github.com/vllm-project/vllm issues | Ecosyste.ms: OpenCollective

[Bugfix] We have fixed the bug that occurred when using FlashInfer as the backend in vLLM Speculative Decoding.

bong-furiosa opened this pull request 7 months ago

[Bugfix]Fix evict v2 with long context length

puf147 opened this pull request 7 months ago

[CI] docfix

rkooo567 opened this pull request 7 months ago

[Doc] add debugging tips

youkaichao opened this pull request 7 months ago

[Core] Refactor Worker and ModelRunner to consolidate control plane communication

stephanie-wang opened this pull request 7 months ago

[Performance]: Qwen2-72B-Instruction-GPTQ-Int4 Openai Server Request Problem

syngokhan opened this issue 7 months ago

hidden-states from final (or middle layers)

janphilippfranken opened this issue 7 months ago

[Bug]:The vllm service takes two hours to start Because of NCCL

zhaotyer opened this issue 7 months ago

[Bug]: topk=1 and temperature=0 cause different output in vllm

rangehow opened this issue 7 months ago

[Doc][Typo] Fixing Missing Comma

ywang96 opened this pull request 7 months ago

[Bugfix] Add device assertion to TorchSDPA

bigPYJ1151 opened this pull request 7 months ago

[Kernel] Suppress mma.sp warning on CUDA 12.5 and later

tlrmchlsmth opened this pull request 7 months ago

[Speculative decoding] Initial spec decode docs

cadedaniel opened this pull request 7 months ago

[Core][Distributed] add shm broadcast

youkaichao opened this pull request 7 months ago

[Bugfix] fix lora_dtype value type in arg_utils.py

c3-ali opened this pull request 7 months ago

[Bug]: EngineArgs missing value type for `lora_dtype`

c3-ali opened this issue 7 months ago

[Kernel] Vectorized FP8 quantize kernel

comaniac opened this pull request 7 months ago

[Bug]: Llama3 output limited to around 10 tokens

arifsaeed opened this issue 7 months ago

[ci] Fix Buildkite agent path

khluu opened this pull request 7 months ago

[Kernel] Factor out epilogues from cutlass kernels

tlrmchlsmth opened this pull request 7 months ago

[Kernel] Adding fused bias add to cutlass_scaled_mm_dq kernel

cyang49 opened this pull request 7 months ago

[Misc] Remove VLLM_BUILD_WITH_NEURON env variable

WoosukKwon opened this pull request 7 months ago

[Doc] Add documentation for FP8 W8A8

mgoin opened this pull request 7 months ago

[Kernel] `w4a16` support for `compressed-tensors`

dsikka opened this pull request 7 months ago

Bump version to v0.5.0

simon-mo opened this pull request 7 months ago

[Docs] Add Docs on Limitations of VLM Support

ywang96 opened this pull request 7 months ago

[CI] Upgrade codespell version.

rkooo567 opened this pull request 7 months ago

[Hardware][Intel] OpenVINO vLLM backend

ilya-lavrenov opened this pull request 7 months ago

[RFC]: OpenVINO vLLM backend

ilya-lavrenov opened this issue 7 months ago

0.4.3 error CUDA error: an illegal memory access was encountered

maxin9966 opened this issue 7 months ago

[misc][typo] fix typo

youkaichao opened this pull request 7 months ago

[Core][Distributed] add same-node detection

youkaichao opened this pull request 7 months ago

[Misc] Various simplifications and typing fixes

njhill opened this pull request 7 months ago

[WIP][Core] Support tensor parallel division with remainder of attention heads

NadavShmayo opened this pull request 7 months ago

[Bug]: Docker image starts vllm.entrypoints.openai.api_server , Docker opens port 8000 but vllm isn't listening on 8000

elabz opened this issue 7 months ago

[Bug]: load nvidia/Llama3-ChatQA-1.5-8B model 15 min

JJplane opened this issue 7 months ago

[CI/Build] Add nightly benchmarking for tgi, tensorrt-llm and lmdeploy

KuntaiDu opened this pull request 7 months ago

[Bug]: Multi GPU setup for VLLM in Openshift still does not work

jayteaftw opened this issue 7 months ago

[Model] Add GLM-4v support

songxxzp opened this pull request 7 months ago

[CI/Test] improve robustness of test by replacing del with context manager (vllm_runner)

youkaichao opened this pull request 7 months ago

[Kernel][RFC] Initial commit containing new Triton kernels for multi lora serving.

FurtherAI opened this pull request 7 months ago

[Bugfix] Take the VRAM usage of prompt_logprobs into account

Conless opened this pull request 7 months ago

[Core][Distributed] merge two broadcast_tensor_dict

youkaichao opened this pull request 7 months ago

[Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale

mgoin opened this pull request 7 months ago

[Bug Fix] Fix the support check for FP8 CUTLASS

cli99 opened this pull request 7 months ago

[Bug]: TorchSDPAMetadata is out of date

Reichenbachian opened this issue 7 months ago

[Misc] Update to comply with the new `compressed-tensors` config

dsikka opened this pull request 7 months ago

[Bugfix][Core] fix broken state for recompute

youkaichao opened this pull request 7 months ago

[Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker

sroy745 opened this pull request 7 months ago

[CI/Test] improve robustness of test by replacing del with context manager (hf_runner)

youkaichao opened this pull request 7 months ago

[RFC]: Refactor MoE

robertgshaw2-neuralmagic opened this issue 7 months ago

[Misc] Remove unused cuda_utils.h in CPU backend

DamonFool opened this pull request 7 months ago

[Bug]: with `--enable-prefix-caching` , `/completions` crashes server with `echo=True` above certain prompt length

hibukipanim opened this issue 7 months ago

[Bug]: Qwen2 MoE: AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight'. Did you mean: 'qweight'?

geekwish opened this issue 7 months ago

[Speculative decoding]: The content generated by speculative decoding is inconsistent with the content generated by : When I use the speculative mode and prompt_length+output_length > 2048, the error occurs

zhangxy1234 opened this issue 7 months ago

fix DbrxFusedNormAttention missing cache_config

Calvinnncy97 opened this pull request 7 months ago

[Performance]: [Automatic Prefix Caching] When hitting the KV cached blocks, the first execute is slow, and then is fast.

soacker opened this issue 7 months ago

[Usage]: Howto quiet the terminal 'Info' outputs in vllm

rohitnanda1443 opened this issue 7 months ago

[Bug]: non-deterministic Python gc order leads to flaky tests

youkaichao opened this issue 7 months ago

[Bug]: Getting an empty string ('') for every call on fine-tuned Code-Llama-7b-hf model

arthbohra opened this issue 7 months ago

[Misc] Add args for selecting distributed executor to benchmarks

BKitor opened this pull request 7 months ago

[Bug]: Unexpected prompt token logprob behaviors of llama 2 when setting echo=True for openai-api server

fywalter opened this issue 7 months ago

[Misc][Utils] allow get_open_port to be called for multiple times

youkaichao opened this pull request 7 months ago

remove sort_keys=True in guided_decoding

DeyangKong opened this pull request 7 months ago

[Core] Fix sharing of stateful logits processors

maxdebayser opened this pull request 7 months ago

[Bug]: vLLM does not support virtual GPU

youkaichao opened this issue 7 months ago

[MISC] Upgrade dependency to PyTorch 2.3.1

comaniac opened this pull request 7 months ago

Sa 24 sparse

dsikka opened this pull request 7 months ago

[Doc] Add an automatic prefix caching section in vllm documentation

KuntaiDu opened this pull request 7 months ago

[AMD][ROCm][CI] unit tests fixes or skip

hongxiayang opened this pull request 7 months ago

[Usage]: Streaming Response from vLLM 0.4.2 -> 0.4.3

BiboyQG opened this issue 7 months ago

[Feature][Frontend]: Continued `stream_options` implementation also in CompletionRequest

Etelis opened this pull request 7 months ago

[New Model]: mistralai/Codestral-22B-v0.1

eduardozamudio opened this issue 7 months ago

[Installation]: Compiling VLLM for cpu only.

Zibri opened this issue 7 months ago

[Performance]: gptq and awq quantization do not improve the performance

aaronlyt opened this issue 7 months ago

[Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs

maor-ps opened this pull request 7 months ago

[Bug]: RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

wuyueandrew opened this issue 7 months ago

GLM-4-9B-Chat:

Geaming-CHN opened this issue 7 months ago

[Bugfix]if the content is started with ":"(response of ping), client should i…

sywangyi opened this pull request 7 months ago

[Installation]: Building editable for vllm fails (pip install -e .)

felixzhu555 opened this issue 7 months ago

[Bug]: Cannot request more than 5 logprobs

coder109 opened this issue 7 months ago

Addition of lacked ignored_seq_groups in _schedule_chunked_prefill

JamesLim-sy opened this pull request 7 months ago

[Core][Distributed] add coordinator to reduce code duplication in tp and pp

youkaichao opened this pull request 7 months ago

[Hardware] Initial TPU integration

WoosukKwon opened this pull request 7 months ago

[Misc] Skip for logits_scale == 1.0

WoosukKwon opened this pull request 7 months ago

[Usage]: the docker image v0.4.3 cannot work

BUJIDAOVS opened this issue 7 months ago

[Misc] Missing error message for custom ops import

DamonFool opened this pull request 7 months ago

trigger_ci_cd

sergey-tinkoff opened this pull request 7 months ago

[Bug]: Regression in predictions in v0.4.3

hibukipanim opened this issue 7 months ago

[Model] Dynamic image size support for LLaVA-NeXT

DarkLight1337 opened this pull request 7 months ago

[Frontend] OpenAI API server: Add `add_special_tokens` to ChatCompletionRequest (default False)

tomeras91 opened this pull request 7 months ago

test

geeker-smallwhite opened this pull request 7 months ago

[Core] Dynamic image size support for VLMs

DarkLight1337 opened this pull request 7 months ago

[Kernel] Update Cutlass int8 kernel configs for SM80

varun-sundar-rabindranath opened this pull request 7 months ago

[Bug]: high gpu_memory_utilization with 'OOM' and low gpu_memory_utilization with 'No available memory for the cache blocks'

mars-ch opened this issue 7 months ago

[Bug]: chatglm3 with lora adapter

Qingyuncookie opened this issue 7 months ago

[Bug]: When I call the speculative model through the vllm interface, an error is reported: TypeError: 'type' object is not subscriptable

YuCheng-Qi opened this issue 7 months ago

[Misc] Fix docstring of get_attn_backend

WoosukKwon opened this pull request 7 months ago

[Bug]: a bug

lambda7xx opened this issue 7 months ago

[Usage]: How to load a model with less CPU memory

liulfy opened this issue 7 months ago