github.com/vllm-project/vllm issues | Ecosyste.ms: OpenCollective

Rahul quant merged

robertgshaw2-neuralmagic opened this pull request 2 months ago

[Kernel] Add CUTLASS sparse support, heuristics, and torch operators

Faraz9877 opened this pull request 2 months ago

[Perf] Reduce peak memory usage of llama

andoorve opened this pull request 2 months ago

[Bug]: KV Cache Error with KV_cache_dtype=FP8 and Large Sequence Length: Losing Context Length of Model

amakaido28 opened this issue 2 months ago

[Bug]: different garbage output of same prompt when inferred with single sequence vs concurrent requests on vllm openai server , temp =0. (mixed batching in longrope))

bhupendrathore opened this issue 2 months ago

[Kernel] Add CUTLASS sparse support with argument sweep, heuristics, and torch operators

Faraz9877 opened this pull request 2 months ago

[bugfix] Fix static asymmetric quantization case

ProExpertProg opened this pull request 2 months ago

[Tool parsing] Improve / correct mistral tool parsing

patrickvonplaten opened this pull request 2 months ago

Nir b2b latest

nirda7 opened this pull request 2 months ago

[Docs] Publish meetup slides

WoosukKwon opened this pull request 2 months ago

[Feature] enable host memory for kv cache

YZP17121579 opened this pull request 2 months ago

Rs 24 sparse

robertgshaw2-neuralmagic opened this pull request 2 months ago

[Misc] Add uninitialized params tracking for `AutoWeightsLoader`

Isotr0py opened this pull request 2 months ago

[Bug]: Out of Memory (OOM) Issues During MMLU Evaluation with lm_eval

wchen61 opened this issue 2 months ago

[Bug] custom chat template sends to model [{'type': 'text', 'text': '...'}]

victorserbu2709 opened this issue 2 months ago

[Feature]: To adapt to the TTS task, I need to directly pass in the embedding. How should I modify it?

1nlplearner opened this issue 2 months ago

[Usage]: using open-webui with vLLM inference engine instead of ollama

wolfgangsmdt opened this issue 2 months ago

[Installation]: Request to include vllm==0.6.2 for cuda 11.8

amew0 opened this issue 2 months ago

[Performance]: Results from the vLLM Blog article "How Speculative Decoding Boosts vLLM Performance by up to 2.8x" are unreproducible

yeonjoon-jung01 opened this issue 2 months ago

[Hardware][Cambricon MLU] Add Cambricon MLU inference backend (#9649)

zonghuaxiansheng opened this pull request 2 months ago

[Bug]: FusedMoE kernel performance depends on input prompt length while decoding

taegeonum opened this issue 2 months ago

[Bugfix] Fix unable to load some models

DarkLight1337 opened this pull request 2 months ago

[Model] Support telechat2

shunxing12345 opened this pull request 2 months ago

[Misc] Change RedundantReshapesPass and FusionPass logging from info to debug

tlrmchlsmth opened this pull request 2 months ago

[TPU] Implement prefix caching for TPUs

WoosukKwon opened this pull request 2 months ago

[Bug]: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12

yananchen1989 opened this issue 2 months ago

[Bug]: Get meaningless output when run long context inference of Qwen2.5 model with vllm>=0.6.3

piamo opened this issue 2 months ago

[Feature]: Quark quantization format upstream to VLLM

kewang-xlnx opened this issue 2 months ago

[Bug]: Can't use yarn rope config for long context in Qwen2 model

FlyCarrot opened this issue 2 months ago

[Bugfix] return zero point in static quantization in scaled_int8_quant

danieldk opened this pull request 2 months ago

[Model] Add Support for Multimodal Granite Models

alex-jw-brooks opened this pull request 2 months ago

[Feature]: 2D TP & EP

WenhaoHe02 opened this issue 2 months ago

[Misc] Update benchmark to support image_url file or http

kakao-steve-ai opened this pull request 2 months ago

[Bug]: vllm serve works incorrect for (some) Vision LM models

Aktsvigun opened this issue 2 months ago

[CI/Build] Make shellcheck happy

DarkLight1337 opened this pull request 2 months ago

[Bug]: 因vllm的版本不同，启动的qwen2.5服务，对于相同的输入；0.6.1.post2 sse输出是正确的，但 0.6.3.post1是错误的？

mawenju203 opened this issue 2 months ago

Bump to compressed-tensors v0.8.0

dsikka opened this pull request 2 months ago

Bump to `compressed-tensors` v0.8.0

dsikka opened this pull request 2 months ago

[Core][Frontend] Add faster-outlines as guided decoding backend

unaidedelf8777 opened this pull request 2 months ago

[Bug]: Speculative Decoding + TP on Spec Worker + Chunked Prefill does not work.

andoorve opened this issue 2 months ago

[core][distributed] use tcp store directly

youkaichao opened this pull request 2 months ago

[help wanted]: add QwenModel to ci tests

youkaichao opened this issue 2 months ago

[torch.compile] PostGradPassManager, Inductor code caching fix, fix_functionalization pass refactor + tests

ProExpertProg opened this pull request 2 months ago

[V1] Fix CI tests on V1 engine

WoosukKwon opened this pull request 2 months ago

Revert "[ci][build] limit cmake version"

youkaichao opened this pull request 2 months ago

[doc] improve debugging doc

youkaichao opened this pull request 2 months ago

[Usage]: Adaptive Batching and number of concurrent requests

Leon-Sander opened this issue 2 months ago

[V1] Enable Inductor when using piecewise CUDA graphs

WoosukKwon opened this pull request 2 months ago

[Feature]: Support for NVIDIA Unified memory

khayamgondal opened this issue 2 months ago

[doc] fix location of runllm widget

youkaichao opened this pull request 2 months ago

[TPU] Use numpy to compute slot mapping

WoosukKwon opened this pull request 2 months ago

[Doc] Fix typo in arg_utils.py

xyang16 opened this pull request 2 months ago

[Bug]: qwen cannot be quantized in vllm

yananchen1989 opened this issue 2 months ago

[Bugfix] Fix QwenModel argument

DamonFool opened this pull request 2 months ago

[Bug]: The throughput computation in metric.py seems wrong

Achazwl opened this issue 2 months ago

[Feature]: 2:4 sparsity + w4a16 support

arunpatala opened this issue 2 months ago

[Feature]: Is it possible for VLLM to support inference with dynamic activation sparsity?

jiangjiadi opened this issue 2 months ago

[Usage]:Qwen2-VL not support Lora

menglrskr opened this issue 2 months ago

[Usage]: How to Use a Public URL for Remote Access to a Deployed vLLM Model?

Nothern-ai opened this issue 2 months ago

[Misc]Fix Idefics3Model argument

jeejeelee opened this pull request 2 months ago

[Kernel][Hardware][AMD] Add support for GGUF quantization on ROCm

kliuae opened this pull request 2 months ago

[V1] Use pickle for serializing EngineCoreRequest & Add multimodal inputs to EngineCoreRequest

WoosukKwon opened this pull request 2 months ago

[Installation]: Install Gpu vllm got no module named triton

Serenagirl opened this issue 2 months ago

[Bug]: Deepseek V2 coder 236B awq error!

tohnee opened this issue 2 months ago

[misc] Layerwise profile updates

varun-sundar-rabindranath opened this pull request 2 months ago

[V1] TPU Prototype

robertgshaw2-neuralmagic opened this pull request 2 months ago

[Hardware] [HPU]add `mark_step` for hpu

jikunshang opened this pull request 2 months ago

[New Model]: 采用 Out-of-Tree Model Integration 方式注册新模型在启用多卡 Ray 模式下的注册信息丢失的问题

llery opened this issue 2 months ago

[Core] Reduce TTFT with concurrent partial prefills

joerunde opened this pull request 2 months ago

[Bugfix] Fix for Spec model TP + Chunked Prefill

andoorve opened this pull request 2 months ago

Making vLLM compatible with Mistral fp8 weights.

akllm opened this pull request 2 months ago

[V1] Enable custom ops with piecewise CUDA graphs

WoosukKwon opened this pull request 2 months ago

[Model] Add support for Qwen2-VL video embeddings input & multiple image embeddings input with varied resolutions

imkero opened this pull request 3 months ago

[Bugfix][Hardware][CPU] Fix broken encoder-decoder CPU runner

Isotr0py opened this pull request 3 months ago

[Frontend][Core] Add Guidance backend for guided decoding

JC1DA opened this pull request 3 months ago

[6/N] pass whole config to inner model

youkaichao opened this pull request 3 months ago

[Bugfix] bitsandbytes models fail to run pipeline parallel

HoangCongDuc opened this pull request 3 months ago

[Bugfix][SpecDecode] apply sampling parameters to target probabilities for consistency in rejection sampling.

jeongin601 opened this pull request 3 months ago

[Core] Loading model from S3 using RunAI Model Streamer as optional loader

omer-dayan opened this pull request 3 months ago

[help wanted]: why cmake 3.31 breaks vllm and how to fix it

youkaichao opened this issue 3 months ago

[Bug]: 500 Internal Server Error when calling v1/completions and v1/chat/completions with vllm/vllm-openai:v0.6.3.post1 on OpenShift

JohnWestlund opened this issue 3 months ago

[Model] Support Qwen2 embeddings and use tags to select model tests

DarkLight1337 opened this pull request 3 months ago

[Bug]: When apply continue_final_message for OpenAI server, the "echo":false is ignored

chaunceyjiang opened this pull request 3 months ago

[Installation]: error: identifier "__builtin_dynamic_object_size" is undefined

xiaoxiaosuaxuan opened this issue 3 months ago

[Frontend] Add per-request number of cached token stats

zifeitong opened this pull request 3 months ago

[Feature]: BASE_URL environment variable

bjb19 opened this issue 3 months ago

[Docs] Misc updates to TPU installation instructions

mikegre-google opened this pull request 3 months ago

[Bugfix][Frontend] Update Llama 3.2 Chat Template to support Vision and Non-Tool use

tjohnson31415 opened this pull request 3 months ago

[Doc] Move PR template content to docs

russellb opened this pull request 3 months ago

[Bug]: Error in benchmark model with vllm backend for endpoint /v1/chat/completions

rabaja opened this issue 3 months ago

[Bug]: Unable to load Llama-3.1-70B-Instruct using either `vllm serve` or `vllm-openai` docker

SMAntony opened this issue 3 months ago

[Installation]: VLLM does not support TPU v5p-16 (Multi-Host) with Ray Cluster

Bihan opened this issue 3 months ago

[Bug]: FlashInfer throws error in nightly: Please set `use_tensor_cores=True` in BatchDecodeWithPagedKVCacheWrapper for group size 3

nathan-az opened this issue 3 months ago

[Usage]: Error executing method determine_num_available_blocks. This might cause deadlock in distributed execution.

SamuelScc opened this issue 3 months ago

[Usage]: how can i get all logits of token?

joyyyhuang opened this issue 3 months ago

[Bug]: H100 - Your GPU does not have native support for FP8 computation

ScOut3R opened this issue 3 months ago

Fix missing data type in flashinfer prefill

reyoung opened this pull request 3 months ago

[Bug]: Outlines w/ Mistral

matbee-eth opened this issue 3 months ago

[Feature]: Support for predicted outputs

flozi00 opened this issue 3 months ago

[WIP] Disable spec-decode + chunked-prefill for draft models with tensor parallelism > 1

sroy745 opened this pull request 3 months ago