github.com/vllm-project/vllm issues | Ecosyste.ms: OpenCollective

[Bug]: Qwen2-VL incoherent output with OpenAI API

SinanAkkoyun opened this issue 3 months ago

[Bug]: tensor parallelism multinode

gpucce opened this issue 3 months ago

[Bugfix][SpecDecode] kv corruption with bonus tokens in spec decode

llsj14 opened this pull request 3 months ago

[Bug]: Bfloat16 or Half are not compatible with HF float16/bfloat16 result.

jason9693 opened this issue 3 months ago

[Bug]: Jetson support regression

conroy-cheers opened this issue 3 months ago

[Doc] Specify async engine args in docs

DarkLight1337 opened this pull request 3 months ago

[V1] Prototype Fully Async Detokenizer

robertgshaw2-neuralmagic opened this pull request 3 months ago

[core] cudagraph output with tensor weak reference

youkaichao opened this pull request 3 months ago

[Bug]: Incoherent Offline Inference Single Video with Qwen2-VL

hector-gr opened this issue 3 months ago

[Performance]: How to Improve Performance Under Concurrency

ljwps opened this issue 3 months ago

[Bugfix] Use temporary directory in registry

DarkLight1337 opened this pull request 3 months ago

[Model] Add BNB quantization support for Mllama

Isotr0py opened this pull request 3 months ago

[Misc] SpecDecodeWorker supports profiling

Abatom opened this pull request 3 months ago

[torch.compile] rework compile control with piecewise cudagraph

youkaichao opened this pull request 3 months ago

[Usage]: ValueError: Model architectures ['LlamaForCausalLM'] are not supported for now.

Mjiegu opened this issue 3 months ago

[Bug]: Inconsistent evaluations when enabling / disabling chunked_prefill?

Jingyu6 opened this issue 3 months ago

[Model] Add classification Task with Qwen2ForSequenceClassification

kakao-kevin-us opened this pull request 3 months ago

[Usage]: Using a model for inference and embedding

micuentadecasa opened this issue 3 months ago

[Installation] pip install vllm (0.6.3) will force a reinstallation of the CPU version torch and replace cuda torch on windows

xiezhipeng-git opened this issue 3 months ago

CI TEST

maxdebayser opened this pull request 3 months ago

[Model] Support math-shepherd-mistral-7b-prm model

Went-Liang opened this pull request 3 months ago

[Bug]: Function calling with stream vs without stream, arguments=None when stream option is enabled

ankush13r opened this issue 3 months ago

[Model] Support GGUF models newly added in `transformers` 4.46.0

Isotr0py opened this pull request 3 months ago

[Core] Support offloading KV cache to CPU

KuntaiDu opened this pull request 3 months ago

[Build] skip renaming files for release wheels pipeline

simon-mo opened this pull request 3 months ago

[Bug]: Input length greater than 32K in nvidia/Llama-3.1-Nemotron-70B-Instruct-HF generate garbage on v0.6.3 ( issue is not seen in v0.6.2)

source-ram opened this issue 3 months ago

[Bug]: hung when start openai api server with multiple gpu in one node.

weiminw opened this issue 3 months ago

[Doc] Update FAQ links in spec_decode.rst

whyiug opened this pull request 3 months ago

[Usage]: Llama-3.1-70B-Instruct best arguments for throughput at scale for multiple users

squinn1 opened this issue 3 months ago

[Misc]: huggingface_hub.errors.HFValidationError using LLama3.1-405b

unrue opened this issue 3 months ago

[Usage]: Pass multiple LoRA modules through YAML config

andreapairon opened this issue 3 months ago

[Feature]: support SageAttention

LSC527 opened this issue 3 months ago

[Performance]: Low GPU utilization - is it normal?

fzyzcjy opened this issue 3 months ago

[V1] Move mm_input_mapper to a separate process

WoosukKwon opened this pull request 3 months ago

[Bug]: pipepline parallel performance issue for 1 sample.

littletomatodonkey opened this issue 3 months ago

[torch.compile] Adding torch compile annotations to some models

CRZbulabula opened this pull request 3 months ago

[Bug]: glm4-9b-chat-lora-merge model with VLLM for concurrent requests, the process gets stuck and returns an "Aborted request" error.

Jimmy-L99 opened this issue 3 months ago

[Usage]: Multimodal content with benchmark_serving.py

khayamgondal opened this issue 3 months ago

[Bugfix] Fix edge cases for MistralTokenizer

tjohnson31415 opened this pull request 3 months ago

[Bug]: Incompatible shape in block table when running Phi-3.5-mini-instruct

vizsatiz opened this issue 3 months ago

[Model][LoRA]LoRA support added for Qwen

jeejeelee opened this pull request 3 months ago

[CI/Build] improve python-only dev setup

dtrifiro opened this pull request 3 months ago

[Bug]: crash：RecursionError: maximum recursion depth exceeded

wciq1208 opened this issue 3 months ago

[New Model]: stepfun-ai/GOT-OCR2_0

akhileshsharma99 opened this issue 3 months ago

[Core] Make encoder-decoder inputs a nested structure to be more composable

DarkLight1337 opened this pull request 3 months ago

Linter test

maxdebayser opened this pull request 3 months ago

[Misc] Upgrade to pytorch 2.5

bnellnm opened this pull request 3 months ago

[Feature]: LoRA support for Qwen model

zhangfan-algo opened this issue 3 months ago

[Bugfix] use AF_INET6 instead of AF_INET for OpenAI Compatible Server

jxpxxzj opened this pull request 3 months ago

[Feature]: Support for 1.58-bit models.

RealMrCactus opened this issue 3 months ago

[Performance]: vllm Eagle performance is worse than expected

LiuXiaoxuanPKU opened this issue 3 months ago

[Bug]: benchmark serving does not support --best_of>1

homeffjy opened this issue 3 months ago

[Encoder Decoder] Add flash_attn kernel support for encoder-decoder models

sroy745 opened this pull request 3 months ago

[Bug]: GGUF Llama-3.1-Nemotron-70B-Instruct-HF ValueError: cannot reshape array of size into shape

paolovic opened this issue 3 months ago

[Bug]: MistralTokenizer Detokenization Issue

prashantgupta24 opened this issue 3 months ago

[Usage]: Custom LLM Generate

Blaizzy opened this issue 3 months ago

[Bugfix][Misc]: fix graph capture for decoder

yudian0504 opened this pull request 3 months ago

[New Model]: bert-base-chinese

kangzemin opened this issue 3 months ago

OOM error :When using four 4500 ada cards to start four lora instances, an error occurs. However, no error occurs when not starting lora on the four 4500 ada cards, and there is no error when starting four lora instances on a single A100 card.

xllrun opened this issue 3 months ago

[Feature]: Support for Controlled Decoding

simonucl opened this issue 3 months ago

[Performance]: bitsandbytes quantization slow

lance0108 opened this issue 3 months ago

[Feature]: EAGLE fp8 quantization

fengyang95 opened this issue 3 months ago

[Bugfix] Fix load config when using bools

madt2709 opened this pull request 3 months ago

[Bugfix] Fix `illegal memory access` error with chunked prefill, prefix caching, block manager v2 and xformers enabled together

sasha0552 opened this pull request 3 months ago

[Usage]: When using vllm to start the interpl2-8b model service, an error occurs. The command is as follows: vllm serve/ internvl2-8b

hyyuananran opened this issue 3 months ago

[Bugfix] Pass json-schema to GuidedDecodingParams and make test stronger

heheda12345 opened this pull request 3 months ago

[Frontend] Support suffix in completions API (fill-in-the-middle)

njhill opened this pull request 3 months ago

[Bug]: Multiple inconsistencies wrt BOS injection and BOS duplication

stas00 opened this issue 3 months ago

Adds method to read the pooling types from model's files

flaviabeo opened this pull request 3 months ago

[Model] Update MPT model with GLU and rope and add low precision layer norm

kazuki opened this pull request 3 months ago

[Bug]: When reading the content from the configuration file specified by the --config parameter, the parameter type was not considered.

SakigamiYang opened this issue 3 months ago

[Bug]: [Performance] 100% performance drop using multiple lora vs no lora(qwen-chat model)

askcs517 opened this issue 3 months ago

[Feature]: LoRA support for InternVLChatModel

AkshataABhat opened this issue 3 months ago

[Misc] Fix ImportError causing by triton

MengqingCao opened this pull request 3 months ago

[Usage]: When to use flashinfer as the default backend

ehuaa opened this issue 3 months ago

【Frontend】Add sampler_priority and repetition_penalty_range

ZeroYuJie opened this pull request 3 months ago

[Performance]: InternVL multi image speed is not improved compare to original

luohao123 opened this issue 3 months ago

[Feature]: Support for Diff-Transformer to limit noise in attention calculation @ runtime

nightflight-dk opened this issue 3 months ago

[Feature]: Alternating local-global attention layers

griff4692 opened this issue 3 months ago

[Bug]: Too Many Tokens are Empty Strings and Empty Bytes, and `top_logprobs` Can't Identify End-of-Text (EOT) Tokens

DIYer22 opened this issue 3 months ago

[Installation]: No module named 'vllm._version' from vllm.version import __version__ as VLLM_VERSION

yangxin60-tal opened this issue 3 months ago

[Feature]: Consider parallel_tool_calls parameter at the API level

lucasalvarezlacasa opened this issue 3 months ago

[Misc]: offline inference inconsistency result of qwen2-7b

poppybrown opened this issue 3 months ago

[Bug]: vllm startup model error /proc file not found

970602 opened this issue 3 months ago

[Misc] Compute query_start_loc/seq_start_loc on CPU

zhengy001 opened this pull request 3 months ago

[Bug]: Could we provide an interface for setting the "dtype" when calling the example/benchmarks python?

hongfeng2013 opened this issue 3 months ago

[Bug]: Speculative decoding generate gibberish when receiving parallel requests with different seeds

wallashss opened this issue 3 months ago

[Frontend] re-enable multi-modality input in the new beam search implementation

FerdinandZhong opened this pull request 3 months ago

[Feature]: Allow setting tool_choice="none" in LLM calls if the OpenAI comaptible vllm server is started with --enable-auto-tool-choice

deheim opened this issue 3 months ago

[Bug]: Speculative decoding breaks guided decoding.

roberthoenig opened this issue 3 months ago

[Bug]: RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241016-170451.pkl): view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

double-vin opened this issue 3 months ago

[Performance]: inference with qwen2.5 using version vLLM 0.6.3 is felt to be slower

Jimmy-L99 opened this issue 3 months ago

[Usage]: Which branch should I use to test speculative decoding

v-lmn opened this issue 3 months ago

Begin refactoring executor_base ABC

jberkhahn opened this pull request 4 months ago

Support Roberta embedding models

maxdebayser opened this pull request 4 months ago

[Performance][Kernel] Fused_moe Performance Improvement

charlifu opened this pull request 4 months ago

[New Model]: Support Zyphra/Zamba2-7B

mgoin opened this issue 4 months ago

[Bug]: KeyError: 'layers.60.mlp.gate_up_proj.weight' mistral large bitsandbytes

copasseron opened this issue 4 months ago

[CI/Build] remove .github from .dockerignore

dtrifiro opened this pull request 4 months ago

[Neuron] [Bugfix] Fix neuron startup

xendo opened this pull request 4 months ago