github.com/vllm-project/vllm issues | Ecosyste.ms: OpenCollective

[Bug]: Engine timeout error due to request step residual

pushan01 opened this issue 7 months ago

[Usage]: ValueError: User-specified max_model_len (8192) is greater than the derived max_model_len (sliding_window=4096 or model_max_length=None in model's config.json).

mfournioux opened this issue 7 months ago

[Bug]: segfault when using google/gemma-2-27b-it on vLLM

federicotorrielli opened this issue 7 months ago

[Bug]: Load LoRA adaptor for Llama3 seems not working

ANYMS-A opened this issue 7 months ago

add benchmark test for fixed input and output length

haichuan1221 opened this pull request 7 months ago

[Installation]: Installation with OpenVINO get dependency conflict Error !!!

HPUedCSLearner opened this issue 7 months ago

[Usage]: Gemma2-9b not working on A10G 24gb gpu

Abhinay2323 opened this issue 7 months ago

[Bug]: Performance : slow inference for FP8 on L20 with 0.5.1(v0.5.0.post1 was fine)

garycaokai opened this issue 7 months ago

[Installation]: Gemma2 Installing Flash Infer `[rank0]: TypeError: 'NoneType' object is not callable`

robertgshaw2-neuralmagic opened this issue 7 months ago

[Core] Support dynamically loading Lora adapter from HuggingFace

Jeffwan opened this pull request 7 months ago

[Feature]: Support loading lora adapters from HuggingFace in runtime

Jeffwan opened this issue 7 months ago

[Bug]: relative path doesn't work for Lora adapter model

Jeffwan opened this issue 7 months ago

[Doc] Fix the lora adapter path in server startup script

Jeffwan opened this pull request 7 months ago

[RFC] Drop beam search support

WoosukKwon opened this issue 7 months ago

[Bug]: benchmark_throughput gets TypeError: XFormersMetadata.__init__() got an unexpected keyword argument 'is_prompt' wit CPU

LGLG42 opened this issue 7 months ago

[ BugFix ] Prompt Logprobs Detokenization

robertgshaw2-neuralmagic opened this pull request 7 months ago

[Bug]: Gemma2 supports 8192 context with sliding window, but vllm only does 4196 or fails if try 8192

pseudotensor opened this issue 7 months ago

[Bug]: issue with Phi3 mini GPTQ 4Bit/8Bit

gm3000 opened this issue 7 months ago

[Hardware][Intel CPU][DOC] Update docs for CPU backend

zhouyuan opened this pull request 7 months ago

[Bug]: AsyncEngineDeadError: Task finished unexpectedly with qwen2 72b

thomZ1 opened this issue 7 months ago

[Installation]: pip install -e .

Kev1ntan opened this issue 7 months ago

[Usage]: Is there a way to make the results of two different calls to VLLM with temperature > 0 consistent?

Some-random opened this issue 7 months ago

do not exclude `object` field in CompletionStreamResponse

kczimm opened this pull request 7 months ago

[misc][frontend] log all available endpoints

youkaichao opened this pull request 7 months ago

[Bug]: No end point available after model is fully loaded

hassanzadeh opened this issue 7 months ago

[Bug]: Guided decoding with Phi-3-small crashes

crosiumreborn opened this issue 7 months ago

[Bug]: gemma-2-27b error loading with vllm.LLM

jl3676 opened this issue 7 months ago

[Usage]: OpenAI-like API in offline inference

1ncludeSteven opened this issue 7 months ago

[Bug]: AsyncEngineDeadError: Task finished unexpectedly with Gemma2 9B

nelyajizi opened this issue 7 months ago

[Feature]: Precise model device placement

vwxyzjn opened this issue 7 months ago

[Feature]: lazy import for VLM

zhyncs opened this issue 7 months ago

[Usage]: BNB Gemma2 9b loading problems

orellavie1212 opened this issue 7 months ago

[Core][Speculative Decoding] Add multi-query verifier for speculative decoding without batch expansion

sighingnow opened this pull request 7 months ago

[Usage]: solve problem like "Found no NVIDIA driver on your system." in WSL2

HelloCard opened this issue 7 months ago

[core][distributed] add zmq fallback for broadcasting large objects

youkaichao opened this pull request 7 months ago

Add test test (this is a test pr)

llmpros opened this pull request 7 months ago

[Bug]: Multiprocessing FileNotFound error in triton cache

jl3676 opened this issue 7 months ago

[Usage]: Struggling to get fp8 inference working correctly on 8xL40s

williambarberjr opened this issue 7 months ago

[Feature]: Support AVX2 for CPU (drop AVX-512 requirement)

kozuch opened this issue 7 months ago

[Bug]: Empty strings as output using gemma-2-27B with 4 A10s

lucafirefox opened this issue 7 months ago

[Bug]: LLaVA 1.6 in 0.5.1: Exceptions after some bigger image request, stuck in faulty mode

andrePankraz opened this issue 7 months ago

[Feature]: ROPE scaling supported by vLLM gemma2

kkk935208447 opened this issue 7 months ago

[Doc]: Code Shared for OpenAI Embedding Client gives base64 encode error

palash-fin opened this issue 7 months ago

[Bug]: As V100 does not support FlashAttention, it is not possible to run the gemma model, hopefully it can support the xformers way to run it

warlockedward opened this issue 7 months ago

Add FlashInfer to default Dockerfile

simon-mo opened this pull request 7 months ago

[Bug]: New bug in 0.5.1 (v0.5.0.post1 was fine)

andrePankraz opened this issue 7 months ago

[Core] implement disaggregated prefilling via KV cache transfer

KuntaiDu opened this pull request 7 months ago

[Bug]: TypeError: 'NoneType' object is not callable when loading Gemma 2 9B with new 0.5.1 version

DanielusG opened this issue 7 months ago

[Doc] Move guide for multimodal model and other improvements

DarkLight1337 opened this pull request 7 months ago

[Doc] Reorganize Supported Models by Type

ywang96 opened this pull request 7 months ago

[Bug]: When running gemma2 7b, an error is reported [rank0]: RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)` Set up according to the prompts: os.environ['VLLM_ATTENTION_BACKEND'] = 'FLASHINFER' print("Environment variable set for VLLM_ATTENTION_BACKEND:", os.getenv('VLLM_ATTENTION_BACKEND'))

orderer0001 opened this issue 7 months ago

[Feature]: Return hidden states (in progress?)

Elanmarkowitz opened this issue 7 months ago

[Core] Refactor _prepare_model_input_tensors - take 2

comaniac opened this pull request 7 months ago

Move release wheel env var to Dockerfile instead

simon-mo opened this pull request 7 months ago

Fix release wheel build env var

simon-mo opened this pull request 7 months ago

Update wheel builds to strip debug

simon-mo opened this pull request 7 months ago

[Bug]: Batch expansion doesn't work with lora

Adhyyan1252 opened this issue 7 months ago

[Docs] Fix readthedocs for tag build

simon-mo opened this pull request 7 months ago

bump version to v0.5.1

simon-mo opened this pull request 7 months ago

[Bug]: When starting deepseek-coder-v2-lite-instruct with vllm on 4 GPUs, one of them is at 0%.

fengyang95 opened this issue 7 months ago

[Usage]: How to use Multi-instance in Vllm? (Model replication on multiple GPUs)

KimMinSang96 opened this issue 7 months ago

[Feature]: expose the tqdm progress bar to enable logging the progress

hugolytics opened this issue 7 months ago

加载Baichuan2-13B-Chat时出异常，torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB. GPU

czhcc opened this issue 7 months ago

[Bug]: When tensor_parallel_size>1, RuntimeError: Cannot re-initialize CUDA in forked subprocess.

excelsimon opened this issue 7 months ago

[Feature]: Integrate new backend

XDaoHong opened this issue 7 months ago

[Performance]: the performance with chunked-prefill-enabled is lower than default

BestKuan opened this issue 7 months ago

[VLM] Cleanup validation and update docs

DarkLight1337 opened this pull request 7 months ago

[Feature]: Model ChatGLMForCausalLM does not support LoRA, but LoRA is enabled.

wangbhan opened this issue 7 months ago

[Bug]: CUDA error when using multiple GPUs

ndao600 opened this issue 7 months ago

[VLM] Improve consistency between feature size calculation and dummy data for profiling

ywang96 opened this pull request 7 months ago

[Bug]: When using tp for inference, an error occurs: Worker VllmWorkerProcess pid 3283517 died, exit code: -15.

B-201 opened this issue 7 months ago

[Bugfix] Enable chunked-prefill and prefix cache with flash-attn backend

sighingnow opened this pull request 7 months ago

[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend

kzawora-intel opened this pull request 7 months ago

[Feature]: deepseek-v2 awq support

fengyang95 opened this issue 7 months ago

[Usage]: Internal server error when serving LoRA adapters with Open-AI compatible vLLM server

ebi64 opened this issue 7 months ago

[Bugfix] Add custom Triton cache manager to resolve MoE MP issue

tdoublep opened this pull request 7 months ago

[Model] Implement DualChunkAttention for Qwen2 Models

hzhwcmhf opened this pull request 7 months ago

[Bugfix] Handle `best_of>1` case by disabling speculation.

tdoublep opened this pull request 7 months ago

[Bug]: Spec. decode fails for requests with n>1 or best_of>1

tdoublep opened this issue 7 months ago

[Bugfix] Use templated datasource in grafana.json to allow automatic imports

frittentheke opened this pull request 7 months ago

[Bug]: Phi-3 long context (longrope) doesn't work with fp8 kv cache

jphme opened this issue 7 months ago

[Installation]: Couldn't find CUDA library root.

CodexDive opened this issue 7 months ago

[Feature]: Multi lora on multi gpus

jiuzhangsy opened this issue 7 months ago

[Usage]: vllm server mode, gpu util

UbeCc opened this issue 7 months ago

[Bug]: Disable log requests and disable log stats do not work

wufxgtihub123 opened this issue 7 months ago

[Usage]: vllm现在支持embedding输入吗，没有发现相关接口

zhanghang-official opened this issue 7 months ago

[core][distributed] accelerate distributed weight loading

youkaichao opened this pull request 7 months ago

[Bug]: RuntimeError: No suitable kernel. h_in=16 h_out=7392 dtype=Float out_dtype=BFloat16

JJJJerry opened this issue 7 months ago

[Hardware] [Intel] Enable Multiprocessing and tensor parallel in CPU backend and update documentation

bigPYJ1151 opened this pull request 7 months ago

[Bug]: Number of available GPU blocks drop significantly for Phi3-vision

CatherineSue opened this issue 7 months ago

[Feature]: multi-lora support older nvidia gpus.

wuisawesome opened this issue 7 months ago

[VLM] Calculate maximum number of multi-modal tokens by model

DarkLight1337 opened this pull request 7 months ago

[Distributed][Core] Support Py39 and Py38 for PP

andoorve opened this pull request 7 months ago

[doc][misc] bump up py version in installation doc

youkaichao opened this pull request 7 months ago

[Installation]: ImportError: undefined symbol: __nvJitLinkAddData_12_1, version libnvJitLink.so.12

laithsakka opened this issue 7 months ago

[core][distributed] allow custom allreduce when pipeline parallel size > 1

youkaichao opened this pull request 7 months ago

[Bug]: Mixtral 8x7b FP8 encounters illegal memory access in custom_all_reduce.cuh

ferdiko opened this issue 7 months ago

[core][distributed] support layer size undividable by pp size in pipeline parallel inference

youkaichao opened this pull request 7 months ago

[Feature]: support layer size undividable by pp size in pipeline parallel inference

youkaichao opened this issue 7 months ago

[ Misc ] Clean Up `CompressedTensorsW8A8`

robertgshaw2-neuralmagic opened this pull request 7 months ago