vLLM issues | Ecosyste.ms: OpenCollective

[FIX] Fix prefix test error on main

github.com/vllm-project/vllm - zhuohan123 opened this pull request 7 months ago

Mixtral 4x 4090 OOM

github.com/vllm-project/vllm - SinanAkkoyun opened this issue 7 months ago

Order of keys for guided JSON

github.com/vllm-project/vllm - ccdv-ai opened this issue 7 months ago

Regression in llama model inference due to #3005

github.com/vllm-project/vllm - Qubitium opened this issue 7 months ago

unload the model

github.com/vllm-project/vllm - osafaimal opened this issue 7 months ago

install from source failed using the latest code

github.com/vllm-project/vllm - sleepwalker2017 opened this issue 7 months ago

[FIX] Make `flash_attn` optional

github.com/vllm-project/vllm - WoosukKwon opened this pull request 7 months ago

[Minor fix] Include flash_attn in docker image

github.com/vllm-project/vllm - tdoublep opened this pull request 8 months ago

Can vLLM handle concurrent request with FastAPI?

github.com/vllm-project/vllm - Strongorange opened this issue 8 months ago

OpenAI Tools / function calling v2

github.com/vllm-project/vllm - FlorianJoncour opened this pull request 8 months ago

Prefix Caching with FP8 KV cache support

github.com/vllm-project/vllm - chenxu2048 opened this pull request 8 months ago

When running pytest tests/, undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv

github.com/vllm-project/vllm - Imss27 opened this issue 8 months ago

vllm load SqueezeLLM quantization model failed

github.com/vllm-project/vllm - zuosong-peng opened this issue 8 months ago

[WIP] Build FlashInfer

github.com/vllm-project/vllm - WoosukKwon opened this pull request 8 months ago

ExLlamaV2: exl2 support

github.com/vllm-project/vllm - pabl-o-ce opened this issue 8 months ago

got completely wrong answer for openchat model with vllm

github.com/vllm-project/vllm - v-yunbin opened this issue 8 months ago

[Feature request] Output attention scores in vLLM

github.com/vllm-project/vllm - ChenxinAn-fdu opened this issue 8 months ago

Unable to run distributed inference on ray with tensor parallel size > 1

github.com/vllm-project/vllm - pravingadakh opened this issue 8 months ago

Supporting embedding models

github.com/vllm-project/vllm - jc9123 opened this pull request 8 months ago

Support `response_format: json_object` in OpenAI server

github.com/vllm-project/vllm - simon-mo opened this issue 8 months ago

[ROCm] Add support for Punica kernels on AMD GPUs

github.com/vllm-project/vllm - kliuae opened this pull request 8 months ago

Does vLLM support the 4-bit quantized version of the Mixtral-8x7B-Instruct-v0.1 model downloaded from Hugging Face

github.com/vllm-project/vllm - leockl opened this issue 8 months ago

Benchmarking script does not limit the maximum concurrency

github.com/vllm-project/vllm - wangchen615 opened this issue 8 months ago

RuntimeError while running any model with embeddedllminfo/vllm-rocm:vllm-v0.2.4 image and rocm5.7 (rhel 8.7)

github.com/vllm-project/vllm - AjayKadoula opened this issue 8 months ago

Should one use tokenizer templates during offline inference?

github.com/vllm-project/vllm - vmkhlv opened this issue 8 months ago

add doc about serving option on dstack

github.com/vllm-project/vllm - deep-diver opened this pull request 8 months ago

Failed to build from source on ROCm (with pytorch and xformers working correctly)

github.com/vllm-project/vllm - nayn99 opened this issue 8 months ago

run qwen1.5-14b-chat with vllm container error.

github.com/vllm-project/vllm - James-Dao opened this issue 8 months ago

how to shat out the log which is unnecessary print per 10s

github.com/vllm-project/vllm - sxk000 opened this issue 8 months ago

Merge Gemma into Llama

github.com/vllm-project/vllm - WoosukKwon opened this pull request 8 months ago

[Feature] Add vision language model support.

github.com/vllm-project/vllm - xwjiang2010 opened this pull request 8 months ago

Support of AMD consumer GPUs

github.com/vllm-project/vllm - arno4000 opened this issue 8 months ago

Qwen 14B AWQ deploy: AttributeError: 'ndarray' object has no attribute '_torch_dtype'

github.com/vllm-project/vllm - testTech92 opened this issue 8 months ago

[BUG] Prompt logprobs causing tensor broadcast issue in `sampler.py`

github.com/vllm-project/vllm - AetherPrior opened this issue 8 months ago

AWQ: Implement new kernels (64% faster decoding)

github.com/vllm-project/vllm - casper-hansen opened this issue 8 months ago

Unable to specify GPU usage in VLLM code

github.com/vllm-project/vllm - humza-sami opened this issue 8 months ago

Separate attention backends

github.com/vllm-project/vllm - WoosukKwon opened this pull request 8 months ago

How can I use the Lora Adapter for a model with Vocab size 40960?

github.com/vllm-project/vllm - hrson-1203 opened this issue 8 months ago

Failed to find C compiler. Please specify via CC environment variable

github.com/vllm-project/vllm - gangooteli opened this issue 8 months ago

Fix: Echo without asking for new tokens or logprobs in OpenAI Completions API

github.com/vllm-project/vllm - matheper opened this pull request 8 months ago

Limited Request Handling for AMD Instinct MI300 X GPUs with Tensor Parallelism > 1

github.com/vllm-project/vllm - Spurthi-Bhat-ScalersAI opened this issue 8 months ago

ValueError: Model QWenLMHeadModel does not support LoRA, but LoRA is enabled. Support for this model may be added in the future. If this is important to you, please open an issue on github.

github.com/vllm-project/vllm - anaivebird opened this issue 8 months ago

The answer accuracy of the QWen series model is lost

github.com/vllm-project/vllm - zhochengbiao opened this issue 8 months ago

The service results based on vllm qwen7B are inconsistent with the original qwen results, and the accuracy will drop significantly

github.com/vllm-project/vllm - chenshukai1015 opened this issue 8 months ago

AWQ Quantization Memory Usage

github.com/vllm-project/vllm - vcivan opened this issue 8 months ago

Multi-GPU Support Failures with AMD MI210

github.com/vllm-project/vllm - tom-papatheodore opened this issue 8 months ago

Fix empty output when temp is too low

github.com/vllm-project/vllm - CatherineSue opened this pull request 8 months ago

E5-mistral-7b-instruct embedding support

github.com/vllm-project/vllm - DavidPeleg6 opened this issue 8 months ago

Runtime exception [step must be nonzero]

github.com/vllm-project/vllm - DreamGenX opened this issue 8 months ago

The results of vllm deployment of qwen-14B are inconsistent with the results of the original qwen-14B

github.com/vllm-project/vllm - qingjiaozyn opened this issue 8 months ago

vllm keeps hanging when using djl-deepspeed

github.com/vllm-project/vllm - ali-firstparty opened this issue 8 months ago

api_server.py: error: unrecognized arguments: --lora-modules sql-lora=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/

github.com/vllm-project/vllm - xueyongfu11 opened this issue 8 months ago

--tensor-parallel-size 2 fails to load on GCP

github.com/vllm-project/vllm - noamgat opened this issue 8 months ago

Duplicate Token `<s>` in Tokenizer Encoded Token ids

github.com/vllm-project/vllm - zxybazh opened this issue 8 months ago

Allow model to be served under multiple names

github.com/vllm-project/vllm - hmellor opened this pull request 8 months ago

HQQ quantization support

github.com/vllm-project/vllm - max-wittig opened this issue 8 months ago

Missing prometheus metrics in `0.3.0`

github.com/vllm-project/vllm - SamComber opened this issue 8 months ago

Please add lora support for higher ranks and alpha values

github.com/vllm-project/vllm - parikshitsaikia1619 opened this issue 8 months ago

Add LoRA support for Mixtral

github.com/vllm-project/vllm - tterrysun opened this pull request 8 months ago

vLLM running on a Ray Cluster Hanging on Initializing

github.com/vllm-project/vllm - Kaotic3 opened this issue 8 months ago

Add guided decoding for OpenAI API server

github.com/vllm-project/vllm - felixzhu555 opened this pull request 8 months ago

Adds support for gunicorn multiprocess process

github.com/vllm-project/vllm - jalotra opened this pull request 8 months ago

Incorrect completions with tensor parallel size of 8 on MI300X GPUs

github.com/vllm-project/vllm - seungduk-yanolja opened this issue 8 months ago

VLLM Multi-Lora with embed_tokens and lm_head in adapter weights

github.com/vllm-project/vllm - germanjke opened this issue 8 months ago

openai completions api <echo=True> raises Error

github.com/vllm-project/vllm - seoyunYang opened this issue 8 months ago

Add Splitwise implementation to vLLM

github.com/vllm-project/vllm - aashaka opened this pull request 8 months ago

Nvidia-H20 with nvcr.io/nvidia/pytorch:23.12-py3，CUBLAS Error！

github.com/vllm-project/vllm - tohneecao opened this issue 8 months ago

Multi GPU ROCm6 issues, and workarounds

github.com/vllm-project/vllm - BKitor opened this issue 8 months ago

model continue conversation

github.com/vllm-project/vllm - andrey-genpracc opened this issue 9 months ago

[Bug] `v0.3.0` produces garbage output when serving CodeLlama-70B on 4xA6000

github.com/vllm-project/vllm - ganler opened this issue 9 months ago

ERROR: Fail to install in editable mode. "UserWarning: There are no .../x86_64-conda-linux-gnu-c++ version bounds defined for CUDA version 12.1"

github.com/vllm-project/vllm - KartikYZ opened this issue 9 months ago

Add fused top-K softmax kernel for MoE

github.com/vllm-project/vllm - WoosukKwon opened this pull request 9 months ago

GPTQ & AWQ Fused MOE

github.com/vllm-project/vllm - chu-tianxiang opened this pull request 9 months ago

Llama Guard inconsistent output between HuggingFace's Transformers and vLLM

github.com/vllm-project/vllm - AmenRa opened this issue 9 months ago

vLLM ignores my requests when I increase the number of concurrent requests

github.com/vllm-project/vllm - savannahfung opened this issue 9 months ago

[Minor] More fix of test_cache.py CI test failure

github.com/vllm-project/vllm - LiuXiaoxuanPKU opened this pull request 9 months ago

ImportError: /ramyapra/vllm/vllm/_C.cpython-310-x86_64-linux-gnu.so: undefined symbol:

github.com/vllm-project/vllm - ramyaprabhu-alt opened this issue 9 months ago

How to increase vllm scheduler promt limit?

github.com/vllm-project/vllm - hanswang1 opened this issue 9 months ago

Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.

github.com/vllm-project/vllm - gty111 opened this issue 9 months ago

Fix/async chat serving

github.com/vllm-project/vllm - schoennenbeck opened this pull request 9 months ago

KV Cache usage is 0% for mistral model

github.com/vllm-project/vllm - nikhilshandilya opened this issue 9 months ago

Seek help, `Qwen-14B-Chat-Int4`ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.

github.com/vllm-project/vllm - huangyunxin opened this issue 9 months ago

OpenAIServingChat cannot be instantiated within a running event loop

github.com/vllm-project/vllm - schoennenbeck opened this issue 9 months ago

Ray worker out of memory

github.com/vllm-project/vllm - tristan279 opened this issue 9 months ago

IndexError when using Beam Search in Chat Completions

github.com/vllm-project/vllm - jamestwhedbee opened this issue 9 months ago

ValueError: Total number of attention heads (52) must be divisible by tensor parallel size (8).

github.com/vllm-project/vllm - PolinaBokova opened this issue 9 months ago

Question: Would a PR integrating ExLlamaV2 kernels with AWQ be accepted?

github.com/vllm-project/vllm - casper-hansen opened this issue 9 months ago

Adding `/get_tokenizer` to api_server for lm-evaluation-harness ease integration.

github.com/vllm-project/vllm - AguirreNicolas opened this pull request 9 months ago

Dockerfile: build-arg to punica kernel

github.com/vllm-project/vllm - AguirreNicolas opened this pull request 9 months ago

Mixtral AWQ fails to work: asyncio.exceptions.CancelledError: Cancelled by cancel scope 7fd214489990

github.com/vllm-project/vllm - pseudotensor opened this issue 9 months ago

[RFC] Automatic Prefix Caching

github.com/vllm-project/vllm - zhuohan123 opened this issue 9 months ago

How to get the logits of the first generated text?

github.com/vllm-project/vllm - Abigail61 opened this issue 9 months ago

Speculative Decoding

github.com/vllm-project/vllm - ymwangg opened this pull request 9 months ago

Add multi-LoRA support for more architectures

github.com/vllm-project/vllm - Yard1 opened this issue 9 months ago

RuntimeError on ROCm

github.com/vllm-project/vllm - rlrs opened this issue 9 months ago

Longer stop sequence not working in streaming mode

github.com/vllm-project/vllm - andrePankraz opened this issue 9 months ago

Support for production grade server for Inference [Gunicorn vs Unicorn]?

github.com/vllm-project/vllm - jalotra opened this issue 9 months ago

GPU utilization decrease during long-term running

github.com/vllm-project/vllm - WrRan opened this issue 9 months ago

CUDA out of memory error despite having enough memory

github.com/vllm-project/vllm - varonroy opened this issue 9 months ago

Allow passing hf config args with openai server

github.com/vllm-project/vllm - Aakash-kaushik opened this issue 9 months ago