vLLM issues | Ecosyste.ms: OpenCollective

When starting the second vllm.entrypoints.api_server using tensor parallel in a single node, the second vllm api_server Stuck in " Started a local Ray instance." OR "Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory"

github.com/vllm-project/vllm - durant1999 opened this issue 11 months ago

Does 'all-reduce kernels are temporarily disabled' is the cause for the more memory requirment?

github.com/vllm-project/vllm - SafeyahShemali opened this issue 11 months ago

DeepSeek VL support

github.com/vllm-project/vllm - SinanAkkoyun opened this issue 11 months ago

inference with AWQ quantization

github.com/vllm-project/vllm - Kev1ntan opened this issue 11 months ago

Fixes #1556 double free

github.com/vllm-project/vllm - br3no opened this pull request 11 months ago

Bug when input top_k as a float that is outside of range

github.com/vllm-project/vllm - Drzhivago264 opened this issue 11 months ago

[Feature Request] Add GPTQ quantization kernels for 4-bit NormalFloat (NF4) use cases.

github.com/vllm-project/vllm - duchengyao opened this issue 11 months ago

vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.

github.com/vllm-project/vllm - EchoShoot opened this issue 11 months ago

TCPStore is not available

github.com/vllm-project/vllm - Z-Diviner opened this issue 11 months ago

What's difference between the seed in LLMEngine and seed in SamplingParams?

github.com/vllm-project/vllm - tomdzh opened this issue 11 months ago

Is it possible to use vllm-0.3.3 with CUDA 11.8

github.com/vllm-project/vllm - HSLUCKY opened this issue 11 months ago

Implement structured engine for parsing json grammar by token with `response_format: {type: json_object}`

github.com/vllm-project/vllm - pathorn opened this pull request 11 months ago

add aya-101 model

github.com/vllm-project/vllm - ahkarami opened this issue 11 months ago

What's up with Pipeline Parallelism?

github.com/vllm-project/vllm - duanzhaol opened this issue 11 months ago

how to run gemma-7b model with vllm 0.3.3 under cuda 118??

github.com/vllm-project/vllm - adogwangwang opened this issue 11 months ago

When chat-ui and vllm are used together, the dialogue output of Llama-2-70b-chat-hf(safetensor file) is abnormal.

github.com/vllm-project/vllm - majestichou opened this issue 11 months ago

AsyncEngineDeadError when LoRA loading fails

github.com/vllm-project/vllm - lifuhuang opened this issue 11 months ago

Multi-LoRA - Support for providing /load and /unload API

github.com/vllm-project/vllm - gauravkr2108 opened this issue 11 months ago

[feature on nm-vllm] Sparse Inference with weight only int8 quant

github.com/vllm-project/vllm - shiqingzhangCSU opened this issue 11 months ago

Question regarding GPU memory allocation

github.com/vllm-project/vllm - wx971025 opened this issue 11 months ago

Error compiling kernels

github.com/vllm-project/vllm - declark1 opened this issue 11 months ago

lm-evaluation-harness broken on master

github.com/vllm-project/vllm - pcmoritz opened this issue 11 months ago

v0.3.3 api server can't startup with neuron sdk

github.com/vllm-project/vllm - qingyuan18 opened this issue 11 months ago

Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU)

github.com/vllm-project/vllm - AdrianAbeyta opened this pull request 11 months ago

[FIX] Fix prefix test error on main

github.com/vllm-project/vllm - zhuohan123 opened this pull request 11 months ago

Mixtral 4x 4090 OOM

github.com/vllm-project/vllm - SinanAkkoyun opened this issue 11 months ago

Order of keys for guided JSON

github.com/vllm-project/vllm - ccdv-ai opened this issue 11 months ago

Regression in llama model inference due to #3005

github.com/vllm-project/vllm - Qubitium opened this issue 11 months ago

unload the model

github.com/vllm-project/vllm - osafaimal opened this issue 11 months ago

install from source failed using the latest code

github.com/vllm-project/vllm - sleepwalker2017 opened this issue 11 months ago

[FIX] Make `flash_attn` optional

github.com/vllm-project/vllm - WoosukKwon opened this pull request 11 months ago

[Minor fix] Include flash_attn in docker image

github.com/vllm-project/vllm - tdoublep opened this pull request 11 months ago

Error when prompt_logprobs + enable_prefix_caching

github.com/vllm-project/vllm - bgyoon opened this issue 11 months ago

Can vLLM handle concurrent request with FastAPI?

github.com/vllm-project/vllm - Strongorange opened this issue 11 months ago

OpenAI Tools / function calling v2

github.com/vllm-project/vllm - FlorianJoncour opened this pull request 11 months ago

Prefix Caching with FP8 KV cache support

github.com/vllm-project/vllm - chenxu2048 opened this pull request 11 months ago

When running pytest tests/, undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv

github.com/vllm-project/vllm - Imss27 opened this issue 11 months ago

vllm load SqueezeLLM quantization model failed

github.com/vllm-project/vllm - zuosong-peng opened this issue 11 months ago

[WIP] Build FlashInfer

github.com/vllm-project/vllm - WoosukKwon opened this pull request 11 months ago

ExLlamaV2: exl2 support

github.com/vllm-project/vllm - pabl-o-ce opened this issue 11 months ago

got completely wrong answer for openchat model with vllm

github.com/vllm-project/vllm - v-yunbin opened this issue 11 months ago

[Feature request] Output attention scores in vLLM

github.com/vllm-project/vllm - ChenxinAn-fdu opened this issue 11 months ago

Unable to run distributed inference on ray with tensor parallel size > 1

github.com/vllm-project/vllm - pravingadakh opened this issue 11 months ago

Supporting embedding models

github.com/vllm-project/vllm - jc9123 opened this pull request 11 months ago

Support `response_format: json_object` in OpenAI server

github.com/vllm-project/vllm - simon-mo opened this issue 11 months ago

[ROCm] Add support for Punica kernels on AMD GPUs

github.com/vllm-project/vllm - kliuae opened this pull request 11 months ago

Does vLLM support the 4-bit quantized version of the Mixtral-8x7B-Instruct-v0.1 model downloaded from Hugging Face

github.com/vllm-project/vllm - leockl opened this issue 11 months ago

Benchmarking script does not limit the maximum concurrency

github.com/vllm-project/vllm - wangchen615 opened this issue 11 months ago

RuntimeError while running any model with embeddedllminfo/vllm-rocm:vllm-v0.2.4 image and rocm5.7 (rhel 8.7)

github.com/vllm-project/vllm - AjayKadoula opened this issue 11 months ago

Should one use tokenizer templates during offline inference?

github.com/vllm-project/vllm - vmkhlv opened this issue 11 months ago

Loading models from an S3 location instead of local path

github.com/vllm-project/vllm - simon-mo opened this issue 11 months ago

add doc about serving option on dstack

github.com/vllm-project/vllm - deep-diver opened this pull request 11 months ago

OpenAI Server issue when running on Apptainer (HPC)

github.com/vllm-project/vllm - vishruth-v opened this issue 11 months ago

Failed to build from source on ROCm (with pytorch and xformers working correctly)

github.com/vllm-project/vllm - nayn99 opened this issue 11 months ago

Building VLLM from source and running inference: No module named 'vllm._C'

github.com/vllm-project/vllm - Lena-Jurkschat opened this issue 11 months ago

Is there a mecanism of priorities when sending a new request

github.com/vllm-project/vllm - brunorigal opened this issue 11 months ago

TypeError: 'NoneType' object is not callable

github.com/vllm-project/vllm - lixiaolx opened this issue 11 months ago

Fatal Python error: Segmentation fault

github.com/vllm-project/vllm - lmx760581375 opened this issue 11 months ago

run qwen1.5-14b-chat with vllm container error.

github.com/vllm-project/vllm - James-Dao opened this issue 11 months ago

how to shat out the log which is unnecessary print per 10s

github.com/vllm-project/vllm - sxk000 opened this issue 11 months ago

Merge Gemma into Llama

github.com/vllm-project/vllm - WoosukKwon opened this pull request 11 months ago

[Feature] Add vision language model support.

github.com/vllm-project/vllm - xwjiang2010 opened this pull request 11 months ago

Support of AMD consumer GPUs

github.com/vllm-project/vllm - arno4000 opened this issue 11 months ago

部署qwen1.5-7B-Chat的时候遇到API接口返回缺10个字符的问题

github.com/vllm-project/vllm - gaijigoumeiren opened this issue 11 months ago

Qwen 14B AWQ deploy: AttributeError: 'ndarray' object has no attribute '_torch_dtype'

github.com/vllm-project/vllm - testTech92 opened this issue 11 months ago

[BUG] Prompt logprobs causing tensor broadcast issue in `sampler.py`

github.com/vllm-project/vllm - AetherPrior opened this issue 11 months ago

lots of blank before each runing step

github.com/vllm-project/vllm - Eutenacity opened this issue 11 months ago

AWQ: Implement new kernels (64% faster decoding)

github.com/vllm-project/vllm - casper-hansen opened this issue 11 months ago

Large length variance of sampled sequences from llama2 70b model compared to HuggingFace .generate()

github.com/vllm-project/vllm - uralik opened this issue 11 months ago

Unable to specify GPU usage in VLLM code

github.com/vllm-project/vllm - humza-sami opened this issue 11 months ago

Separate attention backends

github.com/vllm-project/vllm - WoosukKwon opened this pull request 11 months ago

some error happend when installing vllm

github.com/vllm-project/vllm - finylink opened this issue 11 months ago

How can I use the Lora Adapter for a model with Vocab size 40960?

github.com/vllm-project/vllm - hrson-1203 opened this issue 11 months ago

Failed to find C compiler. Please specify via CC environment variable

github.com/vllm-project/vllm - gangooteli opened this issue 11 months ago

Fix: Echo without asking for new tokens or logprobs in OpenAI Completions API

github.com/vllm-project/vllm - matheper opened this pull request 11 months ago

Limited Request Handling for AMD Instinct MI300 X GPUs with Tensor Parallelism > 1

github.com/vllm-project/vllm - Spurthi-Bhat-ScalersAI opened this issue 11 months ago

求问 qwen-14b微调后的模型用vllm推理后结果都为空

github.com/vllm-project/vllm - lalalabobobo opened this issue 12 months ago

ValueError: Model QWenLMHeadModel does not support LoRA, but LoRA is enabled. Support for this model may be added in the future. If this is important to you, please open an issue on github.

github.com/vllm-project/vllm - anaivebird opened this issue 12 months ago

The answer accuracy of the QWen series model is lost

github.com/vllm-project/vllm - zhochengbiao opened this issue 12 months ago

The service results based on vllm qwen7B are inconsistent with the original qwen results, and the accuracy will drop significantly

github.com/vllm-project/vllm - chenshukai1015 opened this issue 12 months ago

AWQ Quantization Memory Usage

github.com/vllm-project/vllm - vcivan opened this issue 12 months ago

Multi-GPU Support Failures with AMD MI210

github.com/vllm-project/vllm - tom-papatheodore opened this issue 12 months ago

Fix empty output when temp is too low

github.com/vllm-project/vllm - CatherineSue opened this pull request 12 months ago

E5-mistral-7b-instruct embedding support

github.com/vllm-project/vllm - DavidPeleg6 opened this issue 12 months ago

Runtime exception [step must be nonzero]

github.com/vllm-project/vllm - DreamGenX opened this issue 12 months ago

The results of vllm deployment of qwen-14B are inconsistent with the results of the original qwen-14B

github.com/vllm-project/vllm - qingjiaozyn opened this issue 12 months ago

vllm keeps hanging when using djl-deepspeed

github.com/vllm-project/vllm - ali-firstparty opened this issue 12 months ago

api_server.py: error: unrecognized arguments: --lora-modules sql-lora=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/

github.com/vllm-project/vllm - xueyongfu11 opened this issue 12 months ago

--tensor-parallel-size 2 fails to load on GCP

github.com/vllm-project/vllm - noamgat opened this issue 12 months ago

Duplicate Token `<s>` in Tokenizer Encoded Token ids

github.com/vllm-project/vllm - zxybazh opened this issue 12 months ago

Add docker-compose.yml and corresponding .env

github.com/vllm-project/vllm - WolframRavenwolf opened this pull request 12 months ago

Allow model to be served under multiple names

github.com/vllm-project/vllm - hmellor opened this pull request 12 months ago

HQQ quantization support

github.com/vllm-project/vllm - max-wittig opened this issue 12 months ago

Missing prometheus metrics in `0.3.0`

github.com/vllm-project/vllm - SamComber opened this issue 12 months ago

Please add lora support for higher ranks and alpha values

github.com/vllm-project/vllm - parikshitsaikia1619 opened this issue 12 months ago

Add LoRA support for Mixtral

github.com/vllm-project/vllm - tterrysun opened this pull request 12 months ago

vLLM running on a Ray Cluster Hanging on Initializing

github.com/vllm-project/vllm - Kaotic3 opened this issue 12 months ago

Add guided decoding for OpenAI API server

github.com/vllm-project/vllm - felixzhu555 opened this pull request 12 months ago

Adds support for gunicorn multiprocess process

github.com/vllm-project/vllm - jalotra opened this pull request 12 months ago

Incorrect completions with tensor parallel size of 8 on MI300X GPUs

github.com/vllm-project/vllm - seungduk-yanolja opened this issue 12 months ago