Ecosyste.ms: OpenCollective

An open API service for software projects hosted on Open Collective.

github.com/vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://github.com/vllm-project/vllm

TCPStore is not available

Z-Diviner opened this issue 10 months ago
Is it possible to use vllm-0.3.3 with CUDA 11.8

HSLUCKY opened this issue 10 months ago
add aya-101 model

ahkarami opened this issue 11 months ago
What's up with Pipeline Parallelism?

duanzhaol opened this issue 11 months ago
how to run gemma-7b model with vllm 0.3.3 under cuda 118??

adogwangwang opened this issue 11 months ago
AsyncEngineDeadError when LoRA loading fails

lifuhuang opened this issue 11 months ago
Multi-LoRA - Support for providing /load and /unload API

gauravkr2108 opened this issue 11 months ago
[feature on nm-vllm] Sparse Inference with weight only int8 quant

shiqingzhangCSU opened this issue 11 months ago
Question regarding GPU memory allocation

wx971025 opened this issue 11 months ago
Error compiling kernels

declark1 opened this issue 11 months ago
lm-evaluation-harness broken on master

pcmoritz opened this issue 11 months ago
v0.3.3 api server can't startup with neuron sdk

qingyuan18 opened this issue 11 months ago
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU)

AdrianAbeyta opened this pull request 11 months ago
[FIX] Fix prefix test error on main

zhuohan123 opened this pull request 11 months ago
Mixtral 4x 4090 OOM

SinanAkkoyun opened this issue 11 months ago
Order of keys for guided JSON

ccdv-ai opened this issue 11 months ago
Regression in llama model inference due to #3005

Qubitium opened this issue 11 months ago
unload the model

osafaimal opened this issue 11 months ago
install from source failed using the latest code

sleepwalker2017 opened this issue 11 months ago
[FIX] Make `flash_attn` optional

WoosukKwon opened this pull request 11 months ago
[Minor fix] Include flash_attn in docker image

tdoublep opened this pull request 11 months ago
Error when prompt_logprobs + enable_prefix_caching

bgyoon opened this issue 11 months ago
Can vLLM handle concurrent request with FastAPI?

Strongorange opened this issue 11 months ago
OpenAI Tools / function calling v2

FlorianJoncour opened this pull request 11 months ago
Prefix Caching with FP8 KV cache support

chenxu2048 opened this pull request 11 months ago
vllm load SqueezeLLM quantization model failed

zuosong-peng opened this issue 11 months ago
[WIP] Build FlashInfer

WoosukKwon opened this pull request 11 months ago
ExLlamaV2: exl2 support

pabl-o-ce opened this issue 11 months ago
got completely wrong answer for openchat model with vllm

v-yunbin opened this issue 11 months ago
[Feature request] Output attention scores in vLLM

ChenxinAn-fdu opened this issue 11 months ago
Supporting embedding models

jc9123 opened this pull request 11 months ago
Support `response_format: json_object` in OpenAI server

simon-mo opened this issue 11 months ago
[ROCm] Add support for Punica kernels on AMD GPUs

kliuae opened this pull request 11 months ago
Benchmarking script does not limit the maximum concurrency

wangchen615 opened this issue 11 months ago
Should one use tokenizer templates during offline inference?

vmkhlv opened this issue 11 months ago
Loading models from an S3 location instead of local path

simon-mo opened this issue 11 months ago
add doc about serving option on dstack

deep-diver opened this pull request 11 months ago
OpenAI Server issue when running on Apptainer (HPC)

vishruth-v opened this issue 11 months ago
Is there a mecanism of priorities when sending a new request

brunorigal opened this issue 11 months ago
TypeError: 'NoneType' object is not callable

lixiaolx opened this issue 11 months ago
Fatal Python error: Segmentation fault

lmx760581375 opened this issue 11 months ago
run qwen1.5-14b-chat with vllm container error.

James-Dao opened this issue 11 months ago
how to shat out the log which is unnecessary print per 10s

sxk000 opened this issue 11 months ago
Merge Gemma into Llama

WoosukKwon opened this pull request 11 months ago
[Feature] Add vision language model support.

xwjiang2010 opened this pull request 11 months ago
Support of AMD consumer GPUs

arno4000 opened this issue 11 months ago
[BUG] Prompt logprobs causing tensor broadcast issue in `sampler.py`

AetherPrior opened this issue 11 months ago
lots of blank before each runing step

Eutenacity opened this issue 11 months ago
AWQ: Implement new kernels (64% faster decoding)

casper-hansen opened this issue 11 months ago
Unable to specify GPU usage in VLLM code

humza-sami opened this issue 11 months ago
Separate attention backends

WoosukKwon opened this pull request 11 months ago
some error happend when installing vllm

finylink opened this issue 11 months ago
How can I use the Lora Adapter for a model with Vocab size 40960?

hrson-1203 opened this issue 11 months ago
Limited Request Handling for AMD Instinct MI300 X GPUs with Tensor Parallelism > 1

Spurthi-Bhat-ScalersAI opened this issue 11 months ago
求问 qwen-14b微调后的模型用vllm推理后结果都为空

lalalabobobo opened this issue 11 months ago
The answer accuracy of the QWen series model is lost

zhochengbiao opened this issue 11 months ago
AWQ Quantization Memory Usage

vcivan opened this issue 11 months ago
Multi-GPU Support Failures with AMD MI210

tom-papatheodore opened this issue 11 months ago
Fix empty output when temp is too low

CatherineSue opened this pull request 11 months ago
E5-mistral-7b-instruct embedding support

DavidPeleg6 opened this issue 11 months ago
Runtime exception [step must be nonzero]

DreamGenX opened this issue 11 months ago
vllm keeps hanging when using djl-deepspeed

ali-firstparty opened this issue 11 months ago
--tensor-parallel-size 2 fails to load on GCP

noamgat opened this issue 11 months ago
Duplicate Token `<s>` in Tokenizer Encoded Token ids

zxybazh opened this issue 11 months ago
Add docker-compose.yml and corresponding .env

WolframRavenwolf opened this pull request 11 months ago
Allow model to be served under multiple names

hmellor opened this pull request 11 months ago
HQQ quantization support

max-wittig opened this issue 11 months ago
Missing prometheus metrics in `0.3.0`

SamComber opened this issue 11 months ago
Please add lora support for higher ranks and alpha values

parikshitsaikia1619 opened this issue 11 months ago
Add LoRA support for Mixtral

tterrysun opened this pull request 12 months ago
vLLM running on a Ray Cluster Hanging on Initializing

Kaotic3 opened this issue 12 months ago
Add guided decoding for OpenAI API server

felixzhu555 opened this pull request 12 months ago
Adds support for gunicorn multiprocess process

jalotra opened this pull request 12 months ago
Incorrect completions with tensor parallel size of 8 on MI300X GPUs

seungduk-yanolja opened this issue 12 months ago
openai completions api <echo=True> raises Error

seoyunYang opened this issue 12 months ago
Add Splitwise implementation to vLLM

aashaka opened this pull request 12 months ago
Nvidia-H20 with nvcr.io/nvidia/pytorch:23.12-py3,CUBLAS Error!

tohneecao opened this issue 12 months ago
Multi GPU ROCm6 issues, and workarounds

BKitor opened this issue 12 months ago
model continue conversation

andrey-genpracc opened this issue 12 months ago