Ecosyste.ms: OpenCollective
An open API service for software projects hosted on Open Collective.
github.com/vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
https://github.com/vllm-project/vllm
Allow passing hf config args with openai server
Aakash-kaushik opened this issue 11 months ago
Aakash-kaushik opened this issue 11 months ago
openapi running but "POST /v1/chat/completions HTTP/1.1" 404 Not Found
Nero10578 opened this issue 11 months ago
Nero10578 opened this issue 11 months ago
`max_num_batched_tokens` and `max_num_seqs values`
isRambler opened this issue 11 months ago
isRambler opened this issue 11 months ago
Aborted request without reason
erjieyong opened this issue 11 months ago
erjieyong opened this issue 11 months ago
Support JSON mode.
MiyazonoKaori opened this issue 11 months ago
MiyazonoKaori opened this issue 11 months ago
vLLM Distributed Inference stuck when using multi -GPU
RathoreShubh opened this issue 11 months ago
RathoreShubh opened this issue 11 months ago
Add JSON format logging support
CatherineSue opened this pull request 11 months ago
CatherineSue opened this pull request 11 months ago
anyone can Qwen-14B-Chat-AWQ work with VLLM/TP ?
s-natsubori opened this issue 12 months ago
s-natsubori opened this issue 12 months ago
top_k = 50 will make vllm prediction align with transformers
sfyumi opened this issue 12 months ago
sfyumi opened this issue 12 months ago
Multi-node serving with vLLM - Problems with Ray
vbucaj opened this issue 12 months ago
vbucaj opened this issue 12 months ago
Compute perplexity/logits for the prompt
dsmilkov opened this issue 12 months ago
dsmilkov opened this issue 12 months ago
OutOfMemoryError
Hobrus opened this issue 12 months ago
Hobrus opened this issue 12 months ago
RuntimeError: t == DeviceType::CUDA INTERNAL ASSERT FAILED about v0.2.7
cocovoc opened this issue 12 months ago
cocovoc opened this issue 12 months ago
awq compression of llama 2 70b chat got bad result
fancyerii opened this issue 12 months ago
fancyerii opened this issue 12 months ago
vLLM on OpenShift/Kubernetes Manifests
WinsonSou opened this issue 12 months ago
WinsonSou opened this issue 12 months ago
out of memory with mixtral AWQ
m0wer opened this issue 12 months ago
m0wer opened this issue 12 months ago
Docs: Add Haystack integration details
bilgeyucel opened this pull request 12 months ago
bilgeyucel opened this pull request 12 months ago
Could we support Fuyu-8B, a multimodel llm?
leiwen83 opened this issue 12 months ago
leiwen83 opened this issue 12 months ago
Recent vLLMs ask for too much memory: ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
pseudotensor opened this issue 12 months ago
pseudotensor opened this issue 12 months ago
the output of the vLLM is different from that of HF
will-wiki opened this issue about 1 year ago
will-wiki opened this issue about 1 year ago
[WIP] Speculative decoding using a draft model
cadedaniel opened this pull request about 1 year ago
cadedaniel opened this pull request about 1 year ago
Use LRU cache for CUDA Graphs
WoosukKwon opened this issue about 1 year ago
WoosukKwon opened this issue about 1 year ago
torch.cuda.OutOfMemoryError: CUDA out of memory
DenisStefanAndrei opened this issue about 1 year ago
DenisStefanAndrei opened this issue about 1 year ago
argument 'tokens': 'NoneType' object cannot be converted to 'PyString'
xxm1668 opened this issue about 1 year ago
xxm1668 opened this issue about 1 year ago
vllm推理如何指定某块gpu
SiqinLv opened this issue about 1 year ago
SiqinLv opened this issue about 1 year ago
Unable to run any model with tensor_parallel_size>1 on AWS sagemaker notebooks
samarthsarin opened this issue about 1 year ago
samarthsarin opened this issue about 1 year ago
Inquiry Regarding vLLM Support for Mac Metal API
yihong1120 opened this issue about 1 year ago
yihong1120 opened this issue about 1 year ago
Implement Triton-based AWQ kernel
WoosukKwon opened this pull request about 1 year ago
WoosukKwon opened this pull request about 1 year ago
Support VLM model and GPT4V API
xunfeng1980 opened this issue about 1 year ago
xunfeng1980 opened this issue about 1 year ago
Vllm RayWoker process hangs when use llm engine
SuoSiFire opened this issue about 1 year ago
SuoSiFire opened this issue about 1 year ago
[FEATURE REQUEST] SparQ Attention
AlpinDale opened this issue about 1 year ago
AlpinDale opened this issue about 1 year ago
ARM aarch-64 server build failed (host OS: Ubuntu22.04.3)
zhudy opened this issue about 1 year ago
zhudy opened this issue about 1 year ago
why online seving slower than offline serving??
BangDaeng opened this issue about 1 year ago
BangDaeng opened this issue about 1 year ago
I want to add mamba_chat (2.8b) model
SafeyahShemali opened this issue about 1 year ago
SafeyahShemali opened this issue about 1 year ago
How to fix incomplete answers?
LuciAkirami opened this issue about 1 year ago
LuciAkirami opened this issue about 1 year ago
Repeated answer: When I use vllm with opt-13b, the generated text is not end until the max length, with the repeated answer
duihuhu opened this issue about 1 year ago
duihuhu opened this issue about 1 year ago
Error. Rayworkervllm cannot work well when use --tensor-parallel-size . Please help.
JenniePing opened this issue about 1 year ago
JenniePing opened this issue about 1 year ago
Can it support macos ? M2 chip.
znsoftm opened this issue about 1 year ago
znsoftm opened this issue about 1 year ago
Is there a way to terminate vllm.LLM and release the GPU memory
sfc-gh-zhwang opened this issue about 1 year ago
sfc-gh-zhwang opened this issue about 1 year ago
Support `tools` and `tool_choice` parameter in OpenAI compatible service
simon-mo opened this issue about 1 year ago
simon-mo opened this issue about 1 year ago
01-ai/Yi-34B-Chat never stops
pseudotensor opened this issue about 1 year ago
pseudotensor opened this issue about 1 year ago
ModuleNotFoundError: No module named "vllm._C"
Kawai1Ace opened this issue about 1 year ago
Kawai1Ace opened this issue about 1 year ago
Please help me solve the problem. thanks
CP3666 opened this issue about 1 year ago
CP3666 opened this issue about 1 year ago
Proposal: force type hint check with mypy
wangkuiyi opened this issue about 1 year ago
wangkuiyi opened this issue about 1 year ago
pip install -e . failed
dachengai opened this issue about 1 year ago
dachengai opened this issue about 1 year ago
Batching inference outputs are not the same with single inference
gesanqiu opened this issue about 1 year ago
gesanqiu opened this issue about 1 year ago
vllm always tries to download model from huggingface/modelscope even if I specify --download-dir with already downloaded models
davideuler opened this issue about 1 year ago
davideuler opened this issue about 1 year ago
Add worker registry service for hosting multiple vllm model through single api gateway
tjtanaa opened this issue about 1 year ago
tjtanaa opened this issue about 1 year ago
How to use logits_processors
shuaiwang2022 opened this issue about 1 year ago
shuaiwang2022 opened this issue about 1 year ago
NCCL error
maxmelichov opened this issue about 1 year ago
maxmelichov opened this issue about 1 year ago
ImportError: libcudart.so.12
tranhoangnguyen03 opened this issue about 1 year ago
tranhoangnguyen03 opened this issue about 1 year ago
API causes slowdown in batch request handling
jpeig opened this issue about 1 year ago
jpeig opened this issue about 1 year ago
Avoid re-initialize parallel groups
wangruohui opened this pull request about 1 year ago
wangruohui opened this pull request about 1 year ago
[Feature] SYCL kernel support for Intel GPU
abhilash1910 opened this pull request about 1 year ago
abhilash1910 opened this pull request about 1 year ago
follow up of #1687 when safetensors model contains 0-rank tensors
twaka opened this pull request about 1 year ago
twaka opened this pull request about 1 year ago
Plans to make the installation work on Windows without WSL?
alexandre-ist opened this issue about 1 year ago
alexandre-ist opened this issue about 1 year ago
API Server Performance
simon-mo opened this issue about 1 year ago
simon-mo opened this issue about 1 year ago
usage of vllm for extracting embeddings
ra-MANUJ-an opened this issue about 1 year ago
ra-MANUJ-an opened this issue about 1 year ago
Revert 1 docker build
wasertech opened this pull request about 1 year ago
wasertech opened this pull request about 1 year ago
baichuan-13b-chat用vllm来生成,很多测试数据(有长有短,没有超出长度限制)只能生成一个句号,而且有些示例在删掉一些字词或句子之后,就可以正常生成了,请问有可能是什么原因?
MrInouye opened this issue about 1 year ago
MrInouye opened this issue about 1 year ago
Prompt caching
AIApprentice101 opened this issue about 1 year ago
AIApprentice101 opened this issue about 1 year ago
No CUDA GPUs are available Error with vLLM in JupyterLab
SafeyahShemali opened this issue about 1 year ago
SafeyahShemali opened this issue about 1 year ago
how to use chat function
zhangzai666 opened this issue about 1 year ago
zhangzai666 opened this issue about 1 year ago
chatglm3 vllm/vllm/model_executor/models/chatglm.py", line 53, in __init__ assert self.total_num_kv_heads % tp_size == 0 AssertionError
Changjy1997nb opened this issue about 1 year ago
Changjy1997nb opened this issue about 1 year ago
Support for sparsity?
BDHU opened this issue about 1 year ago
BDHU opened this issue about 1 year ago
Tensor parallelism on ray cluster
baojunliu opened this issue about 1 year ago
baojunliu opened this issue about 1 year ago
Adding support for switch-transformer / NLLB-MoE
yl3469 opened this issue about 1 year ago
yl3469 opened this issue about 1 year ago
[Bug] prompt_logprobs = 1 OOM problem
shunxing1234 opened this issue about 1 year ago
shunxing1234 opened this issue about 1 year ago
Error: When using OpenAI-Compatible Server, the server is available but cannot be accessed from the same terminal.
LuristheSun opened this issue about 1 year ago
LuristheSun opened this issue about 1 year ago
Support W8A8 inference in vllm
AniZpZ opened this pull request about 1 year ago
AniZpZ opened this pull request about 1 year ago
Support int8 KVCache Quant in Vllm
AniZpZ opened this pull request about 1 year ago
AniZpZ opened this pull request about 1 year ago
Added logits processor API to sampling params
noamgat opened this pull request about 1 year ago
noamgat opened this pull request about 1 year ago
ImportError: cannot import name 'MistralConfig' from 'transformers'
peter-ch opened this issue about 1 year ago
peter-ch opened this issue about 1 year ago
Adding Locally Typical Sampling (i.e. typical_p in transformers and TGI)
seongminp opened this issue about 1 year ago
seongminp opened this issue about 1 year ago
Does vllm support the Mac/Metal/MPS?
Phil-U-U opened this issue about 1 year ago
Phil-U-U opened this issue about 1 year ago
Using VLLM with a Tesla T4 on SageMaker Studio (ml.g4dn.xlarge instance)
paulovasconcellos-hotmart opened this issue about 1 year ago
paulovasconcellos-hotmart opened this issue about 1 year ago
[question] Does vllm support macos M1 or M2 chip?
acekingke opened this issue about 1 year ago
acekingke opened this issue about 1 year ago
Could not build wheels for vllm, which is required to install pyproject.toml-based projects
ABooth01 opened this issue about 1 year ago
ABooth01 opened this issue about 1 year ago
Make multi replicas to make a balancer.
linkedlist771 opened this issue about 1 year ago
linkedlist771 opened this issue about 1 year ago
How to deploy vllm model across multiple nodes in kubernetes?
Ryojikn opened this issue about 1 year ago
Ryojikn opened this issue about 1 year ago
[Error] 400 Bad Request
Tostino opened this issue about 1 year ago
Tostino opened this issue about 1 year ago
What is the max number prompts that the generate() method can take
hxue3 opened this issue about 1 year ago
hxue3 opened this issue about 1 year ago
Low VRAM batch processing mode
viktor-ferenczi opened this issue about 1 year ago
viktor-ferenczi opened this issue about 1 year ago
Feature request. Allow a few model instances in one GPU if they can feet in VRAM.
agrogov opened this issue about 1 year ago
agrogov opened this issue about 1 year ago
feat: demonstrate using regex for suffix matching
wsxiaoys opened this pull request about 1 year ago
wsxiaoys opened this pull request about 1 year ago
Memory leak
SatoshiReport opened this issue about 1 year ago
SatoshiReport opened this issue about 1 year ago
StreamingLLM support?
nivibilla opened this issue about 1 year ago
nivibilla opened this issue about 1 year ago
workaround of AWQ for Turing GPUs
twaka opened this pull request about 1 year ago
twaka opened this pull request about 1 year ago
Jetson agx orin
MrBrabus75 opened this issue about 1 year ago
MrBrabus75 opened this issue about 1 year ago
Data parallel inference
kevinhu opened this issue about 1 year ago
kevinhu opened this issue about 1 year ago
Support Python 3.12
EwoutH opened this issue about 1 year ago
EwoutH opened this issue about 1 year ago
3 gpu's not supported?
ye7love7 opened this issue about 1 year ago
ye7love7 opened this issue about 1 year ago
Generate nothing from VLLM output
FocusLiwen opened this issue about 1 year ago
FocusLiwen opened this issue about 1 year ago
[Discussion] Will vLLM consider using Speculative Sampling to accelerating LLM decoding?
gesanqiu opened this issue about 1 year ago
gesanqiu opened this issue about 1 year ago
vLLM to add a locally trained model
atanikan opened this issue over 1 year ago
atanikan opened this issue over 1 year ago
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly.
MUZAMMILPERVAIZ opened this issue over 1 year ago
MUZAMMILPERVAIZ opened this issue over 1 year ago
AWQ: bfloat16 not supported? And `--dtype` arg doesn't allow specifying float16
TheBloke opened this issue over 1 year ago
TheBloke opened this issue over 1 year ago
vLLM Discord Server
zhuohan123 opened this issue over 1 year ago
zhuohan123 opened this issue over 1 year ago