Ecosyste.ms: OpenCollective
An open API service for software projects hosted on Open Collective.
github.com/vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
https://github.com/vllm-project/vllm
[Misc]: Implement CPU/GPU swapping in BlockManagerV2
cadedaniel opened this issue 9 months ago
cadedaniel opened this issue 9 months ago
[Hardware][AMD][Kernel]Adding custom kernel for vector query on Rocm
charlifu opened this pull request 9 months ago
charlifu opened this pull request 9 months ago
[Bug]: ChatCompletion prompt_logprobs does not work
noamgat opened this issue 9 months ago
noamgat opened this issue 9 months ago
[RFC] Initial Support for CPUs
bigPYJ1151 opened this issue 9 months ago
bigPYJ1151 opened this issue 9 months ago
[Usage]: Generate specified number of tokens for each request individually
oximi123 opened this issue 9 months ago
oximi123 opened this issue 9 months ago
[Kernel] Use flash-attn for decoding
skrider opened this pull request 9 months ago
skrider opened this pull request 9 months ago
[Misc] add the "download-dir" option to the latency/throughput benchmarks
AmadeusChan opened this pull request 9 months ago
AmadeusChan opened this pull request 9 months ago
[RFC] Initial Support for Cloud TPUs
WoosukKwon opened this issue 9 months ago
WoosukKwon opened this issue 9 months ago
[Bug]: The fine-tuned qwen1.5 model uses transformers generate() to have a normal dialogue, but the dialogue output using vllm openai API has multiple line breaks.
qianghuangwhu opened this issue 9 months ago
qianghuangwhu opened this issue 9 months ago
parent_child_dict[sample.parent_seq_id].append(sample) KeyError: 4
Stosan opened this issue 9 months ago
Stosan opened this issue 9 months ago
ModuleNotFoundError: No module named 'transformers_modules' with API serving using phi-2b
haining78zhang opened this issue 9 months ago
haining78zhang opened this issue 9 months ago
[BugFix] Fix Falcon tied embeddings
WoosukKwon opened this pull request 9 months ago
WoosukKwon opened this pull request 9 months ago
[RFC]: Interface and Abstraction for Distributed Inference Environment
youkaichao opened this issue 9 months ago
youkaichao opened this issue 9 months ago
[Misc]: Throughput/Latency for guided_json with ~100% GPU cache utilization
jens-create opened this issue 9 months ago
jens-create opened this issue 9 months ago
[Feature]: Offload Model Weights to CPU
chenqianfzh opened this issue 9 months ago
chenqianfzh opened this issue 9 months ago
[New Model]: Phi-2 support for LoRA
andykhanna opened this issue 9 months ago
andykhanna opened this issue 9 months ago
[Feature]: Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
tchaton opened this issue 9 months ago
tchaton opened this issue 9 months ago
[CI] Create nightly images/wheels
simon-mo opened this issue 9 months ago
simon-mo opened this issue 9 months ago
[Usage]: model not support lora but listed in supported models
xiaobo-Chen opened this issue 9 months ago
xiaobo-Chen opened this issue 9 months ago
[Usage]: punica LoRA kernels could not be imported. If you built vLLM from source, make sure VLLM_INSTALL_PUNICA_KERNELS=1 env var was set.
nlp-learner opened this issue 9 months ago
nlp-learner opened this issue 9 months ago
[Feature]: Support Guided Decoding in `LLM` entrypoint
simon-mo opened this issue 9 months ago
simon-mo opened this issue 9 months ago
[Feature]: FastServe - Fast Distributed Inference Serving for Large Language Models
chizhang118 opened this issue 9 months ago
chizhang118 opened this issue 9 months ago
[Bug]: when intalling vllm by pip, some errors happend.
finylink opened this issue 9 months ago
finylink opened this issue 9 months ago
[Usage]: How to inference model with multi-gpus
ckj18 opened this issue 9 months ago
ckj18 opened this issue 9 months ago
[Kernel] Full Tensor Parallelism for LoRA Layers
FurtherAI opened this pull request 9 months ago
FurtherAI opened this pull request 9 months ago
[Bug]: aisingapore/sea-lion-7b-instruct fails with assert config.embedding_fraction == 1.0
pseudotensor opened this issue 9 months ago
pseudotensor opened this issue 9 months ago
[Feature]: Support distributing serving with KubeRay's autoscaler
TrafalgarZZZ opened this issue 9 months ago
TrafalgarZZZ opened this issue 9 months ago
[Bug]: vllm slows down after a long run
momomobinx opened this issue 9 months ago
momomobinx opened this issue 9 months ago
[New Model]: Please support CogVLM
kietna1809 opened this issue 9 months ago
kietna1809 opened this issue 9 months ago
[Misc] Add attention sinks
felixzhu555 opened this pull request 9 months ago
felixzhu555 opened this pull request 9 months ago
[Bug]: Use of LoRAReqeust
meiru-cam opened this issue 9 months ago
meiru-cam opened this issue 9 months ago
[BugFix][Frontend] Use correct, shared tokenizer in OpenAI server
njhill opened this pull request 9 months ago
njhill opened this pull request 9 months ago
[Core] Add generic typing to `LRUCache`
njhill opened this pull request 9 months ago
njhill opened this pull request 9 months ago
[Usage]: Set dtype for VLLM using YAML
telekoteko opened this issue 9 months ago
telekoteko opened this issue 9 months ago
Dynamic Multi LoRA Load \ Delete Support
gauravkr2108 opened this pull request 9 months ago
gauravkr2108 opened this pull request 9 months ago
[Feature]: Compute and log the serving FLOPs
zhuohan123 opened this issue 9 months ago
zhuohan123 opened this issue 9 months ago
[Usage]: Why increase max-num-seqs will use less memory
TaChao opened this issue 9 months ago
TaChao opened this issue 9 months ago
[Bug]: DynamicNTKScalingRotaryEmbedding implementation is different from Transformers
killawhale2 opened this issue 9 months ago
killawhale2 opened this issue 9 months ago
[Frontend] [Core] feat: Add model loading using `tensorizer`
sangstar opened this pull request 9 months ago
sangstar opened this pull request 9 months ago
[Frontend] Support complex message content for chat completions endpoint
fgreinacher opened this pull request 9 months ago
fgreinacher opened this pull request 9 months ago
[Core] Multiprocessing executor for single-node multi-GPU deployment
njhill opened this pull request 9 months ago
njhill opened this pull request 9 months ago
[Bug]: ValueError: Model architectures ['Qwen2ForCausalLM'] are not supported for now
finylink opened this issue 9 months ago
finylink opened this issue 9 months ago
baichuan/qwen/chatlgm with lora adaption [feature]
kexuedaishu opened this issue 9 months ago
kexuedaishu opened this issue 9 months ago
[Bugfix] Fix beam search logits processor
maximzubkov opened this pull request 9 months ago
maximzubkov opened this pull request 9 months ago
[Feature]: Control vectors
generalsvr opened this issue 9 months ago
generalsvr opened this issue 9 months ago
[Core] Support thread-based async tokenizer pools
njhill opened this pull request 9 months ago
njhill opened this pull request 9 months ago
[Bug]: Bug in Guided Generation Logits Processorwith `n>1`
maximzubkov opened this issue 9 months ago
maximzubkov opened this issue 9 months ago
[Frontend] support new lora module to a live server in OpenAI Entrypoints
AlphaINF opened this pull request 9 months ago
AlphaINF opened this pull request 9 months ago
[Test] Add a randomized test for OpenAI API
dylanwhawk opened this issue 9 months ago
dylanwhawk opened this issue 9 months ago
[Bug]: Incompatible version between torch and triton
mzz12 opened this issue 9 months ago
mzz12 opened this issue 9 months ago
Does vllm support pytorch/xla ?
dinghaodhd opened this issue 9 months ago
dinghaodhd opened this issue 9 months ago
[Misc] add HOST_IP env var
youkaichao opened this pull request 9 months ago
youkaichao opened this pull request 9 months ago
[Bug]: RuntimeError: invalid argument to reset_peak_memory_stats when offline sampling using neuron
Sadden opened this issue 9 months ago
Sadden opened this issue 9 months ago
Incremental output for LLM entrypoint
yhu422 opened this pull request 9 months ago
yhu422 opened this pull request 9 months ago
Unable to load LoRA fine-tuned LLM from HF (AssertionError)
oscar-martin opened this issue 9 months ago
oscar-martin opened this issue 9 months ago
[Prefill with Prefix Cache] Improve the efficiency of prefilling with prefix cache by allowing a larger batch size
MeloYang05 opened this pull request 9 months ago
MeloYang05 opened this pull request 9 months ago
Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding
tchaton opened this issue 9 months ago
tchaton opened this issue 9 months ago
[Feature] Implement FastV's Token Pruning
chenllliang opened this issue 9 months ago
chenllliang opened this issue 9 months ago
(core dumped) when running `vllm` with `AWQ` on `MIG` partition of a H100 GPU
remiconnesson opened this issue 9 months ago
remiconnesson opened this issue 9 months ago
ImportError: libcudart.so.11.0: cannot open shared object file: No such file or director
mcleish7 opened this issue 9 months ago
mcleish7 opened this issue 9 months ago
Sampling is very slow, causing a CPU bottleneck
m-harmonic opened this issue 9 months ago
m-harmonic opened this issue 9 months ago
Can you choose which GPU to use. like tf inference device_map="cuda:0"
wellcasa opened this issue 9 months ago
wellcasa opened this issue 9 months ago
[TEST] Add a distributed test for async LLM engine.
zhuohan123 opened this issue 9 months ago
zhuohan123 opened this issue 9 months ago
When starting the second vllm.entrypoints.api_server using tensor parallel in a single node, the second vllm api_server Stuck in " Started a local Ray instance." OR "Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory"
durant1999 opened this issue 9 months ago
durant1999 opened this issue 9 months ago
Does 'all-reduce kernels are temporarily disabled' is the cause for the more memory requirment?
SafeyahShemali opened this issue 9 months ago
SafeyahShemali opened this issue 9 months ago
DeepSeek VL support
SinanAkkoyun opened this issue 9 months ago
SinanAkkoyun opened this issue 9 months ago
inference with AWQ quantization
Kev1ntan opened this issue 9 months ago
Kev1ntan opened this issue 9 months ago
Fixes #1556 double free
br3no opened this pull request 10 months ago
br3no opened this pull request 10 months ago
Bug when input top_k as a float that is outside of range
Drzhivago264 opened this issue 10 months ago
Drzhivago264 opened this issue 10 months ago
[Feature Request] Add GPTQ quantization kernels for 4-bit NormalFloat (NF4) use cases.
duchengyao opened this issue 10 months ago
duchengyao opened this issue 10 months ago
TCPStore is not available
Z-Diviner opened this issue 10 months ago
Z-Diviner opened this issue 10 months ago
What's difference between the seed in LLMEngine and seed in SamplingParams?
tomdzh opened this issue 10 months ago
tomdzh opened this issue 10 months ago
Is it possible to use vllm-0.3.3 with CUDA 11.8
HSLUCKY opened this issue 10 months ago
HSLUCKY opened this issue 10 months ago
Implement structured engine for parsing json grammar by token with `response_format: {type: json_object}`
pathorn opened this pull request 10 months ago
pathorn opened this pull request 10 months ago
add aya-101 model
ahkarami opened this issue 10 months ago
ahkarami opened this issue 10 months ago
What's up with Pipeline Parallelism?
duanzhaol opened this issue 10 months ago
duanzhaol opened this issue 10 months ago
how to run gemma-7b model with vllm 0.3.3 under cuda 118??
adogwangwang opened this issue 10 months ago
adogwangwang opened this issue 10 months ago
When chat-ui and vllm are used together, the dialogue output of Llama-2-70b-chat-hf(safetensor file) is abnormal.
majestichou opened this issue 10 months ago
majestichou opened this issue 10 months ago
AsyncEngineDeadError when LoRA loading fails
lifuhuang opened this issue 10 months ago
lifuhuang opened this issue 10 months ago
Multi-LoRA - Support for providing /load and /unload API
gauravkr2108 opened this issue 10 months ago
gauravkr2108 opened this issue 10 months ago
[feature on nm-vllm] Sparse Inference with weight only int8 quant
shiqingzhangCSU opened this issue 10 months ago
shiqingzhangCSU opened this issue 10 months ago
Question regarding GPU memory allocation
wx971025 opened this issue 10 months ago
wx971025 opened this issue 10 months ago
Error compiling kernels
declark1 opened this issue 10 months ago
declark1 opened this issue 10 months ago
lm-evaluation-harness broken on master
pcmoritz opened this issue 10 months ago
pcmoritz opened this issue 10 months ago
v0.3.3 api server can't startup with neuron sdk
qingyuan18 opened this issue 10 months ago
qingyuan18 opened this issue 10 months ago
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU)
AdrianAbeyta opened this pull request 10 months ago
AdrianAbeyta opened this pull request 10 months ago
[FIX] Fix prefix test error on main
zhuohan123 opened this pull request 10 months ago
zhuohan123 opened this pull request 10 months ago
Mixtral 4x 4090 OOM
SinanAkkoyun opened this issue 10 months ago
SinanAkkoyun opened this issue 10 months ago
Order of keys for guided JSON
ccdv-ai opened this issue 10 months ago
ccdv-ai opened this issue 10 months ago
Regression in llama model inference due to #3005
Qubitium opened this issue 10 months ago
Qubitium opened this issue 10 months ago
unload the model
osafaimal opened this issue 10 months ago
osafaimal opened this issue 10 months ago
install from source failed using the latest code
sleepwalker2017 opened this issue 10 months ago
sleepwalker2017 opened this issue 10 months ago
[FIX] Make `flash_attn` optional
WoosukKwon opened this pull request 10 months ago
WoosukKwon opened this pull request 10 months ago
[Minor fix] Include flash_attn in docker image
tdoublep opened this pull request 10 months ago
tdoublep opened this pull request 10 months ago
Error when prompt_logprobs + enable_prefix_caching
bgyoon opened this issue 10 months ago
bgyoon opened this issue 10 months ago
Can vLLM handle concurrent request with FastAPI?
Strongorange opened this issue 10 months ago
Strongorange opened this issue 10 months ago
OpenAI Tools / function calling v2
FlorianJoncour opened this pull request 10 months ago
FlorianJoncour opened this pull request 10 months ago
Prefix Caching with FP8 KV cache support
chenxu2048 opened this pull request 10 months ago
chenxu2048 opened this pull request 10 months ago
When running pytest tests/, undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv
Imss27 opened this issue 10 months ago
Imss27 opened this issue 10 months ago