Ecosyste.ms: OpenCollective
An open API service for software projects hosted on Open Collective.
github.com/vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
https://github.com/vllm-project/vllm
Memory leak
SatoshiReport opened this issue about 1 year ago
SatoshiReport opened this issue about 1 year ago
StreamingLLM support?
nivibilla opened this issue over 1 year ago
nivibilla opened this issue over 1 year ago
workaround of AWQ for Turing GPUs
twaka opened this pull request over 1 year ago
twaka opened this pull request over 1 year ago
Jetson agx orin
MrBrabus75 opened this issue over 1 year ago
MrBrabus75 opened this issue over 1 year ago
Data parallel inference
kevinhu opened this issue over 1 year ago
kevinhu opened this issue over 1 year ago
Support Python 3.12
EwoutH opened this issue over 1 year ago
EwoutH opened this issue over 1 year ago
3 gpu's not supported?
ye7love7 opened this issue over 1 year ago
ye7love7 opened this issue over 1 year ago
Generate nothing from VLLM output
FocusLiwen opened this issue over 1 year ago
FocusLiwen opened this issue over 1 year ago
[Discussion] Will vLLM consider using Speculative Sampling to accelerating LLM decoding?
gesanqiu opened this issue over 1 year ago
gesanqiu opened this issue over 1 year ago
vLLM to add a locally trained model
atanikan opened this issue over 1 year ago
atanikan opened this issue over 1 year ago
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly.
MUZAMMILPERVAIZ opened this issue over 1 year ago
MUZAMMILPERVAIZ opened this issue over 1 year ago
AWQ: bfloat16 not supported? And `--dtype` arg doesn't allow specifying float16
TheBloke opened this issue over 1 year ago
TheBloke opened this issue over 1 year ago
vLLM Discord Server
zhuohan123 opened this issue over 1 year ago
zhuohan123 opened this issue over 1 year ago
Waiting sequence group should have only one prompt sequence.
Link-Li opened this issue over 1 year ago
Link-Li opened this issue over 1 year ago
Inconsistent results between HuggingFace Transformers and vllm
normster opened this issue over 1 year ago
normster opened this issue over 1 year ago
How to deploy api server as https
yilihtien opened this issue over 1 year ago
yilihtien opened this issue over 1 year ago
vllm hangs when reinitializing ray
nelson-liu opened this issue over 1 year ago
nelson-liu opened this issue over 1 year ago
How to use vllm to compute ppl score for input text?
yinochaos opened this issue over 1 year ago
yinochaos opened this issue over 1 year ago
GGUF support
viktor-ferenczi opened this issue over 1 year ago
viktor-ferenczi opened this issue over 1 year ago
AsyncEngineDeadError / RuntimeError: CUDA error: an illegal memory access was encountered
xingyaoww opened this issue over 1 year ago
xingyaoww opened this issue over 1 year ago
It seems that SamplingParams doesnt support the bad_words_ids parameter when generating
mengban opened this issue over 1 year ago
mengban opened this issue over 1 year ago
can model Qwen/Qwen-VL-Chat work well?
wangschang opened this issue over 1 year ago
wangschang opened this issue over 1 year ago
vllm reducing quality when loading local fine tuned Llama-2-13b-hf model
BerndHuber opened this issue over 1 year ago
BerndHuber opened this issue over 1 year ago
I want to close kv cache. if i set gpu_memory_utilization is 0. Does it means that i close the kv cache?
amulil opened this issue over 1 year ago
amulil opened this issue over 1 year ago
Is there authentication supported?
mluogh opened this issue over 1 year ago
mluogh opened this issue over 1 year ago
Loading Model through Multi-Node Ray Cluster Fails
VarunSreenivasan16 opened this issue over 1 year ago
VarunSreenivasan16 opened this issue over 1 year ago
SIGABRT - Fatal Python error: Aborted when running vllm on llama2-7b with --tensor-parallel-size 2
dhritiman opened this issue over 1 year ago
dhritiman opened this issue over 1 year ago
Sagemaker support for inference
Tarun3679 opened this issue over 1 year ago
Tarun3679 opened this issue over 1 year ago
start vllm.entrypoints.api_server model vicuna-13b-v1.3 error: Fatal Python error: Bus error
luefei opened this issue over 1 year ago
luefei opened this issue over 1 year ago
Support for RLHF (ILQL)-trained Models
ojus1 opened this issue over 1 year ago
ojus1 opened this issue over 1 year ago
vLLM full name
designInno opened this issue over 1 year ago
designInno opened this issue over 1 year ago
Stream Tokens operation integration into LLM class (which uses LLMEngine behind the scenes)
orellavie1212 opened this issue over 1 year ago
orellavie1212 opened this issue over 1 year ago
pip installation error - ERROR: Failed building wheel for vllm
dxlong2000 opened this issue over 1 year ago
dxlong2000 opened this issue over 1 year ago
Stuck in Initializing an LLM engine
EvilCalf opened this issue over 1 year ago
EvilCalf opened this issue over 1 year ago
Feature request: Support for embedding models
mantrakp04 opened this issue over 1 year ago
mantrakp04 opened this issue over 1 year ago
test qwen-7b-chat model and output incorrect
dachengai opened this issue over 1 year ago
dachengai opened this issue over 1 year ago
What a fast tokenizer can be used for Baichuan-13b?
FURYFOR opened this issue over 1 year ago
FURYFOR opened this issue over 1 year ago
vllm如何量化部署
xxm1668 opened this issue over 1 year ago
xxm1668 opened this issue over 1 year ago
Issue with raylet error
ZihanWang314 opened this issue over 1 year ago
ZihanWang314 opened this issue over 1 year ago
Memory leak while using tensor_parallel_size>1
haiasd opened this issue over 1 year ago
haiasd opened this issue over 1 year ago
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
jinfengfeng opened this issue over 1 year ago
jinfengfeng opened this issue over 1 year ago
Installing with ROCM
baderex opened this issue over 1 year ago
baderex opened this issue over 1 year ago
Best effort support for all Hugging Face transformers models
dwyatte opened this issue over 1 year ago
dwyatte opened this issue over 1 year ago
ValueError: The number of GPUs per node is not divisible by the number of tensor parallelism.
beratcmn opened this issue over 1 year ago
beratcmn opened this issue over 1 year ago
Cannot get a simple example working with multi-GPU
brevity2021 opened this issue over 1 year ago
brevity2021 opened this issue over 1 year ago
多gpus如何使用?
xxm1668 opened this issue over 1 year ago
xxm1668 opened this issue over 1 year ago
ModuleNotFoundError: No module named 'transformers_modules' with API serving using baichuan-7b
McCarrtney opened this issue over 1 year ago
McCarrtney opened this issue over 1 year ago
vLLM stops all processing when CPU KV cache is used, has to be shut down and restarted.
TheBloke opened this issue over 1 year ago
TheBloke opened this issue over 1 year ago
LlaMA 2: Input prompt (2664 tokens) is too long and exceeds limit of 2048/2560
foamliu opened this issue over 1 year ago
foamliu opened this issue over 1 year ago
Flash Attention V2
nivibilla opened this issue over 1 year ago
nivibilla opened this issue over 1 year ago
Faster model loading
imoneoi opened this issue over 1 year ago
imoneoi opened this issue over 1 year ago
+34% higher throughput?
naed90 opened this issue over 1 year ago
naed90 opened this issue over 1 year ago
[Feature Request] Support input embedding in `LLM.generate()`
KimmiShi opened this issue over 1 year ago
KimmiShi opened this issue over 1 year ago
Decode error while inferencing a batch of prompts
SiriusNEO opened this issue over 1 year ago
SiriusNEO opened this issue over 1 year ago
Support Multiple Models
aldrinc opened this issue over 1 year ago
aldrinc opened this issue over 1 year ago
Feature request:support ExLlama
alanxmay opened this issue over 1 year ago
alanxmay opened this issue over 1 year ago
8bit support
mymusise opened this issue over 1 year ago
mymusise opened this issue over 1 year ago
Require a "Wrapper" feature
jeffchy opened this issue over 1 year ago
jeffchy opened this issue over 1 year ago
CTranslate2
Matthieu-Tinycoaching opened this issue over 1 year ago
Matthieu-Tinycoaching opened this issue over 1 year ago
Remove Ray for the dependency
lanking520 opened this issue over 1 year ago
lanking520 opened this issue over 1 year ago
CUDA error: out of memory
SunixLiu opened this issue over 1 year ago
SunixLiu opened this issue over 1 year ago
Adding support for encoder-decoder models, like T5 or BART
shermansiu opened this issue over 1 year ago
shermansiu opened this issue over 1 year ago
Can I directly obtain the logits here?
SparkJiao opened this issue over 1 year ago
SparkJiao opened this issue over 1 year ago
Whisper support
gottlike opened this issue over 1 year ago
gottlike opened this issue over 1 year ago
Build failure due to CUDA version mismatch
WoosukKwon opened this issue over 1 year ago
WoosukKwon opened this issue over 1 year ago
Support custom models
WoosukKwon opened this issue over 1 year ago
WoosukKwon opened this issue over 1 year ago
Add docstrings to some modules and classes
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Minor code cleaning for SamplingParams
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Add performance comparison figures on A100, V100, T4
WoosukKwon opened this issue over 1 year ago
WoosukKwon opened this issue over 1 year ago
Add CD to PyPI
WoosukKwon opened this issue over 1 year ago
WoosukKwon opened this issue over 1 year ago
Enhance SamplingParams
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Implement presence and frequency penalties
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Support top-k sampling
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Avoid sorting waiting queue & Minor code cleaning
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Support string-based stopping conditions
WoosukKwon opened this issue over 1 year ago
WoosukKwon opened this issue over 1 year ago
Rename variables and methods
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Log system stats
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Update example prompts in `simple_server.py`
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Support various sampling parameters
WoosukKwon opened this issue over 1 year ago
WoosukKwon opened this issue over 1 year ago
Make sure the system can run on T4 and V100
WoosukKwon opened this issue over 1 year ago
WoosukKwon opened this issue over 1 year ago
Clean up the scheduler code
WoosukKwon opened this issue over 1 year ago
WoosukKwon opened this issue over 1 year ago
Add a system logger
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Use slow tokenizer for LLaMA
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Enhance model loader
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Refactor system architecture
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Use runtime profiling to replace manual memory analyzers
zhuohan123 opened this pull request over 1 year ago
zhuohan123 opened this pull request over 1 year ago
Bug in LLaMA fast tokenizer
WoosukKwon opened this issue over 1 year ago
WoosukKwon opened this issue over 1 year ago
[Minor] Fix a dtype bug
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Specify python package dependencies in requirements.txt
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Clean up Megatron-LM code
WoosukKwon opened this issue over 1 year ago
WoosukKwon opened this issue over 1 year ago
Add license
WoosukKwon opened this issue over 1 year ago
WoosukKwon opened this issue over 1 year ago
Implement client API
WoosukKwon opened this issue over 1 year ago
WoosukKwon opened this issue over 1 year ago
Add docstring
zhuohan123 opened this issue over 1 year ago
zhuohan123 opened this issue over 1 year ago
Use mypy
WoosukKwon opened this issue over 1 year ago
WoosukKwon opened this issue over 1 year ago
Support FP32
WoosukKwon opened this issue over 1 year ago
WoosukKwon opened this issue over 1 year ago
Dangerous floating point comparison
merrymercy opened this issue over 1 year ago
merrymercy opened this issue over 1 year ago
Replace FlashAttention with xformers
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Decrease the default size of swap space
WoosukKwon opened this issue over 1 year ago
WoosukKwon opened this issue over 1 year ago
Fix a bug in attention kernel
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Use O3 optimization instead of O2 for CUDA compilation?
WoosukKwon opened this issue over 1 year ago
WoosukKwon opened this issue over 1 year ago