github.com/vllm-project/vllm issues | Ecosyste.ms: OpenCollective

Support W8A8 inference in vllm

AniZpZ opened this pull request about 1 year ago

Support int8 KVCache Quant in Vllm

AniZpZ opened this pull request about 1 year ago

Added logits processor API to sampling params

noamgat opened this pull request about 1 year ago

ImportError: cannot import name 'MistralConfig' from 'transformers'

peter-ch opened this issue about 1 year ago

Adding Locally Typical Sampling (i.e. typical_p in transformers and TGI)

seongminp opened this issue about 1 year ago

Does vllm support the Mac/Metal/MPS?

Phil-U-U opened this issue over 1 year ago

Using VLLM with a Tesla T4 on SageMaker Studio (ml.g4dn.xlarge instance)

paulovasconcellos-hotmart opened this issue over 1 year ago

[question] Does vllm support macos M1 or M2 chip?

acekingke opened this issue over 1 year ago

Could not build wheels for vllm, which is required to install pyproject.toml-based projects

ABooth01 opened this issue over 1 year ago

Make multi replicas to make a balancer.

linkedlist771 opened this issue over 1 year ago

How to deploy vllm model across multiple nodes in kubernetes?

Ryojikn opened this issue over 1 year ago

[Error] 400 Bad Request

Tostino opened this issue over 1 year ago

What is the max number prompts that the generate() method can take

hxue3 opened this issue over 1 year ago

Low VRAM batch processing mode

viktor-ferenczi opened this issue over 1 year ago

Feature request. Allow a few model instances in one GPU if they can feet in VRAM.

agrogov opened this issue over 1 year ago

feat: demonstrate using regex for suffix matching

wsxiaoys opened this pull request over 1 year ago

Memory leak

SatoshiReport opened this issue over 1 year ago

StreamingLLM support?

nivibilla opened this issue over 1 year ago

workaround of AWQ for Turing GPUs

twaka opened this pull request over 1 year ago

Jetson agx orin

MrBrabus75 opened this issue over 1 year ago

Data parallel inference

kevinhu opened this issue over 1 year ago

Support Python 3.12

EwoutH opened this issue over 1 year ago

3 gpu's not supported?

ye7love7 opened this issue over 1 year ago

Generate nothing from VLLM output

FocusLiwen opened this issue over 1 year ago

[Discussion] Will vLLM consider using Speculative Sampling to accelerating LLM decoding?

gesanqiu opened this issue over 1 year ago

vLLM to add a locally trained model

atanikan opened this issue over 1 year ago

vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly.

MUZAMMILPERVAIZ opened this issue over 1 year ago

AWQ: bfloat16 not supported? And `--dtype` arg doesn't allow specifying float16

TheBloke opened this issue over 1 year ago

vLLM Discord Server

zhuohan123 opened this issue over 1 year ago

Waiting sequence group should have only one prompt sequence.

Link-Li opened this issue over 1 year ago

Inconsistent results between HuggingFace Transformers and vllm

normster opened this issue over 1 year ago

How to deploy api server as https

yilihtien opened this issue over 1 year ago

vllm hangs when reinitializing ray

nelson-liu opened this issue over 1 year ago

How to use vllm to compute ppl score for input text?

yinochaos opened this issue over 1 year ago

GGUF support

viktor-ferenczi opened this issue over 1 year ago

AsyncEngineDeadError / RuntimeError: CUDA error: an illegal memory access was encountered

xingyaoww opened this issue over 1 year ago

It seems that SamplingParams doesnt support the bad_words_ids parameter when generating

mengban opened this issue over 1 year ago

can model Qwen/Qwen-VL-Chat work well?

wangschang opened this issue over 1 year ago

vllm reducing quality when loading local fine tuned Llama-2-13b-hf model

BerndHuber opened this issue over 1 year ago

I want to close kv cache. if i set gpu_memory_utilization is 0. Does it means that i close the kv cache?

amulil opened this issue over 1 year ago

Is there authentication supported?

mluogh opened this issue over 1 year ago

Loading Model through Multi-Node Ray Cluster Fails

VarunSreenivasan16 opened this issue over 1 year ago

SIGABRT - Fatal Python error: Aborted when running vllm on llama2-7b with --tensor-parallel-size 2

dhritiman opened this issue over 1 year ago

Sagemaker support for inference

Tarun3679 opened this issue over 1 year ago

start vllm.entrypoints.api_server model vicuna-13b-v1.3 error: Fatal Python error: Bus error

luefei opened this issue over 1 year ago

Support for RLHF (ILQL)-trained Models

ojus1 opened this issue over 1 year ago

vLLM full name

designInno opened this issue over 1 year ago

Stream Tokens operation integration into LLM class (which uses LLMEngine behind the scenes)

orellavie1212 opened this issue over 1 year ago

pip installation error - ERROR: Failed building wheel for vllm

dxlong2000 opened this issue over 1 year ago

Stuck in Initializing an LLM engine

EvilCalf opened this issue over 1 year ago

Feature request: Support for embedding models

mantrakp04 opened this issue over 1 year ago

test qwen-7b-chat model and output incorrect

dachengai opened this issue over 1 year ago

What a fast tokenizer can be used for Baichuan-13b?

FURYFOR opened this issue over 1 year ago

vllm如何量化部署

xxm1668 opened this issue over 1 year ago

Issue with raylet error

ZihanWang314 opened this issue over 1 year ago

Memory leak while using tensor_parallel_size>1

haiasd opened this issue over 1 year ago

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

jinfengfeng opened this issue over 1 year ago

Installing with ROCM

baderex opened this issue over 1 year ago

Best effort support for all Hugging Face transformers models

dwyatte opened this issue over 1 year ago

ValueError: The number of GPUs per node is not divisible by the number of tensor parallelism.

beratcmn opened this issue over 1 year ago

Cannot get a simple example working with multi-GPU

brevity2021 opened this issue over 1 year ago

多gpus如何使用？

xxm1668 opened this issue over 1 year ago

ModuleNotFoundError: No module named 'transformers_modules' with API serving using baichuan-7b

McCarrtney opened this issue over 1 year ago

vLLM stops all processing when CPU KV cache is used, has to be shut down and restarted.

TheBloke opened this issue over 1 year ago

LlaMA 2: Input prompt (2664 tokens) is too long and exceeds limit of 2048/2560

foamliu opened this issue over 1 year ago

Flash Attention V2

nivibilla opened this issue over 1 year ago

Faster model loading

imoneoi opened this issue over 1 year ago

+34% higher throughput?

naed90 opened this issue over 1 year ago

[Feature Request] Support input embedding in `LLM.generate()`

KimmiShi opened this issue over 1 year ago

Decode error while inferencing a batch of prompts

SiriusNEO opened this issue over 1 year ago

Support Multiple Models

aldrinc opened this issue over 1 year ago

Feature request：support ExLlama

alanxmay opened this issue over 1 year ago

8bit support

mymusise opened this issue over 1 year ago

Require a "Wrapper" feature

jeffchy opened this issue over 1 year ago

CTranslate2

Matthieu-Tinycoaching opened this issue over 1 year ago

Remove Ray for the dependency

lanking520 opened this issue over 1 year ago

CUDA error: out of memory

SunixLiu opened this issue over 1 year ago

Adding support for encoder-decoder models, like T5 or BART

shermansiu opened this issue over 1 year ago

Can I directly obtain the logits here?

SparkJiao opened this issue over 1 year ago

Whisper support

gottlike opened this issue over 1 year ago

Build failure due to CUDA version mismatch

WoosukKwon opened this issue over 1 year ago

Support custom models

WoosukKwon opened this issue over 1 year ago

Add docstrings to some modules and classes

WoosukKwon opened this pull request over 1 year ago

Minor code cleaning for SamplingParams

WoosukKwon opened this pull request over 1 year ago

Add performance comparison figures on A100, V100, T4

WoosukKwon opened this issue over 1 year ago

Add CD to PyPI

WoosukKwon opened this issue over 1 year ago

Enhance SamplingParams

WoosukKwon opened this pull request over 1 year ago

Implement presence and frequency penalties

WoosukKwon opened this pull request over 1 year ago

Support top-k sampling

WoosukKwon opened this pull request over 1 year ago

Avoid sorting waiting queue & Minor code cleaning

WoosukKwon opened this pull request over 1 year ago

Support string-based stopping conditions

WoosukKwon opened this issue over 1 year ago

Rename variables and methods

WoosukKwon opened this pull request over 1 year ago

Log system stats

WoosukKwon opened this pull request over 1 year ago

Update example prompts in `simple_server.py`

WoosukKwon opened this pull request over 1 year ago

Support various sampling parameters

WoosukKwon opened this issue over 1 year ago

Make sure the system can run on T4 and V100

WoosukKwon opened this issue over 1 year ago

Clean up the scheduler code

WoosukKwon opened this issue over 1 year ago

Add a system logger

WoosukKwon opened this pull request over 1 year ago

Use slow tokenizer for LLaMA

WoosukKwon opened this pull request over 1 year ago

Enhance model loader

WoosukKwon opened this pull request over 1 year ago