github.com/vllm-project/vllm issues | Ecosyste.ms: OpenCollective

[Misc]: Implement CPU/GPU swapping in BlockManagerV2

cadedaniel opened this issue 9 months ago

[Hardware][AMD][Kernel]Adding custom kernel for vector query on Rocm

charlifu opened this pull request 9 months ago

[Bug]: ChatCompletion prompt_logprobs does not work

noamgat opened this issue 9 months ago

[RFC] Initial Support for CPUs

bigPYJ1151 opened this issue 9 months ago

[Usage]: Generate specified number of tokens for each request individually

oximi123 opened this issue 9 months ago

[Kernel] Use flash-attn for decoding

skrider opened this pull request 9 months ago

[Misc] add the "download-dir" option to the latency/throughput benchmarks

AmadeusChan opened this pull request 9 months ago

[RFC] Initial Support for Cloud TPUs

WoosukKwon opened this issue 9 months ago

[Bug]: The fine-tuned qwen1.5 model uses transformers generate() to have a normal dialogue, but the dialogue output using vllm openai API has multiple line breaks.

qianghuangwhu opened this issue 9 months ago

parent_child_dict[sample.parent_seq_id].append(sample) KeyError: 4

Stosan opened this issue 9 months ago

ModuleNotFoundError: No module named 'transformers_modules' with API serving using phi-2b

haining78zhang opened this issue 9 months ago

[BugFix] Fix Falcon tied embeddings

WoosukKwon opened this pull request 9 months ago

[RFC]: Interface and Abstraction for Distributed Inference Environment

youkaichao opened this issue 9 months ago

[Misc]: Throughput/Latency for guided_json with ~100% GPU cache utilization

jens-create opened this issue 9 months ago

[Feature]: Offload Model Weights to CPU

chenqianfzh opened this issue 9 months ago

[New Model]: Phi-2 support for LoRA

andykhanna opened this issue 9 months ago

[Feature]: Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

tchaton opened this issue 9 months ago

[CI] Create nightly images/wheels

simon-mo opened this issue 9 months ago

[Usage]: model not support lora but listed in supported models

xiaobo-Chen opened this issue 9 months ago

[Usage]: punica LoRA kernels could not be imported. If you built vLLM from source, make sure VLLM_INSTALL_PUNICA_KERNELS=1 env var was set.

nlp-learner opened this issue 9 months ago

[Feature]: Support Guided Decoding in `LLM` entrypoint

simon-mo opened this issue 9 months ago

[Feature]: FastServe - Fast Distributed Inference Serving for Large Language Models

chizhang118 opened this issue 9 months ago

[Bug]: when intalling vllm by pip, some errors happend.

finylink opened this issue 9 months ago

[Usage]: How to inference model with multi-gpus

ckj18 opened this issue 9 months ago

[Kernel] Full Tensor Parallelism for LoRA Layers

FurtherAI opened this pull request 9 months ago

[Bug]: aisingapore/sea-lion-7b-instruct fails with assert config.embedding_fraction == 1.0

pseudotensor opened this issue 9 months ago

[Feature]: Support distributing serving with KubeRay's autoscaler

TrafalgarZZZ opened this issue 9 months ago

[Bug]: vllm slows down after a long run

momomobinx opened this issue 9 months ago

[New Model]: Please support CogVLM

kietna1809 opened this issue 9 months ago

[Misc] Add attention sinks

felixzhu555 opened this pull request 9 months ago

[Bug]: Use of LoRAReqeust

meiru-cam opened this issue 9 months ago

[BugFix][Frontend] Use correct, shared tokenizer in OpenAI server

njhill opened this pull request 9 months ago

[Core] Add generic typing to `LRUCache`

njhill opened this pull request 9 months ago

[Usage]: Set dtype for VLLM using YAML

telekoteko opened this issue 9 months ago

Dynamic Multi LoRA Load \ Delete Support

gauravkr2108 opened this pull request 9 months ago

[Feature]: Compute and log the serving FLOPs

zhuohan123 opened this issue 9 months ago

[Usage]: Why increase max-num-seqs will use less memory

TaChao opened this issue 9 months ago

[Bug]: DynamicNTKScalingRotaryEmbedding implementation is different from Transformers

killawhale2 opened this issue 9 months ago

[Frontend] [Core] feat: Add model loading using `tensorizer`

sangstar opened this pull request 9 months ago

[Frontend] Support complex message content for chat completions endpoint

fgreinacher opened this pull request 9 months ago

[Core] Multiprocessing executor for single-node multi-GPU deployment

njhill opened this pull request 9 months ago

[Bug]: ValueError: Model architectures ['Qwen2ForCausalLM'] are not supported for now

finylink opened this issue 9 months ago

baichuan/qwen/chatlgm with lora adaption [feature]

kexuedaishu opened this issue 9 months ago

[Bugfix] Fix beam search logits processor

maximzubkov opened this pull request 9 months ago

[Feature]: Control vectors

generalsvr opened this issue 9 months ago

[Core] Support thread-based async tokenizer pools

njhill opened this pull request 9 months ago

[Bug]: Bug in Guided Generation Logits Processorwith `n>1`

maximzubkov opened this issue 9 months ago

[Frontend] support new lora module to a live server in OpenAI Entrypoints

AlphaINF opened this pull request 9 months ago

[Test] Add a randomized test for OpenAI API

dylanwhawk opened this issue 9 months ago

[Bug]: Incompatible version between torch and triton

mzz12 opened this issue 9 months ago

Does vllm support pytorch/xla ?

dinghaodhd opened this issue 9 months ago

[Misc] add HOST_IP env var

youkaichao opened this pull request 9 months ago

[Bug]: RuntimeError: invalid argument to reset_peak_memory_stats when offline sampling using neuron

Sadden opened this issue 9 months ago

Incremental output for LLM entrypoint

yhu422 opened this pull request 9 months ago

Unable to load LoRA fine-tuned LLM from HF (AssertionError)

oscar-martin opened this issue 9 months ago

[Prefill with Prefix Cache] Improve the efficiency of prefilling with prefix cache by allowing a larger batch size

MeloYang05 opened this pull request 9 months ago

Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

tchaton opened this issue 9 months ago

[Feature] Implement FastV's Token Pruning

chenllliang opened this issue 9 months ago

(core dumped) when running `vllm` with `AWQ` on `MIG` partition of a H100 GPU

remiconnesson opened this issue 9 months ago

ImportError: libcudart.so.11.0: cannot open shared object file: No such file or director

mcleish7 opened this issue 9 months ago

Sampling is very slow, causing a CPU bottleneck

m-harmonic opened this issue 9 months ago

Can you choose which GPU to use. like tf inference device_map="cuda:0"

wellcasa opened this issue 9 months ago

[TEST] Add a distributed test for async LLM engine.

zhuohan123 opened this issue 9 months ago

When starting the second vllm.entrypoints.api_server using tensor parallel in a single node, the second vllm api_server Stuck in " Started a local Ray instance." OR "Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory"

durant1999 opened this issue 9 months ago

Does 'all-reduce kernels are temporarily disabled' is the cause for the more memory requirment?

SafeyahShemali opened this issue 9 months ago

DeepSeek VL support

SinanAkkoyun opened this issue 9 months ago

inference with AWQ quantization

Kev1ntan opened this issue 9 months ago

Fixes #1556 double free

br3no opened this pull request 10 months ago

Bug when input top_k as a float that is outside of range

Drzhivago264 opened this issue 10 months ago

[Feature Request] Add GPTQ quantization kernels for 4-bit NormalFloat (NF4) use cases.

duchengyao opened this issue 10 months ago

vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.

EchoShoot opened this issue 10 months ago

TCPStore is not available

Z-Diviner opened this issue 10 months ago

What's difference between the seed in LLMEngine and seed in SamplingParams?

tomdzh opened this issue 10 months ago

Is it possible to use vllm-0.3.3 with CUDA 11.8

HSLUCKY opened this issue 10 months ago

Implement structured engine for parsing json grammar by token with `response_format: {type: json_object}`

pathorn opened this pull request 10 months ago

add aya-101 model

ahkarami opened this issue 10 months ago

What's up with Pipeline Parallelism?

duanzhaol opened this issue 10 months ago

how to run gemma-7b model with vllm 0.3.3 under cuda 118??

adogwangwang opened this issue 10 months ago

When chat-ui and vllm are used together, the dialogue output of Llama-2-70b-chat-hf(safetensor file) is abnormal.

majestichou opened this issue 10 months ago

AsyncEngineDeadError when LoRA loading fails

lifuhuang opened this issue 10 months ago

Multi-LoRA - Support for providing /load and /unload API

gauravkr2108 opened this issue 10 months ago

[feature on nm-vllm] Sparse Inference with weight only int8 quant

shiqingzhangCSU opened this issue 10 months ago

Question regarding GPU memory allocation

wx971025 opened this issue 10 months ago

Error compiling kernels

declark1 opened this issue 10 months ago

lm-evaluation-harness broken on master

pcmoritz opened this issue 10 months ago

v0.3.3 api server can't startup with neuron sdk

qingyuan18 opened this issue 10 months ago

Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU)

AdrianAbeyta opened this pull request 10 months ago

[FIX] Fix prefix test error on main

zhuohan123 opened this pull request 10 months ago

Mixtral 4x 4090 OOM

SinanAkkoyun opened this issue 10 months ago

Order of keys for guided JSON

ccdv-ai opened this issue 10 months ago

Regression in llama model inference due to #3005

Qubitium opened this issue 10 months ago

unload the model

osafaimal opened this issue 10 months ago

install from source failed using the latest code

sleepwalker2017 opened this issue 10 months ago

[FIX] Make `flash_attn` optional

WoosukKwon opened this pull request 10 months ago

[Minor fix] Include flash_attn in docker image

tdoublep opened this pull request 10 months ago

Error when prompt_logprobs + enable_prefix_caching

bgyoon opened this issue 10 months ago

Can vLLM handle concurrent request with FastAPI?

Strongorange opened this issue 10 months ago

OpenAI Tools / function calling v2

FlorianJoncour opened this pull request 10 months ago

Prefix Caching with FP8 KV cache support

chenxu2048 opened this pull request 10 months ago

When running pytest tests/, undefined symbol: _ZNSt15__exception_ptr13exception_ptr9_M_addrefEv

Imss27 opened this issue 10 months ago