github.com/vllm-project/vllm issues | Ecosyste.ms: OpenCollective

Added Support for guided decoding in offline interface

kevinbu233 opened this pull request 8 months ago

[AMD][Hardware][Misc][Bugfix] xformer cleanup and light navi logic and CI fixes and refactoring

hongxiayang opened this pull request 8 months ago

[Feature]: Support HuggingFaceM4/idefics2-8b as vision model

pseudotensor opened this issue 8 months ago

[Misc] [CI]: AMD test flaky on main CI

cadedaniel opened this issue 8 months ago

[Model] Update MPT model with GLU and rope and add low precision layer norm

marov opened this pull request 8 months ago

[Model] Jamba support

mzusman opened this pull request 8 months ago

[CI/BUILD] enable intel queue for longer CPU tests

zhouyuan opened this pull request 8 months ago

[Bug]: VLLM's output is unstable when handling requests CONCURRENTLY.

zhengwei-gao opened this issue 8 months ago

[Bug]: deepseek-coder-33b-instruct and deepseek-coder-6.7b-instruct broken, but deepseek-llm-7b-chat and deepseek-llm-67b-chat work well

lgw2023 opened this issue 8 months ago

[Frontend][Core] Update Outlines Integration from `FSM` to `Guide`

br3no opened this pull request 8 months ago

[Bug]: NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered

pseudotensor opened this issue 8 months ago

[Bug]: --engine-use-ray is broken. #4100

jdinalt opened this pull request 8 months ago

[Bugfix] Fix naive attention typos and make it run on navi3x

maleksan85 opened this pull request 8 months ago

[Bug]: guided_json bad output for llama2-13b

pseudotensor opened this issue 8 months ago

[Model] Adding support for MiniCPM-V

HwwwwwwwH opened this pull request 8 months ago

[FacebookAI/roberta-large]: vllm support for FacebookAI/roberta-large

pradeepdev-1995 opened this issue 8 months ago

[Bug]: vllm_C is missing.

Calvinnncy97 opened this issue 8 months ago

[Model] Add support for 360zhinao

garycaokai opened this pull request 8 months ago

[Bug]: RuntimeError: Unknown layout

zzlgreat opened this issue 8 months ago

[Bug]: sending request using response_format json twice breaks vLLM

samos123 opened this issue 8 months ago

[Feature]: Allow LoRA adapters to be specified as in-memory dict of tensors

jacobthebanana opened this issue 8 months ago

[Usage]: Unable to load mistralai/Mixtral-8x7B-Instruct-v0.1

rohitnanda1443 opened this issue 8 months ago

Does vllm support both CUDA 11.3 version and PyTorch 1.12?

iclgg opened this issue 8 months ago

[Usage]: Problem when loading my trained model.

hummingbird2030 opened this issue 8 months ago

[Feature][Chunked prefill]: Make sliding window work

rkooo567 opened this issue 8 months ago

[Feature]: bitsandbytes support

orellavie1212 opened this issue 8 months ago

[Frontend] Refactor prompt processing

DarkLight1337 opened this pull request 8 months ago

[Bug]: start api server stuck

QianguoS opened this issue 8 months ago

[Model] [Kernel] Add 16, 32 kernel sizes in compliation

nbardy opened this pull request 8 months ago

[Installation]: Any plans on providing vLLM pre-compiled for ROCm?

satyamk7054 opened this issue 8 months ago

[Core] Support LoRA on quantized models

jeejeelee opened this pull request 9 months ago

[Installation]: VLLM is impossible to install.

GPaolo opened this issue 9 months ago

[Kernel] Fused MoE Config for Mixtral 8x22

ywang96 opened this pull request 9 months ago

[Bug]: Qwen1.5-14B-Chat使用vllm==0.3.3版本在Tesla V100-PCIE-32GB显卡上部署结果全部是感叹号，无结果

li995495592 opened this issue 9 months ago

[Usage]: flash_attn vs xformers

VeryVery opened this issue 9 months ago

[Performance]: FP8 KV Cache performance loss on FP16 models in ROCm

TNT3530 opened this issue 9 months ago

[Bug]: Command R+ GPTQ bad output on ROCm

TNT3530 opened this issue 9 months ago

[Core][Frontend][Doc] Initial support for LLaVA-NeXT and GPT-4V Chat Completions API

DarkLight1337 opened this pull request 9 months ago

[Feature]: Tree attention about Speculative Decoding

yukavio opened this issue 9 months ago

[CI/Build] Reduce race condition in docker build

youkaichao opened this pull request 9 months ago

[Misc]: Does prefix caching work together with multi lora?

sleepwalker2017 opened this issue 9 months ago

[Bug]: StableLM 12b head size incorrect

bjoernpl opened this issue 9 months ago

[Model] LoRA gptbigcode implementation

raywanb opened this pull request 9 months ago

[Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics

DearPlanet opened this pull request 9 months ago

[Bug]: leading space within content via OpenAI Compatible Server

bufferoverflow opened this issue 9 months ago

[Usage]: How to offload some layers to CPU？

cheney369 opened this issue 9 months ago

想问下有一个稳定版本的docker 镜像吗？

huyang19881115 opened this issue 9 months ago

[Model] Initialize Fuyu-8B support

Isotr0py opened this pull request 9 months ago

[Bug]: Cannot use FlashAttention because the package is not found. Please install it for better performance.

pseudotensor opened this issue 9 months ago

[Bug]: Getting subprocess.CalledProcessError: Command '['/usr/bin/gcc',] error message.

yk287 opened this issue 9 months ago

[Usage]: I have two Gpus, how do I make my model run on 2 gpus

hxujal opened this issue 9 months ago

[Kernel] PyTorch Labs Fused MoE Kernel Integration

robertgshaw2-neuralmagic opened this pull request 9 months ago

[Feature]: Support Ray-free multi-node distributed inference on resource managers like Kubernetes

Jeffwan opened this issue 9 months ago

[Bug]: AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight'

guangweiShaw opened this issue 9 months ago

[Bug]:Failed that we generate the pompts with the google/gemma-2b model by the python code,

936187425 opened this issue 9 months ago

[Usage]: How to determine whether the vllm engine is full with requests or not

man2machine opened this issue 9 months ago

[Bug]: killed due to high memory usage

xiewf1990 opened this issue 9 months ago

[Bug]: Cannot load lora adapters in WSL 2

invokeinnovation opened this issue 9 months ago

[Bug]: vllm 0.4.0.post1 crashed when loading dbrx-instruct on AMD MI250x

vgod-dbx opened this issue 9 months ago

[Doc/Feature]: Llava 1.5 in OpenAI compatible server

stikkireddy opened this issue 9 months ago

[Roadmap] vLLM Roadmap Q2 2024

simon-mo opened this issue 9 months ago

[Misc]: Can we remove `vllm/entrypoints/api_server.py`?

hmellor opened this issue 9 months ago

[Frontend] openAI entrypoint dynamic adapter load

DavidPeleg6 opened this pull request 9 months ago

[Bug]: Error happen in async_llm_engine when use multiple GPUs

for-just-we opened this issue 9 months ago

[Misc]: Implement CPU/GPU swapping in BlockManagerV2

Kaiyang-Chen opened this pull request 9 months ago

[Core] :loud_sound: Improve request logging truncation

joerunde opened this pull request 9 months ago

[Model] Cohere CommandR+

saurabhdash2512 opened this pull request 9 months ago

[Hardware][Intel GPU]Add Initial Intel GPU(XPU) inference backend

jikunshang opened this pull request 9 months ago

[Installation]: Tesla V100 cuda11.4, I have no permission to install a upper-version cuda driver, how can I install vllm? I have tried to build from source and use pip, both failed.

LaVieEnRose365 opened this issue 9 months ago

[Bug]: YI:34B在使用上无法停止。

cat2353050774 opened this issue 9 months ago

[Feature]: Make `outlines` dependency optional

saattrupdan opened this issue 9 months ago

[Bug]: RuntimeError: No suitable kernel. h_in=16 h_out=3424 dtype=Float out_dtype=BFloat16

Edisonwei54 opened this issue 9 months ago

[Feature]: Add OpenTelemetry distributed tracing

ronensc opened this issue 9 months ago

[Feature]: cuda12.2 support

s-natsubori opened this issue 9 months ago

[Bug]: 【P100】RuntimeError: CUDA error: no kernel image is available for execution on the device [repeated 6x across cluster]

matrixssy opened this issue 9 months ago

vllm-0.4.0.post1+neuron213; ModuleNotFoundError: No module named 'vllm._C' [Bug]:

MojHnd opened this issue 9 months ago

Best server cmd for mistralai/Mistral-7B-v0.1

sshleifer opened this issue 9 months ago

[RFC] How do we test and support third-party models

youkaichao opened this issue 9 months ago

[Bug]: Qwen-14B-Chat-Int4 with guided_json error

xunfeng1980 opened this issue 9 months ago

[Bug]: n_inner divisible to number of GPUs

aliozts opened this issue 9 months ago

[Bug]: docker 启动vllm,配置了host_IP ，还是 [W socket.cpp:663] [c10d] The client socket has failed to connect to [::ffff:172.16.8.232]:39623 (errno: 110 - Connection timed out)

huyang19881115 opened this issue 9 months ago

[Core] Eliminate parallel worker per-step task scheduling overhead

njhill opened this pull request 9 months ago

[WIP][Core] fully composible launcher/task/coordinator/communicator design and implementation

youkaichao opened this pull request 9 months ago

[Usage]: Expected output when prompt_logprobs=1

thefirebanks opened this issue 9 months ago

[Bug]: trying to run vllm inference behind the fastapi's server, but it stucks

sigridjineth opened this issue 9 months ago

[Bug]: CUDA error: invalid argument

qingjiaozyn opened this issue 9 months ago

[Model][Misc] Add e5-mistral-7b-instruct and Embedding API

CatherineSue opened this pull request 9 months ago

[CI/Build] A perplexity-computing test for the FP8 KV cache system. Originally used in the context of PR #3290

Alexei-V-Ivanov-AMD opened this pull request 9 months ago

[Model Loading] Speedup model loading with distributed loading

chestnut-Q opened this pull request 9 months ago

[Misc]: Cohere models are not working due to transformers library outdated?

Playerrrrr opened this issue 9 months ago

[RFC] Initial support for Intel GPU

jikunshang opened this issue 9 months ago

[Bug]: RuntimeError: CUDA error: invalid device ordinal with multi node multi gpus

kn1011 opened this issue 9 months ago

[Usage]: vllm can host offline? with internet connection?

juud79 opened this issue 9 months ago

[Feature]: A instruction/chat method for offline LLM class.

simon-mo opened this issue 9 months ago

[Usage]: Model Qwen2ForCausalLM does not support LoRA, but LoRA is enabled. Support for this model may be added in the future. If this is important to you, please open an issue on github

jcxcer opened this issue 9 months ago

[Bug]: VLLM OOMing unpredictably on prediction

hillarysanders opened this issue 9 months ago

[Bug]: Custom all reduce not work.

esmeetu opened this issue 9 months ago

[Usage]: Error Segmentation fault(core dumped) while testing asynchronous high concurrency

alex1996-ljl opened this issue 9 months ago

Using the VLLM engine framework for inference, why is the first character generated always a space?

cy565025164 opened this issue 9 months ago

Enable mypy type checking

simon-mo opened this issue 9 months ago