github.com/vllm-project/vllm issues | Ecosyste.ms: OpenCollective

is there a way we can add Hugging face PEFT model for VLLM to load?

rsong0606 opened this issue 9 months ago

[Metrics] add more metrics

HarryWu99 opened this pull request 9 months ago

[Bugfix][Kernel] Fix compute_type for MoE kernel

WoosukKwon opened this pull request 9 months ago

[Usage]: How do you setup vllm to work in k8s/openshift cluster

jayteaftw opened this issue 9 months ago

[Distributed] refactor pynccl to support multilpe TP groups

youkaichao opened this pull request 9 months ago

[Bug]: JSON Schema for multiple function choices

jamestwhedbee opened this issue 9 months ago

[Kernel] Update fused_moe tuning script for FP8

pcmoritz opened this pull request 9 months ago

[Doc] add visualization for multi-stage dockerfile

prashantgupta24 opened this pull request 9 months ago

[Usage] [Bug]: run inference on mistralai/Mixtral-8x7B-Instruct-v0.1 with tensor parallel > 1 (currently not working)

jayteaftw opened this issue 9 months ago

[Misc] Upgrade to `torch==2.3.0`

mgoin opened this pull request 9 months ago

[Misc] fix typo in block manager

Juelianqvq opened this pull request 9 months ago

[Usage]: How to start vllm with llava using docker compose

athenawisdoms opened this issue 9 months ago

[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption

rkooo567 opened this pull request 9 months ago

[mypy][6/N] Fix all the core subdirectory typing

rkooo567 opened this pull request 9 months ago

[Usage]: why vllm takes as much ram as possible ?

xudong2019 opened this issue 9 months ago

[Bug]: 1-card deployment and 2-card deployment yield inconsistent output logits.

thisissum opened this issue 9 months ago

[CORE] Allow loading of quantized lm_head (ParallelLMHead)

Qubitium opened this pull request 9 months ago

[Performance]: Empirical Measurement of how to broadcast python object in vLLM

youkaichao opened this issue 9 months ago

[Bug]: OpenAI API request doesn't go through with 'guided_json'

Tejaswgupta opened this issue 9 months ago

[Bug]: Prefix caching does not work on Pascal GPUs

sasha0552 opened this issue 9 months ago

[Misc]: need "first good issue"

HarryWu99 opened this issue 9 months ago

[Feature]: option to return hidden states

zhenlan0426 opened this issue 9 months ago

[Usage]: How to disable multi lora to avoid using punica ? Or is the punica being the only choice?

laoda513 opened this issue 9 months ago

[Bug]: all_reduce assert result == 0, File "torch/cuda/graphs.py", line 88, in capture_end super().capture_end(), RuntimeError: CUDA error: operation failed due to a previous error during capture

lmx760581375 opened this issue 9 months ago

[Bug]: Initialising LLM on multiple GPUs stuck at "Started a local Ray instance"

timbmg opened this issue 9 months ago

[Bug]: Engine iteration timed out. This should never happen!

itechbear opened this issue 9 months ago

[Usage]: Not enough memory when run a 33b model float16 on 2 x L40 GPU (48G)

garyyang85 opened this issue 9 months ago

[CI/Build] Move `test_utils.py` to `tests/utils.py`

DarkLight1337 opened this pull request 9 months ago

[Usage]: If I use Offline way to launch the model, how can I get the metrics?

amumu96 opened this issue 9 months ago

[Core] Centralize GPU Worker construction

njhill opened this pull request 9 months ago

[Bug]: cannot load model back due to [does not appear to have a file named config.json]

yananchen1989 opened this issue 9 months ago

[WIP][Hardware][Intel] support intel builds with intel c++

kannon92 opened this pull request 9 months ago

Add support for ReFT

RonanKMcGovern opened this issue 9 months ago

[Core] Pipeline Parallel Support

andoorve opened this pull request 9 months ago

[Doc]: Offline Inference Distributed Broken for TP

sam-h-bean opened this issue 9 months ago

[Hardware][Nvidia] Enable support for Pascal GPUs

jasonacox opened this pull request 9 months ago

[RFC]: environment variable management in vllm

youkaichao opened this issue 9 months ago

[kernel] fix sliding window in prefix prefill Triton kernel

mmoskal opened this pull request 9 months ago

[Bug]: Can not run openapi server with cpu backend

kannon92 opened this issue 9 months ago

[Frontend] add tok/s speed metric to llm class when using tqdm

MahmoudAshraf97 opened this pull request 9 months ago

[Bug]: TypeError in XFormersMetadata

skonto opened this issue 9 months ago

[Model]: Support for InternVL-Chat-V1-5

Iven2132 opened this issue 9 months ago

[Bug]: Running llama2-7b on H20, Floating point exception (core dumped) appears on float16

yk1012664593 opened this issue 9 months ago

[Usage]: I doubt about the meaning of --enable-prefix-caching

chenchunhui97 opened this issue 9 months ago

[Bug]: vllm 0.4.1 and transformers 4.40.1 have conflicting dependencies on pydantic

AbbottKilig opened this issue 9 months ago

[Bug]: Chunked prefill doesn't seem to work when --kv-cache-dtype fp8

rkooo567 opened this issue 9 months ago

[Model] Phi-3 4k sliding window temp. fix

caiom opened this pull request 9 months ago

[Speculative decoding] Support target-model logprobs

cadedaniel opened this pull request 9 months ago

[Bug]: Phi3 still not supported

andrew-vold opened this issue 9 months ago

✨ support local cache for models

prashantgupta24 opened this pull request 9 months ago

[Usage]: When I installed version 0.4.1 and started `vllm.entrypoints.openai.api_server` with the `--engine-use-ray` parameter, I encountered some issues.

Uhao-P opened this issue 9 months ago

[Installation]: GitHub access required during install for vllm >=0.4.1 (for cu12-libnccl.so.2.18.1)

mattmalcher opened this issue 9 months ago

[Feature]: GPTQ/AWQ quantization is not fully optimized yet. The speed can be slower than non-quantized models.

ShubhamVerma16 opened this issue 9 months ago

[Feature]: AssertionError: Speculative decoding not yet supported for RayGPU backend.

cocoza4 opened this issue 9 months ago

[Core] Add `multiproc_worker_utils` for multiprocessing-based workers

njhill opened this pull request 9 months ago

[Frontend] Add APIs for dynamic LoRA models load/unload

graceleeis opened this pull request 9 months ago

[Kernel] Use flashinfer for decoding

LiuXiaoxuanPKU opened this pull request 9 months ago

[Bug]: mistralai/Mixtral-8x22B-Instruct-v0.1 fails to load 2/3 times on aae08249acca69060d0a8220cab920e00520932c

pseudotensor opened this issue 9 months ago

[Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales

pcmoritz opened this pull request 9 months ago

[Bug]: Call to CUDA function failed - unknown error

roclark opened this issue 9 months ago

[Misc]: RuntimeError: Cannot find any model weights [vllm=0.4.0]

vishwa27yvs opened this issue 9 months ago

[Kernel] Support Fp8 Checkpoints (Dynamic + Static)

robertgshaw2-neuralmagic opened this pull request 9 months ago

[New Model]: launch error of Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4

eigen2017 opened this issue 9 months ago

[Misc] Upgrade outlines to v0.0.41

psykhi opened this pull request 9 months ago

Add logger extra

olehviniarchyk opened this pull request 9 months ago

[Core] Consolidate prompt arguments to LLM engines

DarkLight1337 opened this pull request 9 months ago

[Kernel][Core][WIP] Tree attention and parallel decoding

yukavio opened this pull request 9 months ago

[Bug]: phi-3 (microsoft/Phi-3-mini-128k-instruct) fails with assert "factor" in rope_scaling

pseudotensor opened this issue 9 months ago

[Usage]: Flash Attention not working any more

Techinix opened this issue 9 months ago

[CI] check size of the wheels

simon-mo opened this pull request 9 months ago

[Misc]: How is the continous batching feature of vLLM implemented?

llx-08 opened this issue 9 months ago

[New Model]: Support Phi-3

alexkreidler opened this issue 9 months ago

Allow user to define whitespace pattern for outlines

robcaulk opened this pull request 9 months ago

[Feature]: batched parallel decoding

snyhlxde1 opened this issue 9 months ago

[Usage]: ValueError: Cannot find the config file for awq

grumpyp opened this issue 9 months ago

[New Model]: Llama 3 8B Instruct

K-Mistele opened this issue 9 months ago

[Speculative decoding] CUDA graph support

heeju-kim2 opened this pull request 9 months ago

[Bug]: Engine iteration timed out. This should never happen occurred when vllm 0.4.1 deployed llama3.

blackblue9 opened this issue 9 months ago

[Hardware][Nvidia] Enable support for Pascal GPUs

cduk opened this pull request 9 months ago

[WIP] Infrastructure for encoder/decoder support

afeldman-nm opened this pull request 9 months ago

[Bug]: vllm stall on llama3-70b warmup with 0.4.1

piercefreeman opened this issue 9 months ago

[Bug]: CPU Inference vllm_ops not defined

bsu3338 opened this issue 9 months ago

[MISC] Rework logger to enable pythonic custom logging configuration to be provided

tdg5 opened this pull request 9 months ago

add standalone_api_server

alex-k-cart opened this pull request 9 months ago

[CI/Build] AMD CI pipeline with extended set of tests.

Alexei-V-Ivanov-AMD opened this pull request 9 months ago

[Bug]: offline test, Process hangs without exiting when using cuda graph

DefTruth opened this issue 9 months ago

[Bug]: Repeatedly printing after the conversation ends<| im_end |><| im_start |>

huangshengfu opened this issue 9 months ago

[Speculative decoding] Fix async executing

zxdvd opened this pull request 9 months ago

[Feature]: Cannot use FlashAttention backend for Volta and Turing GPUs. (but FlashAttention v1.0.9 supports Turing GPU.)

tutu329 opened this issue 9 months ago

[Bug]: Ray memory leak

saattrupdan opened this issue 9 months ago

Llama-3-70b: Should I apply some special template to use llama-3?

UbeCc opened this issue 9 months ago

[Speculative decoding] Add ngram prompt lookup decoding

leiwen83 opened this pull request 9 months ago

[Misc]: is it possible to load lora adapter on request basis with out restarting the base model for every new lora trained?

Wizmak9 opened this issue 9 months ago

[Misc]: Total number of attention heads (40) must be divisible by tensor parallel size (6)

CNXDZS opened this issue 9 months ago

[Bug]: NameError: name 'vllm_ops' is not defined

yananchen1989 opened this issue 9 months ago

[Model] Add moondream vision language model

vikhyat opened this pull request 9 months ago

[Bug]: NCCL locating mechanism in multi-user environment

ticoneva opened this issue 9 months ago

[Bugfix] Fix marlin kernel crash on H100

alexm-neuralmagic opened this pull request 9 months ago

[Feature]: beam search mode to allow for more options in sampling process

GeauxEric opened this issue 9 months ago

[Speculative decoding] [Performance]: Re-enable bonus tokens

cadedaniel opened this issue 9 months ago