vLLM issues | Ecosyste.ms: OpenCollective

[Tracking issue] [Help wanted]: Deprecate BlockManagerV1

github.com/vllm-project/vllm - cadedaniel opened this issue 6 months ago

[Performance]: Profile & optimize the BlockManagerV2

github.com/vllm-project/vllm - cadedaniel opened this issue 6 months ago

[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support

github.com/vllm-project/vllm - comaniac opened this pull request 6 months ago

[RFC]: Refactor FP8 kv-cache

github.com/vllm-project/vllm - comaniac opened this issue 6 months ago

Disable cuda version check in vllm-openai image

github.com/vllm-project/vllm - zhaoyang-star opened this pull request 6 months ago

[Doc] change installation version to 0.4.2

github.com/vllm-project/vllm - dimaioksha opened this pull request 6 months ago

[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations)

github.com/vllm-project/vllm - mgoin opened this pull request 6 months ago

[Doc]Replace deprecated flag in readme

github.com/vllm-project/vllm - ronensc opened this pull request 6 months ago

[Kernel] Initial Activation Quantization Support

github.com/vllm-project/vllm - dsikka opened this pull request 6 months ago

[Doc]: Documentation disappeared

github.com/vllm-project/vllm - markovalexander opened this issue 6 months ago

--kv_cache_dtype fp8 should not check for nvcc

github.com/vllm-project/vllm - wizche opened this issue 6 months ago

[Bug]: Failing to find LoRA adapter for MultiLoRA Inference

github.com/vllm-project/vllm - RonanKMcGovern opened this issue 6 months ago

[Core][Model runner refactoring 1/N] Refactor attn metadata term

github.com/vllm-project/vllm - rkooo567 opened this pull request 6 months ago

[Feature]: FP6

github.com/vllm-project/vllm - nivibilla opened this issue 6 months ago

[Bug]: For RDNA3 (navi31; gfx1100) VLLM_USE_TRITON_FLASH_ATTN=0 currently must be forced

github.com/vllm-project/vllm - lhl opened this issue 6 months ago

[Core] Add retention policy code for processing requests

github.com/vllm-project/vllm - James4Ever0 opened this pull request 6 months ago

[Bug]: `truncate_prompt_tokens` in SamplingParams only available for openai entrypoints, not for offline vLLM engine

github.com/vllm-project/vllm - YuWang916 opened this issue 6 months ago

[Kernel] Initial Activation Quantization Support

github.com/vllm-project/vllm - dsikka opened this pull request 6 months ago

v0.4.2 Release Tracker

github.com/vllm-project/vllm - simon-mo opened this issue 6 months ago

[Doc] Add a note about MAX_THREADS and NVCC_THREADS

github.com/vllm-project/vllm - Tabrizian opened this pull request 6 months ago

[Doc] update(example model): for OpenAI compatible serving

github.com/vllm-project/vllm - fpaupier opened this pull request 6 months ago

[New Model]: how to debug the values of tensor while adding a new model

github.com/vllm-project/vllm - cillinzhang opened this issue 6 months ago

[Feature]: Load new LoRA adapters on request

github.com/vllm-project/vllm - markovalexander opened this issue 6 months ago

[Installation]: Feature 'f16 arithemetic and compare instructions' requires .target sm_53 or higher

github.com/vllm-project/vllm - fxavier-maf opened this issue 6 months ago

[Bug]: Token accuracy not as expected ad also got Junk values for starcoderbase-15b model with continuous batching

github.com/vllm-project/vllm - sreka opened this issue 6 months ago

[Usage]: It seems that vllm doesn't perform well under high concurrency

github.com/vllm-project/vllm - syGOAT opened this issue 6 months ago

[Usage]: How to modify the default system prompt？

github.com/vllm-project/vllm - lee0v0 opened this issue 6 months ago

[Bug]: The first token generated after prefill occupies the GPU memory but virtual block

github.com/vllm-project/vllm - March-H opened this issue 6 months ago

[Misc][Typo] type annotation fix

github.com/vllm-project/vllm - HarryWu99 opened this pull request 6 months ago

Unable to find Punica extension issue during source code installation

github.com/vllm-project/vllm - kingljl opened this pull request 6 months ago

fix_tokenizer_snapshot_download_bug

github.com/vllm-project/vllm - kingljl opened this pull request 6 months ago

[Usage]: Can decoding phase benefits from prefix-caching?

github.com/vllm-project/vllm - Juelianqvq opened this issue 6 months ago

[Misc] allow user to specify where to write gpu_p2p_access_cache through VLLM_CACHE_DIR env var

github.com/vllm-project/vllm - sfc-gh-zhwang opened this pull request 6 months ago

[Usage]: Do I need to specify chat-template for Qwen model?

github.com/vllm-project/vllm - xudong2019 opened this issue 6 months ago

[Bugfix][Kernel] allow non-power-of-2 head sizes for prefix prefill with alibi

github.com/vllm-project/vllm - DefTruth opened this pull request 6 months ago

[Core][Dsitributed] fix del exception

github.com/vllm-project/vllm - youkaichao opened this pull request 6 months ago

[Misc] remove debug log for chunk detected

github.com/vllm-project/vllm - DefTruth opened this pull request 6 months ago

[Bug]: with_pynccl_for_all_reduce causes GPU OOM

github.com/vllm-project/vllm - sfc-gh-zhwang opened this issue 6 months ago

[Core][Distributed] add cleanup code for model parallel

github.com/vllm-project/vllm - youkaichao opened this pull request 6 months ago

[Feature]: Sliding window attention in a fraction of layers

github.com/vllm-project/vllm - zhuzilin opened this issue 6 months ago

[WIP] Add IPEX Paged Att.

github.com/vllm-project/vllm - bigPYJ1151 opened this pull request 6 months ago

[Bugfix][Minor] Make ignore_eos effective

github.com/vllm-project/vllm - bigPYJ1151 opened this pull request 6 months ago

[Frontend] [Core] Tensorizer: support dynamic `num_readers`, update version

github.com/vllm-project/vllm - alpayariyak opened this pull request 6 months ago

[Core]Refactor gptq_marlin ops

github.com/vllm-project/vllm - jikunshang opened this pull request 6 months ago

is there a way we can add Hugging face PEFT model for VLLM to load?

github.com/vllm-project/vllm - rsong0606 opened this issue 6 months ago

[Metrics] add more metrics

github.com/vllm-project/vllm - HarryWu99 opened this pull request 6 months ago

[Bugfix][Kernel] Fix compute_type for MoE kernel

github.com/vllm-project/vllm - WoosukKwon opened this pull request 6 months ago

[Usage]: How do you setup vllm to work in k8s/openshift cluster

github.com/vllm-project/vllm - jayteaftw opened this issue 6 months ago

[Distributed] refactor pynccl to support multilpe TP groups

github.com/vllm-project/vllm - youkaichao opened this pull request 6 months ago

[Bug]: JSON Schema for multiple function choices

github.com/vllm-project/vllm - jamestwhedbee opened this issue 6 months ago

[Kernel] Update fused_moe tuning script for FP8

github.com/vllm-project/vllm - pcmoritz opened this pull request 6 months ago

[Doc] add visualization for multi-stage dockerfile

github.com/vllm-project/vllm - prashantgupta24 opened this pull request 6 months ago

[Usage] [Bug]: run inference on mistralai/Mixtral-8x7B-Instruct-v0.1 with tensor parallel > 1 (currently not working)

github.com/vllm-project/vllm - jayteaftw opened this issue 6 months ago

[Misc] Upgrade to `torch==2.3.0`

github.com/vllm-project/vllm - mgoin opened this pull request 6 months ago

[Misc] fix typo in block manager

github.com/vllm-project/vllm - Juelianqvq opened this pull request 6 months ago

[Usage]: How to start vllm with llava using docker compose

github.com/vllm-project/vllm - athenawisdoms opened this issue 6 months ago

[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption

github.com/vllm-project/vllm - rkooo567 opened this pull request 6 months ago

[mypy][6/N] Fix all the core subdirectory typing

github.com/vllm-project/vllm - rkooo567 opened this pull request 6 months ago

[Usage]: why vllm takes as much ram as possible ?

github.com/vllm-project/vllm - xudong2019 opened this issue 6 months ago

[Bug]: 1-card deployment and 2-card deployment yield inconsistent output logits.

github.com/vllm-project/vllm - thisissum opened this issue 6 months ago

[CORE] Allow loading of quantized lm_head (ParallelLMHead)

github.com/vllm-project/vllm - Qubitium opened this pull request 6 months ago

[Performance]: Empirical Measurement of how to broadcast python object in vLLM

github.com/vllm-project/vllm - youkaichao opened this issue 6 months ago

[Bug]: OpenAI API request doesn't go through with 'guided_json'

github.com/vllm-project/vllm - Tejaswgupta opened this issue 6 months ago

[Bug]: Prefix caching does not work on Pascal GPUs

github.com/vllm-project/vllm - sasha0552 opened this issue 6 months ago

[Misc]: need "first good issue"

github.com/vllm-project/vllm - HarryWu99 opened this issue 6 months ago

[Feature]: option to return hidden states

github.com/vllm-project/vllm - zhenlan0426 opened this issue 6 months ago

[Usage]: How to disable multi lora to avoid using punica ? Or is the punica being the only choice?

github.com/vllm-project/vllm - laoda513 opened this issue 6 months ago

[Bug]: all_reduce assert result == 0, File "torch/cuda/graphs.py", line 88, in capture_end super().capture_end(), RuntimeError: CUDA error: operation failed due to a previous error during capture

github.com/vllm-project/vllm - lmx760581375 opened this issue 6 months ago

[Bug]: Initialising LLM on multiple GPUs stuck at "Started a local Ray instance"

github.com/vllm-project/vllm - timbmg opened this issue 6 months ago

[Bug]: Engine iteration timed out. This should never happen!

github.com/vllm-project/vllm - itechbear opened this issue 6 months ago

[Usage]: Not enough memory when run a 33b model float16 on 2 x L40 GPU (48G)

github.com/vllm-project/vllm - garyyang85 opened this issue 6 months ago

[CI/Build] Move `test_utils.py` to `tests/utils.py`

github.com/vllm-project/vllm - DarkLight1337 opened this pull request 6 months ago

[Usage]: If I use Offline way to launch the model, how can I get the metrics?

github.com/vllm-project/vllm - amumu96 opened this issue 6 months ago

[Core] Centralize GPU Worker construction

github.com/vllm-project/vllm - njhill opened this pull request 6 months ago

[Bug]: cannot load model back due to [does not appear to have a file named config.json]

github.com/vllm-project/vllm - yananchen1989 opened this issue 6 months ago

[WIP][Hardware][Intel] support intel builds with intel c++

github.com/vllm-project/vllm - kannon92 opened this pull request 6 months ago

Add support for ReFT

github.com/vllm-project/vllm - RonanKMcGovern opened this issue 6 months ago

[Core] Pipeline Parallel Support

github.com/vllm-project/vllm - andoorve opened this pull request 6 months ago

[Doc]: Offline Inference Distributed Broken for TP

github.com/vllm-project/vllm - sam-h-bean opened this issue 6 months ago

[Hardware][Nvidia] Enable support for Pascal GPUs

github.com/vllm-project/vllm - jasonacox opened this pull request 6 months ago

[RFC]: environment variable management in vllm

github.com/vllm-project/vllm - youkaichao opened this issue 6 months ago

[kernel] fix sliding window in prefix prefill Triton kernel

github.com/vllm-project/vllm - mmoskal opened this pull request 6 months ago

[Bug]: Can not run openapi server with cpu backend

github.com/vllm-project/vllm - kannon92 opened this issue 6 months ago

[Frontend] add tok/s speed metric to llm class when using tqdm

github.com/vllm-project/vllm - MahmoudAshraf97 opened this pull request 6 months ago

[Bug]: TypeError in XFormersMetadata

github.com/vllm-project/vllm - skonto opened this issue 6 months ago

[Model]: Support for InternVL-Chat-V1-5

github.com/vllm-project/vllm - Iven2132 opened this issue 6 months ago

[Bug]: Running llama2-7b on H20, Floating point exception (core dumped) appears on float16

github.com/vllm-project/vllm - yk1012664593 opened this issue 6 months ago

[Usage]: I doubt about the meaning of --enable-prefix-caching

github.com/vllm-project/vllm - chenchunhui97 opened this issue 6 months ago

[Bug]: vllm 0.4.1 and transformers 4.40.1 have conflicting dependencies on pydantic

github.com/vllm-project/vllm - AbbottKilig opened this issue 6 months ago

[Bug]: Chunked prefill doesn't seem to work when --kv-cache-dtype fp8

github.com/vllm-project/vllm - rkooo567 opened this issue 6 months ago

[Model] Phi-3 4k sliding window temp. fix

github.com/vllm-project/vllm - caiom opened this pull request 6 months ago

[Speculative decoding] Support target-model logprobs

github.com/vllm-project/vllm - cadedaniel opened this pull request 6 months ago

[Bug]: Phi3 still not supported

github.com/vllm-project/vllm - andrew-vold opened this issue 6 months ago

✨ support local cache for models

github.com/vllm-project/vllm - prashantgupta24 opened this pull request 6 months ago

[Usage]: When I installed version 0.4.1 and started `vllm.entrypoints.openai.api_server` with the `--engine-use-ray` parameter, I encountered some issues.

github.com/vllm-project/vllm - Uhao-P opened this issue 6 months ago

[Installation]: GitHub access required during install for vllm >=0.4.1 (for cu12-libnccl.so.2.18.1)

github.com/vllm-project/vllm - mattmalcher opened this issue 6 months ago

[Feature]: GPTQ/AWQ quantization is not fully optimized yet. The speed can be slower than non-quantized models.

github.com/vllm-project/vllm - ShubhamVerma16 opened this issue 6 months ago

[Feature]: AssertionError: Speculative decoding not yet supported for RayGPU backend.

github.com/vllm-project/vllm - cocoza4 opened this issue 6 months ago

[Core] Add `multiproc_worker_utils` for multiprocessing-based workers

github.com/vllm-project/vllm - njhill opened this pull request 6 months ago

[Frontend] Add APIs for dynamic LoRA models load/unload

github.com/vllm-project/vllm - graceleeis opened this pull request 6 months ago