vLLM issues | Ecosyste.ms: OpenCollective

[Kernel] Enhance MoE benchmarking & tuning script

github.com/vllm-project/vllm - WoosukKwon opened this pull request 5 months ago

[Doc]Add documentation to benchmarking script when running TGI

github.com/vllm-project/vllm - KuntaiDu opened this pull request 5 months ago

Virtual Office Hours: Jun 5 and Jun 20

github.com/vllm-project/vllm - robertgshaw2-neuralmagic opened this issue 5 months ago

[Performance]: Automatic Prefix Caching in multi-turn conversations

github.com/vllm-project/vllm - hmellor opened this issue 5 months ago

[Bugfix] Fix dummy weight for fp8

github.com/vllm-project/vllm - mzusman opened this pull request 5 months ago

[Bug]: Phi3 lora module not loading

github.com/vllm-project/vllm - arunpatala opened this issue 5 months ago

[Bugfix]: Fix communication Timeout error in safety-constrained distributed System

github.com/vllm-project/vllm - ZwwWayne opened this pull request 5 months ago

[Installation]: Failed building editable for vllm

github.com/vllm-project/vllm - Fanb1ing opened this issue 5 months ago

[Misc]:pydantic version conflict between vllm openai server and transformers

github.com/vllm-project/vllm - yunll opened this issue 5 months ago

[Bug]: `max_context_len_to_capture` deprecated, confusion with `max_seq_len_to_capture`

github.com/vllm-project/vllm - lianghsun opened this issue 5 months ago

[Bug]: Cannot use FlashAttention-2 backend because the flash_attn package is not found

github.com/vllm-project/vllm - maxin9966 opened this issue 5 months ago

[CI/Build] Make marlin kernel build conditional.

github.com/vllm-project/vllm - esmeetu opened this pull request 5 months ago

[Bug]: llm_engine_example.py (more requests) get stuck

github.com/vllm-project/vllm - CsRic opened this issue 5 months ago

[Usage]: Passing a guided_json in offline inference

github.com/vllm-project/vllm - ccdv-ai opened this issue 5 months ago

Update test_ignore_eos

github.com/vllm-project/vllm - simon-mo opened this pull request 5 months ago

[Core] Fix scheduler considering "no LoRA" as "LoRA"

github.com/vllm-project/vllm - Yard1 opened this pull request 5 months ago

feat: Add batch API

github.com/vllm-project/vllm - shehraj123 opened this pull request 5 months ago

v0.4.3 Release Tracker

github.com/vllm-project/vllm - simon-mo opened this issue 5 months ago

[Core] Eliminate parallel worker per-step task scheduling overhead

github.com/vllm-project/vllm - njhill opened this pull request 5 months ago

[Misc] Load FP8 kv-cache scaling factors from checkpoints

github.com/vllm-project/vllm - comaniac opened this pull request 5 months ago

[Misc]: When is the planned date for the next release?

github.com/vllm-project/vllm - vrdn-23 opened this issue 5 months ago

[Bug]: `CohereForAI/c4ai-command-r-v01`OSError: [Errno 12] Cannot allocate memory

github.com/vllm-project/vllm - epignatelli opened this issue 5 months ago

[Bugfix] Relax tiktoken to >= 0.6.0

github.com/vllm-project/vllm - mgoin opened this pull request 5 months ago

[Core] Sharded State Loader download from HF

github.com/vllm-project/vllm - aurickq opened this pull request 5 months ago

[Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support)

github.com/vllm-project/vllm - afeldman-nm opened this pull request 5 months ago

[Model] Add Phi-2 LoRA support

github.com/vllm-project/vllm - Isotr0py opened this pull request 5 months ago

[Bugfix] Fix with verifying model max len

github.com/vllm-project/vllm - dimaioksha opened this pull request 5 months ago

[Bug]: Too strict version requirement on `tiktoken`

github.com/vllm-project/vllm - saattrupdan opened this issue 5 months ago

[Bug]: assert parts[0] == "base_model" AssertionError

github.com/vllm-project/vllm - Edisonwei54 opened this issue 5 months ago

[Usage]: why can't I set gpu nums while use "tensor_parallel_size"?

github.com/vllm-project/vllm - GodHforever opened this issue 5 months ago

[Installation]: Do we have the plan to update the pip package installation method for the CPU backend.

github.com/vllm-project/vllm - Zhenzhong1 opened this issue 5 months ago

[Usage]: gpu memory usage when using tensor parallel

github.com/vllm-project/vllm - DaiJianghai opened this issue 5 months ago

[Bug]: single lora request error make all processing requests error

github.com/vllm-project/vllm - jinzhen-lin opened this issue 5 months ago

[Build/CI] Extending AMD Tests

github.com/vllm-project/vllm - Alexei-V-Ivanov-AMD opened this pull request 5 months ago

[Draft][CI/Build] Optimize models tests

github.com/vllm-project/vllm - DarkLight1337 opened this pull request 5 months ago

[RFC]: Add control panel support for vLLM

github.com/vllm-project/vllm - leiwen83 opened this issue 5 months ago

[Bug]: Shape error encountered in speculative decoding when `enable_lora=True`

github.com/vllm-project/vllm - mitchellstern opened this issue 5 months ago

[Doc] Update Ray Data distributed offline inference example

github.com/vllm-project/vllm - Yard1 opened this pull request 5 months ago

[Misc] remove old comments

github.com/vllm-project/vllm - youkaichao opened this pull request 5 months ago

[Usage]: distributed inference with kuberay

github.com/vllm-project/vllm - hetian127 opened this issue 5 months ago

[Misc]: a question about chunked-prefill in flash-attn backends

github.com/vllm-project/vllm - HarryWu99 opened this issue 5 months ago

Add control panel allow manage multi vllm instances

github.com/vllm-project/vllm - leiwen83 opened this pull request 5 months ago

[Doc]: Why is the PA kernel time cost in the decode phase optimized after turning on Prefix Caching?

github.com/vllm-project/vllm - wjj19950828 opened this issue 5 months ago

[Feature]: add local_files_only parameter

github.com/vllm-project/vllm - yananchen1989 opened this issue 5 months ago

[Bug]: No CUDA GPUs are available on 'CPU' use

github.com/vllm-project/vllm - mcr-ksh opened this issue 5 months ago

[Bugfix] Still download from huggingface while set VLLM_USE_MODELSCOPE = true

github.com/vllm-project/vllm - liuzhenghua opened this pull request 5 months ago

[Usage]: How to determine how many concurrent requests can be supported in an acceptable time duration with demo api server?

github.com/vllm-project/vllm - senbinyu opened this issue 5 months ago

[Bug]: Qwen1.5-72B L20x8 latest vLLM TPOT slower than v0.4.0.post, 48ms vs 39ms, why?

github.com/vllm-project/vllm - DefTruth opened this issue 5 months ago

[Bugfix / Core] Prefix Caching Guards (merged with main)

github.com/vllm-project/vllm - zhuohan123 opened this pull request 5 months ago

[Core] Avoid one broadcast op when propagating metadata

github.com/vllm-project/vllm - njhill opened this pull request 5 months ago

[Doc] Highlight the fourth meetup in the README

github.com/vllm-project/vllm - zhuohan123 opened this pull request 5 months ago

Add a new kernel for fusing the dequantization in fused-moe gemm

github.com/vllm-project/vllm - RezaYazdaniAminabadi opened this pull request 5 months ago

[Speculative decoding][Re-take] Enable TP>1 speculative decoding

github.com/vllm-project/vllm - comaniac opened this pull request 5 months ago

[Bug]: Cache operations are not supported for Neuron backend.

github.com/vllm-project/vllm - milo157 opened this issue 5 months ago

[Feature]: Build and publish Neuron docker image

github.com/vllm-project/vllm - yaronr opened this issue 5 months ago

[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support)

github.com/vllm-project/vllm - afeldman-nm opened this pull request 5 months ago

[Bug]: Running vllm docker image with neuron fails

github.com/vllm-project/vllm - yaronr opened this issue 5 months ago

[Bugfix] fix rope error when load models with different dtypes

github.com/vllm-project/vllm - jinzhen-lin opened this pull request 5 months ago

[Build/CI] Enabling AMD Entrypoints Test

github.com/vllm-project/vllm - Alexei-V-Ivanov-AMD opened this pull request 5 months ago

[New Model]: Google's Paligemma family of models

github.com/vllm-project/vllm - nfplay opened this issue 5 months ago

[Usage]: how to use run in mixed mode CPU/GPU (device_map="auto")

github.com/vllm-project/vllm - osafaimal opened this issue 5 months ago

[Bug]: llava inference result is wrong !

github.com/vllm-project/vllm - xiaoyudxy opened this issue 5 months ago

[Hardware][Intel] Add LoRA adapter support for CPU backend

github.com/vllm-project/vllm - Isotr0py opened this pull request 5 months ago

Support to serve vLLM on Kubernetes with LWS

github.com/vllm-project/vllm - kerthcet opened this pull request 5 months ago

[Bugfix] Avoid circular import in model loader

github.com/vllm-project/vllm - hiyouga opened this pull request 5 months ago

Can I still use FP8 E5M2 KV Cache if my GPU capability is less than 8.9?

github.com/vllm-project/vllm - blacker521 opened this issue 5 months ago

[Usage]: Passing image to the vllm api endpoint

github.com/vllm-project/vllm - davidramous opened this issue 5 months ago

[Usage]: How to use tensor-parallel-size argument when deploy Llama3-8b with AsyncLLMEngine

github.com/vllm-project/vllm - ANYMS-A opened this issue 5 months ago

[Feature]: rope_scaling for qwen2

github.com/vllm-project/vllm - HappyLynn opened this issue 5 months ago

[Performance]: Will memcpy happen with distributed kv caches while decoding ?

github.com/vllm-project/vllm - GodHforever opened this issue 5 months ago

[Bug]: llava, output is truncated, not fully displayed

github.com/vllm-project/vllm - xiaoyudxy opened this issue 5 months ago

[Bug]: Llama 3 - Out of memory - RTX 4060 TI

github.com/vllm-project/vllm - savi8sant8s opened this issue 5 months ago

Revert "[Kernel] Use flash-attn for decoding (#3648)"

github.com/vllm-project/vllm - rkooo567 opened this pull request 5 months ago

temporarily prioritize xformer for lora test

github.com/vllm-project/vllm - rkooo567 opened this pull request 5 months ago

[Core][Distributed] remove graph mode function

github.com/vllm-project/vllm - youkaichao opened this pull request 5 months ago

Add 4th meetup announcement to readme

github.com/vllm-project/vllm - simon-mo opened this pull request 5 months ago

[Bugfix] Properly set distributed_executor_backend in ParallelConfig

github.com/vllm-project/vllm - zifeitong opened this pull request 5 months ago

Add marlin unit tests and marlin benchmark script

github.com/vllm-project/vllm - alexm-nm opened this pull request 5 months ago

Remove EOS token before passing the tokenized input to model

github.com/vllm-project/vllm - VallabhMahajan1 opened this issue 5 months ago

[Bug]: 'ArgumentHelper' has no attribute 'enable_prefix_caching'

github.com/vllm-project/vllm - xiaohangguo opened this issue 5 months ago

[Usage]: convert llava-v1.5-7b to liuhaotian/llava-v1.5-7b-hf format

github.com/vllm-project/vllm - xiaoyudxy opened this issue 5 months ago

Qwen1.5-14B-Chat-GPTQ-Int4: quantization is not fully optimized yet. The speed can be slower than non-quantized models.

github.com/vllm-project/vllm - lostsollar opened this issue 5 months ago

[Bugfix][Model] Add base class for vision-language models

github.com/vllm-project/vllm - DarkLight1337 opened this pull request 5 months ago

[Speculative decoding] Enable TP>1 speculative decoding

github.com/vllm-project/vllm - cadedaniel opened this pull request 5 months ago

[Performance]: Qwen 7b chat model, under 128 concurrency, the CPU utilization rate is 100%, and the GPU SM utilization rate is only about 60%-75%. Is it a CPU bottleneck?

github.com/vllm-project/vllm - markluofd opened this issue 5 months ago

[Usage]: Seems nn.module definition may affect the output tokens. Don't know the reason.

github.com/vllm-project/vllm - Zhenzhong1 opened this issue 5 months ago

[Bugfix][Doc] Fix CI failure in docs

github.com/vllm-project/vllm - DarkLight1337 opened this pull request 5 months ago

[Performance]: how to test tensorrt-llm serving correctly

github.com/vllm-project/vllm - RunningLeon opened this issue 5 months ago

[Performance]: Deepseek-v2 support

github.com/vllm-project/vllm - ZixinxinWang opened this issue 5 months ago

[Doc] Add page for `PoolingParams`

github.com/vllm-project/vllm - DarkLight1337 opened this pull request 5 months ago

[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model

github.com/vllm-project/vllm - linxihui opened this pull request 5 months ago

[Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests

github.com/vllm-project/vllm - Alexei-V-Ivanov-AMD opened this pull request 5 months ago

[Doc] Shorten README by removing supported model list

github.com/vllm-project/vllm - zhuohan123 opened this pull request 5 months ago

[Bug]: `logprobs` is not compatible with the OpenAI spec

github.com/vllm-project/vllm - GabrielBianconi opened this issue 5 months ago

[Frontend] Support OpenAI batch file format

github.com/vllm-project/vllm - wuisawesome opened this pull request 5 months ago

[CI/Build] PEP 517/518 improvements

github.com/vllm-project/vllm - dtrifiro opened this pull request 5 months ago

Add GPTQ Marlin 2:4 sparse structured support

github.com/vllm-project/vllm - alexm-neuralmagic opened this pull request 5 months ago

[Bug]: Async engine hangs with 0.4.* releases

github.com/vllm-project/vllm - glos-nv opened this issue 5 months ago

[Kernel] add bfloat16 support for gptq marlin kernel

github.com/vllm-project/vllm - jinzhen-lin opened this pull request 5 months ago

[Lora] Support long context lora

github.com/vllm-project/vllm - rkooo567 opened this pull request 5 months ago