vLLM issues | Ecosyste.ms: OpenCollective

[Bug]: PaliGemma detection task is failing

github.com/vllm-project/vllm - nph4rd opened this issue 3 months ago

[Bug]: Cannot use FlashAttention-2 backend because the vllm_flash_attn package is not found. But I have installed vllm-flash-attn.

github.com/vllm-project/vllm - xyfZzz opened this issue 3 months ago

[Misc] Use torch.compile for basic custom ops

github.com/vllm-project/vllm - WoosukKwon opened this pull request 3 months ago

[Core] Optimize SPMD architecture with delta + serialization optimization

github.com/vllm-project/vllm - rkooo567 opened this pull request 3 months ago

[Bugfix] fix spec decode with cuda graph

github.com/vllm-project/vllm - aurickq opened this pull request 3 months ago

[Core] Add span metrics for model_forward, scheduler and sampler time

github.com/vllm-project/vllm - sfc-gh-mkeralapura opened this pull request 3 months ago

[Bug]: InternVL2 Inference RuntimeError: GET was unable to find an engine to execute this computation

github.com/vllm-project/vllm - HuichiZhou opened this issue 3 months ago

Integrate fused Mixtral MoE with Marlin kernels

github.com/vllm-project/vllm - ElizaWszola opened this pull request 3 months ago

[Frontend] Add readiness and liveness endpoints to OpenAI API server

github.com/vllm-project/vllm - mfournioux opened this pull request 3 months ago

[Bug]: OutOfMemoryError when server running multi requests

github.com/vllm-project/vllm - lzcchl opened this issue 3 months ago

[Bug]: RuntimeError: CHECK_EQ(paged_kv_indptr.size(0), batch_size + 1) failed. 1 vs 257. When load gemma-2-9b-it using vllm

github.com/vllm-project/vllm - seongjiko opened this issue 3 months ago

[Core] Asynchronous Output Processor

github.com/vllm-project/vllm - megha95 opened this pull request 3 months ago

[BugFix] Fix multiprocessing shutdown errors

github.com/vllm-project/vllm - njhill opened this pull request 3 months ago

[Usage]: weird GPU RAM usage

github.com/vllm-project/vllm - hieunguyenquoc opened this issue 3 months ago

Add Classifier free guidance

github.com/vllm-project/vllm - zhaoyinglia opened this pull request 3 months ago

[Bug] [ROCm]: ROCm fails to stop generating tokens on multiple GPTQ models

github.com/vllm-project/vllm - TNT3530 opened this issue 3 months ago

[TPU] Add Load-time W8A16 quantization for TPU Backend

github.com/vllm-project/vllm - lsy323 opened this pull request 3 months ago

[Bug]: VLLM crashes when prefix caching is enabled

github.com/vllm-project/vllm - m-harmonic opened this issue 3 months ago

[core] Multi Step Scheduling

github.com/vllm-project/vllm - SolitaryThinker opened this pull request 3 months ago

[CI/Build] bump minimum cmake version

github.com/vllm-project/vllm - dtrifiro opened this pull request 3 months ago

[Doc] Proofreading documentation

github.com/vllm-project/vllm - sgolebiewski-intel opened this pull request 3 months ago

[Bug]: subprocess.CalledProcessError: Command '['cmake', '--build', '.', '-j=40', '--target=_moe_C', '--target=_C']' returned non-zero exit status 1.

github.com/vllm-project/vllm - qniguogym opened this issue 3 months ago

[WIP] Add Fused MoE W8A8 (Int8) Support

github.com/vllm-project/vllm - qingquansong opened this pull request 3 months ago

[Bug]: RuntimeError: CUDA error: an illegal memory access was encountered

github.com/vllm-project/vllm - chenchunhui97 opened this issue 3 months ago

[CI/Build][ROCm] Enabling tensorizer tests for ROCm

github.com/vllm-project/vllm - alexeykondrat opened this pull request 3 months ago

[Bug]: ValueError: Ray does not allocate any GPUs on the driver node. Consider adjusting the Ray placement group or running the driver on a GPU node.

github.com/vllm-project/vllm - youkaichao opened this issue 3 months ago

[Installation]: my env :cuda version is 12.0，python 3.10, which release should i choose?

github.com/vllm-project/vllm - fanjikang opened this issue 3 months ago

[Frontend]: Add apply_chat_template method and update generate method in LLM class

github.com/vllm-project/vllm - llStringll opened this pull request 3 months ago

[Feature]: Support rerank models

github.com/vllm-project/vllm - etwk opened this issue 3 months ago

[Model] Pipeline parallel support for Qwen2

github.com/vllm-project/vllm - xuyi opened this pull request 3 months ago

[MISC] Introduce pipeline parallelism partition strategies

github.com/vllm-project/vllm - comaniac opened this pull request 3 months ago

[Bug]: error: Segmentation fault(SIGSEGV received at time)

github.com/vllm-project/vllm - Archmilio opened this issue 3 months ago

[Kernel][Misc] Add meta functions for ops to prevent graph breaks

github.com/vllm-project/vllm - bnellnm opened this pull request 3 months ago

[Bug]: JSON-guided generation failing to close text values

github.com/vllm-project/vllm - vecorro opened this issue 3 months ago

merge to main

github.com/vllm-project/vllm - wbdr opened this pull request 3 months ago

[Bug]: vLLM takes forever to load a locally stored 7B model

github.com/vllm-project/vllm - vibhas-singh opened this issue 3 months ago

[Bug]: Error Running DeepSeek-v2-Lite w/ FP8

github.com/vllm-project/vllm - Jiayi-Pan opened this issue 3 months ago

[Bug]: Error: Failed to initialize the TMA descriptor 700 for LLaMa 3.1 405B on 8*H100 -- prefill error?

github.com/vllm-project/vllm - pseudotensor opened this issue 3 months ago

[Core] generate from input embeds

github.com/vllm-project/vllm - Nan2018 opened this pull request 3 months ago

[Kernel] [Triton] [AMD] Add Triton implementation of awq_dequantize

github.com/vllm-project/vllm - rasmith opened this pull request 3 months ago

[Speculative Decoding] EAGLE Implementation with Top-1 proposer

github.com/vllm-project/vllm - abhigoyal1997 opened this pull request 3 months ago

[Bug]: Pipeline parallelism is very slow when inferencing one request

github.com/vllm-project/vllm - gty111 opened this issue 3 months ago

[Usage]: How do I deploy a model on two GPUs with different memory?

github.com/vllm-project/vllm - Halflifefa opened this issue 3 months ago

[Bug]: ERROR 07-26 14:50:35 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 214281 died, exit code: -11

github.com/vllm-project/vllm - TypeFloat opened this issue 3 months ago

[Model] Teleflm Support

github.com/vllm-project/vllm - horizon94 opened this pull request 3 months ago

[CI/Build] upgrade Dockerfile to ubuntu 22.04

github.com/vllm-project/vllm - samos123 opened this pull request 3 months ago

[RFC]: Performance Roadmap

github.com/vllm-project/vllm - simon-mo opened this issue 3 months ago

[RFC]: Isolate OpenAI Server Into Separate Process

github.com/vllm-project/vllm - robertgshaw2-neuralmagic opened this issue 3 months ago

[CI] Reproduce SGLANG benchmark results

github.com/vllm-project/vllm - KuntaiDu opened this pull request 3 months ago

[Bug]: Engine iteration timed out. This should never happen!

github.com/vllm-project/vllm - Kelcin2 opened this issue 3 months ago

[Usage]: can I use it with classification model (e.g. GemmaForSequenceClassification) ?

github.com/vllm-project/vllm - dodler opened this issue 3 months ago

[Bugfix] Add synchronize to prevent possible data race

github.com/vllm-project/vllm - tlrmchlsmth opened this pull request 3 months ago

[Bugfix] Add image placeholder for OpenAI Compatible Server of MiniCPM-V

github.com/vllm-project/vllm - HwwwwwwwH opened this pull request 3 months ago

[Bugfix] Allow vllm to still work if triton is not installed.

github.com/vllm-project/vllm - tdoublep opened this pull request 3 months ago

[Feature]: ngram-spec-decode

github.com/vllm-project/vllm - chenglu66 opened this issue 3 months ago

[Bugfix][Model] Jamba assertions and no chunked prefill by default for Jamba

github.com/vllm-project/vllm - tomeras91 opened this pull request 3 months ago

[Bug]: SIGSEGV received at time=1721904360 on cpu 140, Fatal Python error: Segmentation fault

github.com/vllm-project/vllm - eldarkurtic opened this issue 3 months ago

[BugFix][Speculative Decoding] Fixes the generation token numbers with sps

github.com/vllm-project/vllm - sighingnow opened this pull request 3 months ago

[Performance]: Slow TTFT(?) for Qwen2-72B-GPTQ-Int4 on H100 *2

github.com/vllm-project/vllm - cyc00518 opened this issue 3 months ago

[Bug]: N-gram spec_decode in flash_attention bug

github.com/vllm-project/vllm - chenglu66 opened this issue 3 months ago

[Core] Use array to speedup padding

github.com/vllm-project/vllm - peng1999 opened this pull request 3 months ago

[Feature]: support Mistral-Large-Instruct-2407 function calling

github.com/vllm-project/vllm - ybdesire opened this issue 3 months ago

[Performance]: Medusa SD have poor performance than baseline

github.com/vllm-project/vllm - cwlseu opened this issue 3 months ago

[Bug]: qwen2-72b-instruct model with RuntimeError: CUDA error: an illegal memory access was encountered

github.com/vllm-project/vllm - izhuhaoran opened this issue 3 months ago

[Bug]: Reproducing Llama 3.1 distributed inference from the blog

github.com/vllm-project/vllm - eldarkurtic opened this issue 3 months ago

[Bug]: --max-model-len configuration robustness

github.com/vllm-project/vllm - gargnipungarg opened this issue 3 months ago

[Usage]: Pipeline Parallelism but with quantized model?

github.com/vllm-project/vllm - fahadh4ilyas opened this issue 3 months ago

[Feature]: chat API assistant prefill

github.com/vllm-project/vllm - pseudotensor opened this issue 3 months ago

[wip] spmd delta optimization

github.com/vllm-project/vllm - rkooo567 opened this pull request 3 months ago

[Bugfix] [Easy] Fixed a bug in the multiprocessing GPU executor.

github.com/vllm-project/vllm - eaplatanios opened this pull request 3 months ago

[Installation]: Unable to build docker image using Dockerfile.openvino

github.com/vllm-project/vllm - zahidulhaque opened this issue 3 months ago

[Usage]: How to inference a model with medusa speculative sampling.

github.com/vllm-project/vllm - cwlseu opened this issue 3 months ago

[Bug]: Possible data race when running Llama 405b fp8

github.com/vllm-project/vllm - tlrmchlsmth opened this issue 3 months ago

[Bug]: `pt_main_thread` processes are not killed after main process is killed in MP distributed executor backend

github.com/vllm-project/vllm - oandreeva-nv opened this issue 3 months ago

[Bug]: FP8 Quantization (static and dynamic) incompatible with `--cpu-offload-gb`

github.com/vllm-project/vllm - drikster80 opened this issue 3 months ago

[ Kernel ] Add Fused Layernorm + Dynamic-Per-Token Quant Kernels

github.com/vllm-project/vllm - varun-sundar-rabindranath opened this pull request 3 months ago

[Bugfix] Fix `kv_cache_dtype=fp8` without scales for FP8 checkpoints

github.com/vllm-project/vllm - mgoin opened this pull request 3 months ago

[Bug]: Broken accuracy on LLaMa 3.1 70B -- worse than even 8B

github.com/vllm-project/vllm - pseudotensor opened this issue 3 months ago

[Bugfix] Fix decode tokens w. CUDA graph

github.com/vllm-project/vllm - comaniac opened this pull request 3 months ago

[Bugfix] Fix encoding_format in examples/openai_embedding_client.py

github.com/vllm-project/vllm - CatherineSue opened this pull request 3 months ago

[Bugfix]: use PretrainedConfig to communicate config objects with trust remote code

github.com/vllm-project/vllm - tjohnson31415 opened this pull request 3 months ago

[Usage]: The 8xH100 device failed to run meta-llama/Meta-Llama-3.1-405B-Instruct-FP8.

github.com/vllm-project/vllm - jueming0312 opened this issue 3 months ago

[Bugfix] Fix awq_marlin and gptq_marlin flags

github.com/vllm-project/vllm - alexm-neuralmagic opened this pull request 3 months ago

[Bug]: openai_embedding_client returns len 8192 embedding not 4096

github.com/vllm-project/vllm - ehuaa opened this issue 3 months ago

[Bugfix] Fix speculative decode seeded test

github.com/vllm-project/vllm - njhill opened this pull request 3 months ago

[InstallImportError: cannot import name 'LogicalTokenBlock' from 'vllm.block'ation]:

github.com/vllm-project/vllm - peak-coco opened this issue 3 months ago

[Frontend] split run_server into build_server and run_server

github.com/vllm-project/vllm - dtrifiro opened this pull request 3 months ago

[Model][Jamba] Mamba cache single buffer

github.com/vllm-project/vllm - mzusman opened this pull request 3 months ago

[Bug]: The FP8 models and FP8 KV-Cache-Scales loaded together failed on the latest 0.5.3

github.com/vllm-project/vllm - wanzhenchn opened this issue 3 months ago

[Usage]: use vllm==0.4.2 to infer qwen2-0.5b model on H800 1*80G,but GPU's computational power utilization is only around 20%

github.com/vllm-project/vllm - Ajay-Wong opened this issue 3 months ago

[Bug]: TypeError: snapshot_download() got an unexpected keyword argument 'ignore_patterns' when set VLLM_USE_MODELSCOPE=True

github.com/vllm-project/vllm - wutz opened this issue 3 months ago

[Bug]: batch inference not consistent (even temperature=0)

github.com/vllm-project/vllm - GGuo555 opened this issue 3 months ago

[Bug]: vllm-0.5.3.post1部署Qwen2-72b-instruct-awq模型，刚开始服务正常，但是并发高的时候就报错

github.com/vllm-project/vllm - xinzaifeixiang1992 opened this issue 3 months ago

[Bugfix] Fix speculative decode seeded test

github.com/vllm-project/vllm - tdoublep opened this pull request 3 months ago

[Bug]: VLLM 0.5.3.post1 [rank0]: RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)

github.com/vllm-project/vllm - jueming0312 opened this issue 3 months ago

[Feature]: Add support to Llama-3.1

github.com/vllm-project/vllm - KaifAhmad1 opened this issue 3 months ago

[Bugfix]fix modelscope compatible issue

github.com/vllm-project/vllm - liuyhwangyh opened this pull request 3 months ago

Adjust/openai api server turbo 20240724 v2

github.com/vllm-project/vllm - zyearw1024 opened this pull request 3 months ago

[Feature]: vllm support for Ascend NPU

github.com/vllm-project/vllm - hi-yifeng opened this issue 3 months ago

[Bug]: Cannot find any of ['adapter_name_or_path'] in the model's quantization config

github.com/vllm-project/vllm - fengyunflya opened this issue 3 months ago