github.com/vllm-project/vllm issues | Ecosyste.ms: OpenCollective

[ Kernel ] AWQ Fused MoE

robertgshaw2-neuralmagic opened this pull request 7 months ago

[Bug]: When using qwen-32b-chat-awq with multi-threaded access, errors occur after approximately several hundred visits.”vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.“

ZHJ19970917 opened this issue 7 months ago

[ci][build] fix commit id

youkaichao opened this pull request 7 months ago

[Bugfix][CI/Build] Test prompt adapters in openai entrypoint tests

g-eoj opened this pull request 7 months ago

[doc][distributed] add suggestion for distributed inference

youkaichao opened this pull request 7 months ago

[ Misc ] Apply MoE Refactor to Qwen2 + Deepseekv2 To Support Fp8

robertgshaw2-neuralmagic opened this pull request 7 months ago

[Feature]: Apply chat template through `LLM` class

robertgshaw2-neuralmagic opened this issue 7 months ago

[ Kernel ] AWQ Fused MoE

robertgshaw2-neuralmagic opened this pull request 7 months ago

[Bug]: Timeout Error When Deploying Llamafied InternLM2-5-7B-Chat-1M Model via vLLM OpenAI API Server

mf-skjung opened this issue 7 months ago

[Bugfix][CI/Build] Fix testing for generated commit hash

mgoin opened this pull request 7 months ago

[Doc] Add documentations for nightly benchmarks

KuntaiDu opened this pull request 7 months ago

Updating LM Format Enforcer version to v10.3

noamgat opened this pull request 7 months ago

[ci][distributed] add pipeline parallel correctness test

youkaichao opened this pull request 7 months ago

[Bugfix] use float32 precision in samplers/test_logprobs.py for comparing with HF

tdoublep opened this pull request 7 months ago

when i set tensor_parallel_size>1(A100 * 4), it does not work

cx-hub opened this issue 7 months ago

[core][distributed] simplify code to support pipeline parallel

youkaichao opened this pull request 7 months ago

Remove unnecessary trailing period in spec_decode.rst

terrytangyuan opened this pull request 7 months ago

Report usage for beam search

simon-mo opened this pull request 7 months ago

[Model] Pipeline parallel support for Mixtral

binxuan opened this pull request 7 months ago

[Misc] Add deprecation warning for beam search

WoosukKwon opened this pull request 7 months ago

[Misc]: _run_workers_async function of DistributedGPUExecutorAsync

HMJW opened this issue 7 months ago

[Misc] Disambiguate quantized types via a new ScalarType

LucasWilkinson opened this pull request 7 months ago

[Bug]: Gemma-2 + FlashInfer: ValueError: Unsupported max_frags_z:

HanGuo97 opened this issue 7 months ago

[CI/Build] Cross python wheel

robertgshaw2-neuralmagic opened this pull request 7 months ago

[Doc] xpu backend requires running setvars.sh

rscohn2 opened this pull request 7 months ago

[Bug]: Problem loading Gemma 2 27b-it

rdaiello opened this issue 7 months ago

[Bug]: Runtime AssertionError: 32768 is not divisible by 3, multiproc_worker_utils.py:120, when using 3 GPUs for tensor-parallel

haltingstate opened this issue 7 months ago

[Kernel] Turn off CUTLASS scaled_mm for Ada Lovelace

tlrmchlsmth opened this pull request 7 months ago

[RFC]: A Graph Optimization System in vLLM using torch.compile

bnellnm opened this issue 7 months ago

torch.compile based model optimizer

bnellnm opened this pull request 7 months ago

[Bug]: vLLM 0.5.1 tensor parallel 2 hang

Flynn-Zh opened this issue 7 months ago

[BUGFIX] Raise an error for no draft token case when draft_tp>1

wooyeonlee0 opened this pull request 7 months ago

[Feature]: Request for Ascend NPU support

xuedinge233 opened this issue 7 months ago

[Bug]: vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.

momomobinx opened this issue 7 months ago

[ Misc ] More Cleanup of Marlin

robertgshaw2-neuralmagic opened this pull request 7 months ago

[ Misc ] Support Act Order in Compressed Tensors

robertgshaw2-neuralmagic opened this pull request 7 months ago

[BigFix] Fix the lm_head in gpt_bigcode in lora mode

maxdebayser opened this pull request 7 months ago

[ Misc ] Support Models With Bias in `compressed-tensors` integration

robertgshaw2-neuralmagic opened this pull request 7 months ago

[Installation]: Running ohereForAI/c4ai-command-r-v01 with main pytorch

laithsakka opened this issue 7 months ago

[Bugfix] Fix Ray Metrics API usage

Yard1 opened this pull request 7 months ago

[ Misc ] Remove separate bias add

robertgshaw2-neuralmagic opened this pull request 7 months ago

[ROCm][AMD] unify CUDA_VISIBLE_DEVICES usage in vllm to get device count

hongxiayang opened this pull request 7 months ago

[Misc] Remove flashinfer warning, add flashinfer tests to CI

LiuXiaoxuanPKU opened this pull request 7 months ago

[CI/Build] (2/2) Switching AMD CI to store images in Docker Hub

adityagoel14 opened this pull request 7 months ago

[Bugfix] Fix usage stats logging exception warning with OpenVINO

helena-intel opened this pull request 7 months ago

[Feature]: FlashAttention 3 support

orellavie1212 opened this issue 7 months ago

[doc] update pipeline parallel in readme

youkaichao opened this pull request 7 months ago

[distributed][misc] keep consistent with how pytorch finds libcudart.so

youkaichao opened this pull request 7 months ago

[BugFix] BatchResponseData body should be optional

zifeitong opened this pull request 7 months ago

[Kernel] Fix identical branches

stevegrubb opened this pull request 7 months ago

[Model][Phi3-Small] Remove scipy from blocksparse_attention

mgoin opened this pull request 7 months ago

[Bug]: OpenAI batch file format pydantic validation error

ArsalShakil opened this issue 7 months ago

[Misc] add fixture to guided processor tests

kevinbu233 opened this pull request 7 months ago

[Bug]: get that Exception in thread Thread-3 (_report_usage_worker): (vllm OpenVINO，When python3 vllm/benchmarks/benchmark_throughput.py，)

HPUedCSLearner opened this issue 7 months ago

[bug fix] Fix llava next feature size calculation.

xwjiang2010 opened this pull request 7 months ago

[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step

alexm-neuralmagic opened this pull request 7 months ago

[Bug]: Metrics time_to_first_token_seconds, time_per_output_token_seconds not working correctly

thies1006 opened this issue 7 months ago

[Performance]: how to use NVIDIA Nsight Compute in lunix

chenglu66 opened this issue 7 months ago

fix cuda118 can't find libcudart.so error

zhaotyer opened this pull request 7 months ago

[Bug]: Unable to run phi-3-small in latest release

ssmi153 opened this issue 7 months ago

[Bug]: Error on inference with LoRa request (safetensors format)

tsvisab opened this issue 7 months ago

[Bug]: `tests/basic_correctness/test_chunked_prefill.py` is failing on main in fp32

tdoublep opened this issue 7 months ago

[Bug]: Gemma 2 GPTQ - Complete output via API but incomplete through batch inference

ArsalShakil opened this issue 7 months ago

wip

thri5ha opened this pull request 7 months ago

[Bug]: Gloo 库无法在两台计算机之间进行通信

JKYtydt opened this issue 7 months ago

[Bug]: VLLM's output is unstable version==0.5.1

ffxmm opened this issue 7 months ago

[Model] RowParallelLinear: pass bias to quant_method.apply

tdoublep opened this pull request 7 months ago

[Bugfix] GPTBigCodeForCausalLM: Remove lm_head from supported_lora_modules.

tdoublep opened this pull request 7 months ago

[Usage]: Maximum Context Length Exceeded Due to Base64-Encoded Image in Prompt

tusharraskar opened this issue 7 months ago

[Feature]: Hybrid Attention

leo6022 opened this issue 7 months ago

[Bug]: VLLM 0.5.1 with LLaVA 1.6 exceptions

andrePankraz opened this issue 7 months ago

[Model]: Support for InternVL2

Weiyun1025 opened this issue 7 months ago

[Misc] refactor(config): clean up unused code

aniaan opened this pull request 7 months ago

[Bug]: In k8s pod, it takes approximately 1 hour to start the model using vllm

WangxuP opened this issue 7 months ago

[Core] offload model weights to CPU conditionally

chenqianfzh opened this pull request 7 months ago

[Core] Support Lora lineage and base model metadata management

Jeffwan opened this pull request 7 months ago

[Bug]: Server fails to boot due to a tensor size mismatch when LoRA is enabled for GPTBigCode

tjohnson31415 opened this issue 7 months ago

[Bugfix][Neuron] Fix soft prompt method error in NeuronExecutor

WoosukKwon opened this pull request 7 months ago

[BUG FIX]fix compile error when building with torch2.1

maidabu opened this pull request 7 months ago

[Bug]: Gloo Connection reset by peer

thies1006 opened this issue 7 months ago

[Feature]: Is there any plan to support Cross-Layer Attention (CLA) ?

JiayiFeng opened this issue 7 months ago

[Misc]: Random Output Generation with mistralai/Mixtral-8x22B-v0.1

rajagond opened this issue 7 months ago

[Usage]: In phi3 vision maximum context length issue

tusharraskar opened this issue 7 months ago

[Feature]: Multi-Proposers support for speculative decoding.

ShangmingCai opened this issue 7 months ago

[Bug]: Vllm 0.5.1+cu118 timeout when init CustomAllreduce

zhaotyer opened this issue 7 months ago

Speculative decoding leads to zombie requests

naturomics opened this issue 7 months ago

[Model] Add support for 'gte-Qwen2' embedding models

Nickydusk opened this pull request 7 months ago

[ci] try to add multi-node tests

youkaichao opened this pull request 7 months ago

[CI/Build][TPU] Add TPU CI test

WoosukKwon opened this pull request 7 months ago

[Bug]: deepseek-coder-v2-lite-instruct; Exception in worker VllmWorkerProcess while processing method initialize_cache: [Errno 2] No such file or directory: '/root/.triton/cache/de758c429c9ff1f18930bbd9c3004506/fused_moe_kernel.json.tmp.pid_1528_587007', Traceback (most recent call last):

fengyang95 opened this issue 7 months ago

[RFC]: Enhancing LoRA Management for Production Environments in vLLM

Jeffwan opened this issue 7 months ago

[core] Sampling controller interface

mmoskal opened this pull request 7 months ago

[Doc]: Latency vs Throughput Configurations

antferdom opened this issue 7 months ago

f[Bug]: TypeError: Can't instantiate abstract class NeuronWorker with abstract method execute_worker

areanddee opened this issue 7 months ago

[BugFix] get_and_reset only when scheduler outputs are not empty

mzusman opened this pull request 7 months ago

[Bug]: Qwen2 Moe FP8 not supported on L40

TopIdiot opened this issue 7 months ago

[Core][Model] Add simple_model_runner and a new model XLMRobertaForSequenceClassification through multimodal interface

AllenDou opened this pull request 7 months ago

No executable after building vllm from source with CPU support

parkesorgua opened this issue 7 months ago

[Bug]: tensor parallel (of 4 cards) gives bad answers in version 0.5.1 and later (compared to 0.4.1) with gptq marlin kernels (compared to gptq)

orellavie1212 opened this issue 7 months ago

[BugFix]: fix engine timeout due to request abort

pushan01 opened this pull request 7 months ago