vLLM issues | Ecosyste.ms: OpenCollective

[Misc] refactor(config): clean up unused code

github.com/vllm-project/vllm - aniaan opened this pull request 3 months ago

[Bug]: In k8s pod, it takes approximately 1 hour to start the model using vllm

github.com/vllm-project/vllm - WangxuP opened this issue 3 months ago

[Core] offload model weights to CPU conditionally

github.com/vllm-project/vllm - chenqianfzh opened this pull request 3 months ago

[Core] Support Lora lineage and base model metadata management

github.com/vllm-project/vllm - Jeffwan opened this pull request 3 months ago

[Bug]: Server fails to boot due to a tensor size mismatch when LoRA is enabled for GPTBigCode

github.com/vllm-project/vllm - tjohnson31415 opened this issue 3 months ago

[Bugfix][Neuron] Fix soft prompt method error in NeuronExecutor

github.com/vllm-project/vllm - WoosukKwon opened this pull request 3 months ago

[BUG FIX]fix compile error when building with torch2.1

github.com/vllm-project/vllm - maidabu opened this pull request 3 months ago

[Bug]: Gloo Connection reset by peer

github.com/vllm-project/vllm - thies1006 opened this issue 3 months ago

[Feature]: Is there any plan to support Cross-Layer Attention (CLA) ?

github.com/vllm-project/vllm - JiayiFeng opened this issue 3 months ago

[Misc]: Random Output Generation with mistralai/Mixtral-8x22B-v0.1

github.com/vllm-project/vllm - rajagond opened this issue 3 months ago

[Usage]: In phi3 vision maximum context length issue

github.com/vllm-project/vllm - tusharraskar opened this issue 3 months ago

[Feature]: Multi-Proposers support for speculative decoding.

github.com/vllm-project/vllm - ShangmingCai opened this issue 3 months ago

[Bug]: Vllm 0.5.1+cu118 timeout when init CustomAllreduce

github.com/vllm-project/vllm - zhaotyer opened this issue 3 months ago

[Model] Add support for 'gte-Qwen2' embedding models

github.com/vllm-project/vllm - Nickydusk opened this pull request 3 months ago

[ci] try to add multi-node tests

github.com/vllm-project/vllm - youkaichao opened this pull request 3 months ago

[CI/Build][TPU] Add TPU CI test

github.com/vllm-project/vllm - WoosukKwon opened this pull request 3 months ago

[Bug]: deepseek-coder-v2-lite-instruct; Exception in worker VllmWorkerProcess while processing method initialize_cache: [Errno 2] No such file or directory: '/root/.triton/cache/de758c429c9ff1f18930bbd9c3004506/fused_moe_kernel.json.tmp.pid_1528_587007', Traceback (most recent call last):

github.com/vllm-project/vllm - fengyang95 opened this issue 3 months ago

[RFC]: Enhancing LoRA Management for Production Environments in vLLM

github.com/vllm-project/vllm - Jeffwan opened this issue 3 months ago

[core] Sampling controller interface

github.com/vllm-project/vllm - mmoskal opened this pull request 3 months ago

f[Bug]: TypeError: Can't instantiate abstract class NeuronWorker with abstract method execute_worker

github.com/vllm-project/vllm - areanddee opened this issue 3 months ago

[BugFix] get_and_reset only when scheduler outputs are not empty

github.com/vllm-project/vllm - mzusman opened this pull request 3 months ago

[Bug]: Qwen2 Moe FP8 not supported on L40

github.com/vllm-project/vllm - TopIdiot opened this issue 3 months ago

[Core][Model] Add simple_model_runner and a new model XLMRobertaForSequenceClassification through multimodal interface

github.com/vllm-project/vllm - AllenDou opened this pull request 3 months ago

No executable after building vllm from source with CPU support

github.com/vllm-project/vllm - parkesorgua opened this issue 3 months ago

[Bug]: tensor parallel (of 4 cards) gives bad answers in version 0.5.1 and later (compared to 0.4.1) with gptq marlin kernels (compared to gptq)

github.com/vllm-project/vllm - orellavie1212 opened this issue 3 months ago

[BugFix]: fix engine timeout due to request abort

github.com/vllm-project/vllm - pushan01 opened this pull request 3 months ago

[Bug]: Engine timeout error due to request step residual

github.com/vllm-project/vllm - pushan01 opened this issue 3 months ago

[Usage]: ValueError: User-specified max_model_len (8192) is greater than the derived max_model_len (sliding_window=4096 or model_max_length=None in model's config.json).

github.com/vllm-project/vllm - mfournioux opened this issue 3 months ago

[Bug]: segfault when using google/gemma-2-27b-it on vLLM

github.com/vllm-project/vllm - federicotorrielli opened this issue 3 months ago

add benchmark test for fixed input and output length

github.com/vllm-project/vllm - haichuan1221 opened this pull request 3 months ago

[Installation]: Installation with OpenVINO get dependency conflict Error !!!

github.com/vllm-project/vllm - HPUedCSLearner opened this issue 3 months ago

[Usage]: Gemma2-9b not working on A10G 24gb gpu

github.com/vllm-project/vllm - Abhinay2323 opened this issue 3 months ago

[Bug]: Performance : slow inference for FP8 on L20 with 0.5.1(v0.5.0.post1 was fine)

github.com/vllm-project/vllm - garycaokai opened this issue 3 months ago

[Installation]: Gemma2 Installing Flash Infer `[rank0]: TypeError: 'NoneType' object is not callable`

github.com/vllm-project/vllm - robertgshaw2-neuralmagic opened this issue 3 months ago

[Core] Support dynamically loading Lora adapter from HuggingFace

github.com/vllm-project/vllm - Jeffwan opened this pull request 3 months ago

[Feature]: Support loading lora adapters from HuggingFace in runtime

github.com/vllm-project/vllm - Jeffwan opened this issue 3 months ago

[Bug]: relative path doesn't work for Lora adapter model

github.com/vllm-project/vllm - Jeffwan opened this issue 3 months ago

[Doc] Fix the lora adapter path in server startup script

github.com/vllm-project/vllm - Jeffwan opened this pull request 3 months ago

[RFC] Drop beam search support

github.com/vllm-project/vllm - WoosukKwon opened this issue 3 months ago

[Bug]: benchmark_throughput gets TypeError: XFormersMetadata.__init__() got an unexpected keyword argument 'is_prompt' wit CPU

github.com/vllm-project/vllm - LGLG42 opened this issue 3 months ago

[ BugFix ] Prompt Logprobs Detokenization

github.com/vllm-project/vllm - robertgshaw2-neuralmagic opened this pull request 3 months ago

[Bug]: Gemma2 supports 8192 context with sliding window, but vllm only does 4196 or fails if try 8192

github.com/vllm-project/vllm - pseudotensor opened this issue 3 months ago

[Bug]: issue with Phi3 mini GPTQ 4Bit/8Bit

github.com/vllm-project/vllm - gm3000 opened this issue 3 months ago

[Hardware][Intel CPU][DOC] Update docs for CPU backend

github.com/vllm-project/vllm - zhouyuan opened this pull request 3 months ago

[Bug]: AsyncEngineDeadError: Task finished unexpectedly with qwen2 72b

github.com/vllm-project/vllm - thomZ1 opened this issue 3 months ago

[Installation]: pip install -e .

github.com/vllm-project/vllm - Kev1ntan opened this issue 3 months ago

[Usage]: Is there a way to make the results of two different calls to VLLM with temperature > 0 consistent?

github.com/vllm-project/vllm - Some-random opened this issue 3 months ago

do not exclude `object` field in CompletionStreamResponse

github.com/vllm-project/vllm - kczimm opened this pull request 3 months ago

[misc][frontend] log all available endpoints

github.com/vllm-project/vllm - youkaichao opened this pull request 3 months ago

[Bug]: No end point available after model is fully loaded

github.com/vllm-project/vllm - hassanzadeh opened this issue 3 months ago

[Bug]: Guided decoding with Phi-3-small crashes

github.com/vllm-project/vllm - crosiumreborn opened this issue 3 months ago

[Bug]: gemma-2-27b error loading with vllm.LLM

github.com/vllm-project/vllm - jl3676 opened this issue 3 months ago

[Usage]: OpenAI-like API in offline inference

github.com/vllm-project/vllm - 1ncludeSteven opened this issue 3 months ago

[Bug]: AsyncEngineDeadError: Task finished unexpectedly with Gemma2 9B

github.com/vllm-project/vllm - nelyajizi opened this issue 3 months ago

[Feature]: Precise model device placement

github.com/vllm-project/vllm - vwxyzjn opened this issue 3 months ago

[Feature]: lazy import for VLM

github.com/vllm-project/vllm - zhyncs opened this issue 3 months ago

[Usage]: BNB Gemma2 9b loading problems

github.com/vllm-project/vllm - orellavie1212 opened this issue 3 months ago

[Core][Speculative Decoding] Add multi-query verifier for speculative decoding without batch expansion

github.com/vllm-project/vllm - sighingnow opened this pull request 3 months ago

[Usage]: solve problem like "Found no NVIDIA driver on your system." in WSL2

github.com/vllm-project/vllm - HelloCard opened this issue 3 months ago

[core][distributed] add zmq fallback for broadcasting large objects

github.com/vllm-project/vllm - youkaichao opened this pull request 3 months ago

Add test test (this is a test pr)

github.com/vllm-project/vllm - llmpros opened this pull request 3 months ago

[Bug]: Multiprocessing FileNotFound error in triton cache

github.com/vllm-project/vllm - jl3676 opened this issue 3 months ago

[Usage]: Struggling to get fp8 inference working correctly on 8xL40s

github.com/vllm-project/vllm - williambarberjr opened this issue 3 months ago

[Feature]: Support AVX2 for CPU (drop AVX-512 requirement)

github.com/vllm-project/vllm - kozuch opened this issue 4 months ago

[Bug]: Empty strings as output using gemma-2-27B with 4 A10s

github.com/vllm-project/vllm - lucafirefox opened this issue 4 months ago

[Bug]: LLaVA 1.6 in 0.5.1: Exceptions after some bigger image request, stuck in faulty mode

github.com/vllm-project/vllm - andrePankraz opened this issue 4 months ago

[Feature]: ROPE scaling supported by vLLM gemma2

github.com/vllm-project/vllm - kkk935208447 opened this issue 4 months ago

[Doc]: Code Shared for OpenAI Embedding Client gives base64 encode error

github.com/vllm-project/vllm - palash-fin opened this issue 4 months ago

[Bug]: As V100 does not support FlashAttention, it is not possible to run the gemma model, hopefully it can support the xformers way to run it

github.com/vllm-project/vllm - warlockedward opened this issue 4 months ago

Add FlashInfer to default Dockerfile

github.com/vllm-project/vllm - simon-mo opened this pull request 4 months ago

[Bug]: New bug in 0.5.1 (v0.5.0.post1 was fine)

github.com/vllm-project/vllm - andrePankraz opened this issue 4 months ago

[Core] implement disaggregated prefilling via KV cache transfer

github.com/vllm-project/vllm - KuntaiDu opened this pull request 4 months ago

[Bug]: TypeError: 'NoneType' object is not callable when loading Gemma 2 9B with new 0.5.1 version

github.com/vllm-project/vllm - DanielusG opened this issue 4 months ago

[Doc] Move guide for multimodal model and other improvements

github.com/vllm-project/vllm - DarkLight1337 opened this pull request 4 months ago

[Doc] Reorganize Supported Models by Type

github.com/vllm-project/vllm - ywang96 opened this pull request 4 months ago

[Bug]: When running gemma2 7b, an error is reported [rank0]: RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)` Set up according to the prompts: os.environ['VLLM_ATTENTION_BACKEND'] = 'FLASHINFER' print("Environment variable set for VLLM_ATTENTION_BACKEND:", os.getenv('VLLM_ATTENTION_BACKEND'))

github.com/vllm-project/vllm - orderer0001 opened this issue 4 months ago

[Feature]: Return hidden states (in progress?)

github.com/vllm-project/vllm - Elanmarkowitz opened this issue 4 months ago

[Core] Refactor _prepare_model_input_tensors - take 2

github.com/vllm-project/vllm - comaniac opened this pull request 4 months ago

Move release wheel env var to Dockerfile instead

github.com/vllm-project/vllm - simon-mo opened this pull request 4 months ago

Fix release wheel build env var

github.com/vllm-project/vllm - simon-mo opened this pull request 4 months ago

Update wheel builds to strip debug

github.com/vllm-project/vllm - simon-mo opened this pull request 4 months ago

[Bug]: Batch expansion doesn't work with lora

github.com/vllm-project/vllm - Adhyyan1252 opened this issue 4 months ago

[Docs] Fix readthedocs for tag build

github.com/vllm-project/vllm - simon-mo opened this pull request 4 months ago

bump version to v0.5.1

github.com/vllm-project/vllm - simon-mo opened this pull request 4 months ago

[Bug]: When starting deepseek-coder-v2-lite-instruct with vllm on 4 GPUs, one of them is at 0%.

github.com/vllm-project/vllm - fengyang95 opened this issue 4 months ago

[Usage]: How to use Multi-instance in Vllm? (Model replication on multiple GPUs)

github.com/vllm-project/vllm - KimMinSang96 opened this issue 4 months ago

[Feature]: expose the tqdm progress bar to enable logging the progress

github.com/vllm-project/vllm - hugolytics opened this issue 4 months ago

加载Baichuan2-13B-Chat时出异常，torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB. GPU

github.com/vllm-project/vllm - czhcc opened this issue 4 months ago

[Bug]: When tensor_parallel_size>1, RuntimeError: Cannot re-initialize CUDA in forked subprocess.

github.com/vllm-project/vllm - excelsimon opened this issue 4 months ago

[Feature]: Integrate new backend

github.com/vllm-project/vllm - XDaoHong opened this issue 4 months ago

[Performance]: the performance with chunked-prefill-enabled is lower than default

github.com/vllm-project/vllm - BestKuan opened this issue 4 months ago

[VLM] Cleanup validation and update docs

github.com/vllm-project/vllm - DarkLight1337 opened this pull request 4 months ago

[Feature]: Model ChatGLMForCausalLM does not support LoRA, but LoRA is enabled.

github.com/vllm-project/vllm - wangbhan opened this issue 4 months ago

[Bug]: CUDA error when using multiple GPUs

github.com/vllm-project/vllm - ndao600 opened this issue 4 months ago

[VLM] Improve consistency between feature size calculation and dummy data for profiling

github.com/vllm-project/vllm - ywang96 opened this pull request 4 months ago

[Bug]: When using tp for inference, an error occurs: Worker VllmWorkerProcess pid 3283517 died, exit code: -15.

github.com/vllm-project/vllm - B-201 opened this issue 4 months ago

[Bugfix] Enable chunked-prefill and prefix cache with flash-attn backend

github.com/vllm-project/vllm - sighingnow opened this pull request 4 months ago

[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend

github.com/vllm-project/vllm - kzawora-intel opened this pull request 4 months ago

[Feature]: deepseek-v2 awq support

github.com/vllm-project/vllm - fengyang95 opened this issue 4 months ago

[Usage]: Internal server error when serving LoRA adapters with Open-AI compatible vLLM server

github.com/vllm-project/vllm - ebi64 opened this issue 4 months ago