vLLM issues | Ecosyste.ms: OpenCollective

VLLM Multi-Lora with embed_tokens and lm_head in adapter weights

github.com/vllm-project/vllm - germanjke opened this issue almost 1 year ago

openai completions api <echo=True> raises Error

github.com/vllm-project/vllm - seoyunYang opened this issue almost 1 year ago

Add Splitwise implementation to vLLM

github.com/vllm-project/vllm - aashaka opened this pull request about 1 year ago

Nvidia-H20 with nvcr.io/nvidia/pytorch:23.12-py3，CUBLAS Error！

github.com/vllm-project/vllm - tohneecao opened this issue about 1 year ago

Multi GPU ROCm6 issues, and workarounds

github.com/vllm-project/vllm - BKitor opened this issue about 1 year ago

model continue conversation

github.com/vllm-project/vllm - andrey-genpracc opened this issue about 1 year ago

[Bug] `v0.3.0` produces garbage output when serving CodeLlama-70B on 4xA6000

github.com/vllm-project/vllm - ganler opened this issue about 1 year ago

ERROR: Fail to install in editable mode. "UserWarning: There are no .../x86_64-conda-linux-gnu-c++ version bounds defined for CUDA version 12.1"

github.com/vllm-project/vllm - KartikYZ opened this issue about 1 year ago

Add fused top-K softmax kernel for MoE

github.com/vllm-project/vllm - WoosukKwon opened this pull request about 1 year ago

GPTQ & AWQ Fused MOE

github.com/vllm-project/vllm - chu-tianxiang opened this pull request about 1 year ago

Llama Guard inconsistent output between HuggingFace's Transformers and vLLM

github.com/vllm-project/vllm - AmenRa opened this issue about 1 year ago

vLLM ignores my requests when I increase the number of concurrent requests

github.com/vllm-project/vllm - savannahfung opened this issue about 1 year ago

[Minor] More fix of test_cache.py CI test failure

github.com/vllm-project/vllm - LiuXiaoxuanPKU opened this pull request about 1 year ago

ImportError: /ramyapra/vllm/vllm/_C.cpython-310-x86_64-linux-gnu.so: undefined symbol:

github.com/vllm-project/vllm - ramyaprabhu-alt opened this issue about 1 year ago

How to increase vllm scheduler promt limit?

github.com/vllm-project/vllm - hanswang1 opened this issue about 1 year ago

Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.

github.com/vllm-project/vllm - gty111 opened this issue about 1 year ago

Fix/async chat serving

github.com/vllm-project/vllm - schoennenbeck opened this pull request about 1 year ago

KV Cache usage is 0% for mistral model

github.com/vllm-project/vllm - nikhilshandilya opened this issue about 1 year ago

Seek help, `Qwen-14B-Chat-Int4`ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.

github.com/vllm-project/vllm - huangyunxin opened this issue about 1 year ago

OpenAIServingChat cannot be instantiated within a running event loop

github.com/vllm-project/vllm - schoennenbeck opened this issue about 1 year ago

Ray worker out of memory

github.com/vllm-project/vllm - tristan279 opened this issue about 1 year ago

IndexError when using Beam Search in Chat Completions

github.com/vllm-project/vllm - jamestwhedbee opened this issue about 1 year ago

ValueError: Total number of attention heads (52) must be divisible by tensor parallel size (8).

github.com/vllm-project/vllm - PolinaBokova opened this issue about 1 year ago

Question: Would a PR integrating ExLlamaV2 kernels with AWQ be accepted?

github.com/vllm-project/vllm - casper-hansen opened this issue about 1 year ago

Adding `/get_tokenizer` to api_server for lm-evaluation-harness ease integration.

github.com/vllm-project/vllm - AguirreNicolas opened this pull request about 1 year ago

Dockerfile: build-arg to punica kernel

github.com/vllm-project/vllm - AguirreNicolas opened this pull request about 1 year ago

Mixtral AWQ fails to work: asyncio.exceptions.CancelledError: Cancelled by cancel scope 7fd214489990

github.com/vllm-project/vllm - pseudotensor opened this issue about 1 year ago

[RFC] Automatic Prefix Caching

github.com/vllm-project/vllm - zhuohan123 opened this issue about 1 year ago

How to get the logits of the first generated text?

github.com/vllm-project/vllm - Abigail61 opened this issue about 1 year ago

Speculative Decoding

github.com/vllm-project/vllm - ymwangg opened this pull request about 1 year ago

Add multi-LoRA support for more architectures

github.com/vllm-project/vllm - Yard1 opened this issue about 1 year ago

Combine multi-LoRA and quantization

github.com/vllm-project/vllm - Yard1 opened this issue about 1 year ago

RuntimeError on ROCm

github.com/vllm-project/vllm - rlrs opened this issue about 1 year ago

Longer stop sequence not working in streaming mode

github.com/vllm-project/vllm - andrePankraz opened this issue about 1 year ago

Support for production grade server for Inference [Gunicorn vs Unicorn]?

github.com/vllm-project/vllm - jalotra opened this issue about 1 year ago

GPU utilization decrease during long-term running

github.com/vllm-project/vllm - WrRan opened this issue about 1 year ago

CUDA out of memory error despite having enough memory

github.com/vllm-project/vllm - varonroy opened this issue about 1 year ago

Allow passing hf config args with openai server

github.com/vllm-project/vllm - Aakash-kaushik opened this issue about 1 year ago

openapi running but "POST /v1/chat/completions HTTP/1.1" 404 Not Found

github.com/vllm-project/vllm - Nero10578 opened this issue about 1 year ago

`max_num_batched_tokens` and `max_num_seqs values`

github.com/vllm-project/vllm - isRambler opened this issue about 1 year ago

Aborted request without reason

github.com/vllm-project/vllm - erjieyong opened this issue about 1 year ago

Support JSON mode.

github.com/vllm-project/vllm - MiyazonoKaori opened this issue about 1 year ago

vLLM Distributed Inference stuck when using multi -GPU

github.com/vllm-project/vllm - RathoreShubh opened this issue about 1 year ago

Add JSON format logging support

github.com/vllm-project/vllm - CatherineSue opened this pull request about 1 year ago

anyone can Qwen-14B-Chat-AWQ work with VLLM/TP ?

github.com/vllm-project/vllm - s-natsubori opened this issue about 1 year ago

ValueError: The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (3664). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.`

github.com/vllm-project/vllm - handsomelys opened this issue about 1 year ago

top_k = 50 will make vllm prediction align with transformers

github.com/vllm-project/vllm - sfyumi opened this issue about 1 year ago

Multi-node serving with vLLM - Problems with Ray

github.com/vllm-project/vllm - vbucaj opened this issue about 1 year ago

Compute perplexity/logits for the prompt

github.com/vllm-project/vllm - dsmilkov opened this issue about 1 year ago

OutOfMemoryError

github.com/vllm-project/vllm - Hobrus opened this issue about 1 year ago

RuntimeError: t == DeviceType::CUDA INTERNAL ASSERT FAILED about v0.2.7

github.com/vllm-project/vllm - cocovoc opened this issue about 1 year ago

awq compression of llama 2 70b chat got bad result

github.com/vllm-project/vllm - fancyerii opened this issue about 1 year ago

vLLM on OpenShift/Kubernetes Manifests

github.com/vllm-project/vllm - WinsonSou opened this issue about 1 year ago

out of memory with mixtral AWQ

github.com/vllm-project/vllm - m0wer opened this issue about 1 year ago

Docs: Add Haystack integration details

github.com/vllm-project/vllm - bilgeyucel opened this pull request about 1 year ago

Could we support Fuyu-8B, a multimodel llm?

github.com/vllm-project/vllm - leiwen83 opened this issue about 1 year ago

Recent vLLMs ask for too much memory: ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

github.com/vllm-project/vllm - pseudotensor opened this issue about 1 year ago

the output of the vLLM is different from that of HF

github.com/vllm-project/vllm - will-wiki opened this issue about 1 year ago

[WIP] Speculative decoding using a draft model

github.com/vllm-project/vllm - cadedaniel opened this pull request about 1 year ago

Use LRU cache for CUDA Graphs

github.com/vllm-project/vllm - WoosukKwon opened this issue about 1 year ago

torch.cuda.OutOfMemoryError: CUDA out of memory

github.com/vllm-project/vllm - DenisStefanAndrei opened this issue about 1 year ago

argument 'tokens': 'NoneType' object cannot be converted to 'PyString'

github.com/vllm-project/vllm - xxm1668 opened this issue about 1 year ago

vllm推理如何指定某块gpu

github.com/vllm-project/vllm - SiqinLv opened this issue about 1 year ago

Unable to run any model with tensor_parallel_size>1 on AWS sagemaker notebooks

github.com/vllm-project/vllm - samarthsarin opened this issue about 1 year ago

Inquiry Regarding vLLM Support for Mac Metal API

github.com/vllm-project/vllm - yihong1120 opened this issue about 1 year ago

Implement Triton-based AWQ kernel

github.com/vllm-project/vllm - WoosukKwon opened this pull request about 1 year ago

Support VLM model and GPT4V API

github.com/vllm-project/vllm - xunfeng1980 opened this issue about 1 year ago

Vllm RayWoker process hangs when use llm engine

github.com/vllm-project/vllm - SuoSiFire opened this issue about 1 year ago

[FEATURE REQUEST] SparQ Attention

github.com/vllm-project/vllm - AlpinDale opened this issue about 1 year ago

ARM aarch-64 server build failed (host OS: Ubuntu22.04.3)

github.com/vllm-project/vllm - zhudy opened this issue about 1 year ago

why online seving slower than offline serving??

github.com/vllm-project/vllm - BangDaeng opened this issue about 1 year ago

I want to add mamba_chat (2.8b) model

github.com/vllm-project/vllm - SafeyahShemali opened this issue about 1 year ago

How to fix incomplete answers?

github.com/vllm-project/vllm - LuciAkirami opened this issue about 1 year ago

RuntimeError: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update.See https://github.com/pytorch/rfcs/pull/17 for more details.

github.com/vllm-project/vllm - imraviagrawal opened this issue about 1 year ago

Repeated answer: When I use vllm with opt-13b, the generated text is not end until the max length, with the repeated answer

github.com/vllm-project/vllm - duihuhu opened this issue about 1 year ago

Error. Rayworkervllm cannot work well when use --tensor-parallel-size . Please help.

github.com/vllm-project/vllm - JenniePing opened this issue about 1 year ago

Can it support macos ? M2 chip.

github.com/vllm-project/vllm - znsoftm opened this issue about 1 year ago

Is there a way to terminate vllm.LLM and release the GPU memory

github.com/vllm-project/vllm - sfc-gh-zhwang opened this issue about 1 year ago

Support `tools` and `tool_choice` parameter in OpenAI compatible service

github.com/vllm-project/vllm - simon-mo opened this issue about 1 year ago

01-ai/Yi-34B-Chat never stops

github.com/vllm-project/vllm - pseudotensor opened this issue about 1 year ago

ModuleNotFoundError: No module named "vllm._C"

github.com/vllm-project/vllm - Kawai1Ace opened this issue about 1 year ago

Please help me solve the problem. thanks

github.com/vllm-project/vllm - CP3666 opened this issue about 1 year ago

Proposal: force type hint check with mypy

github.com/vllm-project/vllm - wangkuiyi opened this issue about 1 year ago

pip install -e . failed

github.com/vllm-project/vllm - dachengai opened this issue about 1 year ago

Batching inference outputs are not the same with single inference

github.com/vllm-project/vllm - gesanqiu opened this issue about 1 year ago

vllm always tries to download model from huggingface/modelscope even if I specify --download-dir with already downloaded models

github.com/vllm-project/vllm - davideuler opened this issue about 1 year ago

Add worker registry service for hosting multiple vllm model through single api gateway

github.com/vllm-project/vllm - tjtanaa opened this issue about 1 year ago

How to use logits_processors

github.com/vllm-project/vllm - shuaiwang2022 opened this issue about 1 year ago

NCCL error

github.com/vllm-project/vllm - maxmelichov opened this issue about 1 year ago

ImportError: libcudart.so.12

github.com/vllm-project/vllm - tranhoangnguyen03 opened this issue about 1 year ago

API causes slowdown in batch request handling

github.com/vllm-project/vllm - jpeig opened this issue about 1 year ago

Avoid re-initialize parallel groups

github.com/vllm-project/vllm - wangruohui opened this pull request about 1 year ago

[Feature] SYCL kernel support for Intel GPU

github.com/vllm-project/vllm - abhilash1910 opened this pull request about 1 year ago

follow up of #1687 when safetensors model contains 0-rank tensors

github.com/vllm-project/vllm - twaka opened this pull request about 1 year ago

Plans to make the installation work on Windows without WSL?

github.com/vllm-project/vllm - alexandre-ist opened this issue about 1 year ago

API Server Performance

github.com/vllm-project/vllm - simon-mo opened this issue about 1 year ago

usage of vllm for extracting embeddings

github.com/vllm-project/vllm - ra-MANUJ-an opened this issue about 1 year ago

Revert 1 docker build

github.com/vllm-project/vllm - wasertech opened this pull request about 1 year ago

baichuan-13b-chat用vllm来生成，很多测试数据（有长有短，没有超出长度限制）只能生成一个句号，而且有些示例在删掉一些字词或句子之后，就可以正常生成了，请问有可能是什么原因？

github.com/vllm-project/vllm - MrInouye opened this issue about 1 year ago

Prompt caching

github.com/vllm-project/vllm - AIApprentice101 opened this issue about 1 year ago