github.com/vllm-project/vllm issues | Ecosyste.ms: OpenCollective

Allow passing hf config args with openai server

Aakash-kaushik opened this issue 11 months ago

openapi running but "POST /v1/chat/completions HTTP/1.1" 404 Not Found

Nero10578 opened this issue 11 months ago

`max_num_batched_tokens` and `max_num_seqs values`

isRambler opened this issue 11 months ago

Aborted request without reason

erjieyong opened this issue 11 months ago

Support JSON mode.

MiyazonoKaori opened this issue 11 months ago

vLLM Distributed Inference stuck when using multi -GPU

RathoreShubh opened this issue 11 months ago

Add JSON format logging support

CatherineSue opened this pull request 11 months ago

anyone can Qwen-14B-Chat-AWQ work with VLLM/TP ?

s-natsubori opened this issue 12 months ago

ValueError: The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (3664). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.`

handsomelys opened this issue 12 months ago

top_k = 50 will make vllm prediction align with transformers

sfyumi opened this issue 12 months ago

Multi-node serving with vLLM - Problems with Ray

vbucaj opened this issue 12 months ago

Compute perplexity/logits for the prompt

dsmilkov opened this issue 12 months ago

OutOfMemoryError

Hobrus opened this issue 12 months ago

RuntimeError: t == DeviceType::CUDA INTERNAL ASSERT FAILED about v0.2.7

cocovoc opened this issue 12 months ago

awq compression of llama 2 70b chat got bad result

fancyerii opened this issue 12 months ago

vLLM on OpenShift/Kubernetes Manifests

WinsonSou opened this issue 12 months ago

out of memory with mixtral AWQ

m0wer opened this issue 12 months ago

Docs: Add Haystack integration details

bilgeyucel opened this pull request 12 months ago

Could we support Fuyu-8B, a multimodel llm?

leiwen83 opened this issue 12 months ago

Recent vLLMs ask for too much memory: ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

pseudotensor opened this issue 12 months ago

the output of the vLLM is different from that of HF

will-wiki opened this issue about 1 year ago

[WIP] Speculative decoding using a draft model

cadedaniel opened this pull request about 1 year ago

Use LRU cache for CUDA Graphs

WoosukKwon opened this issue about 1 year ago

torch.cuda.OutOfMemoryError: CUDA out of memory

DenisStefanAndrei opened this issue about 1 year ago

argument 'tokens': 'NoneType' object cannot be converted to 'PyString'

xxm1668 opened this issue about 1 year ago

vllm推理如何指定某块gpu

SiqinLv opened this issue about 1 year ago

Unable to run any model with tensor_parallel_size>1 on AWS sagemaker notebooks

samarthsarin opened this issue about 1 year ago

Inquiry Regarding vLLM Support for Mac Metal API

yihong1120 opened this issue about 1 year ago

Implement Triton-based AWQ kernel

WoosukKwon opened this pull request about 1 year ago

Support VLM model and GPT4V API

xunfeng1980 opened this issue about 1 year ago

Vllm RayWoker process hangs when use llm engine

SuoSiFire opened this issue about 1 year ago

[FEATURE REQUEST] SparQ Attention

AlpinDale opened this issue about 1 year ago

ARM aarch-64 server build failed (host OS: Ubuntu22.04.3)

zhudy opened this issue about 1 year ago

why online seving slower than offline serving??

BangDaeng opened this issue about 1 year ago

I want to add mamba_chat (2.8b) model

SafeyahShemali opened this issue about 1 year ago

How to fix incomplete answers?

LuciAkirami opened this issue about 1 year ago

RuntimeError: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update.See https://github.com/pytorch/rfcs/pull/17 for more details.

imraviagrawal opened this issue about 1 year ago

Repeated answer: When I use vllm with opt-13b, the generated text is not end until the max length, with the repeated answer

duihuhu opened this issue about 1 year ago

Error. Rayworkervllm cannot work well when use --tensor-parallel-size . Please help.

JenniePing opened this issue about 1 year ago

Can it support macos ? M2 chip.

znsoftm opened this issue about 1 year ago

Is there a way to terminate vllm.LLM and release the GPU memory

sfc-gh-zhwang opened this issue about 1 year ago

Support `tools` and `tool_choice` parameter in OpenAI compatible service

simon-mo opened this issue about 1 year ago

01-ai/Yi-34B-Chat never stops

pseudotensor opened this issue about 1 year ago

ModuleNotFoundError: No module named "vllm._C"

Kawai1Ace opened this issue about 1 year ago

Please help me solve the problem. thanks

CP3666 opened this issue about 1 year ago

Proposal: force type hint check with mypy

wangkuiyi opened this issue about 1 year ago

pip install -e . failed

dachengai opened this issue about 1 year ago

Batching inference outputs are not the same with single inference

gesanqiu opened this issue about 1 year ago

vllm always tries to download model from huggingface/modelscope even if I specify --download-dir with already downloaded models

davideuler opened this issue about 1 year ago

Add worker registry service for hosting multiple vllm model through single api gateway

tjtanaa opened this issue about 1 year ago

How to use logits_processors

shuaiwang2022 opened this issue about 1 year ago

NCCL error

maxmelichov opened this issue about 1 year ago

ImportError: libcudart.so.12

tranhoangnguyen03 opened this issue about 1 year ago

API causes slowdown in batch request handling

jpeig opened this issue about 1 year ago

Avoid re-initialize parallel groups

wangruohui opened this pull request about 1 year ago

[Feature] SYCL kernel support for Intel GPU

abhilash1910 opened this pull request about 1 year ago

follow up of #1687 when safetensors model contains 0-rank tensors

twaka opened this pull request about 1 year ago

Plans to make the installation work on Windows without WSL?

alexandre-ist opened this issue about 1 year ago

API Server Performance

simon-mo opened this issue about 1 year ago

usage of vllm for extracting embeddings

ra-MANUJ-an opened this issue about 1 year ago

Revert 1 docker build

wasertech opened this pull request about 1 year ago

baichuan-13b-chat用vllm来生成，很多测试数据（有长有短，没有超出长度限制）只能生成一个句号，而且有些示例在删掉一些字词或句子之后，就可以正常生成了，请问有可能是什么原因？

MrInouye opened this issue about 1 year ago

Prompt caching

AIApprentice101 opened this issue about 1 year ago

No CUDA GPUs are available Error with vLLM in JupyterLab

SafeyahShemali opened this issue about 1 year ago

how to use chat function

zhangzai666 opened this issue about 1 year ago

chatglm3 vllm/vllm/model_executor/models/chatglm.py", line 53, in __init__ assert self.total_num_kv_heads % tp_size == 0 AssertionError

Changjy1997nb opened this issue about 1 year ago

Support for sparsity?

BDHU opened this issue about 1 year ago

Tensor parallelism on ray cluster

baojunliu opened this issue about 1 year ago

Adding support for switch-transformer / NLLB-MoE

yl3469 opened this issue about 1 year ago

[Bug] prompt_logprobs = 1 OOM problem

shunxing1234 opened this issue about 1 year ago

Error: When using OpenAI-Compatible Server, the server is available but cannot be accessed from the same terminal.

LuristheSun opened this issue about 1 year ago

Support W8A8 inference in vllm

AniZpZ opened this pull request about 1 year ago

Support int8 KVCache Quant in Vllm

AniZpZ opened this pull request about 1 year ago

Added logits processor API to sampling params

noamgat opened this pull request about 1 year ago

ImportError: cannot import name 'MistralConfig' from 'transformers'

peter-ch opened this issue about 1 year ago

Adding Locally Typical Sampling (i.e. typical_p in transformers and TGI)

seongminp opened this issue about 1 year ago

Does vllm support the Mac/Metal/MPS?

Phil-U-U opened this issue about 1 year ago

Using VLLM with a Tesla T4 on SageMaker Studio (ml.g4dn.xlarge instance)

paulovasconcellos-hotmart opened this issue about 1 year ago

[question] Does vllm support macos M1 or M2 chip?

acekingke opened this issue about 1 year ago

Could not build wheels for vllm, which is required to install pyproject.toml-based projects

ABooth01 opened this issue about 1 year ago

Make multi replicas to make a balancer.

linkedlist771 opened this issue about 1 year ago

How to deploy vllm model across multiple nodes in kubernetes?

Ryojikn opened this issue about 1 year ago

[Error] 400 Bad Request

Tostino opened this issue about 1 year ago

What is the max number prompts that the generate() method can take

hxue3 opened this issue about 1 year ago

Low VRAM batch processing mode

viktor-ferenczi opened this issue about 1 year ago

Feature request. Allow a few model instances in one GPU if they can feet in VRAM.

agrogov opened this issue about 1 year ago

feat: demonstrate using regex for suffix matching

wsxiaoys opened this pull request about 1 year ago

Memory leak

SatoshiReport opened this issue about 1 year ago

StreamingLLM support?

nivibilla opened this issue about 1 year ago

workaround of AWQ for Turing GPUs

twaka opened this pull request about 1 year ago

Jetson agx orin

MrBrabus75 opened this issue about 1 year ago

Data parallel inference

kevinhu opened this issue about 1 year ago

Support Python 3.12

EwoutH opened this issue about 1 year ago

3 gpu's not supported?

ye7love7 opened this issue about 1 year ago

Generate nothing from VLLM output

FocusLiwen opened this issue about 1 year ago

[Discussion] Will vLLM consider using Speculative Sampling to accelerating LLM decoding?

gesanqiu opened this issue about 1 year ago

vLLM to add a locally trained model

atanikan opened this issue over 1 year ago

vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly.

MUZAMMILPERVAIZ opened this issue over 1 year ago

AWQ: bfloat16 not supported? And `--dtype` arg doesn't allow specifying float16

TheBloke opened this issue over 1 year ago

vLLM Discord Server

zhuohan123 opened this issue over 1 year ago