Ecosyste.ms: OpenCollective

An open API service for software projects hosted on Open Collective.

github.com/vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://github.com/vllm-project/vllm

Allow passing hf config args with openai server

Aakash-kaushik opened this issue 11 months ago
`max_num_batched_tokens` and `max_num_seqs values`

isRambler opened this issue 11 months ago
Aborted request without reason

erjieyong opened this issue 11 months ago
Support JSON mode.

MiyazonoKaori opened this issue 11 months ago
vLLM Distributed Inference stuck when using multi -GPU

RathoreShubh opened this issue 11 months ago
Add JSON format logging support

CatherineSue opened this pull request 11 months ago
anyone can Qwen-14B-Chat-AWQ work with VLLM/TP ?

s-natsubori opened this issue 12 months ago
top_k = 50 will make vllm prediction align with transformers

sfyumi opened this issue 12 months ago
Multi-node serving with vLLM - Problems with Ray

vbucaj opened this issue 12 months ago
Compute perplexity/logits for the prompt

dsmilkov opened this issue 12 months ago
OutOfMemoryError

Hobrus opened this issue 12 months ago
awq compression of llama 2 70b chat got bad result

fancyerii opened this issue 12 months ago
vLLM on OpenShift/Kubernetes Manifests

WinsonSou opened this issue 12 months ago
out of memory with mixtral AWQ

m0wer opened this issue 12 months ago
Docs: Add Haystack integration details

bilgeyucel opened this pull request 12 months ago
Could we support Fuyu-8B, a multimodel llm?

leiwen83 opened this issue 12 months ago
the output of the vLLM is different from that of HF

will-wiki opened this issue about 1 year ago
[WIP] Speculative decoding using a draft model

cadedaniel opened this pull request about 1 year ago
Use LRU cache for CUDA Graphs

WoosukKwon opened this issue about 1 year ago
torch.cuda.OutOfMemoryError: CUDA out of memory

DenisStefanAndrei opened this issue about 1 year ago
vllm推理如何指定某块gpu

SiqinLv opened this issue about 1 year ago
Inquiry Regarding vLLM Support for Mac Metal API

yihong1120 opened this issue about 1 year ago
Implement Triton-based AWQ kernel

WoosukKwon opened this pull request about 1 year ago
Support VLM model and GPT4V API

xunfeng1980 opened this issue about 1 year ago
Vllm RayWoker process hangs when use llm engine

SuoSiFire opened this issue about 1 year ago
[FEATURE REQUEST] SparQ Attention

AlpinDale opened this issue about 1 year ago
ARM aarch-64 server build failed (host OS: Ubuntu22.04.3)

zhudy opened this issue about 1 year ago
why online seving slower than offline serving??

BangDaeng opened this issue about 1 year ago
I want to add mamba_chat (2.8b) model

SafeyahShemali opened this issue about 1 year ago
How to fix incomplete answers?

LuciAkirami opened this issue about 1 year ago
Can it support macos ? M2 chip.

znsoftm opened this issue about 1 year ago
Is there a way to terminate vllm.LLM and release the GPU memory

sfc-gh-zhwang opened this issue about 1 year ago
01-ai/Yi-34B-Chat never stops

pseudotensor opened this issue about 1 year ago
ModuleNotFoundError: No module named "vllm._C"

Kawai1Ace opened this issue about 1 year ago
Please help me solve the problem. thanks

CP3666 opened this issue about 1 year ago
Proposal: force type hint check with mypy

wangkuiyi opened this issue about 1 year ago
pip install -e . failed

dachengai opened this issue about 1 year ago
Batching inference outputs are not the same with single inference

gesanqiu opened this issue about 1 year ago
How to use logits_processors

shuaiwang2022 opened this issue about 1 year ago
NCCL error

maxmelichov opened this issue about 1 year ago
ImportError: libcudart.so.12

tranhoangnguyen03 opened this issue about 1 year ago
API causes slowdown in batch request handling

jpeig opened this issue about 1 year ago
Avoid re-initialize parallel groups

wangruohui opened this pull request about 1 year ago
[Feature] SYCL kernel support for Intel GPU

abhilash1910 opened this pull request about 1 year ago
follow up of #1687 when safetensors model contains 0-rank tensors

twaka opened this pull request about 1 year ago
Plans to make the installation work on Windows without WSL?

alexandre-ist opened this issue about 1 year ago
API Server Performance

simon-mo opened this issue about 1 year ago
usage of vllm for extracting embeddings

ra-MANUJ-an opened this issue about 1 year ago
Revert 1 docker build

wasertech opened this pull request about 1 year ago
Prompt caching

AIApprentice101 opened this issue about 1 year ago
No CUDA GPUs are available Error with vLLM in JupyterLab

SafeyahShemali opened this issue about 1 year ago
how to use chat function

zhangzai666 opened this issue about 1 year ago
Support for sparsity?

BDHU opened this issue about 1 year ago
Tensor parallelism on ray cluster

baojunliu opened this issue about 1 year ago
Adding support for switch-transformer / NLLB-MoE

yl3469 opened this issue about 1 year ago
[Bug] prompt_logprobs = 1 OOM problem

shunxing1234 opened this issue about 1 year ago
Support W8A8 inference in vllm

AniZpZ opened this pull request about 1 year ago
Support int8 KVCache Quant in Vllm

AniZpZ opened this pull request about 1 year ago
Added logits processor API to sampling params

noamgat opened this pull request about 1 year ago
ImportError: cannot import name 'MistralConfig' from 'transformers'

peter-ch opened this issue about 1 year ago
Does vllm support the Mac/Metal/MPS?

Phil-U-U opened this issue about 1 year ago
Using VLLM with a Tesla T4 on SageMaker Studio (ml.g4dn.xlarge instance)

paulovasconcellos-hotmart opened this issue about 1 year ago
[question] Does vllm support macos M1 or M2 chip?

acekingke opened this issue about 1 year ago
Make multi replicas to make a balancer.

linkedlist771 opened this issue about 1 year ago
How to deploy vllm model across multiple nodes in kubernetes?

Ryojikn opened this issue about 1 year ago
[Error] 400 Bad Request

Tostino opened this issue about 1 year ago
Low VRAM batch processing mode

viktor-ferenczi opened this issue about 1 year ago
feat: demonstrate using regex for suffix matching

wsxiaoys opened this pull request about 1 year ago
Memory leak

SatoshiReport opened this issue about 1 year ago
StreamingLLM support?

nivibilla opened this issue about 1 year ago
workaround of AWQ for Turing GPUs

twaka opened this pull request about 1 year ago
Jetson agx orin

MrBrabus75 opened this issue about 1 year ago
Data parallel inference

kevinhu opened this issue about 1 year ago
Support Python 3.12

EwoutH opened this issue about 1 year ago
3 gpu's not supported?

ye7love7 opened this issue about 1 year ago
Generate nothing from VLLM output

FocusLiwen opened this issue about 1 year ago
vLLM to add a locally trained model

atanikan opened this issue over 1 year ago
vLLM Discord Server

zhuohan123 opened this issue over 1 year ago