Ecosyste.ms: OpenCollective

An open API service for software projects hosted on Open Collective.

github.com/vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://github.com/vllm-project/vllm

Memory leak

SatoshiReport opened this issue about 1 year ago
StreamingLLM support?

nivibilla opened this issue over 1 year ago
workaround of AWQ for Turing GPUs

twaka opened this pull request over 1 year ago
Jetson agx orin

MrBrabus75 opened this issue over 1 year ago
Data parallel inference

kevinhu opened this issue over 1 year ago
Support Python 3.12

EwoutH opened this issue over 1 year ago
3 gpu's not supported?

ye7love7 opened this issue over 1 year ago
Generate nothing from VLLM output

FocusLiwen opened this issue over 1 year ago
vLLM to add a locally trained model

atanikan opened this issue over 1 year ago
vLLM Discord Server

zhuohan123 opened this issue over 1 year ago
Waiting sequence group should have only one prompt sequence.

Link-Li opened this issue over 1 year ago
Inconsistent results between HuggingFace Transformers and vllm

normster opened this issue over 1 year ago
How to deploy api server as https

yilihtien opened this issue over 1 year ago
vllm hangs when reinitializing ray

nelson-liu opened this issue over 1 year ago
How to use vllm to compute ppl score for input text?

yinochaos opened this issue over 1 year ago
GGUF support

viktor-ferenczi opened this issue over 1 year ago
can model Qwen/Qwen-VL-Chat work well?

wangschang opened this issue over 1 year ago
Is there authentication supported?

mluogh opened this issue over 1 year ago
Loading Model through Multi-Node Ray Cluster Fails

VarunSreenivasan16 opened this issue over 1 year ago
Sagemaker support for inference

Tarun3679 opened this issue over 1 year ago
Support for RLHF (ILQL)-trained Models

ojus1 opened this issue over 1 year ago
vLLM full name

designInno opened this issue over 1 year ago
pip installation error - ERROR: Failed building wheel for vllm

dxlong2000 opened this issue over 1 year ago
Stuck in Initializing an LLM engine

EvilCalf opened this issue over 1 year ago
Feature request: Support for embedding models

mantrakp04 opened this issue over 1 year ago
test qwen-7b-chat model and output incorrect

dachengai opened this issue over 1 year ago
What a fast tokenizer can be used for Baichuan-13b?

FURYFOR opened this issue over 1 year ago
vllm如何量化部署

xxm1668 opened this issue over 1 year ago
Issue with raylet error

ZihanWang314 opened this issue over 1 year ago
Memory leak while using tensor_parallel_size>1

haiasd opened this issue over 1 year ago
Installing with ROCM

baderex opened this issue over 1 year ago
Best effort support for all Hugging Face transformers models

dwyatte opened this issue over 1 year ago
Cannot get a simple example working with multi-GPU

brevity2021 opened this issue over 1 year ago
多gpus如何使用?

xxm1668 opened this issue over 1 year ago
Flash Attention V2

nivibilla opened this issue over 1 year ago
Faster model loading

imoneoi opened this issue over 1 year ago
+34% higher throughput?

naed90 opened this issue over 1 year ago
[Feature Request] Support input embedding in `LLM.generate()`

KimmiShi opened this issue over 1 year ago
Decode error while inferencing a batch of prompts

SiriusNEO opened this issue over 1 year ago
Support Multiple Models

aldrinc opened this issue over 1 year ago
Feature request:support ExLlama

alanxmay opened this issue over 1 year ago
8bit support

mymusise opened this issue over 1 year ago
Require a "Wrapper" feature

jeffchy opened this issue over 1 year ago
CTranslate2

Matthieu-Tinycoaching opened this issue over 1 year ago
Remove Ray for the dependency

lanking520 opened this issue over 1 year ago
CUDA error: out of memory

SunixLiu opened this issue over 1 year ago
Adding support for encoder-decoder models, like T5 or BART

shermansiu opened this issue over 1 year ago
Can I directly obtain the logits here?

SparkJiao opened this issue over 1 year ago
Whisper support

gottlike opened this issue over 1 year ago
Build failure due to CUDA version mismatch

WoosukKwon opened this issue over 1 year ago
Support custom models

WoosukKwon opened this issue over 1 year ago
Add docstrings to some modules and classes

WoosukKwon opened this pull request over 1 year ago
Minor code cleaning for SamplingParams

WoosukKwon opened this pull request over 1 year ago
Add performance comparison figures on A100, V100, T4

WoosukKwon opened this issue over 1 year ago
Add CD to PyPI

WoosukKwon opened this issue over 1 year ago
Enhance SamplingParams

WoosukKwon opened this pull request over 1 year ago
Implement presence and frequency penalties

WoosukKwon opened this pull request over 1 year ago
Support top-k sampling

WoosukKwon opened this pull request over 1 year ago
Avoid sorting waiting queue & Minor code cleaning

WoosukKwon opened this pull request over 1 year ago
Support string-based stopping conditions

WoosukKwon opened this issue over 1 year ago
Rename variables and methods

WoosukKwon opened this pull request over 1 year ago
Log system stats

WoosukKwon opened this pull request over 1 year ago
Update example prompts in `simple_server.py`

WoosukKwon opened this pull request over 1 year ago
Support various sampling parameters

WoosukKwon opened this issue over 1 year ago
Make sure the system can run on T4 and V100

WoosukKwon opened this issue over 1 year ago
Clean up the scheduler code

WoosukKwon opened this issue over 1 year ago
Add a system logger

WoosukKwon opened this pull request over 1 year ago
Use slow tokenizer for LLaMA

WoosukKwon opened this pull request over 1 year ago
Enhance model loader

WoosukKwon opened this pull request over 1 year ago
Refactor system architecture

WoosukKwon opened this pull request over 1 year ago
Use runtime profiling to replace manual memory analyzers

zhuohan123 opened this pull request over 1 year ago
Bug in LLaMA fast tokenizer

WoosukKwon opened this issue over 1 year ago
[Minor] Fix a dtype bug

WoosukKwon opened this pull request over 1 year ago
Specify python package dependencies in requirements.txt

WoosukKwon opened this pull request over 1 year ago
Clean up Megatron-LM code

WoosukKwon opened this issue over 1 year ago
Add license

WoosukKwon opened this issue over 1 year ago
Implement client API

WoosukKwon opened this issue over 1 year ago
Add docstring

zhuohan123 opened this issue over 1 year ago
Use mypy

WoosukKwon opened this issue over 1 year ago
Support FP32

WoosukKwon opened this issue over 1 year ago
Dangerous floating point comparison

merrymercy opened this issue over 1 year ago
Replace FlashAttention with xformers

WoosukKwon opened this pull request over 1 year ago
Decrease the default size of swap space

WoosukKwon opened this issue over 1 year ago
Fix a bug in attention kernel

WoosukKwon opened this pull request over 1 year ago
Use O3 optimization instead of O2 for CUDA compilation?

WoosukKwon opened this issue over 1 year ago