vLLM issues | Ecosyste.ms: OpenCollective

Debug the optimal upper-bound performance for swapping (0-cost swapping).

github.com/vllm-project/vllm - zhuohan123 opened this issue over 1 year ago

Turn shareGPT data into a standard benchmark

github.com/vllm-project/vllm - zhuohan123 opened this issue over 1 year ago

Fix the rushed out multi-query kernel

github.com/vllm-project/vllm - zhuohan123 opened this issue over 1 year ago

Add support for Stable-LM and OpenAssistant

github.com/vllm-project/vllm - WoosukKwon opened this issue over 1 year ago

Modify the current PyTorch model to C++

github.com/vllm-project/vllm - zhuohan123 opened this issue over 1 year ago

[DO NOT MERGE] Orca prefix sharing benchmark

github.com/vllm-project/vllm - suquark opened this pull request over 1 year ago

[DO NOT MERGE] Prefix sharing (bug fixed)

github.com/vllm-project/vllm - suquark opened this pull request over 1 year ago

[DO NOT MERGE] Prefix stash siyuan

github.com/vllm-project/vllm - suquark opened this pull request over 1 year ago

Support various block sizes

github.com/vllm-project/vllm - WoosukKwon opened this pull request over 1 year ago

Implement prefix sharing

github.com/vllm-project/vllm - WoosukKwon opened this pull request over 1 year ago

Add chatbot benchmark scripts

github.com/vllm-project/vllm - merrymercy opened this pull request over 1 year ago

Support block size 32

github.com/vllm-project/vllm - WoosukKwon opened this pull request over 1 year ago

Fix timeout error in the FastAPI frontend

github.com/vllm-project/vllm - zhuohan123 opened this pull request over 1 year ago

Add an option to use dummy weights

github.com/vllm-project/vllm - WoosukKwon opened this pull request over 1 year ago

Implement block copy kernel to optimize beam search

github.com/vllm-project/vllm - WoosukKwon opened this pull request over 1 year ago

[DO NOT MERGE] Hao integration

github.com/vllm-project/vllm - zhisbug opened this pull request over 1 year ago

Add a script for serving experiments & Collect system stats in scheduler

github.com/vllm-project/vllm - WoosukKwon opened this pull request over 1 year ago

Memcpy kernel for flash attention

github.com/vllm-project/vllm - suquark opened this pull request over 1 year ago

Fix potential bugs in FastAPI frontend and add comments

github.com/vllm-project/vllm - zhuohan123 opened this pull request over 1 year ago

Add query stride to multi_query_cached_kv_attention & Add kernel benchmark script

github.com/vllm-project/vllm - WoosukKwon opened this pull request over 1 year ago

Add CUDA graph-based all reduce launcher

github.com/vllm-project/vllm - WoosukKwon opened this pull request over 1 year ago

Batched benchmark script and more detailed benchmark metrics

github.com/vllm-project/vllm - zhuohan123 opened this pull request over 1 year ago

Basic attention kernel that supports cached KV + (multi-)prompts

github.com/vllm-project/vllm - suquark opened this pull request over 1 year ago

Add an option to disable Ray when using a single GPU

github.com/vllm-project/vllm - WoosukKwon opened this issue over 1 year ago

Tensor Parallel profiling result

github.com/vllm-project/vllm - zhuohan123 opened this issue over 1 year ago

Add ninja to dependency

github.com/vllm-project/vllm - WoosukKwon opened this pull request over 1 year ago

Optimize data movement

github.com/vllm-project/vllm - WoosukKwon opened this pull request over 1 year ago

Use FP32 for log probabilities

github.com/vllm-project/vllm - WoosukKwon opened this pull request over 1 year ago

Modify README to include info on loading LLaMA

github.com/vllm-project/vllm - zhuohan123 opened this pull request over 1 year ago

Optimize tensor parallel execution speed

github.com/vllm-project/vllm - zhuohan123 opened this pull request over 1 year ago

Add custom kernel for RMS normalization

github.com/vllm-project/vllm - WoosukKwon opened this pull request over 1 year ago

Merge QKV into one linear layer

github.com/vllm-project/vllm - zhuohan123 opened this pull request over 1 year ago

Implement custom kernel for LLaMA rotary embedding

github.com/vllm-project/vllm - WoosukKwon opened this pull request over 1 year ago

Refactor the test code for attention kernels

github.com/vllm-project/vllm - WoosukKwon opened this pull request over 1 year ago

Implement preemption via recomputation & Refactor scheduling logic

github.com/vllm-project/vllm - WoosukKwon opened this pull request over 1 year ago

Add cache watermark to avoid frequent cache eviction

github.com/vllm-project/vllm - WoosukKwon opened this pull request over 1 year ago

FastAPI-based working frontend

github.com/vllm-project/vllm - zhuohan123 opened this pull request over 1 year ago

Implement LLaMA

github.com/vllm-project/vllm - WoosukKwon opened this pull request over 1 year ago

Add miscellaneous updates

github.com/vllm-project/vllm - WoosukKwon opened this pull request over 1 year ago

Support beam search & parallel generation

github.com/vllm-project/vllm - WoosukKwon opened this pull request over 1 year ago

Automatically configure KV cache size

github.com/vllm-project/vllm - WoosukKwon opened this pull request over 1 year ago

Fix a bug in 1D input shape

github.com/vllm-project/vllm - WoosukKwon opened this pull request over 1 year ago

Use FlashAttention for `multi_query_kv_attention`

github.com/vllm-project/vllm - WoosukKwon opened this pull request over 1 year ago

Implement `single_query_cached_kv_attention` kernel

github.com/vllm-project/vllm - WoosukKwon opened this pull request over 1 year ago

Support tensor parallel

github.com/vllm-project/vllm - zhuohan123 opened this pull request over 1 year ago

Fix a bug in tying OPT embeddings

github.com/vllm-project/vllm - WoosukKwon opened this pull request over 1 year ago