Ecosyste.ms: OpenCollective
An open API service for software projects hosted on Open Collective.
github.com/vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
https://github.com/vllm-project/vllm
Refactor attention kernels
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
New weight loader without np copy
zhuohan123 opened this pull request over 1 year ago
zhuohan123 opened this pull request over 1 year ago
Add an option to launch cacheflow without ray
zhuohan123 opened this pull request over 1 year ago
zhuohan123 opened this pull request over 1 year ago
Add support for GPT-NeoX (Pythia)
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Add plot scripts
Ying1123 opened this pull request over 1 year ago
Ying1123 opened this pull request over 1 year ago
Improve Weight Loading
zhuohan123 opened this issue over 1 year ago
zhuohan123 opened this issue over 1 year ago
Frontend Improvements
zhuohan123 opened this issue over 1 year ago
zhuohan123 opened this issue over 1 year ago
Debug the optimal upper-bound performance for swapping (0-cost swapping).
zhuohan123 opened this issue over 1 year ago
zhuohan123 opened this issue over 1 year ago
Turn shareGPT data into a standard benchmark
zhuohan123 opened this issue over 1 year ago
zhuohan123 opened this issue over 1 year ago
Fix the rushed out multi-query kernel
zhuohan123 opened this issue over 1 year ago
zhuohan123 opened this issue over 1 year ago
Add support for Stable-LM and OpenAssistant
WoosukKwon opened this issue over 1 year ago
WoosukKwon opened this issue over 1 year ago
Modify the current PyTorch model to C++
zhuohan123 opened this issue over 1 year ago
zhuohan123 opened this issue over 1 year ago
[DO NOT MERGE] Orca prefix sharing benchmark
suquark opened this pull request over 1 year ago
suquark opened this pull request over 1 year ago
[DO NOT MERGE] Prefix sharing (bug fixed)
suquark opened this pull request over 1 year ago
suquark opened this pull request over 1 year ago
[DO NOT MERGE] Prefix stash siyuan
suquark opened this pull request over 1 year ago
suquark opened this pull request over 1 year ago
Support various block sizes
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Implement prefix sharing
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Add chatbot benchmark scripts
merrymercy opened this pull request over 1 year ago
merrymercy opened this pull request over 1 year ago
Support block size 32
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Fix timeout error in the FastAPI frontend
zhuohan123 opened this pull request over 1 year ago
zhuohan123 opened this pull request over 1 year ago
Add an option to use dummy weights
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Implement block copy kernel to optimize beam search
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
[DO NOT MERGE] Hao integration
zhisbug opened this pull request over 1 year ago
zhisbug opened this pull request over 1 year ago
Add a script for serving experiments & Collect system stats in scheduler
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Memcpy kernel for flash attention
suquark opened this pull request over 1 year ago
suquark opened this pull request over 1 year ago
Fix potential bugs in FastAPI frontend and add comments
zhuohan123 opened this pull request over 1 year ago
zhuohan123 opened this pull request over 1 year ago
Add query stride to multi_query_cached_kv_attention & Add kernel benchmark script
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Add CUDA graph-based all reduce launcher
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Batched benchmark script and more detailed benchmark metrics
zhuohan123 opened this pull request over 1 year ago
zhuohan123 opened this pull request over 1 year ago
Basic attention kernel that supports cached KV + (multi-)prompts
suquark opened this pull request over 1 year ago
suquark opened this pull request over 1 year ago
Add an option to disable Ray when using a single GPU
WoosukKwon opened this issue over 1 year ago
WoosukKwon opened this issue over 1 year ago
Tensor Parallel profiling result
zhuohan123 opened this issue over 1 year ago
zhuohan123 opened this issue over 1 year ago
Add ninja to dependency
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Optimize data movement
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Use FP32 for log probabilities
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Modify README to include info on loading LLaMA
zhuohan123 opened this pull request over 1 year ago
zhuohan123 opened this pull request over 1 year ago
Optimize tensor parallel execution speed
zhuohan123 opened this pull request over 1 year ago
zhuohan123 opened this pull request over 1 year ago
Add custom kernel for RMS normalization
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Merge QKV into one linear layer
zhuohan123 opened this pull request over 1 year ago
zhuohan123 opened this pull request over 1 year ago
Implement custom kernel for LLaMA rotary embedding
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Refactor the test code for attention kernels
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Implement preemption via recomputation & Refactor scheduling logic
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Add cache watermark to avoid frequent cache eviction
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
FastAPI-based working frontend
zhuohan123 opened this pull request over 1 year ago
zhuohan123 opened this pull request over 1 year ago
Implement LLaMA
WoosukKwon opened this pull request over 1 year ago
WoosukKwon opened this pull request over 1 year ago
Add miscellaneous updates
WoosukKwon opened this pull request almost 2 years ago
WoosukKwon opened this pull request almost 2 years ago
Support beam search & parallel generation
WoosukKwon opened this pull request almost 2 years ago
WoosukKwon opened this pull request almost 2 years ago
Automatically configure KV cache size
WoosukKwon opened this pull request almost 2 years ago
WoosukKwon opened this pull request almost 2 years ago
Fix a bug in 1D input shape
WoosukKwon opened this pull request almost 2 years ago
WoosukKwon opened this pull request almost 2 years ago
Use FlashAttention for `multi_query_kv_attention`
WoosukKwon opened this pull request almost 2 years ago
WoosukKwon opened this pull request almost 2 years ago
Implement `single_query_cached_kv_attention` kernel
WoosukKwon opened this pull request almost 2 years ago
WoosukKwon opened this pull request almost 2 years ago
Support tensor parallel
zhuohan123 opened this pull request almost 2 years ago
zhuohan123 opened this pull request almost 2 years ago
Fix a bug in tying OPT embeddings
WoosukKwon opened this pull request almost 2 years ago
WoosukKwon opened this pull request almost 2 years ago