Ecosyste.ms: OpenCollective
An open API service for software projects hosted on Open Collective.
github.com/vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
https://github.com/vllm-project/vllm
vllm推理THUDM/chatglm3-6b-128k 无法stop
linzm1007 opened this issue 7 months ago
linzm1007 opened this issue 7 months ago
[Bug]: Pending but Avg generation throughput: 0.0 tokens/s
hitsz-zxw opened this issue 7 months ago
hitsz-zxw opened this issue 7 months ago
[Usage]:how to get the output embedding for a text generation model using vllm
Apricot1225 opened this issue 7 months ago
Apricot1225 opened this issue 7 months ago
[Bugfix] Destroy PP groups properly
andoorve opened this pull request 7 months ago
andoorve opened this pull request 7 months ago
[Bug]: prompt_logprobs doesn't work with openai compatible server
Some-random opened this issue 7 months ago
Some-random opened this issue 7 months ago
[misc] benchmark_serving.py -- add ITL results and tweak TPOT results
tlrmchlsmth opened this pull request 7 months ago
tlrmchlsmth opened this pull request 7 months ago
[Kernel] Allow 8-bit outputs for cutlass_scaled_mm
tlrmchlsmth opened this pull request 7 months ago
tlrmchlsmth opened this pull request 7 months ago
p
khluu opened this pull request 7 months ago
khluu opened this pull request 7 months ago
[Misc] Add CustomOp interface for device portability
WoosukKwon opened this pull request 7 months ago
WoosukKwon opened this pull request 7 months ago
[Bugfix] Fix `MultiprocessingGPUExecutor.check_health` when world_size == 1
jsato8094 opened this pull request 7 months ago
jsato8094 opened this pull request 7 months ago
[CI/Build] Add `is_quant_method_supported` to control quantization test configurations
mgoin opened this pull request 7 months ago
mgoin opened this pull request 7 months ago
[Speculative Decoding] Add `ProposerWorkerBase` abstract class
njhill opened this pull request 7 months ago
njhill opened this pull request 7 months ago
[Misc]: vllm ONLY allocate KVCache on the first device in CUDA_VISIBLE_DEVICES
CatYing opened this issue 7 months ago
CatYing opened this issue 7 months ago
how to compile with GLIBCXX_USE_CXX11_ABI=1
demonatic opened this issue 7 months ago
demonatic opened this issue 7 months ago
[BugFix]Fix the problem that StopChecker assumes a single token produ…
IcyFeather233 opened this pull request 7 months ago
IcyFeather233 opened this pull request 7 months ago
[Kernel] Add back batch size 1536 and 3072 to MoE tuning
WoosukKwon opened this pull request 7 months ago
WoosukKwon opened this pull request 7 months ago
[CI/Build] Reducing CPU CI execution time
bigPYJ1151 opened this pull request 7 months ago
bigPYJ1151 opened this pull request 7 months ago
[Bug]: Tokenizer setter of LLM without CachedTokenizer adapter
DriverSong opened this issue 7 months ago
DriverSong opened this issue 7 months ago
[Performance]: Speculative Performance almost same or lower
tolry418 opened this issue 7 months ago
tolry418 opened this issue 7 months ago
[Kernel] Re-tune Mixtral MoE configurations for FP8 on H100
pcmoritz opened this pull request 7 months ago
pcmoritz opened this pull request 7 months ago
[Frontend] Add OpenAI Vision API Support
ywang96 opened this pull request 7 months ago
ywang96 opened this pull request 7 months ago
[Bug]: LLM.generate() collapse with some padding side
kevin3314 opened this issue 7 months ago
kevin3314 opened this issue 7 months ago
[Bugfix] Add warmup for prefix caching example
zhuohan123 opened this pull request 7 months ago
zhuohan123 opened this pull request 7 months ago
[Feature]: Add efficient interface for evaluating probabilities of fixed prompt-completion pairs
xinyangz opened this issue 7 months ago
xinyangz opened this issue 7 months ago
Bugfix: fix broken of download models from modelscope
liuyhwangyh opened this pull request 7 months ago
liuyhwangyh opened this pull request 7 months ago
[Feature]: vllm-flash-attn cu118 compatibility
epark001 opened this issue 7 months ago
epark001 opened this issue 7 months ago
[Model] Correct Mixtral FP8 checkpoint loading
comaniac opened this pull request 7 months ago
comaniac opened this pull request 7 months ago
[Core][Doc] Default to multiprocessing for single-node distributed case
njhill opened this pull request 7 months ago
njhill opened this pull request 7 months ago
[Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecutor
zifeitong opened this pull request 7 months ago
zifeitong opened this pull request 7 months ago
[Feature]: Custom attention masks
ojus1 opened this issue 7 months ago
ojus1 opened this issue 7 months ago
[Usage]: How to start inference serving through `LLM` object
Jiayi-Pan opened this issue 7 months ago
Jiayi-Pan opened this issue 7 months ago
[Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to False
zifeitong opened this pull request 7 months ago
zifeitong opened this pull request 7 months ago
v0.5.0 Release Tracker
simon-mo opened this issue 7 months ago
simon-mo opened this issue 7 months ago
[Misc] Adding Speculative decoding to Throughput Benchmarking script
abhibambhaniya opened this pull request 7 months ago
abhibambhaniya opened this pull request 7 months ago
[Usage]: RuntimeError: CUDA error: uncorrectable ECC error encountered
DJCoolDev opened this issue 7 months ago
DJCoolDev opened this issue 7 months ago
[Doc]: Update the vllm distributed Inference and Serving with the new MultiprocessingGPUExecutor
rcarrata opened this issue 7 months ago
rcarrata opened this issue 7 months ago
[Bug]: Mixtral-8x22 request cancelled by cancel scope when client sends multiple concurrent requests
markovalexander opened this issue 7 months ago
markovalexander opened this issue 7 months ago
[Bug]: Mistral 7B crashes on NVidia Tesla P100 with a CUDA Error
oe3gwu opened this issue 7 months ago
oe3gwu opened this issue 7 months ago
Support W4A8 quantization for vllm
HandH1998 opened this pull request 7 months ago
HandH1998 opened this pull request 7 months ago
[Bugfix] Support `prompt_logprobs==0`
toslunar opened this pull request 7 months ago
toslunar opened this pull request 7 months ago
[CI/Build] Add inputs tests
DarkLight1337 opened this pull request 7 months ago
DarkLight1337 opened this pull request 7 months ago
[Core] Registry for processing model inputs
DarkLight1337 opened this pull request 7 months ago
DarkLight1337 opened this pull request 7 months ago
[Bug]: prompt_logprobs=0 raises AssertionError
toslunar opened this issue 7 months ago
toslunar opened this issue 7 months ago
[Installation]: Failed to build punica
asinglestep opened this issue 7 months ago
asinglestep opened this issue 7 months ago
[Usage]: how to terminal a vllm model and free or release gpu memory
wellcasa opened this issue 7 months ago
wellcasa opened this issue 7 months ago
[Bugfix]: During testing, use pytest monkeypatch for safely overriding the env var that indicates the vLLM backend
afeldman-nm opened this pull request 7 months ago
afeldman-nm opened this pull request 7 months ago
[Feature]: Support for Mirostat, Dynamic Temperature, and Quadratic Sampling
Emmie411 opened this issue 7 months ago
Emmie411 opened this issue 7 months ago
[Bug]: VLLM_ATTENTION_BACKEND set to ROCM_FLASH only in GHA environment, overriding automatic backend selection; this breaks other kernel unit tests.
afeldman-nm opened this issue 7 months ago
afeldman-nm opened this issue 7 months ago
[BugFix] Apply get_cached_tokenizer to the tokenizer setter of LLM
DriverSong opened this pull request 7 months ago
DriverSong opened this pull request 7 months ago
[Feature]: Option to override HuggingFace's configurations
DarkLight1337 opened this issue 7 months ago
DarkLight1337 opened this issue 7 months ago
[Feature]: inconsistent vocab_sizes support for draft and target workers while using Speculative Decoding
ShangmingCai opened this issue 7 months ago
ShangmingCai opened this issue 7 months ago
[Feature]: Speculative edits
Muhtasham opened this issue 7 months ago
Muhtasham opened this issue 7 months ago
[Bug]: Issues with Applying LoRA in vllm on a T4 GPU
rikitomo opened this issue 7 months ago
rikitomo opened this issue 7 months ago
[Bug]: Issues with Applying LoRA in vllm on a T4 GPU
rikioka-tomokazu opened this issue 7 months ago
rikioka-tomokazu opened this issue 7 months ago
[Frontend] Customizable RoPE theta
sasha0552 opened this pull request 7 months ago
sasha0552 opened this pull request 7 months ago
push error
triple-Mu opened this pull request 7 months ago
triple-Mu opened this pull request 7 months ago
[Usage]: how to use the gpu_cache_usage_perc as a custom metric in k8s HPA?
chakpongchung opened this issue 7 months ago
chakpongchung opened this issue 7 months ago
[Misc] Improve error message when LoRA parsing fails
DarkLight1337 opened this pull request 7 months ago
DarkLight1337 opened this pull request 7 months ago
[Usage]: How can I deploy llama3-70b on a server with 8 3090 GPUs with lora and CUDA graph.
AlphaINF opened this issue 7 months ago
AlphaINF opened this issue 7 months ago
[Core] Support loading GGUF model
Isotr0py opened this pull request 7 months ago
Isotr0py opened this pull request 7 months ago
[Bug]: loading squeezellm model
yuhuixu1993 opened this issue 7 months ago
yuhuixu1993 opened this issue 7 months ago
[Model] Add PaliGemma
ywang96 opened this pull request 7 months ago
ywang96 opened this pull request 7 months ago
[Core][Prefix Caching] Fix hashing logic for non-full blocks
zhuohan123 opened this pull request 7 months ago
zhuohan123 opened this pull request 7 months ago
[Bugfix] [Frontend] vLLM api_server.py when using with prompt_token_ids causes error.
TikZSZ opened this pull request 7 months ago
TikZSZ opened this pull request 7 months ago
[Bug]: vLLM api_server.py when using with prompt_token_ids causes error.
TikZSZ opened this issue 7 months ago
TikZSZ opened this issue 7 months ago
[Feature]: MoE kernels (Mixtral-8x22B-Instruct-v0.1) are not yet supported on CPU only ?
xxll88 opened this issue 7 months ago
xxll88 opened this issue 7 months ago
[BugFix] Prevent `LLM.encode` for non-generation Models
robertgshaw2-neuralmagic opened this pull request 7 months ago
robertgshaw2-neuralmagic opened this pull request 7 months ago
[Kernel] Switch fp8 layers to use the CUTLASS kernels
tlrmchlsmth opened this pull request 7 months ago
tlrmchlsmth opened this pull request 7 months ago
[Bug]: Offline Inference with the OpenAI Batch file format yields unnecessary `asyncio.exceptions.CancelledError`
jlcmoore opened this issue 7 months ago
jlcmoore opened this issue 7 months ago
[Bug]: The Offline Inference Embedding Example Fails
cuizhuyefei opened this issue 7 months ago
cuizhuyefei opened this issue 7 months ago
[Bugfix]: Fix issues related to prefix caching example (#5177)
Delviet opened this pull request 7 months ago
Delviet opened this pull request 7 months ago
[Feature]: BERT models for embeddings
mevince opened this issue 7 months ago
mevince opened this issue 7 months ago
[Model] LoRA support added for command-r
sergey-tinkoff opened this pull request 7 months ago
sergey-tinkoff opened this pull request 7 months ago
[Bug]: Incorrect Example for the Inference with Prefix
Delviet opened this issue 7 months ago
Delviet opened this issue 7 months ago
[Usage]: Prefix caching in VLLM
Abhinay2323 opened this issue 7 months ago
Abhinay2323 opened this issue 7 months ago
draft2
khluu opened this pull request 7 months ago
khluu opened this pull request 7 months ago
[Bugfix] Remove deprecated @abstractproperty
zhuohan123 opened this pull request 7 months ago
zhuohan123 opened this pull request 7 months ago
bug fixed: cuda out of memory lead to 'AsyncEngineDeadError: Background loop has errored already.
charent opened this pull request 7 months ago
charent opened this pull request 7 months ago
Adding fp8 gemm computation
charlifu opened this pull request 7 months ago
charlifu opened this pull request 7 months ago
[Doc] Add checkmark for GPTBigCodeForCausalLM LoRA support
njhill opened this pull request 7 months ago
njhill opened this pull request 7 months ago
[Bug]: Model Launch Hangs with 16+ Ranks in vLLM
wushidonguc opened this issue 7 months ago
wushidonguc opened this issue 7 months ago
[Bugfix] Fix illegal memory access for lora
sfc-gh-zhwang opened this pull request 7 months ago
sfc-gh-zhwang opened this pull request 7 months ago
[Build] Guard against older CUDA versions when building CUTLASS 3.x kernels
tlrmchlsmth opened this pull request 7 months ago
tlrmchlsmth opened this pull request 7 months ago
[Performance]: What can we learn from OctoAI
hmellor opened this issue 7 months ago
hmellor opened this issue 7 months ago
[Build] Do not compile cutlass scaled_mm on CUDA 11
simon-mo opened this pull request 7 months ago
simon-mo opened this pull request 7 months ago
[Bugfix] Fix KeyError: 1 When Using LoRA adapters
BlackBird-Coding opened this pull request 7 months ago
BlackBird-Coding opened this pull request 7 months ago
[Bug]: Unable to Use Prefix Caching in AsyncLLMEngine
kezouke opened this issue 7 months ago
kezouke opened this issue 7 months ago
[Bug]: WSL2(also Docker) 1 GPU work but 2 not,(--tensor-parallel-size 2 )
goodmaney opened this issue 7 months ago
goodmaney opened this issue 7 months ago
[Bug]: Issue with Token Processing Efficiency and Key-Value Cache Utilization in AsyncLLMEngine
kezouke opened this issue 7 months ago
kezouke opened this issue 7 months ago
[Kernel] Pass a device pointer into the quantize kernel for the scales
tlrmchlsmth opened this pull request 7 months ago
tlrmchlsmth opened this pull request 7 months ago
[Core] Bump up the default of --gpu_memory_utilization to be more similar to TensorRT Triton's default
alexm-neuralmagic opened this pull request 7 months ago
alexm-neuralmagic opened this pull request 7 months ago
[Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to reduce binary size
tlrmchlsmth opened this pull request 7 months ago
tlrmchlsmth opened this pull request 7 months ago
[Feature]: VLLM suport for function calling in Mistral-7B-Instruct-v0.3
javierquin opened this issue 7 months ago
javierquin opened this issue 7 months ago
[Feature]: Linear adapter support for Mixtral
DhruvaBansal00 opened this issue 7 months ago
DhruvaBansal00 opened this issue 7 months ago
[Bug] [spec decode] [flash_attn]: CUDA illegal memory access when calling flash_attn_cuda.fwd_kvcache
khluu opened this issue 7 months ago
khluu opened this issue 7 months ago
[Minor] Fix the path typo in loader.py: save_sharded_states.py -> save_sharded_state.py
dashanji opened this pull request 7 months ago
dashanji opened this pull request 7 months ago
[Misc]: Should inference with temperature 0 generate the same results for a lora adapter and equivalent merged model?
rohan-daniscox opened this issue 7 months ago
rohan-daniscox opened this issue 7 months ago
[Bug]: torch.cuda.OutOfMemoryError: CUDA out of memory when Handle inference requests
zhaotyer opened this issue 7 months ago
zhaotyer opened this issue 7 months ago
add gptq_marlin test for bug report https://github.com/vllm-project/vllm/issues/5088
alexm-neuralmagic opened this pull request 7 months ago
alexm-neuralmagic opened this pull request 7 months ago
[Kernel] Update Cutlass fp8 configs
varun-sundar-rabindranath opened this pull request 7 months ago
varun-sundar-rabindranath opened this pull request 7 months ago