github.com/vllm-project/vllm issues | Ecosyste.ms: OpenCollective

vllm推理THUDM/chatglm3-6b-128k 无法stop

linzm1007 opened this issue 7 months ago

[Bug]: Pending but Avg generation throughput: 0.0 tokens/s

hitsz-zxw opened this issue 7 months ago

[Usage]:how to get the output embedding for a text generation model using vllm

Apricot1225 opened this issue 7 months ago

[Bugfix] Destroy PP groups properly

andoorve opened this pull request 7 months ago

[Bug]: prompt_logprobs doesn't work with openai compatible server

Some-random opened this issue 7 months ago

[misc] benchmark_serving.py -- add ITL results and tweak TPOT results

tlrmchlsmth opened this pull request 7 months ago

[Kernel] Allow 8-bit outputs for cutlass_scaled_mm

tlrmchlsmth opened this pull request 7 months ago

p

khluu opened this pull request 7 months ago

[Misc] Add CustomOp interface for device portability

WoosukKwon opened this pull request 7 months ago

[Bugfix] Fix `MultiprocessingGPUExecutor.check_health` when world_size == 1

jsato8094 opened this pull request 7 months ago

[CI/Build] Add `is_quant_method_supported` to control quantization test configurations

mgoin opened this pull request 7 months ago

[Speculative Decoding] Add `ProposerWorkerBase` abstract class

njhill opened this pull request 7 months ago

[Misc]: vllm ONLY allocate KVCache on the first device in CUDA_VISIBLE_DEVICES

CatYing opened this issue 7 months ago

how to compile with GLIBCXX_USE_CXX11_ABI=1

demonatic opened this issue 7 months ago

[BugFix]Fix the problem that StopChecker assumes a single token produ…

IcyFeather233 opened this pull request 7 months ago

[Kernel] Add back batch size 1536 and 3072 to MoE tuning

WoosukKwon opened this pull request 7 months ago

[CI/Build] Reducing CPU CI execution time

bigPYJ1151 opened this pull request 7 months ago

[Bug]: Tokenizer setter of LLM without CachedTokenizer adapter

DriverSong opened this issue 7 months ago

[Performance]: Speculative Performance almost same or lower

tolry418 opened this issue 7 months ago

[Kernel] Re-tune Mixtral MoE configurations for FP8 on H100

pcmoritz opened this pull request 7 months ago

[Frontend] Add OpenAI Vision API Support

ywang96 opened this pull request 7 months ago

[Bug]: LLM.generate() collapse with some padding side

kevin3314 opened this issue 7 months ago

[Bugfix] Add warmup for prefix caching example

zhuohan123 opened this pull request 7 months ago

[Feature]: Add efficient interface for evaluating probabilities of fixed prompt-completion pairs

xinyangz opened this issue 7 months ago

Bugfix: fix broken of download models from modelscope

liuyhwangyh opened this pull request 7 months ago

[Feature]: vllm-flash-attn cu118 compatibility

epark001 opened this issue 7 months ago

[Model] Correct Mixtral FP8 checkpoint loading

comaniac opened this pull request 7 months ago

[Core][Doc] Default to multiprocessing for single-node distributed case

njhill opened this pull request 7 months ago

[Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecutor

zifeitong opened this pull request 7 months ago

[Feature]: Custom attention masks

ojus1 opened this issue 7 months ago

[Usage]: How to start inference serving through `LLM` object

Jiayi-Pan opened this issue 7 months ago

[Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to False

zifeitong opened this pull request 7 months ago

v0.5.0 Release Tracker

simon-mo opened this issue 7 months ago

[Misc] Adding Speculative decoding to Throughput Benchmarking script

abhibambhaniya opened this pull request 7 months ago

[Usage]: RuntimeError: CUDA error: uncorrectable ECC error encountered

DJCoolDev opened this issue 7 months ago

[Doc]: Update the vllm distributed Inference and Serving with the new MultiprocessingGPUExecutor

rcarrata opened this issue 7 months ago

[Bug]: Mixtral-8x22 request cancelled by cancel scope when client sends multiple concurrent requests

markovalexander opened this issue 7 months ago

[Bug]: Mistral 7B crashes on NVidia Tesla P100 with a CUDA Error

oe3gwu opened this issue 7 months ago

Support W4A8 quantization for vllm

HandH1998 opened this pull request 7 months ago

[Bugfix] Support `prompt_logprobs==0`

toslunar opened this pull request 7 months ago

[CI/Build] Add inputs tests

DarkLight1337 opened this pull request 7 months ago

[Core] Registry for processing model inputs

DarkLight1337 opened this pull request 7 months ago

[Bug]: prompt_logprobs=0 raises AssertionError

toslunar opened this issue 7 months ago

[Installation]: Failed to build punica

asinglestep opened this issue 7 months ago

[Usage]: how to terminal a vllm model and free or release gpu memory

wellcasa opened this issue 7 months ago

[Bugfix]: During testing, use pytest monkeypatch for safely overriding the env var that indicates the vLLM backend

afeldman-nm opened this pull request 7 months ago

[Feature]: Support for Mirostat, Dynamic Temperature, and Quadratic Sampling

Emmie411 opened this issue 7 months ago

[Bug]: VLLM_ATTENTION_BACKEND set to ROCM_FLASH only in GHA environment, overriding automatic backend selection; this breaks other kernel unit tests.

afeldman-nm opened this issue 7 months ago

[BugFix] Apply get_cached_tokenizer to the tokenizer setter of LLM

DriverSong opened this pull request 7 months ago

[Feature]: Option to override HuggingFace's configurations

DarkLight1337 opened this issue 7 months ago

[Feature]: inconsistent vocab_sizes support for draft and target workers while using Speculative Decoding

ShangmingCai opened this issue 7 months ago

[Feature]: Speculative edits

Muhtasham opened this issue 7 months ago

[Bug]: Issues with Applying LoRA in vllm on a T4 GPU

rikitomo opened this issue 7 months ago

[Bug]: Issues with Applying LoRA in vllm on a T4 GPU

rikioka-tomokazu opened this issue 7 months ago

[Frontend] Customizable RoPE theta

sasha0552 opened this pull request 7 months ago

push error

triple-Mu opened this pull request 7 months ago

[Usage]: how to use the gpu_cache_usage_perc as a custom metric in k8s HPA?

chakpongchung opened this issue 7 months ago

[Misc] Improve error message when LoRA parsing fails

DarkLight1337 opened this pull request 7 months ago

[Usage]: How can I deploy llama3-70b on a server with 8 3090 GPUs with lora and CUDA graph.

AlphaINF opened this issue 7 months ago

[Core] Support loading GGUF model

Isotr0py opened this pull request 7 months ago

[Bug]: loading squeezellm model

yuhuixu1993 opened this issue 7 months ago

[Model] Add PaliGemma

ywang96 opened this pull request 7 months ago

[Core][Prefix Caching] Fix hashing logic for non-full blocks

zhuohan123 opened this pull request 7 months ago

[Bugfix] [Frontend] vLLM api_server.py when using with prompt_token_ids causes error.

TikZSZ opened this pull request 7 months ago

[Bug]: vLLM api_server.py when using with prompt_token_ids causes error.

TikZSZ opened this issue 7 months ago

[Feature]: MoE kernels (Mixtral-8x22B-Instruct-v0.1) are not yet supported on CPU only ?

xxll88 opened this issue 7 months ago

[BugFix] Prevent `LLM.encode` for non-generation Models

robertgshaw2-neuralmagic opened this pull request 7 months ago

[Kernel] Switch fp8 layers to use the CUTLASS kernels

tlrmchlsmth opened this pull request 7 months ago

[Bug]: Offline Inference with the OpenAI Batch file format yields unnecessary `asyncio.exceptions.CancelledError`

jlcmoore opened this issue 7 months ago

[Bug]: The Offline Inference Embedding Example Fails

cuizhuyefei opened this issue 7 months ago

[Bugfix]: Fix issues related to prefix caching example (#5177)

Delviet opened this pull request 7 months ago

[Feature]: BERT models for embeddings

mevince opened this issue 7 months ago

[Model] LoRA support added for command-r

sergey-tinkoff opened this pull request 7 months ago

[Bug]: Incorrect Example for the Inference with Prefix

Delviet opened this issue 7 months ago

[Usage]: Prefix caching in VLLM

Abhinay2323 opened this issue 7 months ago

draft2

khluu opened this pull request 7 months ago

[Bugfix] Remove deprecated @abstractproperty

zhuohan123 opened this pull request 7 months ago

bug fixed: cuda out of memory lead to 'AsyncEngineDeadError: Background loop has errored already.

charent opened this pull request 7 months ago

Adding fp8 gemm computation

charlifu opened this pull request 7 months ago

[Doc] Add checkmark for GPTBigCodeForCausalLM LoRA support

njhill opened this pull request 7 months ago

[Bug]: Model Launch Hangs with 16+ Ranks in vLLM

wushidonguc opened this issue 7 months ago

[Bugfix] Fix illegal memory access for lora

sfc-gh-zhwang opened this pull request 7 months ago

[Build] Guard against older CUDA versions when building CUTLASS 3.x kernels

tlrmchlsmth opened this pull request 7 months ago

[Performance]: What can we learn from OctoAI

hmellor opened this issue 7 months ago

[Build] Do not compile cutlass scaled_mm on CUDA 11

simon-mo opened this pull request 7 months ago

[Bugfix] Fix KeyError: 1 When Using LoRA adapters

BlackBird-Coding opened this pull request 7 months ago

[Bug]: Unable to Use Prefix Caching in AsyncLLMEngine

kezouke opened this issue 7 months ago

[Bug]: WSL2(also Docker) 1 GPU work but 2 not,(--tensor-parallel-size 2 )

goodmaney opened this issue 7 months ago

[Bug]: Issue with Token Processing Efficiency and Key-Value Cache Utilization in AsyncLLMEngine

kezouke opened this issue 7 months ago

[Kernel] Pass a device pointer into the quantize kernel for the scales

tlrmchlsmth opened this pull request 7 months ago

[Core] Bump up the default of --gpu_memory_utilization to be more similar to TensorRT Triton's default

alexm-neuralmagic opened this pull request 7 months ago

[Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to reduce binary size

tlrmchlsmth opened this pull request 7 months ago

[Feature]: VLLM suport for function calling in Mistral-7B-Instruct-v0.3

javierquin opened this issue 7 months ago

[Feature]: Linear adapter support for Mixtral

DhruvaBansal00 opened this issue 7 months ago

[Bug] [spec decode] [flash_attn]: CUDA illegal memory access when calling flash_attn_cuda.fwd_kvcache

khluu opened this issue 7 months ago

[Minor] Fix the path typo in loader.py: save_sharded_states.py -> save_sharded_state.py

dashanji opened this pull request 7 months ago

[Misc]: Should inference with temperature 0 generate the same results for a lora adapter and equivalent merged model?

rohan-daniscox opened this issue 7 months ago

[Bug]: torch.cuda.OutOfMemoryError: CUDA out of memory when Handle inference requests

zhaotyer opened this issue 7 months ago

add gptq_marlin test for bug report https://github.com/vllm-project/vllm/issues/5088

alexm-neuralmagic opened this pull request 7 months ago

[Kernel] Update Cutlass fp8 configs

varun-sundar-rabindranath opened this pull request 7 months ago