Ecosyste.ms: OpenCollective
An open API service for software projects hosted on Open Collective.
vLLM
vLLM is a high-throughput and memory-efficient inference and serving engine for large language models (LLMs).
Collective -
Host: opensource -
https://opencollective.com/vllm
- Code: https://github.com/vllm-project/vllm
VLLM Multi-Lora with embed_tokens and lm_head in adapter weights
github.com/vllm-project/vllm - germanjke opened this issue almost 1 year ago
github.com/vllm-project/vllm - germanjke opened this issue almost 1 year ago
openai completions api <echo=True> raises Error
github.com/vllm-project/vllm - seoyunYang opened this issue almost 1 year ago
github.com/vllm-project/vllm - seoyunYang opened this issue almost 1 year ago
Add Splitwise implementation to vLLM
github.com/vllm-project/vllm - aashaka opened this pull request about 1 year ago
github.com/vllm-project/vllm - aashaka opened this pull request about 1 year ago
Nvidia-H20 with nvcr.io/nvidia/pytorch:23.12-py3,CUBLAS Error!
github.com/vllm-project/vllm - tohneecao opened this issue about 1 year ago
github.com/vllm-project/vllm - tohneecao opened this issue about 1 year ago
Multi GPU ROCm6 issues, and workarounds
github.com/vllm-project/vllm - BKitor opened this issue about 1 year ago
github.com/vllm-project/vllm - BKitor opened this issue about 1 year ago
model continue conversation
github.com/vllm-project/vllm - andrey-genpracc opened this issue about 1 year ago
github.com/vllm-project/vllm - andrey-genpracc opened this issue about 1 year ago
[Bug] `v0.3.0` produces garbage output when serving CodeLlama-70B on 4xA6000
github.com/vllm-project/vllm - ganler opened this issue about 1 year ago
github.com/vllm-project/vllm - ganler opened this issue about 1 year ago
ERROR: Fail to install in editable mode. "UserWarning: There are no .../x86_64-conda-linux-gnu-c++ version bounds defined for CUDA version 12.1"
github.com/vllm-project/vllm - KartikYZ opened this issue about 1 year ago
github.com/vllm-project/vllm - KartikYZ opened this issue about 1 year ago
Add fused top-K softmax kernel for MoE
github.com/vllm-project/vllm - WoosukKwon opened this pull request about 1 year ago
github.com/vllm-project/vllm - WoosukKwon opened this pull request about 1 year ago
GPTQ & AWQ Fused MOE
github.com/vllm-project/vllm - chu-tianxiang opened this pull request about 1 year ago
github.com/vllm-project/vllm - chu-tianxiang opened this pull request about 1 year ago
Llama Guard inconsistent output between HuggingFace's Transformers and vLLM
github.com/vllm-project/vllm - AmenRa opened this issue about 1 year ago
github.com/vllm-project/vllm - AmenRa opened this issue about 1 year ago
vLLM ignores my requests when I increase the number of concurrent requests
github.com/vllm-project/vllm - savannahfung opened this issue about 1 year ago
github.com/vllm-project/vllm - savannahfung opened this issue about 1 year ago
[Minor] More fix of test_cache.py CI test failure
github.com/vllm-project/vllm - LiuXiaoxuanPKU opened this pull request about 1 year ago
github.com/vllm-project/vllm - LiuXiaoxuanPKU opened this pull request about 1 year ago
ImportError: /ramyapra/vllm/vllm/_C.cpython-310-x86_64-linux-gnu.so: undefined symbol:
github.com/vllm-project/vllm - ramyaprabhu-alt opened this issue about 1 year ago
github.com/vllm-project/vllm - ramyaprabhu-alt opened this issue about 1 year ago
How to increase vllm scheduler promt limit?
github.com/vllm-project/vllm - hanswang1 opened this issue about 1 year ago
github.com/vllm-project/vllm - hanswang1 opened this issue about 1 year ago
Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.
github.com/vllm-project/vllm - gty111 opened this issue about 1 year ago
github.com/vllm-project/vllm - gty111 opened this issue about 1 year ago
Fix/async chat serving
github.com/vllm-project/vllm - schoennenbeck opened this pull request about 1 year ago
github.com/vllm-project/vllm - schoennenbeck opened this pull request about 1 year ago
KV Cache usage is 0% for mistral model
github.com/vllm-project/vllm - nikhilshandilya opened this issue about 1 year ago
github.com/vllm-project/vllm - nikhilshandilya opened this issue about 1 year ago
Seek help, `Qwen-14B-Chat-Int4`ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.
github.com/vllm-project/vllm - huangyunxin opened this issue about 1 year ago
github.com/vllm-project/vllm - huangyunxin opened this issue about 1 year ago
OpenAIServingChat cannot be instantiated within a running event loop
github.com/vllm-project/vllm - schoennenbeck opened this issue about 1 year ago
github.com/vllm-project/vllm - schoennenbeck opened this issue about 1 year ago
Ray worker out of memory
github.com/vllm-project/vllm - tristan279 opened this issue about 1 year ago
github.com/vllm-project/vllm - tristan279 opened this issue about 1 year ago
IndexError when using Beam Search in Chat Completions
github.com/vllm-project/vllm - jamestwhedbee opened this issue about 1 year ago
github.com/vllm-project/vllm - jamestwhedbee opened this issue about 1 year ago
ValueError: Total number of attention heads (52) must be divisible by tensor parallel size (8).
github.com/vllm-project/vllm - PolinaBokova opened this issue about 1 year ago
github.com/vllm-project/vllm - PolinaBokova opened this issue about 1 year ago
Question: Would a PR integrating ExLlamaV2 kernels with AWQ be accepted?
github.com/vllm-project/vllm - casper-hansen opened this issue about 1 year ago
github.com/vllm-project/vllm - casper-hansen opened this issue about 1 year ago
Adding `/get_tokenizer` to api_server for lm-evaluation-harness ease integration.
github.com/vllm-project/vllm - AguirreNicolas opened this pull request about 1 year ago
github.com/vllm-project/vllm - AguirreNicolas opened this pull request about 1 year ago
Dockerfile: build-arg to punica kernel
github.com/vllm-project/vllm - AguirreNicolas opened this pull request about 1 year ago
github.com/vllm-project/vllm - AguirreNicolas opened this pull request about 1 year ago
Mixtral AWQ fails to work: asyncio.exceptions.CancelledError: Cancelled by cancel scope 7fd214489990
github.com/vllm-project/vllm - pseudotensor opened this issue about 1 year ago
github.com/vllm-project/vllm - pseudotensor opened this issue about 1 year ago
[RFC] Automatic Prefix Caching
github.com/vllm-project/vllm - zhuohan123 opened this issue about 1 year ago
github.com/vllm-project/vllm - zhuohan123 opened this issue about 1 year ago
How to get the logits of the first generated text?
github.com/vllm-project/vllm - Abigail61 opened this issue about 1 year ago
github.com/vllm-project/vllm - Abigail61 opened this issue about 1 year ago
Speculative Decoding
github.com/vllm-project/vllm - ymwangg opened this pull request about 1 year ago
github.com/vllm-project/vllm - ymwangg opened this pull request about 1 year ago
Add multi-LoRA support for more architectures
github.com/vllm-project/vllm - Yard1 opened this issue about 1 year ago
github.com/vllm-project/vllm - Yard1 opened this issue about 1 year ago
Combine multi-LoRA and quantization
github.com/vllm-project/vllm - Yard1 opened this issue about 1 year ago
github.com/vllm-project/vllm - Yard1 opened this issue about 1 year ago
Longer stop sequence not working in streaming mode
github.com/vllm-project/vllm - andrePankraz opened this issue about 1 year ago
github.com/vllm-project/vllm - andrePankraz opened this issue about 1 year ago
Support for production grade server for Inference [Gunicorn vs Unicorn]?
github.com/vllm-project/vllm - jalotra opened this issue about 1 year ago
github.com/vllm-project/vllm - jalotra opened this issue about 1 year ago
GPU utilization decrease during long-term running
github.com/vllm-project/vllm - WrRan opened this issue about 1 year ago
github.com/vllm-project/vllm - WrRan opened this issue about 1 year ago
CUDA out of memory error despite having enough memory
github.com/vllm-project/vllm - varonroy opened this issue about 1 year ago
github.com/vllm-project/vllm - varonroy opened this issue about 1 year ago
Allow passing hf config args with openai server
github.com/vllm-project/vllm - Aakash-kaushik opened this issue about 1 year ago
github.com/vllm-project/vllm - Aakash-kaushik opened this issue about 1 year ago
openapi running but "POST /v1/chat/completions HTTP/1.1" 404 Not Found
github.com/vllm-project/vllm - Nero10578 opened this issue about 1 year ago
github.com/vllm-project/vllm - Nero10578 opened this issue about 1 year ago
`max_num_batched_tokens` and `max_num_seqs values`
github.com/vllm-project/vllm - isRambler opened this issue about 1 year ago
github.com/vllm-project/vllm - isRambler opened this issue about 1 year ago
Aborted request without reason
github.com/vllm-project/vllm - erjieyong opened this issue about 1 year ago
github.com/vllm-project/vllm - erjieyong opened this issue about 1 year ago
vLLM Distributed Inference stuck when using multi -GPU
github.com/vllm-project/vllm - RathoreShubh opened this issue about 1 year ago
github.com/vllm-project/vllm - RathoreShubh opened this issue about 1 year ago
Add JSON format logging support
github.com/vllm-project/vllm - CatherineSue opened this pull request about 1 year ago
github.com/vllm-project/vllm - CatherineSue opened this pull request about 1 year ago
anyone can Qwen-14B-Chat-AWQ work with VLLM/TP ?
github.com/vllm-project/vllm - s-natsubori opened this issue about 1 year ago
github.com/vllm-project/vllm - s-natsubori opened this issue about 1 year ago
ValueError: The model's max seq len (4096) is larger than the maximum number of tokens that can be stored in KV cache (3664). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.`
github.com/vllm-project/vllm - handsomelys opened this issue about 1 year ago
github.com/vllm-project/vllm - handsomelys opened this issue about 1 year ago
top_k = 50 will make vllm prediction align with transformers
github.com/vllm-project/vllm - sfyumi opened this issue about 1 year ago
github.com/vllm-project/vllm - sfyumi opened this issue about 1 year ago
Multi-node serving with vLLM - Problems with Ray
github.com/vllm-project/vllm - vbucaj opened this issue about 1 year ago
github.com/vllm-project/vllm - vbucaj opened this issue about 1 year ago
Compute perplexity/logits for the prompt
github.com/vllm-project/vllm - dsmilkov opened this issue about 1 year ago
github.com/vllm-project/vllm - dsmilkov opened this issue about 1 year ago
RuntimeError: t == DeviceType::CUDA INTERNAL ASSERT FAILED about v0.2.7
github.com/vllm-project/vllm - cocovoc opened this issue about 1 year ago
github.com/vllm-project/vllm - cocovoc opened this issue about 1 year ago
awq compression of llama 2 70b chat got bad result
github.com/vllm-project/vllm - fancyerii opened this issue about 1 year ago
github.com/vllm-project/vllm - fancyerii opened this issue about 1 year ago
vLLM on OpenShift/Kubernetes Manifests
github.com/vllm-project/vllm - WinsonSou opened this issue about 1 year ago
github.com/vllm-project/vllm - WinsonSou opened this issue about 1 year ago
out of memory with mixtral AWQ
github.com/vllm-project/vllm - m0wer opened this issue about 1 year ago
github.com/vllm-project/vllm - m0wer opened this issue about 1 year ago
Docs: Add Haystack integration details
github.com/vllm-project/vllm - bilgeyucel opened this pull request about 1 year ago
github.com/vllm-project/vllm - bilgeyucel opened this pull request about 1 year ago
Could we support Fuyu-8B, a multimodel llm?
github.com/vllm-project/vllm - leiwen83 opened this issue about 1 year ago
github.com/vllm-project/vllm - leiwen83 opened this issue about 1 year ago
Recent vLLMs ask for too much memory: ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
github.com/vllm-project/vllm - pseudotensor opened this issue about 1 year ago
github.com/vllm-project/vllm - pseudotensor opened this issue about 1 year ago
the output of the vLLM is different from that of HF
github.com/vllm-project/vllm - will-wiki opened this issue about 1 year ago
github.com/vllm-project/vllm - will-wiki opened this issue about 1 year ago
[WIP] Speculative decoding using a draft model
github.com/vllm-project/vllm - cadedaniel opened this pull request about 1 year ago
github.com/vllm-project/vllm - cadedaniel opened this pull request about 1 year ago
Use LRU cache for CUDA Graphs
github.com/vllm-project/vllm - WoosukKwon opened this issue about 1 year ago
github.com/vllm-project/vllm - WoosukKwon opened this issue about 1 year ago
torch.cuda.OutOfMemoryError: CUDA out of memory
github.com/vllm-project/vllm - DenisStefanAndrei opened this issue about 1 year ago
github.com/vllm-project/vllm - DenisStefanAndrei opened this issue about 1 year ago
argument 'tokens': 'NoneType' object cannot be converted to 'PyString'
github.com/vllm-project/vllm - xxm1668 opened this issue about 1 year ago
github.com/vllm-project/vllm - xxm1668 opened this issue about 1 year ago
Unable to run any model with tensor_parallel_size>1 on AWS sagemaker notebooks
github.com/vllm-project/vllm - samarthsarin opened this issue about 1 year ago
github.com/vllm-project/vllm - samarthsarin opened this issue about 1 year ago
Inquiry Regarding vLLM Support for Mac Metal API
github.com/vllm-project/vllm - yihong1120 opened this issue about 1 year ago
github.com/vllm-project/vllm - yihong1120 opened this issue about 1 year ago
Implement Triton-based AWQ kernel
github.com/vllm-project/vllm - WoosukKwon opened this pull request about 1 year ago
github.com/vllm-project/vllm - WoosukKwon opened this pull request about 1 year ago
Support VLM model and GPT4V API
github.com/vllm-project/vllm - xunfeng1980 opened this issue about 1 year ago
github.com/vllm-project/vllm - xunfeng1980 opened this issue about 1 year ago
Vllm RayWoker process hangs when use llm engine
github.com/vllm-project/vllm - SuoSiFire opened this issue about 1 year ago
github.com/vllm-project/vllm - SuoSiFire opened this issue about 1 year ago
[FEATURE REQUEST] SparQ Attention
github.com/vllm-project/vllm - AlpinDale opened this issue about 1 year ago
github.com/vllm-project/vllm - AlpinDale opened this issue about 1 year ago
ARM aarch-64 server build failed (host OS: Ubuntu22.04.3)
github.com/vllm-project/vllm - zhudy opened this issue about 1 year ago
github.com/vllm-project/vllm - zhudy opened this issue about 1 year ago
why online seving slower than offline serving??
github.com/vllm-project/vllm - BangDaeng opened this issue about 1 year ago
github.com/vllm-project/vllm - BangDaeng opened this issue about 1 year ago
I want to add mamba_chat (2.8b) model
github.com/vllm-project/vllm - SafeyahShemali opened this issue about 1 year ago
github.com/vllm-project/vllm - SafeyahShemali opened this issue about 1 year ago
How to fix incomplete answers?
github.com/vllm-project/vllm - LuciAkirami opened this issue about 1 year ago
github.com/vllm-project/vllm - LuciAkirami opened this issue about 1 year ago
Repeated answer: When I use vllm with opt-13b, the generated text is not end until the max length, with the repeated answer
github.com/vllm-project/vllm - duihuhu opened this issue about 1 year ago
github.com/vllm-project/vllm - duihuhu opened this issue about 1 year ago
Error. Rayworkervllm cannot work well when use --tensor-parallel-size . Please help.
github.com/vllm-project/vllm - JenniePing opened this issue about 1 year ago
github.com/vllm-project/vllm - JenniePing opened this issue about 1 year ago
Can it support macos ? M2 chip.
github.com/vllm-project/vllm - znsoftm opened this issue about 1 year ago
github.com/vllm-project/vllm - znsoftm opened this issue about 1 year ago
Is there a way to terminate vllm.LLM and release the GPU memory
github.com/vllm-project/vllm - sfc-gh-zhwang opened this issue about 1 year ago
github.com/vllm-project/vllm - sfc-gh-zhwang opened this issue about 1 year ago
Support `tools` and `tool_choice` parameter in OpenAI compatible service
github.com/vllm-project/vllm - simon-mo opened this issue about 1 year ago
github.com/vllm-project/vllm - simon-mo opened this issue about 1 year ago
01-ai/Yi-34B-Chat never stops
github.com/vllm-project/vllm - pseudotensor opened this issue about 1 year ago
github.com/vllm-project/vllm - pseudotensor opened this issue about 1 year ago
ModuleNotFoundError: No module named "vllm._C"
github.com/vllm-project/vllm - Kawai1Ace opened this issue about 1 year ago
github.com/vllm-project/vllm - Kawai1Ace opened this issue about 1 year ago
Please help me solve the problem. thanks
github.com/vllm-project/vllm - CP3666 opened this issue about 1 year ago
github.com/vllm-project/vllm - CP3666 opened this issue about 1 year ago
Proposal: force type hint check with mypy
github.com/vllm-project/vllm - wangkuiyi opened this issue about 1 year ago
github.com/vllm-project/vllm - wangkuiyi opened this issue about 1 year ago
Batching inference outputs are not the same with single inference
github.com/vllm-project/vllm - gesanqiu opened this issue about 1 year ago
github.com/vllm-project/vllm - gesanqiu opened this issue about 1 year ago
vllm always tries to download model from huggingface/modelscope even if I specify --download-dir with already downloaded models
github.com/vllm-project/vllm - davideuler opened this issue about 1 year ago
github.com/vllm-project/vllm - davideuler opened this issue about 1 year ago
Add worker registry service for hosting multiple vllm model through single api gateway
github.com/vllm-project/vllm - tjtanaa opened this issue about 1 year ago
github.com/vllm-project/vllm - tjtanaa opened this issue about 1 year ago
How to use logits_processors
github.com/vllm-project/vllm - shuaiwang2022 opened this issue about 1 year ago
github.com/vllm-project/vllm - shuaiwang2022 opened this issue about 1 year ago
ImportError: libcudart.so.12
github.com/vllm-project/vllm - tranhoangnguyen03 opened this issue about 1 year ago
github.com/vllm-project/vllm - tranhoangnguyen03 opened this issue about 1 year ago
API causes slowdown in batch request handling
github.com/vllm-project/vllm - jpeig opened this issue about 1 year ago
github.com/vllm-project/vllm - jpeig opened this issue about 1 year ago
Avoid re-initialize parallel groups
github.com/vllm-project/vllm - wangruohui opened this pull request about 1 year ago
github.com/vllm-project/vllm - wangruohui opened this pull request about 1 year ago
[Feature] SYCL kernel support for Intel GPU
github.com/vllm-project/vllm - abhilash1910 opened this pull request about 1 year ago
github.com/vllm-project/vllm - abhilash1910 opened this pull request about 1 year ago
follow up of #1687 when safetensors model contains 0-rank tensors
github.com/vllm-project/vllm - twaka opened this pull request about 1 year ago
github.com/vllm-project/vllm - twaka opened this pull request about 1 year ago
Plans to make the installation work on Windows without WSL?
github.com/vllm-project/vllm - alexandre-ist opened this issue about 1 year ago
github.com/vllm-project/vllm - alexandre-ist opened this issue about 1 year ago
usage of vllm for extracting embeddings
github.com/vllm-project/vllm - ra-MANUJ-an opened this issue about 1 year ago
github.com/vllm-project/vllm - ra-MANUJ-an opened this issue about 1 year ago
Revert 1 docker build
github.com/vllm-project/vllm - wasertech opened this pull request about 1 year ago
github.com/vllm-project/vllm - wasertech opened this pull request about 1 year ago
baichuan-13b-chat用vllm来生成,很多测试数据(有长有短,没有超出长度限制)只能生成一个句号,而且有些示例在删掉一些字词或句子之后,就可以正常生成了,请问有可能是什么原因?
github.com/vllm-project/vllm - MrInouye opened this issue about 1 year ago
github.com/vllm-project/vllm - MrInouye opened this issue about 1 year ago