github.com/sgl-project/sglang issues | Ecosyste.ms: OpenCollective

[Bug] when llama-3.1-70b-instruct batch inference, CUDA memory usage is unusually large

yak9meat opened this issue 5 months ago

[Feature] Support TRI-ML/prismatic-vlms

Depetrol opened this issue 5 months ago

[RFC] Add an LLM engine

JianyuZhan opened this pull request 5 months ago

[FEAT] JSON constrained support

havetc opened this pull request 5 months ago

[Bug] I set `--host 0.0.0.0`, but it can't be called on another server

YinSonglin1997 opened this issue 5 months ago

[Feature] add disable_custom_all_reduce

Xu-Chen opened this issue 5 months ago

[Bug] After service, `torch.distributed.DistBackendError`

YinSonglin1997 opened this issue 5 months ago

[Bug] Failure to Dispatch Head Dimension 80 in sglang with Specific Configurations

hxer7963 opened this issue 5 months ago

[Feature] Do we have any plan for supporting Phi3V?

boqiny opened this issue 5 months ago

[Develop] Performance Improving Feature

yukavio opened this issue 5 months ago

[Bug] Low QPS for 1.2b model

lxww302 opened this issue 5 months ago

[Bug] Can't run Qwen2-57B-A14B-Instruct-GPTQ-Int4

xcxjack opened this issue 5 months ago

will triton kernels support cuda graph?

AlvL1225 opened this issue 5 months ago

[Bug] Always Watch Dog TimeOut

Rookie-Kai opened this issue 5 months ago

[Bug] cuda out of memory when using MQA and input_len=output_len=1024

lxww302 opened this issue 5 months ago

[Feature] Are there plans to implement a prefill-decode split inference architecture?

CSEEduanyu opened this issue 5 months ago

[Bug] nsys profile failed

zhangjun opened this issue 5 months ago

[Bug] T4 not work

zhyncs opened this issue 5 months ago

[Feature] Support InternVL 2

luohao123 opened this issue 5 months ago

Sequence Parallel

ZYHowell opened this pull request 5 months ago

[Feature] Allow arbitrary logit processors

iiLaurens opened this issue 5 months ago

[Bug] OOM for concurrent long requests

hahmad2008 opened this issue 5 months ago

[Bug] Multinode Llama 3.1 405B fp8

matthew-hippocratic opened this issue 5 months ago

Torch.compile Performance Tracking

merrymercy opened this issue 5 months ago

[Bug] backend stuck at Prefill batch

sophiapeng90 opened this issue 5 months ago

[Feature] DeepSeek-Coder-V2-Instruct-FP8 on 8xA100

halexan opened this issue 5 months ago

[Feature] Add runtime/process cache to avoid booting sever each time.

hnyls2002 opened this issue 5 months ago

feat: frequency, min_new_tokens, presence, and repetition penalties

vhain opened this pull request 5 months ago

Add skip_tokenizer_init args.

gryffindor-rr opened this pull request 5 months ago

[Bug] Multinode cannot be started on runpod

Desmond819 opened this issue 5 months ago

[Bug] pt_main_thread uses 100% cpu all the time

wizd opened this issue 5 months ago

[Bug] FlashInfer support for <=sm_75

horiacristescu opened this issue 5 months ago

Inference Llama3-70b has an AssertionError

Ikkyu321 opened this issue 5 months ago

[Feature] tokenizer_manager accept external tokenizer or skip tokenizer init

gryffindor-rr opened this issue 5 months ago

TTFT latency for long context (16K) is very high around 15 seconds for llama3.1 70b model. (same or worse than vLLM)

gkiri opened this issue 5 months ago

[Feature] Google TPU Support

RonanKMcGovern opened this issue 5 months ago

[Feature] Does sglang now support beam search

StevenZHB opened this issue 5 months ago

[Feature] Add a flag for computing the prompt's logprobs or not.

hnyls2002 opened this issue 5 months ago

[Bug] 运行sglang.launch_server报错：cannot import name 'default_dump_dir' from 'triton.runtime.cache'

NoobPythoner opened this issue 5 months ago

run llama 3.1 405B with multi node has tp server error [Bug]

kinglion811 opened this issue 5 months ago

[Bug] AWQ Marlin not work with Torch Compile

zhyncs opened this issue 5 months ago

RuntimeError: TopKTopPSamplingFromProbs failed with error code no kernel image is available for execution on the device 已杀死[Bug]

mayu123mayu opened this issue 5 months ago

[Feature] plan to support medusa?

CSEEduanyu opened this issue 5 months ago

[Bug] Multi-Node communication issue

dmakhervaks opened this issue 5 months ago

[Feature] RadixCache: remove recursive logic

hnyls2002 opened this issue 5 months ago

OPTIONS method is not supported when using sglang with the nextchat client

jjiwei opened this issue 6 months ago

[Feature] Frontend: be able to run generate super long text

xianbaoqian opened this issue 6 months ago

ROCM

BasDiaz opened this issue 6 months ago

[Feature] Generation Inputs: input_embeds

AlekseyKorshuk opened this issue 6 months ago

Initialization failed. warmup error:

bravelll opened this issue 6 months ago

Support for WebAssembly models

jaanli opened this issue 6 months ago

Development Roadmap (2024 Q3)

Ying1123 opened this issue 6 months ago

select() on first assistant token broken (in different ways in Mistral and Llama). Likely tokenization issue.

max99x opened this issue 6 months ago

`model_override_args` with server

ValeKnappich opened this issue 6 months ago

Add a HuggingFace backend

cloneofsimo opened this issue 6 months ago

Function calling for OpenAI backend

Yiyun-Liang opened this pull request 6 months ago

Add Support to Florence-2

KaifAhmad1 opened this issue 6 months ago

Will speculative decoding be supported?

arunpatala opened this issue 7 months ago

Llava CUDA error: device-side assert triggered

dmilcevski opened this issue 7 months ago

[Bug]: Random model output using sglang backend server

PanJason opened this issue 7 months ago

SG-Lang Runtime Stuck Launching in Docker Container

schopra8 opened this issue 7 months ago

Qwen 2 7B not working

sudarshan-kamath opened this issue 7 months ago

Does llava-next-video deploy only focus on first frames?

LetheRiver0 opened this issue 7 months ago

Unable to load 72b llava qwen on 8*A100 40GB

jeffhernandez1995 opened this issue 7 months ago

remove redundant pad_input_ids function

amosyou opened this pull request 7 months ago

llava-next-video inference result is empty

AmazDeng opened this issue 7 months ago

no longer can load 72b llava qwen on 4*H100 80GB

pseudotensor opened this issue 8 months ago

Invalid API key

pseudotensor opened this issue 8 months ago

Trace OpenAI backend usage

Ying1123 opened this issue 8 months ago

Regex generation causes 37x lower performance

Gintasz opened this issue 8 months ago

OOM CUDA error on 8 * L4 machine when launching sglang server

mounamokaddem opened this issue 8 months ago

Llama-3 regex generation can get stuck in infinite generation beyond max_tokens and crash server (reproduction example)

Gintasz opened this issue 8 months ago

Please add Phi3 support

Curiosity007 opened this issue 8 months ago

no batch run when using openai's format for calling.

xjw00654 opened this issue 8 months ago

ImportError: cannot import name 'function' from partially initialized module 'sglang'

lambda7xx opened this issue 9 months ago

ImportError: cannot import name 'get_cuda_stream' from 'triton.runtime.jit' In triton-nightly(V100)

nenomigami opened this issue 9 months ago

Add Default Timeout to urllib.request.urlopen Calls to Prevent Potential Hanging

alessiodallapiazza opened this issue 10 months ago

Allow OPTIONS Method on Http Server and add Cors headers.

kseyhan opened this issue 10 months ago

Supports the InternVL multimodal large model

exceedzhang opened this issue 10 months ago

Openrouter usage example