[Bug] sglang.launch_server error

Question

[Bug] sglang.launch_server error

Opened this issue 16 days ago · 1 comments

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

(sglang) aluo@titan:~/sglang$ python -m sglang.launch_server --model-path /scratch3/data/Meta-Llama-3.1-8B-Instruct/ --enable-torch-compile --disable-radix-cache
server_args=ServerArgs(model_path='/scratch3/data/Meta-Llama-3.1-8B-Instruct/', tokenizer_path='/scratch3/data/Meta-Llama-3.1-8B-Instruct/', tokenizer_mode='auto', load_format='auto', dtype='auto', trust_remote_code=False, context_length=None, quantization=None, chat_template=None, host='127.0.0.1', port=30000, additional_ports=[30001, 30002, 30003, 30004], mem_fraction_static=0.88, max_prefill_tokens=None, max_running_requests=None, max_num_reqs=None, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=1059195386, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key='', file_storage_pth='SGlang_storage', dp_size=1, load_balance_method='round_robin', chunked_prefill_size=None, disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=True, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_disk_cache=False, enable_torch_compile=True, enable_p2p_check=False, attention_reduce_in_fp32=False, efficient_weight_load=False, nccl_init_addr=None, nnodes=1, node_rank=None)
[gpu=0] Init nccl begin.
[gpu=0] Load weight begin. avail mem=78.69 GB
Initialization failed. controller_init_state: Traceback (most recent call last):
File "/home/aluo/sglang/python/sglang/srt/managers/controller_single.py", line 150, in start_controller_process
controller = ControllerSingle(
File "/home/aluo/sglang/python/sglang/srt/managers/controller_single.py", line 84, in init
self.tp_server = ModelTpServer(
File "/home/aluo/sglang/python/sglang/srt/managers/tp_worker.py", line 91, in init
self.model_runner = ModelRunner(
File "/home/aluo/sglang/python/sglang/srt/model_executor/model_runner.py", line 123, in init
self.load_model()
File "/home/aluo/sglang/python/sglang/srt/model_executor/model_runner.py", line 170, in load_model
self.model = get_model(
File "/home/aluo/miniconda3/envs/sglang/lib/python3.9/site-packages/vllm/model_executor/model_loader/init.py", line 21, in get_model
return loader.load_model(model_config=model_config,
File "/home/aluo/miniconda3/envs/sglang/lib/python3.9/site-packages/vllm/model_executor/model_loader/loader.py", line 267, in load_model
model = _initialize_model(model_config, self.load_config,
File "/home/aluo/miniconda3/envs/sglang/lib/python3.9/site-packages/vllm/model_executor/model_loader/loader.py", line 104, in _initialize_model
return model_class(config=model_config.hf_config,
File "/home/aluo/sglang/python/sglang/srt/models/llama2.py", line 318, in init
self.model = LlamaModel(config, quant_config=quant_config)
File "/home/aluo/sglang/python/sglang/srt/models/llama2.py", line 257, in init
[
File "/home/aluo/sglang/python/sglang/srt/models/llama2.py", line 258, in
LlamaDecoderLayer(
File "/home/aluo/sglang/python/sglang/srt/models/llama2.py", line 192, in init
self.self_attn = LlamaAttention(
File "/home/aluo/sglang/python/sglang/srt/models/llama2.py", line 125, in init
self.qkv_proj = QKVParallelLinear(
TypeError: init() got an unexpected keyword argument 'prefix'

Reproduction

(sglang) aluo@titan:~/sglang$ python -m sglang.launch_server --model-path /scratch3/data/Meta-Llama-3.1-8B-Instruct/ --enable-torch-compile --disable-radix-cache

Environment

Python: 3.9.19 (main, May 6 2024, 19:43:03) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H100 80GB HBM3
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.5, V12.5.82
CUDA Driver Version: 555.42.06
555.42.06
555.42.06
555.42.06
555.42.06
555.42.06
555.42.06
555.42.06
PyTorch: 2.3.1+cu121
sglang: 0.2.7
flashinfer: 0.1.4+cu121torch2.3
requests: 2.32.3
tqdm: 4.66.5
numpy: 1.26.4
aiohttp: 3.10.3
fastapi: 0.112.0
hf_transfer: 0.1.8
huggingface_hub: 0.24.5
interegular: 0.3.3
packaging: 24.1
PIL: 10.4.0
psutil: 6.0.0
pydantic: 2.8.2
uvicorn: 0.30.5
uvloop: 0.19.0
zmq: 26.1.0
vllm: 0.5.2
openai: 1.40.6
anthropic: 0.33.1
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 0-95,192-287 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 0-95,192-287 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 0-95,192-287 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 0-95,192-287 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 96-191,288-383 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 96-191,288-383 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 96-191,288-383 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X 96-191,288-383 1 N/A

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

ulimit soft: 1024

Answer 1 · 2024-09-04T15:41:47.000Z

Please try v0.3.0