[Bug] Memory Issue With --mem-fraction-static Parameter

Mar 12, 2025 by ADMIN 56 views

$**Bug: Memory Issue with --mem-fraction-static Parameter**$

Description:

We are experiencing a memory-related issue when setting the --mem-fraction-static parameter in the sglang framework. The problem arises when trying to allocate memory for speculative decoding, resulting in a CUDA out of memory error or a RuntimeError: Not enough memory error.

Setup:

GPU: NVIDIA L40s (48GB VRAM) x 2 (using 1 GPU when running)
CUDA Version: 12.8
PyTorch Version: 2.5.1
sglang Version: 0.4.3.post2

Errors:

Run sglang with --mem-fraction-static 0.85 → CUDA OOM error occurs.
Run sglang with --mem-fraction-static 0.84 → RuntimeError: Not enough memory. Please try to increase --mem-fraction-static error occurs.

Expected Behavior:

We expect the model to allocate memory properly without getting stuck between these two errors.

Reproduction:

import requests
import os

from sglang import assistant_begin, assistant_end
from sglang import assistant, function, gen, system, user
from sglang import image
from sglang import RuntimeEndpoint, set_default_backend
from sglang.srt.utils import load_image
from sglang.test.test_utils import is_in_ci
from sglang.utils import print_highlight, terminate_process, wait_for_server

if is_in_ci():
    from sglang.docs.frontend.patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd

server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model /LLM/model/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4 --speculative-algorithm EAGLE \ 
    --speculative-draft-model-path /LLM/model/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4 --speculative-num-steps 3 \
    --speculative-eagle-topk 4 --speculative-num-draft-tokens 32 --mem-fraction-static 0.85 --max-running-requests 2 --chunked-prefill-size 256\
    --enable-torch-compile --cuda-graph-max-bs 2
"""
)

wait_for_server(f"http://localhost:{port}")
print(f"Server started on http://localhost:{port}")

Environment:

Python: 3.11.11 (main, Dec 9 2024, 15:32:27) [GCC 8.5.0 20210514 (Red Hat 8.5.0-22)]
CUDA available: True
GPU 0,1: NVIDIA L40S
GPU 0,1 Compute Capability: 8.9
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.8, V12.8.61
CUDA Driver Version: 570.86.10
PyTorch: 2.5.1+cu124
sglang: 0.4.3.post2
sgl_kernel: 0.0.3.post6
flashinfer: 0.2.2.post1
triton: 3.1.0
transformers: 4.48.3
torchao: 0.9.0
numpy: 1.26.4
aiohttp: 3.11.13
fastapi: 0.115.9
hf_transfer: Module Not Found
huggingface_hub: 0.29.1
interegular: 0.3.3
modelscope: Module Not Found
orjson: 3.10.15
packaging: 24.2
psutil: 7.0.0
pydantic: 2.10.6
multipart: 0.0.20
zmq: 26.2.1
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.7.2
openai: 1.65.1
tiktoken: 0.9.0
anthropic: Module Not Found
litellm: Module Not Found
decord: 0.6.0

NVIDIA Topology:

GPU0    GPU1    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID

GPU0 X SYS SYS 0-23,48-71 0 N/A GPU1 SYS X NODE 24-47,72-95 1 N/A NIC0 SYS NODE X

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_bond_0

ulimit soft: 262144

Workarounds:

Increase the --mem-fraction-static value: Try increasing the --mem-fraction-static value to allocate more memory for speculative decoding.
Use a different GPU: Try running the model on a different GPU to see if the issue persists.
Optimize the model: Optimize the model to reduce its memory requirements.
Use a different memory allocation strategy: Try using a different memory allocation strategy, such as --mem-fraction-dynamic or --mem-fraction- adaptive.

Conclusion:

The memory issue with the --mem-fraction-static parameter in the sglang framework is a complex problem that requires further investigation. We have provided some workarounds to help alleviate the issue, but a more comprehensive solution is needed to resolve the problem. We encourage the community to contribute to the discussion and provide insights to help resolve this issue.

Additional Information:

Related Issues: Issue 1, Issue 2
Documentation: sglang Documentation
Community: sglang Community

Acknowledgments:

Q: What is the memory issue with the --mem-fraction-static parameter?

A: The memory issue with the --mem-fraction-static parameter is a problem that occurs when trying to allocate memory for speculative decoding in the sglang framework. This can result in a CUDA out of memory error or a RuntimeError: Not enough memory error.

Q: What are the symptoms of the memory issue?

A: The symptoms of the memory issue include:

CUDA out of memory error
RuntimeError: Not enough memory error
Model crashes or freezes due to memory constraints

Q: What are the possible causes of the memory issue?

A: The possible causes of the memory issue include:

Insufficient memory allocation for speculative decoding
Inefficient memory usage by the model
Conflicting memory requirements between different components of the model

Q: How can I troubleshoot the memory issue?

A: To troubleshoot the memory issue, you can try the following steps:

Check the memory allocation settings for speculative decoding
Optimize the model to reduce its memory requirements
Use a different memory allocation strategy, such as --mem-fraction-dynamic or --mem-fraction-adaptive
Run the model on a different GPU to see if the issue persists

Q: What are the workarounds for the memory issue?

A: The workarounds for the memory issue include:

Increasing the --mem-fraction-static value to allocate more memory for speculative decoding
Using a different GPU to run the model
Optimizing the model to reduce its memory requirements
Using a different memory allocation strategy

Q: How can I prevent the memory issue from occurring in the future?

A: To prevent the memory issue from occurring in the future, you can try the following steps:

Regularly check the memory allocation settings for speculative decoding
Optimize the model to reduce its memory requirements
Use a different memory allocation strategy, such as --mem-fraction-dynamic or --mem-fraction-adaptive
Run the model on a different GPU to ensure that it has sufficient memory resources

Q: What is the current status of the memory issue?

A: The memory issue with the --mem-fraction-static parameter is currently being investigated by the sglang community. We are working to identify the root cause of the issue and to develop a comprehensive solution to resolve it.

Q: How can I contribute to the resolution of the memory issue?

A: To contribute to the resolution of the memory issue, you can try the following steps:

Report any issues or bugs related to the memory issue
Provide feedback and suggestions on how to improve the memory allocation settings for speculative decoding
Contribute to the development of new features or improvements to the sglang framework
Participate in the discussion and provide insights to help resolve the issue

Q: Where can I find more information about the memory issue?

A: You can find more information about the memory issue on the following resources:

sglang documentation: sglang Documentation
sglang community: sglang Community
Related issues: Issue 1, Issue 2