[Bug] Memory Issue With --mem-fraction-static Parameter
Description:
We are experiencing a memory-related issue when setting the --mem-fraction-static
parameter in the sglang
framework. The problem arises when trying to allocate memory for speculative decoding, resulting in a CUDA out of memory error or a RuntimeError: Not enough memory
error.
Setup:
- GPU: NVIDIA L40s (48GB VRAM) x 2 (using 1 GPU when running)
- CUDA Version: 12.8
- PyTorch Version: 2.5.1
- sglang Version: 0.4.3.post2
Errors:
- Run sglang with --mem-fraction-static 0.85 → CUDA OOM error occurs.
- Run sglang with --mem-fraction-static 0.84 → RuntimeError: Not enough memory. Please try to increase --mem-fraction-static error occurs.
Expected Behavior:
We expect the model to allocate memory properly without getting stuck between these two errors.
Reproduction:
import requests
import os
from sglang import assistant_begin, assistant_end
from sglang import assistant, function, gen, system, user
from sglang import image
from sglang import RuntimeEndpoint, set_default_backend
from sglang.srt.utils import load_image
from sglang.test.test_utils import is_in_ci
from sglang.utils import print_highlight, terminate_process, wait_for_server
if is_in_ci():
from sglang.docs.frontend.patch import launch_server_cmd
else:
from sglang.utils import launch_server_cmd
server_process, port = launch_server_cmd(
"""
python3 -m sglang.launch_server --model /LLM/model/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4 --speculative-algorithm EAGLE \
--speculative-draft-model-path /LLM/model/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4 --speculative-num-steps 3 \
--speculative-eagle-topk 4 --speculative-num-draft-tokens 32 --mem-fraction-static 0.85 --max-running-requests 2 --chunked-prefill-size 256\
--enable-torch-compile --cuda-graph-max-bs 2
"""
)
wait_for_server(f"http://localhost:{port}")
print(f"Server started on http://localhost:{port}")
Environment:
-
Python: 3.11.11 (main, Dec 9 2024, 15:32:27) [GCC 8.5.0 20210514 (Red Hat 8.5.0-22)]
-
CUDA available: True
-
GPU 0,1: NVIDIA L40S
-
GPU 0,1 Compute Capability: 8.9
-
CUDA_HOME: /usr/local/cuda
-
NVCC: Cuda compilation tools, release 12.8, V12.8.61
-
CUDA Driver Version: 570.86.10
-
PyTorch: 2.5.1+cu124
-
sglang: 0.4.3.post2
-
sgl_kernel: 0.0.3.post6
-
flashinfer: 0.2.2.post1
-
triton: 3.1.0
-
transformers: 4.48.3
-
torchao: 0.9.0
-
numpy: 1.26.4
-
aiohttp: 3.11.13
-
fastapi: 0.115.9
-
hf_transfer: Module Not Found
-
huggingface_hub: 0.29.1
-
interegular: 0.3.3
-
modelscope: Module Not Found
-
orjson: 3.10.15
-
packaging: 24.2
-
psutil: 7.0.0
-
pydantic: 2.10.6
-
multipart: 0.0.20
-
zmq: 26.2.1
-
uvicorn: 0.34.0
-
uvloop: 0.21.0
-
vllm: 0.7.2
-
openai: 1.65.1
-
tiktoken: 0.9.0
-
anthropic: Module Not Found
-
litellm: Module Not Found
-
decord: 0.6.0
-
NVIDIA Topology:
GPU0 GPU1 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS SYS 0-23,48-71 0 N/A GPU1 SYS X NODE 24-47,72-95 1 N/A NIC0 SYS NODE X
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_bond_0
ulimit soft: 262144
Workarounds:
- Increase the
--mem-fraction-static
value: Try increasing the--mem-fraction-static
value to allocate more memory for speculative decoding. - Use a different GPU: Try running the model on a different GPU to see if the issue persists.
- Optimize the model: Optimize the model to reduce its memory requirements.
- Use a different memory allocation strategy: Try using a different memory allocation strategy, such as
--mem-fraction-dynamic
or--mem-fraction- adaptive
.
Conclusion:
The memory issue with the --mem-fraction-static
parameter in the sglang
framework is a complex problem that requires further investigation. We have provided some workarounds to help alleviate the issue, but a more comprehensive solution is needed to resolve the problem. We encourage the community to contribute to the discussion and provide insights to help resolve this issue.
Additional Information:
- Related Issues: Issue 1, Issue 2
- Documentation: sglang Documentation
- Community: sglang Community
Acknowledgments:
Q: What is the memory issue with the --mem-fraction-static parameter?
A: The memory issue with the --mem-fraction-static
parameter is a problem that occurs when trying to allocate memory for speculative decoding in the sglang
framework. This can result in a CUDA out of memory error or a RuntimeError: Not enough memory
error.
Q: What are the symptoms of the memory issue?
A: The symptoms of the memory issue include:
- CUDA out of memory error
RuntimeError: Not enough memory
error- Model crashes or freezes due to memory constraints
Q: What are the possible causes of the memory issue?
A: The possible causes of the memory issue include:
- Insufficient memory allocation for speculative decoding
- Inefficient memory usage by the model
- Conflicting memory requirements between different components of the model
Q: How can I troubleshoot the memory issue?
A: To troubleshoot the memory issue, you can try the following steps:
- Check the memory allocation settings for speculative decoding
- Optimize the model to reduce its memory requirements
- Use a different memory allocation strategy, such as
--mem-fraction-dynamic
or--mem-fraction-adaptive
- Run the model on a different GPU to see if the issue persists
Q: What are the workarounds for the memory issue?
A: The workarounds for the memory issue include:
- Increasing the
--mem-fraction-static
value to allocate more memory for speculative decoding - Using a different GPU to run the model
- Optimizing the model to reduce its memory requirements
- Using a different memory allocation strategy
Q: How can I prevent the memory issue from occurring in the future?
A: To prevent the memory issue from occurring in the future, you can try the following steps:
- Regularly check the memory allocation settings for speculative decoding
- Optimize the model to reduce its memory requirements
- Use a different memory allocation strategy, such as
--mem-fraction-dynamic
or--mem-fraction-adaptive
- Run the model on a different GPU to ensure that it has sufficient memory resources
Q: What is the current status of the memory issue?
A: The memory issue with the --mem-fraction-static
parameter is currently being investigated by the sglang
community. We are working to identify the root cause of the issue and to develop a comprehensive solution to resolve it.
Q: How can I contribute to the resolution of the memory issue?
A: To contribute to the resolution of the memory issue, you can try the following steps:
- Report any issues or bugs related to the memory issue
- Provide feedback and suggestions on how to improve the memory allocation settings for speculative decoding
- Contribute to the development of new features or improvements to the
sglang
framework - Participate in the discussion and provide insights to help resolve the issue
Q: Where can I find more information about the memory issue?
A: You can find more information about the memory issue on the following resources:
sglang
documentation: sglang Documentationsglang
community: sglang Community- Related issues: Issue 1, Issue 2