[Bug] Memory Issue With --mem-fraction-static Parameter

by ADMIN 56 views

Description:

We are experiencing a memory-related issue when setting the --mem-fraction-static parameter in the sglang framework. The problem arises when trying to allocate memory for speculative decoding, resulting in a CUDA out of memory error or a RuntimeError: Not enough memory error.

Setup:

  • GPU: NVIDIA L40s (48GB VRAM) x 2 (using 1 GPU when running)
  • CUDA Version: 12.8
  • PyTorch Version: 2.5.1
  • sglang Version: 0.4.3.post2

Errors:

  1. Run sglang with --mem-fraction-static 0.85 → CUDA OOM error occurs.
  2. Run sglang with --mem-fraction-static 0.84 → RuntimeError: Not enough memory. Please try to increase --mem-fraction-static error occurs.

Expected Behavior:

We expect the model to allocate memory properly without getting stuck between these two errors.

Reproduction:

import requests
import os

from sglang import assistant_begin, assistant_end
from sglang import assistant, function, gen, system, user
from sglang import image
from sglang import RuntimeEndpoint, set_default_backend
from sglang.srt.utils import load_image
from sglang.test.test_utils import is_in_ci
from sglang.utils import print_highlight, terminate_process, wait_for_server

if is_in_ci():
    from sglang.docs.frontend.patch import launch_server_cmd
else:
    from sglang.utils import launch_server_cmd

server_process, port = launch_server_cmd(
    """
python3 -m sglang.launch_server --model /LLM/model/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4 --speculative-algorithm EAGLE \ 
    --speculative-draft-model-path /LLM/model/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4 --speculative-num-steps 3 \
    --speculative-eagle-topk 4 --speculative-num-draft-tokens 32 --mem-fraction-static 0.85 --max-running-requests 2 --chunked-prefill-size 256\
    --enable-torch-compile --cuda-graph-max-bs 2
"""
)

wait_for_server(f"http://localhost:{port}")
print(f"Server started on http://localhost:{port}")

Environment:

  • Python: 3.11.11 (main, Dec 9 2024, 15:32:27) [GCC 8.5.0 20210514 (Red Hat 8.5.0-22)]

  • CUDA available: True

  • GPU 0,1: NVIDIA L40S

  • GPU 0,1 Compute Capability: 8.9

  • CUDA_HOME: /usr/local/cuda

  • NVCC: Cuda compilation tools, release 12.8, V12.8.61

  • CUDA Driver Version: 570.86.10

  • PyTorch: 2.5.1+cu124

  • sglang: 0.4.3.post2

  • sgl_kernel: 0.0.3.post6

  • flashinfer: 0.2.2.post1

  • triton: 3.1.0

  • transformers: 4.48.3

  • torchao: 0.9.0

  • numpy: 1.26.4

  • aiohttp: 3.11.13

  • fastapi: 0.115.9

  • hf_transfer: Module Not Found

  • huggingface_hub: 0.29.1

  • interegular: 0.3.3

  • modelscope: Module Not Found

  • orjson: 3.10.15

  • packaging: 24.2

  • psutil: 7.0.0

  • pydantic: 2.10.6

  • multipart: 0.0.20

  • zmq: 26.2.1

  • uvicorn: 0.34.0

  • uvloop: 0.21.0

  • vllm: 0.7.2

  • openai: 1.65.1

  • tiktoken: 0.9.0

  • anthropic: Module Not Found

  • litellm: Module Not Found

  • decord: 0.6.0

  • NVIDIA Topology:

    GPU0    GPU1    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
    

GPU0 X SYS SYS 0-23,48-71 0 N/A GPU1 SYS X NODE 24-47,72-95 1 N/A NIC0 SYS NODE X

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_bond_0

ulimit soft: 262144

Workarounds:

  1. Increase the --mem-fraction-static value: Try increasing the --mem-fraction-static value to allocate more memory for speculative decoding.
  2. Use a different GPU: Try running the model on a different GPU to see if the issue persists.
  3. Optimize the model: Optimize the model to reduce its memory requirements.
  4. Use a different memory allocation strategy: Try using a different memory allocation strategy, such as --mem-fraction-dynamic or --mem-fraction- adaptive.

Conclusion:

The memory issue with the --mem-fraction-static parameter in the sglang framework is a complex problem that requires further investigation. We have provided some workarounds to help alleviate the issue, but a more comprehensive solution is needed to resolve the problem. We encourage the community to contribute to the discussion and provide insights to help resolve this issue.

Additional Information:

Acknowledgments:

Q: What is the memory issue with the --mem-fraction-static parameter?

A: The memory issue with the --mem-fraction-static parameter is a problem that occurs when trying to allocate memory for speculative decoding in the sglang framework. This can result in a CUDA out of memory error or a RuntimeError: Not enough memory error.

Q: What are the symptoms of the memory issue?

A: The symptoms of the memory issue include:

  • CUDA out of memory error
  • RuntimeError: Not enough memory error
  • Model crashes or freezes due to memory constraints

Q: What are the possible causes of the memory issue?

A: The possible causes of the memory issue include:

  • Insufficient memory allocation for speculative decoding
  • Inefficient memory usage by the model
  • Conflicting memory requirements between different components of the model

Q: How can I troubleshoot the memory issue?

A: To troubleshoot the memory issue, you can try the following steps:

  • Check the memory allocation settings for speculative decoding
  • Optimize the model to reduce its memory requirements
  • Use a different memory allocation strategy, such as --mem-fraction-dynamic or --mem-fraction-adaptive
  • Run the model on a different GPU to see if the issue persists

Q: What are the workarounds for the memory issue?

A: The workarounds for the memory issue include:

  • Increasing the --mem-fraction-static value to allocate more memory for speculative decoding
  • Using a different GPU to run the model
  • Optimizing the model to reduce its memory requirements
  • Using a different memory allocation strategy

Q: How can I prevent the memory issue from occurring in the future?

A: To prevent the memory issue from occurring in the future, you can try the following steps:

  • Regularly check the memory allocation settings for speculative decoding
  • Optimize the model to reduce its memory requirements
  • Use a different memory allocation strategy, such as --mem-fraction-dynamic or --mem-fraction-adaptive
  • Run the model on a different GPU to ensure that it has sufficient memory resources

Q: What is the current status of the memory issue?

A: The memory issue with the --mem-fraction-static parameter is currently being investigated by the sglang community. We are working to identify the root cause of the issue and to develop a comprehensive solution to resolve it.

Q: How can I contribute to the resolution of the memory issue?

A: To contribute to the resolution of the memory issue, you can try the following steps:

  • Report any issues or bugs related to the memory issue
  • Provide feedback and suggestions on how to improve the memory allocation settings for speculative decoding
  • Contribute to the development of new features or improvements to the sglang framework
  • Participate in the discussion and provide insights to help resolve the issue

Q: Where can I find more information about the memory issue?

A: You can find more information about the memory issue on the following resources: