Logging & Checkpoint Output During Multi-node Training

Mar 13, 2025 by ADMIN 55 views

Introduction

Multi-node training is a powerful technique for scaling up deep learning models to large datasets and complex tasks. However, it also introduces new challenges, such as managing logging and checkpoint output across multiple nodes. In this article, we will explore the expected behavior of logging and checkpoint output during multi-node training and provide guidance on how to configure these settings for optimal performance.

Logging in Multi-Node Training

When running a multi-node training job, you may have noticed that the training progress log is only visible on the master node. This is an expected behavior, as the logging output is typically sent to the master node, which is responsible for coordinating the training process.

However, this can be a problem if you want to monitor the training progress on each node individually. To address this issue, you can use the --logging_steps argument to specify the frequency at which the logging output is sent to the master node. For example, you can set --logging_steps 5 to send the logging output to the master node every 5 steps.

Checkpoint Output in Multi-Node Training

Checkpoint output is another important aspect of multi-node training. When running a multi-node training job, you need to specify the output directory using the --output_dir argument. By default, the output directory is set to the current working directory, but you can specify a different directory using this argument.

One common question is whether to set the output directory to a common storage location or to store the output locally on each node. The answer depends on your specific use case and requirements.

Common Storage Location

If you want to store the checkpoint output in a common location that is accessible from all nodes, you can set the output directory to a shared storage location, such as a network file system (NFS) or a distributed file system (DFS). This approach has several advantages, including:

Easy access: You can access the checkpoint output from any node in the cluster.
Improved collaboration: Multiple users can access and share the checkpoint output.
Better data management: You can manage the checkpoint output more efficiently using a centralized storage location.

However, this approach also has some disadvantages, including:

Performance overhead: Accessing a shared storage location can introduce performance overhead, especially if the storage location is located on a remote server.
Data consistency: You need to ensure that the checkpoint output is consistent across all nodes in the cluster.

Local Storage on Each Node

Alternatively, you can store the checkpoint output locally on each node. This approach has several advantages, including:

Improved performance: Accessing local storage is generally faster than accessing a shared storage location.
Better data consistency: The checkpoint output is consistent across all nodes in the cluster.

However, this approach also has some disadvantages, including:

Limited access: You can only access the checkpoint output from the node where it is stored.
Data management: You need to manage the checkpoint output on each node individually.

Configuring Checkpoint Output

To configure the checkpoint output, you can use the --output_dir argument to specify the output directory. For example, you can set --output_dir /path/to/output to store the checkpoint output in the specified directory.

You can also use the --save_steps argument to specify the frequency at which the checkpoint output is saved. For example, you can set --save_steps 100 to save the checkpoint output every 100 steps.

Example Use Case

Here is an example use case that demonstrates how to configure the checkpoint output for a multi-node training job:

nnodes=2
nproc_per_node=1

NCCL_DEBUG=WARN \
NNODES=$nnodes \
NODE_RANK=0 \  # NODE_RANK=1 for the other node
MASTER_ADDR=XXX.XXX.XXX \
MASTER_PORT=29500 \
NPROC_PER_NODE=$nproc_per_node \
swift sft \
    --model Qwen/Qwen2-Audio-7B-Instruct \
    --dataset 'speech_asr/speech_asr_aishell1_trainsets:validation#2000' \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-5 \
    --gradient_accumulation_steps $(expr 32 / $nproc_per_node / $nnodes) \
    --eval_steps 100 \
    --save_steps 100 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 8192 \
    --output_dir /path/to/output \
    --system 'You are a helpful assistant.' \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 1 \
    --gradient_checkpointing_kwargs '{"use_reentrant": false}'

In this example, we set the output directory to /path/to/output and specify the frequency at which the checkpoint output is saved using the --save_steps argument.

Conclusion

Q: What is the expected behavior of logging during multi-node training?

A: The expected behavior of logging during multi-node training is that the logging output is sent to the master node, which is responsible for coordinating the training process. This means that the training progress log is only visible on the master node.

Q: Why is the logging output only visible on the master node?

A: The logging output is only visible on the master node because the logging output is sent to the master node, which is responsible for coordinating the training process. This allows the master node to monitor the training progress and make decisions about the training process.

Q: How can I configure the logging output to be visible on each node?

A: You can configure the logging output to be visible on each node by using the --logging_steps argument to specify the frequency at which the logging output is sent to the master node. For example, you can set --logging_steps 5 to send the logging output to the master node every 5 steps.

Q: What is the difference between a common storage location and local storage on each node?

A: A common storage location is a shared storage location that is accessible from all nodes in the cluster. Local storage on each node refers to the storage location on each individual node.

Q: What are the advantages and disadvantages of using a common storage location?

A: The advantages of using a common storage location include easy access, improved collaboration, and better data management. The disadvantages include performance overhead and data consistency issues.

Q: What are the advantages and disadvantages of using local storage on each node?

A: The advantages of using local storage on each node include improved performance and better data consistency. The disadvantages include limited access and data management issues.

Q: How can I configure the checkpoint output to be stored in a common storage location?

A: You can configure the checkpoint output to be stored in a common storage location by using the --output_dir argument to specify the output directory. For example, you can set --output_dir /path/to/output to store the checkpoint output in the specified directory.

Q: How can I configure the checkpoint output to be stored locally on each node?

A: You can configure the checkpoint output to be stored locally on each node by using the --output_dir argument to specify the output directory. For example, you can set --output_dir /path/to/output to store the checkpoint output in the specified directory.

Q: What is the difference between `--save_steps` and `--save_total_limit`?

A: --save_steps specifies the frequency at which the checkpoint output is saved, while --save_total_limit specifies the maximum number of checkpoint outputs to save.

Q: How can I configure the checkpoint output to be saved every 100 steps?

A: You can configure the checkpoint output to be saved every 100 steps by using the --save_steps argument to specify the frequency at which the checkpoint output is saved. For example, you can set --save_steps 100 to save the checkpoint output every 100 steps.

Q: How can I configure the checkpoint output to be saved up to 2 times?

A: You can configure the checkpoint output to be saved up to 2 times by using the --save_total_limit argument to specify the maximum number of checkpoint outputs to save. For example, you can set --save_total_limit 2 to save up to 2 checkpoint outputs.

Q: What is the purpose of `--logging_steps`?

A: The purpose of --logging_steps is to specify the frequency at which the logging output is sent to the master node.

Q: What is the purpose of `--save_steps`?

A: The purpose of --save_steps is to specify the frequency at which the checkpoint output is saved.

Q: What is the purpose of `--save_total_limit`?

A: The purpose of --save_total_limit is to specify the maximum number of checkpoint outputs to save.

Q: How can I configure the checkpoint output to be saved in a specific directory?

A: You can configure the checkpoint output to be saved in a specific directory by using the --output_dir argument to specify the output directory. For example, you can set --output_dir /path/to/output to save the checkpoint output in the specified directory.

Q: How can I configure the checkpoint output to be saved with a specific name?

A: You can configure the checkpoint output to be saved with a specific name by using the --save_name argument to specify the name of the checkpoint output. For example, you can set --save_name my_checkpoint to save the checkpoint output with the name my_checkpoint.