DISABLED Test_train_parity_multi_group_unshard_async_op (main.TestFullyShard1DTrainingCore)

Mar 10, 2025 by ADMIN 96 views

Disabled Test: test_train_parity_multi_group_unshard_async_op (main.TestFullyShard1DTrainingCore)

The test_train_parity_multi_group_unshard_async_op test, located in the distributed/_composable/fsdp/test_fully_shard_training.py file, has been disabled due to its failure in Continuous Integration (CI) environments. This test is part of the TestFullyShard1DTrainingCore suite and is designed to test the fully shard 1D training core functionality in PyTorch.

Platforms

The test is currently failing on the inductor platform.

Flakiness

Over the past 3 hours, the test has been determined to be flaky in 5 workflow(s) with 5 failures and 5 successes. This indicates that the test is not consistently passing or failing, making it challenging to diagnose and fix the issue.

Debugging Instructions

To debug this test, follow these steps:

Click on the recent samples link: Visit the recent examples page to view the recent runs of the test.
Click on the workflow logs: Click on the workflow logs linked above to view the detailed logs of the test runs.
Expand the Test step: Click on the Test step of the job so that it is expanded. This will allow you to grep for the test name.
Grep for the test name: Grep for test_train_parity_multi_group_unshard_async_op in the logs to find relevant log snippets.

Sample Error Message

The test is failing with a RuntimeError exception, as shown in the following sample error message:

Traceback (most recent call last):
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 605, in wrapper
    self._join_processes(fn)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 845, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 899, in _check_return_codes
    raise RuntimeError(
RuntimeError: Process 0 terminated or timed out after 300.1126618385315 seconds

Important Note

DO NOT ASSUME THINGS ARE OKAY IF THE CI IS GREEN. The CI environment is now shielding flaky tests from developers, which means that the test may appear to be passing even if it is not consistently working as expected.

Additional Information

The following individuals have been cc'd on this issue:

@H-Huang
@awgu
@kwen2501
@wanchaol
@fegin
@fduwjj
@wz337
@wconstab
@d4l3k
@c-p-i-o
@clee2000
@chauhang
@penguinwu

The test_train_parity_multi_group_unshard_async_op test is currently disabled due to its failure in CI environments. To debug this test, follow the instructions outlined above and investigate the relevant log snippets. If you have any questions or concerns, please don't hesitate to reach out to the individuals listed above.
Q&A: Disabled Test - test_train_parity_multi_group_unshard_async_op (main.TestFullyShard1DTrainingCore)

Q: What is the current status of the test_train_parity_multi_group_unshard_async_op test? A: The test is currently disabled due to its failure in Continuous Integration (CI) environments.

Q: Why is the test failing in CI environments? A: The test is failing due to flakiness, which means that it is not consistently passing or failing. Over the past 3 hours, the test has been determined to be flaky in 5 workflow(s) with 5 failures and 5 successes.

Q: What is the impact of the test being disabled? A: The test being disabled means that it will not be run in CI environments, and its results will not be reported. However, this also means that the test may not be thoroughly tested, which could lead to issues in the future.

Q: How can I debug the test? A: To debug the test, follow these steps:

Click on the recent samples link: Visit the recent examples page to view the recent runs of the test.
Click on the workflow logs: Click on the workflow logs linked above to view the detailed logs of the test runs.
Expand the Test step: Click on the Test step of the job so that it is expanded. This will allow you to grep for the test name.
Grep for the test name: Grep for test_train_parity_multi_group_unshard_async_op in the logs to find relevant log snippets.

Q: What is the sample error message for the test? A: The test is failing with a RuntimeError exception, as shown in the following sample error message:

Traceback (most recent call last):
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 605, in wrapper
    self._join_processes(fn)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 845, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 899, in _check_return_codes
    raise RuntimeError(
RuntimeError: Process 0 terminated or timed out after 300.1126618385315 seconds

Q: What is the importance of not assuming things are okay if the CI is green? A: DO NOT ASSUME THINGS ARE OKAY IF THE CI IS GREEN. The CI environment is now shielding flaky tests from developers, which means that the test may appear to be passing even if it is not consistently working as expected.

Q: Who has been cc'd on this issue? A: The following individuals have been cc'd on this issue:

@H-Huang
@awgu
@kwen2501
@wanchaol
@fegin
@fduwjj
@wz337
@wconstab
@d4l3k
@c-p-i-o
@clee2000
@chauhang
@penguinwu

Q: What is the next step in resolving this issue? A: The next step is to investigate the relevant log snippets and determine the root cause of the issue. If you have any questions or concerns, please don't hesitate to reach out to the individuals listed above.