[2024.2.6] FullScanAggregateEvent Got 0 Responses From CL=1 During Rolling Restart
Issue Description
During the disrupt_rolling_restart_cluster
test, a FullScanAggregateEvent error occurred, indicating that the operation failed due to a ReadTimeout error. The error message suggests that the coordinator node timed out waiting for replica nodes' responses, and only 0 responses were received from 1 CL=ONE node.
Impact
This issue causes a failure in the FullScanAggregatesOperation, which is a critical operation in Scylla. The failure of this operation can lead to data inconsistencies and potential data loss.
How Frequently Does It Reproduce?
This issue is not a frequent occurrence, but it is still a critical issue that needs to be addressed. The reproduction frequency is unknown, and further investigation is required to determine the root cause of the issue.
Installation Details
The test was run on a 6-node cluster with the following configuration:
- Cluster size: 6 nodes (n2-highmem-16)
- Scylla Nodes used in this run:
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-9 (34.139.180.132 | 10.142.0.122) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-8 (34.148.229.223 | 10.142.0.30) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-7 (35.196.20.19 | 10.142.0.6) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-6 (34.138.13.253 | 10.142.0.117) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-5 (35.185.87.183 | 10.142.0.115) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-4 (34.23.32.239 | 10.142.0.104) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-3 (35.196.17.144 | 10.142.0.53) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-2 (35.227.117.58 | 10.142.0.36) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-1 (35.196.50.184 | 10.142.0.35) (shards: 14)
- OS / Image:
https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/7869686360737388405
(gce: undefined_region) - Test:
longevity-10gb-3h-gce-test
- Test id:
010f8f2c-57f2-46f1-aa46-d7ff7e587117
- Test name:
enterprise-2024.2/longevity/longevity-10gb-3h-gce-test
- Test method:
longevity_test.LongevityTest.test_custom_time
- Test config file(s):
Logs and Commands
The following logs and commands are available:
- Restore Monitor Stack command:
$ hydra investigate show-monitor 010f8f2c-57f2-46f1-aa46-d7ff7e587117
- Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs 010f8f2c-57f2-46f1-aa46-d7ff7e587117
Logs
The following logs are available:
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-5 - https://cloudius-jenkins-test.s3.amazonaws.com/010f8f2c-57f2-46f1-aa46-d7ff7e587117/20250311_215756/longevity-10gb-3h-2024-2-db-node-010f8f2c-0-5-010f8f2c.tar.gz
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-6 - https://cloudius-jenkins-test.s3.amazonaws.com/010f8f2c-57f2-46f1-aa46-d7ff7e587117/20250311_215756/longevity-10gb-3h-2024-2-db-node-010f8f2c-0-6-010f8f2c.tar.gz
- db-cluster-010f8f2c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/010f8f2c-57f2-46f1-aa46-d7ff7e587117/20250312_012952/db-cluster-010f8f2c.tar.gz
- sct-runner-events-010f8f2c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/010f8f2c-57f2-46f1-aa46-d7ff7e587117/20250312_012952/sct-runner-events-010f8f2c.tar.gz
- sct-010f8f2c.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/010f8f2c-57f2-46f1-aa46-d7ff7e587117/20250312_012952/sct-010f8f2c.log.tar.gz
- loader-set-010f8f2c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/010f8f2c-57f2-46f1-aa46-d7ff7e587117/20250312_012952/loader-set-010f8f2c.tar.gz
- monitor-set-010f8f2c.tar.gz - [https://cloudius-jenkins-test.s3.amazonaws.com/010f8f2c-57f2-46f1-aa46-d7ff7e587117/20250312_012952/monitor-set-010f8f2c.tar.gz](https://cloudius-jenkins-test.s3
Q&A: 2024.2.6 FullScanAggregateEvent Got 0 Responses from CL=1 During Rolling Restart ====================================================================================
Q: What is the FullScanAggregateEvent error?
A: The FullScanAggregateEvent error is a critical error that occurs when the FullScanAggregatesOperation fails due to a ReadTimeout error. This error indicates that the coordinator node timed out waiting for replica nodes' responses, and only 0 responses were received from 1 CL=ONE node.
Q: What is the impact of this error?
A: This error causes a failure in the FullScanAggregatesOperation, which is a critical operation in Scylla. The failure of this operation can lead to data inconsistencies and potential data loss.
Q: How frequently does this error occur?
A: This error is not a frequent occurrence, but it is still a critical issue that needs to be addressed. The reproduction frequency is unknown, and further investigation is required to determine the root cause of the issue.
Q: What are the installation details of the test?
A: The test was run on a 6-node cluster with the following configuration:
- Cluster size: 6 nodes (n2-highmem-16)
- Scylla Nodes used in this run:
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-9 (34.139.180.132 | 10.142.0.122) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-8 (34.148.229.223 | 10.142.0.30) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-7 (35.196.20.19 | 10.142.0.6) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-6 (34.138.13.253 | 10.142.0.117) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-5 (35.185.87.183 | 10.142.0.115) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-4 (34.23.32.239 | 10.142.0.104) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-3 (35.196.17.144 | 10.142.0.53) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-2 (35.227.117.58 | 10.142.0.36) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-1 (35.196.50.184 | 10.142.0.35) (shards: 14)
- OS / Image:
https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/7869686360737388405
(gce: undefined_region) - Test:
longevity-10gb-3h-gce-test
- Test id:
010f8f2c-57f2-46f1-aa46-d7ff7e587117
- Test name:
enterprise-2024.2/longevity/longevity-10gb-3h-gce-test
- Test method:
longevity_test.LongevityTest.test_custom_time
- Test config file(s):
Q: What are the logs and commands available for this issue?
A: The following logs and commands are available:
- Restore Monitor Stack command:
$ hydra investigate show-monitor 010f8f2c-57f2-46f1-aa46-d7ff7e587117
- Restore monitor on AWS instance using Jenkins job
- Show all stored logs command:
$ hydra investigate show-logs 010f8f2c-57f2-46f1-aa46-d7ff7e587117
Q: What are the logs available for this issue?
A: The following logs are available:
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-5 - https://cloudius-jenkins-test.s3.amazonaws.com/010f8f2c-57f2-46f1-aa46-d7ff7e587117/20250311_215756/longevity-10gb-3h-2024-2-db-node-010f8f2c-0-5-010f8f2c.tar.gz
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-6 - https://cloudius-jenkins-test.s3.amazonaws.com/010f8f2c-57f2-46f1-aa46-d7ff7e587117/20250311_215756/longevity-10gb-3h-2024-2-db-node-010f8f2c-0-6-010f8f2c.tar.gz
- db-cluster-010f8f2c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/010f8f2c-57f2-46f1-aa46-d7ff7e587117/20250312_012952/db-cluster-010f8f2c.tar.gz
- sct-runner-events-010f8f2c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/010f8f2c-57f2-46f1-aa46-d7ff7e587117/20250312_012952/sct-runner-events-010f8f2c.tar.gz
- sct-010f8f2c.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/010f8f2c-57f2-46f1-aa46-d7ff7e587117/20250312_012952/sct-010f8f2c.log.tar.gz
- loader-set-010f8f2c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/010f8f2c-57f2-46f1-aa46-d7ff7e587117/20250312_012952/loader-set-010f8f2c.tar.gz
- monitor-set-010f8f2c.tar.gz - [https://