[2024.2.6] FullScanAggregateEvent Got 0 Responses From CL=1 During Rolling Restart

Mar 13, 2025 by ADMIN 83 views

Issue Description

During the disrupt_rolling_restart_cluster test, a FullScanAggregateEvent error occurred, indicating that the operation failed due to a ReadTimeout error. The error message suggests that the coordinator node timed out waiting for replica nodes' responses, and only 0 responses were received from 1 CL=ONE node.

Impact

This issue causes a failure in the FullScanAggregatesOperation, which is a critical operation in Scylla. The failure of this operation can lead to data inconsistencies and potential data loss.

How Frequently Does It Reproduce?

This issue is not a frequent occurrence, but it is still a critical issue that needs to be addressed. The reproduction frequency is unknown, and further investigation is required to determine the root cause of the issue.

Installation Details

The test was run on a 6-node cluster with the following configuration:

Cluster size: 6 nodes (n2-highmem-16)
Scylla Nodes used in this run:
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-9 (34.139.180.132 | 10.142.0.122) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-8 (34.148.229.223 | 10.142.0.30) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-7 (35.196.20.19 | 10.142.0.6) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-6 (34.138.13.253 | 10.142.0.117) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-5 (35.185.87.183 | 10.142.0.115) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-4 (34.23.32.239 | 10.142.0.104) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-3 (35.196.17.144 | 10.142.0.53) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-2 (35.227.117.58 | 10.142.0.36) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-1 (35.196.50.184 | 10.142.0.35) (shards: 14)
OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/7869686360737388405 (gce: undefined_region)
Test: longevity-10gb-3h-gce-test
Test id: 010f8f2c-57f2-46f1-aa46-d7ff7e587117
Test name: enterprise-2024.2/longevity/longevity-10gb-3h-gce-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):
- longevity-10gb-3h.yaml

Logs and Commands

The following logs and commands are available:

Restore Monitor Stack command: $ hydra investigate show-monitor 010f8f2c-57f2-46f1-aa46-d7ff7e587117
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 010f8f2c-57f2-46f1-aa46-d7ff7e587117

Logs

The following logs are available:

longevity-10gb-3h-2024-2-db-node-010f8f2c-0-5 - https://cloudius-jenkins-test.s3.amazonaws.com/010f8f2c-57f2-46f1-aa46-d7ff7e587117/20250311_215756/longevity-10gb-3h-2024-2-db-node-010f8f2c-0-5-010f8f2c.tar.gz
longevity-10gb-3h-2024-2-db-node-010f8f2c-0-6 - https://cloudius-jenkins-test.s3.amazonaws.com/010f8f2c-57f2-46f1-aa46-d7ff7e587117/20250311_215756/longevity-10gb-3h-2024-2-db-node-010f8f2c-0-6-010f8f2c.tar.gz
db-cluster-010f8f2c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/010f8f2c-57f2-46f1-aa46-d7ff7e587117/20250312_012952/db-cluster-010f8f2c.tar.gz
sct-runner-events-010f8f2c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/010f8f2c-57f2-46f1-aa46-d7ff7e587117/20250312_012952/sct-runner-events-010f8f2c.tar.gz
sct-010f8f2c.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/010f8f2c-57f2-46f1-aa46-d7ff7e587117/20250312_012952/sct-010f8f2c.log.tar.gz
loader-set-010f8f2c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/010f8f2c-57f2-46f1-aa46-d7ff7e587117/20250312_012952/loader-set-010f8f2c.tar.gz
monitor-set-010f8f2c.tar.gz - [https://cloudius-jenkins-test.s3.amazonaws.com/010f8f2c-57f2-46f1-aa46-d7ff7e587117/20250312_012952/monitor-set-010f8f2c.tar.gz](https://cloudius-jenkins-test.s3
Q&A: 2024.2.6 FullScanAggregateEvent Got 0 Responses from CL=1 During Rolling Restart ====================================================================================

Q: What is the FullScanAggregateEvent error?

A: The FullScanAggregateEvent error is a critical error that occurs when the FullScanAggregatesOperation fails due to a ReadTimeout error. This error indicates that the coordinator node timed out waiting for replica nodes' responses, and only 0 responses were received from 1 CL=ONE node.

Q: What is the impact of this error?

A: This error causes a failure in the FullScanAggregatesOperation, which is a critical operation in Scylla. The failure of this operation can lead to data inconsistencies and potential data loss.

Q: How frequently does this error occur?

A: This error is not a frequent occurrence, but it is still a critical issue that needs to be addressed. The reproduction frequency is unknown, and further investigation is required to determine the root cause of the issue.

Q: What are the installation details of the test?

A: The test was run on a 6-node cluster with the following configuration:

Cluster size: 6 nodes (n2-highmem-16)
Scylla Nodes used in this run:
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-9 (34.139.180.132 | 10.142.0.122) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-8 (34.148.229.223 | 10.142.0.30) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-7 (35.196.20.19 | 10.142.0.6) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-6 (34.138.13.253 | 10.142.0.117) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-5 (35.185.87.183 | 10.142.0.115) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-4 (34.23.32.239 | 10.142.0.104) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-3 (35.196.17.144 | 10.142.0.53) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-2 (35.227.117.58 | 10.142.0.36) (shards: 14)
- longevity-10gb-3h-2024-2-db-node-010f8f2c-0-1 (35.196.50.184 | 10.142.0.35) (shards: 14)
OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/7869686360737388405 (gce: undefined_region)
Test: longevity-10gb-3h-gce-test
Test id: 010f8f2c-57f2-46f1-aa46-d7ff7e587117
Test name: enterprise-2024.2/longevity/longevity-10gb-3h-gce-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):
- longevity-10gb-3h.yaml

Q: What are the logs and commands available for this issue?

A: The following logs and commands are available:

Restore Monitor Stack command: $ hydra investigate show-monitor 010f8f2c-57f2-46f1-aa46-d7ff7e587117
Restore monitor on AWS instance using Jenkins job
Show all stored logs command: $ hydra investigate show-logs 010f8f2c-57f2-46f1-aa46-d7ff7e587117

Q: What are the logs available for this issue?

A: The following logs are available:

longevity-10gb-3h-2024-2-db-node-010f8f2c-0-5 - https://cloudius-jenkins-test.s3.amazonaws.com/010f8f2c-57f2-46f1-aa46-d7ff7e587117/20250311_215756/longevity-10gb-3h-2024-2-db-node-010f8f2c-0-5-010f8f2c.tar.gz
longevity-10gb-3h-2024-2-db-node-010f8f2c-0-6 - https://cloudius-jenkins-test.s3.amazonaws.com/010f8f2c-57f2-46f1-aa46-d7ff7e587117/20250311_215756/longevity-10gb-3h-2024-2-db-node-010f8f2c-0-6-010f8f2c.tar.gz
db-cluster-010f8f2c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/010f8f2c-57f2-46f1-aa46-d7ff7e587117/20250312_012952/db-cluster-010f8f2c.tar.gz
sct-runner-events-010f8f2c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/010f8f2c-57f2-46f1-aa46-d7ff7e587117/20250312_012952/sct-runner-events-010f8f2c.tar.gz
sct-010f8f2c.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/010f8f2c-57f2-46f1-aa46-d7ff7e587117/20250312_012952/sct-010f8f2c.log.tar.gz
loader-set-010f8f2c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/010f8f2c-57f2-46f1-aa46-d7ff7e587117/20250312_012952/loader-set-010f8f2c.tar.gz
monitor-set-010f8f2c.tar.gz - [https://