[2024.2.6] FullScanAggregateEvent Got 0 Responses From CL=1 During Rolling Restart

by ADMIN 83 views

Issue Description

During the disrupt_rolling_restart_cluster test, a FullScanAggregateEvent error occurred, indicating that the operation failed due to a ReadTimeout error. The error message suggests that the coordinator node timed out waiting for replica nodes' responses, and only 0 responses were received from 1 CL=ONE node.

Impact

This issue causes a failure in the FullScanAggregatesOperation, which is a critical operation in Scylla. The failure of this operation can lead to data inconsistencies and potential data loss.

How Frequently Does It Reproduce?

This issue is not a frequent occurrence, but it is still a critical issue that needs to be addressed. The reproduction frequency is unknown, and further investigation is required to determine the root cause of the issue.

Installation Details

The test was run on a 6-node cluster with the following configuration:

  • Cluster size: 6 nodes (n2-highmem-16)
  • Scylla Nodes used in this run:
    • longevity-10gb-3h-2024-2-db-node-010f8f2c-0-9 (34.139.180.132 | 10.142.0.122) (shards: 14)
    • longevity-10gb-3h-2024-2-db-node-010f8f2c-0-8 (34.148.229.223 | 10.142.0.30) (shards: 14)
    • longevity-10gb-3h-2024-2-db-node-010f8f2c-0-7 (35.196.20.19 | 10.142.0.6) (shards: 14)
    • longevity-10gb-3h-2024-2-db-node-010f8f2c-0-6 (34.138.13.253 | 10.142.0.117) (shards: 14)
    • longevity-10gb-3h-2024-2-db-node-010f8f2c-0-5 (35.185.87.183 | 10.142.0.115) (shards: 14)
    • longevity-10gb-3h-2024-2-db-node-010f8f2c-0-4 (34.23.32.239 | 10.142.0.104) (shards: 14)
    • longevity-10gb-3h-2024-2-db-node-010f8f2c-0-3 (35.196.17.144 | 10.142.0.53) (shards: 14)
    • longevity-10gb-3h-2024-2-db-node-010f8f2c-0-2 (35.227.117.58 | 10.142.0.36) (shards: 14)
    • longevity-10gb-3h-2024-2-db-node-010f8f2c-0-1 (35.196.50.184 | 10.142.0.35) (shards: 14)
  • OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/7869686360737388405 (gce: undefined_region)
  • Test: longevity-10gb-3h-gce-test
  • Test id: 010f8f2c-57f2-46f1-aa46-d7ff7e587117
  • Test name: enterprise-2024.2/longevity/longevity-10gb-3h-gce-test
  • Test method: longevity_test.LongevityTest.test_custom_time
  • Test config file(s):

Logs and Commands

The following logs and commands are available:

  • Restore Monitor Stack command: $ hydra investigate show-monitor 010f8f2c-57f2-46f1-aa46-d7ff7e587117
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 010f8f2c-57f2-46f1-aa46-d7ff7e587117

Logs

The following logs are available:

Q: What is the FullScanAggregateEvent error?

A: The FullScanAggregateEvent error is a critical error that occurs when the FullScanAggregatesOperation fails due to a ReadTimeout error. This error indicates that the coordinator node timed out waiting for replica nodes' responses, and only 0 responses were received from 1 CL=ONE node.

Q: What is the impact of this error?

A: This error causes a failure in the FullScanAggregatesOperation, which is a critical operation in Scylla. The failure of this operation can lead to data inconsistencies and potential data loss.

Q: How frequently does this error occur?

A: This error is not a frequent occurrence, but it is still a critical issue that needs to be addressed. The reproduction frequency is unknown, and further investigation is required to determine the root cause of the issue.

Q: What are the installation details of the test?

A: The test was run on a 6-node cluster with the following configuration:

  • Cluster size: 6 nodes (n2-highmem-16)
  • Scylla Nodes used in this run:
    • longevity-10gb-3h-2024-2-db-node-010f8f2c-0-9 (34.139.180.132 | 10.142.0.122) (shards: 14)
    • longevity-10gb-3h-2024-2-db-node-010f8f2c-0-8 (34.148.229.223 | 10.142.0.30) (shards: 14)
    • longevity-10gb-3h-2024-2-db-node-010f8f2c-0-7 (35.196.20.19 | 10.142.0.6) (shards: 14)
    • longevity-10gb-3h-2024-2-db-node-010f8f2c-0-6 (34.138.13.253 | 10.142.0.117) (shards: 14)
    • longevity-10gb-3h-2024-2-db-node-010f8f2c-0-5 (35.185.87.183 | 10.142.0.115) (shards: 14)
    • longevity-10gb-3h-2024-2-db-node-010f8f2c-0-4 (34.23.32.239 | 10.142.0.104) (shards: 14)
    • longevity-10gb-3h-2024-2-db-node-010f8f2c-0-3 (35.196.17.144 | 10.142.0.53) (shards: 14)
    • longevity-10gb-3h-2024-2-db-node-010f8f2c-0-2 (35.227.117.58 | 10.142.0.36) (shards: 14)
    • longevity-10gb-3h-2024-2-db-node-010f8f2c-0-1 (35.196.50.184 | 10.142.0.35) (shards: 14)
  • OS / Image: https://www.googleapis.com/compute/v1/projects/scylla-images/global/images/7869686360737388405 (gce: undefined_region)
  • Test: longevity-10gb-3h-gce-test
  • Test id: 010f8f2c-57f2-46f1-aa46-d7ff7e587117
  • Test name: enterprise-2024.2/longevity/longevity-10gb-3h-gce-test
  • Test method: longevity_test.LongevityTest.test_custom_time
  • Test config file(s):

Q: What are the logs and commands available for this issue?

A: The following logs and commands are available:

  • Restore Monitor Stack command: $ hydra investigate show-monitor 010f8f2c-57f2-46f1-aa46-d7ff7e587117
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 010f8f2c-57f2-46f1-aa46-d7ff7e587117

Q: What are the logs available for this issue?

A: The following logs are available: