Node Becomes Unresponsive After Splitting From The Cluster And Merging Back

Mar 12, 2025 by ADMIN 76 views

Introduction

In this article, we will explore a critical issue that can occur in OrientDB clusters, specifically when a node splits from the cluster and then merges back. This problem can cause the affected node to become unresponsive, leading to a range of issues, including database inconsistencies and client request failures. We will delve into the expected behavior, actual behavior, and steps to reproduce this issue, as well as a workaround to resolve the problem.

Expected Behavior

When a node splits from the cluster and then merges back, the following behavior is expected:

The node should re-open known databases.
The node should process client requests.
The node should process quorum requests.

Actual Behavior

However, in the case of the issue described, the actual behavior is as follows:

The node is ONLINE, but it has no databases listed in the cluster view. Other nodes in the cluster also report an empty database list for the affected node.
The node does not respond to client requests, resulting in the following error message: com.orientechnologies.orient.core.exception.OStorageException: Cannot connect to the remote server/database.
The node counts towards the quorum from the perspective of other nodes but does not respond to quorum requests.

Steps to Reproduce

To reproduce this issue, follow these steps:

Start a cluster of three nodes: Begin by starting a cluster of three nodes, labeled odb1, odb2, and odb3. The roles of the nodes are not relevant, but at least one node should be designated as the MASTER.
Create a new database: Use OrientDB Studio to connect to the MASTER node and create a new database.
Wait for the database to become ONLINE: Allow the database to become ONLINE on each node in the cluster.
Disrupt node odb2's heartbeats: Disrupt the heartbeats of node odb2 by using docker pause or docker compose pause, or by blocking traffic using a network firewall. Any node can be used to simulate this scenario.
Wait for odb1 and odb3 to detect the missing node: Allow odb1 and odb3 to detect the missing node odb2 and kick it from the cluster view.

Restore node odb2's heartbeats: Reverse the action taken in step 4 to restore node odb2's heartbeats. Node odb2 will notice the pause with the following message:

System clock apparently jumped from 2025-03-12 08:36:20.974 to 2025-03-12 08:40:11.880 since last heartbeat (+225906 ms) [ClusterHeartbeatManager]
Resetting heartbeat timestamps because of huge system clock jump! Clock-Jump: 225906 ms, Heartbeat-Timeout: 60000 ms [ClusterHeartbeatManager]

Wait for odb2 to start merging to the cluster: Allow odb2 to start merging to the cluster.
Wait for odb2 to become ONLINE: Wait for odb2 to become ONLINE.

Workaround

To resolve this issue, a workaround is to restart any of the remaining nodes (odb1 or odb3). This action will trigger the synchronization of the database on odb2, and the node will participate in the cluster as expected.

Conclusion

Q: What causes a node to become unresponsive after splitting from the cluster and merging back?

A: The exact cause of this issue is not yet fully understood, but it is believed to be related to the way OrientDB handles node heartbeats and cluster synchronization. When a node splits from the cluster and then merges back, it may experience a "clock jump" or a significant change in its system clock, which can cause the node to become desynchronized with the rest of the cluster.

Q: What are the symptoms of a node becoming unresponsive after splitting from the cluster and merging back?

A: The symptoms of this issue include:

The node is ONLINE, but it has no databases listed in the cluster view.
The node does not respond to client requests, resulting in an OStorageException: Cannot connect to the remote server/database error.
The node counts towards the quorum from the perspective of other nodes but does not respond to quorum requests.

Q: How can I reproduce this issue?

A: To reproduce this issue, follow these steps:

Start a cluster of three nodes: Begin by starting a cluster of three nodes, labeled odb1, odb2, and odb3. The roles of the nodes are not relevant, but at least one node should be designated as the MASTER.
Create a new database: Use OrientDB Studio to connect to the MASTER node and create a new database.
Wait for the database to become ONLINE: Allow the database to become ONLINE on each node in the cluster.
Disrupt node odb2's heartbeats: Disrupt the heartbeats of node odb2 by using docker pause or docker compose pause, or by blocking traffic using a network firewall. Any node can be used to simulate this scenario.
Wait for odb1 and odb3 to detect the missing node: Allow odb1 and odb3 to detect the missing node odb2 and kick it from the cluster view.

Restore node odb2's heartbeats: Reverse the action taken in step 4 to restore node odb2's heartbeats. Node odb2 will notice the pause with the following message:

System clock apparently jumped from 2025-03-12 08:36:20.974 to 2025-03-12 08:40:11.880 since last heartbeat (+225906 ms) [ClusterHeartbeatManager]
Resetting heartbeat timestamps because of huge system clock jump! Clock-Jump: 225906 ms, Heartbeat-Timeout: 60000 ms [ClusterHeartbeatManager]

Wait for odb2 to start merging to the cluster: Allow odb2 to start merging to the cluster.
Wait for odb2 to become ONLINE: Wait for odb2 to become ONLINE.

Q: What is the workaround for this issue?

A: The workaround for this issue is to restart any of the remaining nodes (odb1 or odb3). This action will trigger the synchronization of the database on odb2, and the node will participate in the cluster as expected.

Q: Is there a way to prevent this issue from occurring in the first place?

A: While there is no foolproof way to prevent this issue from occurring, there are several steps you can take to minimize the risk:

Regularly back up your database to prevent data loss in case of a node failure.
Use a load balancer or other traffic management tool to distribute traffic across multiple nodes and reduce the impact of a single node failure.
Implement a monitoring and alerting system to quickly detect and respond to node failures or other issues.
Regularly update and patch your OrientDB installation to ensure you have the latest security fixes and features.

Q: Is this issue specific to OrientDB, or can it occur in other databases as well?

A: While this issue is specific to OrientDB, similar issues can occur in other databases that use a distributed architecture and rely on node heartbeats and cluster synchronization. However, the specific symptoms and behavior of this issue may vary depending on the database and its configuration.