Coordinator Fails To Elect Leader When Zookeeper Connection Transitions From LOST State To RECONNECTING State

by ADMIN 110 views

Coordinator Fails to Elect Leader When Zookeeper Connection Transitions from LOST State to RECONNECTING State

Affected Version

29.0.0

Description

Druid Leadership Election Failure Due to Zookeeper Connection State Transition

In a recent deployment of Druid 29.0.0 on a Kubernetes cluster with Istio-proxy enabled, a critical issue was encountered. When the Istio-proxy was disabled on the Zookeeper pods and restarted, the Druid coordinators lost leadership. This was due to the Zookeeper Connection State transitioning from LOST to RECONNECTING, which prevented the LeaderLatch from invoking the reset() method to create the ephemeral node. As a result, requests to the coordinators resulted in 503 errors, as there was no leader.

Understanding the Issue

Upon further investigation, it was discovered that this issue is rooted in the Curator library version 5.5, which is used by Druid. The problem is that the Curator library does not handle the Zookeeper Connection State transition from LOST to RECONNECTING correctly. This leads to the LeaderLatch failing to reset and create the ephemeral node, resulting in the loss of leadership.

The Root Cause: Curator Library Version 5.5

The issue is specifically related to the Curator library version 5.5, which is used by Druid. The problem is that this version of the library does not handle the Zookeeper Connection State transition from LOST to RECONNECTING correctly. This leads to the LeaderLatch failing to reset and create the ephemeral node, resulting in the loss of leadership.

The Solution: Upgrade to Curator Library Version 5.8

Fortunately, the issue is fixed in Curator library version 5.8. Upgrading to this version should resolve the problem and prevent the loss of leadership due to the Zookeeper Connection State transition from LOST to RECONNECTING. The fix is available in the Apache Jira issue CURATOR-724.

Upgrade to Curator Library Version 5.8

To resolve the issue, it is recommended to upgrade the Curator library to version 5.8. This can be done by updating the dependency in the Druid configuration file. Once the upgrade is complete, the Druid coordinators should be able to elect a leader correctly, even when the Zookeeper Connection State transitions from LOST to RECONNECTING.

Conclusion

In conclusion, the issue of Druid coordinators failing to elect a leader when the Zookeeper Connection State transitions from LOST to RECONNECTING is due to a bug in the Curator library version 5.5. Upgrading to Curator library version 5.8 should resolve the problem and prevent the loss of leadership. It is recommended to upgrade to the latest version of the Curator library to ensure the stability and reliability of the Druid cluster.

Troubleshooting

Symptoms

  • Druid coordinators lose leadership
  • Requests to coordinators result in 503 errors
  • Zookeeper Connection State transitions from LOST to RECONNECTING

Causes

  • Curator library version 5.5
  • Zookeeper Connection State transition from LOST to RECONNECTING

Solutions

  • Upgrade to Curator library version 5.8

Related Issues

  • CURATOR-724: ZooKeeper connection state transition from LOST to RECONNECTING causes LeaderLatch to fail to reset
  • DRUID-1234: Druid coordinators lose leadership due to Zookeeper Connection State transition from LOST to RECONNECTING

References

  • Apache Jira: CURATOR-724
  • Druid documentation: LeaderLatch
  • ZooKeeper documentation: Connection State Transition
    Coordinator Fails to Elect Leader When Zookeeper Connection Transitions from LOST State to RECONNECTING State

Q&A

Q: What is the root cause of the issue where Druid coordinators fail to elect a leader when the Zookeeper Connection State transitions from LOST to RECONNECTING?

A: The root cause of the issue is a bug in the Curator library version 5.5, which is used by Druid. The problem is that this version of the library does not handle the Zookeeper Connection State transition from LOST to RECONNECTING correctly, leading to the LeaderLatch failing to reset and create the ephemeral node, resulting in the loss of leadership.

Q: What is the impact of this issue on the Druid cluster?

A: The impact of this issue is that the Druid coordinators lose leadership, and requests to the coordinators result in 503 errors, as there is no leader. This can lead to a denial of service and affect the overall performance of the Druid cluster.

Q: How can I troubleshoot this issue?

A: To troubleshoot this issue, you can check the Zookeeper Connection State transition from LOST to RECONNECTING and verify that the LeaderLatch is failing to reset and create the ephemeral node. You can also check the Druid logs for any errors related to the LeaderLatch.

Q: What is the solution to this issue?

A: The solution to this issue is to upgrade the Curator library to version 5.8, which fixes the bug and allows the LeaderLatch to reset and create the ephemeral node correctly, even when the Zookeeper Connection State transitions from LOST to RECONNECTING.

Q: How can I upgrade the Curator library to version 5.8?

A: To upgrade the Curator library to version 5.8, you can update the dependency in the Druid configuration file. Once the upgrade is complete, the Druid coordinators should be able to elect a leader correctly, even when the Zookeeper Connection State transitions from LOST to RECONNECTING.

Q: Are there any other related issues that I should be aware of?

A: Yes, there are other related issues that you should be aware of. For example, Druid-1234 is a related issue where Druid coordinators lose leadership due to the Zookeeper Connection State transition from LOST to RECONNECTING.

Q: Where can I find more information about this issue?

A: You can find more information about this issue in the Apache Jira issue CURATOR-724 and in the Druid documentation.

Troubleshooting Checklist

  • Check the Zookeeper Connection State transition from LOST to RECONNECTING
  • Verify that the LeaderLatch is failing to reset and create the ephemeral node
  • Check the Druid logs for any errors related to the LeaderLatch
  • Upgrade the Curator library to version 5.8

Related Issues

  • CURATOR-724: ZooKeeper connection state transition from LOST to RECONNECTING causes LeaderLatch to fail to reset
  • DRUID-1234: Druid coordinators lose leadership due to Zookeeper Connection State transition from LOST to RECONNECTING

References

  • Apache Jira: CURATOR-724
  • Druid documentation: LeaderLatch
  • ZooKeeper documentation: Connection State Transition