Mainnet Shard 0 - 12-Minute Downtime on November 8, 2024

Title: Mainnet Shard 0 - 12-Minute Downtime on November 8, 2024


Summary

On November 8, 2024, at 3:30:09 AM UTC, mainnet shard 0 experienced approximately 12 minutes of downtime, beginning at block height 65173161. The chain resumed block production at 3:42:55 AM UTC.


Detailed Timeline

  • 03:30:09 AM UTC: Block height 65173161 was produced.
  • 03:30:38 AM UTC: Consensus on block 65173162 timed out, initiating a view change.
  • 03:42:54 AM UTC: A new leader was successfully elected.
  • 03:42:55 AM UTC: Block height 65173162 was produced, and normal operations resumed.

Impact Analysis

The incident resulted in a temporary halt in block production on shard 0, lasting approximately 12 minutes.


Root Cause Analysis

Investigation revealed a flaw in the view change mechanism. During the view change process, the protocol re-elected the same validator as leader, who was likely offline or unresponsive throughout the downtime. This issue occurred because the view change process did not effectively rotate to a different validator, resulting in a delayed consensus restoration.


Actions Taken

  • Conducted initial troubleshooting of the consensus and view change mechanisms during the incident.
  • Reached out to the affected validator to obtain logs and gather further information for investigation.
  • Promptly notified node operators of the incident via a forum post to ensure awareness.

Follow-Up Actions

  • The Harmony team will work on an urgent release to prevent this issue from reoccurring.
  • Plan to implement and test validators with multi-BLS key setups on devnet and testnet to better replicate mainnet behavior.

Lessons Learned

The leader rotation feature was deployed on devnet and testnet; however, this specific scenario was not adequately tested. This incident highlights the need for expanded test cases, especially involving validators with multiple BLS keys, to ensure that potential issues are identified and addressed in lower environments before deploying to mainnet.


Feel free to reach out if you have any questions.

3 Likes