Harmony Mainnet Stuck Incident Report: Feb 10, 2024

Summary

The Harmony mainnet experienced a significant outage when nodes became stuck during the validation phase due to block verification issues related to old crosslink data. The team identified the root cause as a lack of shard state data in the snapDB, necessitating a series of potential fixes, including database updates and code patches. After a coordinated effort, a solution was implemented to ignore old crosslinks in block proposals, successfully resolving the issue without affecting network integrity. Subsequently, crosslink management logic was updated to allow crosslink to catch up. This incident highlighted the importance of dynamic verification methods and the potential need for protocol adjustments to handle similar situations.

Timeline of Events

  • Initial Discovery: The problem was first reported on February 10, identifying that mainnet nodes were stuck due to repetitive “block already known” errors.
  • Immediate Response: The team promptly initiated diagnostic procedures, identifying issues related to block and shard state verification failures.
  • Investigation and Troubleshooting: The restoration and full synchronization of snapDB unveiled that the shard state data necessary for block verification wasn’t accessible, causing a loop in view changes due to outdated crosslinks. The snapDB’s state did not encompass the older crosslink data required for validation, highlighting the criticality of having comprehensive state data within the database for the smooth functioning of the network. This scenario underscores the importance of database management and the handling of crosslinks to prevent similar issues.
  • Resolution Attempts: Several strategies were deployed, including database rollbacks, DNS updates to guide nodes toward healthy peers, and using snapDB for node recovery.
  • Root Cause Analysis: The incident began with a segmentation fault, causing node crashes. A flawed rollback mechanism was invoked upon restart, leading to persistent ‘block already known’ errors. The subsequent restoration and full sync of snapDB failed to include necessary shard state data, exacerbating block verification failures due to outdated or missing crosslink information.
  • Final Solution: The final resolution involved ignoring old crosslinks in block proposals and updating the block proposal logic to delete outdated crosslinks. This approach ensured the network processed only valid blocks, addressing the validation failures caused by obsolete crosslink data.
  • Recovery and Monitoring: Recovery and monitoring involved applying fixes for consensus and addressing crosslink syncing issues, extending the total recovery time beyond the initial 26-hour outage period to ensure network stability and functionality.

Technical Analysis

  • Initial Symptoms: Harmony mainnet nodes encountered “block already known” errors, preventing the processing of new blocks. This issue caused a significant network stall, initially observed through error logs indicating duplicate block entries.
  • Investigation and Findings: Detailed log analysis pinpointed issues with block verification during view changes. The core problem was identified as an inability to read the shard state from the database for epochs significantly older than the current one, which led to verification failures during consensus.
  • Resolution Attempts: The team undertook multiple strategies to mitigate the issue, including synchronized network restarts and the deployment of snapDB to facilitate quicker node synchronization. The breakthrough came with the realization that outdated crosslinks in block proposals failed validation due to missing shard state data. This led to the development and deployment of fixes to ignore very old crosslinks during block proposal, ultimately restoring network functionality.

Root Cause

The incident’s root cause was the inclusion of outdated crosslink data within block proposals, coupled with the deployment of snapDB, which lacked necessary state information for those epochs. This discrepancy led to validation failures and subsequently caused nodes to become stuck, preventing the network from processing new blocks efficiently.

Solution and Implementation

The resolution involved developing and deploying hotfixes to bypass the issue with old crosslink data that led to the validation failures and the mainnet nodes getting stuck. These fixes ensured that only valid blocks, particularly recent crosslinks, were processed by the network, addressing the immediate problem and preventing similar issues from arising due to outdated data in block proposals. The successful deployment of these hotfixes brought the network back online, marking the end of the incident.

Impact Assessment

The Harmony mainnet experienced a significant outage due to nodes getting stuck, primarily due to “block already known” errors and failures in block processing. The investigation revealed the core issue stemmed from block verification problems during view changes, pinpointing the inability to read the shard state from the DB for old epochs as the root cause. Efforts to rectify the situation included synchronized restart attempts, utilizing snapDB, and identifying that outdated crosslinks were triggering validation failures. The resolution involved deploying hotfixes to exclude very old crosslinks from block proposals and to process only valid blocks, effectively addressing the validation issues and unsticking the nodes. This event led to around 26 hours of network downtime, impacting transactions and highlighting the need for further security and stability measures.

Lessons Learned

The incident underscored the critical nature of maintaining comprehensive state data for validation purposes, highlighting the inherent challenges of effectively managing snapshot databases (snapDBs). It also spotlighted the necessity for robust mechanisms capable of handling old crosslinks efficiently, ensuring network integrity and minimizing the potential for future disruptions. The lessons learned from this event will guide system resilience and validation protocol improvements to prevent similar occurrences and enhance overall network stability.

Future Preventive Measures

To enhance Harmony’s resilience against similar incidents in the future, the team can focus on optimizing database management, particularly concerning the maintenance and updating of snapDBs to ensure they contain comprehensive state data for all epochs. Improvements in the processing and validation of crosslinks can prevent the inclusion of outdated or invalid data in block proposals. Additionally, enhancements to the consensus mechanisms, such as more robust handling of view changes and block verifications, could safeguard against validation failures and improve network stability. Implementing these measures would fortify Harmony’s infrastructure, ensuring higher reliability and security.

Acknowledgments

The resolution of the Harmony mainnet incident was a collective effort, showcasing the dedication and expertise of the team. Special recognition goes to Gheis and Soph for identifying the root cause, alongside valuable contributions from Diego, Konstantin, and Ulad. Their collaborative approach underlines the team’s strength in navigating complex challenges and ensuring the stability of the network.

References

PR #4627, PR #4628, and PR #4629

4 Likes