Summary
On May 2, 2023, Shard 0 (S0) consensus was interrupted for approximately 3 hours, resulting in internal validators nodes being unable to add block 41256348. The issue was resolved by reconfiguring the DNS node to point to the explorer node, allowing synchronization with external nodes. Further investigation into the code hash issue is underway to prevent future incidents.
Impact
Shard 0 consensus was interrupted and no transactions were processed for approximately 3 hours. Shard 0 consensus was restored at 2023–05–02, 22:18:46 UTC.
Timeline
- May 2, 07:12 PM (UTC): PagerDuty alerts for consensus stuck on shard 0
- May 2, 08:54 PM (UTC): Revert from block 4125637 to 4125636
- May 2, 10:14 PM (UTC): DNS is updated
- May 2, 10:17 PM (UTC): Nodes are all restarted
- May 2, 10:26 PM (UTC): Issue is resolved - consensus resumes
Root Cause
Internal S0 validator nodes could not add block 41256348 due to an error loading code hash of a smart contract. Internal non-consensus nodes, explorer nodes, and external validators (not yet upgraded) were not impacted, and have successfully added block 41256348. Internal and external nodes were at different heights and could not reach a consensus.
Action Taken
To resolve the problem, the internal S0 validator nodes were restarted. In addition, the DNS node was reconfigured to point to the explorer node. This allowed the internal S0 validator nodes to synchronize with the latest block in sync with the external node.
What’s Next
- Fix the code hash issue (In Progress); May 10 outage has been caused by a similar issue
- Hardfork (v2023.2.0) postponed until the root cause is formally fixed