Post Mortem: Tuesday 2 May 2023

Summary

On May 2, 2023, Shard 0 (S0) consensus was interrupted for approximately 3 hours, resulting in internal validators nodes being unable to add block 41256348. The issue was resolved by reconfiguring the DNS node to point to the explorer node, allowing synchronization with external nodes. Further investigation into the code hash issue is underway to prevent future incidents.

Impact

Shard 0 consensus was interrupted and no transactions were processed for approximately 3 hours. Shard 0 consensus was restored at 2023–05–02, 22:18:46 UTC.

Timeline

  • May 2, 07:12 PM (UTC): PagerDuty alerts for consensus stuck on shard 0
  • May 2, 08:54 PM (UTC): Revert from block 4125637 to 4125636
  • May 2, 10:14 PM (UTC): DNS is updated
  • May 2, 10:17 PM (UTC): Nodes are all restarted
  • May 2, 10:26 PM (UTC): Issue is resolved - consensus resumes

Root Cause

Internal S0 validator nodes could not add block 41256348 due to an error loading code hash of a smart contract. Internal non-consensus nodes, explorer nodes, and external validators (not yet upgraded) were not impacted, and have successfully added block 41256348. Internal and external nodes were at different heights and could not reach a consensus.

Action Taken

To resolve the problem, the internal S0 validator nodes were restarted. In addition, the DNS node was reconfigured to point to the explorer node. This allowed the internal S0 validator nodes to synchronize with the latest block in sync with the external node.

What’s Next

  • Fix the code hash issue (In Progress); May 10 outage has been caused by a similar issue
  • Hardfork (v2023.2.0) postponed until the root cause is formally fixed
2 Likes

Will we have another post-mortem about the outage we had between 14-5-2023 and 15-5 2023?

Yes. We are looking into the issue currently. We will post once as soon as possible.