[POST MORTEM] JAN 2023 - Shard 1 and Shard 3 consensus loss

Hello everyone,

The root cause of previous two network outage in shard 3 and shard 1 has been identified to leader node not time synced.

What happened ?
The S1 and S3 leader node were build in a DC with an image where the NTP service configuration used the cloud provider internal NTP server. That server stopped working since the 14th and some day after Shard 3 then Shard 1 failed on us.

On the harmony node protocol operation, during a view change process (ie when we need to elect a new leader), the expected leader send a timestamp to all the validator node in the network. If that timestamp wasn’t coming from a node with its time synced, it would eventually crash the node receiving it. The node would eventually rewind blocks not saved to disk and would be stuck there.

As of now only harmony node can become leader so there is no risk of any random validator in the network to crash a shard. However in the near future, expected in 2023, there will be external validator. It will be very important for all the expected external leader to make sure they have their node system time synced with a public NTP server.

What we did to remediate the problem :

  • reconfigure all our internal validator node to use a public NTP server
  • monitor and alert on time synchronisation

What is next:

  • update validator documentation to reflect the important of the future external leader
  • review the view change protocol and identified if we can get rid of the time component
  • decentralized fully with external validator taking the role of leader (pending evaluation on how safe we think the network will be)
2 Likes

Soph is this strictly an AWS issue or cloud based issue? Or for those of us that run our own servers and automatically get time from the internet is this a non issue for us?

1 Like

When you build a new server, usually the image is specific to the cloud provider. NTP service configuration is up to them and might differ depending on the the cloud provider and even depend on the cloud provider DC location.

2 Likes

Gotcha, so I should probably check my home server periodically for time drift? I’m checking to see if there’s a Linux util that does it automatically.

2 Likes

yeah and you can check this Discord

was this the 1291-1293 eopochs?

i don’t have the exact epoch, but the rough calculation of it seems to match the date of the outages