Hello everyone,
The root cause of previous two network outage in shard 3 and shard 1 has been identified to leader node not time synced.
What happened ?
The S1 and S3 leader node were build in a DC with an image where the NTP service configuration used the cloud provider internal NTP server. That server stopped working since the 14th and some day after Shard 3 then Shard 1 failed on us.
On the harmony node protocol operation, during a view change process (ie when we need to elect a new leader), the expected leader send a timestamp to all the validator node in the network. If that timestamp wasn’t coming from a node with its time synced, it would eventually crash the node receiving it. The node would eventually rewind blocks not saved to disk and would be stuck there.
As of now only harmony node can become leader so there is no risk of any random validator in the network to crash a shard. However in the near future, expected in 2023, there will be external validator. It will be very important for all the expected external leader to make sure they have their node system time synced with a public NTP server.
What we did to remediate the problem :
- reconfigure all our internal validator node to use a public NTP server
- monitor and alert on time synchronisation
What is next:
- update validator documentation to reflect the important of the future external leader
- review the view change protocol and identified if we can get rid of the time component
- decentralized fully with external validator taking the role of leader (pending evaluation on how safe we think the network will be)