[POST MORTEM] JAN 2023 - Shard 1 and Shard 3 consensus loss

sophoah · January 20, 2023, 2:40am

Hello everyone,

The root cause of previous two network outage in shard 3 and shard 1 has been identified to leader node not time synced.

What happened ?
The S1 and S3 leader node were build in a DC with an image where the NTP service configuration used the cloud provider internal NTP server. That server stopped working since the 14th and some day after Shard 3 then Shard 1 failed on us.

On the harmony node protocol operation, during a view change process (ie when we need to elect a new leader), the expected leader send a timestamp to all the validator node in the network. If that timestamp wasn’t coming from a node with its time synced, it would eventually crash the node receiving it. The node would eventually rewind blocks not saved to disk and would be stuck there.

As of now only harmony node can become leader so there is no risk of any random validator in the network to crash a shard. However in the near future, expected in 2023, there will be external validator. It will be very important for all the expected external leader to make sure they have their node system time synced with a public NTP server.

What we did to remediate the problem :

reconfigure all our internal validator node to use a public NTP server
monitor and alert on time synchronisation

What is next:

update validator documentation to reflect the important of the future external leader
review the view change protocol and identified if we can get rid of the time component
decentralized fully with external validator taking the role of leader (pending evaluation on how safe we think the network will be)

Jimbo_JCR.one · January 20, 2023, 12:42pm

Soph is this strictly an AWS issue or cloud based issue? Or for those of us that run our own servers and automatically get time from the internet is this a non issue for us?

sophoah · January 21, 2023, 2:24am

When you build a new server, usually the image is specific to the cloud provider. NTP service configuration is up to them and might differ depending on the the cloud provider and even depend on the cloud provider DC location.

Jimbo_JCR.one · January 21, 2023, 2:39am

Gotcha, so I should probably check my home server periodically for time drift? I’m checking to see if there’s a Linux util that does it automatically.

sophoah · January 21, 2023, 1:17pm

yeah and you can check this Discord

Matthew_Lopez · January 24, 2023, 5:45am

was this the 1291-1293 eopochs?

sophoah · January 26, 2023, 2:30am

i don’t have the exact epoch, but the rough calculation of it seems to match the date of the outages

Topic		Replies	Views
Post Mortem: Tuesday 2 May 2023 Protocol	2	460	May 15, 2023
Postmortem: January Network Outage Announcements	10	4749	January 22, 2022
Mainnet Shard 0 - 12-Minute Downtime on November 8, 2024 Protocol	0	53	November 8, 2024
V4.3.2 release update Protocol	7	1082	January 12, 2022
Protocol Resiliency Questions Protocol	2	1067	April 24, 2019

[POST MORTEM] JAN 2023 - Shard 1 and Shard 3 consensus loss

Related topics