Postmortem
Event: Network Outage
Date(s): January 12, 2022 to January 14, 2022
The focus of this postmortem is to describe the events leading to the outage which occurred on January 13th, 2022. The purpose of this writing is to not only offer transparency to our community and ecosystem, but also to present our lessons learned and action plan.
TLDR;
The network was down for 19 hours, between January 12, 2022 to January 14, 2022. Based on current evidence, the root cause(s) stem from the fact that the network received an overwhelming amount of spam traffic across leader nodes, combined with suboptimal handling of high traffic in the pre-released binary v4.3.2, created a cascade of network and block synchronization issues.
Summary
The network suffered a block synchronization issue at 2:22 PM PST January 12th, resulting in all RPC nodes on shard-0 to become stuck. This cascaded into an issue where external validator nodes were also stuck in their attempts to synchronize with shard-0. Harmony worked to diagnose and resolve the issue alongside the validator community while on-chain transactions were beginning to fail throughout. The issue was resolved at approximately 1:30AM PST, bringing the outage duration to approximately 11 hours.
The following morning at 7:00 AM PST, on January 13th, 2022, we observed the same situation; shard-0 had become stuck and on-chain transactions were unsuccessful. The Harmony team quickly ran through similar steps as the day prior, and worked together with the validator community to resolve the issue.
The same issue occurred once more later that day at 6:50 PM PST. The Harmony team joined together on a conference call and began diagnosing the issue immediately while communicating with the community. It was observed that network traffic was more than 100x above the average, indicating either the network was under a flood attack or an issue was occurring at the p2p level. We updated the code to mitigate this and deployed new binaries, communicated to our communities across live-chat channels, and the team worked diligently through the night to resolve the problem.
The issue was resolved at approximately 1:00 PM PST on January 14th, 2022, bringing this outage duration to 19 hours. Based on the evidence at hand, the issue was the result of an overwhelming amount of spam traffic across leader nodes, combined with suboptimal handling of high traffic in the pre-released binary v4.3.2.
Detailed Events
The Harmony team updated all internal validator nodes in the PST morning of January 12th, 2022, with pre-released binary v4.3.2 to prepare for the new release. Hours later, around 2:22 PM PST, all RPC nodes on shard 0 were stuck at an orphaned block at height 21651687, failing to synchronize with the latest blocks. The shard 0 validator committee was still generating new blocks without problem, and no signing issues were reported by external validators. The Harmony team immediately jumped in to debug the issue and found the root cause at around 4PM.
All the necessary steps were taken to recover the RPC nodes including 1) reverting the orphaned block on the RPC nodes so they continue synchronizing and 2) switching the state sync DNS endpoint to the Harmony validator nodes so RPC nodes can sync to the latest blocks. RPC nodes and all validators on shard 1,2,3 were recovered and fully synced with shard 0 at around 12:00 AM PST on Jan 13th, at which point RPC services resumed serving user traffic normally.
Around 7:00 AM PST January 13th, the same issue occurred at block 21681128. The team repeated the procedure to recover all nodes and RPC services. Once more, around 3 PM PST Jan 13th, shard 1 stopped producing blocks, and shard 0 stopped producing blocks approximately three hours later. The Harmony team immediately created a conference call to discuss the situation, diagnose the issue, coordinate communication with the community, and work to determine the root cause.
The Harmony team performed a series of changes based on the evidence found during the investigation.
Actions Taken and Fixes:
-
Make leader broadcast the committed message early and before block insertion: send committed message before block insertion by rlan35 · Pull Request #3987 · harmony-one/harmony · GitHub
-
We tried libp2p rate limiting: [P2P] Add a rate limiter at p2p pubsub message handling by JackyWYX · Pull Request #3998 · harmony-one/harmony · GitHub but found the rate limited messages were old and sent from legitimate validators, suggesting either the attacker found a way to relay other peers messages without revealing identity, or the p2p network is resending old messages repeatedly by itself.
-
Fine tuned libp2p message broadcasting configuration to stop broadcasted “echoing” and upgrade libp2p version to : p2p tuning and module upgrade by LeoHChen · Pull Request #4005 · harmony-one/harmony · GitHub
-
Optimistically re-propose the already proposed block on leader If consensus is not reached: Keep proposing the same block by rlan35 · Pull Request #4000 · harmony-one/harmony · GitHub
-
Stop reusing prepared block in view change due to invalid block error: [fixme] ignore M1 in view change to reach consensus by LeoHChen · Pull Request #3996 · harmony-one/harmony · GitHub
Based on the evidence at hand, the issue was the result of an overwhelming amount of spam traffic across leader nodes, combined with suboptimal handling of high traffic in the pre-released binary v4.3.2. As of now, pre-released binary 4.3.2 was rolled back on all validator nodes and various fixes including the tuning of libp2p configuration were deployed which fully recovered the network.
Root Cause Analysis
Issue of Orphaned Block in Pre-Release Binary 4.3.2
At block 21651687, the shard 0 leader successfully collected ⅔ of the commit message but failed to broadcast it soon enough before the view change protocol kicked in. The reason why the leader couldn’t broadcast the ⅔ commit message in time is because the leader got spammed with a sudden spike of transaction requests and it took around 30 seconds to store the block into the chain database after which the ⅔ commit message was sent. Since the rest of the validators didn’t receive the ⅔ commit message in time, they all started the view change protocol, which aims to switch the malfunctioning leader and continue the consensus.
The validators received the prepared message for block at height 21651687 with ID 0x005399509311b2bd669c8cfabf3be2a238b743a8e865c40cec77f866c6d9591b (referred as block 591b), but didn’t receive the ⅔ commit message due to the failure on the leader (described above). The view change protocol should’ve proceeded with the original prepared block 591b, but due to a bug in the view change implementation, the validators didn’t reuse the prepared block to continue the consensus but instead produced a new block 0x5446997a30537d02b5370643e1b22a5beea4a856ec39a8efd2279b46660208db at the same height. The following log records showed the failure of validators to reuse the prepared block 591b for view change.
{“level”:“info”,“myBlock”:21651687,“myViewID”:21652342,“phase”:“Prepare”,“mode”:“Normal”,“MsgBlockNum”:21651687,“MsgViewID”:21652342,“caller”:“/home/sp/harmony/harmony/consensus/validator.go:191”,“time”:“2022-01-12T22:22:42.486767661Z”,“message”:“[OnPrepared] Received prepared message”}
{“level”:“info”,“myBlock”:21651687,“myViewID”:21652342,“phase”:“Prepare”,“mode”:“Normal”,“caller”:“/home/sp/harmony/harmony/consensus/validator.go:96”,“time”:“2022-01-12T22:22:43.610974447Z”,“message”:“[validateNewBlock] Block Already verified”}
{“level”:“info”,“myBlock”:21651687,“myViewID”:21652342,“phase”:“Prepare”,“mode”:“Normal”,“blockNum”:21651687,“blockHash”:“005399509311b2bd669c8cfabf3be2a238b743a8e865c40cec77f866c6d9591b”,“caller”:“/home/sp/harmony/harmony/consensus/validator.go:178”,“time”:“2022-01-12T22:22:43.616193434Z”,“message”:“[sendCommitMessages] Sent Commit Message!!”}
{“level”:“warn”,“myBlock”:21651687,“myViewID”:21652342,“phase”:“Commit”,“mode”:“ViewChanging”,“nextViewID”:21652343,“viewChangingID”:21652343,“timeoutDuration”:27000,“NextLeader”:“5ecf182ef0cfe449470d900dbe4ce312d534c882765e639c59525f2710d9f96a04abd871e4f0afadc1bd4b07a54ea106”,“caller”:“/home/sp/harmony/harmony/consensus/view_change.go:265”,“time”:“2022-01-12T22:23:08.267356266Z”,“message”:“[startViewChange]”}
{“level”:“info”,“context”:“ViewChange”,“viewID”:21652343,“blockNum”:21651687,“caller”:“/home/sp/harmony/harmony/consensus/view_change_construct.go:436”,“time”:“2022-01-12T22:23:08.26752208Z”,“message”:“[InitPayload] add my M2 (NIL) type messaage”}
The original leader that’s slow in broadcasting the block 591b eventually sent out the ⅔ commit message for the block after a 30s delay. Unfortunately, when the ⅔ commit message was sent, the rest of the validators already started the view change protocol so the message was ignored by the validators. However, all the non-validator nodes, without being aware of the view change happening, received the ⅔ commit message and accepted the block 591b. Therefore, they can not accept the canonical blocks that’s in the chain agreed by the shard 0 committee. The consequence is that they were all stuck with block 591b.
Issue with P2P Broadcasted Message “Echoing”
Network Inbound and outbound traffic significantly increased by around 100 times throughout the time when shard 0 and shard 1 failed to produce blocks. The downtime of shards was highly correlated with the hiked network traffic on the shards.
Initially, a p2p network flooding attack was suspected because validator nodes logged receipt of a substantial number of stale p2p messages (e.g. view change and block messages). We suspected that an attacker was storing p2p broadcasted messages and later flooding them into the network to cause resource squeeze on the normal validator nodes. We experimented with rate limiting based on message sender ID and discovered that sender of the flooding messages are actually from legit validators that’s known to be well behaving validators. That directs to 2 potential causes:
-
the attacker was able to craft the flooding messages and hide it’s own sender ID;
-
Or the flooding messages kept being resent and “echoing” in the p2p broadcasting network. Since libp2p inherently has the spam protection based on peer identity and signature checking, we suspect #2 is more likely the root cause.
In an attempt to stop potential “echoing” of p2p messages, p2p broadcasting configuration was tuned down to reduce many thresholds that may cause excessive message fanout or re-transmission. After the parameter change was deployed to validator nodes, the flooding traffic was immediately removed and network traffic returned to normal level as before.
It is suspected that the echoing was triggered when certain network conditions occur, such as traffic spike or traffic pattern change (the rpc issue and stuck blocks definitely caused lots of chaos such as excessive view change messages being sent). That explains why the network flooding happened coincidently soon after the orphaned block incident happened.
Lessons Learned
- Change Management
The blockchain space often requires teams to remain nimble and able to adapt to the fast pace of the industry. Despite this, a slower pace could occasionally be used in the rollout of critical code changes. We could accomplish this by adopting an easy, simple change management process. See “Next Steps” below.
- Flow of Communication
We received positive feedback from the community on our communication efforts through the troubleshooting process, but improvement should never stop. One way we plan to continue improving has us documenting, in advance, the specific areas where communication will be shared with the community. See “Next Steps” below.
Next Steps
- The action plan for moving forward can be found using the link below. This document will be updated as progress is made.
The Post Outage Response Plan below will be updated over time as progress is made.
Timeline of January 13th Events
Timeframe: 7:45 PM > 8:00 PM
- Discover beacon shard block stuck via proactive monitoring
- Priority-1 conference call is created for all infrastructure team members. All hands on deck.
Timeframe: 8:00 PM > 10:00 PM
- Doc for unblocking validator nodes updated and re-shared with external validators
- Walk validators through the process of unblocking local copy of beacon shard.
Timeframe: 10:00 PM > 12:00 AM
- Upgraded all Shard-0 internal validators with new v4 binary v7215-v4.3.1-4-g073df164
- DNS entries removed to limit traffic while we investigate block halt events.
- Traffic analysis performed
Timeframe: 12:00 AM > 2:00 AM
- Request for shard-0 node operators to update their binary to: v4.3.1-4-g073df164
- Assist validators with the process / answer questions in #mainnet-nodes (Discord), Validators (Telegram), and Validator Incubator (Telegram).
Timeframe: 2:00 AM > 4:00 AM
- Request for shard-1 node operators to update their binary to: v4.3.1-4-g073df164
- Chain started moving. Stopped shortly after.
- Observe spam traffic at the P2P level.
- Action plan developed to combat potential spam transactions.
- Whitelisting internal nodes on traffic filters to better understand the source of p2p spam and alleviate the issue.
- Additional PRs created.
- Push to all internal validators.
Timeframe: 4:00 AM > 6:00 AM
- See improvement but lack voting power on S0.
- Coding additional fixes.
- New branch for v4 created with a number of fixes.
- Shard0 recovery v4 branch installed on internal S0 nodes.
- Testing on Jacky’s local node is successful. Publishing fixes including whitelist.
- Another build is pushed to the repository which includes additional fixes.
Timeframe: 6:00 AM > 8:00 AM
- Merged another set of fixed and applied to internal nodes. Noticed a significant drop in unwanted traffic. Watching to ensure legitimate traffic from internal nodes or legits nodes were not blocked
- Monitored the new build on SIN and IAD servers with promising results.
- Expanding this to additional regions. Deploy the release to additional RPC nodes.
- Installed build to Shard0, Shard1 RPC nodes. Installing on Shard2 and Shard3 RPC nodes.
- Installed the build on an internal Shard0 validator node.
- Monitoring and observing a significant reduction in traffic.
Timeframe: 8:00 AM > 10:00 AM
- Legitimate external validator nodes were identified and shard 0 validator node were reverted to previous code (without p2p rate limits)
Timeframe: 10:00 AM > 12:00 PM
- Additional traffic analysis.
- v4 build with Jacky’s latest ddos PR is created.
Timeframe: 12:00 PM > 2:00 PM
- New binary push resulted in more of a reduction in P2P traffic. Pushed to the validator nodes.
- Shards begin producing blocks. Bandwidth utilization on external validators lowers significantly.
- Continue to see issues where the local beacon shard is slow to synchronize…
- Team finds a potential solution for the delay. Upgrades the DNS nodes.
Timeframe: 2:00 PM > 4:00 PM
- Work on shard-3 DNS nodes is completed. DNS is updated for all shards now.
- Validators begin updating shard-3 nodes to new binary.