Postmortem: January Network Outage

Postmortem

Event: Network Outage
Date(s): January 12, 2022 to January 14, 2022


The focus of this postmortem is to describe the events leading to the outage which occurred on January 13th, 2022. The purpose of this writing is to not only offer transparency to our community and ecosystem, but also to present our lessons learned and action plan.


TLDR;

The network was down for 19 hours, between January 12, 2022 to January 14, 2022. Based on current evidence, the root cause(s) stem from the fact that the network received an overwhelming amount of spam traffic across leader nodes, combined with suboptimal handling of high traffic in the pre-released binary v4.3.2, created a cascade of network and block synchronization issues.

Summary

The network suffered a block synchronization issue at 2:22 PM PST January 12th, resulting in all RPC nodes on shard-0 to become stuck. This cascaded into an issue where external validator nodes were also stuck in their attempts to synchronize with shard-0. Harmony worked to diagnose and resolve the issue alongside the validator community while on-chain transactions were beginning to fail throughout. The issue was resolved at approximately 1:30AM PST, bringing the outage duration to approximately 11 hours.

The following morning at 7:00 AM PST, on January 13th, 2022, we observed the same situation; shard-0 had become stuck and on-chain transactions were unsuccessful. The Harmony team quickly ran through similar steps as the day prior, and worked together with the validator community to resolve the issue.

The same issue occurred once more later that day at 6:50 PM PST. The Harmony team joined together on a conference call and began diagnosing the issue immediately while communicating with the community. It was observed that network traffic was more than 100x above the average, indicating either the network was under a flood attack or an issue was occurring at the p2p level. We updated the code to mitigate this and deployed new binaries, communicated to our communities across live-chat channels, and the team worked diligently through the night to resolve the problem.

The issue was resolved at approximately 1:00 PM PST on January 14th, 2022, bringing this outage duration to 19 hours. Based on the evidence at hand, the issue was the result of an overwhelming amount of spam traffic across leader nodes, combined with suboptimal handling of high traffic in the pre-released binary v4.3.2.

Detailed Events

The Harmony team updated all internal validator nodes in the PST morning of January 12th, 2022, with pre-released binary v4.3.2 to prepare for the new release. Hours later, around 2:22 PM PST, all RPC nodes on shard 0 were stuck at an orphaned block at height 21651687, failing to synchronize with the latest blocks. The shard 0 validator committee was still generating new blocks without problem, and no signing issues were reported by external validators. The Harmony team immediately jumped in to debug the issue and found the root cause at around 4PM.

All the necessary steps were taken to recover the RPC nodes including 1) reverting the orphaned block on the RPC nodes so they continue synchronizing and 2) switching the state sync DNS endpoint to the Harmony validator nodes so RPC nodes can sync to the latest blocks. RPC nodes and all validators on shard 1,2,3 were recovered and fully synced with shard 0 at around 12:00 AM PST on Jan 13th, at which point RPC services resumed serving user traffic normally.

Around 7:00 AM PST January 13th, the same issue occurred at block 21681128. The team repeated the procedure to recover all nodes and RPC services. Once more, around 3 PM PST Jan 13th, shard 1 stopped producing blocks, and shard 0 stopped producing blocks approximately three hours later. The Harmony team immediately created a conference call to discuss the situation, diagnose the issue, coordinate communication with the community, and work to determine the root cause.

The Harmony team performed a series of changes based on the evidence found during the investigation.

Actions Taken and Fixes:

Based on the evidence at hand, the issue was the result of an overwhelming amount of spam traffic across leader nodes, combined with suboptimal handling of high traffic in the pre-released binary v4.3.2. As of now, pre-released binary 4.3.2 was rolled back on all validator nodes and various fixes including the tuning of libp2p configuration were deployed which fully recovered the network.

Root Cause Analysis

Issue of Orphaned Block in Pre-Release Binary 4.3.2

At block 21651687, the shard 0 leader successfully collected ⅔ of the commit message but failed to broadcast it soon enough before the view change protocol kicked in. The reason why the leader couldn’t broadcast the ⅔ commit message in time is because the leader got spammed with a sudden spike of transaction requests and it took around 30 seconds to store the block into the chain database after which the ⅔ commit message was sent. Since the rest of the validators didn’t receive the ⅔ commit message in time, they all started the view change protocol, which aims to switch the malfunctioning leader and continue the consensus.

The validators received the prepared message for block at height 21651687 with ID 0x005399509311b2bd669c8cfabf3be2a238b743a8e865c40cec77f866c6d9591b (referred as block 591b), but didn’t receive the ⅔ commit message due to the failure on the leader (described above). The view change protocol should’ve proceeded with the original prepared block 591b, but due to a bug in the view change implementation, the validators didn’t reuse the prepared block to continue the consensus but instead produced a new block 0x5446997a30537d02b5370643e1b22a5beea4a856ec39a8efd2279b46660208db at the same height. The following log records showed the failure of validators to reuse the prepared block 591b for view change.

{“level”:“info”,“myBlock”:21651687,“myViewID”:21652342,“phase”:“Prepare”,“mode”:“Normal”,“MsgBlockNum”:21651687,“MsgViewID”:21652342,“caller”:"/home/sp/harmony/harmony/consensus/validator.go:191",“time”:“2022-01-12T22:22:42.486767661Z”,“message”:"[OnPrepared] Received prepared message"}
{“level”:“info”,“myBlock”:21651687,“myViewID”:21652342,“phase”:“Prepare”,“mode”:“Normal”,“caller”:"/home/sp/harmony/harmony/consensus/validator.go:96",“time”:“2022-01-12T22:22:43.610974447Z”,“message”:"[validateNewBlock] Block Already verified"}

{“level”:“info”,“myBlock”:21651687,“myViewID”:21652342,“phase”:“Prepare”,“mode”:“Normal”,“blockNum”:21651687,“blockHash”:“005399509311b2bd669c8cfabf3be2a238b743a8e865c40cec77f866c6d9591b”,“caller”:"/home/sp/harmony/harmony/consensus/validator.go:178",“time”:“2022-01-12T22:22:43.616193434Z”,“message”:"[sendCommitMessages] Sent Commit Message!!"}

{“level”:“warn”,“myBlock”:21651687,“myViewID”:21652342,“phase”:“Commit”,“mode”:“ViewChanging”,“nextViewID”:21652343,“viewChangingID”:21652343,“timeoutDuration”:27000,“NextLeader”:“5ecf182ef0cfe449470d900dbe4ce312d534c882765e639c59525f2710d9f96a04abd871e4f0afadc1bd4b07a54ea106”,“caller”:"/home/sp/harmony/harmony/consensus/view_change.go:265",“time”:“2022-01-12T22:23:08.267356266Z”,“message”:"[startViewChange]"}

{“level”:“info”,“context”:“ViewChange”,“viewID”:21652343,“blockNum”:21651687,“caller”:"/home/sp/harmony/harmony/consensus/view_change_construct.go:436",“time”:“2022-01-12T22:23:08.26752208Z”,“message”:"[InitPayload] add my M2 (NIL) type messaage"}

The original leader that’s slow in broadcasting the block 591b eventually sent out the ⅔ commit message for the block after a 30s delay. Unfortunately, when the ⅔ commit message was sent, the rest of the validators already started the view change protocol so the message was ignored by the validators. However, all the non-validator nodes, without being aware of the view change happening, received the ⅔ commit message and accepted the block 591b. Therefore, they can not accept the canonical blocks that’s in the chain agreed by the shard 0 committee. The consequence is that they were all stuck with block 591b.

Issue with P2P Broadcasted Message “Echoing”

Network Inbound and outbound traffic significantly increased by around 100 times throughout the time when shard 0 and shard 1 failed to produce blocks. The downtime of shards was highly correlated with the hiked network traffic on the shards.

Initially, a p2p network flooding attack was suspected because validator nodes logged receipt of a substantial number of stale p2p messages (e.g. view change and block messages). We suspected that an attacker was storing p2p broadcasted messages and later flooding them into the network to cause resource squeeze on the normal validator nodes. We experimented with rate limiting based on message sender ID and discovered that sender of the flooding messages are actually from legit validators that’s known to be well behaving validators. That directs to 2 potential causes:

  1. the attacker was able to craft the flooding messages and hide it’s own sender ID;

  2. Or the flooding messages kept being resent and “echoing” in the p2p broadcasting network. Since libp2p inherently has the spam protection based on peer identity and signature checking, we suspect #2 is more likely the root cause.

In an attempt to stop potential “echoing” of p2p messages, p2p broadcasting configuration was tuned down to reduce many thresholds that may cause excessive message fanout or re-transmission. After the parameter change was deployed to validator nodes, the flooding traffic was immediately removed and network traffic returned to normal level as before.

It is suspected that the echoing was triggered when certain network conditions occur, such as traffic spike or traffic pattern change (the rpc issue and stuck blocks definitely caused lots of chaos such as excessive view change messages being sent). That explains why the network flooding happened coincidently soon after the orphaned block incident happened.

Lessons Learned

  • Change Management

The blockchain space often requires teams to remain nimble and able to adapt to the fast pace of the industry. Despite this, a slower pace could occasionally be used in the rollout of critical code changes. We could accomplish this by adopting an easy, simple change management process. See “Next Steps” below.

  • Flow of Communication

We received positive feedback from the community on our communication efforts through the troubleshooting process, but improvement should never stop. One way we plan to continue improving has us documenting, in advance, the specific areas where communication will be shared with the community. See “Next Steps” below.

Next Steps

  • The action plan for moving forward can be found using the link below. This document will be updated as progress is made.

The Post Outage Response Plan below will be updated over time as progress is made.


Timeline of January 13th Events

Timeframe: 7:45 PM > 8:00 PM

  • Discover beacon shard block stuck via proactive monitoring
  • Priority-1 conference call is created for all infrastructure team members. All hands on deck.

Timeframe: 8:00 PM > 10:00 PM

  • Doc for unblocking validator nodes updated and re-shared with external validators
  • Walk validators through the process of unblocking local copy of beacon shard.

Timeframe: 10:00 PM > 12:00 AM

  • Upgraded all Shard-0 internal validators with new v4 binary v7215-v4.3.1-4-g073df164
  • DNS entries removed to limit traffic while we investigate block halt events.
  • Traffic analysis performed

Timeframe: 12:00 AM > 2:00 AM

  • Request for shard-0 node operators to update their binary to: v4.3.1-4-g073df164
  • Assist validators with the process / answer questions in #mainnet-nodes (Discord), Validators (Telegram), and Validator Incubator (Telegram).

Timeframe: 2:00 AM > 4:00 AM

  • Request for shard-1 node operators to update their binary to: v4.3.1-4-g073df164
  • Chain started moving. Stopped shortly after.
  • Observe spam traffic at the P2P level.
  • Action plan developed to combat potential spam transactions.
  • Whitelisting internal nodes on traffic filters to better understand the source of p2p spam and alleviate the issue.
  • Additional PRs created.
  • Push to all internal validators.

Timeframe: 4:00 AM > 6:00 AM

  • See improvement but lack voting power on S0.
  • Coding additional fixes.
  • New branch for v4 created with a number of fixes.
    • Shard0 recovery v4 branch installed on internal S0 nodes.
  • Testing on Jacky’s local node is successful. Publishing fixes including whitelist.
  • Another build is pushed to the repository which includes additional fixes.

Timeframe: 6:00 AM > 8:00 AM

  • Merged another set of fixed and applied to internal nodes. Noticed a significant drop in unwanted traffic. Watching to ensure legitimate traffic from internal nodes or legits nodes were not blocked
  • Monitored the new build on SIN and IAD servers with promising results.
    • Expanding this to additional regions. Deploy the release to additional RPC nodes.
  • Installed build to Shard0, Shard1 RPC nodes. Installing on Shard2 and Shard3 RPC nodes.
  • Installed the build on an internal Shard0 validator node.
    • Monitoring and observing a significant reduction in traffic.

Timeframe: 8:00 AM > 10:00 AM

  • Legitimate external validator nodes were identified and shard 0 validator node were reverted to previous code (without p2p rate limits)

Timeframe: 10:00 AM > 12:00 PM

  • Additional traffic analysis.
  • v4 build with Jacky’s latest ddos PR is created.

Timeframe: 12:00 PM > 2:00 PM

  • New binary push resulted in more of a reduction in P2P traffic. Pushed to the validator nodes.
  • Shards begin producing blocks. Bandwidth utilization on external validators lowers significantly.
  • Continue to see issues where the local beacon shard is slow to synchronize…
  • Team finds a potential solution for the delay. Upgrades the DNS nodes.

Timeframe: 2:00 PM > 4:00 PM

  • Work on shard-3 DNS nodes is completed. DNS is updated for all shards now.
  • Validators begin updating shard-3 nodes to new binary.
28 Likes

Thank you for the hard work and great write-up :pray:

6 Likes

Thank you for the summary and the relentless work put in by the team!
Was amazing to see all the validators and Harmony team working together and responding so quickly.

Think we already came out much stronger out of this situation and really excited to see further building on what we learned here :slightly_smiling_face::blue_heart:

5 Likes

Thank you for evey teaching in this post. Congratulations all in the fast reactions.

Thank you for this succinct explanation! It was quite impressive how quickly the team was able to resolve the issue, and increases my confidence in it.

Thank you, team!

“Obstacles are designed to make you stronger, only the weak avoid them”

:blue_heart:

1 Like

Great write up Daniel and thanks to the Harmony team for working so diligently to solve this while members like Soph and you skipped sleep to keep the communication up.

Based on the evidence at hand, the issue was the result of an overwhelming amount of spam traffic across leader nodes, combined with suboptimal handling of high traffic in the pre-released binary v4.3.2.

Curious if the incident would have still happened if the internal nodes stayed on v4.3.1 or would the spam attack itself have been enough to cause consensus to stop?

2 Likes

Hey we had some feedback from Saturday’s VDAO twitter space. Some were considered that sharing the Binary links on social media might open the possibility for a 3 party to access the file and maliciously alter the file and harm the validators that are installing the binary.

Any thoughts on this?

Great job and great RCA, and thanks for the updates team. Keep u the good work

Great to see the transaparency.
30 gwei is 30 times the old fees for old users atleast. Any plans on reducing the fees to old levels or increasing it gradually. Or is their any phased roadmap for for increment in gas?

Great job thanks for bringing the network back again.

I wish to write some comments observations, feedback and suggestions. I will not pretend I know what you are talking about 100% in the article and excuse potently being a bit ignorant. I would like to use this opportunity to offer food for thought and creative thinking.

From my expire in the Photonics RnD department of Marconi we were developing nodes for the fibre internet. Each node reminds me of the validator nodes. Those nodes we were designing them to withstand extreme network loads, network malicious attacks and the echoing you mentioned. I can’t remember exactly but the code was regulating traffic even if it was attacked and in extreme occasions the node was auto reset to unblock “stuck packages”. I know we were over 600 engineers only on my site and the code was extremely complex using C++. (If I ever find my old notes I will share)

Furthermore, not only the nodes are regulating a network attack but also redirecting the network’s “good” information to other channels. I assume what you may call at Harmony shards.

That why we always have internet!

Anyway, the point is with your resources and the community resources can we look into similar functionally for the blockchain? Have you heard of HOLO chain? I think they might have something going on there. The network (in theory at least) does not fail just some notes close and reset.

As another more customer-oriented observations and as the comment above says GWEI is 30x more. That might be fine as long as the price of $ONE is low but at the end of the day it kind of resembles what governments do, increasing taxes to regulate the economy.

30 GWEI it is fine for now don’t get me wrong however on a strategic, marketing and competitive advantaged scenarios to the end used it is like being taxed.

If the attack was from a trading bot as I hear in the media channels the 30gwei might have stopped them now but someone will reconfigure their bot so they can make profit with 30gwei.

On the competitive advantage over ETH and other networks Harmony want to build bridges. However, if the gwei might need to increase again by the time ETH2 is out Harmony might lose its position as a bridge maker.

This is a creative thinking process asking questions analysing an event and how we can make things better.

So, my question is do you think that potentially shard hoping blocks alleviate any syncing and stuck blocks?

Could NFT be a solution offering a decentralised way that protect anonymity but root out malicious attacks and malicious profiteering bots?

Think nature we don’t have to go far look at our bodies and its immune system…. NFT antibodies? Thinking out loud

I know some things are easier said than done however, I am instigating a creative thinking process and offering some food for thought and happy help anyway I can to take this forward and keep researching as well as having an open creative dialogue and see what can be found and achieved.

Thanks again and let’s build bigger and better