Weekly OnCall Summary Nov/16 - 23, 2021

leo · November 23, 2021, 6:14pm

Weekly OnCall Rotation

OnCall US Pacific: Leo (leo@harmony.one)

OnCall Asia/EMEA: Soph (soph@harmony.one)

Duration: 11/16 8:30am - 11/23 8:30am PST

Summary

Details

11/16:

Node OOS: PagerDuty
Multisig Frontend Down/Up - uptimerobot
Node OOS: PagerDuty
Batch remove all ‘latest/*.gz’ from all nodes to free the disk space on the root volume

11/17:

Beacon out of sync
Shard 2 signing power at 75% due to S2 DNS nodes all being out of sync. Fixed by adding more DNS nodes.

11/18:

Beacon out of sync, had to restart to avoid OOM
Mainnet BTC indexer sync auto resolved
Updated grafana to start alerting at 85% cpu usage
Out of sync issue :
- Updated grafana to start alerting at 85% cpu usage
- Upgraded 9 DNS node from c5.large to c5.2xlarge
- Restarted node
  - Validator node :
Memory Usage Alert: PagerDuty
- OOM killer, time to replace nodes (https://github.com/harmony-one/harmony-ops-priv/issues/55 )

11/19:

Testnet fork issue at block 17732590 fixed by reverting all nodes s1/s2/s3 and s0 explorer node by 1 block
Memory alert, while pending #55, for now restart of a list nodes …

11/20:

Disk at 90% : 3 nodes disks were upgraded

All S1/S2/S3 validator nodes are upgraded to r5.xlarge. Old instances will be deleted monday/tuesday.

Disk upgrade to 750G: (PagerDuty )

11/21:

Disk at 90% PD Alerts : two nodes were upgraded to 700GB

11/22:

One node was stuck. Had to restart the harmony process to unblock it
Multiple s0 RPC node were in general having constantly 5 ~ 20 blocks delays

Those were mainly due to

The previous stuck node that was lagging behind
One of the api websocket node was constantly behind
Screenshot node was still syncing due to daily snapshot occurrence a few hours before
Asian node all had cpu usage varying between 5% ~ 30% (same zig zag pattern when we had the RPC issue 2-3 weeks ago)

image724×312 70.5 KB

After a while it would eventually go back to normal leaving only the previous stuck node behind :

Unique Blocks: [19669427 19662283]

Multiple disks at 90%, true-up across all S0 nodes all are now below 70%. Only the snapshot node needs separate work.

image800×378 46.9 KB

Note : Websocket node has been for the past few days always a bit behind 1 ~ 10 blocks, Jack was following on that and did some changes on the target group

Impact on Nov 21 vs. back in Nov 11

11/23:

2 s0 RPC node were stuck with no apparent reason, created : Mainnet S0 node got stuck · Issue #3941 · harmony-one/harmony · GitHub

Takeaways:

Whenever received single alert/alarm, check on similar instances, such as OOS, OOM.
Testnet bridge alerts and explorer-api-websockets are not being auto-resolved.
Shard 0 pruned DB is getting bigger and bigger

lij · November 24, 2021, 1:13am

thanks for writing in public! let’s do for all of our team meetings. only way to scale to 10 billion people.

Topic		Replies	Views
Weekly On-call Summary Nov 9-15, 2021 Team Ops	0	271	November 16, 2021
Weekly OnCall Rotation - Aug 10-17, 2021 Team Ops	0	233	September 13, 2021
Weekly on-call Summary July 21 - July 26, 2021 Team Ops	0	274	July 27, 2021
Weekly On-call Summary Oct 25th - Nov 2nd Team Ops	0	265	November 9, 2021
Weekly OnCall Summary Sept 27th - Oct 4th, 2021 Team Ops	1	242	October 5, 2021