Weekly OnCall Summary Nov/16 - 23, 2021

Weekly OnCall Rotation

OnCall US Pacific: Leo (leo@harmony.one)

OnCall Asia/EMEA: Soph (soph@harmony.one)

Duration: 11/16 8:30am - 11/23 8:30am PST

Summary

Details

11/16:

  • Node OOS: PagerDuty
  • Multisig Frontend Down/Up - uptimerobot
  • Node OOS: PagerDuty
  • Batch remove all ‘latest/*.gz’ from all nodes to free the disk space on the root volume

11/17:

  • Beacon out of sync
  • Shard 2 signing power at 75% due to S2 DNS nodes all being out of sync. Fixed by adding more DNS nodes.

11/18:

  • Beacon out of sync, had to restart to avoid OOM
  • Mainnet BTC indexer sync auto resolved
  • Updated grafana to start alerting at 85% cpu usage
  • Out of sync issue :
    • Updated grafana to start alerting at 85% cpu usage
    • Upgraded 9 DNS node from c5.large to c5.2xlarge
    • Restarted node
      • Validator node :
  • Memory Usage Alert: PagerDuty

11/19:

  • Testnet fork issue at block 17732590 fixed by reverting all nodes s1/s2/s3 and s0 explorer node by 1 block
  • Memory alert, while pending #55, for now restart of a list nodes …

11/20:

  • Disk at 90% : 3 nodes disks were upgraded

All S1/S2/S3 validator nodes are upgraded to r5.xlarge. Old instances will be deleted monday/tuesday.

11/21:

  • Disk at 90% PD Alerts : two nodes were upgraded to 700GB

11/22:

  • One node was stuck. Had to restart the harmony process to unblock it
  • Multiple s0 RPC node were in general having constantly 5 ~ 20 blocks delays

Those were mainly due to

  • The previous stuck node that was lagging behind
  • One of the api websocket node was constantly behind
  • Screenshot node was still syncing due to daily snapshot occurrence a few hours before
  • Asian node all had cpu usage varying between 5% ~ 30% (same zig zag pattern when we had the RPC issue 2-3 weeks ago)

After a while it would eventually go back to normal leaving only the previous stuck node behind :

Unique Blocks: [19669427 19662283]

  • Multiple disks at 90%, true-up across all S0 nodes all are now below 70%. Only the snapshot node needs separate work.

Note : Websocket node has been for the past few days always a bit behind 1 ~ 10 blocks, Jack was following on that and did some changes on the target group

Impact on Nov 21 vs. back in Nov 11

11/23:

Takeaways:

  1. Whenever received single alert/alarm, check on similar instances, such as OOS, OOM.
  2. Testnet bridge alerts and explorer-api-websockets are not being auto-resolved.
  3. Shard 0 pruned DB is getting bigger and bigger
4 Likes

thanks for writing in public! let’s do for all of our team meetings. only way to scale to 10 billion people.

1 Like