Weekly OnCall Rotation - Aug 10-17, 2021

Weekly OnCall Rotation

Oncall: Leo, Soph({lc,soph}@harmony.one)

Duration: 8/10 12:00pm - 8/17 12:00pm

Summary

  • Lots of low incidents that are auto-resolved. Out of sync issue on mainnet that are usually temporary
  • S0 Leader disk space issue

Details

  • 8/10

  • 8/11

  • 8/12

    • Node disk Space alert on 34.227.78.124 91% after manual clean up still the same, had to increase the disk
  • 8/13

  • 8/14

    • Monitor all local disks in grafana
    • Grafana updated to show local disks
  • 8/15

  • 8/16

  • 8/17

    • Jenya informed Nita Neou (Soph) on explorer-v2 api having inconsistent results between the 2 APIs. a restart resolved the issue. Need to investigate the reason our monitor didn’t work.

Operation Improvement

  • Apart from “Node Disk Space Alert”, no other needed action. How can they be fine tuned to avoid too many notifications?

  • Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 is super annoying

  • Node CPU Usage Rate Alert can be fine tuned?

  • Watchdog auto-resolve “beacon out-of-sync” pager now. 22 tickets were automatically resolved.

Resources

Runbook: GitBook

Uptime Status: https://status.harmony.one/

Grafana Server: http://grafana.harmony.one:3000/d/DoL_wq7Zk/harmony-nodes-monitoring?orgId=1&refresh=5s

Node out-of-sync: https://harmony.one/status

Pager duty: https://harmonyone.pagerduty.com/