Weekly OnCall Rotation - Aug 10-17, 2021

Weekly OnCall Rotation

Oncall: Leo, Soph({lc,soph}@harmony.one)

Duration: 8/10 12:00pm - 8/17 12:00pm

Summary

  • Lots of low incidents that are auto-resolved. Out of sync issue on mainnet that are usually temporary
  • S0 Leader disk space issue

Details

  • 8/10

  • 8/11

  • 8/12

    • Node disk Space alert on 34.227.78.124 91% after manual clean up still the same, had to increase the disk
  • 8/13

    • Updated uptimerobot monitor for explorerv2 api and indexer, wiki and docs
    • Leader out of space warning: PagerDuty
  • 8/14

    • Monitor all local disks in grafana
    • Grafana updated to show local disks
  • 8/15

  • 8/16

  • 8/17

    • Jenya informed Nita Neou (Soph) on explorer-v2 api having inconsistent results between the 2 APIs. a restart resolved the issue. Need to investigate the reason our monitor didn’t work.

Operation Improvement

  • Apart from “Node Disk Space Alert”, no other needed action. How can they be fine tuned to avoid too many notifications?

  • Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 is super annoying

  • Node CPU Usage Rate Alert can be fine tuned?

  • Watchdog auto-resolve “beacon out-of-sync” pager now. 22 tickets were automatically resolved.

Resources

Runbook: https://app.gitbook.com/@harmony-one/s/onboarding-wiki/devops-run-book/harmony-mainnet-devops-runbook

Uptime Status: https://status.harmony.one/

Grafana Server: http://grafana.harmony.one:3000/d/DoL_wq7Zk/harmony-nodes-monitoring?orgId=1&refresh=5s

Node out-of-sync: Network Status | Harmony

Pager duty: https://harmonyone.pagerduty.com/