Weekly OnCall Rotation
Oncall: Leo, Soph({lc,soph}@harmony.one)
Duration: 8/10 12:00pm - 8/17 12:00pm
Summary
- Lots of low incidents that are auto-resolved. Out of sync issue on mainnet that are usually temporary
- S0 Leader disk space issue
Details
-
8/10
-
8/11
- Revoke public access of the internal wiki (https://docs.harmony.one/onboarding-wiki )
- Onboard Jack Chan
-
8/12
- Node disk Space alert on 34.227.78.124 91% after manual clean up still the same, had to increase the disk
-
8/13
- Updated uptimerobot monitor for explorerv2 api and indexer, wiki and docs
- Leader out of space warning: PagerDuty
-
8/14
- Monitor all local disks in grafana
- Grafana updated to show local disks
-
8/15
-
8/16
-
8/17
- Jenya informed Nita Neou (Soph) on explorer-v2 api having inconsistent results between the 2 APIs. a restart resolved the issue. Need to investigate the reason our monitor didn’t work.
Operation Improvement
-
Apart from “Node Disk Space Alert”, no other needed action. How can they be fine tuned to avoid too many notifications?
-
Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 is super annoying
-
Node CPU Usage Rate Alert can be fine tuned?
-
Watchdog auto-resolve “beacon out-of-sync” pager now. 22 tickets were automatically resolved.
Resources
Uptime Status: https://status.harmony.one/
Grafana Server: http://grafana.harmony.one:3000/d/DoL_wq7Zk/harmony-nodes-monitoring?orgId=1&refresh=5s
Node out-of-sync: Network Status | Harmony
Pager duty: https://harmonyone.pagerduty.com/