Weekly OnCall Rotation

leo · September 13, 2021, 6:15pm

Oncall: Leo, Soph({lc,soph}@harmony.one)

Duration: 8/10 12:00pm - 8/17 12:00pm

Summary

Lots of low incidents that are auto-resolved. Out of sync issue on mainnet that are usually temporary
S0 Leader disk space issue

8/10
8/11
- Revoke public access of the internal wiki (https://docs.harmony.one/onboarding-wiki )
- Onboard Jack Chan
8/12
- Node disk Space alert on 34.227.78.124 91% after manual clean up still the same, had to increase the disk
8/13
- Updated uptimerobot monitor for explorerv2 api and indexer, wiki and docs
- Leader out of space warning: https://harmonyone.pagerduty.com/incidents/PFAGH8L
8/14
- Monitor all local disks in grafana
- Grafana updated to show local disks
8/15
8/16
8/17
- Jenya informed Nita Neou (Soph) on explorer-v2 api having inconsistent results between the 2 APIs. a restart resolved the issue. Need to investigate the reason our monitor didn’t work.

Apart from “Node Disk Space Alert”, no other needed action. How can they be fine tuned to avoid too many notifications?
Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 is super annoying
Node CPU Usage Rate Alert can be fine tuned?
Watchdog auto-resolve “beacon out-of-sync” pager now. 22 tickets were automatically resolved.

Runbook: GitBook

Topic		Replies	Views
Weekly OnCall Rotation - Aug 2-9, 2021 Team Ops	0	286	August 10, 2021
Weekly OnCall Summary Nov/16 - 23, 2021 Team Ops	1	328	November 24, 2021
Weekly on-call summary July 27 - August 2th,2021 Team Ops	0	282	August 4, 2021
Weekly On-call Summary Nov 9-15, 2021 Team Ops	0	280	November 16, 2021
Weekly On-call Summary Oct 25th - Nov 2nd Team Ops	0	272	November 9, 2021