Weekly OnCall Rotation
Oncall: Ganesha, Jenya ({ganesha,jenya}@harmony.one)
Duration: 11/9 12:00pm - 11/15 12:00pm
Summary
- BTC mainnet nodes downtime (https://btc.main.hmny.io & https://btc2.main.hmny.io)
- Happens whenever the btc node falls out of sync, keep monitoring & manual restart if needed
- Many beacon out of sync alerts & auto resolution (too frequent?)
- Needs root cause analysis & better alerting
- Couple of node out of sync alerts & auto resolution
- Node memory usage rate alerts for shard-2/3 nodes & auto resolution
- Node disk space abnormal alerts for shard0 nodes & auto resolution
- Monitor is down: s0-min-sync-3
- Newly added DNS nodes
- Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 (nov 11)
- Nov 11 IDO event resulted in huge traffic. Normal behavior, no specific cause to remedy.
- AverageDurationRPCTest - Over a period of 15 min, in normal condition the average of all RPC duration…
Action Items
- Create issues for beacon out of sync and memory usage alerts for tracking @sophoah