Weekly On-call Summary Nov 9-15, 2021

Weekly OnCall Rotation

Oncall: Ganesha, Jenya ({ganesha,jenya}@harmony.one)

Duration: 11/9 12:00pm - 11/15 12:00pm

Summary

  • BTC mainnet nodes downtime (https://btc.main.hmny.io & https://btc2.main.hmny.io)
    • Happens whenever the btc node falls out of sync, keep monitoring & manual restart if needed
  • Many beacon out of sync alerts & auto resolution (too frequent?)
    • Needs root cause analysis & better alerting
  • Couple of node out of sync alerts & auto resolution
  • Node memory usage rate alerts for shard-2/3 nodes & auto resolution
  • Node disk space abnormal alerts for shard0 nodes & auto resolution
  • Monitor is down: s0-min-sync-3
    • Newly added DNS nodes
  • Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 (nov 11)
    • Nov 11 IDO event resulted in huge traffic. Normal behavior, no specific cause to remedy.
  • AverageDurationRPCTest - Over a period of 15 min, in normal condition the average of all RPC duration…

Action Items

  • Create issues for beacon out of sync and memory usage alerts for tracking @sophoah