Weekly On-call Summary Oct 25th - Nov 2nd

Weekly OnCall Rotation

Oncall: Jacky & Jack (US Pacific Daytime) & Yuriy (US Pacific NightTime)

Duration: Oct 26, 2021 to Nov 2, 2021

Summary

  1. Generally quiet with minor issues which fix itself.
  2. Most beacon out of sync issues automatically resolve themselves.
  3. Disk expansion on several nodes - shard0 (Nita Neou (Soph))
  4. False alarms on RPC on 10/29 for 9 hours (noise from websockets load mix-in)
  5. Re-aligned LB → Targets → Nodes
  6. Websockets added unnecessary pressure to nodes

Details

Oct 26, 2021

  1. Three beacon out of sync issue

Oct 27, 2021

  1. Several beacon out of sync issue
  2. Build dedicated DNS nodes, taken out of explorer nodes by Soph

Oct 28, 2021

  1. Several beacon out of sync issue

  2. Initial guess: Flooding RPC method InSync keeps DNS nodes busy

  3. The guess is verified when deploying the initial fix: Sync status rpc fix by JackyWYX · Pull Request #3912 · harmony-one/harmony · GitHub

  4. After the deploy, the CPU of DNS nodes drops dramatically.

  1. The fix is then undeployed because of the issue found in testnet : Out of memory on v7192-v4.3.0-18-g791c9d20 · Issue #3915 · harmony-one/harmony · GitHub, Need further investigation.

  2. disk upgraded on : (by Soph)

  3. Explorer-v2 token indexer

  4. public graph node

  5. Mainnet snapshot node

  6. improvement suggestion : configure log rotation on docker logs of our running services

  7. graph public node : postgres DB is taking most of the disk space. Would need to look at how to clean that up or just scrap everything and restart ?

Oct 29, 2021

  1. Several beacon out of sync issue

  2. Extra fixes to the InSync rpc fix:

  3. PR: [SYNC] refactor and make sync status check interval smaller by JackyWYX · Pull Request #3918 · harmony-one/harmony · GitHub

  4. Prometheus metrics on all method RPC:

  5. PR: [RPC] General solution of prometheus metrics for all RPC methods by JackyWYX · Pull Request #3919 · harmony-one/harmony · GitHub

  6. Evening see a lot of AverageDurationRPCTest pagers

  7. All auto resolved.

  8. RPC call pattern and network pattern looks normal

  1. The canary RPC method causing the problem differs overtime

  1. Isolation of websockets traffic from rest of RPC traffic made a big difference, root caused and traffic shifted

Oct 30, 2021

  1. Several beacon out of sync issue at Digital Ocean nodes (shards 1-3), auto-resolves

  2. Isolated all websocket servers to its own nodes, reduced load on nodes

  1. Resolved increasing websocket connections pileup-- LB timeout was too short (60s), changed to (1800s = 30min)

Oct 31, 2021 to Nov 2, 2021

  1. Several beacon out of sync issue at Digital Ocean nodes (shards 1-3), auto-resolves