Weekly OnCall Rotation
Oncall: Jacky & Jack (US Pacific Daytime) & Yuriy (US Pacific NightTime)
Duration: Oct 26, 2021 to Nov 2, 2021
Summary
- Generally quiet with minor issues which fix itself.
- Most beacon out of sync issues automatically resolve themselves.
- Disk expansion on several nodes - shard0 (Nita Neou (Soph))
- False alarms on RPC on 10/29 for 9 hours (noise from websockets load mix-in)
- Re-aligned LB → Targets → Nodes
- Websockets added unnecessary pressure to nodes
Details
Oct 26, 2021
- Three beacon out of sync issue
Oct 27, 2021
- Several beacon out of sync issue
- Build dedicated DNS nodes, taken out of explorer nodes by Soph
Oct 28, 2021
-
Several beacon out of sync issue
-
Initial guess: Flooding RPC method InSync keeps DNS nodes busy
-
The guess is verified when deploying the initial fix: Sync status rpc fix by JackyWYX · Pull Request #3912 · harmony-one/harmony · GitHub
-
After the deploy, the CPU of DNS nodes drops dramatically.
-
The fix is then undeployed because of the issue found in testnet : Out of memory on v7192-v4.3.0-18-g791c9d20 · Issue #3915 · harmony-one/harmony · GitHub, Need further investigation.
-
disk upgraded on : (by Soph)
-
Explorer-v2 token indexer
-
public graph node
-
Mainnet snapshot node
-
improvement suggestion : configure log rotation on docker logs of our running services
-
graph public node : postgres DB is taking most of the disk space. Would need to look at how to clean that up or just scrap everything and restart ?
Oct 29, 2021
-
Several beacon out of sync issue
-
Extra fixes to the InSync rpc fix:
-
Prometheus metrics on all method RPC:
-
Evening see a lot of AverageDurationRPCTest pagers
-
All auto resolved.
-
RPC call pattern and network pattern looks normal
- The canary RPC method causing the problem differs overtime
- Isolation of websockets traffic from rest of RPC traffic made a big difference, root caused and traffic shifted
Oct 30, 2021
-
Several beacon out of sync issue at Digital Ocean nodes (shards 1-3), auto-resolves
-
Isolated all websocket servers to its own nodes, reduced load on nodes
- Resolved increasing websocket connections pileup-- LB timeout was too short (60s), changed to (1800s = 30min)
Oct 31, 2021 to Nov 2, 2021
- Several beacon out of sync issue at Digital Ocean nodes (shards 1-3), auto-resolves