Summary
- High CPU and High memory due to abuse of RPC methods. Auto resolved.
- Beacon out of sync - Almost the same time as RPC abuse. Auto resolved.
- DB space alerts - expanding disk or remove excessive logs.
July 27th
- Memory alert for explorer nodes (4:45a.m - 11:00a.m)
- After restarted nodes, the alert rate drops and calms down.
- Still another high memory happened at 6:41-8:00.
- Prometheus data missing. Suspect abuse of RPC calls.
- Meanwhile, Beacon out of sync happened
- Out of sync rate is high before restarting nodes for memory issues. After restarting the node, rate dropped significantly.
- Resolved automatically.
July 28th
- Only one Beacon out of sync issue. Resolved automatically.
- Error “Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0”
- Resolved automatically
July 29th
- Eth bridge down - resolved by soph
- Error “Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0”
- Resolved automatically
July 30th
- Error “Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0”
- Resolved automatically
July 31th
- RPC response delay (AverageDurationRPCTest - Over a period of 15 min, in normal condition the average is below 500ms, during issue there would multiple spike and user will already start complaining there are issues)
- Restarted explorer nodes
- Data missing at grafana
Aug 1th - Aug 2th
- Unheathy host alert
- Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0
- Resolved automatically
- Disk space alert
- Explorer nodes running out of disk
- Some report wrong disk usage (at /data instead of /)
- Extended volume of some virginia instances from 160GB to 200GB
- Other instances removed latest logs at July (releases ~5GB disk)
- Explorer nodes running out of disk
Take aways:
-
Some prometheus data missing from July 28th to Aug 2th. Need investigate.
-
Fixed already
-
Add the prometheus service to pagerduty
-
Some new alerts need better documentation of handling:
-
Average UnHealthyHostCount (Description can be clearer?) - updated on 4 Aug Nita Neou (Soph)
-
AverageDurationRPCTest
-
Some nodes have volume mounted to different points (/ and /data). Need distinguish and check disk usage at /data.
-
Haodi can look into this issue.
-
A large number of logs come from core/tx_pool.go:1023 (new transaction added to transaction pool). Shall lower the log level.