Weekly on-call summary July 27 - August 2th,2021

Jacky · August 4, 2021, 6:07pm

Summary

Memory alert for explorer nodes (4:45a.m - 11:00a.m)
- After restarted nodes, the alert rate drops and calms down.
- Still another high memory happened at 6:41-8:00.
  - Prometheus data missing. Suspect abuse of RPC calls.
Meanwhile, Beacon out of sync happened
- Out of sync rate is high before restarting nodes for memory issues. After restarting the node, rate dropped significantly.
- Resolved automatically.

Only one Beacon out of sync issue. Resolved automatically.
Error “Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0”
- Resolved automatically

Eth bridge down - resolved by soph
Error “Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0”
- Resolved automatically

Error “Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0”
- Resolved automatically

RPC response delay (AverageDurationRPCTest - Over a period of 15 min, in normal condition the average is below 500ms, during issue there would multiple spike and user will already start complaining there are issues)
- Restarted explorer nodes
- Data missing at grafana

Unheathy host alert
- Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0
- Resolved automatically
Disk space alert
- Explorer nodes running out of disk
  - Some report wrong disk usage (at /data instead of /)
  - Extended volume of some virginia instances from 160GB to 200GB
  - Other instances removed latest logs at July (releases ~5GB disk)

Some prometheus data missing from July 28th to Aug 2th. Need investigate.
Fixed already
Add the prometheus service to pagerduty
Some new alerts need better documentation of handling:
Average UnHealthyHostCount (Description can be clearer?) - updated on 4 Aug Nita Neou (Soph)
AverageDurationRPCTest
Some nodes have volume mounted to different points (/ and /data). Need distinguish and check disk usage at /data.
Haodi can look into this issue.
A large number of logs come from core/tx_pool.go:1023 (new transaction added to transaction pool). Shall lower the log level.

Topic	Replies	Views
Weekly on-call Summary July 21 - July 26, 2021 Team Ops	274	July 27, 2021
Weekly OnCall Rotation - Aug 2-9, 2021 Team Ops	277	August 10, 2021
Weekly OnCall Rotation - Aug 10-17, 2021 Team Ops	233	September 13, 2021
Weekly On-call Summary Nov 9-15, 2021 Team Ops	271	November 16, 2021
Weekly On-call Summary Oct 25th - Nov 2nd Team Ops	265	November 9, 2021