Weekly on-call summary July 27 - August 2th,2021

Summary

  • High CPU and High memory due to abuse of RPC methods. Auto resolved.
  • Beacon out of sync - Almost the same time as RPC abuse. Auto resolved.
  • DB space alerts - expanding disk or remove excessive logs.

July 27th

  • Memory alert for explorer nodes (4:45a.m - 11:00a.m)
    • After restarted nodes, the alert rate drops and calms down.
    • Still another high memory happened at 6:41-8:00.
      • Prometheus data missing. Suspect abuse of RPC calls.
  • Meanwhile, Beacon out of sync happened
    • Out of sync rate is high before restarting nodes for memory issues. After restarting the node, rate dropped significantly.
    • Resolved automatically.

July 28th

  • Only one Beacon out of sync issue. Resolved automatically.
  • Error “Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0”
    • Resolved automatically

July 29th

  • Eth bridge down - resolved by soph
  • Error “Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0”
    • Resolved automatically

July 30th

  • Error “Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0”
    • Resolved automatically

July 31th

  • RPC response delay (AverageDurationRPCTest - Over a period of 15 min, in normal condition the average is below 500ms, during issue there would multiple spike and user will already start complaining there are issues)
    • Restarted explorer nodes
    • Data missing at grafana

Aug 1th - Aug 2th

  • Unheathy host alert
    • Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0
    • Resolved automatically
  • Disk space alert
    • Explorer nodes running out of disk
      • Some report wrong disk usage (at /data instead of /)
      • Extended volume of some virginia instances from 160GB to 200GB
      • Other instances removed latest logs at July (releases ~5GB disk)

Take aways:

  1. Some prometheus data missing from July 28th to Aug 2th. Need investigate.

  2. Fixed already

  3. Add the prometheus service to pagerduty

  4. Some new alerts need better documentation of handling:

  5. Average UnHealthyHostCount (Description can be clearer?) - updated on 4 Aug Nita Neou (Soph)

  6. AverageDurationRPCTest

  7. Some nodes have volume mounted to different points (/ and /data). Need distinguish and check disk usage at /data.

  8. Haodi can look into this issue.

  9. A large number of logs come from core/tx_pool.go:1023 (new transaction added to transaction pool). Shall lower the log level.