Weekly OnCall Rotation - Aug 2-9, 2021

Weekly OnCall Rotation

Oncall: Ganesha, Jenya ({}@harmony.one)

Duration: 8/2 12:00pm - 8/9 12:00pm

Summary

  • Node Disk Space Alert - the free space of the mainnet shard{0-3} node(IP) is abnormal
  • Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0
  • beacon out of sync! - mainnet
  • Node CPU Usage Rate Alert - the cpu usage rate of the mainnet shard{0-3} node(IP) is abnormal
    • Many “Harmony Process has been restarted due to 300 sec above 90% cpu”
  • AverageDurationRPCTest - Over a period of 15 min, in normal condition the average of all RPC duration…
  • Explorer indexer was down because of postgres’s integer type overflow, fixed by updating the scheme. Was a software issue and one-time situation, no need to update the runbook

Details

  • 8/2
    • Node Disk Space Alert - the free space of the mainnet shard{0-3} node(IP) is abnormal - #3
  • 8/3
    • Node Disk Space Alert - the free space of the mainnet shard{0-3} node(IP) is abnormal - #13
    • Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 - #26 (annoying)
  • 8/4
    • beacon out of sync! - mainnet #3
    • Node Disk Space Alert - the free space of the mainnet shard{0-3} node(IP) is abnormal - #7
    • Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 - #28
    • Node CPU Usage Rate Alert - the cpu usage rate of the mainnet shard{0-3} node(IP) is abnormal - #2
    • AverageDurationRPCTest - Over a period of 15 min, in normal condition the average of all RPC duration… - #2
  • 8/5
    • beacon out of sync! - mainnet #3
    • AverageDurationRPCTest - Over a period of 15 min, in normal condition the average of all RPC duration… - #7
    • Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 - #25
    • Node Disk Space Alert - the free space of the mainnet shard{0-3} node(IP) is abnormal - #10
  • 8/6
    • beacon out of sync! - mainnet #1
    • AverageDurationRPCTest - Over a period of 15 min, in normal condition the average of all RPC duration… - #4
    • Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 - #17
    • explorer-indexer-s0 ( … ). It was down for 7 hours and 37 minutes.
    • explorer-indexer-token ( … ). It was down for 7 hours and 37 minutes.
  • 8/7
    • beacon out of sync! - mainnet #2
    • AverageDurationRPCTest - Over a period of 15 min, in normal condition the average of all RPC duration… - #2
    • Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 - #15
    • Node Disk Space Alert - the free space of the mainnet shard{0-3} node(IP) is abnormal - #2
  • 8/8
    • beacon out of sync! - mainnet #5
    • AverageDurationRPCTest - Over a period of 15 min, in normal condition the average of all RPC duration… - #3
    • Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 - #15
    • Node Disk Space Alert - the free space of the mainnet shard{0-3} node(IP) is abnormal - #2
  • 8/9
    • beacon out of sync! - mainnet #4
    • AverageDurationRPCTest - Over a period of 15 min, in normal condition the average of all RPC duration… - #2
    • Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 - #5
    • Node Disk Space Alert - the free space of the mainnet shard{0-3} node(IP) is abnormal - #5
    • Node CPU Usage Rate Alert - the cpu usage rate of the mainnet shard{0-3} node(IP) is abnormal - #2

Operation Improvement

  • Apart from “Node Disk Space Alert”, no other needed action

  • Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 is annoying, fine tune

  • Node CPU Usage Rate Alert can be fine tuned- recurring?

    • Fine tune to which rpc triggering this (staking/non-staking rpcs) - TODO soph
    • Separate explorer endpoint for staking?
  • Update the aws console setting up instructions - Jack

3 Likes