Weekly OnCall Rotation
Oncall: Ganesha, Jenya ({}@harmony.one)
Duration: 8/2 12:00pm - 8/9 12:00pm
Summary
- Node Disk Space Alert - the free space of the mainnet shard{0-3} node(IP) is abnormal
- Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0
- beacon out of sync! - mainnet
- Node CPU Usage Rate Alert - the cpu usage rate of the mainnet shard{0-3} node(IP) is abnormal
- Many “Harmony Process has been restarted due to 300 sec above 90% cpu”
- AverageDurationRPCTest - Over a period of 15 min, in normal condition the average of all RPC duration…
- Explorer indexer was down because of postgres’s integer type overflow, fixed by updating the scheme. Was a software issue and one-time situation, no need to update the runbook
Details
- 8/2
- Node Disk Space Alert - the free space of the mainnet shard{0-3} node(IP) is abnormal - #3
- 8/3
- Node Disk Space Alert - the free space of the mainnet shard{0-3} node(IP) is abnormal - #13
- Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 - #26 (annoying)
- 8/4
- beacon out of sync! - mainnet #3
- Node Disk Space Alert - the free space of the mainnet shard{0-3} node(IP) is abnormal - #7
- Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 - #28
- Node CPU Usage Rate Alert - the cpu usage rate of the mainnet shard{0-3} node(IP) is abnormal - #2
- AverageDurationRPCTest - Over a period of 15 min, in normal condition the average of all RPC duration… - #2
- 8/5
- beacon out of sync! - mainnet #3
- AverageDurationRPCTest - Over a period of 15 min, in normal condition the average of all RPC duration… - #7
- Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 - #25
- Node Disk Space Alert - the free space of the mainnet shard{0-3} node(IP) is abnormal - #10
- 8/6
- beacon out of sync! - mainnet #1
- AverageDurationRPCTest - Over a period of 15 min, in normal condition the average of all RPC duration… - #4
- Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 - #17
- explorer-indexer-s0 ( … ). It was down for 7 hours and 37 minutes.
- explorer-indexer-token ( … ). It was down for 7 hours and 37 minutes.
- 8/7
- beacon out of sync! - mainnet #2
- AverageDurationRPCTest - Over a period of 15 min, in normal condition the average of all RPC duration… - #2
- Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 - #15
- Node Disk Space Alert - the free space of the mainnet shard{0-3} node(IP) is abnormal - #2
- 8/8
- beacon out of sync! - mainnet #5
- AverageDurationRPCTest - Over a period of 15 min, in normal condition the average of all RPC duration… - #3
- Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 - #15
- Node Disk Space Alert - the free space of the mainnet shard{0-3} node(IP) is abnormal - #2
- 8/9
- beacon out of sync! - mainnet #4
- AverageDurationRPCTest - Over a period of 15 min, in normal condition the average of all RPC duration… - #2
- Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 - #5
- Node Disk Space Alert - the free space of the mainnet shard{0-3} node(IP) is abnormal - #5
- Node CPU Usage Rate Alert - the cpu usage rate of the mainnet shard{0-3} node(IP) is abnormal - #2
Operation Improvement
-
Apart from “Node Disk Space Alert”, no other needed action
-
Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 is annoying, fine tune
-
Node CPU Usage Rate Alert can be fine tuned- recurring?
- Fine tune to which rpc triggering this (staking/non-staking rpcs) - TODO soph
- Separate explorer endpoint for staking?
-
Update the aws console setting up instructions - Jack