Weekly OnCall Rotation

ganesha · August 10, 2021, 4:32pm

Oncall: Ganesha, Jenya ({}@harmony.one)

Duration: 8/2 12:00pm - 8/9 12:00pm

Summary

Node Disk Space Alert - the free space of the mainnet shard{0-3} node(IP) is abnormal
Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0
beacon out of sync! - mainnet
Node CPU Usage Rate Alert - the cpu usage rate of the mainnet shard{0-3} node(IP) is abnormal
- Many “Harmony Process has been restarted due to 300 sec above 90% cpu”
AverageDurationRPCTest - Over a period of 15 min, in normal condition the average of all RPC duration…
Explorer indexer was down because of postgres’s integer type overflow, fixed by updating the scheme. Was a software issue and one-time situation, no need to update the runbook

Details

8/2
- Node Disk Space Alert - the free space of the mainnet shard{0-3} node(IP) is abnormal - #3
8/3
- Node Disk Space Alert - the free space of the mainnet shard{0-3} node(IP) is abnormal - #13
- Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 - #26 (annoying)
8/4
- beacon out of sync! - mainnet #3
- Node Disk Space Alert - the free space of the mainnet shard{0-3} node(IP) is abnormal - #7
- Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 - #28
- Node CPU Usage Rate Alert - the cpu usage rate of the mainnet shard{0-3} node(IP) is abnormal - #2
- AverageDurationRPCTest - Over a period of 15 min, in normal condition the average of all RPC duration… - #2
8/5
- beacon out of sync! - mainnet #3
- AverageDurationRPCTest - Over a period of 15 min, in normal condition the average of all RPC duration… - #7
- Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 - #25
- Node Disk Space Alert - the free space of the mainnet shard{0-3} node(IP) is abnormal - #10
8/6
- beacon out of sync! - mainnet #1
- AverageDurationRPCTest - Over a period of 15 min, in normal condition the average of all RPC duration… - #4
- Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 - #17
- explorer-indexer-s0 ( … ). It was down for 7 hours and 37 minutes.
- explorer-indexer-token ( … ). It was down for 7 hours and 37 minutes.
8/7
- beacon out of sync! - mainnet #2
- AverageDurationRPCTest - Over a period of 15 min, in normal condition the average of all RPC duration… - #2
- Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 - #15
- Node Disk Space Alert - the free space of the mainnet shard{0-3} node(IP) is abnormal - #2
8/8
- beacon out of sync! - mainnet #5
- AverageDurationRPCTest - Over a period of 15 min, in normal condition the average of all RPC duration… - #3
- Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 - #15
- Node Disk Space Alert - the free space of the mainnet shard{0-3} node(IP) is abnormal - #2
8/9
- beacon out of sync! - mainnet #4
- AverageDurationRPCTest - Over a period of 15 min, in normal condition the average of all RPC duration… - #2
- Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 - #5
- Node Disk Space Alert - the free space of the mainnet shard{0-3} node(IP) is abnormal - #5
- Node CPU Usage Rate Alert - the cpu usage rate of the mainnet shard{0-3} node(IP) is abnormal - #2

Operation Improvement

Apart from “Node Disk Space Alert”, no other needed action
Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 is annoying, fine tune
Node CPU Usage Rate Alert can be fine tuned- recurring?
- Fine tune to which rpc triggering this (staking/non-staking rpcs) - TODO soph
- Separate explorer endpoint for staking?
Update the aws console setting up instructions - Jack

Topic		Replies	Views
Weekly OnCall Rotation - Aug 10-17, 2021 Team Ops	0	238	September 13, 2021
Weekly On-call Summary Nov 9-15, 2021 Team Ops	0	280	November 16, 2021
Weekly on-call Summary August 31 - September 6, 2021 Team Ops	0	305	September 13, 2021
Weekly on-call summary July 27 - August 2th,2021 Team Ops	0	282	August 4, 2021
Weekly OnCall Summary Nov/16 - 23, 2021 Team Ops	1	328	November 24, 2021

Weekly OnCall Rotation - Aug 2-9, 2021

Weekly OnCall Rotation

Summary

Details

Operation Improvement

Related topics