Weekly OnCall Summary Jan 25 - Feb 01, 2022

Jacksteroo · February 7, 2022, 10:28pm

Weekly OnCall Rotation

OnCall US Pacific: Jack
OnCall Asia/EMEA: Haodi / Soph
Duration: Jan 25, 2022 8:30am - Feb 1, 2022 8:30am PST

RPC simulation / synthetic tests: Average Duration RPC test threshold has been increased from 500ms to 1s to help reduce the frequency of the alert so we can respond to legitimate issues
This level will still trigger when there’s a real incident (sample of past incident below)

All synthetic test frequency updated from 5m to a higher granularity of every 1m
Two new S0 archival node with attached storage with raid 0 node built in a hope we can phase out the expensive SSD-based nodes
Graph node slow sync with our archival node reported by Viper dev with assumption that rate limits are the root cause. Tested v4.4.0 (RPC with no rate limits) and he confirmed that it worked.

No activity stood out, alarms were less often

Slow graph node sync - There is an ENG build now available with PR 4039 that removes the rate limits on the do_evm_call method and Viper dev confirmed that build worked well. This build is required because of the upcoming hard fork
Our secondary cloud nodes have disks running low on space (> 90% disk usage)
Began upgrading to 2TB attached SSD nodes, completed two days later

Tested traffic split between api.harmony.one and api.s0.t.hmny.io for 10 days to silo’ed Load Balancers and dedicated nodes behind them (not shared)
Test concludes that nodes are healthier behind the traffic split LBs vs. non-split LBs
Rolled out at secondary cloud nodes to London region
Watchdog barked a lot less, archive nodes struggle when 1000+ tx per block gets in

xxx-api-harmony-one (servicing api.harmony.one)

xxx-api-s0-t-hmny-io (servicing api.s0.t.hmny.io)

Completed setup of all new traffic split region dependencies (Load Balancers, DNS healthchecks, increased alarm coverage)
Updated AWS CloudWatch alarms to be less sensitive (10 data points ~= 10 mins)
All RPC Synthetics Alarms are updated now to be more “realistic”
Updated alarm to be more lenient but alarms sooner (3 out of 5)

Topic		Replies	Views
Weekly OnCall Summary Nov/16 - 23, 2021 Team Ops	1	319	November 24, 2021
Weekly OnCall Summary Sept 27th - Oct 4th, 2021 Team Ops	1	242	October 5, 2021
Weekly On-call Summary Oct 19th - Oct 25th Team Ops	0	279	October 27, 2021
Weekly OnCall Rotation - Aug 2-9, 2021 Team Ops	0	278	August 10, 2021
Weekly OnCall Rotation - Aug 10-17, 2021 Team Ops	0	234	September 13, 2021