Weekly OnCall Summary Jan 25 - Feb 01, 2022

Weekly OnCall Rotation

OnCall US Pacific: Jack
OnCall Asia/EMEA: Haodi / Soph
Duration: Jan 25, 2022 8:30am - Feb 1, 2022 8:30am PST

Overview

Jan 25, 2022

  • Noisy alarms were ignored, needed tuning
  • Watchdog was barking a lot

Jan 26, 2022

  • RPC simulation / synthetic tests: Average Duration RPC test threshold has been increased from 500ms to 1s to help reduce the frequency of the alert so we can respond to legitimate issues Image
  • This level will still trigger when there’s a real incident (sample of past incident below) Image

Jan 27, 2022

  • All synthetic test frequency updated from 5m to a higher granularity of every 1m
  • Two new S0 archival node with attached storage with raid 0 node built in a hope we can phase out the expensive SSD-based nodes
  • Graph node slow sync with our archival node reported by Viper dev with assumption that rate limits are the root cause. Tested v4.4.0 (RPC with no rate limits) and he confirmed that it worked.

Jan 28, 2022

No activity stood out, alarms were less often

Jan 29, 2022

  • Slow graph node sync - There is an ENG build now available with PR 4039 that removes the rate limits on the do_evm_call method and Viper dev confirmed that build worked well. This build is required because of the upcoming hard fork
  • Our secondary cloud nodes have disks running low on space (> 90% disk usage)
  • Began upgrading to 2TB attached SSD nodes, completed two days later

Jan 30, 2022

  • Tested traffic split between api.harmony.one and api.s0.t.hmny.io for 10 days to silo’ed Load Balancers and dedicated nodes behind them (not shared)
  • Test concludes that nodes are healthier behind the traffic split LBs vs. non-split LBs
  • Rolled out at secondary cloud nodes to London region
  • Watchdog barked a lot less, archive nodes struggle when 1000+ tx per block gets in

xxx-api-harmony-one (servicing api.harmony.one)

xxx-api-s0-t-hmny-io (servicing api.s0.t.hmny.io)

Jan 31, 2022

  • Completed setup of all new traffic split region dependencies (Load Balancers, DNS healthchecks, increased alarm coverage)
  • Updated AWS CloudWatch alarms to be less sensitive (10 data points ~= 10 mins)
  • All RPC Synthetics Alarms are updated now to be more “realistic”
  • Updated alarm to be more lenient but alarms sooner (3 out of 5)

Takeaway

  • Archive nodes struggles during high trafficked spikes
  • Explorer nodes are much healthier (with existing v4.3.3) with traffic splitting
  • Secondary cloud’s 2TB SSD-based nodes are still more affordable than AWS