Weekly On-call Summary Oct 19th - Oct 25th

Weekly OnCall Rotation

Oncall: @giv @sophoah

10/19 - 10/25

Summary

  • Most incidents auto-resolved
  • Several RPC outages and shift to Pocket

Details

Oct 19, 2021

  • Giv: A few “Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0” errors which auto resolved.
  • Giv: Multisig.harmony.one put in maintenance mode for syncing DB

Oct 20, 2021

  • Giv: A few “Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0” errors which auto resolved.
  • Giv: Multisig.harmony.one is back
  • Giv: Multisig API went down briefly but auto resolved. Seems like a short CPU spike
  • Giv: Another short Multisig API timeout

Oct 21, 2021

  • Giv: Several out of sync beacon incidents back to back. I restarted service.
  • Soph: Create 2 new node to replace old ones. Disk was at 90%
  • Soph: Not too long after 7/8 nodes were replaced around 11:25 PM UTC many of our load balancers started to have lots of hosts being unhealthy, and eventually, all of them became like that. I had to add HARMONY-MIN-PRUNE-EXPLORER-S0-3/4/5/6 which were meant to be terminated anytime soon

Oct 22, 2021

  • Took out from all the target group HARMONY-MIN-PRUNE-EXPLORER-S0-7-api.s0-NEW which seemed to be a bad actor and causing multiple target group to trigger the “Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0” HARMONY-MIN-PRUNE-EXPLORER-S0-6-api.s0-NEW

Oct 24, 2021

  • Giv: Multiple RPC errors. We moved 85% of the traffic to Pocket to reduce the stress. Errors are subsiding

  • all our RPC pruned node are all out of sync (3k ~ 7k blocks behind)

  • I added to the bridge endpoint archival nodes (that are not exposed externally ie sushi archival node) to restore the bridge RPC service

  • Adding the archival node to api.harmony.one would unsync them after a very short while (I had 3 of it)

  • I then took out all archival node to avoid our archival RPC service to be down and allow a faster sync

  • i took out 4 nodes out of the out of the original 8 of api.harmony.one to :

  • upgrade them from c5.2xlarge to c5.9xlarge (there were the most behind), the remaining four are c5.4xlarge

  • and wait for sync

  • my objective is to add into api.harmony.one 3 archival node + 5 c5.9xlarge node (4 + 1 see below) in the hope there are enough node to support the load/traffic

  • 2 other pruned node that were on standby (7/8) were stopped and upgraded to c5.9xlarge. 1 never came back online, and 1 will be used.

  • waited for the 8 nodes to fully synced and added them to the group api.harmony.one

and 2 min after 4 were already out of sync

the nodes are now playing musical chair taking turn sync/out of sync …

and i don’t think user experience is fixed right now since as the nodes goes sync/unsync, they are taken out/in to the load balancer and that will also add into the instability

  • Soph: - our RPC nodes still go out of sync be it the archival or pruned node as long as they have been added to api.harmony.one but at least now nodes are powerful enough to stay near the last block and catch up. Node couldn’t catchup with last block previously and would forever stay unhealthy
  • there is only one pruned node left 44.229.11.235 that is still catching up

  • all target are taking turn with healthy/unhealthy state

  • whatever I do, I can’t ssh to i-0244bd36143490418 (HARMONY-MIN-PRUNE-EXPLORER-S0-7-api.s0-NEW) is there a way to get a console access to it in AWS?

  • i believe the stickiness configuration on the load balancer helped with the CALL_EXCEPTION ERROR issue

  • however we still have long RPC call, see team-devops with the synthetic test (not sure why the notification still goes to that group though) and the timeout issue we discussed above with Wolf and OpenSwap Farms - Yield Farm ERC20 / BEP20 tokens page load. (it works better not but taking a very long time to load).

Oct 25, 2021

  • Giv: Api.harmony.one is down { “message”: “Relay attempts exhausted” }
  • Giv: Many errors across apps
  • Giv: More RPC issues. Slow response times for getbalance calls