Weekly On-call Summary Oct 19th - Oct 25th

giv · October 27, 2021, 11:39pm

Weekly OnCall Rotation

Oncall: @giv @sophoah

10/19 - 10/25

Summary

Most incidents auto-resolved
Several RPC outages and shift to Pocket

Details

Oct 19, 2021

Giv: A few “Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0” errors which auto resolved.
Giv: Multisig.harmony.one put in maintenance mode for syncing DB

Oct 20, 2021

Giv: A few “Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0” errors which auto resolved.
Giv: Multisig.harmony.one is back
Giv: Multisig API went down briefly but auto resolved. Seems like a short CPU spike
Giv: Another short Multisig API timeout

Oct 21, 2021

Giv: Several out of sync beacon incidents back to back. I restarted service.
Soph: Create 2 new node to replace old ones. Disk was at 90%
Soph: Not too long after 7/8 nodes were replaced around 11:25 PM UTC many of our load balancers started to have lots of hosts being unhealthy, and eventually, all of them became like that. I had to add HARMONY-MIN-PRUNE-EXPLORER-S0-3/4/5/6 which were meant to be terminated anytime soon

Oct 22, 2021

Took out from all the target group HARMONY-MIN-PRUNE-EXPLORER-S0-7-api.s0-NEW which seemed to be a bad actor and causing multiple target group to trigger the “Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0” HARMONY-MIN-PRUNE-EXPLORER-S0-6-api.s0-NEW

Oct 24, 2021

Giv: Multiple RPC errors. We moved 85% of the traffic to Pocket to reduce the stress. Errors are subsiding

Giv a.api.s0.t.hmny.io nodes are unhealthy due to reduction of Pocket nodes
Soph: quick update here :

all our RPC pruned node are all out of sync (3k ~ 7k blocks behind)
I added to the bridge endpoint archival nodes (that are not exposed externally ie sushi archival node) to restore the bridge RPC service
Adding the archival node to api.harmony.one would unsync them after a very short while (I had 3 of it)
I then took out all archival node to avoid our archival RPC service to be down and allow a faster sync
i took out 4 nodes out of the out of the original 8 of api.harmony.one to :
upgrade them from c5.2xlarge to c5.9xlarge (there were the most behind), the remaining four are c5.4xlarge
and wait for sync
my objective is to add into api.harmony.one 3 archival node + 5 c5.9xlarge node (4 + 1 see below) in the hope there are enough node to support the load/traffic
2 other pruned node that were on standby (7/8) were stopped and upgraded to c5.9xlarge. 1 never came back online, and 1 will be used.
waited for the 8 nodes to fully synced and added them to the group api.harmony.one

and 2 min after 4 were already out of sync

the nodes are now playing musical chair taking turn sync/out of sync …

and i don’t think user experience is fixed right now since as the nodes goes sync/unsync, they are taken out/in to the load balancer and that will also add into the instability

Soph: - our RPC nodes still go out of sync be it the archival or pruned node as long as they have been added to api.harmony.one but at least now nodes are powerful enough to stay near the last block and catch up. Node couldn’t catchup with last block previously and would forever stay unhealthy

there is only one pruned node left 44.229.11.235 that is still catching up
all target are taking turn with healthy/unhealthy state
whatever I do, I can’t ssh to i-0244bd36143490418 (HARMONY-MIN-PRUNE-EXPLORER-S0-7-api.s0-NEW) is there a way to get a console access to it in AWS?
i believe the stickiness configuration on the load balancer helped with the CALL_EXCEPTION ERROR issue
however we still have long RPC call, see team-devops with the synthetic test (not sure why the notification still goes to that group though) and the timeout issue we discussed above with Wolf and OpenSwap Farms - Yield Farm ERC20 / BEP20 tokens page load. (it works better not but taking a very long time to load).

Oct 25, 2021

Giv: Api.harmony.one is down { “message”: “Relay attempts exhausted” }
Giv: Many errors across apps

1222×1302 73.9 KB
Giv: More RPC issues. Slow response times for getbalance calls
664×1074 94.5 KB

Topic		Replies	Views
Weekly On-call Summary Oct 25th - Nov 2nd Team Ops	0	265	November 9, 2021
Weekly OnCall Rotation - Aug 10-17, 2021 Team Ops	0	233	September 13, 2021
Weekly OnCall Summary Nov/16 - 23, 2021 Team Ops	1	318	November 24, 2021
Weekly On-call Summary Nov 9-15, 2021 Team Ops	0	271	November 16, 2021
Weekly OnCall Rotation - Aug 2-9, 2021 Team Ops	0	277	August 10, 2021