Weekly on-call Summary July 21 - July 26, 2021

giv · July 27, 2021, 8:17pm

On-call: @giv July 21 - July 26, 2021

Excessive node memory usage alerts on PagerDuty.
- TODO: Adjust alert threshold (@sophoah)
- TODO: Investigate root cause
Missing runbook for troubleshooting RPC issues.
- TODO: Add new Grafana metrics link to runbook (@giv)
- TODO: Identify pattern
- TODO: Implement timeouts (in progress) (@Jacky)
Beacon out of sync alerts.
- TODO: Auto resolve out of sync and consensus stuck
Bridge backend downtime.
- TODO: Improve runbook and potentially combine on-call contacts (@ganesha)

Giv: Removed testnet from #status bot due to noisy incidents until @sophoah implements maintenance window.
Giv: Added Explorer backend (35.167.126.78:8888) to status monitoring.
Giv: Incident 8:27 pm PT: (@sophoah) to adjust Watchdog to reduce noise.

Giv: Incident 10:50 am PT: 138.68.11.38:9500 beacon out of sync! - mainnet - @giv restarted process on the node.
Giv: Incident 4:05 pm PT: Ongoing memory usage errors on 3.23.105.44 (> 85%)

1386×194 34.7 KB

@giv restarted process. Mem is back to normal but slowly climbing back up.
Giv: Incident 5:02 pm PT: 54.177.156.229 mem usage high. Auto resolved. Seems the mem usage alert window needs to be widened to prevent alerting for small spikes.
Giv: Incident 8:14 pm PT: Low disk space on 3.129.166.0. @sophoah looking into increasing disk space.

Soph: https://github.com/harmony-one/harmony-ops-priv/issues/50) - the node has been upgraded with a bigger disk in lightsail and is now syncing however it seems really slow

1361×229 136 KB
Soph: Watchdog testnet consensus threshold are defined on the jenkins machine file /usr/local/watchdog/configs/testnet.yaml and warning/alert is sent after 10 min of downtime we can increase that again if still too noisy
Soph: new pagerduty jenkins job has been created. Now used by the testnet nightly update https://jenkins.harmony.one/job/PagerDuty-Maintenance/
Giv: Incident 8:37 am PT: 138.68.11.38:9500 (shard 2) beacon at block height 15195035, but beacon height 15505824 - instance health looks normal

1600×250 247 KB
Giv: 9:23 am PT - seeing a lot of node memory alerts in PD - every few minutes. Mostly on shard 2.
Giv: Incident 10:37 am PT: RPC response time climbing. hmy_getStorageAt is taking over a minute to respond. Status page updated

1600×1022 123 KB
Giv: Incident 3:24 pm PT: https://be4.bridge.hmny.io/ is down. Yuriy is addressing and will add steps in runbook

Soph: Grafana mem alert has been updated to alert only after 15 min of mem usage above 85%
Soph: S2 Explorer node beacon shard chain is being clone to fix the 300k block behind
Soph: Multiple high memory grafana alert which needed a node harmony process restart (x5)
Giv: Incident 12:22 pm PT: getlastcrosslinks method is taking 3.5 mins to respond. TODO: Add timeouts to all methods.

Giv: Incident 10:32 am PT: 54.177.156.229:9500 beacon out of sync! - mainnet - Restarted process.
Giv: Incident 1:02 pm PT: Monitor is DOWN: eth-bridge(backend-4) - Unsure what to do. Auto-resolved after 1 hour.
Giv: Incident 1:37 pm PT: AverageDurationRPCTest
Seeing a huge increase in response times and CPU utilization across the board

1600×1096 222 KB

Incident auto resolved

1600×581 147 KB
Giv: Many memory alerts throughout the day.

Giv: Users are complaining about RPC performance. Response times do seem slower overall with a few spikes

1600×681 299 KB
Giv: Bridge is down: Complaints from Viper users. Bridge incident on PD was assigned to Ganesha instead of me. This was resolved by Yuriy and it was due to a conflict with the last update.
Giv: Still seeing > 10 memory alerts per day

Topic		Replies	Views
Weekly On-call Summary Oct 19th - Oct 25th Team Ops	0	279	October 27, 2021
Weekly on-call Summary August 31 - September 6, 2021 Team Ops	0	301	September 13, 2021
Weekly OnCall Summary Nov/16 - 23, 2021 Team Ops	1	317	November 24, 2021
Weekly OnCall Summary Jan 25 - Feb 01, 2022 Team Ops	0	276	February 7, 2022
Weekly On-call Summary Nov 9-15, 2021 Team Ops	0	271	November 16, 2021