On-call: @giv July 21 - July 26, 2021
Summary
- Excessive node memory usage alerts on PagerDuty.
- TODO: Adjust alert threshold (@sophoah)
- TODO: Investigate root cause
- Missing runbook for troubleshooting RPC issues.
- TODO: Add new Grafana metrics link to runbook (@giv)
- TODO: Identify pattern
- TODO: Implement timeouts (in progress) (@Jacky)
- Beacon out of sync alerts.
- Bridge backend downtime.
- TODO: Improve runbook and potentially combine on-call contacts (@ganesha)
Jul 21, 2021
- Giv: Removed testnet from #status bot due to noisy incidents until @sophoah implements maintenance window.
- Giv: Added Explorer backend (35.167.126.78:8888) to status monitoring.
- Giv: Incident 8:27 pm PT: (@sophoah) to adjust Watchdog to reduce noise.
Jul 22, 2021
- Giv: Incident 10:50 am PT: 138.68.11.38:9500 beacon out of sync! - mainnet - @giv restarted process on the node.
- Giv: Incident 4:05 pm PT: Ongoing memory usage errors on 3.23.105.44 (> 85%)
@giv restarted process. Mem is back to normal but slowly climbing back up. - Giv: Incident 5:02 pm PT: 54.177.156.229 mem usage high. Auto resolved. Seems the mem usage alert window needs to be widened to prevent alerting for small spikes.
- Giv: Incident 8:14 pm PT: Low disk space on 3.129.166.0. @sophoah looking into increasing disk space.
Jul 23, 2021
-
Soph: https://github.com/harmony-one/harmony-ops-priv/issues/50) - the node has been upgraded with a bigger disk in lightsail and is now syncing however it seems really slow
-
Soph: Watchdog testnet consensus threshold are defined on the jenkins machine file /usr/local/watchdog/configs/testnet.yaml and warning/alert is sent after 10 min of downtime we can increase that again if still too noisy
-
Soph: new pagerduty jenkins job has been created. Now used by the testnet nightly update https://jenkins.harmony.one/job/PagerDuty-Maintenance/
-
Giv: Incident 8:37 am PT: 138.68.11.38:9500 (shard 2) beacon at block height 15195035, but beacon height 15505824 - instance health looks normal
-
Giv: 9:23 am PT - seeing a lot of node memory alerts in PD - every few minutes. Mostly on shard 2.
-
Giv: Incident 10:37 am PT: RPC response time climbing. hmy_getStorageAt is taking over a minute to respond. Status page updated
-
Giv: Incident 3:24 pm PT: https://be4.bridge.hmny.io/ is down. Yuriy is addressing and will add steps in runbook
Jul 24, 2021
- Soph: Grafana mem alert has been updated to alert only after 15 min of mem usage above 85%
- Soph: S2 Explorer node beacon shard chain is being clone to fix the 300k block behind
- Soph: Multiple high memory grafana alert which needed a node harmony process restart (x5)
- Giv: Incident 12:22 pm PT: getlastcrosslinks method is taking 3.5 mins to respond. TODO: Add timeouts to all methods.
Jul 25, 2021
- Giv: Incident 10:32 am PT: 54.177.156.229:9500 beacon out of sync! - mainnet - Restarted process.
- Giv: Incident 1:02 pm PT: Monitor is DOWN: eth-bridge(backend-4) - Unsure what to do. Auto-resolved after 1 hour.
- Giv: Incident 1:37 pm PT: AverageDurationRPCTest
Seeing a huge increase in response times and CPU utilization across the board
Incident auto resolved
- Giv: Many memory alerts throughout the day.
Jul 26, 2021
- Giv: Users are complaining about RPC performance. Response times do seem slower overall with a few spikes
- Giv: Bridge is down: Complaints from Viper users. Bridge incident on PD was assigned to Ganesha instead of me. This was resolved by Yuriy and it was due to a conflict with the last update.
- Giv: Still seeing > 10 memory alerts per day