On-call: @giv July 21 - July 26, 2021
- Excessive node memory usage alerts on PagerDuty.
- Missing runbook for troubleshooting RPC issues.
- Beacon out of sync alerts.
- Bridge backend downtime.
- TODO: Improve runbook and potentially combine on-call contacts (@ganesha)
- Giv: Removed testnet from #status bot due to noisy incidents until @sophoah implements maintenance window.
- Giv: Added Explorer backend (184.108.40.206:8888) to status monitoring.
- Giv: Incident 8:27 pm PT: (@sophoah) to adjust Watchdog to reduce noise.
- Giv: Incident 10:50 am PT: 220.127.116.11:9500 beacon out of sync! - mainnet - @giv restarted process on the node.
- Giv: Incident 4:05 pm PT: Ongoing memory usage errors on 18.104.22.168 (> 85%)
@giv restarted process. Mem is back to normal but slowly climbing back up.
- Giv: Incident 5:02 pm PT: 22.214.171.124 mem usage high. Auto resolved. Seems the mem usage alert window needs to be widened to prevent alerting for small spikes.
- Giv: Incident 8:14 pm PT: Low disk space on 126.96.36.199. @sophoah looking into increasing disk space.
Soph: https://github.com/harmony-one/harmony-ops-priv/issues/50) - the node has been upgraded with a bigger disk in lightsail and is now syncing however it seems really slow
Soph: Watchdog testnet consensus threshold are defined on the jenkins machine file /usr/local/watchdog/configs/testnet.yaml and warning/alert is sent after 10 min of downtime we can increase that again if still too noisy
Soph: new pagerduty jenkins job has been created. Now used by the testnet nightly update https://jenkins.harmony.one/job/PagerDuty-Maintenance/
Giv: Incident 8:37 am PT: 188.8.131.52:9500 (shard 2) beacon at block height 15195035, but beacon height 15505824 - instance health looks normal
Giv: 9:23 am PT - seeing a lot of node memory alerts in PD - every few minutes. Mostly on shard 2.
Giv: Incident 10:37 am PT: RPC response time climbing. hmy_getStorageAt is taking over a minute to respond. Status page updated
- Soph: Grafana mem alert has been updated to alert only after 15 min of mem usage above 85%
- Soph: S2 Explorer node beacon shard chain is being clone to fix the 300k block behind
- Soph: Multiple high memory grafana alert which needed a node harmony process restart (x5)
- Giv: Incident 12:22 pm PT: getlastcrosslinks method is taking 3.5 mins to respond. TODO: Add timeouts to all methods.
- Giv: Incident 10:32 am PT: 184.108.40.206:9500 beacon out of sync! - mainnet - Restarted process.
- Giv: Incident 1:02 pm PT: Monitor is DOWN: eth-bridge(backend-4) - Unsure what to do. Auto-resolved after 1 hour.
- Giv: Incident 1:37 pm PT: AverageDurationRPCTest
Seeing a huge increase in response times and CPU utilization across the board
Incident auto resolved
- Giv: Many memory alerts throughout the day.
- Giv: Users are complaining about RPC performance. Response times do seem slower overall with a few spikes
- Giv: Bridge is down: Complaints from Viper users. Bridge incident on PD was assigned to Ganesha instead of me. This was resolved by Yuriy and it was due to a conflict with the last update.
- Giv: Still seeing > 10 memory alerts per day