Weekly On-call Summary Oct 11th - Oct 18th, 2021

Weekly OnCall Rotation

Oncall: RJ, Jenya (rongjian@harmony.one, jenya@harmony.one)

Duration: 10/11 08:30am - 10/18 08:30pm

Summary

  • MultiSig service is not stable and keeps going down. Mostly a simple restart bring back the service up. Root cause is being investigated and fixed by Jenya and Jack. Mostly related to aws instances and more profiling is needed.

  • Migrated Multisig Postgres database from local docker to AWS cloud. It fixed the CPU issue and now the API is stable.

  • Multisig indexers behind for 24 hours of data. Tracing RPC is slow and we only index 20 blocks (40 seconds of data) in ~25 seconds now. When I try to fire more requests, RPC starts to return an empty response or disconnect by timeout. Hence currently the latest actions are not visible on the UI.

  • API and indexers located in different docker containers, so it will be easy to separate them. Not urgent as indexers consume very small amount of CPU/RAM

  • Soph updated all validator node S1/S2/S3 disks to 500GB, now all are at 50% disk space usage. For S0, some of them are reaching 80% but nothing much we can do there since those are attached disk (not EBS)

  • Auto-resolved:

Details

10/12

10/13

10/14

10/15

10/16

10/17

  • Soph updated all validator node S1/S2/S3 disks to 500GB, now all are at 50% disk space usage. For S0, some of them are reaching 80% but nothing much we can do there since those are attached disk (not EBS)
  • Multi-sig server down: https://harmonyone.pagerduty.com/incidents/Q029FY5VFDXMZ2
  • Node Disk Space Alert - the free space of the mainnet shard2 node abnormal
  • A few beacon out of sync issues that got auto-resolved.

10/18

Suggestions:

  1. For each alert, link the runbook to it in the alert message.
  2. For each existing host, put some description on what it’s used for and whether it’s customer-impacting.
  3. Having a run through with the team on the architecture/runbook of each service
  4. Expose vaultpass password in lastpass and share with have instruction on how to retrieve it
1 Like