Weekly On-call Summary Oct 11th - Oct 18th, 2021

Weekly OnCall Rotation

Oncall: RJ, Jenya (rongjian@harmony.one, jenya@harmony.one)

Duration: 10/11 08:30am - 10/18 08:30pm

Summary

  • MultiSig service is not stable and keeps going down. Mostly a simple restart bring back the service up. Root cause is being investigated and fixed by Jenya and Jack. Mostly related to aws instances and more profiling is needed.

  • Migrated Multisig Postgres database from local docker to AWS cloud. It fixed the CPU issue and now the API is stable.

  • Multisig indexers behind for 24 hours of data. Tracing RPC is slow and we only index 20 blocks (40 seconds of data) in ~25 seconds now. When I try to fire more requests, RPC starts to return an empty response or disconnect by timeout. Hence currently the latest actions are not visible on the UI.

  • API and indexers located in different docker containers, so it will be easy to separate them. Not urgent as indexers consume very small amount of CPU/RAM

  • Soph updated all validator node S1/S2/S3 disks to 500GB, now all are at 50% disk space usage. For S0, some of them are reaching 80% but nothing much we can do there since those are attached disk (not EBS)

  • Auto-resolved:

Details

10/12

10/13

10/14

  • Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 PagerDuty

  • Monitor is DOWN: explorer-api1 ( http://api1.explorer.t.hmny.io:3000/v0/shard/0/block/number/0 ) PagerDuty

    • Runbook: GitBook
    • Aws instance is down and restarted by Leo.
    • EIP was added and required an update of the devops host file, this file is encrypted with ansible-vault and vaultpass wasn’t enable
    • Until soph came online to update the host file and restart the instance
  • Monitor is DOWN: explorer-api1-WebSocket PagerDuty

    • Api1 ws and and https are linked, see above
  • the cpu usage rate of the mainnet service node is abnormal PagerDuty

10/15

  • MultiSig Backend ( https://multisig.t.hmny.io/api/v1/about/) PagerDuty
    • Quote Jack “So, we managed to get it back up and running, but few things left to do - the healthcheck is not able to sense that the service is actually up (still says failed, don’t kill it please!), probably due to missing docker IP to instance IP mapping - the docker container startup script needs to be registered as a new service bigger picture-wise, we need to… - either commit more resources to take care of our services like these - or outsource them to like a Gnosis partner”
  • Node Memory Usage Rate Alert - the memory usage rate of the mainnet shard3 node is abnormal PagerDuty

10/16

10/17

  • Soph updated all validator node S1/S2/S3 disks to 500GB, now all are at 50% disk space usage. For S0, some of them are reaching 80% but nothing much we can do there since those are attached disk (not EBS)
  • Multi-sig server down: PagerDuty
  • Node Disk Space Alert - the free space of the mainnet shard2 node abnormal
  • A few beacon out of sync issues that got auto-resolved.

10/18

Suggestions:

  1. For each alert, link the runbook to it in the alert message.
  2. For each existing host, put some description on what it’s used for and whether it’s customer-impacting.
  3. Having a run through with the team on the architecture/runbook of each service
  4. Expose vaultpass password in lastpass and share with have instruction on how to retrieve it
1 Like