Weekly OnCall Rotation
Oncall: RJ, Jenya (rongjian@harmony.one, jenya@harmony.one)
Duration: 10/11 08:30am - 10/18 08:30pm
Summary
-
MultiSig service is not stable and keeps going down. Mostly a simple restart bring back the service up. Root cause is being investigated and fixed by Jenya and Jack. Mostly related to aws instances and more profiling is needed.
-
Migrated Multisig Postgres database from local docker to AWS cloud. It fixed the CPU issue and now the API is stable.
-
Multisig indexers behind for 24 hours of data. Tracing RPC is slow and we only index 20 blocks (40 seconds of data) in ~25 seconds now. When I try to fire more requests, RPC starts to return an empty response or disconnect by timeout. Hence currently the latest actions are not visible on the UI.
-
API and indexers located in different docker containers, so it will be easy to separate them. Not urgent as indexers consume very small amount of CPU/RAM
-
Soph updated all validator node S1/S2/S3 disks to 500GB, now all are at 50% disk space usage. For S0, some of them are reaching 80% but nothing much we can do there since those are attached disk (not EBS)
-
Auto-resolved:
Details
10/12
- MultiSig Backend ( https://multisig.t.hmny.io/api/v1/about/ ). It was down for 59 minutes and 19 seconds. PagerDuty - out of resource: need access to grant more resource
- Disk out of space: PagerDuty
- Beacon out of sync (auto-resolved): PagerDuty
10/13
- the cpu usage rate of the mainnet service node is abnormal
- beacon out of sync
10/14
-
Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 PagerDuty
-
Monitor is DOWN: explorer-api1 ( http://api1.explorer.t.hmny.io:3000/v0/shard/0/block/number/0 ) PagerDuty
- Runbook: GitBook
- Aws instance is down and restarted by Leo.
- EIP was added and required an update of the devops host file, this file is encrypted with ansible-vault and vaultpass wasn’t enable
- Until soph came online to update the host file and restart the instance
-
Monitor is DOWN: explorer-api1-WebSocket PagerDuty
- Api1 ws and and https are linked, see above
-
the cpu usage rate of the mainnet service node is abnormal PagerDuty
10/15
- MultiSig Backend ( https://multisig.t.hmny.io/api/v1/about/) PagerDuty
- Quote Jack “So, we managed to get it back up and running, but few things left to do - the healthcheck is not able to sense that the service is actually up (still says failed, don’t kill it please!), probably due to missing docker IP to instance IP mapping - the docker container startup script needs to be registered as a new service bigger picture-wise, we need to… - either commit more resources to take care of our services like these - or outsource them to like a Gnosis partner”
- Node Memory Usage Rate Alert - the memory usage rate of the mainnet shard3 node is abnormal PagerDuty
10/16
- Insufficient sign power of Shard1! - mainnet
- A few beacon out of sync issues that got auto-resolved.
10/17
- Soph updated all validator node S1/S2/S3 disks to 500GB, now all are at 50% disk space usage. For S0, some of them are reaching 80% but nothing much we can do there since those are attached disk (not EBS)
- Multi-sig server down: PagerDuty
- Node Disk Space Alert - the free space of the mainnet shard2 node abnormal
- A few beacon out of sync issues that got auto-resolved.
10/18
- Tons of Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 (a few every hour) that gets auto-resolved.
- Monitor is DOWN: MultiSig Backend PagerDuty
Suggestions:
- For each alert, link the runbook to it in the alert message.
- For each existing host, put some description on what it’s used for and whether it’s customer-impacting.
- Having a run through with the team on the architecture/runbook of each service
- Expose vaultpass password in lastpass and share with have instruction on how to retrieve it