Weekly On-call Summary Oct 11th - Oct 18th, 2021

rongjian · October 19, 2021, 4:46pm

Weekly OnCall Rotation

Oncall: RJ, Jenya (rongjian@harmony.one, jenya@harmony.one)

Duration: 10/11 08:30am - 10/18 08:30pm

MultiSig service is not stable and keeps going down. Mostly a simple restart bring back the service up. Root cause is being investigated and fixed by Jenya and Jack. Mostly related to aws instances and more profiling is needed.
Migrated Multisig Postgres database from local docker to AWS cloud. It fixed the CPU issue and now the API is stable.
Multisig indexers behind for 24 hours of data. Tracing RPC is slow and we only index 20 blocks (40 seconds of data) in ~25 seconds now. When I try to fire more requests, RPC starts to return an empty response or disconnect by timeout. Hence currently the latest actions are not visible on the UI.
API and indexers located in different docker containers, so it will be easy to separate them. Not urgent as indexers consume very small amount of CPU/RAM
Soph updated all validator node S1/S2/S3 disks to 500GB, now all are at 50% disk space usage. For S0, some of them are reaching 80% but nothing much we can do there since those are attached disk (not EBS)
Auto-resolved:
- beacon out of sync;
- cpu usage rate
- Insufficient sign power of Shard
- UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0

MultiSig Backend ( https://multisig.t.hmny.io/api/v1/about/ ). It was down for 59 minutes and 19 seconds. PagerDuty - out of resource: need access to grant more resource
Disk out of space: PagerDuty
Beacon out of sync (auto-resolved): PagerDuty

10/13

10/14

Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 PagerDuty
Monitor is DOWN: explorer-api1 ( http://api1.explorer.t.hmny.io:3000/v0/shard/0/block/number/0 ) PagerDuty
- Runbook: GitBook
- Aws instance is down and restarted by Leo.
- EIP was added and required an update of the devops host file, this file is encrypted with ansible-vault and vaultpass wasn’t enable
- Until soph came online to update the host file and restart the instance
Monitor is DOWN: explorer-api1-WebSocket PagerDuty
- Api1 ws and and https are linked, see above
the cpu usage rate of the mainnet service node is abnormal PagerDuty

10/15

MultiSig Backend ( https://multisig.t.hmny.io/api/v1/about/) PagerDuty
- Quote Jack “So, we managed to get it back up and running, but few things left to do - the healthcheck is not able to sense that the service is actually up (still says failed, don’t kill it please!), probably due to missing docker IP to instance IP mapping - the docker container startup script needs to be registered as a new service bigger picture-wise, we need to… - either commit more resources to take care of our services like these - or outsource them to like a Gnosis partner”
Node Memory Usage Rate Alert - the memory usage rate of the mainnet shard3 node is abnormal PagerDuty

10/16

10/17

Soph updated all validator node S1/S2/S3 disks to 500GB, now all are at 50% disk space usage. For S0, some of them are reaching 80% but nothing much we can do there since those are attached disk (not EBS)
Multi-sig server down: PagerDuty
Node Disk Space Alert - the free space of the mainnet shard2 node abnormal
A few beacon out of sync issues that got auto-resolved.

10/18

Tons of Average UnHealthyHostCount GreaterThanOrEqualToThreshold 1.0 (a few every hour) that gets auto-resolved.
Monitor is DOWN: MultiSig Backend PagerDuty

Suggestions:

For each alert, link the runbook to it in the alert message.
For each existing host, put some description on what it’s used for and whether it’s customer-impacting.
Having a run through with the team on the architecture/runbook of each service
Expose vaultpass password in lastpass and share with have instruction on how to retrieve it

Topic		Replies	Views
Weekly OnCall Summary Nov/16 - 23, 2021 Team Ops	1	317	November 24, 2021
Weekly OnCall Rotation - Aug 2-9, 2021 Team Ops	0	277	August 10, 2021
Weekly On-call Summary Oct 19th - Oct 25th Team Ops	0	279	October 27, 2021
Weekly On-call Summary Oct 25th - Nov 2nd Team Ops	0	264	November 9, 2021
Weekly OnCall Summary Dec/28 - Jan/03 Team Ops	0	393	January 4, 2022