Weekly OnCall Rotation
- OnCall US Pacific: Leo (leo@harmony.one)
- OnCall Asia/EMEA: Haodi (haodi@harmony.one)
- Duration: 12/28 8:30am - 2022/01/04 8:30am PST
Summary
Details
12/28:
-
PagerDuty
- CPU usage of btc testnode
- PagerDuty
-
PagerDuty (resolved for Yuriy)
- Node OOS. increased disk to 1000G, 75% usage now
- Removed “latest/*.gz” to free disk on the root volume
-
PagerDuty
- Node is very slow to catch up with latest blocks; can only sync 30 blocks per minute
- Catching up now
-
PagerDuty
- Auto resolved after catching up
12/29:
-
PagerDuty
- Btc test node is constantly high CPU usage
-
PagerDuty
- Indextoken.explorer.t.hmny.io is down
12/30:
-
Explorer postgres DB was down due to an error
[PostgresStorage:shard0] [2021-12-30T20:12:43.707Z] multixact "members" limit exceeded {
-
Tried to run “VACUUM” command on postgres db manually.
-
Copy the “internal_transaction” table to a new table
-
Temporarily disable the write of “internal_transaction” table to unblock the explorer.harmony.one, after discussion with Jack/Daniel. Executed by Jenya.
-
Adjusted setting to do auto vacuum every 1 or 2 hours at table level; vacuum slows down the DB operations; the impact is the explorer will lagging behind for a few minutes every 1 or 2 hours
-
Apply multiple row insertion in the indexer
-
Apply blocklist to leaders to stop a potential hacking.
1/1:
- Re-apply the blocklist to a high value account after the leader changes.
- Top Explorer DB inserts
1/2:
- Regression found on /node-sync API that caused load balancer can’t take the OOS node offline
- A PR is deployed to test the fix:
- Explorer DB throughput improved. Charts are available in AWS us-east-1 on Harmony Dev account here
1/3:
-
PagerDuty
- OOS warning on root volume, cleaned up log files
- RDS behind Explorer at a better state. Below’s a 1-week time window demonstrating an unstable state
Takeaways:
- Whenever received a single alert/alarm, check on similar instances, such as OOS, OOM.
- Please do not forget to resolve the pagerduty incidents
- Please do remember to update the ticket with all the details and actions taken: Ex, PagerDuty
- May snooze the pager if waiting for resolution
- When RPC traffic went up, Pocket Networks communicated traffic spike on their end as well. [Pocket Grafana charts and credentials](GitBook available in our runbooks
- We didn’t have full AWS LB log coverage, nor any way to query the logs. We now have a runbook to perform log dives on AWS
- The Aurora RDS supporting Explorer is on PostgreSQL, common SQL queries now available in our runbook