Weekly OnCall Summary Dec/28 - Jan/03

Weekly OnCall Rotation

  • OnCall US Pacific: Leo (leo@harmony.one)
  • OnCall Asia/EMEA: Haodi (haodi@harmony.one)
  • Duration: 12/28 8:30am - 2022/01/04 8:30am PST

Summary

Details

12/28:

  • PagerDuty
    • CPU usage of btc testnode
  • PagerDuty
  • PagerDuty (resolved for Yuriy)
    • Node OOS. increased disk to 1000G, 75% usage now
  • Removed “latest/*.gz” to free disk on the root volume
  • PagerDuty
    • Node is very slow to catch up with latest blocks; can only sync 30 blocks per minute
    • Catching up now
  • PagerDuty
    • Auto resolved after catching up

12/29:

12/30:

  • Explorer postgres DB was down due to an error [PostgresStorage:shard0] [2021-12-30T20:12:43.707Z] multixact "members" limit exceeded {

  • Tried to run “VACUUM” command on postgres db manually.

  • Copy the “internal_transaction” table to a new table

  • Temporarily disable the write of “internal_transaction” table to unblock the explorer.harmony.one, after discussion with Jack/Daniel. Executed by Jenya.

  • Adjusted setting to do auto vacuum every 1 or 2 hours at table level; vacuum slows down the DB operations; the impact is the explorer will lagging behind for a few minutes every 1 or 2 hours

  • Apply multiple row insertion in the indexer

  • Apply blocklist to leaders to stop a potential hacking.

1/1:

  • Re-apply the blocklist to a high value account after the leader changes.
  • Top Explorer DB inserts

1/2:

1/3:

  • PagerDuty
    • OOS warning on root volume, cleaned up log files
  • RDS behind Explorer at a better state. Below’s a 1-week time window demonstrating an unstable state

Takeaways:

  1. Whenever received a single alert/alarm, check on similar instances, such as OOS, OOM.
  2. Please do not forget to resolve the pagerduty incidents
  3. Please do remember to update the ticket with all the details and actions taken: Ex, PagerDuty
  4. May snooze the pager if waiting for resolution
  5. When RPC traffic went up, Pocket Networks communicated traffic spike on their end as well. [Pocket Grafana charts and credentials](GitBook available in our runbooks
  6. We didn’t have full AWS LB log coverage, nor any way to query the logs. We now have a runbook to perform log dives on AWS
  7. The Aurora RDS supporting Explorer is on PostgreSQL, common SQL queries now available in our runbook
1 Like