
11am Monday (oct 14th) state of affairs:
Things started to deteriorate the night of oct 9th (PST)
api latency slowly builds, leading to site outages every few hours
redeploying the api fixes things for a time
Local host works fine during the outage
simple queries start to take forever, filling up the connection pool
So far we've :
Set up a read replica that the front end uses when querying directly from the db with the supabase js library
updated our backend pg promise version
checked open connections on api (1.3k), total ingress/egress bytes, total api query numbers, memory usage (10%), all are normal.
Reverted suspicious-looking commits over the past few days
Increased, then decreased pg pool size
Moved some requests from api to the supabase js client that talked directly to the db's load balancer
Discovered that our server's CPU usage increases until it hits 100% of the single core running node's capabilty, and this coincides with our server's latency spikes.
Opened a PR to integrate datadog into our server to get some visibility into what is causing the cpu spikes.
Typical API stats during an outage:





DB stats during an example outage:


I'm happy to provide more info, stats, etc.
repo: https://github.com/manifoldmarkets/manifold
previous market: https://manifold.markets/ian/will-we-fix-our-database-problems-b?play=true
This resolves as NO if we haven't fixed the underlying problem, i.e. even if we have a cron job restarting the server every hour
🏅 Top traders
# | Name | Total profit |
---|---|---|
1 | Ṁ5,754 | |
2 | Ṁ4,343 | |
3 | Ṁ3,295 | |
4 | Ṁ1,748 | |
5 | Ṁ1,340 |