
Another outage, another market! This time, it's multiple choice, since I feel it's more likely than not that there would be another outage this year.
Forecast when our next Supabase outage will be that causes Manifold to lose read access for > 5 minutes, or write access for > 12 hours.
Info on our latest outage
We've been having issues recently with our database CPU usage being too high. This is probably our fault. We narrowed down the cause to a combination of external API usage, and poor usage of our biggest tables (user_contract_metrics, user_feed). But that didn't cause the full downtime.
On Monday morning (July 17th, 10:50 am), we restarted the database (in attempt to fix the above issues), and it got stuck in a boot loop.
It was down for two hours. A Supabase engineer got us out of the boot loop.
This was mostly a bug in the way Supabase was hosting it (they only gave it 90 seconds to start up, but we needed longer to replay transactions). Doc on the outage: https://manifoldmarkets.notion.site/DB-Outage-849d055d43f64807b98ca0930f890346?pvs=4
A supabase engineer wrote a report on the issue:
We took a look at the timeline of events, as well as what caused the issue and the facts are as follow:
2023-07-17 17:49:17 UTC
a shutdown request was sent to your project's Postgres service
internally, we use systemd to manage services
the default value for a systemd service's TimeoutStopSec is 90 seconds
this is a variable specifying how long systemd waits for a process to exit before outright killing the process after receiving a shutdown signal
90 seconds is usually sufficient time for Postgres to exit, but given your project's database size, and number of IO operations it was running when it received the shutdown signal, this proved insufficient
2023-07-17 17:50:47 UTC
systemd terminates Postgres' process, forcing an ungraceful shutdown of your project's database
2023-07-17 17:51:43 UTC
your project's Postgres service attempts booting up and replaying WAL changes, exactly like you've assumed in the outage doc you've shared
unfortunately, since the default for TimeoutStartSec is also 90 seconds, after 90 seconds from startup, systemd signalled Postgres to terminate its process, exactly as it did when this issue started
this effectively put your database service in shutdown mode for another 90 seconds, before systemd killed the Postgres process and the chain of events started repeating
In total, we've had three outages this year: