When will we have another Supabase outage?
15
247
แน€760
Jan 1
0.2%
July / Aug
79%
Sept / Oct
18%
Nov / Dec
3%
No more in 2023

Another outage, another market! This time, it's multiple choice, since I feel it's more likely than not that there would be another outage this year.

Forecast when our next Supabase outage will be that causes Manifold to lose read access for > 5 minutes, or write access for > 12 hours.

Info on our latest outage

We've been having issues recently with our database CPU usage being too high. This is probably our fault. We narrowed down the cause to a combination of external API usage, and poor usage of our biggest tables (user_contract_metrics, user_feed). But that didn't cause the full downtime.

On Monday morning (July 17th, 10:50 am), we restarted the database (in attempt to fix the above issues), and it got stuck in a boot loop.

It was down for two hours. A Supabase engineer got us out of the boot loop.

This was mostly a bug in the way Supabase was hosting it (they only gave it 90 seconds to start up, but we needed longer to replay transactions). Doc on the outage: https://manifoldmarkets.notion.site/DB-Outage-849d055d43f64807b98ca0930f890346?pvs=4

A supabase engineer wrote a report on the issue:

We took a look at the timeline of events, as well as what caused the issue and the facts are as follow:

  • 2023-07-17 17:49:17 UTC

    • a shutdown request was sent to your project's Postgres service

    • internally, we use systemd to manage services

    • the default value for a systemd service's TimeoutStopSec is 90 seconds

      • this is a variable specifying how long systemd waits for a process to exit before outright killing the process after receiving a shutdown signal

      • 90 seconds is usually sufficient time for Postgres to exit, but given your project's database size, and number of IO operations it was running when it received the shutdown signal, this proved insufficient

  • 2023-07-17 17:50:47 UTC

    • systemd terminates Postgres' process, forcing an ungraceful shutdown of your project's database

  • 2023-07-17 17:51:43 UTC

    • your project's Postgres service attempts booting up and replaying WAL changes, exactly like you've assumed in the outage doc you've shared

    • unfortunately, since the default for TimeoutStartSec is also 90 seconds, after 90 seconds from startup, systemd signalled Postgres to terminate its process, exactly as it did when this issue started

    • this effectively put your database service in shutdown mode for another 90 seconds, before systemd killed the Postgres process and the chain of events started repeating

In total, we've had three outages this year:

Get แน€200 play money
Sort by:

@JamesGrugett How does this resolve?

bought แน€10 of Nov / Dec YES

bought แน€100 of Sept / Oct YES

to lose read access for > 5 minutes

I lost read access for over five minutes, seemingly due to cloudflare

bought แน€50 of Sept / Oct YES

Supabase didn't work correctly today! Voting in polls, sending managrams, reviewing markets and clearing notifications didn't work.

{"code":"XX000","details":null,"hint":null,"message":"must be superuser to use repack_trigger function"}

(these errors were at least 4 hours long but unlikely to be 12 hours, though)

Yeah 12 hours is a weirdly high bar.

@AnT notifications aren't clearing again + home markets won't load (on mobile)!