What caused the Optus telecommunications outage in Australia? [unlinked free-response]
Mini
23
9.4k
resolved Nov 28
Resolved
YES
A number got too big
Resolved
YES
Accidental (software-only)
Resolved
YES
BGP Problem
Resolved
YES
Configuration change
Resolved
NO
A software bug
Resolved
NO
Insider sabotage (e.g. a disgruntled employee)
Resolved
NO
Intentionally shut down for any reason
Resolved
NO
Intentionally shut down additional services to help recover from an unintentional initial cause
Resolved
NO
Anything weather-related (including space weather)
Resolved
NO
Expiration of something based on a date/time
Resolved
NO
Intentionally shut down to avoid exposure to a discovered security vulnerability
Resolved
NO
Accidental (physical world involved, not only software)
Resolved
NO
Someone dialed the wrong number
Resolved
NO
Outside cyber-attack
Resolved
NO
Intentionally shut down on government request for national security reasons
Resolved
NO
Damaged fiber optic cable
Resolved
NO
Power outage
Resolved
NO
Intentionally shut down in response to an actual security breach
Resolved
NO
Intentional manifestation of SCP-4882
Resolved
NO
Damage/malfunction due to the recent geomagnetic storm

Australia's second-largest mobile telecommunications provider, Optus, is currently recently experiences a nation-wide service outage, with mobile phone and internet access unavailable for all customers, causing major disruption to business, finance, public transportation, and other important services.

https://www.abc.net.au/news/2023-11-08/optus-outage-live-blog/103076996

What caused it?

I'll resolve all answers as we learn enough to rule them in or out, according to information published by reputable media sources, or otherwise to my judgement, or N/A answers that can't reasonably be ruled in or out by end of year.

Note that if a primary cause is identified, all other causes not known at that time will resolve NO - they will not remain open on the off chance that there were muliple causes not yet identified at the time the primary cause was identified.

Feel free to add answers that are more specific than existing answers. For example, I've already added "outside cyber-attack", but you might add "Cyber-attack by <entity>", and both can resolve YES if they're correct.



Answer-specific clarifications/rulings:

Intentionally shut down to avoid exposure to a discovered security vulnerability

  • This will resolve YES only if the vulnerability was not exploited.


See also:

Get Ṁ600 play money

🏅 Top traders

#NameTotal profit
1Ṁ334
2Ṁ267
3Ṁ32
4Ṁ20
5Ṁ18
Sort by:

Here's the full Senate committee submission (links downloads a PDF):

https://www.aph.gov.au/DocumentStore.ashx?id=2ed95079-023d-49d5-87fd-d9029740629b

I've resolved all remaining answers. One I think might be controversial, which is whether it was "A software bug". Traders had bet this up to 99.7%.

A software bug is quite unambiguously distinct from incorrect configuration or usage of software that may cause unexpected problems. A software bug is incorrect behaviour of the software itself. This doesn't seem to be the case for the Optus outage.

The Senate submission says on page 5:

19. It is now understood that the outage occurred due to approximately 90 PE routers automatically self-isolating in order to protect themselves from an overload of IP routing information. These self-protection limits are default settings provided by the relevant global equipment vendor (Cisco).

20. This unexpected overload of IP routing information occurred after a software upgrade at one of the Singtel internet exchanges (known as STiX) in North America, one of Optus’ international networks. During the upgrade, the Optus network received changes in routing information from an alternate Singtel peering router. These routing changes were propagated through multiple layers of our IP Core network. As a result, at around 4:05am (AEDT), the pre-set safety limits on a significant number of Optus network routers were exceeded. Although the software upgrade resulted in the change in routing information, it was not the cause of the incident.

There isn't an indication in this that the software itself was behaving incorrectly by exchanging new routing prefixes following an upgrade, or that the software on the routers was not functioning as designed when they shut themselves down.

Routing tables are data, not software, and the threshold at which the routers were to shut themselves off is configuration, not code. An incompatibility between different types of data or configuration, leading to either excessive transmission of routing prefixes, or excessively conservative shutdown of routers (it's unclear which of these things was the actual problem) doesn't suggest that the software was doing anything other than what it was designed to do.

The manufacturers may even have shipped poorly-considered default configuration settings, that were not appropriate for most customers, or that were not appropriate given other default settings. Nonetheless it's configuration, which means we're talking about misconfigured software, not buggy software.

Don't think I've resolved something at 99.7% to NO before! Condolences to those who bet YES, but I think this is the correct resolution.

Will consider remaining options later - ping me if I forget!

Intentionally shut down additional services to help recover from an unintentional initial cause

Happy that folks might disagree with this one, let me know. But although there is obviously some kind of protective reason for the automated shutdown, it's seemingly being framed as the problem rather than a solution that might have helped recover from anything. And in any case I don't really want to call it "intentional" when it was automated.

The report suggests that approximately 90 edge provider routers disconnected as an automated protective measure against routing update overload

[...] triggered Optus's routers to rapidly update its own routing tables which triggered the shutdown due to the pre-configured default threshold limits set by Cisco Systems being exceeded

@chrisjbillington I agree with that assessment, this answer (in my head at least) required human intervention. I did think about the automated shutdown case at the time I wrote it.

Intentionally shut down for any reason

Although the shutdown was programmed to happen given the circumstances, I wouldn't call it intentional, certainly the circumstances that led to it were not intentional.

The report suggests that approximately 90 edge provider routers disconnected as an automated protective measure against routing update overload

A number got too big

Resolves YES on the basis that the shutdown was in fact directly caused by threshold limits in routing tables being exceeded (though this was more of a proximate than ultimate cause):

This, in turn, triggered Optus's routers to rapidly update its own routing tables which triggered the shutdown due to the pre-configured default threshold limits set by Cisco Systems being exceeded

Reckon this is an easy no:

"The issues preventing Optus customers from making calls and accessing mobile data began about 4am."

https://www.9news.com.au/national/optus-outage-australia-wide-millions-affected-updates/70501aaf-9bb7-45ed-9569-cf682e5e8bc5

Resolving whatever can be resolved based on what Wikipedia currently says:

A submission tabled to the Australian Senate Standing Committee on Environment and Communications committee describes the outage as a gradual event triggered by loss of connectivity between neighbouring computer networks.[1] The report suggests that approximately 90 edge provider routers disconnected as an automated protective measure against routing update overload. The failures occurred following a software upgrade at a North American Singtel exchange that caused one of the routers to disconnect. This, in turn, triggered Optus's routers to rapidly update its own routing tables which triggered the shutdown due to the pre-configured default threshold limits set by Cisco Systems being exceeded. The tabled report and Singtel stressed that the software upgrade was not the cause of the fault.[1][10]

@chrisjbillington do you reckon you'll be confident enough to resolve any of the low probability outcomes to no anytime soon?

@Daniel_MC Yes, should do. I'll do a quick search to see what the latest is, but it sounds like anything not plausibly related to or a possible cause of a BGP prefix update gone wrong should resolve NO.

@chrisjbillington sick let me know if I can help

ABC News

In a statement released on Monday afternoon, Optus says its network was affected by "changes to routing information from an international peering network" around 4:05am AEDT last Wednesday, "following a routine software upgrade".

@Gen Thanks, I think "changes to routing information" is enough to resolve "configuration change" YES, not sure anything else resolves yet.

@chrisjbillington 🥳 first thing I added

I forgot to bet on it, 0 profit.

🤣

Wikipedia currently says "The specific cause of the outage is unknown to the public", but mentions that "Post-outage forensics suggest that a Border Gateway Protocol (BGP) routing problem played a role in the outage."

It seems pretty clear from skimming the cited articles that there's enough evidence that a BGP-related problem was involved even if it wasn't the ultimate cause, so I'll resolve that YES.

I'll check back in periodically to see if the article is updated. There's a Senate inquiry soon that may give us some more info: https://www.theguardian.com/business/2023/nov/11/optus-chief-executive-set-to-face-senate-inquiry-over-nationwide-outage and it looks like the government will be doing some investigation, so we should get some answers.

@chrisjbillington Seems that the senate inquiry will give us the most information. Optus definitely doesn’t want eyes on them so I doubt they release anything on their own accord.

Accidental (physical world involved, not only software)

Don't sleep on this one IMO. There is still a chance that "we plugged the cable in the wrong hole" can be involved!

https://www.abc.net.au/news/2023-11-08/optus-deep-network-outage-what-is-the-cause/103080080

This article includes a lot of speculation -- almost all of their theories are already represented in various answers we already have.

BGP Problem

Okay, some reports do claim SOMETHING now.

This is probably a variant of "configuration change" -- at least, it was in my head when I wrote that answer.

https://www.smh.com.au/technology/what-caused-the-optus-outage-20231108-p5eiep.html

Matt Tett, the managing director of Enex TestLab, said the issue appeared to be caused by a so-called “BGP [Border Gateway Protocol] prefix flood”.

Essentially, it means that one of Optus’ routers was likely fed incorrect routing information in an update, leading to total network gridlock. This could have been caused by either Optus or an external party. Optus has been contacted to confirm whether this is indeed the case.

Network operators suggested this possible scenario after Optus sent a message to them stating that the suspected root cause of the issue lay with “route reflectors, which are currently handling an excessive number of routes, leading to session shutdown and a complete traffic halt”.

@Eliza Note that even if this is the case, which is no guarantee, that does NOT rule out "it broke physical hardware in a way that meant we had to replace it"

I added two versions of Accidental to go along with Intentional.

Physical world is anything from the cleaner unplugging the backup drive to a bomb blowing up the network building. Someone at a keyboard typing the wrong command would not count for the Accidental (physical world) answer, that would go under the (software only) one.


For example "Software bug" already existed, and could go with either Intentional for any reason, or Accidental.

Limit order at 35% for intentional shutdown due to security breach if anyone wants to take the other side

@NGK Arb opportunity with "Intentionally shut down for any reason"!

@NGK Why am I betting on something that I know nothing about?