May 15, 2024

Incident Review: UniSuper GCP Outage

Australian trading service UniSuper had a very bad couple of weeks recently, when their entire platform experienced a major outage. With $125B under management and over 600k members, the fund was disrupted for about 2 weeks. From a joint press release between UniSuper and Google Cloud:

Google Cloud CEO, Thomas Kurian has confirmed that the disruption arose from an unprecedented sequence of events whereby an inadvertent misconfiguration during provisioning of UniSuper’s Private Cloud services ultimately resulted in the deletion of UniSuper’s Private Cloud subscription.

This outage was speculated on in detail on Twitter by George Orosz, and in a blog post by Daniel Compton. It is likely that this incident cost UniSuper millions of dollars directly, and even more in indirect reputation and lost business impact.

In his analysis, Daniel covers how there is no such object as a Private Cloud subscription. As detailed in his blog post, it is likely that UniSuper actually uses Google Cloud’s VMWare Engine (GCVE), which does have a resource called a private cloud.

Based on the language in their press release, a plausible scenario is that UniSuper accidentally ran a Terraform command that inadvertently deleted their GCVE private cloud. Consider the following scenario proposed by Daniel:

UniSuper ran a terraform apply with a bad configuration or perhaps a terraform destroy with the prod tfvar file. The Terraform plan showed “delete private cloud,” and the operator approved it.

One more frightening detail: if this command were executed, it would also happen immediately. The GCP Terraform provider executes Private Cloud deletion with a hardcoded parameter of delay_hours=0.

Preventing accidental deletion with Resourcely

Incidents, outages, and breaches happen from misconfiguration every day - causing millions of dollars in impact and losses. Resourcely is a misconfiguration engine that can help prevent ›catastrophic incidents caused by inadvertent settings or incorrect parameters. This is done with two key concepts: blueprints (paved roads to deployment) and guardrails (rule and policy enforcement).

If this outage was indeed caused by an inadvertent terraform apply, a Resourcely guardrail would have prevented this entire incident.

Implementing this guardrail and preventing inadvertent deletion of critical cloud resources is easy with Resourcely: simply create the guardrail from our provided template, and Resourcely will prevent any PR that violates this rule.

Give it a try

Cloud services are inherently complex, as is Terraform. Even if UniSuper was not impacted by a misconfigured Terraform plan applied by them, this could help other organizations that are undoubtedly struggling with limited resources, heterogenous cloud resources, and lack of infrastructure-as-code expertise.

To prevent misconfiguration of cloud infrastructure, give Resourcely a try 👉