#CensusFail 2016: A Comedy of Errors


On Tuesday, August 9th, 2016, the Australian Bureau of Statistics (ABS) oversaw the 2016 national Census. For the first time ever, this massive survey of the nation's ten million households was to be completed online. Rather than receiving a paper form, Australians received a letter with a 12-digit code, and instructions on how to log in online and fill out their Census.

To put it bluntly, it was a disaster.

The website crashed at the worst possible time, and even though no data was compromised, the ABS wasn't able to effectively reassure the public that their sensitive information was safe. News articles spoke of 'malicious overseas hackers' and 'World War 3 online', #CensusFail did the rounds on social media, and the reputation of the ABS was dealt a catastrophic blow.

The overwhelming public sentiment was perhaps best summed up by opposition leader Bill Shorten, who remarked:

"They had one job and they couldn't even do that properly."

If you're Australian, you're probably already familiar with this story. But with the subsequent Senate inquiry having recently delivered its final report [APH, 24 Nov 2016], now is the perfect time to take an in-depth look at what exactly went so wrong on Census night...


Background


Since 1911, the Census has been conducted every five years, but 2016 was the first time it had ever been run primarily online: 24 million people across 10 million households were expected to participate, and of these, at least 65 percent were expected to do so through the ABS website [ABS, 9 Aug 2016].

"The 2016 Census will be the first Census in Australia's history where we expect more people to complete online than on paper. Online login codes will be posted to the majority of households, replacing hand delivery and collection of forms by Census field staff. "These new delivery and collection procedures aim to increase online participation, make the Census easier for households, and reduce costs for the taxpayer."

To achieve this, the ABS partnered with IBM, who were contracted to "develop, implement and host the eCensus platform and application". Acting as the main internet service providers were Telstra and NextGen Networks, who in turn relied on upstream services provided by Vocus Communications and NTT Communications.

This was an ambitious undertaking, even with the support of IBM, and many expressed doubts as to whether the ABS could keep this sort of online Census system running smoothly. The ABS brushed off these concerns, insisting that IBM's servers were more than capable of handling the expected traffic [ABC News, 9 Aug 2016].

"There is plenty of reserve capacity to cope if more than 80 percent of Australians choose to complete the census online. The online Census form can handle 1,000,000 form submissions every hour. That's twice the capacity we expect to need."

During this time, the ABS was already under fire for their controversial proposal to store the names and addresses of respondents for up to four years, when previously the data was destroyed after a maximum of 18 months. A group of Senate crossbenchers made headlines by vowing not to give their name when completing the Census, and encouraging the public to do the same. In response, the government threatened those who refused to fully complete the Census with fines of $180 per day [ABC News, 9 Aug 2016].

On top of this, even though the Census site was to be open for submissions all the way from July 26th through to September 25th [ABS, 21 Sep 2016], large-scale advertising campaigns run by the ABS urged Australians to "Get online on August 9". As a result, many incorrectly believed that they would be fined for failing to submit their online Census on this particular date.

Most significantly, the appropriate security measures were not put in place to mitigate against distributed denial-of-service (DDoS) attacks. In place of traditional mitigation techniques, the decision was made to simply geoblock all international traffic, a plan IBM referred to as 'Island Australia'.

This combination of poor communication and inadequate planning meant that the ABS's online Census scheme was effectively doomed to failure from the very beginning. As a result, few were surprised to see the site plagued by technical issues on the night. What was surprising however, was just how much of an impact these issues ended up having.


Census Day


There are multiple differing accounts of what exactly went wrong on August 9th.

One of these accounts is that of Michael McCormack, the minister responsible for the Census, who presented his timeline of events at a press conference a day later [ABC News, 10 Aug 2016]. He claimed that this information had come directly from David Kalisch, the Chief Statistician at the ABS.

However, much of McCormack's account is contradicted in the ABS's own submission to the Senate Economics References Committee [ABS, 21 Sep 2016], and submissions by IBM [IBM, 2016], NextGen [NextGen, 18 Oct 2016] and Vocus [Vocus, 18 Oct 2016] all offer their own diverging interpretations of what happened.

On top of this, there was also a set of anonymous insider leaks, published by information security journalist Patrick Gray two days after Census day [Risky Business, 11 Aug 2016]. As with all anonymous leaks they shouldn't necessarily be taken as absolute fact, however they've so far proven accurate and have helped shed some light on IBM's handling of the project.

The first signs of trouble

The first DDoS occurred from 10:10am to 10:21am. This was a very minor attack, so much so that IBM was initially unsure if it was merely a spike in legitimate traffic. However as mentioned earlier, IBM had eschewed traditional DDoS protection, and at this point Island Australia had not yet been activated, which implies that the site was running with no protection whatsoever.

The attack caused the website to be unavailable for a period of five minutes, though at this time of the morning this wouldn't have caused much inconvenience. ABS figures show that the rate of submissions in the morning and early afternoon was only ~20 per second, and this number wouldn't begin to rise until around 3:30pm.

Island Australia

The second DDoS occurred from 11:45am to 11:49am. Again, no mitigation mechanisms were in place at this stage, and again there was a short outage, this time for two minutes. The attack, and the resulting outage, ended soon after the ABS and IBM made the decision to enable Island Australia. The plan was to continue blocking overseas traffic until midnight, at which point most Census forms would have been submitted.

The third DDoS occurred at 4:58pm (or 4:52pm according to NextGen). To IBM's credit, this attack was mitigated by Island Australia, and unlike the previous two attacks it did not cause any additional downtime. Curiously, McCormack also mentioned another small-scale attack at 6:15pm, but there has been no mention of this event in any subsequent statements or inquiry submissions. This suggests that this particular 'attack' was actually a spike in legitimate traffic, which would make sense given it occurred just as the submission rate was beginning to rise sharply.

The Final DDoS

The final attack, the largest yet, began at 7:28pm. According to McCormack, once the attack was detected "the ABS made the decision to shut down the online form to protect the system from further incidents". At the time, he admitted that there had been "an attack [...] from overseas", but maintained that the site was taken down voluntarily at 7:45pm as a precautionary measure, and had not failed because of the attack.

The account given by the ABS and IBM, and generally backed up Vocus and NextGen, paints a different picture. According to the ABS, the site was rendered unavailable by 7:33pm. At 7:43pm, an IBM router responsible for maintaining the border firewall became overloaded and failed, causing further outages. Finally, the decision was made to shut down the site, but this didn't happen until 8:09pm.

Router Failure

To facilitate access to the site, IBM was running two routers, which were forwarding traffic from their two direct upstream ISPs. One router was linked to Telstra, the other to NextGen. Under Island Australia, Telstra and NextGen were responsible for geoblocking international traffic, only forwarding requests from IP addresses within Australia.

In their inquiry submission, IBM states that, following the final attack, "the firewall to the eCensus site - through which IBM's control link to the routers on both the NextGen link and the Telstra link operated - became overloaded with data". Specifically, it was the router facing NextGen that failed after being flooded with incoming traffic. According to the inquiry report, the attack "had the effect of commencing new sessions which quickly exhausted the memory capacity" of the router.

To expand on this, a typical stateful firewall works by evaluating each incoming connection against a pre-defined security policy. If the initial incoming packet satisfies this policy, the connection is added a dynamic 'state table', and any subsequent packets not matching an existing state table entry are dropped.

However, there is a limit to the size of this state table. If the rate of incoming connections is unexpectedly high, and if these connections appear to be legitimate, the state table can fill up completely, overloading the firewall and leaving it unable to evaluate new incoming connections. It's likely this was the root cause of the NextGen-facing router failure.

"A very expensive paperweight"

At 7:43pm IBM attempted to reboot both routers, but this only made matters much, much worse. The Telstra-facing router wasn't correctly set up to handle a hard reset - when it came back online, the security policy wasn't restored, leaving the router effectively non-functional. As Gray put it, it was left "operating as a very expensive paperweight". Meanwhile, the NextGen router had been successfully rebooted, but was still being hammered by DDoS traffic. This resulted in another round of outages.

Intriguingly, this failed router reboot happened right around the time McCormack originally claimed the site was 'voluntarily' shut down (7:45pm). In light of what we now know, it seems that the Census site, already crippled by what should have been a minor DDoS, had gone down after IBM inadvertently disabled their own firewall.

The False Positive

If we take McCormack's account at face value, the obvious question that arises is: why would IBM respond the DDoS, an attempt to take down their site, by deciding to voluntarily take down their site? According to the ABS:

"At the time of the DDoS attack, the ABS and IBM observed (on a shared real-time dashboard) an unusual spike in outbound traffic in the IBM monitoring systems. The spike was unexplained, which prompted concerns that the system may have been compromised."

At this point, IBM began to suspect their network had already been breached, and that the DDoS was just a distraction. This anomalous outbound traffic raised the alarming possibility that someone was performing data exfiltration, actively siphoning off the very Census data they'd been hired to secure. No doubt still unnerved from the earlier attacks, they felt they had little choice but to completely shut down the site (even though it was already largely unavailable, owing to the earlier DDoS and router failure).

However, this turned out to be a false positive. IBM explains in their submission that their monitoring system was programmed to report on the volume of inbound and outbound traffic every 60 seconds. But due to the DDoS, the timing of this process became wildly inconsistent. Rather than arriving every minute, readings were being delayed for several minutes, then arriving all at once, giving the illusion of a spike in traffic. In reality, there was only a small, constant stream of outbound traffic. According to Gray, this traffic was completely normal and expected, consisting of "offshore-bound system information [and] logs".

By the time IBM realised this, it was too late. The Australian Signals Directorate's (ASD) Incident Response team had already began investigating, and it would be a long while before they gave the all clear. The site was restored at 2:30pm on Thursday, August 11th, a total downtime of just over 43 hours.


Inside the Attack


To understand the significance (or lack thereof) of this event, it's necessary to first understand the basics of how a Denial of Service attack works. If you're already familiar with this, go ahead and skip to the next section.

Denial of Service

The word 'hacking' is often used used to describe a Denial of Service (DoS) attack, which is somewhat misleading, as it gives the impression that the aim is to steal data or take control of a website. There are plenty of ways to do this, but a DoS is not one of them. Rather, a basic DoS simply attempts to flood a site with traffic, to the point where the server hosting the site can no longer keep up. As long as the attack continues, the site remains unreachable to legitimate users.

Since a DoS requires a large amount of traffic to have a noticeable effect, it's common for an attacker to utilise a 'botnet' - a collection of remotely-controlled, malware-infected computers that can be ordered to continuously send requests to a specific site. A DoS from a single one of these machines would have little effect, but a simultaneous DoS leveraging the entire botnet could be potentially crippling. This is the essential idea behind a Distributed Denial of Service attack.

The volume of this traffic is measured in megabits per second (Mbps) or gigabits per second (Gbps, or 1000Mbps), and the larger botnets can generate hundreds of Gbps of traffic. To provide a recent example, in September this year the website of security journalist Brian Krebs was hit with a 620Gbps attack [KrebsOnSecurity, 16 Sep 2016]. Originating from the infamous Mirai botnet, it was the largest DDoS ever recorded.

But leveraging a botnet isn't the only way to generate massive amounts of malicious traffic...

Reflection and Amplification

All communication between machines on a network happens using a network protocol, a set of rules defining the structure of messages and the way they're exchanged. A comprehensive explanation of networking protocols is well beyond the scope of this article, but suffice to say, they're everywhere. The page you're reading had been transmitted using Hypertext Transfer Protocol over SSL, hence the https:// at the beginning of the address.

Just as network protocols can be used legitimately, they can also be abused. In the case of DDoS attacks, an attacker can generate a much larger volume of traffic by tricking servers into responding to 'spoofed' requests [CloudFlare, 30 Oct 2012]. This is possible because several commonly-used network protocols share two significant features:

The first point means that an attacker can send a request to a server and spoof the source address, replacing it with the address of the target. The server will then send off a response, but this response will be directed to the target instead. This becomes significant when combined with the second point: the size of the request isn't always proportional to the size of the response. This means it's easy to prompt a server to respond with a large amount of data by sending a relatively small request.

By combining the two, it's possible to generate massive response packets (amplification) and direct them toward the target by spoofing the source address (reflection).

The above chart shows the most common protocols used for reflection/amplification attacks in 2015, as measured by Arbor Networks [Arbor Networks, 2016].

There are three main offenders:

Combined, these three protocols account for 90% of reflection/amplification attacks.

DNS Amplification

By all accounts, the DDoS that hit the ABS Census site was an amplification attack using the DNS protocol. As can be seen above, DNS is frequently abused to launch amplification attacks, primarily because there are as many as 32 million 'open DNS resolvers' running across the internet [Open Resolver Project, 27 Oct 2013]. Whereas a properly-configured DNS server will only respond to requests from a specific range of trusted addresses, an open resolver has no such restrictions, leaving it open to misuse.

Below an example of a simple query that uses DNS to return the IP address corresponding to a domain:

C:\>nslookup dperrysvendsen.com 8.8.8.8
Server: dns.google
Address: 8.8.8.8

Non-authoritative answer:
Name: dperrysvendsen.com
Address: 37.60.254.141

Here, we've run the nslookup command for the domain name of this site. We've directed this query to 8.8.8.8, the IP address of Google's public DNS server. In response, dns.google has returned the corresponding IP addresses 37.60.254.141.

The example above is a just a typical lookup command, not anything malicious, but even in this case the response we receive is several times larger than the initial request. A DNS amplification attack takes advantage of this 'amplification factor' to generate large amounts of traffic.

ICMP Flood

On top of DNS amplification, Gray claims that there may have been a second simultaneous attack, this time using the lower-level Internet Control Messaging Protocol (ICMP). No official sources have confirmed this, but it's not unusual to see multiple attacks across different protocols used in combination.

One possibility is that the attacker also launched a Ping flood, an old-school ICMP-based DoS technique. Ping flooding is the sort of thing you'd expect to see in the nineties, not 2016, and compared to the main DDoS it probably wasn't particularly effective. In all likelihood, it was used less as an actual threat and more for comedic effect.

Nonetheless, the principle is much the same as any other reflection attack. An ICMP Echo request (or Ping) is usually used to determine if a particular host is reachable. If the source address of a ping request is spoofed, the reply will be directed to the target server instead.

Below is an example of a ping request that will attempt to reach the specified website and report the results:

C:\>ping dperrysvendsen.com

Pinging dperrysvendsen.com [37.60.254.141] with 32 bytes of data:
Reply from 37.60.254.141: bytes=32 time=248ms TTL=52
Reply from 37.60.254.141: bytes=32 time=226ms TTL=52
Reply from 37.60.254.141: bytes=32 time=226ms TTL=52
Reply from 37.60.254.141: bytes=32 time=225ms TTL=52

Ping statistics for 37.60.254.141:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 225ms, Maximum = 248ms, Average = 231ms

Here, in response to our ping request, we've received several echo responses from the IP address 192.0.78.12 (the same one we discovered above). Once again, the response is much larger than the request, and once again these these responses can be redirected to the target server, flooding the server with ICMP packets.

Interestingly, if you were to attempt to ping census.abs.gov.au in the aftermath of this attack, you'd get no response. As of the time of writing, this is still the case, even though the site remains up with a placeholder message. This would indicate that the servers are blocking ICMP entirely, which would be unusual under normal circumstances, but would make sense if they were being targeted by ICMP-based attacks.

Size Definitely Matters

The exact volume of the attack is unclear. Gray's original estimate, again based on his anonymous sources, was 2Gbps. When Alastair MacGibbon, Special Adviser to the Prime Minister on Cyber Security, appeared before the Senate inquiry, he estimated that the volume was ~3Gbps [iTnews, 25 Oct 2016]. However, Vocus contradicts this in their submission to the inquiry, in which they claim the actual volume was five times smaller, at only 536Mbps. This amount of traffic "is not considered significant in the industry" the submission states, since "it is materially below the mean attack size".

Indeed, the volume was remarkably small given the profound impact of the attack. Below is a chart of the average size of reflection/amplification attacks in 2015. This shows that, even if we accept MacGibbon's significantly higher figure of 3Gbps, this is only marginally above average for this class of attack.

In fact, if we break down this data by protocol, we can see that the average size of DNS amplification attacks (shown below in light blue) was steadily rising over the course of the year, eventually reaching 4.4Gbps. This would mean that, according to the latest available data, the attack was between 1.5 and 8.3 times smaller than the average volume of comparable attacks.

Perhaps this goes some way toward explaining the lack of noticeable activity on real-time DDoS trackers including NorseCorp's live attack map and Arbor Network's own Digital Attack Map. When questioned about this, MacGibbon suggested the attack was probably so small it wouldn't have shown up at all.


The Blame Game


We now know that not only was the technique utilised in this attack an extremely common one, the volume of traffic was far below average for this attack class. So why wasn't the Census site able to cope?

Each party offered up their view as to why things didn't go to plan:

ABS: Attack should not have caused disruptions

The ABS claims that, along with IBM, they "conclusively determined that [...] the outage was caused by an overseas-based DDoS attack". They blame IBM for this, saying that they "had sought and received various assurances [...] about operational preparedness and resilience to DDoS attacks". They claim IBM assured them that Island Australia would be sufficient to protect the site, and maintain that "at no time was the ABS offered or advised of additional DDoS protections that could be put into place".

"The online Census system was hosted by IBM under contract to the ABS and the DDoS attack should not have been able to disrupt the system. Despite extensive planning and preparation by the ABS for the 2016 Census this risk was not adequately addressed by IBM [...]"

IBM: Geoblocking should have been effective

IBM claims that "the site underwent performance and security testing by the ABS [...] before it went live". They further claim that "the geo-blocking arrangement was tested prior to Census Day and worked". At best, this appears to be a half-truth. The ABS did commission their own independent testing of the system, which was carried out by Revolution IT, an Australian software firm. But this load testing was never intended to take into account the impact of potential DDoS traffic.

Revolution IT has since confirmed that, for the purposes of testing, the projected volume was 250 submissions per second (900,000 per hour), with a peak of 400 submissions per second (1.4 million per hour) [Revolution IT, 10 Aug 2016]. Figures provided by the ABS show that by 7:30pm the load had reached 150 submissions per second, still well within this expected range.

As for the geoblocking failure, IBM blames their upstream providers, saying "had NextGen (and through it Vocus) properly implemented Island Australia, it would have been effective to prevent this DDoS [and the] site would not have become unavailable to the public". They describe how "a Singapore link operated by one of NextGen's upstream suppliers (Vocus Communications) had not been closed off", which had allowed attack traffic to "[enter] the NextGen link to the eCensus site".

NextGen: DDoS prevention measures were rejected

NextGen argues that the requirements of Island Australia were "well beyond what is provided for a standard internet service". They claim IBM were offered "additional feature[s] designed to effectively detect and defend against DDoS attacks", but IBM had chosen not to purchase this option, even after it was strongly recommended to them.

IBM claims that there were plenty of reasons for this, in particular they were concerned that NextGen's existing mitigation mechanisms might interfere with their load-balancing systems, which were needed to "distribute the load across multiple back-end process streams". They also argued that these protections may mischaracterise normal traffic spikes as DDoS attack traffic. It was not until August 13th, five days after the attack, that IBM finally agreed to implement this protection, which was subsequently provided free of charge.

NextGen further claims that, as part of Island Australia, IBM requested they geoblock traffic from 20 specific host routes. Their submission outlines how this was achieved by implementing several remote 'black holes' at the edge of the network. A black hole refers to the common Border Gateway Protocol (BGP) trick of forwarding traffic to null0, the universal null interface that immediately discards any received packets. This effectively blocks all traffic from a specific source without alerting the sender [Cisco, 2005].

But this list failed to cover all possible routes that could be used to deliver DDoS traffic. In other words, IBM's plan was fundamentally flawed, no matter how well it was implemented.

"Nextgen believes that the individual host routes picked by IBM may not be exhaustive, and DDoS attacks could come from other routes in the IP address range (which they did in the 3rd DDoS attack on Census Day)". [...] There were a number of routes without geoblocking during the fourth DDoS attack, and which were not identified during testing, along with the Singapore link mentioned in [...] IBM's Submission."

Vocus: Island Australia was not realistic

Vocus admits that, even though it blocked the majority of malicious traffic, it wasn't successful in stopping a relatively small volume of traffic originating from a Singapore link. However they dispute that this was the cause of the outage, and argue that the site "should have had relevant preparations in place to enable it to cater for the expected traffic from users as well as high likelihood of DDoS attacks".

They also echo NextGen's assertion that Island Australia was never an appropriate plan to begin with, and raise the possibility that some DDoS traffic actually originated from within Australia. They are not the first to suggest this - in his original set of leaks, Gray had also claimed that the attack had come from within Australia, though both the ABS and IBM strongly deny this. In their submission, Vocus argues that IBM's plan simply wasn't realistic:

"It is incorrect for IBM to represent that DDoS attack traffic travels through a single link, in this case, the Vocus Singapore peering link. [...] The devices (‘botnets') can be located anywhere in the world, including inside Australia. Furthermore, the Island Australia approach does not consider the reality of overseas network operators connecting to Australian service providers inside Australian borders."

Indeed, IBM's justification for choosing Island Australia is at best shaky, and at worst outright laughable. As the inquiry report puts it:

"It appears that IBM and the ABS were in agreement that any botnet in Australia was of insufficient size to cause serious damage to the eCensus website, and therefore geoblocking would be sufficient."

This might have been true, had IBM not specifically asked Vocus to disable their standard DDoS mitigation measures. Vocus's submission also claims that they were "in fact requested to disable [their] DDoS protection product covering the eCensus IP space", which they say would have "appropriately shielded [the site] from DDoS attacks".

Finally, Vocus states that they "[were] not informed of IBM's DDoS mitigation strategy, Island Australia or its specific requirements, until after the fourth attack." They argue that "any assumption[s] that Vocus was required to, or had implemented Island Australia or geo-blocking [...] are inaccurate".

Turnbull: "Heads will roll"

Finally, Australian Prime Minister Malcom Turnbull has publicly laid the blame on IBM, threatening that "heads will roll" over the incident [ABC News, 12 Aug 2016], though he clarified that this would only happen after a full investigation had been completed

"[IBM] had an obligation to protect [the site] against denial-of-service attacks, and the measures that they'd agreed to put into place [...] didn't work, and that's why you had the problems that occurred. There were a number of other failures, but fundamentally, that's the reason for the problem.

"What occurred was not a massive [or] unprecedented attack. It was highly expected."

The Aftermath


In conclusion, based on the information put forward by the Senate inquiry into the handling of the 2016 Census, we now know that IBM:

Despite the ASD having investigated the incident, we are no closer to knowing who was responsible, with the inquiry report confirming that "the perpetrators of the DDoS attack remain unknown".

The Australian government has since reached a confidential settlement with IBM. Though the details of this settlement are not public, the compensation bill is reported to be upwards of $30 million [ABC News, 25 Nov 2016], three times what IBM was originally paid to implement the project.

The ABS maintains that, in spite of these failures, the response rate for the 2016 Census was "over 96%", a figure comparable with that of previous years. However it's likely they'll be thinking very carefully about their approach next time.

Don't be surprised if the 2021 Census sees a return to pen-and-paper - it may be for the best.


Note: This article was originaly published on November 27, 2016, on my now-defunct blog.