GDS: Incident report – GOV.UK DNS outage

Originally published by Dafydd Vaughan on 15 November 2016

This article was originally published on the Inside GOV.UK blog and has now been re-published here. You can read the original article online.

This post is a report about 2 recent incidents on the GOV.UK website. We’re still working through our incident process, but due to the scale of these issues, we felt it was important to publish our report as soon as possible.

This post covers 2 similar and related incidents:

  • an intermittent outage of the GOV.UK website and related services for around 3 hours on Friday 21 October 2016
  • an intermittent outage of just the GOV.UK website for around 25 minutes on Wednesday 26 October 2016

Background

Both incidents were related to problems with DNS – a mechanism used to convert a website address like ‘www.gov.uk’ to an IP address like ‘151.101.60.144’.

Browsers use DNS to identify the IP address of a website so they can send requests back and forth.

Friday 21 October

On Friday, a third-party company was targeted by a large distributed denial of service (DDoS) attack. The company – Dyn – provides DNS services for many large websites.

At around 5pm on Friday, the attack began. It caused problems connecting to Dyn’s servers around the world. This meant that DNS requests received slow or no responses.

This outage affected the GOV.UK website, GOV.UK blogs on ‘blog.gov.uk’, government services on the ‘service.gov.uk’ domain name and legacy systems running on the ‘direct.gov.uk’ and ‘businesslink.gov.uk’ domains.

The outage also affected some third party systems that GOV.UK uses including ZenDesk and Basecamp.

The GOV.UK on-call technical team worked to restore the service by removing Dyn from the infrastructure for ‘www.gov.uk’ and moving DNS for ‘service.gov.uk’ to an alternative provider.

These 2 domains were prioritised over others as they have the highest number of users, and so were most affected.

However, as several internal GDS systems were unavailable, the external organisation that runs our domain names couldn’t be sure that the change requests we were making were authorised. This delayed the work to restore services.

Service for ‘www.gov.uk’ and ‘service.gov.uk’ was restored by 8pm. The remaining services including GOV.UK blogs and legacy systems were restored at 11pm when the DDoS attack ended.

Wednesday 26 October

At around 3:15pm, an external organisation made a planned change to the DNS record for the ‘www.gov.uk’ domain name.

This change was requested by GOV.UK to restore some resilience following the actions we took during Friday’s incident.

An engineer at the external organisation mistakenly changed the DNS record to point at something that didn’t exist.

This meant that requests for ‘www.gov.uk’ didn’t work.

Once the mistake was identified, the external organisation reacted quickly to fix the record.

The GOV.UK website was restored by 3.40pm and the total downtime was approximately 25 minutes.

What users saw

The nature of DNS means that it is very easily cached. Temporary copies of DNS responses are often kept in various places, including in your browser, your home/office router and at your internet provider.

Only those who had no copy of the relevant DNS records and needed to request them from our DNS provider would have seen problems.

Additionally, during the Friday incident, access to Dyn’s servers was intermittent, so some requests would still have succeeded.

This means that not everyone would have been affected by these outages.

If they were affected, users would not have been able to access the GOV.UK website, GOV.UK blogs, or other government services hosted on the ‘service.gov.uk’, ‘direct.gov.uk’ and ‘businesslink.gov.uk’ domains.

Immediate things we’re doing to prevent this from happening again

Since Friday’s incident, we’ve made a number of changes to our DNS setup and have several more in progress.

We identified that our DNS provision is a single point of failure. So now we use a second DNS service. We’re also looking to see if there are any other single points of failure.

In addition to the DNS changes, we’re looking again at our monitoring and alerting setup. Several of our monitoring services were affected by Friday’s outage, which means we weren’t alerted in the way we should have been (although this wasn’t a problem in this case).

Finally, we’re looking at how we can provide better information to people if we experience an outage in the future. Many of our usual communication methods were unavailable as they all relied on Dyn’s DNS service.

Next steps

As I said at the beginning of the post, we’re still working through our incident process. We’re reviewing the incident, the causes and our handling of it, and expect more actions to come from that.

We’re also working with our colleagues at the National Cyber Security Centre and other parts of government to coordinate our incident management processes and understand the wider government impact.

We’ll post a further update to this incident report if there are any other disruptions or relevant findings.