Monday 30 November 2015

Google cloud outage caused by failure that saw admins run it manually ... and fail

A mistaken peering advertisement from a European network took Google Cloud's europe-west1 region offline last week for around 70 minutes.

The slip-up happened when an unnamed network owner connected a new peering link to Google, and in the process, it advertised reachability for far more traffic than it could handle.

As a result, as The Chocolate Factory explains in this post, most of the lost traffic carried destination addresses in eastern Europe and the Middle East.

“The peer's network signalled that it could route traffic to many more destinations than Google engineers had anticipated, and more than the link had capacity for. Google's network responded accordingly by routing a large volume of traffic to the link. At 11:55, the link saturated and began dropping the majority of its traffic”, the post states.

That kind of error, Google's report continues, would usually be caught by automated safety checks, but “the automation was not operational due to an unrelated failure, and the link was brought online manually, so the automation's safety checks did not occur”.

“To prevent recurrence of this issue, Google network engineers are changing procedure to disallow manual link activation”, the post notes.

Route announcement errors are a continuing headache on the Internet.

In June, Telekom Malaysia mis-advertised routes to Level 3, causing the US provider to sling most of its traffic onto a network that couldn't cope and dropped the packets.

Read More: http://www.theregister.co.uk/2015/11/30/euro_network_gobbles_googles_cloud/

No comments:

Post a Comment