An update on Sunday’s service disruption | Google Cloud Blog🔒 cloud.google.com
When the biggest cloud providers in the world have issues it’s a good reminder on how hard it is to run technology platforms, at scale, continuously.
In essence, the root cause of Sunday’s disruption was a configuration change that was intended for a small number of servers in a single region. The configuration was incorrectly applied to a larger number of servers across several neighboring regions, and it caused those regions to stop using more than half of their available network capacity.
Sounds similar to other big outages, unintended consequences of a change or accidental scoping of a change. These changes come from expert administrators, so the machines follow the direction. I don’t think it’ll be that long before we build some skepticism into our platforms to not always trust even an expert administrators direction if the scope of that change is large.Posted on June 6, 2019 →