| By Jeremy Geelan | Article Rating: |
|
| September 2, 2009 09:15 AM EDT | Reads: |
5,878 |
"We've turned our full attention to helping ensure this kind of event doesn't happen again," wrote Ben Treynor, self-described 'VP Engineering and Site Reliability Czar' at Google, in an official blogged explanation last night of the 100-minute Gmail outage yesterday, which Treynor conceded "was a Big Deal, and we're treating it as such."
He then described in detail the events that conspired to bring about the outage:
"Here's what happened: This morning (Pacific Time) we took a small fraction of Gmail's servers offline to perform routine upgrades. This isn't in itself a problem — we do this all the time, and Gmail's web interface runs in many locations and just sends traffic to other locations when one is offline.
However, as we now know, we had slightly underestimated the load which some recent changes (ironically, some designed to improve service availability) placed on the request routers — servers which direct web queries to the appropriate Gmail server for response. At about 12:30 pm Pacific a few of the request routers became overloaded and in effect told the rest of the system "stop sending us traffic, we're too slow!". This transferred the load onto the remaining request routers, causing a few more of them to also become overloaded, and within minutes nearly all of the request routers were overloaded. As a result, people couldn't access Gmail via the web interface because their requests couldn't be routed to a Gmail server. IMAP/POP access and mail processing continued to work normally because these requests don't use the same routers.
The Gmail engineering team was alerted to the failures within seconds (we take monitoring very seriously). After establishing that the core problem was insufficient available capacity, the team brought a LOT of additional request routers online (flexible capacity is one of the advantages of Google's architecture), distributed the traffic across the request routers, and the Gmail web interface came back online."
Treynor's post ended with a detailed explanation of Google's plans to prevent a repeat of the same problem:
"What's next ... Some of the actions are straightforward and are already done — for example, increasing request router capacity well beyond peak demand to provide headroom. Some of the actions are more subtle — for example, we have concluded that request routers don't have sufficient failure isolation (i.e. if there's a problem in one datacenter, it shouldn't affect servers in another datacenter) and do not degrade gracefully (e.g. if many request routers are overloaded simultaneously, they all should just get slower instead of refusing to accept traffic and shifting their load). We'll be hard at work over the next few weeks implementing these and other Gmail reliability improvements."
He ends, "Gmail remains more than 99.9% available to all users, and we're committed to keeping events like today's notable for their rarity."
Published September 2, 2009 Reads 5,878
Copyright © 2009 SYS-CON Media, Inc. — All Rights Reserved.
Syndicated stories and blog feeds, all rights reserved by the author.
More Stories By Jeremy Geelan
Jeremy Geelan is President & COO of Cloud Expo, Inc. and Conference Chair of the worldwide Cloud Expo series. He appears regularly at conferences and trade shows, speaking to technology audiences both in North America and overseas. He is executive producer and presenter of Cloud Expo's "Power Panels" on SYS-CON.TV.
- Acquia Announces Two New Board Members
- CollabNet Adds Board Member and Senior Executives to Fuel Continued Growth in Agile ALM and Enterprise Cloud Development
- Learn Open Source Database Tools from Stanford for Free
- Research and Markets: Global Mobile Device Management Enterprise Software Market 2010-2014 Includes a Discussion of the Key Vendors Operating in This Market
- Alternative Search Engines for the Contemporary User
- FORTUNE Magazine Names Rackspace Among “100 Best Companies to Work For”
- New York City : Blueprint for Cloud-enabled economic transformation
- EnterpriseDB Announces Availability of Postgres Plus Cloud Database
- Connectria Hosting Achieves "Off the Chart" Operational Efficiency With Cloud-Based Storage Solution From Nexsan and CommVault
- eXo Platform 3.5 Now Available: First Cloud-Ready Enterprise Portal and User Experience Platform-as-a-Service (UXPaaS)
- Research and Markets: WordPress 24-Hour Trainer, 2nd Edition
- ICOS and Joyent Announce Strategic Partnership to Deliver Joyent's Cloud Infrastructure Solution to Channel Partners and Service Providers
- Five Years Waiting for JRE 7: Is It Justified? (Part 1)
- Book Review: The CERT Oracle Secure Coding Standard for Java
- Acquia Announces Two New Board Members
- CollabNet Adds Board Member and Senior Executives to Fuel Continued Growth in Agile ALM and Enterprise Cloud Development
- Learn Open Source Database Tools from Stanford for Free
- Research and Markets: Global Mobile Device Management Enterprise Software Market 2010-2014 Includes a Discussion of the Key Vendors Operating in This Market
- Government Big Data Solutions Award Nominee: Wayne Wheeles (Sherpa Surfing)
- Alternative Search Engines for the Contemporary User
- FORTUNE Magazine Names Rackspace Among “100 Best Companies to Work For”
- New York City : Blueprint for Cloud-enabled economic transformation
- EnterpriseDB Announces Availability of Postgres Plus Cloud Database
- Load testing the post office
- Java Developer's Journal Exclusive: 2006 "JDJ Editors' Choice" Awards
- The i-Technology Right Stuff
- Creating Web Applications with the Eclipse Web Tools Project
- Eclipse Special: Remote Debugging Tomcat & JBoss Apps with Eclipse
- The Next Programming Models, RIAs and Composite Applications
- Where Are RIA Technologies Headed in 2008?
- SYS-CON Webcast: Eclipse IDE for Students, Useful Eclipse Tips & Tricks
- How to Bring Eclipse 3.1, J2SE 5.0, and Tomcat 5.0 Together
- Eclipse: The Story of Web Tools Platform 0.7
- "Eclipse 3.0 is a Great Leap Forward," Says JDJ's Dudney
- The Top 250 Players in the Cloud Computing Ecosystem
- Developing an Eclipse BIRT Report Item Extension






















