Andrew Pollack's Blog

Technology, Family, Entertainment, Politics, and Random Noise

Data Center explosion in Houston takes down 9000 servers - what happened, and what you should consider about disaster recovery

By Andrew Pollack on 06/03/2008 at 08:10 AM EDT

Having been impacted by the outage at The Planet, obviously I followed it closely. Through many telephone calls and chats with staff there and with people I know personally in the area with professional relationships on staff at The Planet, here's the best take I can give you on what really happened.

What Happened?

At about 17:45 CST on a Saturday, a fire and explosion took down an entire data center in Houston, TX belonging to hosting provider "The Planet". The state of the art facility manages more than 9000 servers in racks on two floors of the huge data center. The facility is known for extremely high reliability and high performance network connectivity to the internet. As such, a large number of its customers ignored best practices and did not have hot standby servers in other data centers. Although the facility has standby generators capable of instantly providing enough power to sustain the facility indefinitely and the fuel for such use to last well over a week, they were unable to use these generators due to the nature of the incident.


Transformer Explosion


The initial incident was an explosion and fire that required (obviously) the immediate evacuation of the entire facility. Initial reports have been that a transformer exploded in their power distribution room. It may be that it was in fact an electrical conduit explosion, and some have suggested that the initial explosion may have caused a burst in a pressurized fire suppression system which did more of the damage. It will be some time before we know for sure what the initial explosion was. We do know, however, that the force of the blast blew out three interior walls of the power distribution room - moving those walls "several feet" from their original position.

Power distribution rooms in large data centers are like the circuit breaker panel at your house, only instead of distributing 20-50 kilowatts to a dozen or so circuits, a data center can be dealing with several megawatts and hundreds or thousands of circuits. Clearly this couldn't have been an explosion of a megawatt sized transformer as that would have left a crater where that half of the building used to be. It could have been one of several smaller transformers or it could more likely have been a conduit explosion.

The damage from the explosion destroyed the connections of nearly every circuit from the distribution room out to the racks of equipment, the cooling equipment, and everything else. Had they started the generators, they'd only have created a dangerous fire.

Even with dozens of electricians at work and vendors from the power company to the networking gear having offices already on site, it was nearly 28 hours before the electrical connections could be made safe enough and enough equipment replaced for the fire marshal to allow them to begin restoring power. Initially, power was restored to the second floor housing 2/3 of the servers. We're talking about Houston in the summer time, so the area first had to be slowly cooled to safe temperatures without cooling too quickly and creating a condensation problem. Racks could then be started in small groups. Each rack drawing its maximum load as servers restarted and rack mounted batteries (designed to carry gear through a transition to generator power) all demanded a full charge at once.

By mid-day on Monday, most of the second floor was powered up. The Planet had brought teams of support people from their Dallas and other Houston data centers to help with the work. Many servers hadn't been restarted in months or years. Some small percent did not start. Drives that had been spinning all that time didn't have strong enough motors to spin up. Configurations that had changed over the months since the last reboot failed to start properly. All of these issues were handled as best they could.

The first floor servers simply had no connection any longer to the power distribution center. Their power conduits were in the concrete floor of the building, and were simply no longer even near the walls that used to hold their connective cables to the distribution room. By the afternoon on Monday, temporary circuits had been created allowing huge cables from out door generators to run directly into the building and connect to the first floor circuits. This is the way it is currently running, and will continue for at least a week while equipment is brought in and the production grid is rebuilt from scratch.

Some Lessons Reinforced --

I would say lessons learned, but anyone in our business should know these already.

1. Any data center, no matter how good, is subject to this kind of rare incident. A data center can be blown up, a plane can fall on one, a meteor could hit one, a sinkhole could swallow one, or flying monkeys could carry one away. I have little sympathy for people screaming at tech support staff about losing thousands of dollars an hour. If the up time is that critical, you should have standby in another data center -- possibly with another vendor entirely. For me, I've already got the fail over machine in place now, and am building it out and configuring today. A bit late, but it can happen again. Especially since right now there is a lot of temporary patchwork in that data center and a lot of mitigation work to be done over the coming weeks.

2. Don't delay your disaster recovery plans. I got caught with my pants down, as I'd been planning a hot standby server in another data center and was months over due setting one up. I own that failure, not The Planet. They do own their own mistake however. The planet had acquired another outfit and was still using the DNS setup that the older firm had used. This left them without a backup DNS system in another data center themselves for those customers. This, combined with the customers themselves not getting their own backup dns providers (which is free in some places) left some customers in other data centers without service. I think that is the only part of this incident where The Planet holds fault. They knew about the issue, but hadn't completed their plans to better redistribute that configuration. Like me, they'd put off what they knew they had to do and they got caught.

3. Don't delay implementing your disaster recovery plans in hopes that you'll be up and running before you could complete the process. If your process takes a long time to change over, you should start right away. If something does happen to bring the primary service back, that's good. In the mean time, the sooner you start the better off you are. Often with these things, it is many hours into the incident before enough facts are known and verified to give an accurate time estimate.


There are  - loading -  comments....



Other Recent Stories...

  1. 01/26/2023Better Running VirtualBox or VMWARE Virtual Machines on Windows 10+ Forgive me, Reader, for I have sinned. I has been nearly 3 years since my last blog entry. The truth is, I haven't had much to say that was worthy of more than a basic social media post -- until today. For my current work, I was assigned a new laptop. It's a real powerhouse machine with 14 processor cores and 64 gigs of ram. It should be perfect for running my development environment in a virtual machine, but it wasn't. VirtualBox was barely starting, and no matter how many features I turned off, it could ...... 
  2. 04/04/2020How many Ventilators for the price of those tanks the Pentagon didn't even want?This goes WAY beyond Trump or Obama. This is decades of poor planning and poor use of funds. Certainly it should have been addressed in the Trump, Obama, Bush, Clinton, Bush, and Reagan administrations -- all of which were well aware of the implications of a pandemic. I want a military prepared to help us, not just hurt other people. As an American I expect that with the ridiculous funding of our military might, we are prepared for damn near everything. Not just killing people and breaking things, but ...... 
  3. 01/28/2020Copyright Troll WarningThere's a copyright troll firm that has automated reverse-image searches and goes around looking for any posted images that they can make a quick copyright claim on. This is not quite a scam because it's technically legal, but it's run very much like a scam. This company works with a few "clients" that have vast repositories of copyrighted images. The trolls do a reverse web search on those images looking for hits. When they find one on a site that looks like someone they can scare, they work it like ...... 
  4. 03/26/2019Undestanding how OAUTH scopes will bring the concept of APPS to your Domino server 
  5. 02/05/2019Toro Yard Equipment - Not really a premium brand as far as I am concerned 
  6. 10/08/2018Will you be at the NYC Launch Event for HCL Domino v10 -- Find me! 
  7. 09/04/2018With two big projects on hold, I suddenly find myself very available for new short and long term projects.  
  8. 07/13/2018Who is HCL and why is it a good thing that they are now the ones behind Notes and Domino? 
  9. 03/21/2018Domino Apps on IOS is a Game Changer. Quit holding back. 
  10. 02/15/2018Andrew’s Proposed Gun Laws 
Click here for more articles.....


pen icon Comment Entry
Subject
Your Name
Homepage
*Your Email
* Your email address is required, but not displayed.
 
Your thoughts....
 
Remember Me  

Please wait while your document is saved.