Data Center Outages: Three Decades of Lessons Learned

By: Jeremy Gilbertson

EDI has been in business for 30 years and in that time, we have seen just about everything in the data center world. Any data center expert worth their salt has been in the middle of an outage scenario, but for me, one particular experience will always stand out. The quality of day that a data center operator experiences can be affected by a multitude of factors. However, the appearance of TV news vans in the data center parking lot can negatively impact that quality by several orders of magnitude.

Imagine reporting to one of your sites, to respond to an outage, and the first people you meet in the parking lot are the TV vans for major network affiliates. I survived that day and many others that proceeded and followed. Survived and learned. The good news is that you probably won’t have to experience navigating those parking lot interviews on your way into the office. My experience as a data center facility manager with responsibility for a portfolio of almost thirty critical facilities and a staff of over 11 chief engineers has yielded some fantastic lessons-learned that I’d like to share with you. Hopefully, you can apply one of these takeaways to an existing challenge or problem.

Over time, technology has improved the resiliency of data centers, however it has also presented complications that can introduce risk driven by human intervention. The old systems were less reliable and required more hands-on attention. Equipment and designs have improved tremendously over the years.

Today’s critical facility manager, has been lulled into a kind of complacency due to the reliability achieved by these newer technologies. Technology and complexity, however, exposes more of the system to human intervention errors. The outage possibilities have evolved along with the systems, and, collectively, we are encountering a new set of vulnerabilities we must identify and anticipate. Another old owner adage applies even more today than in the past, “We have met the enemy and he is us.”

To move forward, we need to look back and adopt a defense firmly based in those old days when we had to be more proactive simply because those old systems were less reliable. We cannot afford to be reactive. If anything, the expectations placed on data centers today make any outage more damaging than those old days of batch processing with a more tolerant attitude and lower expectations.

One of the most challenging aspects of data center operations is the ability to effectively communicate critical information to senior leadership and the diverse stakeholder teams supported by the data center. Existing industry tools, like the Tier methodology, is too complicated and esoteric to grasp for stakeholders without a deep understanding of critical electrical and mechanical systems. Over time, I applied my experience to  develop two tools to simplify data center risk to those without a degree in electrical engineering. The RED/GREEN Single Line and the HARRM standard have helped us to actively bridge the gap between IT, Facilities and the Business units to drive timely and informed data center decision-making.

RED/GREEN Single Line

  • Simply color coding to define data center resiliency.
  • RED = Single Points of Failure (SPOF)
  • GREEN = Redundant
  • Replaces formal electrical icons with simplified pictures

Using this format, most executives can easily recognize the UPS, diesel generator or CRAC unit on the drawing as they often walked past these same components during tours. We start with a RED/GREEN drawing of the systems as they exist, then develop two or three options with less RED and more GREEN with a budget value for each upgrade. RED is BAD and GREEN is GOOD. More GREEN costs more money. With this tool, we can define risk and budget in an understandable way that drives dynamic decision-making concurrently with financial impact.

HARRM or Hardening, Availability, Redundancy, Reliability and Maintainability is the framework we developed to simplify data center resiliency benchmarking and planning discussions. With this framework, current state can be identified and validated, and the gaps between desired and future state can be quantified and remediated.

HARDENING: Site, Base Building and Systems details designed for DC survivability in harsh environmental conditions.

Examples include:
  • Higher wind ratings in glazing and roofing materials
  • Documented recovery methods
  • Coordinated camera/access and physical security items.
  • Locked ON accessible disconnects at Condenser Yard

AVAILABILITY: Systems designed to be highly available.

Examples include:
  • Anticipate and design for critical equipment access.
  • Arrange for systems to be cross supporting with minimal switching.
  • Equipment specified for minimal annual “downtime”
  • Perform availability cross study from 5, 10 and 20 year (wet/dry) battery options for UPS systems.

REDUNDANCY: Systems designed to eliminate SPOF (Single Point of Failure). The all-encompassing conceptual data center design basis that drives and supports all the other Critical Path Data Center principles in the HARRM methodology. Redundancy is defined and represents the “TIER” methodology we commonly encounter as “shorthand” for HARRM. N +1 (or Need) PLUS One or the 2N (or Twice Need) are representative of redundancy modeling. For example, the A/B or 2N  data center systems model is commonly encountered today because the IT Industry recognized the value of the “dual corded” system power method. Prior to that time the single corded IT infrastructure forced designers of High Tier or client-driven HARRM modeling in critical data centers to utilize static switch technology in an effort to drive redundancy to the lowest system point. Although expensive, Redundancy was achieved, all be it at the cost of increased capital, lowered Hardening, lower Availability, lower Reliability and Maintainability. HARRM is pursued as a balance among its components and the client’s appetite for the Risk/Loss/Investment equation.

RELIABILITY: Reliability is the most simple yet elusive of the HARRM components. The evaluation of both the individual devices as components singly and collectively as a system. Data Center Reliability is the absence of DOWNTIME. Data Center Reliability is achieved through the long-term operation application of all the other HARRM principles. Hardening, Availability, Redundancy and Maintainability are the components by which Reliability is built. Reliability is achieved through critical experience as the application of those building blocks. A data center is comprised of many critical equipment components. These are highly complicated and manufactured by many different “OEMs” (Original Equipment Manufacturers). It is possible to build or expand a data center in an endless exponential variety of combinations. Examples can be cited of data centers failing due to incompatible components. Systemically, experience with critical equipment combinations is essential to achieve Reliability. Reliability is also a function of data center operational experience, or the ability to apply real world experience, over time within challenging environments. Uptime, predictive maintenance and mean time between failures (MTBF) all provide useful tools to evaluate equipment level Reliability, however such information is often biased. Again, experience in the real world of data center operations is valuable. Experience, access to equipment technicians and unbiased, real world operational presence provide clients useful tools to achieve systemic and component level data center Reliability.

MAINTAINABILITY: Systems designed for concurrent preventative and predictive maintenance. Systems designed for complimentary support. One of the major goals of maintainability is to provide elimination of the human error element.

Examples include:
  • A simple key transfer that allows the PLC to make all the conditional and permissive switching sequenced to make a complicated transfer and isolate the opposite side for critical maintenance work.
  • A UPS system with multiple parallel battery strings designed to isolate individual strings for battery maintenance without losing the UPS functionality. Anticipating correct physical positioning of critical equipment to allow access without disruption.
  • CRACs that utilize a single shaft extended across the box to support cage fans should allow sufficient space to pull the fan shaft with sufficient clearance to an adjacent PDU.

These tools can be implemented in single site or portfolio-based assessments to understand current state capacity, resiliency, inherent risk, ability to expand, operational capabilities and other key metrics that determine the data center’s ability to support the business. Evaluations accompanied by a simple framework (like HARRM and RED/GREEN) and a solid technical process can be tremendously valuable risk management and strategic planning tools. Often, owners struggle with cost justification for these types of evaluations because their environments haven’t recently experienced an outage. Proactive planning and risk identification can be far less costly that a reactive response to an outage scenario.

A proper data center evaluation includes operational assessments in addition to the aforementioned design review tools RED/GREEN and HARRM. The objective is to document site operational management standards and to establish goals and practices to prevent data center outages as the first priority and recovery as the second priority. Modern redundant designs are complicated, requiring more focus on operational policies and procedures. Every aspect of the facility is open to interpretation unless management establishes the law of the land through an Operational Program. Preparing policies and procedures for everything from general maintenance to storm preparedness and recovery for all failure scenarios can be a daunting challenge. We find more often than not in smaller facilities these operational programs simply don’t exist. If they do exist, they are covered in dust at the bottom of a rusted file cabinet in the maintenance closet. This is the equivalent of trying to remember how to get to a town that you visited five years ago in a state you visited once in the middle of a hurricane with an air horn in your ear. The first step is developing the operational procedures, and the second step is making sure the operational staff is trained and drilled on execution of the procedures.

Despite our best efforts, situations will arise that we cannot predict or prevent.We can mitigate, respond and recover effectively. So, what is our best advice on avoiding a crowded parking lot filled with your friendly global network affiliate?

  1. Lean on the hard-earned experience of data center experts
  2. Trust in simple frameworks to make informed data center decisions
  3. Don’t forget about the operational component