Some Rules for Successful Data Center Operations
There are numerous policies and practices that every data center owner or operator follows, or should be following, but in reality there are only a few rules that must be adhered to for gaining the best results.
Evaluate what you are doing and why you are doing it.
It is far better to prevent a “Lights Out” occurrence than sitting around a conference table discussing the forensics of a shutdown.
Years ago, I was doing a walk through in a small data center in the Midwest. I saw the Emergency Power Off (EPO) on the wall near the door and I asked the facility manager why it did not have a cover or a sign indicating its purpose. I’ll never forget his response. He pointed at the familiar red mushroom button and said, “That is what we call the resume generating switch, if you punch that, you should make sure your resume is updated because your career is over here.”
We both laughed about that, but it got me to thinking that data center likely is going to have an unscheduled shutdown one of these days.
Even if the company communicated that message to each employee, there is still a chance that someone did not get the memo or heard and understood the message about the EPO and was disgruntled, looking for a new place to work. Without the cover or sign, there is always a risk someone could accidently lean against the wall and dump the data center.
A practical solution is to determine the necessity of the EPO, based on NEC code updates and then consider the risks associated with EPO and how to eliminate or reduce the risks. Providing a flip cover and posting a sign is only a portion of the solution. Make sure every new and existing employee understands the white space is where their paycheck is printed and emphasize the importance of practicing common sense procedures working in that environment.
However, caution must be applied here. Following processes out of habit leads to complacency. Complacency may lead to disaster, which brings us to point #2.
Test your defenses.
Not referring to IT security; that should be covered from both the virtual and physical side, but have an understanding of the vulnerabilities your data center may suffer. Are your maintenance practices and performance reviewed periodically? I have looked through maintenance logs that have been checked and dated, but curiously the handwriting all looks very similar, almost as if the technician, who is likely either bored with the paperwork or overwhelmed by other demands just sat down and filled out a month or two worth of maintenance records. I’m not casting stones at the technicians, the demands upon their time sometimes requires shortcuts, and paperwork is one of those shortcuts. The operator should review the logs after each scheduled maintenance performance to look for trends or anomalies, no matter whether they are generated from a CMMS system or in a binder. For example, looking over a UPS report and comparing it to the previous month. Does the recorded voltage and current (input and output) match the UPS display? Or does the unit need recalibration? How is the battery health? Are you testing the UPS and generator together under load conditions? Entire volumes have been written regarding maintenance of critical infrastructure, but don’t just rely on a completed maintenance report, do a quality control check and look deeper at each component. A couple of airlines would agree that a few minutes spent here each month could save your data center from an unfortunate event that makes headlines and causes company stock prices to tumble in the short term.
Looking past the maintenance programs for the critical infrastructure, has the warranty information been archived? Do you know when the end of life for the systems will sneak up? One of the more difficult conversations to have with a data center manager is to tell them their Computer Room Air Conditioning (CRAC) or UPS is approaching its final days and their response is “there is no money in the budget for a new unit.” This exchange is often coupled with the fact the unit is operating above design capacity and redundancy. In other words, running to failure.
The additional stress this creates for the operator operating on borrowed time could be reduced if they set the calendar alarm with this date minus four to five years or whatever is appropriate for your organizations budget planning. This will allow time to build the necessary budget for a capital investment without surprising the CFO.
Other defense strategies are reviewing and assessing the disaster recovery (DR) plan and business continuity plan (BCP). You say you haven’t blown the dust off those documents since Y2K?
DR and BCP have far reaching impact outside the data center requires that comprehensive risk assessment study should not be overlooked.
The data center manager should be concerned with anything that could become a disruption.
Stacks of boxes and paper in the data center? Very common, but also a potential fire hazard or trip hazard, at the minimum, non-IT related stuff, including old servers, racks impede work flow. Underfloor smoke detectors? What is the policy for lifting floor tiles? How often is the under floor area cleaned? Lifting a floor tile could present some nasty surprises that may include setting off a fire alarm.
These are only a few of the many possible scenarios that may impact your IT operations.
A best practice for testing your defenses is Management by Walking around (MBWA). This time tested custom was popular back in the 1980’s but appears to have roots much further back to Abraham Lincoln’s review of the Union troops during the Civil War.
It is to your advantage to get inside the white space and see, smell, touch and hear what is going on. Are there stains on the ceiling tiles? How long have they been there? Probably should open a tile and take a peek with a flashlight. Does the air smell musty? Is there water building up in the condensate pans? Do you hear the belts squealing on the CRACs? Open up the unit and check to see if they are aligned properly. Do you smell a whiff of sulphur? The UPS batteries may be telling you something. What’s the temperature in the space?
Too often we rely on email or texts and completed checklists and don’t take a hands on approach to identifying risks in the data center environment.
Challenge Your Assumptions
One of my favorite authors, Rudyard Kipling wrote,
“I keep six honest serving-men
(They taught me all I knew);
Their names are What and Why and When
And How and Where and Who.”
One of the epiphanies I had early in the data center world was to know what a red herring was and how to address it. Historically a red herring was a logical device to distract an opponent during an argument. Supposedly it comes from training hunting dogs by the use of a kipper or herring to drag across the ground, thus throwing them off the trail.
In my particular case, there were no fish. I had met with a technician in the data center who was very upset that the UPS failed every time he was in the room. I was set back on my heels by that statement. While I was collecting data to determine the cause of the loss of power in the data center I ran into a network administrator working inside a rack. I asked him about the failures and his response was there were two in the past six months. I countered, “but your colleague just told me that “every” time he was in the data center there was a UPS failure.” The network guy stroked his beard, looked down for a second and looked back at me. “Well, the other guy you met works at another location. He is only here a couple of times a year, so that would make sense.”
That revelation taught me to keep asking questions and challenge my own assumption.
The first technician told me the truth. The network administrator also told me the truth, but he included details that the first observer did not possess. If I had only spoken to the first technician, I would have spent a lot of wasted time tracking a problem that I assumed was an ongoing, everyday occurrence. The network administrator pointed me in the right direction and I was able to determine the true cause of the shutdown. (As it turned out, it was not entirely a UPS issue, as previously understood, but a site wiring fault)
These are just a few of the water cooler discussions a data center manager should be having with their team. Speaking of discussions, it’s never a bad idea to talk shop over coffee and donuts. Third shift employees should have some interaction with their first and second shift counterparts.
Even the best run organizations with the sharpest minds and most detailed operating procedures thrive on the tribal knowledge that each engineer or technician brings to the data center, but it’s important to capture those conversations to allow the gained knowledge to grow beyond the informal to be developed as process improvements.
If you think you data center may be a risk, contact us. We’d love to schedule a site assessment to help you site stay online.