The duty shift service of the data center operations unit, as well as the customer service, is *the* main indicator by which the true quality of a data center is judged. And for a good reason: whether or not the Service Level Agreement (SLA) stated in the contract is maintained fully depends on the training and efficiency of the engineers, and quality of their interaction with the ‘life systems’ of the data center. This is a major reputation factor for almost any self-respecting data center operator.
We talked to Maksim Malyutin, head of the duty shift service in IXcellerate, who shared in detail the daily routine of his colleagues, what procedures they follow in case of incidents and how their shifts are organized.
Maksim, what exactly your service does?
The duty shift service is responsible for the operation of the data center’s engineering equipment. We are the first to respond to any emergency situations and exercise full control over engineering systems, as well as the actions of colleagues from the service team and contractors. If there is a slightest change in the work of the equipment, or – God forbid – a breakdown occurs, we are the first ones to find out and take the initial actions. This is a super responsible job.
Where does customer service fit into this equation? At what stage do your activities intersect?
In the early years of IXcellerate, when there was only one data center and not much equipment, the was one single service. Over time, our site has expanded significantly and the number of data centers, racks, cooling equipment, transformer substations, distribution networks, etc. has increased. As it happened, the company’s management decided to split the service into two different ones. Customer service directly interacts with our valued clients and provides support with a variety of tasks and requests, from equipment unloading to Remote Hands package. Duty shift service oversees operations of all engineering equipment of the data center and provides maintenance checks.
Basically, we monitor everything related to customer equipment, including temperature and humidity, power and cooling, connections between racks and other parameters. But if we need to inform customers we do that only through the customer service department.
Let me give you an example. Say, a rise in temperature is detected near the customer rack. Duty engineers see this on the monitoring system and send an employee there to inspect the equipment to determine the causes. Upon inspection it turns out that a client employee who had been working with the equipment forgot to install plugs from the side of the cold corridor, which caused an increase in temperature near the rack. We immediately inform the customer service about the incident and take necessary actions to stabilize the temperature. Meanwhile, the customer desk interacts with the client and logs the situation through the client portal. The processes always run in parallel which makes tackling incidents very efficient.
What mode does your service run in?
Since our data centers operate 24/7 all year round, duty engineers monitor the equipment and work in 24-hour shifts with three days in between each shift. The entire work schedule is approved a month in advance.
Tell us about the shift schedule in more detail. What routine actions or checks do you perform and what stages is the shift divided into?
The majority of hours are spent monitoring all the engineering systems.
We monitor all the parameters of the data center, there are several hundred of them, and tens of thousands of reading points in total.
Employees arrive at work in early morning well in advance of their shift. Then the shift change happens: a whole algorithm of actions, the main of which is to hand over the information about all the situations and occurences from the previous shift. The shift lead must have a clear picture of the state of the entire infrastructure. Whether there were any shutdowns, power switches, unscheduled maintenance work, any transfers of loads, incidents and so on.
The operations service has an approved annual preventive maintenance and repair plan, in accordance with which the service team engineers perform maintenance. Colleagues give us the work order, for instance, for a planned shutdown of a precision air conditioner. Then they carry out these works, and the engineers on duty do the monitoring. Our task at this time is to keep an eye on the state of the client infrastructure that could be affected by the ongoing works.
We also have routine rounds. They are divided into two caterogires: indoor premises (including the dala halls) and outdoor premises (diesel generators, cooling equipment etc.).
What do you specifically pay attention to during these rounds?
Maxim takes out a standard tour sheet and shows it to me.
Here is a typical MOS1 data center tour sheet, it is carried out four times a day. As a rule, the round starts from the duty shift room, continues to the loading area, then to the fire extinguishing system, on to the client area, then to the data hall…
Suddenly, the interview is interrupted by an alarm in the monitoring system. Maksim immediately turns to his colleagues watching in front of the monitors.
– Power drop? – he asks. – There you go, a live scenario. We are observing a minor voltage drop from the city power supply. Right now one of our engineers remains on monitoring and informs the customer service, while the other immediately heads to inspect the equipment. Any situation can arise at any time, we must always be ready for this.
What could be the reasons for such power drops? Is this a normal situation?
We are ready for absolutely any scenarios. Any notification of power drops or shutdowns is not normal by definition, but we have all our algorithms worked out.
This happens sometimes. There may be several reasons. The most common cause is a short circuit at the city power supply center. It could happen during the scheduled maintenance of engineering equipment or when customers dismantle their rack, move it, rearrange some of the units.
We recently performed a controlled rack shutdown with one of our clients in the MOS2 data center. They brought their technicians to train to become better prepared for such situations in the future.
Let’s get back to the routine rounds and checklists. Why are you doing them if all your parameters are displayed in real time in the monitoring system?
Four times a day! Laughs. Indeed, everything is displayed in the program, but we cannot rule out innacuracies. Let’s say, during the round, the engineer may notice slightly unusual hum coming from the quipment, while the system doesn’t show any diviations. It may by nothing, but sometimes the difference between what we see in our system and real indicators can only be detected during the physical inspection. This way we exercise double control, which is also extremely important.
All the metrics, indicators, norms – everything is registered and backed up by policies and standards. All the work, with the exception of emergency situations, is planned. Everything is very strictly regulated.
And after the inspection, you go back and compare the data, right? And then what?
After the inspection tour is completed, on-duty engineers continue to monitor the equipment and work carried out by colleagues and contractors. Somewhere around 6 PM the activities start to wrap up, some get moved to the next days. Upon completion of the works, we need to make sure that the entire infrastructure is in normal mode. After that – it’s evening and overnight duty.
What are the most active hours?
When the service team engineers perform equipment maintenance. This usually happens during the day. Power switching, load changes. Customer employees interact with racks, sometimes they work in cold corridors, blocking access of air from under the raised floor, we take appropriate actions, and so on. A lot of activity! This is the busiest and most risk-prone time.
What happens at the end of your shift?
At the end of the shift, we hand over operational logs with recorded events, schedule and inspection plan to the next team. Each shift must be aware of all the events that happened over the last two weeks. People go on vacations, some can get sick, so when the duty shift engineer takes over he must have a clear understanding of what happened before him.
Does it mean that each of you has an entire map of events of the last two weeks in their heads?
Incredible! Thank you very much and good luck!