[Op1st-dc-alerts] [mghpcc-alerts] Update re: Brief cooling system problem in the MGHPCC Computer Room at 3:03PM June 15, 2023

John Goodhue jtgoodhue at mghpcc.org
Sat Jun 17 16:30:39 EDT 2023


The facility continues to operate with manual settings where needed to work around the faulty controller.  At approximately 12:35 PM today, an error condition in the chiller plant briefly increased the cold aisle temperature in the computer room.  Temperature has been restored to normal, but please let us know at help at mghpcc.org if you see a problem that needs to be addressed.

In addition, subsequent recovery actions led to a water spill at approximately 3:45 PM in a room adjacent to the Staging Room, with some water going under the wall into the Staging Room.  We are in the process of cleaning up the spill.

======prior alerts========

***June 16, 2023 10:10PM***

The hardware replacement that the Schneider technician provided was an incorrect part, and they will not be able to do any further work until after the holiday weekend.  

The facility continues to operate with manual settings where needed to work around the faulty controller.  The facility can operate in this mode, and we will have staff either on site or close by prepared to respond if problems occur.  However, we may not be able to react as quickly as the automated controls to low probability / high-impact events such as utility power failure.

======prior alerts========

***June 16, 2023 2:45PM***
Subject: 
Update re: Brief cooling system problem in the MGHPCC Computer Room at 3:03PM June 15, 2023


Diagnosis:
The source of yesterday’s cooling system malfunction was a Building Management System (BMS) controller that incorrectly turned a valve, shutting off part of the chilled water supply to the computer room.  

Follow up
After the controller malfunctioned for a second time at 10PM yesterday evening (6/15), we took several actions:
   Restored the valve position within a few minutes, minimizing impact.
   Disabled the control signal from the BMS controller to the valve.
   Kept people on site who can operate the valve manually in the (unlikely event that
      needs to be operated.
   Brought in Schneider technicians to analyze the controller.
   Inspected relevant sections of the controller software to look for possible error and/or 
       problematic inputs.

Next Step
The Schneider technicians have recommended replacing the CPU board in the controller.  We will be doing a short test to verify that this can be done non-disruptively, followed by the actual replacement after the new board arrives (ETA 4:30PM today).


***June 15, 2023 3:25PM***

At 3:03 this afternoon (June 15), a cooling system malfunction affected the flow of chilled water to the MGHPCC Computer Room.  The malfunction was corrected at approximately 3:08PM bringing water flow back to normal.  

We will send an update after determining root cause.

Apologies for any inconvenience and/or equipment alarms that this may have caused.


More information about the op1st-dc-alerts mailing list