[Op1st-dc-alerts] [mghpcc-alerts] Update re: Brief cooling system problem in the MGHPCC Computer Room at 3:03PM June 15, 2023

mghpcc-alerts mghpcc-alerts at mghpcc.org
Fri Jun 16 14:46:34 EDT 2023


Diagnosis:
The source of yesterday’s cooling system malfunction was a Building Management System (BMS) controller that incorrectly turned a valve, shutting off part of the chilled water supply to the computer room.  

Follow up
After the controller malfunctioned for a second time at 10PM yesterday evening (6/15), we took several actions:
   Restored the valve position within a few minutes, minimizing impact.
   Disabled the control signal from the BMS controller to the valve actuator.
   Kept people on site who can operate the valve manually in the (unlikely event that
      needs to be operated.
   Brought in Schneider technicians to analyze the controller.
   Inspected relevant sections of the controller software to look for possible 
       error and/or problematic inputs.

Next Step
The Schneider technicians have recommended replacing the CPU board in the controller.  We will be doing a short test to verify that this can be done non-disruptively, followed by the actual replacement after the new board arrives (ETA 4:30PM today).

======prior alerts========

* June 15, 2023

At 3:03 this afternoon (June 15), a cooling system malfunction affected the flow of chilled water to the MGHPCC Computer Room.  The malfunction was corrected at approximately 3:08PM bringing water flow back to normal.  

We will send an update after determining root cause.

Apologies for any inconvenience and/or equipment alarms that this may have caused.


More information about the op1st-dc-alerts mailing list