For much of this spring semester ITS staff have been chasing down several issues causing network instability. One of the causes was failing fiber optic cable which has been located and replaced.
We also identified a bug in the network operating system running on some of our core network switches. We have been working with Juniper engineers (Juniper is the company that manufactures our networking equipment) for several weeks to identify and fix the problem. This may sound like a lot of time but this is a complicated issue and had to be escalated through Juniper technical support.
Last Friday evening, February 15, ITS staff performed emergency network maintenance. The maintenance process was advised by Juniper. Specifically, the maintenance was an upgrade of the network operating system running on the core switches. We expected a 10 to 20 minute loss of Internet connectivity during the process, nothing more. The work started at about 7:30 p.m. and ITS posted a message on the Dash in the afternoon announcing the necessary emergency maintenance.
Unfortunately, the maintenance process didn’t go as planned. At about 8:30 p.m. we began experiencing unintended side effects of the upgrade. From about 9:20 to 11:20 p.m. we lost the campus network completely:wired; eduroam; and telephone service. Shortly after the start of the outage, Public Safety, in cooperation with ITS, sent out a message to the campus community via LiveSafe announcing that things were down. ITS/We also emailed faculty, staff, and students regarding the outage around 10 p.m. At about 11:30 p.m. Public Safety posted an “all clear” LiveSafe message when everything was operational again.
Yesterday, Monday, February 18, ITS staff debriefed about Friday’s outage and resumed working with Juniper technical support to identify and validate a reliable procedure to perform the upgrade as soon as possible and without another unplanned outage. We also put in a request to have Juniper network engineer on site to work with ITS staff during the next upgrade process.
Once we have resources lined up and have a clear idea of the procedure and how long it will take, ITS/we will announce, in advance, any maintenance windows we need to do the upgrade. We’ll make sure the schedule impacts services as little as possible (think: between midnight and 5 a.m.). By planning outages in this timeframe, we hope to avoid interrupting significant campus events.
If you have any questions or concerns, please feel free to contact me.
Joel Cooper
Chief Information Technology Officer