During last night’s maintenance window our Packet-Voice Gateways (PVG) were upgraded as part of our preparation to upgrade the entire cellular switch to the latest version of software, MTX14. Preparations have been underway for several weeks, leading up to what Nortel calls the One Night Process (ONP.) The preparations include component-by-component changes to hardware, software and firmware. Mostly this has been pretty uneventful, save for the issues caused by us being one of the Verification offices (VO) for Nortel’s then-new Packet MSC. Nortel has dealt with issues as they arose, and the upgrade has only slipped a day or two.
PVG preparation work had been completed according to schedule during the maintenance window by our Nortel installer and a member of the Network Operations staff. The summary of the night’s activities was already sent, network traffic was starting to pick-up, the day shift was sipping the day’s first latté as the night-shift left to go home.
The day started as so many do: an inbox full of emails, a couple of voice mails, yesterday’s undone todo list, the routine “I need this now” action item from the PHB, and of course the entire Web to be read. Around 09:15 the team was up to their ears in all that when the switch started dropping alarms fast. About as fast as the trunk maintenance screen could be brought up on the computer it was already showing 61 trunk groups in critical alarm. One of the engineers began sorting through the alarms trying to determine a pattern of failure. Another engineer fielded phone calls from Network Operations guys and informed Customer Care of the service-affecting event in progress. I took a call from our handset guy at one of the retail stores saying the OTA service had stopped working.
Nortel Technical Support (TAS) was pretty quickly engaged on the phone and they began sifting through the logs looking for the cause of the failure. Meanwhile my engineering manager had pretty much figured it out. He found that all the failing trunks were homed on one of the PVG that had been upgraded during the previous maintenance window. Nortel followed his line of reasoning and started rolling back the upgrade in that PVG. As soon as they had switched activity into the rolled-back controller the trunks began to come back into service. It was about 35 minutes into the event. Once service was restored and the alarms cleared, Nortel really began digging in to determine the root cause. It took them a couple hours but eventually they got it.
As I mentioned before, our Packet MSC was a verification office when it went into service in early 2006. At that time, the software load could not properly handle signaling for E9-1-1. So patches were applied that allowed the 911 trunks to work using MF signaling. When the PVG upgrade had been prepared it was intended for the standard switch load; the 911 MF trunk work-around was forgotten. The PVG upgrade had been otherwise successful, right up until someone dialed 911.