Pages

Monday, December 20, 2010

Tier 1 NOC operator+ Shodown not paying attention = FAIL

This post was inspired by Turgon from the techexam board about change control. A few weeks ago I had the pleasure of bringing down a few offices sharing a connection across a SIP trunk. So how could a guy who likes to brag about being a network ninja be so dumb? hmmm. So what went wrong? I changed the duplex setting on a router without even checking to make sure the guy on the other side was even making sense (we'll come back to this later). I had a customer who had incoming calls over a SIP trunk that would have random errors. Sometimes the call would just drop. Other times it wouldn’t take DTMF, and other times if the call was transferring to an outbound call, the VOIP gateway wouldn’t make it. Add in a furious customer yelling at you over the past few days and we have a recipe for trouble. So lets get started!!!!!!


SIP appeared to be working correctly. The dial-peers where correct, SIP settings were correct and the numbers were registered. Debugs showed nothing out of the ordinary. "Stretching my head". Lets setup a syslog server and see if anything weird pops up over the next few hours in the debugs. BAM!!!!!! Here it is. I see late collisions popping up then they go away. I'm speaking to a Tier 1 NOC operator and he suggests changing the duplex settings. I ask him what is he set at. He's at auto and I'm at auto. He suggests going to 100 full. Now we are talking about Ethernet right? No time during this evolution did I think to question him on this SIP trunk since they give us 2 hand off's(once for voice and one for data). I assumed he was talking about the side for voice. We go to a 100 full I'm locked out, and I go to "OH SHIT". I'm not getting any response from the router. A reload in 10 would have saved me, but I didn't do it. MESSAGE on Playing with connectivity interfaces!!!! He still had access to his gear, I asked him to change the settings back to where they were (still haven't gotten to this story yet), and I had to call someone at the building to go reboot the gear, since they still had PHONES. Nobody answered cause they weren't in yet, so I'm assuming I may be driving out there. I finally get someone next door at the other company who goes and reboots the device for me. Back to normal and in time we make the correct changes.

Now where did I go wrong? Why didn't I ask these questions before hand? The young engineer changed the port settings on his Edge Juniper router. This wasn't the device I was connected to. I was actually connected to his Cisco IAD device so they could hand me a voice and data line. Why was he in his Juniper router instead of the IAD??? Never got around to asking him that, maybe he thought it was our equipment? The world will never know. In the end don't move to fast it can be costly. I got lucky this time. I had a buddy's bring down 10,000 call center agents, another black hole a entire section of a network, and another Reset BGP, and so on. I haven't seen anyone loose there jobs over these incidents, but we all have heard the stories. Happy Networking.




Update. I wrote this post a while back and never posted it This week I was working with a TAC engineer, we brought down a call center. Hence I'm up at 11pm doing troubleshooting at low call volume time. Once again this could have been solved with some pre planning, but I guess you learn by making mistakes.

No comments:

Post a Comment