Pages

Tuesday, January 18, 2011

When poop hits the fan!!!!!





So we talk about this a lot on the boards and in real life. So how do we handle this sort of event. You get a call that XXXX is down and nobody has internet or phone access across several sites? Where do you begin? Do you hit the sweat button? Is this your time to take a coffee or smoke break to get mentally ready to get in the game? Some people like this part of the job more than others. I just happen to be a guy that likes to deal with outages. I have several thought the day ranging from 1 user has no access to my entire company can’t make calls. I’m going to shed some insight on a method to the madness of how you can handle the call, excel, and be the envy of your peers. I also want to hear from you, you might know something better and we can all learn something new.





To me I approach all outages the same. Before I jump into gear, I ask preliminary questions?



1. What is the problem?

2. When did it happen?

3. How many users are affected?

4. Any changes to your system recently?





Ok let’s be honest, on some of these questions nobody is going to be completely honest with you most of the time. I’m usually talking to another IT guy that was probably messing around and doesn’t want to come clean, and there are several stories with that regarding call centers and auto attendants that I will explain one day, But this gives us a place to start. Next I get access to the network in whatever place is left. Sometimes it’s a terminal server if it’s a complete outage, sometimes I can VPN into another site. Once I’m in I get my tools ready and fire most of them up. Usually SecureCRT, Notepad++ Kiwi Cat Tools, Wireshark, Kiwi Syslog, and a command prompt for pings. Next I get access to whatever devices seem to be the problem or the closest I can get to it. Now comes my initial play book. SHOW Commands regarding the particular technology and checking the log (I’m also checking for last login and changes made by). 75-95 percent of the time the problem is right there if I go up the OSI model. Other times debugs will be needed. That’s where the syslog sever and other tools come in. This Is now the time when people are starting to ask questions. The usual answer of I’m running debugs looking for any errors is enough to back up most people I have dealt with. Other times the VP of technology or whatever are on the phone and they start asking questions, usally they dont know what the hell they are talking about, One enterprise architect told me that he could reach a 192.168 network from his connection at home:) I usually tell them give me a min I’m going to put you on hold, while I gather more info. If they become too problematic I can get someone else in to run damage control cause there asking for updates every 5 min taking time away to solve the problem. If I still don’t have it up after 30-45 min and it’s a global outage I will reboot whatever device it is. That has more success rate than we like to give it credit to for smart IT guys the old raytheon reset is below us. Then after a hour or so it’s time to escalate to someone smarter than me. Sometimes it’s to cisco, other times its to another engineer in the office who has worked with that technology. In the end we solve some, others kick out arse, but it’s how we all learn.



No comments:

Post a Comment