Thursday, August 23, 2018

Troubleshooting 101


Think of yourself as a doctor, but for computers.  Start with "DO NO HARM" as your credo.  Don't make things worse, snapshots, GO SLOWLY, think before taking any action, ask for a double check.
There are two basic approaches to troubleshooting: the stab-in-the-dark approach and the systematic approach. The stab-in-the-dark approach usually involves little knowledge of the technology involved and is completely random in nature. A systematic approach, on the other hand, involves a step-by-step approach and requires in-depth knowledge of the technology.
1) When did it start? (almost always change related, planned or unplanned)
     Find an error message, try finding the starting time in the logs
2) Isolate, isolate, isolate.
  How can I split this complex problem into several smaller problems.  Packets go from A to Z, but don't arrive, 
First divide the problem in half, check if packet makes it from A-M, if it does, then check M-Z.
If you see it didn't make it form M-Z, half it again, check M-T, then T-Z, then again, keep dividing in half.
3) the WORST problems to troubleshoot are always two things, that agitate each other.
Sometimes you have one problem, that due to redundancy, or other reasons, you don't even KNOW you have had for months.
Then another thing breaks, suddenly you have a bizarre scenario that just doesn't add up.
4) Check the health of EVERYTHING
Log into switches, servers, (consoles people) often errors don't show up in logs, but you'll see them sitting right in of you.
5) Get creative, approach the problem from different angles, ask for help, a second point of view or skillset can really help.   Go play foosball, step back for 20 minutes and refresh your mind.

More Advice:
Look for workarounds, or multiple paths to restore service.
If you have a known method to restore, but it may take hours or days, then try to work both paths in parallel