Troubleshooting is an art according to most IT practitioners. The reasons are not far to seek; there are no formal books on troubleshooting; it is not taught as a subject in schools; what’s more, troubleshooting is an afterthought to be shunted to an insignificant appendix in most administration manuals.
If one thinks about it, troubleshooting is not that hard. Everybody including children, does troubleshooting all the time. For example, if it is dark in the room, the light must be off. My foot is stinging, there must be an ant biting it. We do this kind of reasoning without thinking everyday. Troubleshooting becomes hard in certain contexts primarily because of complex interrelationships between the subsystems and incomplete information.
Greeks thought about one line of reasoning more than two thousand years ago and gave it a fancy name, modus ponendo ponens or modus ponens. In plain English, it means “affirming the antecedent”. OK I know, so let me illustrate it with an example:
If it rained today then the roads must be wet.
This statement consists of two parts: “it rained today” (antecedent) and “the roads must wet” (consequent). So when someone says it rained today, thus affirming the antecedent, it follows that the consequent must be true or the roads must be wet. Symbolically modus ponens is written in the following manner:
P → Q
P
∴ Q
Modus Ponens is a very powerful but simple concept. There are two ways one can use it: Start with a known incident (or antecedent) and then arrive at a certain conclusion (or consequent) – this is called forward chaining. Conversely, given a conclusion find the matching preconditions or the antecedents. There could be more than one condition that matches a given conclusion. For example, if we see that the roads are wet, it probably rained today or recently. However, there could be another possibility that the road was washed by the cleaning crew. This is backward chaining.
Within the problem management discipline of ITIL, rapid problem resolution or RPR is used for resolving problems. This method was developed by Advance7 in the 1990s and incorporated into V3 of ITIL in 2007. This method advocates two steps: core process and supporting techniques. Core process involves the following steps (from Wikipedia)
- Discover
- Gather & review existing information
- Reach an agreed understanding
- Investigate
- Create & execute a diagnostic data capture plan
- Analyse the results & iterate if necessary
- Identify Root Cause
- Fix
- Translate diagnostic data
- Determine & implement fix
- Confirm Root Cause addressed
The supporting techniques explain the above in more detail.
So in my opinion, troubleshooting IT systems as a discipline has a long way to go. It presents many challenges and therefore many exciting opportunities as well.