Note: See [1] for the ITIL terms used in this entry.
Note: Though I am only referring to Incident Management in this blog, it is also applicable to Problem Management, even though it is a separate process under ITIL. I am doing it so as to keep this blog short. While Incident Management is responsible for the fix or workaround, ultimately it is the Problem Management that performs the root cause analysis for chronic Incidents and provides a permanent solution.
“The Process responsible for managing the Lifecycle of all Incidents. The primary Objective of Incident Management is to return the IT Service to Users as quickly as possible.”
An Incident is defined as any event which is not part of the standard operation of a service and which causes, or may cause, an interruption to, or a reduction in, the quality of service. A simplified Incident Management work flow is provided in the figure below.
When an Incident is reported to the Service Desk, it attempts to resolve it by consulting the Known Error Database and the CMDB. If it is unsuccessful the Incident is classified and transferred to the Incident Management. Incident Management typically consists of first line support specialists who can resolve most of the common Incidents. When they are unable to do so, they will quickly escalate it to the second line support team and the process continues until the Incident is resolved. As per its charter, Incident Management tries to find a quick resolution to the Incident so that the Service degradation or downtime is minimized.
So why is it hard
There are several factors that make Incident Management one of the most difficult and expensive of all the ITIL processes. By no means, this is an exhaustive list. Please feel free to add to it.
Complex System Architecture
Over the last 60 years, IT industry has seen breakneck growth. IT services have evolved to meet increasingly sophisticated and complex business demands. A typical IT service today includes the following:
Hardware
- One or more servers or virtual machines
- SAN storage
- Network components
- Backup servers
Software
- Hypervisor (if virtualized)
- Operating system
- One or more databases
- One or more web servers
- One or more application servers
- Load balancing servers
- Monitoring software
- Interfaces to internal and external services
In the above I am not even talking about Business Continuity which adds their own layers. This results in a complex architecture which is difficult to understand and manage. What’s more, the architecture is often not documented adequately and is not up to date.
Poorly architected or missing processes
In addition to inadequate documentation, many IT departments do not have processes to manage their IT service. This results in ad-hoc and sometimes unauthorized changes resulting in cascading effects.
Silo effect caused by super specialization among IT professionals
As a result of complex architectures super specialists are becoming necessary to manage them. This creates silos in which super specialists operate with specialist jargon that is only comprehensible within the silos but not elsewhere. When serious incidents are reported, it is not uncommon to find half a dozen domain experts spending valuable time on swat calls.
Incomplete monitoring of processes and systems
For a variety of reasons, not all of the processes and systems that belong to an IT Service are monitored. While there seems to be no alternative to this because of cost and resource issues, it results in blind spots. An unmonitored Incident in one stack may result in an unpredictable Incident in another, but may take a long time to diagnose because no one is aware of the original Incident.
Lessons learned do not propagate
Even though domain experts may have excellent trouble shooting skills, once a difficult Incident has been resolved, often they do not have the tools to spread the knowledge. Search engines have reduced this problem somewhat by providing tag based searches. Complex Incidents that have multiple or cascading root causes can not easily be captured in a community knowledge base. This results in frequent re-inventing of the wheels.
Missing or unclear context in exception handling
IT hardware and software are often developed in an environment that is far removed from the ecosystems where they eventually end up. When exceptions do occur, the exception handlers usually do not understand the context and therefore do not provide a comprehensible explanation.
There are many other reasons why Incident Management remains hard. There is a tendency to throw resources at Incidents when underlying cause is poorly architected software, infrastructure or business processes. Sufficient attention is not paid to training IT professionals in troubleshooting which remains an art form. Finally it is getting more and more expensive to hire trained professionals and IT budgets shrinking.
Better automation and autonomics provide some relief from Incident Management but that is a topic for another blog.
[1] www.itsmfi.org/files/itSMF_ITILV3_Intro_Overview.pdf
