Why Incident Management is hard?

July 18, 2010

Note: See [1] for the ITIL terms used in this entry.

Note: Though I am only referring to Incident Management in this blog, it is also applicable to Problem Management, even though it is a separate process under ITIL. I am doing it so as to keep this blog short. While Incident Management is responsible for the fix or workaround, ultimately it is the Problem Management that performs the root cause analysis for chronic Incidents and provides a permanent solution.

“The Process responsible for managing the Lifecycle of all Incidents. The primary Objective of Incident Management is to return the IT Service to Users as quickly as possible.”

An Incident is defined as any event which is not part of the standard operation of a service and which causes, or may cause, an interruption to, or a reduction in, the quality of service. A simplified Incident Management work flow is provided in the figure below.

A simplified version of Incident Management

When an Incident is reported to the Service Desk, it attempts to resolve it by consulting the Known Error Database and the CMDB. If it is unsuccessful the Incident is classified and transferred to the Incident Management. Incident Management typically consists of first line support specialists who can resolve most of the common Incidents. When they are unable to do so, they will quickly escalate it to the second line support team and the process continues until the Incident is resolved. As per its charter, Incident Management tries to find a quick resolution to the Incident so that the Service degradation or downtime is minimized.

So why is it hard

There are several factors that make Incident Management one of the most difficult and expensive of all the ITIL processes. By no means, this is an exhaustive list. Please feel free to add to it.

Complex System Architecture

Over the last 60 years, IT industry has seen breakneck growth. IT services have evolved to meet increasingly sophisticated and complex business demands. A typical IT service today includes the following:

Hardware

  • One or more servers or virtual machines
  • SAN storage
  • Network components
  • Backup servers

Software

  • Hypervisor (if virtualized)
  • Operating system
  • One or more databases
  • One or more web servers
  • One or more application servers
  • Load balancing servers
  • Monitoring software
  • Interfaces to internal and external services

In the above I am not even talking about Business Continuity which adds their own layers. This results in a complex architecture which is difficult to understand and manage. What’s more, the architecture is often not documented adequately and is not up to date.

Poorly architected or missing processes

In addition to inadequate documentation, many IT departments do not have processes to manage their IT service. This results in ad-hoc and sometimes unauthorized changes resulting in cascading effects.

Silo effect caused by super specialization among IT professionals

As a result of complex architectures super specialists are becoming necessary to manage them. This creates silos in which super specialists operate with specialist jargon that is only comprehensible within the silos but not elsewhere. When serious incidents are reported, it is not uncommon to find half a dozen domain experts spending valuable time on swat calls.

Incomplete monitoring of processes and systems

For a variety of reasons, not all of the processes and systems that belong to an IT Service are monitored. While there seems to be no alternative to this because of cost and resource issues, it results in blind spots. An unmonitored Incident in one stack may result in an unpredictable Incident in another, but may take a long time to diagnose because no one is aware of the original Incident.

Lessons learned do not propagate

Even though domain experts may have excellent trouble shooting skills, once a difficult Incident has been resolved, often they do not have the tools to spread the knowledge. Search engines have reduced this problem somewhat by providing tag based searches. Complex Incidents that have multiple or cascading root causes can not easily be captured in a community knowledge base. This results in frequent re-inventing of the wheels.

Missing or unclear context in exception handling

IT hardware and software are often developed in an environment that is far removed from the ecosystems where they eventually end up. When exceptions do occur, the exception handlers usually do not understand the context and therefore do not provide a comprehensible explanation.

There are many other reasons why Incident Management remains hard. There is a tendency to throw resources at Incidents when underlying cause is poorly architected software, infrastructure or business processes. Sufficient attention is not paid to training IT professionals in troubleshooting which remains an art form. Finally it is getting more and more expensive to hire trained professionals and IT budgets shrinking.

Better automation and autonomics provide some relief from Incident Management but that is a topic for another blog.


[1] www.itsmfi.org/files/itSMF_ITILV3_Intro_Overview.pdf

Advertisements

Troubleshooting IT Systems

April 12, 2010

Troubleshooting is an art  according to most IT practitioners. The reasons are not far to seek; there are no formal books on troubleshooting; it is not taught as a subject in schools; what’s more, troubleshooting is an afterthought  to be shunted to an insignificant appendix in most administration manuals.

If one thinks about it, troubleshooting is not that hard. Everybody including children, does troubleshooting all the  time. For example, if it is dark in the room, the light must be off. My foot is stinging, there must be an ant biting it. We do this kind of reasoning without thinking everyday. Troubleshooting becomes hard in certain contexts primarily because of complex interrelationships between the subsystems and incomplete information.

Greeks thought about one line of reasoning more than two thousand years ago and gave it a fancy name, modus ponendo ponens or modus ponens. In plain English, it means affirming the antecedent”. OK I know, so let me illustrate it with an example:

If it rained today then the roads must be wet.

This statement consists of two parts: “it rained today” (antecedent) and “the roads must wet” (consequent). So when someone says it rained today, thus affirming the antecedent, it follows that the consequent must be true or the roads must be wet. Symbolically modus ponens is written in the following manner:

P →  Q

P

Q

Modus Ponens is a very powerful but simple concept. There are two ways one can use it: Start with a known incident (or antecedent) and then arrive at a certain conclusion (or consequent) – this is called forward chaining. Conversely, given a conclusion find the matching preconditions or the antecedents. There could be more than one condition that matches a given conclusion. For example, if we see that the roads are wet, it probably rained today or recently. However, there could be another possibility that the road was washed by the cleaning crew. This is backward chaining.

Within the problem management discipline of ITIL, rapid problem resolution or RPR is used for resolving problems. This method was developed by Advance7 in the 1990s and incorporated into V3 of ITIL in 2007. This method advocates two steps: core process and supporting techniques. Core process involves the following steps (from Wikipedia)

  • Discover
    • Gather & review existing information
    • Reach an agreed understanding
  • Investigate
    • Create & execute a diagnostic data capture plan
    • Analyse the results & iterate if necessary
    • Identify Root Cause
  • Fix
    • Translate diagnostic data
    • Determine & implement fix
    • Confirm Root Cause addressed

The supporting techniques explain the above in more detail.

So in my opinion, troubleshooting IT systems as a discipline has a long way to go. It presents many challenges and therefore many exciting opportunities as well.

(Cloud) Computing as a utility

March 11, 2010

In the developed world, computing is ubiquitous. It is so common now that it is at par with electricity, cable TV, telephone service or tap water. It is hard to imagine life without computers. Today we always have the manual option, such as an operator on an 800 number. In due course, I see such options totally disappearing.

Computing is likely to become so pervasive that eventually entire world will be connected and practically every device and tool we use will have a computer in it. This appears to be the theme behind the “Smarter Planet” initiative by IBM. Business analytics will make better use of the glut of information already generated by the huge number of electronic gadgets. Easy availability of such tools will further increase the demand for more such data and this feedback loop will eventually culminate in a completely networked world.

What happens to the service providers when that happens? There are two scenarios I can see unfolding. As it is, today cloud computing is really cheap and the prices will continue to drop as the cost of hardware and networking keep dropping. This will put severe pressure on the margins that the providers enjoy. Ultimately it will become impossible for all but a handful of companies worldwide to provide cloud computing on any profitable basis. This will lead to monopolies and oligopolies which are not good for the consumers. Governments worldwide will move to correct this situation.

Slowly but surely public/private ownership models will begin to emerge as we have seen in the utility (water, electricity) sector. Of course the pace will depend upon the local politics and the market conditions. The same debates that we see today on education, healthcare etc., we will see in computing as well. Can government do a better job or the private sector. Who is more efficient? History repeats itself.

Manageability by Design – A Definition

November 20, 2009

My friend Iggy Fernandez of Database Specialists and I coined a term “Manageable by Design (MBD)” in the cloud computing context. We define MBD as following:

“An IT System is Manageable By Design (MBD) if it uses Standards, Instrumentation, Interfaces, Automation, Autonomics, and Documentation to facilitate the activities and purposes of IT Service Management (ITSM).”

As stated above, we have identified six objective criteria by which an IT System becomes manageable:

  • Standards
  • Instrumentation
  • Interfaces
  • Automation
  • Autonomics
  • Documentation

Depending upon the complexity of the IT System, some or all of the above may be required to make it manageable. An IT System can also be augmented to make it MBD by the vendor or by a third party. In a cloud computing environment, IT Systems are deployed in an ecosystem which is in a state of continuous flux. This dynamic nature coupled with security concerns makes it imperative that the cloud-deployed IT Systems are manageable.

Standards

Whether they are deployed traditionally or in a cloud, IT Systems need to interact with many other IT Systems for operational purposes. These are operating systems, backup systems, monitoring systems, security sentinels, discovery tools, performance analysis and tuning systems and so on. If standards are created for each of these interactions, the total cost of ownership of IT infrastructure can be brought down substantially. Manageable IT Systems need to implement Instrumentation to provide access to internal parameters, metrics and diagnostics. They need to implement well defined and commonly accepted Interfaces so that they can be operated upon in a uniform manner. Vendors or third parties need to provide standardized Automation tools that surround these IT Systems so that labor costs in operation are reduced. Finally Autonomics should be incorporated which will reduce downtime and the need for expert human intervention.  Standards will lay the foundation for the implementation of these disciplines.

As an example of such a standard, consider the Linux RPM Package Manager utility [1]. This utility defines a standard for packaging software components for installations and upgrades. Applications packaged using RPM format can be installed upgraded and uninstalled using the same RPM utility. This allows a system administrator to automate installations, upgrades and otherwise maintain the IT Systems more efficiently.

Instrumentation

Instrumentation refers to an inherent or extended ability of an IT System to monitor and report its internal parameters, metrics and diagnostics. IT Systems are described using several parameters and its architecture, which are stored in the configuration profiles or in the memory of the processes. IT Systems also contain transient state information regarding the transactions they are processing. In addition, IT Systems generate diagnostic information in the form of error logs and trace files. Instrumentation captures or measures this information and uses Interfaces to deliver it to the consumers. As an example of great Instrumentation, consider Oracle Corporation. Oracle has implemented comprehensive Instrumentation within its flagship RDBMS product that can be used for operational activities.

Interfaces

An Interface is  “…the place at which independent and often unrelated systems meet and act on or communicate with each other” [2].

In the IT domain there are many examples of standard interfaces. For example, USB, Firewire, and CompactFlash are well known standard hardware interfaces.  These interfaces provide a channel for two devices to communicate with each other. However, in the software arena, the interfaces are not standardized and they are usually limited to implementing exchange of data. Standardized Interfaces should provide configuration information, internal states, error logs and trace files, and control functions for startup, shutdown, clone, backup, install, uninstall, upgrade, and patch activities. Interfaces will make automation possible and reduce the need for a team of highly trained professionals for routine operational tasks.

Automation

Automation in the IT System context refers to the technique of making a system operate without human intervention. According to some estimates, labor costs now exceed the cost of IT Systems by an order of magnitude or more. IT personnel spend a lot of time installing, patching, cloning, and troubleshooting. Given proper Interfaces and Instrumentation, many of these tasks can be automated.

Automation is usually not a part of the IT System itself, but a collection of scripts, processes and jobs. Automation leverages existing Interfaces and Instrumentation provided by an IT System. It has the potential to substantially reduce the labor cost in an enterprise application deployment. Some of the common automation tasks are: installation, upgrades, patching, startup and shutdown routines, backups, cloning, etc.


Autonomics

IBM defines autonomic computing in this manner:

“An approach to self-managed computing systems with a minimum of human interference. The term derives from the body’s autonomic nervous system, which controls key functions without conscious awareness or involvement” [3]

Cloud computing requires that the deployed IT Systems be demand elastic. There is a frequent change in the configurations in such an environment.  It is impossible for support personnel to keep track of all the configuration changes, provisioning, and deprovisioning that happens in a cloud. Incident and Problem Management will be even more difficult. Therefore such systems will have to be self aware and be able to perform their tasks without frequent human intervention. IT Systems which have incorporated autonomic computing are introspective, self reconfiguring, continually optimizing, self healing, self protecting, adapting, standards compliant, and demand elastic in nature. Autonomics could be implemented within the IT System itself or implemented externally using the Interfaces and Instrumentation provided by it.

Documentation

Documentation is the most obvious of the six and all IT vendors do provide some documentation along with their products. However, there is no standard for documentation and the quality is not always top notch. The documentation should be context sensitive, indexed and cross referenced. Documentation should also be accessible from the internet. IT system should also provide context sensitive help when appropriate from the same documentation.

References:

1. The Story of RPM by Matt Frye in Redhat Magazine, January 3rd, 2009 (http://magazine.redhat.com/2007/02/08/the-story-of-rpm/)

2. Merriam Webster Online (http://www.merriam-webster.com/dictionary/Interface)

3. Definition of autonomic computing at IBM Research website (http://www.research.ibm.com/autonomic/overview/faqs.html#1)

Cloud Camp in Phoenix

September 11, 2009

We are organizing a Cloud Camp in Phoenix! It is a free all day event, open to all the cloud enthusiasts, vendors, IT consumers etc. Grab your seat at the following link:

http://www.cloudcamp.com/?page_id=1128

Ravi

A tragedy in the cloud

June 13, 2009

According to a news report UK based hosting service company VAServ was a target of  a hacking attack, and as a result, lost data for 100,000 web sites. This is a huge blow to hosting services industry especially those who provide cheap services based on virtualization.

It is not yet clear whether the attack was a result of the carelessness on the part of VAServ or a vulnerability of HyperVM from a company called Lxlabs. According to Lxlabs website, “HyperVM is a multi-platform, multi-tiered, multi-server, multi-virtualization web based application that will allow you to create and manage different Virtual Machines each based on different technologies across machines and platforms.”

What’s truly tragic is that Lxlabs founder, K. T. Ligesh, 32,  committed suicide on 8th of June. As I said earlier, it is not yet clear whether the loss of data at VaServ was due to HyperVM vulnerability or serious security breaches at VaServ. Someone boasted about the exploit at VaServ and claimed it was through simple sniffing and password guessing, and not through HyperVM. If true, it is just goes to show how terrible cybercrime can be.

From such incidents it becomes clear why enterprises will remain weary of the public clouds. Earlier I blogged about public vs private clouds. There is a market for self service clouds like the one offered by VaServ, but for anything more than a small mom and pop operation, it is clearly not enough. A full service (either internal or hosted) private cloud is the only solution. We are reaching a turning point where vendors are beginning to offer Cloud services and it is a matter of time before they offer to convert entire hosted IT services of their clients to private Clouds.

Manageability in Cloud Computing

June 8, 2009

There have been many attempts to define and characterize Cloud Computing recently. NIST (National Institute of Standards and Technology) leads with a draft.

And then there have been some following articles in the blogosphere here and here. And this one appeared before the NIST draft.

What is interesting is that the NIST draft provided the following definition of Cloud Computing:

“Cloud computing is a pay-per-use model for enabling available, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model promotes availability and is comprised of five key characteristics, three delivery models, and four deployment models. (emphasis mine)”

And then went on to list the five characteristics:

  • On-demand self-service
  • Ubiquitous network access
  • Location independent resource business
  • Rapid elasticity
  • Pay per use

But they missed out the most important part in my opinion. I have highlighted the manageability in the definition above. Managing the applications, or IT Service Support is one the most expensive factor in the total cost of ownership (TCO). The software license costs and the cost of the infrastructure to support it only forms a part of the TCO. During the lifetime of an enterprise software application majority of the costs are incurred in maintaining and supporting it.

Therefore any Cloud Computing Environment (CCE) should consider the manageability of the applications deployed. If the deployed applications are not manageable by design,  the CCE will not able to manage them autonomically and therefore dramatically increasing the cost of support. Stating it in another way, applications being developed for the Cloud should include manageability as the part of design rather than as an afterthought.

Change Management in the Cloud

May 28, 2009

Change Management is an essential process of any IT department. Change Management ensures that only authorized and carefully considered Changes are implemented. There are planned Changes and there are unplanned or emergency Changes and there is a process to handle both.

Typically the RFC (Request For Change) are raised when one of the following happens:

  1. There is a Problem that needs resolution – RFC raised by Problem Management
  2. There is a vendor supplied patch or upgrade – RFC raised by Operation/Infratsructure team
  3. There is a change in Architecture to address growing needs – RFC raised by Capacity Management
  4. There is emergency which requires a quick fix – Emergency Change raised by Problem or Incident Management

In a Cloud Computing Environment, the requirements are very similar, except in (3) above. Due to Automated Provisioning and Virtualization, Cloud’s promise is rapid elasticity. To ensure that request for new resources are attended to in minutes or hours instead of weeks or months, all the ITIL processes need to suitably modify their functioning. In case of Change Management this is what needs to happen:

1. All provisioning activities follow an established and approved business workflow. In addition it is completely automated.

2. Configuration Management is automatically updated to reflect the Changes.

3. Even regular Changes need to be applied to the Images used by Automated Provisioning.

4. Change Management keep in mind that the Cloud architecture is dynamic by definition, so yesterday’s snapshot may not be good enough for tormorrow’s Change.

Public clouds v/s private clouds

May 27, 2009

One of the major objections to cloud computing has been that it is not secure enough. There is some truth to it and it is not an easy matter to secure an entire enterprise in a public cloud. Given all the apprehensions about the security, privacy, legislation involved, it is safe to say that the deployment of public cloud computing in large enterprises remains a distant dream.

Having said that, I believe the public cloud can greatly benefit the individuals. Cost of ownership of a computer today is unnecessarily high for all the well known reasons. Having to pay only for the computing power, software licenses, storage and networking bandwidth that I actually use is a very compelling proposition. I think over a period of time people will begin to realize the value in cloud computing, just as they did when utility companies began to deliver electricity to the homes. There are some concerns that cloud computing could lead to loss of freedom to choose, but I think those can be managed by proper legislation and also by developing open cloud standards and bill of rights (http://wiki.cloudcommunity.org/wiki/Cloud_Computing_Manifesto).

Private clouds can benefit large enterprises which invest in enormous computing power, network bandwidth and storage. Companies like IBM are developing tools and technology to make it happen. Private clouds will address the security and privacy issues as well as the risk of cloud hosting company going down under. Given a large enough enterprise, private cloud computing can be as cost effective as public cloud computing.

Entities that can benefit from cloud computing:

Large enterprises
Defense organizations
Government agencies
NGOs

Pay as you go

May 27, 2009

I read Dave Malcolm Surgient’s blog on the characteristics of cloud computing. As cloud computing still remains nebulous, this kind of clarity helps everyone understand it a little better. He talks about five characteristics, which I list here:

Characteristic 1: Dynamic computing infrastructure
Characteristic 2: IT service-centric approach
Characteristic 3: Self-service based usage model
Characteristic 4: Minimally or self-managed platform
Characteristic 5: Consumption-based billing

I was particularly struck by Consumption-based billing. What a great idea! When was the last time you paid for a generator installed by your utility? When was the last time you paid for the cable laid by your cable television company? And yet we continue to pay for the CPUs, the hard disks, the network interfaces. Not to mention all the junk that the Microsofts, the Ciscos, the Intels and the rest of them want to put on your PC. If you ever looked at the services running you will notice that most of them are never used. Most of the computing power we purchase is never used.

Imagine if you only need to pay for what you use. Imagine a world where you could plug in a simple device and begin to use the IT service just as you would electricity or a telephone service. You only pay for the storage, processing and the network bandwidth usage. In addition, unlike electricity or cable, you have many competing companies to choose from. This service will be available where ever you are, not just at home.

As an extension, you only pay for software when you use it. ALL computing services will be metered on a pay-as-you-go basis rather than a license per copy with fat yearly support fees. If you are using open source products, there is no need to pay for them ever!

I know this will greatly upset the establishment, such as Microsoft and Oracle. So be it. For too long they have ruled the IT world with outsized profits. Monopolies rule the enterprise and desktop software. This can not go on forever. The open source community has matured sufficiently now that we can do a lot of computing without buying anything from Microsoft or Oracle.