Posts Tagged ‘Change Management’

Why Incident Management is hard?

July 18, 2010

Note: See [1] for the ITIL terms used in this entry.

Note: Though I am only referring to Incident Management in this blog, it is also applicable to Problem Management, even though it is a separate process under ITIL. I am doing it so as to keep this blog short. While Incident Management is responsible for the fix or workaround, ultimately it is the Problem Management that performs the root cause analysis for chronic Incidents and provides a permanent solution.

“The Process responsible for managing the Lifecycle of all Incidents. The primary Objective of Incident Management is to return the IT Service to Users as quickly as possible.”

An Incident is defined as any event which is not part of the standard operation of a service and which causes, or may cause, an interruption to, or a reduction in, the quality of service. A simplified Incident Management work flow is provided in the figure below.

A simplified version of Incident Management

When an Incident is reported to the Service Desk, it attempts to resolve it by consulting the Known Error Database and the CMDB. If it is unsuccessful the Incident is classified and transferred to the Incident Management. Incident Management typically consists of first line support specialists who can resolve most of the common Incidents. When they are unable to do so, they will quickly escalate it to the second line support team and the process continues until the Incident is resolved. As per its charter, Incident Management tries to find a quick resolution to the Incident so that the Service degradation or downtime is minimized.

So why is it hard

There are several factors that make Incident Management one of the most difficult and expensive of all the ITIL processes. By no means, this is an exhaustive list. Please feel free to add to it.

Complex System Architecture

Over the last 60 years, IT industry has seen breakneck growth. IT services have evolved to meet increasingly sophisticated and complex business demands. A typical IT service today includes the following:

Hardware

  • One or more servers or virtual machines
  • SAN storage
  • Network components
  • Backup servers

Software

  • Hypervisor (if virtualized)
  • Operating system
  • One or more databases
  • One or more web servers
  • One or more application servers
  • Load balancing servers
  • Monitoring software
  • Interfaces to internal and external services

In the above I am not even talking about Business Continuity which adds their own layers. This results in a complex architecture which is difficult to understand and manage. What’s more, the architecture is often not documented adequately and is not up to date.

Poorly architected or missing processes

In addition to inadequate documentation, many IT departments do not have processes to manage their IT service. This results in ad-hoc and sometimes unauthorized changes resulting in cascading effects.

Silo effect caused by super specialization among IT professionals

As a result of complex architectures super specialists are becoming necessary to manage them. This creates silos in which super specialists operate with specialist jargon that is only comprehensible within the silos but not elsewhere. When serious incidents are reported, it is not uncommon to find half a dozen domain experts spending valuable time on swat calls.

Incomplete monitoring of processes and systems

For a variety of reasons, not all of the processes and systems that belong to an IT Service are monitored. While there seems to be no alternative to this because of cost and resource issues, it results in blind spots. An unmonitored Incident in one stack may result in an unpredictable Incident in another, but may take a long time to diagnose because no one is aware of the original Incident.

Lessons learned do not propagate

Even though domain experts may have excellent trouble shooting skills, once a difficult Incident has been resolved, often they do not have the tools to spread the knowledge. Search engines have reduced this problem somewhat by providing tag based searches. Complex Incidents that have multiple or cascading root causes can not easily be captured in a community knowledge base. This results in frequent re-inventing of the wheels.

Missing or unclear context in exception handling

IT hardware and software are often developed in an environment that is far removed from the ecosystems where they eventually end up. When exceptions do occur, the exception handlers usually do not understand the context and therefore do not provide a comprehensible explanation.

There are many other reasons why Incident Management remains hard. There is a tendency to throw resources at Incidents when underlying cause is poorly architected software, infrastructure or business processes. Sufficient attention is not paid to training IT professionals in troubleshooting which remains an art form. Finally it is getting more and more expensive to hire trained professionals and IT budgets shrinking.

Better automation and autonomics provide some relief from Incident Management but that is a topic for another blog.


[1] www.itsmfi.org/files/itSMF_ITILV3_Intro_Overview.pdf

A tragedy in the cloud

June 13, 2009

According to a news report UK based hosting service company VAServ was a target of  a hacking attack, and as a result, lost data for 100,000 web sites. This is a huge blow to hosting services industry especially those who provide cheap services based on virtualization.

It is not yet clear whether the attack was a result of the carelessness on the part of VAServ or a vulnerability of HyperVM from a company called Lxlabs. According to Lxlabs website, “HyperVM is a multi-platform, multi-tiered, multi-server, multi-virtualization web based application that will allow you to create and manage different Virtual Machines each based on different technologies across machines and platforms.”

What’s truly tragic is that Lxlabs founder, K. T. Ligesh, 32,  committed suicide on 8th of June. As I said earlier, it is not yet clear whether the loss of data at VaServ was due to HyperVM vulnerability or serious security breaches at VaServ. Someone boasted about the exploit at VaServ and claimed it was through simple sniffing and password guessing, and not through HyperVM. If true, it is just goes to show how terrible cybercrime can be.

From such incidents it becomes clear why enterprises will remain weary of the public clouds. Earlier I blogged about public vs private clouds. There is a market for self service clouds like the one offered by VaServ, but for anything more than a small mom and pop operation, it is clearly not enough. A full service (either internal or hosted) private cloud is the only solution. We are reaching a turning point where vendors are beginning to offer Cloud services and it is a matter of time before they offer to convert entire hosted IT services of their clients to private Clouds.

Manageability in Cloud Computing

June 8, 2009

There have been many attempts to define and characterize Cloud Computing recently. NIST (National Institute of Standards and Technology) leads with a draft.

And then there have been some following articles in the blogosphere here and here. And this one appeared before the NIST draft.

What is interesting is that the NIST draft provided the following definition of Cloud Computing:

“Cloud computing is a pay-per-use model for enabling available, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model promotes availability and is comprised of five key characteristics, three delivery models, and four deployment models. (emphasis mine)”

And then went on to list the five characteristics:

  • On-demand self-service
  • Ubiquitous network access
  • Location independent resource business
  • Rapid elasticity
  • Pay per use

But they missed out the most important part in my opinion. I have highlighted the manageability in the definition above. Managing the applications, or IT Service Support is one the most expensive factor in the total cost of ownership (TCO). The software license costs and the cost of the infrastructure to support it only forms a part of the TCO. During the lifetime of an enterprise software application majority of the costs are incurred in maintaining and supporting it.

Therefore any Cloud Computing Environment (CCE) should consider the manageability of the applications deployed. If the deployed applications are not manageable by design,  the CCE will not able to manage them autonomically and therefore dramatically increasing the cost of support. Stating it in another way, applications being developed for the Cloud should include manageability as the part of design rather than as an afterthought.

Change Management in the Cloud

May 28, 2009

Change Management is an essential process of any IT department. Change Management ensures that only authorized and carefully considered Changes are implemented. There are planned Changes and there are unplanned or emergency Changes and there is a process to handle both.

Typically the RFC (Request For Change) are raised when one of the following happens:

  1. There is a Problem that needs resolution – RFC raised by Problem Management
  2. There is a vendor supplied patch or upgrade – RFC raised by Operation/Infratsructure team
  3. There is a change in Architecture to address growing needs – RFC raised by Capacity Management
  4. There is emergency which requires a quick fix – Emergency Change raised by Problem or Incident Management

In a Cloud Computing Environment, the requirements are very similar, except in (3) above. Due to Automated Provisioning and Virtualization, Cloud’s promise is rapid elasticity. To ensure that request for new resources are attended to in minutes or hours instead of weeks or months, all the ITIL processes need to suitably modify their functioning. In case of Change Management this is what needs to happen:

1. All provisioning activities follow an established and approved business workflow. In addition it is completely automated.

2. Configuration Management is automatically updated to reflect the Changes.

3. Even regular Changes need to be applied to the Images used by Automated Provisioning.

4. Change Management keep in mind that the Cloud architecture is dynamic by definition, so yesterday’s snapshot may not be good enough for tormorrow’s Change.

Public clouds v/s private clouds

May 27, 2009

One of the major objections to cloud computing has been that it is not secure enough. There is some truth to it and it is not an easy matter to secure an entire enterprise in a public cloud. Given all the apprehensions about the security, privacy, legislation involved, it is safe to say that the deployment of public cloud computing in large enterprises remains a distant dream.

Having said that, I believe the public cloud can greatly benefit the individuals. Cost of ownership of a computer today is unnecessarily high for all the well known reasons. Having to pay only for the computing power, software licenses, storage and networking bandwidth that I actually use is a very compelling proposition. I think over a period of time people will begin to realize the value in cloud computing, just as they did when utility companies began to deliver electricity to the homes. There are some concerns that cloud computing could lead to loss of freedom to choose, but I think those can be managed by proper legislation and also by developing open cloud standards and bill of rights (http://wiki.cloudcommunity.org/wiki/Cloud_Computing_Manifesto).

Private clouds can benefit large enterprises which invest in enormous computing power, network bandwidth and storage. Companies like IBM are developing tools and technology to make it happen. Private clouds will address the security and privacy issues as well as the risk of cloud hosting company going down under. Given a large enough enterprise, private cloud computing can be as cost effective as public cloud computing.

Entities that can benefit from cloud computing:

Large enterprises
Defense organizations
Government agencies
NGOs

Pay as you go

May 27, 2009

I read Dave Malcolm Surgient’s blog on the characteristics of cloud computing. As cloud computing still remains nebulous, this kind of clarity helps everyone understand it a little better. He talks about five characteristics, which I list here:

Characteristic 1: Dynamic computing infrastructure
Characteristic 2: IT service-centric approach
Characteristic 3: Self-service based usage model
Characteristic 4: Minimally or self-managed platform
Characteristic 5: Consumption-based billing

I was particularly struck by Consumption-based billing. What a great idea! When was the last time you paid for a generator installed by your utility? When was the last time you paid for the cable laid by your cable television company? And yet we continue to pay for the CPUs, the hard disks, the network interfaces. Not to mention all the junk that the Microsofts, the Ciscos, the Intels and the rest of them want to put on your PC. If you ever looked at the services running you will notice that most of them are never used. Most of the computing power we purchase is never used.

Imagine if you only need to pay for what you use. Imagine a world where you could plug in a simple device and begin to use the IT service just as you would electricity or a telephone service. You only pay for the storage, processing and the network bandwidth usage. In addition, unlike electricity or cable, you have many competing companies to choose from. This service will be available where ever you are, not just at home.

As an extension, you only pay for software when you use it. ALL computing services will be metered on a pay-as-you-go basis rather than a license per copy with fat yearly support fees. If you are using open source products, there is no need to pay for them ever!

I know this will greatly upset the establishment, such as Microsoft and Oracle. So be it. For too long they have ruled the IT world with outsized profits. Monopolies rule the enterprise and desktop software. This can not go on forever. The open source community has matured sufficiently now that we can do a lot of computing without buying anything from Microsoft or Oracle.


Follow

Get every new post delivered to your Inbox.