Trading Off IT Service Availability and Costs

Trading Off IT Service Availability and Costs

In order to control an internally or externally provided IT service, service level agreements (SLA) are formulated between service provider and consumer. Besides other information, SLA include guarantees for quality metrics to ensure a fast, safe, and secure IT service provision. One of the most crucial quality metrics for consumers is the service availability.

The IT Infrastructure Library (ITIL) defines availability as “the ability of a service […] to perform its agreed function when required”, thus this metric is at “the core of customer satisfaction” for IT service providers. If service availability falls below the agreed level, penalty costs will arise for the provider in the short-term, but the loss of reputation may be harder to deal with in the long-term. Hence it is crucial for IT service providers to ensure sufficient availability levels – of course at a minimum cost.

Four principal approaches can be identified to increase availability:

  • Fault prevention means to minimize the probability of faults in the system, e.g., by using high quality software and hardware components
  • Fault tolerance means to ensure operation even in the case of faults, normally by introducing redundancy mechanisms
  • Fault forecasting means to know beforehand that a fault will occur so that counter-measures can be taken, e.g., by combining monitoring data and data stream analytics
  • Fault removal means to minimize the time to recover in case of a fault and requires fault identification techniques as well as a set of immediately available (and ideally automatic) counter-measures

The first two of these approaches may be considered in the service design stage. However, the success of fault prevention is limited due to the fact that faults can never be excluded completely. Therefore, fault tolerance mechanisms are a powerful but expensive approach to increase availability.

In order to tradeoff costs and effects of redundancy mechanisms, a redundancy allocation problem (RAP) can be defined. With regard to this combinatorial optimization problem, a service is divided into its subsystems in which redundancy can be applied. Using a combination of meta-heuristics and simulation, combinations of redundancy mechanisms can be identified for which availability and costs are nearly optimal. Following a Pareto approach, the results can be visualized as a Pareto front which represents the trade-off between both objectives, as can be seen in this figure. On this basis, a decision maker in the service design stage can choose a suitable redundancy design from the points depicted.