Rick Sturm, Wayne Morris-Foundations of Service Level Management-Sams (2000)
Rick Sturm, Wayne Morris-Foundations of Service Level Management-Sams (2000)
MANAGEMEN I
Service level management can be the key to custom i i
provider, including ASPs, ISPs, Telcos, and IT orgai . i
been a lack of information and advice about 1111111 ,
and Service Level Agreements (SLAs), at least 1111111 Him
Management solves that problem by providing the
use with service level management:
Practical tips, cautions, and notes of interest to help you take adr:uit.s ....t tlu
the authors and others who have implemented service level inamwe mcut
Il mplutes for building effective Service Level Agreements
l: itlelines to shorten the process of negotiating SLAs
Advice on developing service level management disciplines within your orgunliulloo
Sample business justifications supporting service level management Inveslnua IN
A comprehensive list of products that can be used for service level management
FOUNDATIONS OF
SERVICE LEVEL
MANAGEMENT
..1
At Mil
this book can save managers time and money, as well as help them avoid the trust rut ion
it attempting to "reinvent the wheel" for service level management.
Kick Sturm has over 25 years of experience in the computer industry. He is president of lintcrprise
Management Associates, a leading industry analyst firm that provides strategic and tactical advice
on the issues of managing computing and communications environments and the delivery of those
services. He was co-chair of the IETF Working Group that developed the SNMP M113 for managing
applications, and was a founder of the OpenView Forum. Rick also is a columnist for Interne! Week,
leas published numerous articles in other leading trade publications, and is a frequent speaker at
industry events.
Wayne Morris has over 20 years of experience in the computer industry. He is the vice president of
corporate marketing and a company officer of BMC Software, the leading supplier of application
service assurance solutions. He has held a variety of technical, support, sales, marketing, and
executive management positions in several companies in Australia and the United States.
Mary Jander has spent 15 years tracking information technology. She is presently a senior analyst
with Enterprise Management Associates. Prior to that, she was with Data Communications
magazine, where she covered network and systems management.
$29.99
CATEGORY: NETWORKING
)VERS:
SERVICE LEVEL
MANAGEMENT
SA MS
www.samspublishing.com
MS
Foundations of
Service Level
Management
SiI M5
Foundations of
Service Level
Management
Rick Sturm,
Wayne Morris,
and Mary Jander
SiI M5
800 E. 96th Street, Indianapolis, Indiana 46240
9876
Trademarks
All terms mentioned in this book that are known to be
trademarks or service marks have been appropriately capitalized. Sams cannot attest to the accuracy of this information.
Use of a term in this book should not be regarded as affecting the validity of any trademark or service mark.
Associate Publisher
Michael Stephens
Executive Editor
Tim Ryan
Acquisitions Editor
Steve Anglin
Development Editor
Songlin Qiu
Managing Editor
Lisa Wilson
Project Editor
Elizabeth Roberts
Copy Editor
Rhonda Tinch-Mize
Indexer
Kevin Kent
Proofreader
Katherin Bidwell
Professional Reviewers
Ben Bolles
Eric Goldfarb
Interior Designer
Dan Armstrong
Wayne Morris has over 20 years experience in the computer industry. He is the
vice president of corporate marketing and a company officer of BMC Software,
the leading supplier of application service assurance solutions. He has held a variety of technical, support, sales, marketing, and executive management positions in
several companies in Australia and the United States. His articles on systems and
service level management have been published in the United States, Europe, and
Australia, and he speaks regularly at industry conferences.
Mary Jander has spent 15 years tracking information technology. She is a senior
analyst with Enterprise Management Associates (Boulder, CO). Prior to that, she
was with Data Communications magazine, where she covered network and systems
management for an international readership of network architects and information
systems managers. She has also worked at Computer Decisions magazine and as a
freelance writer and copy editor.
Dedication
To Marilyn and Davidthanks again for your understanding, encouragement, and
forbearance through the entire process of creating this book.
Cover Designer
Alan Clements
Copywriter
Rick Sturm has over 25 years experience in the computer industry. He is president of Enterprise Management Associates, a leading industry analyst firm that
provides strategic and tactical advice on the issues of managing computing and
communications environments and the delivery of those services. He was co-chair
of' the IETF Working Group that developed the SNMP MIB for managing applications and was a founder of the OpenView Forum. He also is a columnist for
Internet Week, has published numerous articles in other leading trade publications,
and is a frequent speaker at industry events.
Rick Sturm
To my family on two continents whose understanding, support, and love carry me.
Eric Borgert
Wayne Morris
Production
Darin Crone
Mary Jander
V
4
Contents at a Glance
Acknowledgments
This book generated significant interest and excitement as we put it together.
Service level management is a growing management discipline, and there are a
number of professionals who are contributing to the growth in understanding
and acceptance of proactive service management. This includes industry analysts,
members of the press, and courageous individuals within IT departments who
have taken a leadership role in implementing service level management, and who
have subsequently shared their best practices in industry forums and conferences.
Many of our colleagues and friends contributed their insights and support for
this book. To mention them all here is not possible, but we'd like to call out some
individuals who made our jobs much easier.
Thanks to Jeanne Moreno, Linda Harvey, and Alex Shootman of BMC Software
who have been instrumental in implementing service level management and in
educating many others in the methodology and procedures for assuring that
service levels can be met. David Spuler, David Johnson, Sharon Dearman, and
Shannon Whiting of BMC Software contributed research that helped in many
chapters. Amy DeCarlo, Mike Howell, Elizabeth North, and Colleen Prinster
with Enterprise Management Associates contributed research and editorial assistance that helped in many chapters and Appendix E Finally, our thanks go to Sara
Nupen with Enterprise Management Associates. She assisted with the creation of
several of the illustrations for the book.
Many vendors contributed information to the descriptions of current service level
management products. Although it is not practical to list all those companies, our
special thanks go to BMC Software, Cabletron, Candle Corporation, FirstSense,
Hewlett-Packard, IBM/Tivoli, Landmark, Luminate, and Mercury Interactive for
their contributions.
Our thanks go to Rosemarie Waiand and Jan Watson for setting up the Web site,
http: //www.sim-info.org , which will be used as a repository for templates and
also for discussion forums around service level management.
There are many others who had a direct role in developing this book and bringing it to market including Songlin Qiu, Tim Ryan, and Steve Anglin of Sams
Publishing and the book reviewer, Ben Bolles.
To all these individuals and others who helped us, we give our thanks and sincere
appreciation.
Introduction
Standards Efforts 77
123
Appendixes 187
A Internal Service Level Agreement Template 189
B
D
E
vi
vii
Table of Contents
Introduction 1
PART
Summary
75
IT Infrastructure Library 78
Distributed Management Task Force (DMTF)
SLA Working Group 81
Internet Engineering Task Force (IETF)Application
Management MIB 82
Application Response Measurement Working Group 83
Summary 84
PART
II
Reality
6
85
Availability 22
Performance 24
Workload Levels 26
Security 28
Accuracy 31
Recoverability 32
Affordability 34
Summary 38
90
Standards Efforts 77
PART
Summary 136
184
184
I'A RT
1 V Appendixes 187
A Internal Service Level Agreement Template 189
About the SLA 189
About the Service 190
191
193
Glossary 241
Index 247
When you write, please be sure to include this book's title and author as well as
your name and phone or fax number. I will carefully review your comments and
share them with the author and editors who worked on the book.
xii
Fax:
317-581-4770
Email:
Mail:
Michael Stephens
Associate Publisher
Sams
201 West 103rd Street
Indianapolis, IN 46290 USA
xiii
Introduction
InformationTechnology (IT) departments are receiving pressure in both large
acid small businesses to operate more like a business and become more efficient.
(;ustomers are demanding assurances of service levels from their IT departments,
Poleos, ASPs, ISPs, and other service providers. New categories of service providers
are emerging, such as application service providers (ASP). Increasingly, businesses
are turning to out-sourcing of IT functions as a way to control costs and to
achieve consistent levels of service. Businesses are increasingly dependent upon
service providers, including their own internal IT department and external service
providers. No longer can service providers, such as an IT department, just focus
on keeping each of the pieces (network, systems, databases, and so on) running.
'Ibday's environment demands a comprehensive, customer-focused, holistic
approach to management. In some cases, IT itself is becoming the focal point
I br new business, as evidenced in the growing trend toward managed e-commerce,
e-business, application services, and virtual private networks. To fulfill these new
roles, IT managers must remain focused on important customers, while still
providing affordable service to other clients.
'
successliil and one that is a complete I hiilcne_ a waste of tine, money, and
ellOrt. It also looks at the various types of products that are available to help
with service level management,
Part III: Recommendationsprovides insights and guidance about the
actual contents of a Service Level Agreement. It provides guidance on building a business case for service level management. It provides guidelines for
choosing the appropriate metrics for Service Level Agreements. A third key
component of this section is guidelines for implementing service level
management program in any organization.
Appendixesprovide detailed information that will be helpful for implementing service level management. The appendices include templates for
reports, Service Level Agreements, and follow up assessments. There is also
an appendix that contains a comprehensive list of vendors and their products
to assist with service level management.
GlossaryIn light of the confusion surrounding service level management
terminology, a set of clear, concise definitions is vital to understanding this
subject.This glossary contains definitions of over 60 terms important to
understanding service level management.
'
PART
Note
A Note presents interesting pieces of information related to the surrounding discussion.
Tip
A Tip offers advice or teaches an easier way to do something.
Caution
A Caution advises you about potential problems and helps you steer clear of disaster.
PRAI
Theory and
Principles
Chapter
1 The Challenge
2 The Perception and Management of
Service Levels
3 Service Level Reporting
4 Service Level Agreements
5 Standards Efforts
CHAPTER
The Challenge
Like the peasants in an old monster film, armed with torches and pitchforks,
ready to storm the castletoday IT clients are fed up with the service that they
have been receiving. They are storming IT castles demanding change. They want
improved service and they want it now.This book looks at the problem and how
IT can respond to the demands of the user community.
Note
In this book, the terms users and clients are used interchangeably to identify those people and
groups within a company who are the consumer of services provided by IT. The term customer is
reserved for those groups and individuals who buy the company's goods and services.
The world of business has always been one of change and innovation. The objective of those changes has always been to maximize profits. Owners and managers
have constantly sought new ways to achieve this objective. Historically, change has
Mission Impossible
Information technology (IT) plays a double role in today's global business
environment. IT's role in facilitating change is well-known and well-documented.
However, it is also subject to forces of change from outside the IT department and
even from outside the corporation.
Among the forces affecting IT is the accounting department. Over the past two
decades, companies have sought to become even more competitive. In some cases,
this has been driven by a desire for increased profits, and in other instances, it is a
matter of survival, as competitors force prices downward. This has translated into
pressure on IT to reduce costs. IT is being asked to live with smaller budgetsboth
for capital expenditures and for ongoing expenses. The result is that it is difficult to
acquire additional equipment to accommodate the growth in usage most companies
are experiencing (increased number of users, transaction volume, number of applications, and so on). Acquisition of more modern, faster, and more reliable equipment is made difficult.
Reductions in expense budgets usually translate into reductions in the size of the
IT staff because payroll is normally the greatest single expense in an IT budget.
Other casualties of budget cuts are salary increases and training for the IT staff.
In a competitive job market, the results of these latter items can be higher staff
turnover and employees who are less experienced and not as well trained as would
be desirable. Ultimately, this limits IT's ability to improve or maintain the levels
of service delivered to the end users.
While pressures to reduce costs have been mounting, IT clients, the end users, have
become less ignorant and increasingly sophisticated and technically savvy in the
ways of computing. They have computers at home. They have computers on their
desks in their offices and many have purchased servers for their departments. They
are no longer as accepting of excuses or explanations from IT as they once were.
The users know what they want and believe they know what is possible. They
have aggressive timetables for the delivery of new systems and high expectations
in terms of the availability and performance of those systems.
Just as users' technical awareness has risen, so has their reliance on computer systems. The number of mission-critical systemssystems essential to the operation
of the business and, ultimately, its very survivalcontinues to grow daily. Thus, IT's
level of responsibility within the corporation has risen significantly. IT has gone
from a facilitator of the business process to becoming part of the process and from
supporting staff functions to becoming a key element of the business.
Today, some businesses are built solely on electronic commerce. In these cases,
the company has no existence except through its computers. Companies like
Amazon.com , eBay, e*trade, and the like cease to exist without the functions
provided by IT.
However, the criticality of systems is not limited to cyberspace. For industries such
as the airlines, financial services, and telecommunications, continuous availability
of mission-critical applications is essential. The spread of enterprise resource planning (ERP) applications (for example, SAP R/3, PeopleSoft, Oracle, and so on)
has produced another form of mission-critical application. The need for highly
available applications with high performance levels has become nearly ubiquitous.
IT has found itself in an unenviable position. CIOs around the world are being
told to reduce their budget and improve service levels for an ever-increasing number of applications. In other words, they are faced with the impossible situation of
having to deliver "more" with "less."
Divergent Views
IT managers have not been ignoring the needs of their users or the business
impacts of the services provided by their organizations. In fact, from the very
beginning, IT managers have attempted to measure and assess the performance
of the services provided by their organizations. However, they have been limited
by perspective and technology.
Historically, IT managers have measured the effectiveness of their organizations
by looking at the individual hardware and software components. In the beginning,
this made perfect sensethere was only one computer, and it could only run a
single program at a time. However, that condition did not last long. Today analyzing individual components provides information that is relevant and important for
managing a specific device or component, but it does not provide a perspective on
the overall service being provided to the end user.
Consider the example of Acme Manufacturing. The order entry department has
negotiated a Service Level Agreement (SLA) with IT. That agreement calls for the
order entry system to be available 99.9% of the time, and no component is to have
more than 10 minutes of total downtime in a month. This SLA is incorporated
into the objectives for each of the IT department managers. Table 1.1 shows the
results for one month. At the end of the month, almost everyone is pleased with
the results. All but one of the components have met or exceeded the objectives for
availability and for total downtime in the month.
Component
Building Hub
Customer Database
Inventory Database
LAN
4.32
0
6.00
8.64
7.54
69.72
9.88
Local Server
Order Entry Application
Remote Host
WAN
Availability
100.00%
99.99%
100.00%
99.99%
99.98%
99.98%
99.84%
99.98%
Obviously, the Remote Host had some problems during the month, but this
was because of the failure of a circuit board. Operations had to wait for a service
technician to arrive on site and install the new board. Considering this, even the
operations group is reasonably satisfied with the performance that IT delivered.
The management of the order entry department takes a much different view of the
performance for the month. The end users see a month in which there was a total
of 106.1 minutes in which they could not process orders. As shown in Table 1.2,
the availability that they experienced was 99.75%well below the target of 99.9%.
(Availability of 99.9% would allow a - total unavailable time of 43.2 minutes in a
30-day month.) It is possible that there was some overlap in outages; however, that
is statistically unlikely. Also, this example has been constructed using the assumption
that outages impacting more than one component would be charged to the root
cause. For example, if the remote host fails, it will necessarily result in an outage
for the order entry application, the customer database, and the inventory database.
However, the outage would only be charged against the remote host because that
is the cause of the other components being unavailable.
Table 1.2 Users' PerspectiveAcme Order Entry System Performance
Component
Minutes of
Downtime
Building Hub
Customer Database
Inventory Database
LAN
4.32
Availability
100.00%
99.99%
100.00%
99.99%
6.00
8.64
7.54
69.72
9 . 88
99.84%
99.98%
Composite
106.1
99.75%
Local Server
99.98%
99.98%
At this point, IT uanagers are congratulating L.R . 1i other di the high level of
service that they have delivered to the order entry department. Meanwhile, the
order entry department managers are hopping mad. The users see IT as being
unresponsive and unable (or unwilling) to meet their needs. As they look at the
situation, outsourcing the IT function starts to sound like an appealing alternative.
Technical Challenge
I! is true that IT still maintains far too much of a component-centric view of the
services that it delivers. Although IT organizations must share in the blame for this
c ontinued limited perspective, much of it can be explained by the limitations of
the technology available for management reporting. Consider the problem that is
reflected in Figure 1.1.This is a simplified illustration of a distributed computing
environment. Each type of component has a unique management system attached.
Those unique management systems (element management systems) can provide a
great deal of information about any single device. The problem presented by these
systems is that they also produce a fragmented view of the service. Element management systems are not designed to assess each device in the overall context of
lie service that it is helping to deliver.
I ;Iement management systems are providing information about each component in
isolation. It is as if a doctor carefully examines a single part of your body From
that examination, it will be possible to describe the state of that body part. In the
case of certain critical body parts (such as the heart, liver, and so on), it might be
possible to state how the part is impacting your general health or lilt' expectancy.
However, more often it is necessary to consider the part within the context of the
total body. It is this perspective that is moving the medical community toward a
holistic approach to treatment and diagnosis. There is a similar need within the IT
community.
Workstation
Desktop
Systems
mac,
Figure 1.1
Figure 1.2
In Figure 1.1, there is only a single router, and if the router fails, there is no question
that the service is interrupted. However, in a more complex environmentwith
many routers (see Figure 1.2), alternate paths for data, and so onthe impact of the
failure of a single router is not as obvious.
What is needed is the ability to assess the impact of any aberration in the
service delivery environment on the service and the end users. Herein lies the
challenge assessing the overall impact when the data is only available on a piecemeal basis. Many companies simply rely on the subjective judgment of the operations personnel. Unfortunately, this is an unreliable approach that cannot produce
accurate measurements of the overall level of the service being delivered.
Fortunately, new software products are emerging that aim to provide such measurements.We will look at these tools in Chapter 7, "Service Level Management
Products."
What Is SLM?
vice level management (SLM) is the disciplined, proactive methodology and
piocedures used to ensure that adequate levels of service are delivered to all IT
users in accordance with business priorities and at acceptable cost. Effective SLM
iequires the IT organization to thoroughly understand each service it provides,
nit luding the relative priority and business importance of each.
'vet vice levels typically are defined in terms of the availability, responsiveness,
integrity, and security delivered to the users of the service.These criteria must be
viewed in light of the specific goals of the application being provided. For example, a human resources application might require communications such as email
,uuong individuals. An order-entry application might involve multiple cooperating
applications such as supply chain management. In all cases, the service should be
treated as a closed-loop system with all service levels related directly to the endher experience.
I'hr instrument for enforcing SLM is the Service Level Agreement (SLA): a
oitract between IT and its clients that specifies the parameters of system capacity,
network performance, and overall response time required to meet business object ives. The SLA also specifies a process for measuring and reporting the quality of
service provided by IT, and it describes compensation due the client if IT misses
i lie mark.
SI.M's benefits are so compelling that its use isn't relegated only to IT environments. Seminole Electric Cooperative Inc. (Tampa, FL) also uses SLAs as a key element in its multilevel service offerings. Customers receive cycles of electrical
power instead of packets or system capacity.
Consider a company that produces sugar (powdered and granulated) and sells it
in two-pound bags. First, the specifications ("granulated sugar" and "net weight:
2 lb.") appear clearly on the bag. For reasons of customer satisfaction and cost
control, in addition to governmental regulations, the product must meet those
specifications.
Note
An IT department without a service level management program is like a sugar producer that puts
a "reasonable" amount of product in unlabeled bags. Clearly, this is not a formula for success.
There are six basic reasons for an IT organization to implement service level
management. These reasons are as follows:
Client satisfaction
Managing expectations
Resource regulation
Internal marketing of IT services
Cost control
Defensive strategy
Client Satisfaction
The leading reason for implementing service level management is client satisfaction. To begin with, SLM necessitates a dialog between IT managers and their
clients. This is necessary in order for IT to be able to understand the client's service requirements. It also forces clients to clearly state (perhaps for the first time)
their requirements or expectations. When IT and the client agree on what is an
acceptable level of service, they are establishing a benchmark against which IT
performance can be measured. IT is able to shift toward a defined objectivethe
client's requirements. The dialog that is initially established continues through the
process with regular reports. Even a process of service level management cannot
produce happy clients when service level commitments are not met. However, it
will significantly raise overall client satisfaction when commitments are met. It can
also help to improve the situation when targets are missed.
Managing Expectations
An ancillary benefit of implementing service level management is that it makes it
possible to avoid so-called expectation creepthat is, the ever rising levels of users'
undocumented expectations. It is common for people to want improvements over
the status quo. If users' requirements are not documented, their expectations are
Resource Regulation
SI,M provides a form of governance over IT resources. In some organizations,
it powerful user group will sometimes demand support for an application that
unfairly ties up resources. With an SLA in place, it is more difficult for a strong
minority to outweigh the interests of the majority. SLAs also help IT avoid
opacity problems that result when too many applications crowd the network,
wryer, mainframe, or desktop. And because SLAs specify levels of service, they can
he used as indicators for ongoing system capacity and network bandwidth requirements. Specific resources will be needed to keep abreast of SLA parameters. And
the monitoring and measurement deployed by IT to keep up with SLAs ensures
early warning for any new capacity that might be required.
Cost Control
In the context of cost control, service level management is a double-edged sword.
First, it helps IT to better determine the appropriate level of service to provide.
Without service level objectives arrived at in dialog with their clients, IT management is forced to guess. Too often, this guesswork leads to excess. That is, it can
lead to over-staffing, configuring networks with excess capacity, buying larger,
faster computers, and so on.
In the absence of dialog with IT, the users' requirements are established by what is
desirable rather than what is affordable. The requirements and expectations are not
tempered by the reality of feasibility or affordability. Service level management can
also impact costs through moderating user demands for higher levels of service.
This can happen in two ways. As discussed in the previous section, service level
management can limit the escalation of user demands. Also, as part of the dialog
with IT, the financial impact of higher levels of service can also be explained. In
some instances, the business case will justify the additional cost of providing higher
levels of service. In other cases, there will not be a financial justification and,
hopefully, the unnecessary cost will be avoided.
Defensive Strategy
Ultimately everyone is motivated by self-interest. IT managers are no different.
It can clearly be in the interest of IT managers to implement a service level management process.With SLM in place, IT has a tool to use in defending itself from
user attacks. Clear objectives are set and documented. There is no room for doubt
about whether the objectives have been met. In a well-written Service Level
Agreement, even the metrics for measuring service levels are defined and agreed
to by both the users and IT.
Ultimately, service level management is something that can benefit the user, the
IT organization, and the corporation in which they both work. The process of service level management can temper users' demands for higher levels of service.
Conversely, service level management can hold IT accountable for delivering
agreed upon levels of service, while providing them with clear objectives for service. Outsourcing continues to be very popular. SLM can be the best defensive
strategy that IT can have against user dissatisfaction that can lead to outsourcing.
I'iohably more important than all the other factors fueling interest in service level
nt,ntagement is the fact that technology has matured, making possible end-to-end
measurement and reporting available at a reasonable cost. Dozens of vendors, ranging from the very largest to minuscule start-ups, have focused their attention and
he considerable talents of their technical wizards on the challenges of service level
tinnagement.A 1996 Enterprise Management Associates survey of IT managers
li and only twelve products that were being used for service level management.
1 lowever, possibly only one of those products (Microsoft Excel) added any value
to he process, albeit minimal. Eighteen months later, in May 1998, the number
of companies that identified themselves as offering products specifically for service
level management had risen to 62. By March of 1999, the number had climbed
to W) (see Figure 1.3).
SLM Products
100
80
60
40
20
1996
Figure 1.3
Why Now?
If the case for service level management is so compelling, why is it just now
receiving widespread attention? Several reasons help explain the sudden attention
that service level management is receiving. First, there has been a dramatic increase
in the number of applications (that is, the number of services being provided) and
in the relative importance of those applications. Companies are more dependent
upon the services that IT provides.
The next factor driving the increased interest in service level agreements is
increasing user sophistication and their growing dissatisfaction with the level of
service they are receiving. This change in the user community is discussed at the
beginning of this chapter.
1998
1999
II' a company wants to implement SLM, it is a much simpler process than it was
even 4 years ago. In the past, collecting the data (if available) and generating the
ti I,M reports was slow and labor-intensive. Sometimes it required custom programs
lo be written, or expensive data collection products to be purchased. Even then,
I he results were usually marginal. The situation has improved significantly. The
introduction of new products has facilitated the data collection process as well as
Me merging or correlating of data from diverse sources. Although there are more
,advances to come, SLM reporting has become dramatically easier.
Summary
In today's global business environment, IT professionals find themselves undergoing
pressure to reduce costs and deliver higher-than-ever levels of service to increasingly
savvy users. To achieve this, they are deploying service level management (SLM), a
methodology for ensuring consistent levels of capacity and performance in IT environments. SLM includes a contract between IT and its clients (whether in house or
external to the organization), that specifies the client's expectations, IT's responsibilities, and compensation that IT will provide if goals are not met. Despite some initial
misgivings, the value and importance of SLM have been established. Its successful use
is well documented in numerous case studies, and its popularity is increasing not
only within IT organizations, but also among service providers. A range of new
products geared to supporting SLM further testifies to its deployment as an unquestioned requirement in IT organizations worldwide.
CHAPTER
11 is business. This requires that the IT organization understands each service it provides, including relative priorities, business importance, and which lines of business
.11 id individual users consume which service.
'I'licre are a number of important aspects that relate to the perception and man,igcm . ent of service levels. The first consideration is to ensure that service levels to
In managed are measured and evaluated from a perspective that matches the business goals of the IT organization. The IT department supports business productivity by ensuring that the applications used by internal personnel are available to
hem when required and that they are responsive enough to allow these users to
he optimally productive. It is almost certain that failures and errors will lead to a
u rvice outage. The IT department must restore the service as quickly as possible
with the least amount of disruption to other services.
Notwork
The IT department must also ensure that automated business processes complete
in a timely fashion to meet required deadlines and to enhance effectiveness and
profitability. Additionally, if customers directly use or interact with IT services, IT
must ensure that their experience is pleasurableleading to customer loyalty and
repeat business.
Service levels are measured and managed to improve a number of quantifiable
aspects of the perceived quality of the services delivered. These are described in
the remainder of the chapter.
Availability
Availability is the percentage of the time that a service is available for use. This can
Figure 2.1
Note
this end-to-end definition of service using the user's perspective is required for all measures relating
to service quality.
Note
Unless availability measurement relates directly to the user experience, it will have little positive
value and might, in fact, damage the credibility of the IT organization quoting such measurements.
True availability must be measured end-to-end from the end user through all the
technology layers and components to the desired business application and data, and
back to the end user. Such an aggregate value can be difficult to measure directly
and might have to be derived by combining the availability of all the components
traversed. Figure 2.1 shows the concept of end-to-end service level management.
The reality of a business-oriented IT service might, in fact, be more complex,
involving multiple applications, extranets, and Internet connections.
The first step in measuring and managing service levels is to define each service
and map out the service from end-to-end. Each of the end users and their locations should be identified together with the path they take to access the business
application providing the core part of the service. The data used by the application
should be determined along with where it resides and how it is accessed. If the
core application needs to interact with other applications, these should also be
identified and mapped. In this manner the overall flow of a service or business
transaction can be determined, recorded, and used to define transaction types,
component dependencies, and appropriate service measurement points.
I'he availability of each service is defined within standard hours of operation for
that service. Most IT organizations need to remove applications from service at
periodic intervals (called a maintenance window) in order to undertake routine mainIr nance of the application, supporting databases, and underlying infrastructure.
I knee availability objectives will also specify planned outages by service, together
with the schedule for those outages. The standard hours of operation will vary
depending on the nature and criticality of each service. As more services become
I nternet based, the length of the maintenance window shrinks and might not be
, n eeptable at all.
Availability is the most important factor that influences the users' perception of the
quality of IT services. It is also the most critical factor affecting user productivity,
particularly if the user depends entirely on a particular business application to perform_ her job function.
Performance
As with availability, performance must be measured from the end users' perspective
and must also relate to the business goals of IT. The performance of a service is
measured by the responsiveness of the application to interactive users and the time
required to complete each batch job to be processed (also called job turnaround). The
responsiveness of the application and batch job processing times will be affected
directly by the amount of work to be processed (also called workload levels). This
concept is discussed in the next section.
A large Blount of processing dues not req uire continuous interaction with either
Iltr riser or system operator and happens in batch mode or as background tasks. In
this i' ,is&', job streams are scheduled and processed to perform routine operations
dim produce predetermined outputs such as payroll, financial reporting, inventory
noauoilrstos, and so on. Responsiveness of the batch jobs is referred to as turnaround.
I' h ls is the time between submitting the batch or background request and the
rrnnple Lion of all processing associated with that request, including delivering
',input to the required recipients.
Interactive Responsiveness
Interactive responsiveness relates to the time taken to complete a request on behalf
of a user. The quicker the requests are completed, the more responsive the service.
The request could be processing a service transaction or retrieving some information. It,is important that any measure of responsiveness match the user experience,
hence response time measures must be end-to-end, from the end user's desktop
through the business application (including any database access and interaction
with other applications) and back to the end user.
The responsiveness of the service is second only to availability as an important
factor in the user's perception of the quality of services provided by IT. There is a
direct correlation between how fast the application responds to online users and
their productivity. An important consideration is the consistency of the interactive
response times experienced by the end users. Erratic and unpredictable response
times that vary from exceptionally fast to extremely slow will be perceived by the
users as unacceptable and far worse than consistent response times that might be
merely adequate.
It is important that all applications supported by the IT environment meet their
performance goals. If balanced performance is not maintained, one application
service might meet its performance objectives at the expense of other application
services, which will result in a dissatisfied user community.
Tip
Ina large IT environment, numerous batch jobs will have to be processed every
lay, told the volume of jobs typically varies in cycles with peaks at the end of
week, month, quarter, and fiscal year.
IIore is usually a specified "batch window" within which all batch processing has
to Inish to ensure that the performance, and particularly the responsiveness, of
Itueroctive processing is not degraded by the batch processing. Some background
lacks are continuous in nature such as print spooling or file system management.
Critical Deadlines
In ,uldition to the normal window for processing all batch jobs, there might
Ile specified times at which certain jobs or tasks must finish to satisfy external
vendors or regulation. For example, the payroll run might have to be completed
by :I:00 a.m. to ensure that the information is sent to the bank in time for the
electronic funds transfer to be completed that night.
Note
Meeting critical deadlines can be very important because there might be monetary damages or
penalties for not completing the work by the specified time.
III many cases, the completion of critical deadline jobs will take precedence over
iipleting all jobs within the batch window and might have priority over interu live processing.
In cases where performance is certain to degrade over time because of increasing workloads, and
where responsiveness is significantly better than required initially, some
IT departments build in
latency that can be removed gradually over time. This ensures that response times can be held
constant at acceptable levels, and ensures that unrealistic expectations aren't set.
26
Workload Levels
The workload level is the volume of processing performed by a particular service.
This includes both the rate of processing interactive transactions, as well as the
number of completed batch jobs within a given time period.
These service workloads generally relate to specific applications; however, workload processing might span multiple applications and generate work on multiple
systems. A service workload uses all the components involved in delivering the service, including using network, system, database, and middleware resources. In order
to plan capacity requirements and understand the effect of service workloads, it is
very useful to correlate service workloads with individual resource utilization levels
across all components that provide the service infrastructure.
27
indication, A trading house accepts the hid in yet another transaction. Finally, the
sale confirmation is sent back to the broker in another transaction, and the broker
notifies the customer of the sale. Each of these interactions could be considered a
lousiness transaction by itself, but will have little relevance to the customer wanting
In sell stock, The customer's perception will be that he completed one business
isactionhe sold a volume of stock.
All measures of interactive workloads need to specify whether business transactions
ur application transactions are being quoted, and if a business transaction, the parI les to the business transaction should also be understood, particularly whether one
Note
Measuring business transactions can be complicated unless the application code itself supports such
The most important measures of service workload volumes are online transaction rates, the number
n measure.
of batch jobs to be processed, and the number of these jobs that will be completed in parallel.
Transaction Rates
Interactive workloads are usually measured as the number of transactions per second; however, it is important to understand the nature of the quoted transactions.
Transactions represent a complete unit of useful work. It is important to recognize
the difference between a transaction and a system or application interaction. A single interaction is simply one pair of messages in a dialog between the user and the
application, such as the user submitting a request for service and receiving an
acknowledgment of the request. A single transaction can involve multiple user
interactions.
Mapping business transactions to application transactions, to underlying interaclions, and relating how the supporting technology components were used to
I,rocess the transaction will be required in order to plan capacity requirements.
Client/Server Interactions
Many applications have been developed using the client/server architecture. In
many cases, this is multi-tiered such as an application that is split among the clientside presentation, application server, and database server. This complicates the meas tirement of the transaction.
Tip
The overriding rule for determining the scope of a business transaction is that it begins with an end
Definition of a Transaction
user initiating a business action or request and ends when the automated business process fulfills
An application transaction performs some business task that results in a change to the
data associated with, or the state of, the automated application. A business transaction
changes the state of a business entity, changes the state of the relationship between
business entities, or performs some service on behalf of a business customer.
For example, a business transaction might result in the sale of a number of shares
in a publicly traded company. In order to complete this business transaction, a
number of different interactions and application transactions might occur. The
broker registers the customer's desire to sell the shares using one application transaction. The broker's system checks the stock price from the stock exchange with
a different application transaction. Perhaps the broker then lists the volume to be
sold and its asking price with a market maker by interfacing to that entity's trading
29
28
vary depending on the power, of the system performing the processing and the
characteristics of the workloads themselves.
ordinating actions and administration across these multiple security systems becomes critical for
ensuring consistency of access privileges and reducing the administrative overhead and potential for
Tip
errors.
Workload balancing is important to ensure that synergistic workloads run concurrently because conflicting jobs (jobs with similar characteristics, such as being CPU intensive or I/O intensive) might
lead to thrashing and performance degradation.
I )(fining Resources
It becomes particularly important to carefully control the amount and priority of
batch jobs and background processing when these are performed concurrently
with interactive processing. There is a very real possibility that these background
tasks and jobs will detract from the quality of service provided to interactive users.
All users and resourcesincluding services, data, applications, systems, and network
elementsmust be defined to the security systems. To avoid issues with multiple
un onsistent definitions, a resource-naming architecture should be defined and
adopted. Additionally, a centralized security administration application can be used
t o automate the coordination and propagation of definitions and updates between
multiple distributed security systems.
Access Controls
When the resources have been defined, access control lists are defined and then
used to determine which users have access to which resources, and the nature of
t lie authorized access. Depending on the type of resource to be accessed or used,
(lie nature of the access will vary. For example, a user might be authorized to use a
particular service that also requires access rights to certain data objects within a
database.The nature or level of the access can range from read, write, update, create, or delete. Depending on her access privileges to the underlying data, the user's
ability to invoke the various service options will vary.
Note
Security
Defining the security of a service includes the definition of who can access the
service, the nature of the access, and the mechanisms used to detect, prevent, and
report unauthorized access. As applications span multiple platforms and users
require access to data across multiple databases, the complexity of the security
environment increases tremendously, and multiple security management systems
will be employed.
Understanding, mapping, and maintaining the resources used by each service is important for understanding which resources a particular user will need to access in order to perform his job function.
'rhe service options available to the user will determine the level of access to each
application and resource she requires. Again, the use of a registry or directory service can simplify this aspect of maintaining access control.
I he issue of information privacy is a sensitive topic within the Internet community ,ind one that has direct impact on the users' perception of service quality.
htandards and regulation can be expected to continue to evolve in this area, and
hence service level management must embrace managing information privacy.
Intrusion Detection
lie I I' resource definitions, as well as the identity and information associated with
users who access those resources, are all business assets, and as such it is important
to identify the business owner of IT security. This security business manager must
he responsible for defining security requirements, policies, privilege classes, access
I rots, escalation procedures, and monitoring roles and procedures.
The security aspects of service level management and reporting should be aimed
Privacy Issues
An important consideration when implementing security systems is to ensure that,
where appropriate, the identity of the users is kept private and is not available to
unauthorized access. If registries and directory services are used, controlling access
to these data stores is an important aspect of ensuring information privacy.
Accuracy
her vice accuracy is
Data Integrity
tat,a integrity is the most significant aspect of ensuring the accuracy of the data
used for making decisions. Hardware failures, logic errors, and program architecuse issues, as well as operator and user error, can all impact the integrity of data.
I tinsuring data integrity requires checking the consistency of data and databases
su uctures including views, stored procedures, indices, and so on.
Additionally, defining and implementing appropriate data backup and recovery
pr ocedures will improve data integrity by enabling restoration of corrupted data.
I l ecovery of data is addressed in more detail in the section "Recoverability" later
in the chapter.
Note
Information privacy becomes more important for those applications that directly touch customers
Data Currency
for example, e-commerce applicationsor where applications interface with business partners, such
as with supply chain and e-business applications.
Another important aspect of data accuracy is the currency of data. This is particularly important when data is distributed across multiple data stores such as repliited databases, data warehouses, and data marts. In these cases, the latency, or delay
in propagating data changes to the distributed data stores, affects the accuracy of
tlhe data. Longer propagation delays result in data that is not consistent across the
enterprise, and different users will be working with various versions of the data.
Web servers, c-business, and e-commerce exacerbate this problem because in many
cases data is moved from operational databases to data stores outside the firewallfor example, to the external Web site or to the business partner via an extranet.
Caution
e'i vies using that device or component, whereas a logic error will impact a single
vice. In either case, the failure might have a cascading effect on other services
thal INC the data or other output from a service initially impacted by the failure.A
frill disaster will affect all services in that location and all other services that depend
Dui any physical devices in that location.
Applications using replicated data can result in customers and partners having different and
inaccurate data available to them, depending on the frequency of data updates.
Note
the vast majority of outages today are because of logic errors rather than either hardware failure or
disasters.
Job Control
The accuracy of provided services also depends on ensuring all the required batch
jobs are run with the correct sequencing and dependency rules and that critical
deadlines are met. This aspect can rely on operator intervention, job scripts, or an
automated job scheduler.
Scheduled Maintenance
As mentioned when discussing service availability, most IT environments require
maintenance functions to be performed regularly during scheduled downtimes.
Service availability is directly affected by the IT department's ability to remain
within the scheduled periods. Service quality also depends on the IT department
ensuring that all appropriate maintenanceincluding backups, bulk data moves
and loads, database reorganizations, and database schema changes, as well as upgrades
to applications, supporting software, and hardwareis completed correctly during
the planned downtime. Hence, service management should include precise definition and implementation of scheduled maintenance requirements, frequency, and
procedures.
Inderstanding the impact of each outage type and planning for the correct recovrry procedure requires knowledge of the relationship between each service and the
underlying resources, as well as the inter-relationship between services, particularly
tit which there is data sharing. The registry or directory of services and associated
s upporting resources can be invaluable in understanding the effect of an individual
resource failure. Similarly, knowledge of the business process, application integration, data model, and the association between data objects and access by application services will allow the impact of one service outage on other services to be
assessed.
Levels of Recovery
Recovering from an outage will take place in multiple stages. In the event of a
physical failure, the device is repaired or replaced. Then the data must be restored
from a back-up copy; the application restarted; and as much lost work as possible,
from the time of the last backup to the time of failure, is re-created. Then business
processing can be resumed. Each of these processes can be automated to some
degree, and the use of additional automation reduces the time required for recovery and can also reduce the possibility of error in the recovery process.
Recoverability
Recovering from unplanned outage conditions as rapidly as possible is necessary to
improve the availability of services provided by IT The ultimate goal of a recoverability strategy is to provide business continuity or as close to this ideal as possible.
Hence, the IT organization must be able to recover from multiple types of outages
in a minimal time and with minimal disruption to the other services provided by IT.
Types of Outages
Outages can be because of physical failures, logical errors, or a natural disaster.
Depending on the nature of the outage, a single service might be disrupted or
multiple services might be affected. For example, a physical failure will affect those
JD
Time to Recover
The time taken for the recovery includes the time required to cease processing,
restore a stable environment, recover corrupted data, and re-create lost transactions.
The recovery time directly impacts service availability, whereas the ability to
recover all data and completed transactions has a direct effect on the accuracy
and integrity of the data and the perceived quality of the service.
The time taken to restore a stable environment depends on the extent of any
physical damage and availability of additional or substitute hardware resources. The
additional recovery time depends on the amount of data to be recovered, the time
required to locate and mount the back-up media, the speed of data transfer from
the back-up media, the time required to re-create transactions, and the time
required to initialize and restart applications and background tasks.
Tip
Assigning IT costs to lines of business allows IT to be seen as a business partner and service supplier
Affordability
A distinct balance exists between the service levels provided by the IT department
and the associated costs of delivering the service. Typically, the higher the availability and performance required, the more costly it is to provide the service. In order
to better understand this relationship, and to ensure that lines of business use fully
loaded costs when assessing their profit and loss, many organizations charge IT
costs directly to the users of IT services.
When allocating costs, a mechanism for calculating IT costs together with a
methodology for allocating those costs to the various users and lines of business
should be negotiated and agreed to by IT and the user community.
Quantifying Cost
The costs associated with running the IT environment include hardware costs
(capital depreciation and expenses), software costs, maintenance costs, personnel
costs, telecommunications cost, consultant and professional service costs, and
environmental costs. In most cases, the costs associated with operating the IT
environment are the largest costs incurred, outweighing the expenses and capital
depreciation of hardware and software. This means that effectively managing the
When assigning costs to lines of business, considerations include what costs to use,
what method of allocation to use, and how to demonstrate return on investment
and show value for money.
Software development costs vary depending on the demand for custom applications and can normally be related to specific projects or business initiatives.
Software license costs for application software usually increase with the number of
users. Assigning application license costs and custom development costs to lines of
business is generally straightforward.
Other costs will vary with the size and complexity of the IT environment including additional capacity for contingencies, backups, and hot standby systems, along
with utility software and network and system management solutions. These are difficult to assign to individual users or lines of business but are necessary for smooth
operation.
Tip
These costs lend themselves more to an agreed-upon allocation as overhead, rather than trying to
relate costs to usage by the lines of business.
En IONS a
ou actual usage.
Mitre simplistic methods have gathered support in distributed environments
tug lading allocating costs based on service subscriptions or calculating costs based
volume of business transactions. These are easier to calculate and have analogies
on
I tat most management can relate to, making them easier to understand and sell to
Ilie lines of business.
Service subscription cost allocation is based on the cable television model. Lines
oll businesses pay for services that are accessed by their personnel, and the cost per
user does not vary by the intensity of usage. This is a very simple model, and, provided that the cost per user per service is set appropriately, it is easy to understand
slid easy to calculate.The price for each service subscription should relate to the
ost of providing the service and preferably will also reflect the perceived value of
the service.
Caution
The IT department needs to ensure that the sum of all subscriptions sold equals the total costs to be
allocated.
CHAPTER
Using service subscriptions can be easy for the user community to understand
because there is an analogy to the cable television industry. There is not a direct
correlation between cost, usage intensity, and business volumes in this model, and
this might cause some confusion when trying to relate business value.
Using business transaction volumes provides a more direct link between cost and
business value, and the analogy is bank fees on banking transactions. The difficulty
here might be in negotiating a suitable fee per business transaction and differentiating value between the various types of transactions, while still having a simple
model that is easy to calculate. This also places the responsibility on the IT department to understand business volumes by transaction type well enough to ensure
that total IT costs are recovered.
Service Level
Reporting
Summary
As outlined in this chapter, there are many aspects to service level management.
The most important concept is to ensure that the definition of the services to be
managed relate to the perception of the lines of business and the IT users. The
quality of the services delivered to these users will be judged according to the
users' ability to safely, effectively, and cost efficiently use the services when required
to perform their jobs.
41
40
Audience
When determining how best to report on the quality of services provided by the
IT department, various audience types should be identified and categorized along
with their interest areas and characteristics. Each audience category requires different information that varies in focus, granularity, and frequency. Many common
elements can provide the underlying information used in all reports; however, the
perspective and presentation format will differ by audience.
Executive Management
Executive management wants to know that the IT department is providing value
to the business overall and contributing to business success. As information technology becomes viewed increasingly as a competitive advantage, senior management becomes more attuned to the impact (positive and negative) of the service
quality delivered by the IT department. This includes understanding how enhancing the quality of IT services improves business competitiveness and efficiency.
Similarly, management understands that outages and degraded service cost the
business both in real dollars as well as in related lost opportunity costs. As IT services are provided directly to customers, such as with e-commerce and e-business
initiatives, the visibility of service difficulties increases and extends to the press,
financial community, and investors who assess the impact of service problems on
business viability and performance.
Reports aimed at the executive management team must be highly summarized and
outline the quality of service experienced by the company's personnel, customers,
and business partners. The report should directly relate the delivery of superior service to associated productivity improvements. Conversely, service outages or degradation should be related to real costs as well as lost opportunity costs in both
revenue and staff productivity.
Note
Although reports that include the business impact of service difficulties might be painful and
embarrassing, they build credibility and might be very helpful when asking for management's support to fix the problems.
Lines of Business
The lines of business are interested in knowing how the quality of services provided by IT help them to drive more business. This means the reports should relate
service levels to business transaction volumes, personnel productivity, and, where
possible, customer satisfaction. Reporting the impact of service outages or service
degradation in terms
Note
Establishing the relationship between service quality and the ability to optimize business transaclions is important.
1110-eases in business transaction volumes might be related to improved service levels
;ind business expenses might be reduced by improved staff productivity because of
better service performance. These types of relationships can be shown with many
different types of applications, such as automated manufacturing operations where
the bottleneck might lie with computerized control or with customer-facing operations such as reservation systems where computer delays lead to increases in staffing
levels and reductions in customer satisfaction. Therefore, it is important that this relationship be explained to the lines of business and utilized in service level reports. The
goal is to quantify the business benefits associated with the reported service quality.
Internal to IT
IT must be service oriented in order to provide better support for the business.
To foster this orientation, the same service level reports provided to the lines of
business should be available to and reviewed by all levels of IT management. Many
IT departments are organizing first-level support along service lines rather than
technology layers. This provides a focus for service level reviews as well as natural
interface points for user communities. These service-oriented teams also act as the
user advocates within the IT department.
Additional reports showing all underlying technology outages and performance
degradation should be produced. Where possible, these reports should be correlated
with overall service quality using time as the common variable. These allow IT
management and technology-focused second-level support to relate the impact of
technology and component failures and degradation to the quality of service levels
delivered to the lines of business. Overall service delivery performance should be
graded against service level objectives. This ensures all IT personnel know how
acr vice levels achieved with ally business impact is an important aspect of the
well the department is performing overall, and how their particular role and
the technology they support affects the achievement of these objectives.
executive summary.
The executive summary should be self-contained, particularly for end-of-period
Outside Customers
Summarized reports should be available to the customers of IT services who are
outside the corporation. These should provide information on the quality of the
services delivered to them, and should also outline the steps taken to improve service quality, particularly if customer expectations have not been met.
Tip
Regular customer satisfaction surveys should be conducted to relate the satisfaction of external
IT
A powerful business driver can be established if service levels can be related to customer satisfaction and if there is a relationship between customer satisfaction, customer loyalty, and buying behaviors. If these relationships can be demonstrated, the
IT department is in a powerful position to show the true value of the services and
service quality it provides.
Tip
One aspect of service level reporting that can dramatically improve customer perception and satis-
faction is the proactive notification provided by real-time reporting and alerts as outlined in the
next section.
Types of Reports
Several different report types are required to provide sufficient detail on all the
aspects of service quality and to satisfy the interests and focus of the different audience types. The format and content of each report also varies with the frequency
with which it is produced. Reporting frequency is discussed in the next section.
This section outlines the components of a service level report; however, not all
reports incorporate all components and not all audiences are interested in receiving all components.
Executive Summary
This report provides an overall assessment of achieved service levels including
quantitative and qualitative reports against agreed service level objectives. It should
provide quick summaries of the quality of the services delivered and, preferably,
make effective use of graphs and charts to impart this inti rnration. Relating the
Performance Reporting
Performance must relate directly to the end-user experience and should be broken
out by online transaction responsiveness as well as batch job turnaround. To communicate effectively with lines of business and senior management, responsiveness
should be shown by application, user group, location, and line of business.
Tip
It might be beneficial to group transactions based on their characteristics, as well as report responsiveness of those transaction types individually as well as the aggregate performance.
4t
TT
should be calculated if possible and reported. These should then be graphed against
overall end-user response times to show any correlation using time as the common
variable. This enables IT management, capacity planners, and performance analysts
to focus on the most critical performance issues.
Workload Volumes
The lines of business and senior management want to see workload volumes
expressed in terms of business transaction rates.This provides a common basis for
discussion of workload levels between the IT department and the lines of business.
Note
I hose reports should be timed when reviewing security procedures with lines of
business, and when discussing ;md recommending additional security measures.
Recoveries
All outages should have an additional report outlining recovery time, technique
used to recover, and procedures implemented to prevent or reduce the impact of
subsequent occurrences.This report is very useful to ensure that IT operations
Income more proactive and move away from continuously operating in a reactive
niode.The preventive measures might require additional IT resources (human or
equipment) and might involve implementing new procedures within the IT
department or the lines of business.
Business transaction volume supports a better understanding of the value of the services provided by
IT, and provides a foundation for demonstrating the positive (or negative) impact of service levels on
Tip
Outlining the real costs, as well as the lost opportunity costs of the downtime caused by each outage together with the incremental expense of preventing future occurrences, allows an informed
Cost Allocation
II' costs are allocated to the lines of business, either to allow them to be charged
hack or to provide a pseudo profit and loss statement, an appropriate report will
he required. This report should outline the methodology used to calculate IT costs,
total costs calculated by this method, the mechanism used to allocate costs to individual lines of business, and the calculated cost for each line of business.
Caution
Using a cost allocation model can lead to unpleasant discussions about the amount of costs
involved, unless this report is also accompanied by the associated service level reports showing the
value of the IT services in business terms.
Security Intrusion
An important aspect of service quality is maintaining the confidentiality, privacy, and
integrity of business data. Thus, reports should be provided on security intrusion
attempts, security violations, and compromised or damaged data. Additionally, reports
on virus infections and their associated impacts are useful to understanding how
viruses are spread and for making decisions on preventive measures. In all cases of
data damage or security violations, the report should also include a summary of
techniques used to detect the intrusion, the recovery procedures used to restore data
integrity, and the processes and mechanisms used to prevent reoccurrence.
Many corporations choose not to allocate IT costs, and treat the IT department as
a general administration expense. This overlooks the value of IT as a competitive
advantage for the business and overstates the profitability of individual lines of
business. In this case, total IT costs are allocated to the lines of business using the
formula for allocating general administration costs.This is a simple way of allocating IT costs, but doesn't necessarily provide a true representation of how IT
resources are used in reality or the relative utilization and value of IT services to
each line of business.
However the allocated costs are calculated, the IT department has to decide
whether to produce a single report showing allocated costs for all lines of business,
or to produce individual reports for each line of business showing only the costs
associated with that organization. This decision depends on the culture of the
organization and any associated internal political ramifications.
Oracle Financials
Measure
Availability for Normal Operations
Proactive Problem Notification
Outage Recovery Times
Responsiveness for Queries
Responsiveness for Update
Report Timeliness
Security of Data
Service
Grade
B+
A
A
A
B
A
A
liih,lier - level audience, and relate more to the business aspects of service delivery
and the resulting business impact of service level quality.
I )oily Reports
Reports produced daily are detailed and show the quality of service provided by
t hr I'I' department during the previous day. All reports and quality grades of the
various aspects of service should be provided segmented by application, user group,
location, and line of business.The report should also show how the service quality
varied by time of day for each of these segments.
Ti p
Daily reports are very detailed and are useful for identifying any patterns or trends in workload volumes or service quality that require analysis or improvement.
ma
Figure 3.2 shows a sample daily report for a help desk application. This report clearly
shows that the response time experienced by the users in Paris was problematic.
Vantive Help Desk Performance for Wednesday July 7, 1999
Measure
Availability for Normal Operations
Proactive Problem Notification
Outage Recovery Times
Responsiveness for Queries
Responsiveness for Update
Report Timeliness
Security of Data
Grade
C
A
B
A
B
A
A
180
160
140
120
100
o
o
Figure 3.1
15
I >etailed, daily reports are primarily for consumption within the IT department
and, thus, can be more technology focused. However, the relationship between
technology performance and the service quality experienced by end-users should
he clearly established.
w w cn 6 v
Oa m m m m
n n n n n n n n n n n n n n n nnnnnnnnnnnnn
dw.
Frequency of Reporting
v a 6 6
Figure 3.2
A sample daily service level report showing response times by location for
40
ro
Weekly Summaries
Weekly summaries provide similar information to the daily reports, but are summarized relative to time. The service quality can be summarized by shift or halfshift for each day of the week rather than on an hourly basis. If additional detail is
required to explain a pattern or trend, the drill-down detail from a particular day's
report should be provided.
The weekly reports should start with a business focus of availability, performance,
and workload volumes, provide technology focused reports of these same measures, and highlight any correlation to show the impact of technology issues on
overall service delivery. Additional aspects to be covered in the report should be a
summary of security violations and attempted intrusion, as well as detailed analysis
of outages and recoveries.
Although the primary audience is the IT department, lines of business might also
want to review the weekly reports, particularly if the service quality was perceived
to be abnormal.
osts, can be very useful for a quarterly line of business review conducted by the
I'I' department.
Tip
The quarterly summary report, combined with a line of business customer satisfaction survey, is
an excellent vehicle for continuing the communications between IT and its internal and external
customers.
'I'liese business reports can also be useful for understanding future plans and IT
requirements for each business unit and for renegotiating service level agreements
as necessary. The quarterly summaries are also where costs would typically be allocated if a chargeback mechanism were implemented. In order to ensure no surprises for the lines of business, additional exception reports might be required
inure frequently if anticipated costs are exceeded.
Real-Time Reporting
Monthly Overviews
The monthly overview is primarily a reporting mechanism for the lines of business
and senior management. It should communicate the quality of services delivered
by the IT department succinctly and should relate the quality of IT services to
business value.
Tip
The report card format, combined with graphical explanation, enables clear understanding and quick
interpretation.
All aspects of service quality should be covered, but allocated costs are not typically
shown in monthly reports unless accounting and budget procedures called for
monthly reports.
Additional reports internal to the IT department should relate the availability and
performance of the various technology layers to the business view of service. This
allows the correlation between technology problems and associated business impact
to be clearly established so that resources can be focused on the most important
issues.
Outage Alerts
Tip
Providing a proactive mechanism that lets users know of problems identified by the IT department
significantly increases confidence of the user community in the IT department.
bU
Planned Downtime
K1 F
Ao Paooai i*
^
Sp
Poltsh
BMC Compass
SEARCH I HELP DESK 1 CRISIS I
EVENTS !PHONE/100K t PRODUCTS 1 STOCK I
ALERT
cutioii: I lolisilri
Serviccs
`:invite Name
Tip
IL
lNat mnut
MarketingAplcos/Da-u,Prdt
'..:Distribution, R&D Applications, Corporate
"7nformaticn r ,; rams, Dacum tt lVfanagemmt
community to negotiate an alternative schedule for the downtime or to make other business
r,
Figure 3.3
r.i.l
:<.mnte
14elw0rk S i
I 0,
0,
,11L! `I
,L P ..m. ^ baia';
A sample online alert system showing the service status for a specific location.
figure. 3.4 shows that the first warning is due to a problem with the document
i i i i nagement system. This additional information shows the locations affected by
We problem together with the organizations, as well as the estimated time for
arr vice resumption.
eer,
'
BMC Compass
Figure 3.3 demonstrates a sample online reporting system in which one service is
currently in warning status and one service is currently in alert status. Drill-down
details for each service are available as shown in Figure 3.4.
=:: ..1VI
arrangements.
1es
It might also improve user relations if the alert is published with sufficient notice to enable the user
Performance Degradation
Typically, users will experience performance degradation in the form of poor
responsiveness before a problem is identified by the IT department, unless technology solutions are implemented to proactively measure the end-user experienced
response times. In either case, as soon as the performance degradation is identified, an
alert should be sent to all impacted users notifying them of the condition. In many
cases, isolating the cause of the performance degradation is complex, and it might be
difficult to determine the length of time this will take. However, making users aware
of the problem and that the IT department is actively working to improve performance reduces the number of problem reports and calls to the help desk.
.Fm
_ .lg
d Solo
li Proaw.l. A SNNbn.
nM
A cgkzatt
ULORDRILOsns
I,A atIons
A 00111
:a rac
co
Ima gi ng
\riELD
; D ocume
Figure 3.4
, R&D
Database
found
corrupted 7/1 0
. ata recovery
s of 7/10 2
Service level reporting consists of a variety of report formats, each with different
content and production frequency. Each of these reports has a different set of audiences, and the report format and content should be tailored to each specific audience. Effective reporting requires an understanding of the audiences, and service
levels should be reported from the audience's perspective. Of particular importance
is establishing and reporting the relationship between the delivered service levels
and the business impact (both positive and negative). Additionally, real-time alerting and reporting mechanisms are very valuable to the service users, and can significantly enhance the reputation of the IT department as well as reduce the
workload on the help desk staff and systems.
CHAPTER
Summary
Reporting achieved service levels is an important aspect of communicating
between the IT department and the lines of business it services. The reports will
vary by audience, frequency, and content detail; however, in all cases, the reports
should discuss the quality of services provided in terms the audience understands.
If possible, establishing a direct link between service quality and business impact
adds credibility to the reports. Additionally, the use of online alerts and proactive
notification of service degradation allows the IT department to show greater
responsiveness to the IT user community. This improves user relationships while
reducing the number of calls to help desk personnel.
Service Level
Agreements
ervice Level Agreements (SLAs) are central to managing the quality of service
delivered by, or received by, an IT organization. More than anything else, SLAs are
what people think of when they discuss service level management. Obviously, as
discussed in Chapter 3, "Service Level Reporting," Service Level Reports are a key
component of service level management. However, without service level agreements, efforts to manage service levels are little more than a collection of good
intentions.
turkey and then take it out, it shows that your prospective dinner is hot enough to
register halfway up the thermometer. What does this tell you? Is the turkey nearly
done, barely thawed, or so overcooked that it soon can be converted to jerky?
From the data provided by the uncalibrated thermometer, you simply cannot
judge. If you have used this thermometer enough times prior to this, you might
be able to interpret its display. However, such interpretation requires considerable
experience and is highly subjective.
Take the analogy of a cooking thermometer a bit further. Assume that you are
going to cook some steaks for a group of friends. Now these are not just ordinary
steaks. They are ostrich steaks. They are quite expensive, and you have never
cooked them before. To make matters worse, your friends are quite particular
about how their steaks are to be served. Each of your friends tells you how he
or she would like their steak prepared. Two of them request rare. One requests
medium and one requests medium rare, "but not too rare."
In far too many cases, IT managers are working with uncalibrated thermometers.
That is, they are working with data that is difficult to relate to the service levels
expected or provided. For example, detailed collections of data (such as packet col
lisions) might be very useful to network administrators and, indeed, might be
related to the level of service being provided. However, that relationship is not
immediately apparent.
Because you have never cooked ostrich before, you do not feel that you will be
able to visually judge when the steaks are properly cooked. Fortunately, you have a
cookbook that tells you the temperature to which an ostrich steak must be cooked
for various degrees of completion. Although you have upgraded your thermometer
and now have one that has degrees marked on it, you still have a problem. You do
not know if the cookbook's standard for medium rare matches your friends'
expectations. What should you do? You could proceed to cook the steaks using the
standard specified in the cookbook and hope that your friends approve of your
decision.You decide that would be too risky.You could tell your friends to cook
their steaks themselves, thereby absolving yourself of responsibility. However, you
want to impress your friends with your great culinary skills, and telling them to
cook their own steaks could hardly be expected to impress them as you are hoping
to do. As you ponder your dilemma, you notice that your cookbook contains a
short description for each of the labels provided (rare, medium, and so on).You hit
upon an idea; you read the descriptions to your friends. A discussion follows in
which you learn that your friend who wanted her steak cooked medium rare, but
not too rare, in reality would like her steak cooked to what your cookbook
describes as medium-well done. The others agree with the cookbook's descriptions. Using your new thermometer, you cook the steaks to perfection and your
friends hail you as a great chef.
liv
Functions of SLAs
eke defining how well steaks will be cooked, some basic benefits result from cre-
ating a Service Level Agreement. First, an SLA defines what levels of service are
isidered acceptable by users and are attainable by the service provider. This is
iii
l a rticularly beneficial to the service provider. It guards against expectation creep.
There is a basic characteristic in human nature to always want more and bettertegarclless of the subject. If you receive a huge raise, it is likely that, even if the cost
or living does not change, you will be hoping for (or even demanding) another
raise a year later.
In the case of IT services, if the availability of a key application is increased
dramatically--higher than ever requested beforeclients will soon become used
to that level of availability and begin to demand an even higher level of availability,
and they will vilify IT if it is not provided. If the expectations are documented in
iii SLA, they become a reference pointan anchorfor client expectations. In
other words, the SLA provides permanence for the agreements arrived at and
locumented in it. More specifically, a well-written Service Level Agreement will
define not only the expectations (how good is good enough), but it will also
define a mutually acceptable and agreed upon set of indicators of the quality of
service.
Service providers and their clients are like beings from different planets when talking about service levels. They tend to speak very different languages. The result is
that it is often quite difficult for them to understand each other. Ultimately, an
SLA, through those service level indicators, will provide a common language for
Note
There are six primary benefits that can be expected from Service Level Agreements. Those
benefits are
Provides permanence
Provides clarity
Serves as communications vehicle
Guards against expectation creep
Sets mutual standards for service
1xternal SLAs
The most rigorous type of agreement is the External SLA. Because it is usually a
legally binding contract between two companies, it requires more care in crafting
it, Legal review of the External agreement is strongly advised. However, many
r n npanies overlook this step and, as a result, end up with an agreement that is
ul . little value. Of course, another error, at least as serious, is to fail to have SLAs
with external service providers.The lack of SLAB has proven disastrous for many
c on Upanies.
Caution
Have External SLAs reviewed by an attorney before signing.
Types of SLAs
Broadly, there are three types of SLAs.The one that is most common is an InHouse SLA. An In-House SLA is one between a service provider and an in-house
client. An example of an In-House SLA would be the agreement between IT and
a user department. The second most common SLA is an External SLA. That is, an
SLA between a service provider and its client (another company).The third type
of SLA is an Internal SLA. The Internal SLA is used by the service provider to
measure the performance of groups within the service provider's organization. An
example of the Internal SLA would be between the network services group in an
IT organization and the overall organization, or perhaps the CIO. The Internal
SLA is typically tied to annual reviews of managers and provides a mechanism for
holding individuals and groups accountable for their portion of an overall service.
The process for creating an SLA is fundamentally the same for each type of agreement. Likewise, the contents that are found in each different type of agreement are
basically the same. The differences come largely in the formality that is attached to
the process of creating the agreement, the language that is used, and the consequences that will result if the service level commitments are not met.
In - House SLAs
When the service provider and client work for the same company, familiarity
should not be allowed to preclude establishing a detailed, legally binding contract.
If the SLA is constructed in a considered, serious way, the results can benefit both
parties as well as the company itself. Most large banks and Financial institutions, for
58
Internal SLAs
The Internal SLA is a relatively simple matter. It typically is written in an informal
manner. In fact, the Internal SLA might not exist as a separate agreement. Instead,
its commitments and intent might be embodied in other documents, such as individual or departmental goals and objectives, or even in the criteria for the company's bonus plan. Frequently, the Internal SLA will specify service levels in very
technical terms. The use of technical terminology, and even jargon, can be acceptable in this document because all the parties are familiar with the terms and
understand them.
SLA Processes
Service level management is a process. While the SLA itself is a document, it is the
product of process. Processes are required to create, maintain, and administer the
Service Level Agreement.
Creation Process
The process of creating an SLA typically follows a series of predictable steps, which
are summarized in the following section. Although this guide is applicable to most
SLAs, keep in mind that every situation is different. In many instances, SLAs will
need to be tweaked to accommodate special functions and measurements. It is
crucial for IT managers to follow their instincts in tailoring each SLA to fit the
requirements of particular constituents.
SLA creation begins with a serious commitment to negotiate an agreement. It
would be easy to say the groups involved must make that commitment. However,
in reality, commitments are not made by groups, but by individuals. In this case, the
commitment needs to begin with senior management of the groups involved
that is, the management of the service provider organization and the user or client
organization. Ideally, the commitment is made at the very highest levels in the
respective organizations.
Assemble a Team
When there is executive commitment to creating an SLA, the next step is to
assemble a team of people to actually negotiate the terms of the agreement. It is
important that the members of the team be personally committed to the success
of the processthat is, committed to creating a fair and reasonable Service Level
Agreement.
In order to assemble a team to negotiate an agreement, it is necessary to determine
the team size and membership. As with most questions in life, there is not a single,
I ne size-fits-all answer to this question. In part, the size of the team will be dict,tteil by the culture of the company. However, some guidelines can be offered. The
e,itn needs to be large enough that each stakeholder group be represented on the
e,nit. (A stakeholder group is any organization that is either engaged in providing
the service or is a user of the service.) In most cases, the agreement will have only
I wo stakeholders, the service provider and the user. Although it is possible for the
Bain to consist of just two individuals, that is not very common.The typical size
of ;in SLA negotiating team in a medium to large company is 4-10 people. Every
el fort should be made to minimize the size of the team, although the realities of
c orporate politics might dictate otherwise.
Ideally, every team member should have something unique to contribute to the
process, such as knowledge about how the service impacts the users, the limitations
of the technology being used to deliver the service, and so on. The members need
In be, in some respect, subject-matter experts on some aspect of the service delivery or consumption.
Note
In assembling a team to negotiate a service level agreement, there are four points to keep in mind.
First, there should be equal representation from the service provider and their client. Second, the
leaders of the team should be peers. Third, the members of the team should be stakeholders. That is,
they should have a vested interest in the service being provided. Fourth, the team members need to
be subject matter experts; for example, knowledgeable about the service and its business impacts.
' I'liere should be equal representation on the team from both the user group and
the service provider. Too great a disparity in numbers will give an unfair psychological advantage to the larger team.
At a minimum, the leaders from each group need to be peers. It is difficult for
effective negotiations to take place if there is a significant disparity in rank among
team members. Some companies are more sensitive to this than others. If peer relationships are not considered, this can lead to one group being apt to dictate (possibly unintentionally) the terms of the SLA, rather than negotiating them through a
process of exchange between peers. Another requirement for the team leaders is
that they have sufficient authority to commit their organization to the SLA.
The negotiating team should have a charter, written by the leaders, that specifies
its responsibilities, membership, leadership, structure, and functioning. In terms of
functioning, the charter needs to include a schedule for the development of the
SLA. It is advisable to make the schedule aggressive. A pitfall of some teams, especially large ones, is that some members almost make a career of the negotiation
process. In most cases, depending on availability of the team members, it should be
possible to negotiate an agreement in 6-8 weeks.
Tip
Negotiating team meetings should be brief and infrequent. Most of the work will be done outside
the meetings.
As noted previously, the SLA is a contract . When the negotiations have been compleled, die next step is to document what was agreed upon. The basic components
pit a Service Level Agreement are as follow s:
Parties to the agreement
Exclusions
Iirni
Reporting
Scope
Administration
,imitations
Reviews
Revisions
Approvals
Non-performance
Optional services
Tip
Do your homework before negotiating. You should know the following:
Cost of delivering a given level of service
Term
'Typically, the term of the SLA will be two years. Creating an SLA is too much
work to warrant an agreement term of much less than two years. Alternatively,
technology and business conditions change too rapidly to be able to confidently
expect the agreement to be valid beyond two years.
Scope
This section will define the services covered by the agreement. For example, an
agreement might specify that it covers an online order entry system, the facilities
where the users will be located, volumes of transactions anticipated, when the
service will be available (days of the week and hours of the day). Note that this
section does not specify the levels of services to be provided. In the preceding
example, nothing is mentioned about the percent availability for the service.
Limitations
This section of the agreement can be thought of as the service provider's Caveat
clause. This section basically qualifies the services defined in the Scope section of
the agreement. The service provider is saying, "We will provide the services covered by this agreement as long as you don't exceed any of the limitations." Typical
limitations are volume (for example, transactions per minute or per hour, number
of concurrent users, and so on), topology (location of facilities to which the service is delivered, distribution of users, and so on), and adequate funding for the
service provider. These types of limitations are quite reasonable.
63
62
In order to enter into the Service Level Agreement, the service provider has to
believe that they have adequate resources to meet the commitments of the agreement. Making this commitment without these limitations would be like someone
agreeing to feed you and all the members of your household for a lump sum payment of $10,000. This might be a good deal for either party of this agreement. If
your immediate family consists of just yourself and your very petite wife, the person agreeing to provide the food has struck a great bargain. However, one month
into the term of your agreement, your five children (two of whom are training to
become sumo wrestlers) move back into your house. Also, you decide to host a
foreign exchange student. The student happens to be 300-pound weight lifter.
Suddenly the balance of the equation has shifted. Without limitations in the agreement to provide food for your current household, the other party to the agreement will start losing money after the second month of feeding your enlarged
household. In business, equally dramatic changes can occur. Mergers and acquisitions can bring sudden increases in workload, as well as shifts in traffic network
patterns. Closing or opening facilities will shift workloads and might require new
links for your network. Consolidation of functions into fewer locations might
change traffic patterns. Growth of the business is also a source of additional data to
be handled.
Ides, and so on. 'l he objective for accuracy is basically centered on the question
III whether the service is doing what it is supposed to do. For example, are email
messages delivered to the intended recipient? Although availability, performance,
and accuracy are the most popular categories for objectives, they are by no means
t he only objectives. Other categories include cost and security.
Iti any discussion of service level objectives, a question always raised is,"What is
t he right number of objectives?". Although there is not a specific number that is
always the correct number to use, this is a case in which the principle of brevity
has merit. Including more objectives does not automatically raise the quality of the
SI A. In general terms, 5-10 service level objectives are usually sufficient. This
numuber of objectives is usually sufficient to cover the most important aspects of
the service. Including more objectives usually means that less important objectives
,ire being introduced and drawing attention away from the more important ones. If
tltcre appears to be a large number of critical service level objectives that need to
be included in the agreement, the SLA team should carefully consider the possibility that they are attempting to cover more than one service with the agreement. If
that is the case, they should redefine their effort and write separate SLAs for each
service.
Tip
Limit the number of service level objectives to 5-10 critical objectives.
More than any other factor, the service level objectives are what most people think
of when they refer to SLAs. The service level objectives are the agreed upon levels
of service that are to be provided. These might include such things as response
time, availability, and so on. For each aspect of the service covered by the agreement, there should be a target level defined. In fact, in some cases it can be desirable to define two levels for each factor. The first will be the minimum level of
service that will be considered acceptable. The second will be a stretch objective.
That is, the second number will reflect a higher level of service that is desirable,
but not guaranteed. Clearly, the second category is optional, and if it is utilized in
an SLA, it will normally have some type of incentive or reward associated with
meeting it.
The most popular categories of Service Level Objectives are availability, performance, and accuracy. Availability can be specified in terms of the days and hours
that the service will be available or as a percentage of that time. It is generally best
to specify the time period when the service is expected to be available and then
define the minimum acceptable percentage of availability. Performance can include
measurements of speed and/or volume. Volume (also referred to as throughput or
workload) might be expressed in terms of transactions/hour, transactions/day, or
gigabits of files transferred from one location to another. Speed includes the
always-popular response time objective. However, speed is not limited to just
response time. It could also include time required to transfer data, retrieve archived
114
96
A service level objective must be meaningful to all the parties to the agreement.
Another way of stating this is to say that it must be relevant. An IT organization
ought consider an important metric to be CPU utilization for the servers used to
deliver the service in question. However, from the users' perspective, the relevance
id this to the service they receive is difficult to grasp.
Another requirement for a service level objective is closely related to the need to
he meaningful. That is, the service level objective and its associated metrics must
he understandable. Interviews of IT managers by Enterprise Management
Associates has found that some of them are providing the users with statistics that
ate intended to reflect service levels. Unfortunately, those statistics tend to be ones
that are easily captured, and mean little to anyone other than a network engineer
nr system administrator. Two of the more popular statistics reported were packet
c ollisions and dropped packets. Although these might impact the level of service
being delivered, they are not readily related to what the user is experiencing. In
tact, these statistics meant little or nothing to nearly all the users receiving the
reports. Thus, those statistics failed the tests for being understandable and for being
meaningful.
i eaningful.
Tern, Haut ,
Irdlsna
Medan
dang
C
Teluk
te
Pontianak
l aalikpapan
Maj
aT o
eonthein4R
' INDONESIA
r4,ae=
Figure 4.1
yepura
enarmas
Iman
. 0
Mereeke
You might wonder why an IT organization would ever commit to a service level
objective that it cannot meet. There are a variety of reasons for this. The IT representatives on the SLA team might have been poor negotiators. The user team might not
have negotiated in good faith, approaching the process from a win-lose perspective.
The negotiators might not have been peers with the more senior representatives on
the user team. Another possibility is that the IT representatives failed to do their homework. Even if the team members, individually, lacked specific knowledge about the
connection in question, they certainly should have been able to research it and make
an informed response to the request for this level of service for user response time.
Note
In order for service level agreements to be successful, the criteria that they use to measure the level
of service must be
Attainable
Meaningful
Measurable
Controllable
Understandable
Affordable
Mutually acceptable
The next requirement for a service level objective is that it must be measurable.
Among users, a very popular service level objective is user (that is, end-to-end)
response time. Certainly, this is one of the key factors shaping the users' opinions
about the level of service they are receiving. Unfortunately, measuring user
response time on an end-to-end basis is still a technical challenge today. At the
other end of the feasibility spectrum is the availability of a service. This is relatively
straightforward and can be measured with a minimum of effort and difficulty. If it
is not possible (and affordable) to measure something to represent a service level
objective, that objective is worthless and should not be included in an agreement.
A service level objective belongs in an SLA only if it represents something that is
controllable. That is, the service provider must have the ability to exercise control
over the factors that determine the level of service delivered. If unlimited resources
are available, it is difficult to conceive of a common IT-provided service that is not
controllable. However, faced with the limitations of the real world, such conditions
become much more plausible. Consider the IT manager in a Third World country.
The manager's budget does not permit the purchase of a standby generator to prevent service interruptions during the frequent power failures. Strikes by union
workers, poor service by a telco (with no reasonable alternative), and so on are just
a few of the factors that can place certain service level objectives beyond the control of the service provider. When assessing whether an objective is controllable,
consider providing exclusions (or waivers) for factors that are not controllable and
that might impact the level of service provided.
As has been previously mentioned, no organization has unlimited resources. The
amount that can be spent on delivering any service is limited. Therefore, in setting
OD
service level objectives, it is also necessary to consider whether the desired level of
service is affordable. (This might also be thought of as being cost effective.) The
first way to look at this is by considering whether the desired level can be delivered within the existing budget of the service provider, without adversely impacting any other services. If it can, there is no question that it is. affordable. If it
cannot, the question becomes more difficult to answer. It is necessary to consider
the business value of the desired level of service compared with the current level.
In one case, the client of an IT organization was adamant that for the service in
question (order entry system), they absolutely had to have 99.999999% availability.
The current availability for that system was 99.999%. Instead of digging in their
heels and insisting that the higher availability was impossible, the IT organization
did their homework. They researched what changes would be required in order to
deliver the requested availability and the cost of those changes. They returned to
the user organization and explained that they would be happy to provide the
desired level of service if (as was the company policy) the user organization would
provide the necessary funds. It was explained that the cost of the necessary changes
would be $87 million initially and $8$10 million per year thereafter. Suddenly the
user organization decided that a more modest increase would be acceptable
(99.9999%). Another aspect of affordability pertains to the cost of collecting the
data for service level reporting. Like so many things, this often becomes a tradeoff
between precision and cost.
Finally, the service level objectives that are included in an SLA must be mutually
acceptable to all the parties to the agreement. It is not possible for a viable, effective agreement to be arrived at if one of the parties to the agreement simply dictates the terms of the agreement. Creating an SLA is a process of negotiation to
arrive at a result that both parties consider acceptable and that they both feel they
can live with for the term of the agreement.
Of
potential trouble spots. Or the problem could he in one of the many network conions linking the client to the server: A router could be down in the network,
the communications server at the user site could be down, or the application could
hr' running but not responding because it is waiting for some critical resource. Any
of these examples would prevent the users from being able to access the application. From the user's perspective, however, the truth is that the application is
unavailable.
Tip
Remember that the user's perspective is the one that counts.
mcamm
l'herefore, it can be seen that careful thought must go into defining what indicators will be used to provide metrics to represent each service objective. In some
ases, the service level indicators will be the same as the objective they represent.
lit other cases, the indicators are an indirect representation of the service level
objective.
Consider the case of the availability of an order entry system. Ideally, there will be
a single indicator for the service's availability; that is, an indicator which reflects the
overall availability of the service to the end user. Unfortunately, in the case of the
order entry system, there is not a way to directly measure the availability of the
service. However, it might be possible to develop an estimate of the service's availahility. Perhaps a special application can be constructed that will reside at the users'
location and periodically test the service's availability (perhaps by submitting an
inquiry transaction). However, security and other concerns might preclude such an
approach.
Continuing with our example, in the event that it is not possible to develop a single measurement that represents the overall availability of the service, it becomes
necessary for the SLA to define what will provide an adequate approximation of
the service's availability. One approach might be to track the availability of each of
the components required for the delivery of the service (for example, application,
server, network, application, and so on). Obviously, this is not a perfect solution,
but it might be good enough. If more precision is necessary, it is possible to analyze and correlate the data to provide a better view of overall availability. However,
greater precision will normally carry with it greater complexity, greater cost, and
higher likelihood of error. Remember that perfection is unlikely and compromise
is an inherent part of the SLA process.
Whatever is chosen, the SLA needs to document each of the service level indicators that will be used to represent each of the service level objectives. It will be
necessary to specify the data source for each of the indicators.
an
Non-Performance
v(sthility of a problem within their company Wright not be something that the sales
utgain/ation or the support organization consider desirable, it is clearly in the cusI rn ner's best interests. Another interesting aspect to the use of this contract clause is
t hat it has almost never actually been applied. The reason that it has not been
applied is twofold. First, and most importantly, the vendors will turn their organi/ations inside out to make sure that they don't have to make the penalty payment.
Second, in all honesty, the terms that the customer specifies are so loose that
almost any action on the vendor's part will satisfy the language of the contract.
I lowever, in their fear of incurring any penalty, the vendors seem not to recognize
t l i is and go far beyond what is required in the contract.
( )ne last point about this example is warranted. The penalty is calculated as a percentage of the annual maintenance fee. The result is a potential penalty that is
miniscule. Consider a software product that costs $30,000 and has an annual main'enance fee of $4,500.The contractual penalty for not responding to a complaint
liont the customer about a problem would be calculated based on the amount of
tulle in excess of that allowed in the contract. Assume that the agreement specifies
that the vendor will respond to a serious problem within one business day. If the
vendor actually takes three business days to respond, the violation of the agreement
would be two days. Even though the violation is in business days, the maintenance
agreement is specified in calendar days, and therefore the penalty is calculated
using business days. The 2-day violation is divided by 365 calendar days. This result
(0.005479) is then multiplied times the annual maintenance fee. Therefore, in this
case the penalty would be $24.67! A ridiculously small amount, yet sufficient to
make very large companies jump through hoops to avoid it. A simpler alternative
to determining the amount of the penalty is to simply specify a dollar amount for
a period of time (minutes, hours, days, and so on) of violation of the agreement.
Although in the previous example a token penalty was sufficient, you cannot
always rely on a token payment to produce the desired effect. The key to a penalty
for non-performance being effective is that it must cause pain or discomfort
within the service provider's organization. The objective is to maximize the discomfort so that in the future the service provider will choose to ensure the proper
level of service is delivered, rather than to suffer the discomfort that will result
From non-performance. The most obvious way to cause pain is through a large
financial penalty. However, smart service providers won't agree to terms that can
result in very large penalties. Also, although financial penalties are possible with
internal service providers, they are more difficult to implement. Another drawback
to the financial penalty is that a large penalty can cripple the service provider,
making it even more difficult for them to meet their commitments. On the other
hand, a small penalty applied to an unscrupulous service provider can become
another incidental cost of doing business for themless expensive than providing
the level of service specified in the agreement.
70
Tip
Penalties for non-performance should be sufficiently large so that they will cause pain within the
vendor organization. However, even small penalties can be constructed in such a way as to make
them painful.
Creating effective penalties for non-performance calls for creativity Some of the
best penalties do not involve money. For example, an effective requirement might
be specified that in a case of non-performance, the head of the service provider's
organization must meet with the head of the client organization and provide an
explanation. It might be even more effective if it was stipulated that the meeting
had to be in person, at the client's office, and within 48 hours of the determination
of the non-performance condition.You will be most successful in creating effective
penalties if you know as much as possible about the service provider. This way, you
can better understand which penalties will have the maximum effect within their
organization. The bottom line is that you must be creative to be effective in defining penalties for non-performance. Also, never accept a service provider's claim that
they never agree to penalty clauses. First, they probably already have done so with
other clients. Second, they will do so if they want your business, particularly if you
are creative in defining the penalties.
Tip
Do not accept a service provider's claim that they never agree to penalties for non-performance. This
is almost certainly untrue and even if true can be circumvented through persistence and creativity.
It is very important that both parties have a clear understanding of what constitutes non-performance. Consider the example of an agreement that specifies a
response time of 2.2 seconds for the order entry application. Is that requirement
an absolute threshold that must never be crossed? Or is the response time specification actually referring to an average? Alternatively, there might be some
allowance for the threshold to be exceeded under certain circumstances (refer to
the section "Limitations") or for a maximum number of transactions in a given
time period or for a maximum allowable period in the busiest part of the day.
What is most important here is that both of the parties have a clear understanding
of what constitutes a violation of the agreement and therefore warrants some consequential action.
Caution
Beware of non-performance remedies that provide the compensation in the form of future services.
Bad service is never a bargain.
71
the non-performance section of the agreement, creativity and flexihility are important. These are particularly important when dealing with internal sere providers. It does not make sense to negotiate a penalty for non-performance
dial consists of reducing IT's budget. That would effectively reduce their ability to
I t el the users' requirements. Instead, non-financial penalties should be considered
(li- example, reductions in individual bonuses, and so on). Also, as an alternative to
penalties, internal service providers can be motivated by rewarding them for meeting
exceeding the service level commitments in the SLA.
In constructing
Optional Services
I'here might be additional service components that are not normally provided, or
are not provided at this time. However, if there is reason to anticipate that the
user might want some of these options within the term of the SLA, it is wise to
include a provision for that in this agreement. For example, a company might not
tirrently be open for business on Sunday, allowing IT to perform batch processing
and system administration work during the day. However, if it is anticipated that
Sunday work will be required during the Christmas holiday seasonhence,
tetluiring the availability of the online systemsthe possibility should be included
T hat
Ixxclusions
It addition to spelling out the services that are covered by the agreement, the SLA
should also specify what is not included in the agreement. Some common sense is
\v,trranted here. Obviously, if the agreement covers the online order entry system, it
is not necessary to specify that the agreement does not include the payroll system.
Instead, the exclusions that are specified in the SLA are those categories that might
reasonably be assumed to be covered. For example, it might be appropriate to specify
t I at the service encompassed by the order entry system's SLA does not cover the
entry of orders by customers via the company's Web site. Clearly the e-commerce
activity, although important to the company and a means by which orders can be
icceived, is not part of the current order entry system. The e-commerce component
is too distinct to be covered by the SLA for the online order entry system. It has difftrent users, employs different software, is accessed differently, and so on. What might
he appropriate to consider for inclusion in the agreement would be the interface
I 'trough which e-commerce orders are received by the order entry system.
Reporting
'Hie reports generated for the Service Level Agreement are key components of the
SLA process. Without reports, the agreement is left merely as a statement of good
intentions. The lack of reports would mean that it would never be possible to contrast actual performance against the stated objectives contained in the agreement.
72
The reports must be relevant to the service level objectives and reflect the service
level indicators. Like the service level objectives, users must readily understand
themeven the ones who have no understanding of the underlying technical
issues. In many cases, graphs are the best way to represent the information about
the service level performance. However, remember that some users will want to
look at the data more closely. Therefore, it is advisable to have the supporting data
available in tabular form for those who want to review it. Another recommendation is to keep the reports simple and focused. Although it might be easy to distribute copies of a report already being produced that includes the required
information (plus a lot of other information), it is unwise to use this report.
Instead, it is better to distribute reports that contain only the specific information
required by the SLA. Additional information can be confusing or lead to misunderstandings. Reports might contain information about multiple service level indicators, but should not contain extraneous data.
73
Who Can You Trust?
A certain degree of caution is appropriate with the SLA reports, particularly if the employees producing the reports stand to gain personally from the results reflected in the reports. A large company learned this lesson painfully. The company had an internally developed trouble ticketing system.
The system was not terribly sophisticated, but it was adequate for the needs of that company. One
clay, an executive got the bright idea that he could motivate the IT department employees to provide
better service (higher availability) by linking their quarterly bonuses to the level of service that was
being delivered to their clients. On the face of it, this seemed like a reasonable idea. The question
then became how to measure the service being delivered. Someone hit upon the idea of using the
trouble ticketing system. This seemed reasonable because it did track every outage and, by implication, could then be used to calculate the remaining availability.
Data from the trouble ticketing system was analyzed and it was agreed that this could be used to
provide a reasonably accurate indication of service availability. The decision was made to implement
the plan.
Tip
Remember that graphs can convey more information and be more readily understood than
tables. Therefore, whenever possible, use graphs to display information about actual service level
performance.
It should be noted that the trouble ticketing system was not perfect. Its greatest weakness was the
fact that it relied on individual IT employees (Help Desk) to manually enter information about service
interruptions. The employees were quite good about opening trouble tickets when a problem
occurred. However, when they were very busy, or if multiple people were involved in resolving the
problem, sometimes a trouble ticket might not be closed at the time that the problem was resolved.
The SLA should contain a list of each of the reports that will need to be produced
in support of the agreement. For each report, the SLA should specify the name of
the report and when it will be produced (frequency). It should also indicate which
service level indicator(s) are reflected in this report. There should be a brief
description of the content of the report and possibly even an example of the
report itself. A description of the source of the data for the report should be
included in this section of the agreement. Although this might seem tedious, it
does prevent misunderstandings later. Also, it can serve as a limited guard against
unethical manipulation of the reports during the term of the agreement. For each
report specify the following:
This could result in some trouble tickets being open for several days, or even longer. This problem
had been discovered a couple of years earlier and the program was modified to allow anomalies like
this to be corrected. At the end of each month, any apparent problems of this type were researched
and new information was entered to reflect the correct duration of the problem. This facility became
the source of a problem of a very different type.
The employees responsible for researching outages and, if appropriate, entering the correct information were part of the same group whose bonuses were tied to service availability. It did not take
long for these individuals to figure out the facility used to correct errors in outage durations could
also be used to ensure that their group always met or exceeded the objectives that had been established for service availability. Within a couple of months of implementing the idea to link bonuses to
Report name
service availability, availability had soared. The executive who had conceived the plan was congratu-
Frequency
Executives from the user departments grumbled that they had not seen any improvement in service.
Content
However, this was initially dismissed with the thought that the users would never be satisfied no
Data sources
Responsibility
Distribution
matter how much service improved. After about eight months and continued complaints from the
user departments, supported by their own documentation of problems, the IT department decided to
investigate the situation. The investigation did reveal that the employees had been falsifying the
records in order to meet availability objectives and thereby maximizing their personal bonuses. As a
consequence, the link of availability to bonuses was discontinued and accurate reporting returned.
Amazingly, no one was ever disciplined for this scam.
74
Revisions
The SLA needs to specify who will be responsible for producing the reports.The
responsibility should be specified by position or group rather than by individual.
It is also necessary to include specifications about the distribution of each report.
As illustrated in the sidebar, care must be taken not to create a situation in which
a conflict of interest might arise and lead to the reports being compromised. At a
minimum, it should list the groups, or positions, that are to receive the reports.
However, it is also desirable to specify whether the report will be produced in hard
copy or electronic form. If electronic copies are chosen, the SLA should specify
how the report would be distributed (email, Web, and so on).
Administration
This section of an SLA describes the ongoing administration of the SLA and the
processes that it specifies. In this section, there needs to be a description of the
ongoing processes and a definition of where in the organization responsibility for
each process lies.
Reviews
Periodically, the SLA needs to be reviewed to verify that it is still valid and that its
processes are working satisfactorily. It is possible for a review to occur at any time,
if both parties are agreeable to doing so. However, the SLA needs to specify times
when regular, periodic reviews will occur. In a typical agreement, with a term of
24 months, three reviews should be scheduled. The first review should be held six
months after the agreement is put in place. The other two reviews should occur in
the twelfth month and the eighteenth month.
In a review of an SLA, some fundamental questions need to be addressed. The first
question is whether the agreement and its associated processes are functioning as
intended. Particularly in the first review, it is important to address the question of
whether the agreement and its service levels are still acceptable. The reviews need
to consider whether any changes are required. For example, it might be necessary
to replace a service level indicator because data is no longer available for it. Or, it
might be necessary to redefine responsibilities or report distributions because of an
organization restructuring.
SLA reviews can range from very informal to very formal. They can be little more
than two department heads (for example, the former negotiating team leaders) discussing the SLA over a cup of coffee. Alternatively, at the other end of the spectrum, the review might consist of reconvening the entire SLA negotiating team.
The method chosen will depend in large part on the culture of the company, the
warmth or coolness of the relations between the departments involved, and the
user department's satisfaction with the service levels being delivered.
When an SLA is put into place, it should be expected that revisions to it would be
necessary.'l'he agreement is not set in concrete, nor are the organizations that it
Ie rves. Revisions are very common and tend to be driven by a variety of factors
including: requirements, technology, workload, staffing, staff location, mergers and
a, tluisitions, and so on.When revisions are necessary, a new agreement will need to
I written and approved. As with the agreement reviews, the process can be quite
iii lirinal or require a lengthy negotiation process.
pprovals
Alier all the details for an SLA have been defined, and all the parties are in agreement, the agreement needs to be signed. In the case of an SLA with an external
service provider, this is obviously necessary With internal service providers, the
need to sign the agreement might be less obvious, but it is just as important. In
signing the agreement, both parties are formally acknowledging that they are in
Agreement with its terms and are committed to its success.The person signing the
,agreement for the service provider should be the person who has authority over
All aspects of the services covered by the agreement. Likewise, the user signing the
Agreement should be the overall department head, that is, the person to whom all
the users of the service report. However, regardless of level, the individuals signing
the agreement must have authority to sign the agreement and have an interest in
its success.
Summary
Service Level Agreements are a key component to any service level management
process.To begin with, they provide a basis for effective dialog between the client
,and the service provider. They can be beneficial to both the service provider and
the client because SLAs hold both parties accountable. That is, the client is forced
to define the level of service that will be considered acceptable. On the other
hand, the service provider is held accountable for delivering the level of service
to which they have agreed. To be effective, the SLA must be negotiated fairly
and in good faith. When established, the SLA is one of the most effective vehicles
available to the service provider for managing client satisfaction.
CHAPTER
Standards Efforts
I
ndustry standards for service level management are not very mature at this time,
and in general there is a lack of industry-accepted methodologies, practices, and
standards in place. Most standards efforts have focused on infrastructure management rather than service management. This is at least partially because of the difficulty of setting standards for defining, measuring, and managing services, which is
a more complex issue than standards for monitoring and configuring individual
devices and components.
The most notable standards effort to date has been driven by the UK
Government's Central Computing and Telecommunications Agency (CCTA).
CCTA has delivered a documented methodology for managing service called the
IT Infrastructure Library (ITIL). Other efforts include the Service Level
Agreement (SLA) Working Group created by the Distributed Management Task
Force (DMTF) and the Appl MIB by the Internet Engineering Task Force (IETF).
A more focused effort, the Application Response Measurement (ARM) Working
Group is supported by several vendors, as well as a special interest group sponsored
by the Computer Measurement Group. The rest of this chapter looks at these
efforts in more detail.
Tip
Standards efforts continually change over time, with some efforts gaining momentum whereas others lapse, and new initiatives are regularly introduced into the industry. It is worthwhile to check the
following Web sites regularly to ensure that you are aware of new developments:
http: / /www. dmtf .org (Distributed Management Task Force)
http: //www.ietf.org (Internet Engineering Task Force)
User
Users
IT Infrastructure Library
The IT Infrastructure Library (ITIL) was initially developed for use within UK
government IT departments by the Central Computing and Telecommunications
Agency (CCTA). This library consists of 24 volumes available to interested parties.
The use of the ITIL has spread outside the UK government and, in fact, has a significant amount of support throughout Europe. Awareness and support for ITIL in
the United States is very limited, although an organization has been established in
the United States to try to increase its acceptance.
The ITIL has a number of service management modules that cover topics including help desk operations, problem management, change management, software
control and distribution, service level management, cost management, capacity
management, contingency planning, configuration management, and availability
management. The volumes provide a methodology for defining, communicating,
planning, implementing, and reviewing services to be delivered by the IT department. They include guidelines, process flowcharts, job descriptions, and discussions
on benefits, costs, and potential problems.
The specific module on service level management refers to the relationship
between service level managers, suppliers, and maintainers of services. ITIL sees
service level management as being primarily concerned with the quality of IT services in the face of changing needs and demands. Figure 5.1 shows how IT users,
service providers, and suppliers relate via the use of Service Level Agreements.
This module also outlines the responsibilities of the service level manager as
. Creating a service catalog that describes provided services
Identifying service level requirements relating to each service and user community
. Negotiating Service Level Agreements between service suppliers and IT
users
. Reviewing support services with service suppliers
IT Service
Providers
Systems
Contracts
Suppliers and
Maintainers
Figure 5.1
Hardware
Application
Software
Telecomms
80
ITIL views service management as a single discipline with multiple aspects and
advocates taking an integrated approach to implementing service management.
Hence ITIL recommends the use of a single repository for configuration data that
is available to the help desk and used as a base for change management, problem
management, and contingency planning as an important implementation consideration. The base ITIL modules don't specify which order to implement all aspects
of service management because they can be implemented either consecutively or
simultaneously.
Note
ITIL does not cover all aspects of implementation management, and recommends the use of formal
project management as well as complete procedure documentation, risk management, audits, and
regular reviews.
The ITIL approach gained the support of the British Computer Society, which has
validated the training and examinations associated with ITIL's Certificate in IT
Infrastructure Management. A number of user groups have formed to support
ITIL, including the IT Infrastructure Management Forum and the IT Service
Management Forum. These user groups comprise IT departments of government
and commercial organizations as well as academic bodies and vendor representatives. EXIN, the Dutch equivalent of CCTA, has become a partner in ITIL and
is helping to fund the ongoing updating and re-issuing of the library.
Several thousand professionals in Europe have been trained and certified in the
ITIL methodologies, and multiple authorized vendors provide ITIL certification
training, almost all of whom are located in Europe. A large number of organizations in Europe are training their IT staff in ITIL methods; however, it is important to recognize that smaller environments should scale down the processes and
methods appropriately.
O 1
user. High priority should be given to those practices that will help improve the
quality and consistency of IT service delivery. Use the ITIL methodology as a
starting point and alter it to better suit the size and maturity of your organization
and the scope of the services to be managed.
Note
Note
Training in ITIL methods does not mean that you are ready to implement the methodologies immedi-
The Common Information Model began as a component of the Web Based Enterprise Management
ately as learned in your IT organization. ITIL starts with the presumption that no current service
(WBEM) initiative and has gained significant support from hardware, software, and management
management methodology, products, or processes are in place, and this will not be the case in
most organizations. The goal is to adopt and adapt those methodologies that bring an appropriate
level of discipline and the most benefit to the organization. These should be implemented in a
phased approach to build on and extend existing practices.
It is recommended that service managers become familiar with the concepts and
methodologies provided by the ITIL and use this information as a framework to
review current service management processes with the IT department. Then, select
those areas in which no formal procedures exist and that appear to have the largest
potential return on investment in terms of better support for the business and IT
The SLA Working Group is extending the syntax and metaschema of CIM to
embrace the concepts of service management. The concept of a service spans
across multiple areas of the CIM schema, such as network support of the service,
software used to deliver the service, and end users who consume the service. Core
Model extensions are being created, and these will also allow for further subclassing within the Common Models for domain-specific usage. In addition, policies
will be supported for representing management goals, desired system states, or the
commitments of a Service Level Agreement.
14
II
i
II I
'IIII
It l
!II
83
CL
)l . these, the most interesting with respect to service level management are the
'Transactions Statistics table and the Running Application Element Control table.
'I'lie requirement that the application runs on a single system is a limitation that
needs to be considered as part of any management solution implemented using
t hee Appl MIB standard.
Note
As the Appl MIB is still a proposal in RFC stage, it will be some time before it is finalized and addi-
tional time before there is any significant support and implementation of software products that
use this standard to provide management information.
84
arm_update:
arm_stop:
arm_end:
PART
Caution
The ARM API is intrusive and must be used during the application development or retro-fitted if the
application is already in service. Because many of the applications used within an IT department are
produced by third-party software vendors, the IT department will not have access to the source code
of these applications. Hence retro-fitting the applications for ARM might be impossible unless the
Reality
vendor agrees to make the modifications. It might be possible to use remote terminal emulation and
embed ARM API calls with the scripts; however, there are other mechanisms for simulating transactions and gathering statistics that might be simpler to implement. As ARM does not have widespread acceptance yet, reliance on it as the only way to measure end-user response times might be
Chapter
premature.
Summary
No single industry standard exists for service level management that has broad
acceptance. Although a number of initiatives are underway, the best approach is
to keep apprised of standard developments and use a best practice approach. The
methodology of the ITILsuitably modified for your particular organization,
together with suitable mechanisms for measuring aspects of service qualitycan
provide a base platform for successful implementation of service level management. As standards continually evolve and new initiatives appear frequently, it is
wise to monitor the various standards organizations via their Web sites.
CHAPTER
Service Level
Management
Practices
T
his chapter examines the current practices in use today in typical corporations
;Ind organizations. Most of the information used to draw conclusions has come
from the United States, although anecdotal evidence suggests that common practices in other countries are quite similar. In general the current state of service
management, particularly for newer applications and services across distributed
enterprises, is somewhat immature. Although a number of organizations proactively
manage the services they provide, the definition, understanding, and scope of
service management vary tremendously from organization to organization.
nv
88
the subject. Even among organizations that practice service naanageinciu, the scope
and understanding of this discipline varied. Of those who could define service
level management, the most common answer (around 35% of those surveyed)
associated the term with meeting or improving end-user perception of the
service, which might be a specific application or network service 6.
Several industry research firms have concluded that there is significant confusion
in the industry and in the marketplace around service level management. For
example, META Group titled a May 1999 research note "Service Level Mess,"
citing hype from vendors as helping to increase this level of confusion. META
Group also indicated that the service level management market maturity would
begin in 2001.
skiff overcontniitinent in trying to meet unrealistic goals.These unrealistic objectives might be set by the IT organization in response to customer demands and, in
many cases, the agreements are too one-sided and don't clearly specify the responsibilities of both parties.
I >itlrent industry analysts emphasize different aspects of service level management.
I Iurwitz Group advocates that a service level agreement needs to specify three
service level objectives: user response time, application availability, and application
recoverability. Hurwitz Group also sees service level management as an iterative
process that extends beyond Service Level Agreements and must be managed as
outlined in Figure 6.1.
SLA owne r
3. Monitor SLA
compliance
The industry research firm Gartner Group outlines common pitfalls of today's
Service Level Agreements as being too complex, with no set baseline, leading to
2. Assign the
4. Collect &
analyze data
Figure 6.1
In contrast, Giga Group believes that the three critical areas of Service Level
Agreements are response time, application availability, and cost of service delivery.
Giga Group recognizes that cost analysis is a complex subject, but highlights the
need to consider cost as it relates to achieving a specified set of service levels.
Forrester Research sees service level management as consisting of ensuring business application availability, performance planning to enhance the infrastructure
to meet response time requirements, and administrative support to provide the
day-to-day operations.
META Group advocates an approach it calls Service Value Agreements (SVAs),
which is an evolution from a static SLM model to a dynamic one that is oriented
around business-focused process management. In META Group's assessment, less
than 25% of IT organizations would implement service management from a quality discipline wherein the IT department is aligned with the goals of the lines of
business and has appropriate compensation programs to ensure SLA goals are met 1 .
The increased attention from industry analysts, as well as more service level
management articles appearing in trade publications, will help to educate IT
90
professionals. The general level of understanding and acceptance will also increase
as more industry forums evolve to include the opportunity for the sharing of best
practices around service management. At this time, there is no common agreement
even among the industry analyst community regarding the definition, scope, and
process for managing services effectively. As outlined in Chapter 5, "Standards
Efforts," few standards have emerged and none have any significant support. We
can expect the situation and the maturity of service management to continue to
improve, particularly if accepted standards do emerge.
Tip
While waiting for standards to emerge and evolve, including the basics such as common definitions,
you might want to become involved in groups such as the Distributed Management Task Force and
the IT Service Management Forum. You could also attend selected, focused trade shows and conferences where you can share your experiences and listen to best practices in use at other organizations.
91
International Network Services conducts an annual online survey on service level
management. The 1999 INS survey showed that, of those respondents who have
implemented service level management, 63% were satisfied with their organization's
S I ,M capabilities versus only 17% in the previous year. Although satisfaction was
increasing, the same survey indicated that 90% of the respondents felt improving
(heir SLM capabilities was an important goalthe same number as the previous
year. This is a good indication that service level management is, in fact, a continuous
improvement process, with most IT organizations seeking better capabilities 3 .
I'he 1999 INS survey also found that organizational issues including processes and
procedures were the most significant barriers to implementing or improving service level management. A number of other challenges related to the difficulties in
defining, negotiating, and measuring Service Level Agreements. Also noted was the
problem of justifying the cost/benefits to upper management.
'rhe degree to which the IT department has implemented sophisticated service
management varies by the perspective of the IT department. If the IT organization
sees itself as a partner with the lines of business and responsible for helping those
business units gain market advantage and improve profits, continuous service
improvement comes more naturally. The nature of Service Level Agreements and
management is also different for services the IT department provides internally
and services it contracts for with external suppliers.
93
92
department. To understand service quality, a review is typically undertaken of
reported problems, including the trend in the number of problem reports, the time
to close problems, the number of backlogged problems, which organizations are
most affected, and which IT functions are handling the greatest number of problems. Following the review and the establishment of a baseline, the IT department
can then set a number of measurable, internal goals that will lead to service
improvement.
In most cases, these early steps establish a baseline of procedures and internal
processes necessary to ensure consistency of approach to service delivery. This
approach often helps the IT department understand the current level of service
delivery and highlights which areas require most improvement. Although an
important first step, real service level management will not begin until the IT
department establishes agreements with internal clients and external suppliers.
Note
Tip
Even after Service Level Agreements are established outside the IT department, internal agreements
To be most effective, the initial service quality goals should be simple, easy to understand, and
will still be required to define interfaces between the various areas within the IT organization, along
clearly measurable. There should also be a link between achievement of the goals and incentives
for the IT staff responsible for the service, such as a bonus component of their compensation.
Caution
Determining business- value-based Service Level Agreements might be a difficult concept for all
lines of business to accept. It typically requires senior-management-level and sometimes executivelevel sponsorship to ensure complete buy-in by the lines of business.
96
94
Only half the participants in the 1999 INS survey had SLAs in place, versus 87%
who stated that they had some form of service level management. This indicates
that a large number of organizations were still in the ad hoc stage of managing
service levels. The survey also indicates that the acceptance and implementation
of Service Level Agreements will improve as 60% of the respondents planned to
implement either initial or additional SLAs, at which time 65% of respondents will
have at least one SLA in place. Interestingly, IT departments recognized the importance of mapping resources to the most critical applications and services, with 42%
of respondents making this an objective for Service Level Agreements 3 .
The 1999 INS survey also examined the primary objectives for developing Service
Level Agreements between the IT department and lines of business. The most
prominent themes were setting and managing user expectations and assessing their
satisfaction, understanding service priorities and mapping resources accordingly, and
measuring the quality of the services provided by the IT department. Figure 6.2
provides additional detail on this aspect of the survey.
Primary Objectives for Developing SLAB
J 15
Expand services
28
28
36
36
37
42
40
55
48
20
end-user experience.
25
26
'I'his approach is also shown in the 1999 INS survey, where network availability
was selected as very important by 90% of respondents.This is a technology view
of availability. The secondmost important metric was customer satisfaction, which
can be achieved only if the user experience is perceived to be acceptable. The
thirdm.ost important component was network performance, followed by applicaI ion availability and application response time. This supports the typical view of
first addressing availability, and then performance, while at the same time moving
from a pure component view to one that centers around the application and the
60
Respondents by Category
Figure 6.2
The top objectives for developing SLAs between the IT department and internal
organizations-1999 INS service level management survey.
97
96
These results are consistent with the 1999 INS survey that showed the top three
elements included in external network service provider SLAs to be
Network availability (77% of respondents)
Network performance (73%)
Network throughput (64%)
UUNET Technologies offers SI,As fin frame relay, dedicated circuits, and Internet
access services. These cover network availability, latency, proactive outage notification,
and installation interval guarantees. Again, with each of these there are financial
penalties if UUNET fails to meet the performance guarantees.
In summary, offering Service Level Agreements and managing the quality of the
services they provide is seen as a competitive necessity by the telecommunications
and Internet services providers.
Tip
Telecommunications companies and Internet service providers are becoming much more competitive
and aggressive in trying to increase their respective market shares. If you don't have a formal service
level agreement with your supplier, you should be able to use the competitive pressures to negotiate
one that includes penalty clauses for failure to deliver the required level of service.
Tip
IT departments should ensure that they have Service Level Agreements in place with external suppliers, and that those agreements are monitored and regularly reviewed. Without appropriate service
quality from these suppliers, it is extremely difficult, if not impossible, for the IT department to meet
its own Service Level Agreements with the lines of business.
For the most part, telecommunications service providers and Internet service
providers are offering SLAs that guarantee high levels of network performance.
To illustrate this, we will look at a sample of providers offering such agreements;
however, note that this is meant to be only representative and not exhaustive.
AT&T offers SLAs for its domestic, international, and managed frame relay
environments. AT&T provides SLAs in five areas including provisioning, service
restoration time, latency, throughput, and network availability. Each of these areas
has agreed-upon service levels and if they are not met, AT&T credits customers
for monthly charges and maintenance fees based on the terms outlined in each
customer's contract.
GTE Internetworking offers SLAs for its Internet Advantage dedicated access
customers. These SLAs include credits for network outages, the inability to reach
specific Internet sites, and packet losses. GTE guarantees only its own backbone,
but customers can test to identify packet losses or delays within that portion of
the network. GTE also keeps performance statistics on a central database, which
allows verification of customer claims of poor performance.
MCI WorldCom's networkMCl Enterprise Assurance SLA extends performance
guarantees across all its data services. These include guarantees for availability,
performance such as transit delays, and network restoration time.
NaviSite Internet Services provides Internet outsourcing solutions and offers the
SiteHarbor product family of service guarantees. These guarantees cover the database server, Web server, network infrastructure, and facility infrastructure. NaviSite
includes penalties in the form of free service if the guarantees are not met.
Sprint's Frame Relay for LAN service is backed by performance guarantees for
network availability and network response time. Sprint also offers performance
guarantees for its Frame Relay for SNA service. In both cases, Sprint provides
customers with financial credits if performance guarantees are not met.
Typical Agreements
Currently, most Service Level Agreements for services provided by the IT department are fairly simple and are more focused on specifying roles, responsibilities,
and procedures. The 1999 INS survey found the top elements included in internal
SLAs to be:
Assignment of responsibilities and roles (64% of respondents)
Goals and objectives (61%)
Reporting policies and escalation procedures (61%)
Help desk availability (59%)
Below these were more performance-oriented metrics including network availability, network performance, application availability, and application response time.
The structure of most Service Level Agreements begins with a statement of intent,
a description of the service, approval process for changes to the SLA, definition of
terms, and identification of the primary users of the service. A number of procedures are described, including the problem-reporting procedures, definition of the
change management process, and how requests for new users will be processed.
Typically, the schedule of normal service availability and schedule of planned outages is specified.
Following this definition of roles and procedures, any specific performance objectives
are specified. In most of today's SLAs, these goals tend to be limited to availability
measures and response times and resolution times for reported problems. In some
cases, additional measures and objectives are stated for application-response times.
Today, very few internal Service Level Agreements either specify the costs of services
tib
Reporting Practices
Just as Service Level Agreements are somewhat immature at this time, so are the
service level reporting practices of most IT departments. The majority of service
level reports are very detailed, component-level availability and performance statistics that are incomprehensible to most recipients outside the IT department. Some
organizations produce useful reports showing the number, severity, and type of
problems reported by users of IT services, including response and resolution times.
These help the IT department show its responsiveness to the lines of business and
can be used to determine whether the problems are systematic, underlying technology or staffing issues requiring attention.
'1
2.9
Application management
3.3
3.3
3.3
Tip
Unless you can provide service level metrics in terms that users can relate to and that represent
their experience, it might be best to disseminate the service level reports only within the IT department. Technology-oriented component reporting confuses the lines of business and reduces the
credibility of the IT department.
2.6
2.8
32
34
Respondents by Category
1=Not at all effective 2=Not so effective 3=Somewhat effective 4=Very effective
Figure 6.3
IVV
Summary
CHAPTER
References
1. META Group, Service Management Strategies Delta, 10 February 1999, File:754
2. META Group, Service Management Strategies Delta, 30 April 1999, File:778
3. Rick Blum, Jeffrey Kaplan/International Network Services, INS 1999 Survey
Results - Service Level Management, 10 May 1999
4. The Forrester Report, IT Pacts Beyond SLAs, April 1999
5. The Forrester Report, Service Level Management,Volume 15, Number Four,
February 1998
6. Tim Wilson/InternetWeek, "Service Level Management: Build Stronger
External Bonds," 10 May 1999
7. Enterprise Management Associates, Service Level Management Market
Research Study, 30 November 1998
Service Level
Management
Products
T
102
New products can generate a "buzz" that leads to an area becoming a hot topic.
Also, a shift in user interests will sometimes be the driver for interest in a particular
arena. Sometimes press coverage can become a driver independently of any of
these factors, or it can be fueled by these factors. Also, vendor publicity can
become both a driving force and also lead to increased press coverage.
Today, SLM is one of the latest hot topics. Predictably, there has been a flood of
products, from new and established companies, aimed at this market segment.
However, with SLM, there is a fundamental problem. There is not a clear definition of terminology. Therefore, vendors are free to create their own definitions
ones that include their products in the domain of service level management.
Unfortunately, this plethora of definitions has created confusion within the user
community.
In this chapter, we will provide a framework for classifying and assessing SLM
products. This will enable managers to better decipher the confusing array of products offered for SLM. And hopefully, it will give managers the means to find SLM
solutions that meet their organization's particular requirements.
We will use our own classification system to scope out SLM products. Keep in
mind, however, that it's possible for SLM tools to fit into more than one category.
And when given the chance, most vendors will insist that their products "do it all."
Still, for our purposes, SLM products can be grouped into the following broad
functional categories:
Monitoring
Reporting
Analysis
Administration
Monitoring Tools
When a Service Level Agreement has been negotiated, it is necessary to capture
data about the actual quality, or level, of service delivered. To do this, managers
need to use tools to monitor the performance of the service. These monitoring
tools comprise software or hardware that retrieves data about the state of underlying components driving the service. This data is stored in a database for future reference or interpreted and put into reports. (Reporting tools will be discussed in
the next section.)
103
(bridges, routers, switches, hubs, and so forth). Some also gather input from .
software programs that of ect overall service availability (applications, databases,
tiddleware, and the like).
n
Most primary data collectors are not dedicated to SLM. Instead, they are typically
management systems that gather data for a range of purposes, one of which is
SLM. For example, Hewlett-Packard's OpenView Network Node Manager
(NNM) monitors an enterprise network for a range of parameters, including network availability. Although this data can be used for SLM reporting, it also aids
troubleshooting by tipping off network operators about degraded performance. HP
provides a separate SLM reporting package that works with NNM. That product,
called Information Technology Service Management (ITSM) Service Level
Manager, also takes input from other HP applications.
Another class of product, secondary data collectors, has appeared (see Figure 7.1).
These tools do not need to communicate directly with the managed environment
(although some of them are able to do so, if necessary). Instead, they extract data
from other products that are primary data collectors. Tools such as Luminate's
Service Level Analyzer fit this category. Infovista's Vistaviews is another example.
This product retrieves data from third-party management applications, including
BMC Patrol and Compaq Insight. Also, it comes in versions capable of interacting
directly with routers, Ethernet switches, and WAN gear. Secondary data collectors
like Service Level Analyzer offer a means of extending management platforms
from different vendors for SLM monitoring, while filling in where management
systems might be absent. This approach offers a number of advantages. First, it
eliminates the need for redundant agents throughout the distributed computing
environment. Second, redundant management traffic is eliminated by relying on
original sources. Third, this approach eliminates the need for the redundant storage
of large quantities of data.
Secondary
data
collector
Primary
data
collector
Inventory
F-- ^
SLM
Fault
SLM
A
Network
Systems
Basic Strategies
Monitoring tools collect data in two ways: In the first approach, primary data
collectors capture data directly from the network elements underlying the service
Figure 7.1
Knowing whether a product is a primary or secondary data collector helps determine how an SLM monitoring tool fits a particular environment. To get a better
sense of actual requirements, however, it's important to gauge how products fit the
manager-agent model.
Both primary and secondary data collectors are designed according to this engineering scheme (see Figure 7.2), in which each device or software program uses an
integral mechanism called an agent to collect data about its status. This information
is automatically forwarded to a central application called a manager, usually in
response to a poll signal or request. Many agents in a network can be set up to
communicate with one or more managers. For a comprehensive description of the
manager-agent model and its implementation in various products, see Appendix F,
"Selected Vendors of Service Level Management Products."
a = agent
_=
status information
Servers and
Workstations
Figure 7.2
The manager-agent model can help prospective buyers determine what they need
to look for in an SLM monitoring tool. If a company already has an SNMP manager such as HP's OpenView NNM, for instance, all that might be needed is an
SLM package capable of using NNM data. That's because most network devices
today are shipped with integral SNMP agents, ready to send data to any vendor's
standard SNMP manager on request.
In other cases, special agents will be needed to furnish additional information for
SLM reports. Suppose that, for example, a catalog retailer needs to track how well
its call center has performed in a given month. Data will be required about the
functions of the CSU/DSUs, routers, and network connections that bring customer orders into the call center. SNMP agents embedded in those devices, plus
standard RMON probes, can furnish this information. And if the retailer has HP
OpenView NNM installed, the data can be easily captured.
Rut the retailer's IT department also needs to know how quickly orders are
processed after they're taken over the phone. To obtain this input, software agents
will need to be placed on the call center's database server. Because most database
servers don't come with SNMP agents installed, the retailer will need to purchase
an application that includes special agent software. In this example, another
OpenView product, HP's IT/Operations, could be purchased to track the server
database via agents bundled with the product. Data from IT/Operations could
then be combined with NNM data for use in HP's ITSM Service Level Manager.
Our hypothetical retailer might take a different tack if OpenView NNM wasn't
available. If BMC Patrol were installed, for instance, server agents would already be
in place. The problem then would be to purchase an SLM monitoring tool to capture data about the underlying network. The retailer could choose secondary SLM
data collectors like Quallaby's Proviso to add data about routers and other gear to
the system information from Patrol.
In some instances, IT will need to obtain data for SLM from a legacy application,
device, or system that does not have its own standard SNMP agent. In this case, IT
personnel might have to build agents that can report either directly or indirectly
into existing management solutions. This requirement is not as difficult to meet as
it might seem. It is relatively easy to construct an SNMP agent using object modeling via Visual Basic or Visual C++. If need be, reporting tools and alerting programs also can be constructed or augmented in a relatively straightforward fashion.
Keep in mind that it will be easier to augment SLM tools that support open, welldocumented databases and formats.
Data Capture
SLM monitoring tools use a range of methods to capture data. In the implementations previously described, agents are used to check on the devices and software
underlying a network service. Other techniques include the use of probes and simulation. Take a look at each of these methods, along with their key benefits and
drawbacks.
Agents
107
106
Several types of agents can be used with SLM tools: Hardware agents comprise
software or firmware embedded in network devices that retrieve status information
via SNMP or proprietary commands. Nearly all devices in today's corporate environments ship with embedded SNMP agents. All devices from Cisco, for instance,
ship with integral agents that use special commands to capture information about
device status. This data is converted within the agent to SNMP for transmission to
local or remote manager applications from Cisco and other vendors.
Another type of agent important to SLM products is the RMON agent, which
consists of code installed at the network interface to analyze traffic and gauge
overall network availability. Many RMON agents are packed into standalone
boxes called probes (see the next section for more on these). Alternatively, RMON
agents are sold as firmware embedded in switches, hubs, and network interface
cards. All major hub and switch vendors include RMON agents in their wares.
Because SNMP agents aren't ubiquitously installed on servers or within software
packages, many SLM products come with specially designed agents. These agents
consist of code that resides on a server and taps log files for information on the
performance of databases, network applications, middleware, or the operating system itself. BMC Software offers Patrol agents for a range of distributed databases
as well as mainframe environments. These agents report back to BMC's Patrol
manager, which in turn is accessible to a range of third-party applications from
vendors who've partnered with BMC.
The chief benefits of agent technology are its flexibility and support for mixing
and matching of products from different vendors. Software agents also can be used
to extract data from a range of sources, as previously noted. Agents also are versatile: Any standard SNMP agent works with any SNMP manager, and vice versa.
Even proprietary agents can be integrated with third-party managersas long as
the vendors are willing to cooperate.
On the downside, agent technology can add a processing burden to networks
and systems if it is not well planned. Communication between agents and managers is usually based on the client/server model, in which data is exchanged
between the two entities over a network. When SNMP is used, this means that
packets are transmitted back and forth across a TCP/IP connection. This traffic
can tax bandwidth on network links set up to handle mission-critical applications.
Congestion can result, especially in large networks, in which many devices are
"talking to" a central manager console. One way to avoid congestion is to set up
the manager console to poll agents only at specified intervals, or to retrieve only
certain types of data from the agents, such as critical alarm information.
Poorly designed SNMP and proprietary agent software also can burden a host
computer, causing slowdowns in response time. However, broad experience by
IT organizations in many industries over several years, coupled with computer
models, both support the conclusion that only in an extreme worst-case situation
can the traffic between managers and agents be expected to exceed 1% of the
available bandwidth.
The growth of the Internet has prompted many vendors to investigate Web-based
techniques, such as Java applets and XML (eXtensive Markup Language), as an
alternative to traditional manager-agent communications. Some products, including
Trinity from Avesta (a company recently purchased by Visual Networks), e-Specto
from Dirigo, and FrontLine e.M from Manage.com put these techniques to work
monitoring the availability and health of e-commerce services. Using the Web
saves bandwidth and system resources and eliminates the need to set up multiple
consoles for management from remote locations. Instead, managers can obtain
SLM data from any location via Web browsers. Today, most Web-based management products rely on proprietary protocols and interfaces. But ongoing work by
the Distributed Management Task Force (DMTF) is aimed at creating formal standards for Web-based management.
Note
Agent software embedded in hardware devices and network servers is used to gather status and
configuration data for transmission to central management consoles. Agent technology has been
standardized by the IETF using SNMP, which allows third-party platforms and applications to gather
input from multiple sources in the network, regardless of vendor or brand.
Probes are limited in other ways too: A probe designed to monitor based- line
services, for instance, won't track traffic operating above rates of 2.1)413 megabits per
second (Mbps). And the number of links a probe can handle is limited to its physical port capacity: As the number of monitored links increases, more probes need to
be purchased.
Note
transactions over LAN and WAN links. The simulation tools furnish a way to test multiple connec-
Probes and packet monitors use agents embedded in packet-filtering devices to track and report the
status of network traffic as it moves over LAN or WAN connections. The RMON MIB standardizes
this data for compatibility with any SNMP console.
A range of vendors of CSU/DSUs have entered the SLM market by adapting their
equipment for use as SLM monitoring tools. ADC Kentrox, Adtran, Digital Link,
Eastern Research, Paradyne, Sync Research,Verilink, and Visual Networks all fit
this category. Each of these vendors offers a series of CSU/DSUs that keep track
of physical-layer performance while divvying up WAN bandwidth to enterprise
segments. These products are comparatively inexpensive, and they can be a convenient solution for organizations that want to press existing equipment into the service of SLM monitoring. On the downside, these units only track the performance
ofWAN links. They don't monitor routed segments. And they might not work
on international networksalthough most of the CSU/DSU vendors furnish
standalone probe versions of their monitors for use overseas.
Note
Some WAN CSU/DSUs come with integral agents that track the physical-layer performance of WAN
connections and apply this data to SLM reports.
Simulation
SLM Domains
Effective use of SLM monitoring calls for a skillful application of the basic strategies and data capture techniques previously outlined. But just having the tools isn't
enough; a manager needs to apply the tools at the right times in the right places.
Like a carpenter equipped with wood and a hammer but no nails, SLM tools
won't deliver good information if they're not used in the proper combinations.
And the right mix of tools differs with each organization.
One step toward success is to examine the portions of a network that need to be
monitored, and then put tools in place to generate the needed SLM data. In general, networks can be described as having the following components or domains:
Network devices and connections
Servers and desktops
Applications
Databases
Transactions
Taken together, these domains control the quality of network services. An accounting department, for instance, can't run effectively unless all personnel including
debit and credit professionals, tax accountants, the controller and the CFOare all
properly connected over the intranet, which in turn requires switches, hubs, and
routers to be in working order. Likewise, the servers and workstations used by the
staff need to be configured correctly. But no IT manager needs to be told that
response time can slow to a crawl even if the underlying devices and servers are
working. Applications can be awkwardly designed, databases clogged with useless
entries, and transactions poorly structured.
To get the best SLM information, it is usually, but not always, necessary to install
products to monitor each domain. To get the best read on the quality of the
accounting services in the previous example requires tools to deliver input on network availability and response time of applications. If multiple sites are involved, a
110
probe might be used to track the quality of WAN links furnished by ;i crrier.
Based on the network design and ongoing performance input, it night be important to adjust the level of monitoring, increase the number and quality of tools in
one domain, or consolidate tools across others.
To know how to best coordinate a solution that fits a particular organization's
requirements, it's important to know the basic functions of each domain, the tools
typically used to monitor those functions, and where and when they're applied.
Take a closer look at each of the domains in turn (see Figure 7.3) and examine
how the basic strategies and the data capture techniques that we've already covered
are applied in each. Examples of currently available products will be furnished for
each domain.
SLM Domains
111
number of management systems dedicated to performance monitoring and reporting also support SLM, including Keystone VPNview from Bridgeway, ProactiveNet
Watch from ProactiveNet, and Netvoyant from Redpoint Network Systems. Each
of these products can take the place of a primary data collector to feed its own
SLM reporting tools. They can be used where no primary data collector is in
place, or where there is a primary data collector in a central location that needs
to be augmented at remote sites.
Deciding which SLM monitor to use depends in part on the size and design of
the network. In large nets, it might be practical and economical to simply extend
a platform like OpenView to include SLM monitoring by using tools from the
platform vendor. It might also make sense to deploy the platform's scalability
options. HP, Tivoli, and other vendors of SNMP management platforms furnish
software called a midlevel manager that gathers data at specific segments or sites
and sifts it for selective transmission to a central console, reducing the amount of
bandwidth and processing required to monitor multiple sites.
Most mission-critical networks these days rely to some extent on carrier connectivity. To keep track of how well the carrier is contributing to service levels, probes
may be deployed at specific WAN links (see Figure 7.4). Probes can be polled just
like any other SNMP or RMON device. More in-depth data, however, can be
obtained by using the application that is sold with the probe. In many instances,
this app can be set up with a bit of tweaking to transmit data to OpenView or to a
third-party SLM tool.
SLM Monitor
Figure 7.3
Router
The quality of the underlying network is key to SLM monitoring. After all, no networked service or e-business application can operate without reliable physical connectivity. Monitoring a network requires keeping track of whether each device is
operating, and how well all components are working in concert. Getting this data
calls for a two-pronged approach that includes tracking the availability of individual
devices and monitoring the performance of network connections. Typically, performance data includes information about the throughput, or quantity of delivered
packets, and the latency or delay between devices on a particular connection.
Availability and performance data can be obtained by tapping standard SNMP
and RMON/RMON II agents located in hubs, switches, routers, and other gear.
As previously noted, this can be done via primary data collectors such as HP
OpenView or Tivoli Netview, both of which, like other platforms, support their
own SLM tools as well as those of third-party vendors. Alternatively, a growing
Probe
Hub
hubs, workstations,
servers, etc.
= status information
Figure 7.4
112
113
time it takes a server to respond to a desktop request, they also monitor the health
and functionality of the application's inner workings.They do this by residing
inside the software itself monitoring the keystrokes, commands, and transactions
deployed by the service. They can identify applications that send too many
requests to the server, or highlight those that use transactions that are awkwardly
constructed.
Because they're so detailed, these agents are specially designed to keep tabs on
specific brands of apps or databases.The Collaborative Service Level Suite from
Envive Corp. and Luminate for SAP R/3, for instance, track SAP R/3 databases.
ETEWatch from Candle Corp. monitors the response time of Lotus Notes,
PeopleSoft, and SAP R/3 applications. Empirical Director from Empirical Software
gathers performance data in Oracle databases as well as a range of operating systems.
Smartwatch from Landmark Systems can be set up to track the performance of a
variety of middleware packages, operating systems, and applications. And BMC
Software furnishes a comprehensive framework suite encompassing Patrol, Best/1,
and other packages for managing all these elements.
Application agents differ in their monitoring orientation: ETEwatch and
Smartwatch, for instance, monitor the performance of applications from the workstation perspective, whereas Envive and Luminate take the response time view from
the server. Which view is more valid is generally a matter of opinion. Proponents
of the workstation approach claim their wares gauge end-to-end response times,
whereas vendors of the server approach say their agents are easier to maintain
because they don't have to be placed on desktops throughout the network.
Monitoring specific transactions within applications represents the most sophisticated type of application monitoring. It also requires the user to deploy the highest
level of expertise. That's because products like Smartwatch call for users to select
the transactions they want to monitor. This calls for in-depth knowledge of how
applications are structured, as well as a sense of the specific transactions that require
most attention. For most organizations, a product like Smartwatch will need to be
run by a programmer.
Packet monitors can furnish granular information about software performance by
analyzing application traffic. Optimal's Application Expert, for example, depicts
specific application threads using color-coded graphs; managers can visually pick
out bulky command sequences that might be holding up response time.
A key consideration in choosing a software-monitoring tool is its ability to integrate with other vendors' wares, particularly vendors that offer other SLM solutions. BMC Software, for example, has integrated its tools with HP OpenView,
Tivoli, and a range of other third-party management platforms and applications.
And vendors such as Compuware and Envive also have made integration with
platforms and frameworks a priority. No SLM shopping expedition is complete
without a thorough check of a vendor's partnerships and integrated solutions.
114
116
We will examine how each of these functions might incorporate SLM monitoring
and reportingand how commercial products can be used to fit the specific
requirements.
Fault Management
These days, it's rare to find a network that isn't equipped with some form of
fault-reporting software or hardware. The SNMP management systems of the
early 1990s were focused primarily on reporting broken links and devices, and
the descendants of these early OpenView and Netview systems remain in many
organizations today.
Also in today's organizations are the techniques of fault reporting that originated
ten years ago. The trouble is, yesterday's fault management systems are no longer
able to meet the needs of today's burgeoning networks. The reason is sheer numbers: Larger and more complicated networks breed lots of alerts that can cause as
many problems as they solve. When a router breaks, for instance, the management
system will not only receive alerts from that device, but also from all the hubs,
workstations, servers, and other gear that depend on that router for connectivity.
Weeding through the resulting avalanche of alarms can delay troubleshooting and
repairresulting in missed service levels.
To cope with this, a new breed of product has emerged that works alongside standard SNMP managers, sifting their alerts and reporting only those the manager
needs to see. Included in this category is Netcool/Omnibus from Micromuse,
which lets managers gather and filter events from multiple management systems,
including those supporting non-SNMP protocols. In effect, Netcool/Omnibus acts
as a manager of managers, providing a single console in which selected events and
alerts are displayed to streamline troubleshooting. Another group of products takes
event filtering a step further, using built-in intelligence to identify the root cause
of network problems from telltale patterns of alerts. The Incharge system from
System Management Arts, Eye of the Storm from Prosum, and tsc/Eventwatch
from Tavve Software fit this category.
Configuration
Ideally, successful SLM includes the ability to control as well as monitor network
devices and connections. But this capability is only just starting to emerge, as vendors add traffic-shaping capabilities to their SLA monitoring tools. The Wise
IP/Accelerator hardware/software product from Netreality, for example, combines
a traffic monitor and shaper with SLA reporting tools. This lets managers assign
bandwidth to applications according to priority. Mission-critical e-commerce
applications, for instance, are run at high, guaranteed rates, whereas internal email
might get "best effort" status if congestion occurs. There are other vendors with
116
offerings in this space, although many do not have integral SLA reporting tools.
Packeteyes from SBE, for example, combines an access router and firewall with
software that assigns and controls application bandwidth. There are also softwareonly products for bandwidth management: The Enterprise Edition software suite
from Orchestream, for instance, enforces prioritization of traffic across switches and
routers from Cisco, Lucent, and Xedia. On the downside, a lack of standards for
policy management has up to now kept products like Orchestream's limited to
specific vendors' wares.
Accounting
A key aim of SLM is to keep costs in line. Ironically, products that track the usage
of enterprise network services have only recently emerged. These tools, including
Netcountant from Apogee Networks, IT Charge Manager from SAS, and Telemate.
net from Telemate Software, tap RMON probes and log files in routers and applications in order to tally the amount of bandwidth consumed by a particular application, department, or individual. This data is matched up to a dollar value and placed
in a bill. Alternatively, managers can use the data to populate financial reports or
forecast the cost of upcoming additions to networking hardware and software.
Because these products are still so new, they haven't reached their full potential
yet. It's conceivable, for instance, that by linking these accounting applications to
Web load balancers, switches, and bandwidth prioritization gear, IT and network
managers could include cost parameters along with network performance in future
SLAs. An IT department might, for instance, be able to keep track of how much
of a costly leased line or virtual private network a particular group has used in a
given month. And if usage threatens to exceed budgeted funds, the department
could be notified. Likewise, if more bandwidth is required, a manager can test
out various configurations before signing on the dotted line.
Performance
117
Increased use of the Internet and carrier services in corporate networks has
made management of security a full-time job in many networks. Keeping passwords up-to-date, making sure that access is properly assigned, and monitoring
software for viruses are just a few of the tasks required to ensure that today's larger,
more public networks guard business secrets and avoid resource tampering. A compelling argument for considering security as part of the service level management
equation is quite simple. If the security of an environment is compromised, the
availability and/or performance of the service can be compromised. Some Service
Level Agreements include specific metrics regarding the security of the environment and the data contained therein.
Several vendors of secondary data collectors furnish comprehensive security application suites along with SLM. Unfortunately, many of these products aren't directly
integrated with the platform. Exceptions include BullSoft, which offers security
management, authentication, monitoring, and documentation as an integral part
of its OpenMaster platform.
Reporting Tools
We've spent the lion's share of this chapter describing a framework for selecting
products that monitor and capture data for SLM. There's a good reason for this:
Without the right input, any SLM project is doomed to failure. Even the best
information won't guarantee a successful SLM strategy if the results can't be published effectively. An examination of reporting capabilities is a key part of any SLM
product selection.
Broken routers, congested links, and malfunctioning adapters all generate SNMP
alerts that show up as alarms or alerts in fault-management consoles such as
OpenView or Netview This is information that's typically required by network
operators in the course of day-to-day troubleshooting and management. In fact,
most of the time, management tools with real-time capabilities have the capability
to automatically generate a page or dial a phone number to notify operations personnel in the event critical alarms occur. (Operators and other IT personnel can
select the particular events ahead of time that will trigger the notification capability.) Still, keep in mind that although real-time data is an important gauge of overall
availability and uptime, it can't give the perspective on overall performance required
for SLM. Prompt response to an outage can reduce the impact on a particular SLA,
or help operations personnel keep to the repair times stipulated in the SLA.
121
Lit)
t)
performance information. SAS also has added cost accounting, capacity planning,
and high-end financial analysis to its suite.
Some organizations will need consulting help to properly analyze SLM data. Cases
like these might be best served by reliance on a service from the likes of Lucent,
Winterfold Datacomm, or X-Cel Communications.These vendors provide services
that can help, orchestrate data capture to fuel specially tailored reports and analyses.
But as is the case with any kind of customization, extra costs might be involved.
Caution
In some instances, managers will need reports that can't be provided by the vendor. For cases like these, many vendors offer APIs and software development kits
that allow their wares to be customized. This option might cost extra, however.
Caution
IT managers choosing to use vendor APIs and software development kits (SDKs) sometimes need to
spend twice as much as they did to obtain the basic product. Even if APIs or SDKs seem reasonably
priced, there might be a need to hire the vendor's professional services team to create customized
software that works. In some cases, it might be more practical to simply export data to a third-party
reporting package such as Crystal Reports from Seagate rather than going the made-to-order route.
Vendors usually consider a $2,000-per-day pricetag for consulting help to be a bargain. The value of
customized software must be weighed against the outlay beforehand in order to avoid disappointment.
Administration Tools
SLM calls for a new approach to the day-to-day tasks involved in managing and
administering network services. After all, it's tough to analyze network costs or
ensure ongoing performance if it's not clear what is installed. To make changes as
required to improve service levels demands tools that enable network elements to
be located and reconfigured quickly and efficiently.
MEW
SLM Analysis
Its characteristic of SLM that when it's properly in place in any organization, it
starts to exceed its original function. When constituents see the benefits of SLM,
they aren't content with a monthly report. Network operators want day-to-day
downloads for proactive management. Executives want to see data cut and sifted in
various ways to furnish better insight into how the technology they're purchasing
is serving the business, and so on.
SLM tools vary widely in their capability to adapt to all these requirements. Some
products, such as Desktalk's Trend series, were designed with built-in data analysis
flexibility, whereas others, such as the Network Health series from Concord
Communications, were advertised from the start as offering off-the-shelf reports
that didn't require tweaking. But even if a product furnishes in-depth analytical
capability, it might not have the data in hand to do the numbers required. Concord
and Desktalk, for instance, have limited real-time data capture capabilities.
Generally speaking, serious data analysis will require the use of sophisticated thirdparty packages. SAS Institute, the statistical software vendor, now provides a range
of data analysis and reporting tools tailored to fit SLM. Among these are the IT
Service Vision series, which creates a data warehouse of network, system, and Web
ILL
Summary
SLM products span a broad range of functions and formats. What's more, vendors
have jumped on the SLM bandwagon in order to promote products that weren't
created with service level management in mind. Only by using a framework that
keeps the primary purposes and goals of SLM at the forefront can managers hope
to make sense of the many offerings crowding the market.
A workable approach is to first look at products according to the SLM functions
of monitoring, reporting, analysis, and administration. Then it's important to scope
out the monitoring issueswhere data is captured and in what format. Careful
planning is required to ensure that the right data is gathered in the right spots at
the right times to create an adequate basis for service level reporting. When this is
accomplished, an organization is ready to choose tools for publishing and analyzing
SLM data in ways that meet its particular requirements.
When selecting SLM tools, it is especially important to keep in mind the database
format supported by each product.You will need to be able to get data in and out
of an SLM system easily. Selecting a system with a proprietary database for back-end
functions will limit your ability to customize the software or augment it with thirdparty products. Many of today's SLM tools are based on open, well-documented
databases like SQL Server, so it should not be difficult to select one that meets your
requirements.
To be effective, any SLM strategy also needs to be flexible enough to accommodate
ongoing information requests. In fact, the test of a successful SLM implementation
will be the demands put on the IT department for more information once initial
reports are generated. Managers need to be ready to "slice and dice" SLM data in
order to meet these demands. Again, in constructing reports, it helps to have an
integral database that is familiar to your staff.
Ultimately, SLM monitoring and reporting will lead to a more efficient approach to
managing network services; one that calls for improved record-keeping and a tighter
centralized control over administrative parameters. The Distributed Management Task
Force (DMTF) and other organizations are working to make this happen by creating
interoperable schemas for management data in applications, databases, and directories.
Recommendations
Chapter
8
10
11
12
Moving Forward
CHAPTER
126
Tip
The cost justification for service level management will be much more credible if true business value
can be related directly to improved quality of service. This is more powerful than attempting to justify service level management based on cost or staff savings within the IT department.
99%
99.5%
99.9%
99.99%
$7,358,400
$3,679,200
$736,400
$7,000
Note
The cost of downtime varies significantly by industry. Financial trading systems have extremely high
costs associated with even minor service disruptions. As more corporations enter the age of e-business,
opportunity costs as a result of outages of front office applications will continue to increase.
Mrnisarn
I 41
I l t1
Business revenue can also be affected by performance degradations that impact the
ability to handle the required workload volumes. If the application responsiveness
degrades, this can also impact revenues as best illustrated by financial trading systems where additional seconds can lead to significant losses or decrease profits from
trades. As e-comrnerce is used to directly sell goods and services to consumers
across the Internet, slow responsiveness can also lead to consumers buying from
a competitor.
Quantifying the impact on business revenue requires an understanding of the
critical business systems and the associated revenue generated by those systems on
an annual basis. This information can be used to calculate an hourly rate, and by
assessing the increased service availability due to proactive service management,
an associated benefit can be calculated.
mommumom mam
Caution
The lines of business should be consulted when calculating revenue impact because they might have
manual backup systems that will allow processing to continue in a degraded mode. This produces
quantifiable revenues, but at a reduced rate.
Similarly, the potential of improved relationships with the distribution channels and
the effectiveness of supply chain transactions can be used as a basis for calculating
the benefit of improved service quality or the negative impact of unacceptable
service levels for the critical application services used by these business partners.
Quantifying the impact of slow response times will be more difficult and will
require the cooperation of the lines of business. Revenue impact will include
any penalties involved in not meeting critical deadlines, as well as the competitive
disadvantage associated with reduced effectiveness of internal personnel or lost
business due to customers shopping elsewhere.
When calculating employee costs for productivity calculations, remember to use fully loaded costs,
which include salary, bonuses, benefits, equipment costs, real estate, and utilities.
Similar to the benefits of reduced outages, the benefits associated with improved
and consistent responsiveness can be calculated by determining how much more
work can be performed in a given time period. This can translate into cost avoidance by deferring the hiring of additional employees.
Note
User productivity is also affected by offline activities such as output distribution and information
archival and retrieval. When determining the scope of service management in your environment, the
service level agreement should extend to cover these offline requirements.
IdV
Tip
When implementing a service level management solution, it is best to start with the most critical
or highly visible service provided by the IT department. By focusing on one service at a time, the
probability of a successful implementation increases significantly and the initial success leads to
continued management support for service level management of additional services.
Employee costs within lines of businessthese are the end users of services
by service management
Lost business due to service outages that could have been prevented by
service management
Cost of customer dissatisfaction due to service outages and degradation that
I Ji
Employee Costs
productivity due to degradation in service responsiveness. Nor are the opportunity costs of better
Table 8.2
$40,000
60,000
30%
5,000
500
50%
15
50%
IT Infrastructure - Hardware
Customer Satisfaction
productivity
personnel
Ratio of databases to DBAs
Cost avoidance of tools allowing additional IT staff
SUMMARY
Lost productivity for application downtime
3%
2,000
10%
3,334,500
242,775
3,577,275
26,709
375,000
500,000
93,600
100
10%
100,000,000
12
6
500,000
0.50%
936
468
SLA Penalties
Lost Business
50
10%
99.5%
IT Infrastructure - Software
Number of databases
Estimated percentage of growth of databases
57,000
28.50
14,250
83,000
41.50
1,038
Lost business
Customer satisfaction
SLA penalties cost
Improved IT productivity
Total
3
10
207,500
$3,577,275
$375,000
$500,000
$93,600
$207,500
$4,753,375
s^
Application Downtime
The costs associated with downtime include both unscheduled downtime due
to failures as well as planned downtime for maintenance that extends into normal
business hours.
Percentage of application downtime:
(100)(percentage of availability during business hours)
Annual unscheduled downtime in hours for all servers:
(number of business hours per day)x(number of business days per
week)x(52 weeks per year)x(number of servers)X(percentage of downtime
during business hours)
Annual unscheduled downtime during business hours:
(number of business hours per day)/(24)X(annu al unscheduled downtime
in hours for all servers)
136
Lost Business
When calculating lost income due to unscheduled downtime, we must factor the
revenue normally generated by those application services to account for the ability
to operate manually in a degraded mode.
Hourly income related to server applications:
(annual income related to server applications)/((number of business hours
per day)x(number of business days per week))x(52 weeks per year)
Annual lost business due to application downtime:
(hourly income related to server applications)x(annual unscheduled downtime during business hours)x(estimated percentage of business lost due to
downtime)
Customer Satisfaction:
(estimated value of application availability to customers this is an estimate
of future business that will be affected by unacceptable quality of service)
136
CHAPTER
Implementing Service
Level Management
Note
Industry analysts have estimated that the practical limit of the number of databases a database
administrator can manage manually is from five to ten. Through the use of tools that take proactive
actions, this number rises to fifty or more. Similar ratios can be used for system support personnel.
eaders can pause at the start of this chapter. After all, we've covered the fundaR
mental concepts and parameters of SLAs and offered a framework for product
selection. Isn't that what implementing service level management is all about?
Summary
It is possible to quantify the cost savings associated with implementing proactive
service level management strategies and tools. When doing so, it is important to
begin with the impact on business revenue, productivity, and customer satisfaction.
Additional cost avoidance resulting from improved IT staff productivity, better
return on investment in IT assets, and deferring system upgrades can also be
used to justify service level management.
Appendix E provides an actual case study of qualitative value and quantitative
return on investment for implementing service level management for an SAP
application at a service provider.
The answer is a resounding no. In fact, we've only laid the groundwork.
Successfully implementing service level management (SLM) calls for more than
buying some software and slapping a contract on the desk of the nearest department head. It requires a strategy, an organized, flexible plan for introducing SLAs
and working with them day to day to achieve maximum efficiency and savings.
Without this, projects can fail despite the best efforts to make them work.
Consider the following case: A couple of years back, a network manager working
for a large Eastern retailer decided SLM would suit his firm. He hired a consultant
to scope out the basics and evaluate products. The CIO signed off. After a large
expenditure, software was installed and SLA templates prepared. The first of these
was sent to the head of the customer service department, the largest in-house IT
13l
138
user in the companywhere it sat on her desk.Time passed. Other divisions were
sent SLA forms with similar results. A meeting was called to explain the benefits of
the new system, during which the head of customer service asked why she hadn't
been given the opportunity to help shape the terms of her SLA. She did not, she
pointed out, have time to help IT do its job. The other managers present at the
meeting concurred. The next day, the network manager found himself summoned
to the boss's office for a long talk about the high cost of his pet project. Two
months later, the manager who'd instigated SLM resigned.
This anecdote illustrates that good intentions and products don't constitute an
SLM strategy. Instead, what's needed is an in-depth analysis of a company's unique
culture and requirements, with a clear sense of information regarding potential
pitfalls and opportunities. The network manager wasn't wrong to propose service
level management. In fact, he could have been a trendsetter. His products and
templates were state of the art. The trouble was, he hadn't bothered to consider
how best to introduce SLM to his constituents. He had not focused on soliciting
buy-in from all parts of the business, not just IT. He had mistakenly focused on the
network layer alone, and he had not followed an inclusive strategy that incorporated
all services that would be affected by SLM. Inevitably, the vacuum of unanswered
questions soon filled with misunderstanding and political rivalry In the end, our
hero fell victim to his own initiative.
Unfortunately, it's a scenario that's repeated all too often in today's business world.
But with proper planning, it can be avoided. In this chapter, we'll outline ways to
construct an effective SLM strategy, thereby not only avoiding failure, but also
planning for best results in real-world situations.
'141
I TV
This story shows clearly what can happen when trouble isn't taken up front to
obtain top-down buy-in by all IT personnel, from the CIO down. The concepts
of SLM cannot be effective unless all IT personnel are informed of their particular
role in making the SLA work.
Making Contact
When a starting point for introducing SLM has been chosen, the next step is to
initiate contact with the prospective client. This needs to be done from the top.
SI,M can't succeed without the endorsement of the folks who appear at the head
of the client's org chart. Don't make the common mistake of assuming that the
boss is too busy or doesn't care about the changes you're trying to make. Also
don't assume that those below him in the organization will fall into step on their
After the IT department itself is fully briefed on its roles and responsibilities, it is
time to choose a clientele. Who will be first?
own.
In choosing a first client, it's best to pick according to need and visibility within
the corporation. But there are no hard and fast rules, and in the end the best
course of action will depend on the company's particular circumstances. The
following selection criteria can help:
When you've decided whom to contact, it's time to make your pitch. SLM puts
any IT manager in the position of a service provider who must sell the client on
a proposal's benefits. Set up a formal meeting with your target executives and give
a standard business presentation, complete with graphics (see Table 9.1).This might
be your chance to become a corporate hero. Don't reduce your effectiveness with
poor preparation.
DON'T
After the presentation, be ready for confrontation. Don't expect the benefits of
SLM to be immediately evident. Furthermore, the audience (that is, the client) is
very apt to be antagonistic to the service provider (regardless of whether it is internal or external). Certainly in a majority of companies today, the IT department is
viewed with a mixture of attitudes ranging from mild suspicion to open hostility.
As with any new technology, there will be plenty of questions. Field these pleasantly and with candor. Do not become defensiveif you do become defensive,
your client will think you have something to defend. Similarly, any aggressive
behavior will work against you.
Obtaining a Baseline
No SLM strategy can begin without a baseline of performance. Baselining, or
monitoring the network and systems to determine the present state of performance, is crucial in determining 1) how services need to be changed for more
satisfactory performance, and 2) how services will be maintained and guaranteed
over time. The operative principle is simple:You must know where you are before
you can proceed to a better place.
Taking a baseline doesn't mean racing to the nearest network connector with a
portable monitor. Unless all parties agree to a set of fundamental parameters ahead
of timeand clearly understand what they're agreeing tothe baseline report will
be worthless.
Start by deciding what measurements will be needed to adequately identify existing network and system performance. In most cases, these boil down to availability,
or uptime of all devices and system, and performance, defined in terms of response
time, network latency, or job turnaround. As ever, it's important to keep the focus
on how the end user perceives the service. The end user is the consumer, whereas
IT is the service provider. The end-user experience determines how the service is
actually meeting key business goals.
Clearly explain all metrics as you suggest them, and make sure that you consult
with colleagues in other parts of IT before suggesting anything. Clients are likely
to be confused if metrics are explained inadequately or if multiple metrics are
presented for the same service. Worse, they might feel IT is attempting to mystify
them in order to gain control of the project. Perceptions like these can sound
the death knell for SLM.
Next, determine who will be responsible for capturing the metrics, which
methodology will be used, and how the data will be captured. If application
response time and network availability are determined to be the key baseline
elements, two distinct measurements might need to be taken by two IT groups
using two distinct types of instruments. The network group, for instance, might
gauge uptime via a performance monitor, and the systems group might use a
software agent to measure application response time at the server. Choose a time
and place for coordinating input from multiple sources.
It's also vital to determine the time values for the baseline. Careful consideration
must be given to the time interval over which samples or measurements will be
taken, as well as the overall period of time allowed for baseline sampling. These
values probably will end up in the SLA itself, so it's important to give this some
thought, and perhaps even to run through a few trials before coming to a final
decision.
145
144
Regarding time intervals for sampling, it's generally best to err on the side of
granularity. If you start out with too much data, it can always be reduced to a
significant and accurate figure. Too little information, on the other hand, defeats
the purpose of baselining. Start by measuring at least on an hourly basis, then tally
results into daily and weekly averages.
Let the duration of baseline sampling be determined by the business cycle itself.
A payroll application might show the full range of possible variations in response
time and availability over the two to six weeks it takes to complete a company
payroll. In contrast, a customer service department specializing in seasonal equipment might peak in bandwidth and system requirements for three months of the
year, and then show minor fluctuation for the remaining nine months. In that
instance, a baseline might have to be taken twice in one year to establish reasonable performance expectations.
Note
SLM team members must agree on the following parameters before baselining can begin:
only if considerable funds are shelled out for multiple redundancy. Clients might
also need counsel in order to avoid shortchanging themselves. One company we
worked with recently signed for 99% uptime per month on all WAN links ordered
li.om a particular carrier--but soon found out that metric allowed for several hours
of downtime every 30 days. It took some haggling, but adjusting the 99% figure to
reflect biweekly rather than monthly performance resulted in significant savings for
the company.
Specific Metrics
Sources
Availability
Network availability
Network management
platforms
Performance management
applications
Protocol analyzers
Traffic monitors
RMON probes
Systems management platforms
Systems management
applications
System log files
Some network management
systems and performance
management applications
System uptime
COHhI I IIes
146
147
Specific Metrics
Sources
Performance
Network latency
Performance management
applications
Protocol analyzers
Traffic monitors
RMON probes
Performance
management
user workstation;
applications
Transaction rates
Log files
levels
Systems management platforms
and applications
Other
Log files
Recoverability (mean
Asset management
time to repair)
systems
Security
Log files
Radius servers
Start by taking stock of what you already have.You might have SLM tools that you
don't recognize already on hand. Management systems, application log files, and
performance management applications all can be used to obtain SLM metrics. By
creating a record-keeping database and customized reports, it might be possible to
minimize the need to acquire additional products.
Another alternative to acquiring additional products is to upgrade existing tools.
For example, most IT organizations use one or more protocol analyzers or network monitors. Nearly all these devices feature upgrades and add-ons for SLM
implementation. Not only can these enhancements equip monitors and analyzers
with SLA reporting functions, they also can extend their scope of functionality.
Vendors like Concord Communications and Netscout now furnish basic application response-time measurement along with traffic monitoring. The same goes for
products originally designed only to measure application performance. BMC
Patrol, long known for app-management wares, now works with a range of network management platforms and performance monitoring tools. Upgrading existing products can usually be done simply by installing new releases of products to
which you are entitled under current maintenance agreements for those products.
In answering these questions, dig into details. Avoid disappointment and embarrassment by making sure that newly acquired SLM tools match specific releases of
operating system, database, and management products in house. Nail down support
contracts before officially introducing new tools: Ask all vendors for a commitment
to furnish upgrades to ensure that these key parts of your SLM system keep working together.
148
request adjustments, and make changes as needed? Or will hidden mistrust and
rivalries threaten the project? Will the group doing the monitoring have time to
deal with all this? Follow your instincts here. Remember, anything shoved under
the carpet at this point will surface in one way or another later on. If you sense
problems, it might be best to charge one or two team members with reviewing
metrics and generating reports.
In many instances, complex SLAs will call for input from multiple departments. If
this is the case, create a reporting team to coordinate results. This team also should
be accountable for the resultsdon't enable "buck passing." Pick folks who have
the time, the ability, and the diplomacy to get the job done properly.
The next choice is when to issue reports. Much depends on the terms of the SLA
itself. If a contract stipulates that IT must live up to a monthly service level, reports
should be delivered at a set time each month, preferably in time for the client to
obtain credit against next month's bill. In some cases, clients might want more frequent reports, even if the terms of the contract call for once-a-month review
Encourage all parties to compromise in order to reach a frequency that is easy to
meet within everyone's schedule, while allowing time for discussion and changes.
Next, establish a report distribution list. This can be tricky. If too many people
receive reports, you might be faced with a periodic chorus of opinions and
demands (depending, of course, on how well you've managed to field input up
front). But too few recipients can lower the project's visibility and value. Each
organization will have its own circumstances to consider, but in most instances,
itis best to err on the side of having too many rather than too few included in
the SLM report loop. If you've done your job ahead of time, report recipients
shouldn't have much to complain about or change. And some folks, happy to
have been included in the first place, will tend to drop out of active participation
over time. An alternative that is growing in popularity is the use of a Web site
with authenticated access to make the SLM report information available to clients.
However, a study of IT managers by Enterprise Management Associates (see Figure
9.1) found that hard-copy reports are still the most favored method of distributing
information about SLM performance.
The capabilities of the SLM tools chosen will help determine how reports are
distributed. As noted, many SLM tools today are Web-enabled: Results can be
emailed over the Internet or posted to a Web site for general browser access.
Where Web distribution isn't possible, the time and trouble it takes to get reports
to the right people should influence the size of the distribution list. Alternatively,
someone might be designated to supervise the actual publishing and distribution of
reports. Using administrative staff or part-time help might be economical ways to
get the job done.
149
80%
50%
40%
30%
20%
10%
0%
Verbal
Figure 9.1
Hardcopy
Web based
Following Through
If you've followed an orderly and well thought out strategy, your SLM rollout
should proceed smoothly. But don't rest on your laurels. All SLM projects require
continuous care and feeding to stay successful. Part of a winning strategy is a
follow-up program of continual improvement. This doesn't mean that you must
make changes just to keep up the appearance of flexibility. It does mean that you
need to be open to suggestions and willing to make corrections to any aspect of
the project as needed.
Sometimes this means parting graciously with pet products and plans. Consider the
following case: One IT manager I know implemented SLM in his company using
a management system he'd already installed. Working overtime, he prepared SLA
templates, a database, and reports tailored to fit the incumbent system. The savings
realized from this earned my friend praise and a bonus.Time passed, and the success of the SLM project caused other clients in the company to clamor for their
own contracts, based on new parameters. It was clear new tools were needed to
meet these requests. One day, one of the man's IT colleagues unexpectedly presented the SLM team with a sweeping proposal for a new suite of tools he'd evaluated. My friend felt slighted and argued publicly against the purchase. Eventually,
this caused rivalries to surface, the boss took sides, and my friend felt compelled to
take a back seat on the SLM team. By failing to recognize that following new suggestions did not detract from the value of his contribution, my friend stopped
reaping the rewards of his success.
This story shows that the right attitude is an important first step in any SLM
follow-up plan. But it's just a first step. To ensure continual improvement, you
need to get input at the right time and in the right format. A good review process
makes this happen. This includes 1) getting input from members of the SLM team
in regularly scheduled evaluation meetings, 2) conducting client satisfaction surveys
to get input that might not be put forward in a public meeting, and 3) finding
IDI
150
When exploring the reason for dissatisfaction, he proactive. If you've heard grumis the agreement working for you?""Can
hlings about SL,As, ask for input: "1
we meet to discuss any adjustments that need to be made?" "How can we help
make this work better for your group?" Don't wait for the client to become
unhappier. Don't think that by hiding in your hole you'll avoid confrontation. If
anything, putting off contact will cause disgruntlement to fester and increase the
chances of ultimate SLM rejection.
When complaints are voiced, try to defuse them before they get to be insurmountable obstacles or crises. If a client is unhappy with the time interval being
monitored for service level performance, change it. Don't ask for more time or
argue against it. Instead, give a simple response such as, "Yes, that sounds like a
good idea, let's give it a try" Demonstrating your willingness to act as the client
wants will dispel suspicions that you're using your technical expertise to rule the
roost.
Sometimes, mistakes will be made.You might fail to be proactive or initiate SLM
contact with clients. If this happens, there is a risk that anything you say regarding
SLM will be viewed with skepticism.You must accept this.You made a mistake by
not being proactive or by responding inadequately to your clients, and you must
pay the price. There is no magic fix. Candor and honesty, coupled with open communication, stand the best chance of healing the wound over time. So if you find
yourself confronted by an unhappy client, be frank. Admit your mistakes, outline
your plans to resolve the problems, and move forward. Invite the client to work
with you to establish SLAs that will meet their requirements. And keep the lines
of communication open. Where problems have occurred, it's important to exceed
the minimum level of dialog that the SLA process requires.
Summary
Effective implementation of SLM requires more than good intentions and good
products. It calls for a carefully considered strategy that emphasizes cooperation
and planning. IT managers can ensure success by first analyzing their company's
unique culture; then proceeding with an open mind and a willing attitude to create a plan that fits it. Putting the plan into action requires assembling a team of
professionals who are committed to the rollout. The team must use a thorough,
orderly process to create SLAs, track them, and distribute reports in agreed-upon
formats. In addition, IT must follow up with ongoing checks on user satisfaction.
At every juncture, the time and trouble invested in establishing trust, reliability, and
orderly and open communication will determine an organization's success in
putting SLM into practice.
CHAPTER
Capturing Data
for Service Level
Agreements (SLAB)
T
166
Four broad parameters typically used in evaluating service levels are as II)llows:
Availability
. Performance
. Reliability
Recoverability
Availability refers to the percentage of time available for use, preferably of the endto-end service, but many times of a server, device, or the network. Performance
basically indicates the rate (or speed) at which work is performed. The most popular
indicator of performance today is response time. However, other indicators are also
useful in specific contexts. For example, in a company that performs remote data
backups for its clients, an important performance indicator would be the file transfer rate. Reliability refers to how often a service, device, or network goes down or
how long it stays down, and recoverability is the time required to restore the service
following a failure. These metrics offer high-level views of service quality, whereas
response time is a way to directly measure how the end user's productivity and satisfaction are affected by service performance.
Simply providing measurements of availability, performance, reliability, and recoverability is not enough to perform service level monitoring. All aspects of service
that affect end-user productivity and satisfaction should be covered by the Service
Level Agreements. The characteristic that has the highest visibility among customers is response time. Other aspects to measure and monitor include workload
volumes, help desk responsiveness, implementation times for configuration changes
and new services, as well as overall customer satisfaction.
With a n ultitier server structure, end-to-end response time might not provide a
detailed enough picture of delays within the underlying components to pinpoint
performance problems. Another technique, inter-server response time measurement,
focuses on the response time between servers. Providing multitiered response time
measurement allows the IT personnel to drill-down and discover the source of performance problems. Undoubtedly the best approach to measuring response time is to
implement both end-to-end and inter-server response time measurements.
Today, only a small percentage of IT departments set and measure Service Level
Agreements for distributed application availability and performance. Many of these
IT departments do so using in-house developed tools and manual processes.These
processes are typically based on analyzing end-user service problem calls and correlating the end-user locations with the components that are failing or performing
poorly. New technologies are emerging that can assist the IT department to measure
distributed application availability and performance in a more automated fashion.
Although Service Level Agreements should align with the end-user perception of
service quality, IT departments have been reluctant to agree to such SLAs for distributed application services because of the difficulty in measuring actual application availability and performance on an end-to-end basis.
Today's application architectures vary widely and typically use some variation of
the client/server model. This results in some processing occurring on the desktop,
some on the application server, and in the case of a multitiered architecture, some
occurs on back-end database servers. This complicates capturing end-to-end
response times because a single business or application transaction will span
multiple interactions between the various client/server layers.
Tip
Selecting which method to use depends on a number of factors including access to code for instrumentation purposes, willingness to proliferate and manage agents on desktops, ability to acquire
sophisticated network traffic monitors, and the inherent inaccuracies with some of these approaches.
In many cases, a combination of approaches deployed pragmatically will provide the best solution.
167
156
Use of these techniques does not eliminate the need for measuring the service levels of individual components. In many cases, these techniques will identify service
problems based on end-to-end measurements, but this might not be enough to
determine where the problem is located or how to correct it. However, by comparing response times by application across various locations, it might be possible
to isolate the problem location. For example, if an application is performing poorly
across all locations, the server or database is the likely cause. If an application is
performing poorly in only one location, it is likely a location-specific problem
such as local server, the local area network, or the wide area network connection
between that location and the application server.
Caution
..
These techniques for measuring end-to-end response times aren't able to detect outages of
individual desktops. These methods measure availability and performance of application transactions
between the user and the business process. This might be an issue for client/server applications
in which a significant portion of the application code actually runs on the desktop itself. The IT
department should continue to monitor help desk calls and the problem resolution system closely
to determine the business impact of individual desktop problems.
We will now examine each of these methods in more detail. As these techniques
require data to be collected continuously, we will also discuss some of the common
architectures used by data monitoring solutions later in the chapter.
Figure 10.1
1D
158
Note
UNIX systems come with a variety of performance measurement utilities. Unfortunately, these utilities were designed as standalone tools, and each addresses the particular problem the utility
designer was trying to solve at the time of its design. The outputs of these utilities vary between
UNIX variants. In addition, the procedure for underlying measurement is not well documented and
supported. As a result, it takes a large amount of effort to correctly collect, understand, and interpret UNIX performance data in consistent ways.
The utilities generally available with the UNIX operating system include
Sar, system activity reporter, records and reports on system-wide resource
utilization and performance information, including total CPU utilization.
CPU utilization is measured using the tick-based sampling method. A system
counter accumulates the number of CPU ticks during which non-idle
process was running. This counter is sampled at specified intervals to compute the average CPU utilization between samples. This method leads to the
problem of relatively low capture ratio.
The accounting utility records the resources used by a process upon the process's
termination. The principal drawback of this method is that no information is
available for the process until it terminates. Accounting reports summarize
these statistics by the command or process name and username.
The ps utility provides a snapshot of the processes running on the system as
an ASCII report. It reports the amount of CPU used by the process since its
inception. When reporting information on all the processes, overhead is quite
high.
As seen from this quick overview, these tools primarily provide resource utilization
information and don't measure the end-user response times or application transaction throughput. The output of these utilities differs among the assorted UNIX
variants and doesn't provide historical or trend information.
A number of performance monitoring products are available from independent
software vendors. Most of these collect data through a standard UNIX interface
called the /dev/kmem kernel device driver. The advantages of the third-party
products include the ability to normalize and compare the data across different
UNIX variants as well as greater productivity through enhanced user interfaces
and reports, including trend analysis reports.
Per/non monitors perlo s tance and server resource usage (including CPU,
memory and disk I/O). It uses counters from the Windows registry, and the
data can be logged and viewed online or charted in reports.
. 'Ms/manager provides information on all the processes and services running
and the amount of memory and CPU they are using.
Process Explode monitors processes, threads, and the committed mapped
memory. This is primarily of use to developers.
Quick Slice is a basic tool for viewing the per active process CPU usage.
Similar to the standard UNIX utilities, these Windows facilities focus on resource
utilization and don't directly measure or monitor the service levels experienced by
users. The event logs can also provide a significant amount of information about
activity on an NT system. The NT Resource Kit Utilities allow these logs to be
dumped and imported into a database for easier manipulation and analysis.
inu
When the information has been placed in the repository, analysis tools that support
a specific application service are required to correlate and aggregate information
across all components. It is typically easier to use this method to determine endto-end availability than it is to determine end-to-end response times.
IDI
The primary drawbacks of analyzing network traffic are the inability to define transactions in user
terms and the difficulty of matching all traffic. Additionally, these techniques do not capture
response time stemming from desktop application components.
I US
then attempts to detect the start and end of a transaction, and measure the time
between these events. Typically, the client agent then sends back measured data to
a central place where broader analysis occurs.
Caution
The main issue with this approach is the high costs of modifying legacy application code and the
lack of coordination between most IT operations staff and applications development departments.
These client agents capture response time from the client perspective without having to instrument the application itself. For example, some capture information on
Web browser interactions such as the response time for page retrievals or downloads. Similarly, some can decode client transactions for popular ERP applications.
The primary benefits of this method are the granularity of collection, for example
at an individual screen level, lack of application instrumentation, and the ability to
analyze user interaction from the detail data.
Tip
The primary drawback of this approach is the large volume of data captured. To mitigate this issue,
place agents on representative desktops rather than on every desktop in the organization. Using a
sampling mechanism can also reduce the volume of data while still providing reasonable availability
and response time metrics.
These client capture agents might also be appropriate for user workflow analysis in
addition to capturing the service quality from the end-user perspective.
Instrumenting Applications
The next approach involves building application programming interfaces (APIs)
that provide monitoring directly into the application. These API calls allow a
monitoring tool to query the application for end-to-end response times, as well as
run application management actions on the application, for example backup and
recovery routines. This approach is still developing, as an industry accepted standard
for these API calls has not emerged yet.
Even after a standard becomes widely accepted, many popular applications will
likely go through several releases before they fully support the standard. This
embedded API approach offers the best accuracy for measuring application
response time. The Application Response Measurement API (ARM), discussed
in Chapter 5, "Standards Efforts," is a good example of an API used to measure
response time.
The instrumentation APIs define the start and end of business transactions and
capture the total end-to-end response times as users process their transactions. This
technology is invasive to the application itself. The strength of this approach is that
transactions are defined in terms of business processes. The primary drawback is
application invasiveness, which is an expense that most enterprises are willing to
incur for only their most critical applications. Further, instrumentation adds overhead that could impact the runtime performance of the application.
The need for application modification makes this approach inapplicable to older, noninstrumented
versions of an application. Hence, this intrusive instrumentation approach is best used in a situation
in which a full revision and upgrade of the application is already required or under way.
I VT
Commands like pine and traceroute are special cases of simulated transactions, They
measure only the response time of the network round trip, and do not include any
information about the application server or database. These approaches can be useful in detecting network congestion, diagnosing if a problem is network or server
related, and separating measured transaction times into network and non-network
times. As an example of the latter, transaction-level synthetic transactions correlated
with network-level pings can provide a reasonable division of response time into
network and server times.
Tip
Synthetic transaction generation, together with built-in sampling capabilities, offers the best
approach to measuring availability and response time metrics for the widest variety of business
transactions. This approach is not intrusive into the application, and it requires less technical skill to
implement.
their management solution. All these agents perform similar tasks, but the
functionality differs based on certain agent characteristics.
`true intelligent agents have the following characteristics:
AutonomousOperates independently of the management console
including the ability to start, collect data, and take actions.
SocialCommunicates with other agents, management consoles, and
directly with users.
ReactiveDetects
events and initiates actions based on the event.
.
. DynamicOperates differently depending on time and the context of other
activities that might be happening.
There are also a number of technical aspects of an intelligent agent including
AsynchronousDoes not need a permanent link to the initiating event or
console.
. Event-drivenReacts to events and runs when only certain events occur.
No active user interactionDoes not require constant user intervention
to run.
. Self-executing--Has the ability to run itself.
Self-containedHas all required knowledge to perform its task.
To avoid excessive overhead on the servers, the number of agents should be limited. This might be best achieved by acquiring agents from as few different vendors
as possible. When selecting agent vendors, agent-to-agent integration capabilities,
agent intelligence, and agent security are important considerations.
Tip
Before deploying agents, implement a pilot to measure CPU, memory, and bandwidth consumption of
agents and consoles under a variety of operating conditions. By estimating the number of events
and the overhead required to manage that number of events, the agent impact on the system can be
accurately planned.
Y
1
^i1J
+
166
Tip
possible.
This can be achieved by careful coordination across management disciplines such as network management, database administration, and systems management. Each management discipline should
Multiple agents that duplicate agent functionality on a server should be avoided wherever
be responsible for controlling agent deployment within its functional area. Policies and procedures
should be developed for deploying and managing distributed agents.
Measurement Techniques
Before using a specific performance metric, it is important to have a clear and
unambiguous understanding of its semantics. This is particularly important when
using multiple metrics in conjunction to derive end-to-end service quality or to
solve a problem of service degradation. Almost all operating systems and management solutions have had some metrics with ambiguous meaning at some point in
time. Measurement techniques for collecting performance data can be divided into
two general categories, which are event-driven and sampling-based.
161
Comparative Analysis
The event-driven collection method is generally the most accurate, but it does
have some limitations. Its accuracy depends on the level to which the events are
interI need. There can also be discrepancies depending on the nature of the event
rupt and when the actual measurements are taken. Depending on the frequency
of events, the overhead of the event-driven measurement can be significantly larger
than that of sampling-based measurement, and it can potentially distort the measu rements significantly.
On the other hand, the sampling-based method is subject to errors when multiple
activities occur or processes run between two samples. The activity occurring at
the time of the sample will be allocated the entire length of the sample interval.
Other activities or processes are not allocated any time during that sample.
Similarly, if an activity takes place totally within a sample or if a process is created
and terminated between two samples, it is not allocated any time at all.
Note
The amount of error in the sampling depends primarily on the sampling frequency. Longer time
between the samples will result in larger potential errors. The trade-off is that more frequent
Event-Driven Measurement
Event-driven measurement means that the times at which certain events happen
are recorded and then desired statistics are computed by analyzing the data.
For example, when measuring CPU utilization, the events of interest are the
scheduling of a process to run on a processor and the suspension of its execution.
The elapsed time between the scheduling of a process to run and suspension of its
processing is added to the CPU's busy counter and the process's CPU use counter,
which can be sampled and written periodically to a log file or repository. With this
method, both the total CPU utilization and the CPU utilization for each process
are measured.
The same method can be used for collecting other information including end-to-end
response times based on instrumentation APIs or synthetically generated transactions.
Sampling-Based Measurement
The sampling method of data collection involves taking a scheduled periodic look
at certain counters or information access points. For example, when measuring
CPU utilization by sampling, the measurement method periodically takes a sample
to see if any process is running on the CPU, and if so, it increments the system
busy counter as well as the CPU usage counter for the process. The data collector
will typically sample these counters and record values in a log file.
The sampling method is generally more efficient because it places less overhead on
the system under measurement.
Summary
There are multiple data collection techniques and agent architectures that are used
to collect data that is useful for measuring and monitoring service levels. Careful
consideration should be given to the nature of the technology used and the
deployment of collection agents. The goal is to ensure that sufficient data is collected to accurately measure service quality, while not placing excessive overhead
on the computing environment.
It is also very important to ensure that service levels are measured from an end-toend basis so that the end-user experience is captured. A number of techniques can
be used to measure end-to-end availability and response times. When selecting a
method to use, considerations include access to code for instrumentation, agent
proliferation, and the level of expertise in house for implementing and supporting
the management solution. In many cases, adopting a pragmatic approach utilizing
synthetically generated transactions that are measured using sampling techniques
will provide sufficient scope and accuracy of information.
Very critical and time-sensitive applications might require more sophisticated techniques such as intrusive application instrumentation or client agents to provide
more comprehensive, accurate information.
CHAPTER
Service Level
Management as
Service Enabler
T
he benefits of service level management can be clearly delineated for any organization that takes the time to make it work. But SLM can be especially advantageous for those companies seeking to sell their IT services to outside users. In fact,
the growing ranks of Internet service providers (ISPs), application service providers
(ASPs), and outsourcers testify to the value of SLM.
By defining the parameters of acceptable service and setting clear goals and expectations for providers and users, SLM provides a framework in which providers can
offer more and better services, while maximizing the potential of existing ones.
SLM also is the key to helping users ensure that they get the most value from
their growing investment in outside services. In most organizations, increased
demands on IT are accompanied by staff shortages and budget constraints. IT
managers are turning to external providers for help. SLM gives them a way to
quantify and get the level of performance and capacity they require.
In this chapter, we'll take a closer look at how SLM is playing a role in the riiw
ing world of online services. In doing so, we will attempt to focus on . SLM ksti
from two perspectivesthe service provider's and the end user's.
171
170
The Ascendance of IP
First, we will take a look at the types of services that are most often used by corporate customers to extend their internal networks. These services vary, but there is
a preponderance of demand for IP-based services. The reasons are clear: Internet
access is inexpensive compared with the costs of dedicated, private networks. The
Internet is a fast and easy way to extend the corporate network without adding
new facilities. An Internet presence gives companies with limited geographic scope
a way to market their wares to 55 million computers in 222 countries.
Improved security and performance on the Internet also make it a unique environment for .com businesses such as Amazon.com that exist solely in cyberspace.
The market for firms like Amazon.com that conduct business-to-consumer electronic commerce over the Internet is expected to exceed $100 billion over the
next three to four years. And business-to-business electronic commerce, in which
companies use the Internet to support transactions with partners and suppliers,
is even bigger. Estimated revenue for companies in this space is expected to top
$1 trillion within the same timeframe.
Market opportunities like these are forcing SLM into the spotlight, as providers and
their customers seek ways to establish and maintain ever-higher levels of network
performance and availability in an increasingly service-oriented environment.
)ata Corp. (II )C, h r minglt a nt, MA) estimate this market will grow at rates over
0)0% annually, reaching $2 billion to $O billion by the end of 2001.
;onipanies also are turning to outsourcers for assistance. Consultants and systems
integrators often take over all or part of the duties of the data center, including
supervision of local and wide area network services, maintenance and management
of assets, network monitoring, and security. According to IDC, worldwide revenue
For outsourcing services now exceeds $100 billion and is expected to reach $151
billion by 2003.
(
What's an ASP?
The demands of e-business are driving companies to sign on with service providers who offer them
online access to mission-critical applications. This approach reduces the initial investment organizations must make, and it saves development and implementation time. It also eliminates the need to
hire extra IT talent to run new systems.
But, like all technology "buzz phrases'such as service level managementthe term ASP seems to
take on new meanings every month. And as this profitable services segment grows, the term is likely
to become even more inclusiveat least to marketing experts. Outsourcers, integrators, and even
consultants are jumping on the revenue bandwagon and are labeling themselves as ASPs.
On a more down-to-earth level, the question of who really qualifies as an ASP is more limiting.
According to the ASP Consortium (Wakefield, MA), "An application service provider manages and
Note
The Advantages of Internet-based Services are as follows:
delivers application capabilities to multiple entities from data centers across a wide area network."
Market research firm International Data Corp. (Framingham, MA) gives this definition: "Application
service providers (ASPs) provide a contractual service offering to deploy, host, manage, and rent
For end users: A quick, inexpensive way to extend in-house networks and interact with customers
and suppliers
For service providers: Fast deployment of services at low cost; worldwide reach; and unique environment for services like electronic commerce
access to an application from a centrally managed facility. ASPs are responsible for either directly or
indirectly providing all the specific activities and expertise aimed at managing a software application or set of applications."
These definitions leave room for two kinds of providersthose who offer applications from their own
facilities, and those who rely on the cooperation of other carriers or Web hosting companies to furnish the necessary network services. In either case, the ASP is charged with the direct management
A Spectrum of Providers
As demand rises for Internet-based services, the market is becoming increasingly
segmented. Internet service providers (ISPs) tout a range of offerings IT professionals can use to increase their companies' online capabilities, including standard
Web access and hosting, email, virtual private networks, electronic commerce networking services, remote access, and voice-over IP services. According to investment banker Credit Suisse First Boston (New York), projected worldwide revenue
for ISPs will exceed $45 billion by 2002.
Meanwhile, an emerging segment of application service providers (ASPs) offers
remote access to specific applications, such as enterprise resource planning (ERP)
applications, corporate databases, and complex vertical applications, over the Web.
Market researchers such as Forrester Research (Cambridge, MA) and International
of its own servers and holds ultimate responsibility to the customer for maintaining agreed-on levels
of service.
The generally accepted definition of ASP does not include those companies that provide applications
over a customer's own network, such as systems integrators. And in most instancesalthough there
is some disagreement about thisit does not include network outsourcers. In some cases, however,
these providers might host customer applications from their own servers, using the Web as the
transport medium.
The ASP market has many helpers, as evidenced by the list of ASP Consortium members, which
includes hardware vendors like Compaq and Cisco, which furnish the servers and network infrastructure gear for ASPs, as well as software suppliers like Citrix, Great Plains Software, and IBM. Still,
these companies do not qualify as ASPs by themselves.
172
Atlrrr.m' , ul' )
- . Yoe
ri1144: 54%
54 %
46%
50%
40%30%
20%
10%
0%
Product or solution
developed
Pad of
the service
52%
Commercially
available product
32%
in - house
28%
No
28%
ESN
Yes
Yes
No
72%
44%
36%
Tougher penalties
36%
28%
Other
36%
43%
33%
33%
29%
them
5%
Other
Don't try to sell corporate networkers on the merits of the honor system. What works well at West Point
shows flaws outside the walls of the academy, especially when it comes to SLAs: Can customers really
trust that carriers will meet the pledges they make, and make restitution when they come up short?
No
461
Figure 11.1
45%
40%
35% i
30%
25%
20%'
10/
5% 1
0%
Fr
^1
li
fJV
Other
More Tougher More
More
metrics penalties frequent
reliable
measurements
tools for
measuring
performance
45%
40%;
35%
30%
1
25/
1
20%,
15%
10%
5%
'
0/ 1 Provider
does not
honor
them
1.1
"
1
I1f
I
^
'i
^ l'
r'
i,
{
_1
Provider Confident
Provider
responds in prouder
unwilling
performance
to
to
negotiate performance wibn4
them
problems
case by
case
Other
Ask David Giambruno, the global transition program manager for medical equipment manufacturer
Datex-Ohmeda, a subsidiary of Instrumentarium Corp. (Helsinki, Finland). He suffered days of downtime on ATEtT's international frame relay network last October, and while he was fixing the problem,
he discovered ATEtT had charged him for five years of service he never knew about.
Different Strokes
The burgeoning growth in IP servicesand the variety of services now on offer
presents a challenge when it comes to SLM. New services such as e-commerce
and ASP services will not always fit a single SLM mold because an extension or
modification of existing SLM parameters is required. And in some cases, it might
be necessary to set new parameters.
"'We had PVCs (permanent virtual circuits) in Canada that weren't even hooked up to routers: he says.
To top it off, ATEtT (Basking Ridge, NJ) refused to make compensationpointing out the SLA (Service
Level Agreement) didn't cover either problem. ATEtT declined to comment, but Giambruno is blunt:
'I've learned that unless you can actually see what's going on in your network, it's going to cost you:"
From the article, "SLA Monitoring Tools, Heavyweight Help," Data Communications magazine,
February 7, 1999.
by those of the business partners and suppliers who support the customer's online
presence. Callers who place an online order with an e-commerce retailer, for
example, might experience a response time delay if the supplier on whom the
retailer depends suffers a network failure. Although the retailer is not technically
responsible for the delay, it will affect his ability to deliver service to his customers.
He is liable to compensate those customers if response times fall below promised
service levels. The retailer is legally the provider of service, regardless of the components that service contains. Any SLAs set up with e-commerce providers
and/or SLAs made between providersneed to reflect these new facts of life
in the world of online services.
Emerging ASP services also present special challenges. ASP services are still so new
that users and providers have not determined the precise elements that will constitute service-level criteria. New types of services are changing the rules. Do ASPs,
for instance, offer their customers SLAs based on user response time, server
response time, overall network uptime, or a combination of all these? Questions
like these are still in debate as the market for services develops.
There are other SLM challenges presented by emerging services: In many
instances, a service might include interdependencies between Web hosting
providers and carriers offering the underlying network facilities. SLAs will need
to be established between the multiple providers as well as between providers and
customers.
Smart Implementation
Regardless of the complexities of particular SLA criteria, all the suggestions and
templates for creating and maintaining SLAs covered so far in this book can be
successfully applied to the service environment. From the perspective of the user
of services as well as the provider, it is important to set up a task force, define
SLA parameters, and agree on continual methods of monitoring and follow-up.
The service environment also presents unique SLM implementation challenges,
both from the user and service provider perspectives. It is important to be aware
of these from the beginning in order to ensure success.
Fundamentally, these distinctions center on the fact that the service provider holds
the advantage in relation to its customers. By offering the vital services on which
the customer's business is run, the provider is virtually in control of the customer's
business itself.
Irk will be sure to surfce to the users' disadvantage later on. Ask lots of questions:
As noted, most service providers have prefabricated SLAs they use as standard.
'They will not offer to extend these SLAs unless they are asked to do so.
Note
Remember: Like it or not, corporate customers are the underdogs in the service relationship.
Controlling the business infrastructure gives the provider control over the customer's business. It
is therefore vital to clarify all terms of the SLA and its implementation right from the start.
I'i' professionals can give themselves a better chance for success by keeping several
things in mind when setting up a SLM relationship with service providers:
Know what you're talking aboutEnter negotiations armed with baseline
measurements. Know what constitutes adequate performance and capacity for
all business functions that the provider will be expected to support. Also know
exactly how long you are willing to wait to have something fixed if it breaks.
Establish a common frame of referenceMake certain up front that the
terms used in the SLA match those the service provider uses. Also, agree
with the provider on what methods and products you'll be using to monitor
conformance with parameters. Some providers might not support your
products, and vice versa, making it tough to make a case for compensation
if something goes wrong.
Document everythingMake sure that you ask your provider to endorse all
SLA parameters, including reimbursement in the event of outage or failure.
If your requests are not documented, the provider will be under no obligation to follow them. And the fact that the provider has so many customers
will make it resistant to extending special unasked-for privileges.
Be ready to pay for extrasProviders will often prove flexible when asked
to extend the terms of their one-size-fits-all SLAs. But most will ask for
additional payment beyond a certain point. This situation is normal; expect
to pay for the terms you need.
Get your act togetherMake sure that your in-house SLM team is well
prepared and unified. Recordkeeping is a key part of the SLA and needs full
support on your end. If something does go wrong with the service you are
contracting, it will be vital for team members to work as a well-informed
team in order to get repairs and compensation.
Keep an open mindOur advice so far is based on the fact that users need
to take extra precautions in setting up SLM with their providers. But don't
maintain a defensive attitude: Remember that most service providers intend
to do the best possible job for their customers in order to keep themselves in
business. Encouraging an atmosphere of cooperation will serve you better
than harboring an adversarial attitude.
177
176
SLM can work as wellor betterfor service providers as it can for their customers. But there are unique considerations from the service provider perspective.
Providers might start from a position of power relative to their customers, but this
does not mean that they themselves are not vulnerable. In many ways, today's service providers are just as vulnerable as their customers. After all, their business is to
offer reliable service. If they fail to do that, they cannot stay in business.
SLM does more for service providers than merely offer protection from liability. It
helps them create a frame of reference for new and existing services. Knowing the
level of performance they can guarantee allows them to pass along SLAs to their
customers that differentiates them from other providers.
SLM also helps in the creation of differentiated services, in which different groups of
users are offered disparate guarantees of service, based on their payment plan.
"Gold" customers, for instance, might be offered continuous availability at an
agreed-upon level of response time; "silver" customers would get response times
within a certain range of measurement; and "bronze" customers would receive
"best effort" service. SLM provides the input that enables the service provider to
offer differentiated services; and it also gives them the ongoing framework for
implementing them with customers.Table 11.1 illustrates a typical model used by
providers of differentiated services.
Rate
Response time
Availability
Gold
Silver
Bronze
Standard
10Mbps
5 to 10Mbps
2 to 5Mbps
Best effort
<1 second
99.9%
Over 90%
Over 80%
Best effort
<3 seconds
<5 seconds
Best effort
In general, service providers have the same problems their customers do, often on a
grander scale. Also, service providers face a range of challenges that their customers
do not. Specifically, they must ensure that other providers and suppliers on whom
they depend can furnish a level of performance, availability, and capacity that
enable them to pass along a single, consistent level of service to their customers.
In effect, many of today's online services, such as e-commerce services, depend on
a group of suppliers maintaining a chain of performance. A break in the chain will
affect the ability of all participants to meet service-level expectations.
In some cases, service providers will need to take the initiative in establishing
SLA parameters ahead of industry trends. Many ASPs, for instance, are breaking
new ground when it comes to service models. They find themselves creating SLAs
Make infrastructure serve your SLAsService providers can make the most
of service level management by building it into their infrastructure through
the use of technologies like Quality of Service (QOS), which uses intelligence built into routers and switches to control the flow of network traffic.
It is worth the investment of time and effort to find and take advantage of
these new resources for guaranteeing performance.
Stay in tune with customersMost providers are focused more on determining the kinds of services customers want than on ways to present new
SLAs. Still, it pays to keep in touch with demands for SLA improvements,
particularly as the quality and content of SLAs will increasingly differentiate
services from multiple providers in emerging segments.
Take the lead with other participants and suppliersMany types of services
today call for cooperation among multiple providers. E-commerce is one
example. Leave nothing to chance if you have these kinds of interdependent
relationships. Remember, your business depends on all the links in the chain
performing consistently.
Keep an open mindStay flexible with your customers and business suppliers. Rigidity will not serve you well in a market in which new competitors
are ready to take business from you at a moment's notice.
Summary
SLM. Users need to be
firm and clear in negotiations with providers. Providers need to stay in touch with
customer demand and remain flexible and open to new methods, while ensuring a
consistent level of performance and availability in multiprovider situations. Over
time, the demands of the service environment will no doubt help service level
management develop beyond its present scope.
The burgeoning services market is a proving ground for
CHAPTER
Moving Forward
management in the industry today. However, the story has only just begun. We
anticipate many advances during the coming months and years.The continuing
maturity of the understanding and best practices of service level management will
happen more rapidly if IT managers share information, monitor the evolution of
standards, and push vendors to provide more capable solutions.
This chapter recaps some of the more salient aspects of the current state of service
level management and also suggests a mechanism for commencing and continuing
a dialog to assist in the maturation process for service level management.
1t1
180
Similarly, many corporations are moving to a direct self-service model for interacting with their customers using the Web as the communication mechanism.
Hundreds of millions of dollars are being spent to attract customers to those Web
sites. If the site is not available, the Internet application is not responsive, or the
customer feels vulnerable from a security or privacy perspective, he will not have
a good experience. Not only will he be reluctant to buy something or conduct a
business transaction on the initial visit, it is unlikely that they will return to the
site, or it will take significant marketing dollars to attract them again.
Chapter 8, "Business Case for Service Level Management," provides a business case
for service level management along with a sample cost justification worksheet.
Hopefully, these won't be needed in order to convince senior management of the
need to carefully manage service quality in the same way as they would monitor
and manage other valuable business assets.
business and market effectiveness. However, when the IT department takes on this challenge, it will
be even more important to ensure consistent, high-quality service delivery.
The dialog with the lines of business is important when defining services and
negotiating Service Level Agreements. Reporting on service quality when problems occur, as well as when excellent service is delivered, is another important
aspect of building trust and credibility. It will also be important to jointly conduct,
with the lines of business, regular satisfaction surveys and reviews of the Service
Level Agreements. This helps the IT department to stay in touch with changing
business requirements and user perceptions of service quality.
All the agreements used by the IT department should follow a similar format,
but the actual terms and conditions can vary from one agreement to another.
The detail required within each agreement could also vary, particularly when the
business importance and time criticality of the services vary. The Service Level
Agreements should include the conditions under which the agreement should
be re-negotiated to remove any contention in the future.
Il tltlgltt also he beneficial to align sonic component of compensation for IT personnel with the service quality and meeting Service Level Agreements. In some
corporations, incentives have also changed to encourage more proactive automated
approaches to ensuring that service level objectives are met, rather than providing
incentives for reactive fire-fighting problem correction practices.
1 Uzi
I Y I
A new breed of solution has evolved over the last few years,TYpically referred to as
application management, these solutions seek to manage from an application perypective and drill into the underlying technology layers where necessary to resolve
prublems,This provides a much better alignment between the IT department and
the lines of business and is more attuned to supporting a service level management
inn iative.These solutions are being augmented to capture the end-user experience,
which provides the basis for understanding and improving service quality.
Several vendors are also providing more sophisticated service reporting capabilities
based on the ability to either capture directly or derive the end-to-end availability
and responsiveness of critical application services.
' l'he future direction of management solutions will include advances in the
following two important areas:
other attendees is one mechanism for achieving this. Later in this chapter, we suggest another
mechanism using a Web site.
problem causing the service degradation. Reducing the time required to diagnose
problems will also allow the IT staff to spend more time on automating recovery
actions, proactively planning future requirements, and working with the lines of
businesses on supporting strategic business initiatives.
Reading relevant articles in trade publications as well as research reports from
industry analyst firms is one way of keeping abreast of technology and solutions
advances.
186
Caution
As with any new initiative within the industry, vendor hype around business process management
will confuse the marketplace. When evaluating these claims, it is a good idea to go back to the
basics. A business process spanning multiple applications can't be effectively managed unless the
solution has visibility into, and can manage, the individual applications. Similarly, a single application can't be effectively managed unless the solution has visibility into, and can manage, all the supporting infrastructure layers.
Appendixes
Appendix
A Internal Service Level Agreement Template
B
We would like to participate in that evolution and extend an invitation to you, the
reader, to also be involved in the progress of service level management. To this end,
we have set up a Web site at www. nextslm. org . On this site, you will find some of
the templates provided in the appendix to this book, as well as other material we
felt would be beneficial to share. There are chat capabilities as well as instructions
on how to post material to the site.
We hope the Web site will promote sharing of best practices and a continuing
dialog between like-minded professionals seeking to advance service level management. We thank you for your interest in this book, and in advance, for sharing in
the forthcoming dialog.
Appendix
Statement of Intent
This service level agreement (SLA) documents the characteristics of an IS service
that is required by a business function as they are mutually understood and agreed
to by representatives of the owner groups. The purpose of the SLA is to ensure
that the proper elements and commitment are in place to provide optimal data
processing services for the business function. The owner groups use this SLA to
facilitate their planning process. This agreement is not meant to override current
procedures, but to complement them. Service levels specified within this agreement are communicated on a monthly basis to the owner group representatives.
1ti1
190
Approvals
Table A.1 shows which business groups and IS groups share ownership of the
service, and which of their representatives have reviewed and approved this SLA.
Description
I 1w service management group provides the following service:
Ensures that the specify name application is available for users to log on and to
specify business purpose of the service
Organizational Group
Representative
Business Function
IS Service
Business unit
representative
Service manager
Team leader
Computing Services
Review Dates
Last Review: Date of last SLA review
Next Review: Scheduled date for next SLA review
Responds to and resolves user questions about, problems with, and requests
for enhancements to the application
User Environment
The business function is conducted in the following data processing environment
as shown in Table A.2.
Table A.2 Service User Community Characteristics
Number of Users
Geographic Location
Computer Platform
and so on
This SLA uses the following conventions to refer to times and percents:
Times expressed in the format "hours:minutes" reflect a 24-hour clock in the
central standard time zone.
Times expressed as a number of"business hours" include those from the hours
from 8:30 to 17:30.
This section provides information about the normal schedule of times when the service is available. It also describes the process for enhancing or changing the service.
i b3
I t1
Table A.5
Service Availability
Percent
Nonemergency Enhancements
All changes that take more than four hours to implement or that impact user
workflow are reviewed by the service name Advisory Board for approval and
prioritization.
Enhancements and changes that do not require a service outage and that do not
impact user workflow are implemented upon completion.
Enhancements and changes that require a service outage are scheduled on Saturday
mornings. Users are notified at least two business days in advance when a nonemergency service outage is required to implement an enhancement or change.
Problem Response
Time
Insert target
percentage
Insert targets
normally specified as
X%
of transactions of
type Y to be completed
with Z seconds
1-High Priority
insert target time
2-Medium Priority
insert target time
3-Low Priority
Change Process
Performance Target
Problem
Circumvention or
Resolution Time
1-High Priority
insert target Time
2-Medium Priority
insert target time
3-Low Priority
insert target time
The Help Desk prioritizes requests for support according to the following
priority-level guidelines:
1-High Priority
Service name is not operational for multiple users.
A major function of service name is not operational for
multiple users.
2-Medium Priority
Service name is not operational for a single user.
A major function of service name is not operational for a
single user.
Appendix
Simple Internal
Service Level
Agreement Template
by insert description of user community to insert
T he insert service name is usedThe
IT department guarantees that
description of the service capability.
1. The service name will be available insert percentage of the time from insert
normal hours of operation including hours and days of the week. Any individual
outage in excess of insert time period or sum of outages exceeding insert time
period per month will constitute a violation.
2. Insert percentage of service name transactions will exhibit insert value seconds
or less response time, defined as the interval from the time the user sends a
transaction to the time a visual confirmation of transaction completion is
received. Missing the metric for business transactions measured over any
business week will constitute a violation.
3. The IT department will respond to service incidents that affect multiple
users within insert time period, resolve the problem within insert time period,
and update status every insert time period. Missing any of these metrics on
an incident will constitute a violation.
4. The IT department will respond to service incidents that affect individual
users within insert time period, resolve the problem within
insert time period,
and update status every insert time period.
Missing any of these metrics on an
incident will constitute a violation.
5. The IT department will respond to noncritical inquiries within
insert time
period, deliver an answer within insert time period,
and update status within
insert time period. Missing any of these metrics on an incident will constitute
a violation.
Appendix
Sample Customer
Satisfaction Survey
T
hank you for taking the time to provide feedback regarding the services
provided by the IT department. There are three areas that you may evaluate:
Customer Service Orientation
Results Orientation
. Expertise of Staff
There is also an area for general comments and future IT requirements.
1lU
199
If you enter a Fair or Poor rating, we ask that you provide additional comments.
General Comments
Table C.1 shows the qualities and skills descriptions that should be used when
making evaluations.
Results Orientation
Expertise of Staff
Ih.xpertise of Staff:
What things do you feel the IT department does well and what things could we
do better? What works and what does not? Please be specific.
Current Usage
This section helps the IT department gain a better understanding of the service
usage and support patterns of our customers. Please answer the following question.
How would you describe your reliance on information technology to perform
your job?
Please complete Table C.2 by rating each of the services against the three attributes.
Table C.2 Service Ratings
Service
Customer
Results
Expertise
Service
Orientation
of Staff
Orientation
BUSINESS APPLICATIONS
Extremely Heavy
Heavy
Moderate
Light
Very Light
Please indicate in Table C.3 the most frequent contact you have with the IT
department in each of the designated areas.
Financial Application
H/R Application
Contact Type
Web Access
Annually
DESKTOP SUPPORT
PC Hardware/Software
Reporting a
UNIX, X-terms
service problem
NETWORK SUPPORT
Local Network
Requesting a
new application project
Requesting an
Remote Network
application enhancement
Phones/Voice mail
TECHNICAL SUPPORT
Mainframe
Requesting new
UNIX Servers
network access
NT Servers
Requesting
service access
Daily
Weekly
Monthly
Quarterly
200
Future Requirements
In your opinion, what specific areas should the IT department focus on during the
next year? Please be specific.
Optional Information
Appendix
Please provide the following information so that we can follow up with you:
Name:
Department:
Location:
Sample Reporting
Schedule
Daily Report
The daily report is a tactical report showing sufficient detail to allow the IT
department and IT management to have a good understanding of the service
quality of the previous day. These reports are typically kept online for two weeks.
The contents include
Outage report by application by location
Response time report by application by location summarized at 15-minute
intervals for the prime shift, and at 30-minute intervals for the off-shift
Problem reports by priority, including a brief description of the problem for
critical and severe problems
202
Average problem response time by priority
. Problems closed and outstanding by priority
Security violations and attempted intrusions
Appendix
Weekly Report
The weekly reports are used by both the IT department and the lines of business
to review the service quality delivered by the IT department. These reports are
kept online for eight weeks. The contents include
. Workload volumes by application summarized by shift by day
. Outage summary by application by shift by day
Recovery analysis for all outages of significant duration
Cumulative outage duration for the month by application
Response time percentiles by application
Monthly Report
The monthly report is a management report that focuses on how well the IT
department is servicing the lines of business. The monthly reports are kept online
for six months. The contents include
Report card summary
. Workload volumes by application
Service level achievement summary by application service
. Highlighted problem areas and analysis
Quarterly Report
The quarterly report is a business report focused on identifying trends in service
quality as well as overall satisfaction. It also provides information on future initiatives.
The quarterly reports are kept online for four to six quarters. The contents include
. Workload trend report by application and user community
. Customer satisfaction survey results
. Service level achievement trends
Cost allocation summary
. New IT initiatives
Lvu
L U'
Summary of Value
The nature of the service provider business is providing application availability.
Inherent in this is a guarantee of a certain level of availability and performance.
The Service Level Agreement (SLA) evidences this with our customers. We provide for a penalty when this level of availability is not met. The major service level
management value comes from providing the exact methodology and tools needed
to manage at the required level.
The ROI value areas are
Avoid paying a financial penalty by meeting service level objectives for a
customer
Slower growth (hiring) of the support and operations staff
Reduce the number of help desk calls
Eventual possible elimination of the help desk operations
Software licensing savings
The benefit value areas are
Lost customer credibility with excessive downtime or poor response time
Time savings of the operations staff
Time savings of the shared group
Greater credibility of the SLA numbers
Sales competitive advantage
Reduced time and manual effort for the billing staff
Lin
4410
Benefit Areas
Benefits include soft dollar areas such as people productivity, customer confidence,
and brand perception that are harder to quantify and use as a justification for service level management, but are nonetheless important.
vo
Proactive service management manages both availability and performance to customer needs. These are two of the items that can significantly contribute to loss of
credibility.
Proactive service management provides automation, emailing, and paging capabilities, which free up time for the staff to perform nonroutine tasks and projects.
Appendix
Year 0
Year 1
Year 2
Year 3
$0
$140,000
$210,000
$280,000
$0
$140,000
$210,000
$280,000
$0
$336,000
$840,000
$1,344,000
$0
$0
$450,000
$0
$0
$0
$480,000 $1,080,000
$960,000
$0
$930,000 $1,080,000
($314,000)
$180,000
$944,000
$0
$960,
104.4%
$341,358
Figure E.1
Selected Vendors
of Service Level
Management Products
Summary
The implementation of proactive service level management at this sample service
provider shows an excellent rate of return on the level of investment required for
the implementation.
s this book went to print, the ever-expanding market for service level manA
agement included over 800 vendors, each with a claim to provide at least one
SLM solution. In reality, many products cover just one aspect of SLM, such as
event monitoring or historical reporting. But this limitation does not stop vendors
from selling their wares as comprehensive SLM solutions. Given these claims, it is
difficult, if not impossible, to assemble a complete list of SLM products that does
full justice to the marketand the prospective buyer. The information that follows
is intended as a sampling of representative offerings that readers can use to start the
evaluation process.
ServicePoint Series
The ServicePoint Service Delivery Unit (SDU) is a WAN access device that combines termination, monitoring, and control. It maps specific types of services, such
213
212
IQ Series
The Adtran IQ series of intelligent performance monitoring devices provides
detailed statistics on the overall health and performance of frame relay networks
at rates from 56Kbps to 2.048Kbps. It is specifically targeted at Service Level
Agreement verification for frame relay subscribers. In-depth diagnostics for circuit
management and troubleshooting also are furnished. The IQ family features IQ
View, an SNMP management program that runs under Windows NT This software manages IQ devices while providing a database and trend analysis of the
frame relay statistics gathered.
Adtran Incorporated
InView
View software monitors service levels for distributed applications and alerts
I'
I' managers when objectives are not being met, so problems can be resolved
proactively The software runs under Windows NT and is designed to continuously
provide response-time information from the end-user perspective. The software
identifies trends in response time over long periods of time to spot potential
problems before they interfere with business and to identify growth patterns for
;pitchy planning. To facilitate this, detailed service-level data generated by EnView
can be stored in a central reporting repository. Thus, availability and end-user
response time might be tracked historically by application and location, enabling
service-level trending for IS management and the end-user community.
Amdahl Corporation
1I b
11 4
+1-603-337-7000
+1-212-285-1500
800-822-9773
http://www.aprisma.com
http://www.avesta.com
Attention!
PILOT
PILOT is a performance tuning and capacity planning tool for mainframes. It features reporting, tracking, forecasting, and modeling. PILOT tracks response times,
identifies peak periods, builds simulations of current and future systems for capacity planning and justification, and produces reports that facilitate timely problem
diagnosis and resolution.Versions of PILOT are offered for MVS, CICS, and SMF
environments.
http://www.attentionsottware.com
Hauppauge, NY 11788
+1-631-979-0100
800-877-0990
http://www.axios.com
L ib
Bridgeway Corporation
I'() Box 229
R edmond, WA 98073-0229
1.1-425-881-4270
http://www.bridgeway.com
OpenMaster
IiullSoft, the worldwide software division of Groupe Bull SA (Paris), offers
OpenMaster to manage multi-vendor IT networks, systems, and applications.
OpenMaster, based on UNIX, incorporates an object-based repository and management services to allow IT staff to easily deploy software, manage assets and configurations, manage availability and performance of IT, and secure IT components.
OpenMaster also furnishes service-level reporting on all IT elements across geographical, functional, or business process boundaries. Reports are provided on
network devices, desktops, servers, and applications. Information is delivered on
configuration, significant events, and security parameters. A range of report formats
are offered for a variety of media, including the Web via graphical Java interfaces. In
addition, multi-dimensional analysis tools are available for more complex tasks, such
as return on investment evaluations or analyses of the overall performance of critical
components over long periods of time.
BuliSoft
300 Concord Road
Billerica, MA 01821
+1-978-294-6000
800-285-5727
http://www.bullsoft.com
219
ETEWatch is software that runs under Windows NT and measures end-to-end
application performance management.Versions are offered to support Citrix
MetaFrame, Lotus Notes, R/3 monitors, PeopleSoft, and custom applications.
Candle's Response Time Network (RTN) is a service that monitors applications
from the end user's point of view. RTN is based on Candle's ETEWatch and lets
users see how applications are performing for any site, time, user, server, or time
period right at the desktop. An advanced online application process engine structures the data into information that can be customized. Candle's Performance
Monitoring Network (PMN) automates the transformation of performance data
into intelligent business analysis. The service provides daily, weekly, monthly, or
quarterly information on service levels, capacity, and application monitoring.
Candle Corporation
221
220
under Windows NT, UNIX, and Novell NetWare. EcoTOOLS uses a single,
consistent Windows NT interface to furnish at-a-glance scorecard reports for
management and the general user population in addition to the in-depth operational reports required for the daily management of applications and servers.
Customizable reports also are available.
Compuware Corporation
TREND
I)eskTalk's TREND product automates the collection and analysis of performance data and delivers business-critical reports out of the box. TREND collects
performance data from industry-standard sources such as SNMP MIBs, as well as
from application monitoring partners such as FirstSense Software and Ganymede
Software. Utilizing these heterogeneous data sources,TREND reports deliver a
cohesive view of network, system, and application performance, providing IT
organizations with an end-to-end service level picture of the entire business
process. TREND is built on a distributed architecture with a Web interface for
report creation and viewing. TREND users can add new data sources, update
polling polices, fine tune threshold definitions, and create customized performance
reports. A predictive analysis feature warns network managers in advance of
impending slowdowns so they can prevent problems and quickly identify the
root cause of any delay. TREND operates on and between AIX, HP-UX, Solaris,
Windows 95, and Windows NT platforms.
DeskTalk Systems Incorporated
+1-508-460-4646
http://www.concord.com
CrossKeys Resolve
CrossKeys Resolve is a software suite designed to help service providers define, set,
and meaasure service level goals. The product also includes performance reporting
software. The Solaris-based package enables service provider to deliver sets of network and service performance reports to their customers and internal users via the
223
222
with statistical trending information on the utilization, link status, and interface
detail as well as other parameters vital to the frame relay network. WANwatcher
collects the network statistics in real time or on an hourly, daily, or weekly basis or
at preset intervals. It can handle data on up to 1,280 channels in the network.
Statistics can be viewed in a range of report formats.
Eastern Research Incorporated
knowledge-based rules against the baseline to identity abnormal behavior that can
lead to performance problems. Engineers can drill down and learn more about the
abnormal behavior and perform what/if analyses to see how changes in loading
an improve system performance. Basis engineers can also enhance Envive's health
check by adding additional knowledge rules using their own SAP knowledge. SLS
runs on a separate architecture from the database system itself, enabling it to perIbrm analyses even when R/3 is down.
Envive Corporation
http://www.erinc.com
Empirical Suite
Empirical's flagship product, the Empirical Suite, covers the planning, measurement, and prediction functions associated with improving enterprise service levels.
The suite is comprised of three products: Empirical Planner, Empirical Director,
and Empirical Controller. The applications are sold either individually or as a bundled solution. Empirical Planner helps IT managers set baselines, define corporate
service levels, and implement requirements. Empirical Director runs under
Windows NT, UNIX, or VMS and tracks actual application service, sending alerts
when performance falls below an optimum level, and diagnosing the source of a
problem. IT managers can also use the application to perform trend analysis for
capacity planning and long-term troubleshooting purposes. Empirical Controller
performs corrective actions to fix service level issues. The application makes
promises to help administrators automate the tuning of application SQL and
the database's physical structure.
Empirical Software Incorporated
888-236-8483
http://www.envive.com
FirstSense Enterprise
FirstSense, which was acquired by Concord Communications on January 2, 2000,
offers FirstSense Enterprise, software that continuously monitors application
performance and availability from the end-user perspective. FirstSense says this
approach provides IT organizations the information necessary to measure true
application quality of service. FirstSense Enterprise uses patented lightweight
intelligent autonomous agents on end-user client systems to continuously monitor
and collect information on business transactions that affect the end user. The agents
track end-to-end response times (in real-time) comparing actual availability and
performance against service-level thresholds. When a transaction exceeds defined
service-level thresholds, FirstSense Enterprise captures diagnostic information at
the moment the exception occurs and at every tier involved with that specific
application transaction. FirstSense sends notification of an alarm, and compares values at exception time to normally observed behavior. These "normalcy profiles"
provide a baseline of application behavior so that IT can determine what is typical
for a particular environment. The baseline data and exception diagnostics provide
IT with the context for resolving problems, whether on the client, network, or
server.
FirstSense Software Incorporated
21 B Street
Burlington, MA 01803
+1-781-685-1000
http://www.firstsense.com
L`P
Pegasus
Ganymede Software's Pegasus monitoring solution is designed to minimize the
time and effort required to detect, diagnose, and trend network performance problems. The Pegasus Application Monitor component gives a user's view of application performance. It passively monitors the performance of end-user transactions,
so IT professionals can identify, prioritize, isolate, and diagnose application performance problems. It tells staffers if an application on a particular desktop is being
constrained by the client, the network, or the server so that they can deal with
these problems before end-users are aware of them. If the network is causing an
application to slow down, Pegasus identifies which network segment is causing the
performance degradation by using active application flows of known transactions
to determine where performance is being constrained. In addition, key system
statistics can be monitored to see how they are affecting application performance.
This information can be used to establish trends, set SLAs, and monitor
conformance to agreed-on criteria.
Ganymede Software Incorporated
or
http://www.geckoware.com
Hewlett-Packard Company
Morrisville, NC 27560-9119
919-469-0997
http://www.ganymede.com
+1-650-857-1501
http://www.hp.com
Continuity
Continuity software is designed to help IT organizations manage service requirements in complex distributed environments. It gathers baseline information on normal network performance and then tracks information on availability, performance,
response time, throughput, service levels, and operational risksin terms that both
IT operations managers and business managers can understand. Continuity provides
real-time, correlated diagnostics to maximize availability and performance by helping managers to correct and prevent service disruption quickly. By monitoring
business transactions as users experience them, the product aims to address problems
before users are aware they exist.
Intelligent Communication Software GmbH
Middlesex, UK
Munich, Germany
+49-89-748598-35
http://www.ics.de
117
environment. SLMs notify the niid-level manager of when response times exceed a
defined level.The mid-level manager can, in turn, forward that data to upstream
management stations. SLMs can also be used for local reporting.
Jyra Research Incorporated
NETClarity Suite
The NETClarity Suite of network performance management and diagnostic tools
allows the network manager to monitor, measure, test, and diagnose performance
across the entire network. The suite's six network performance tools are Network
Checker+, Remote Analyzer Probe, Load Balancer, Service Level Manager,
Capacity Planner, and NETClarity Complete. All the tools are based on technology and methodologies taken from LANquest's independent LAN/WAN testing
services.
LANQuest
1La
Netcool
Micromuse's Netcool suite is designed to help telecommunications and Internet
service providers ensure the uptime of network-based customer services and applications. The Netcool ObjectServer is the central component in the suite. The
ObjectServer is an in-memory database optimized for collecting events, associating
events with business services, and creating real-time reports that show the availability of services. The ObjectServer performs all formatting and filtering of this data,
allowing operators to create customized EventLists and views of business services.
The suite also contains ObjectiveView, an object-based topographical front-end
toolset that allows operators to build clickable maps, icons, and other graphical
interfaces to ObjectServer data and EventLists. ObjectiveViews are used by
managers in the network operations center because they supply a concise, global
summary of event severities and service availability throughout the entire network.
Micromuse Incorporated
139 Townsend Street
San Francisco, CA 94107
+1-415-538-9090
http://www micromuse.com
231
the DSMs into the network and interprets the information from the agents. NetOps
then offers suggestions to correct network deficiencies.
NetOps Corporation
501 Washington Avenue
2nd Floor
Notreality
2350 Mission College Boulevard, Suite 900
Santa. Clara, CA 95054
+1-408-988-8100
http://www.nreality.com
Pleasantville, NY 10570
+1-914-747-7600
http://www.operations.com
NetPredict
NetPredict software monitors the end-to-end performance of specific applications
on user-selected paths through a network. To perform this function, the software
collects key data obtained from SNMP and distributed RMON sources. That
information is then stored in a relational database for long-term trending and
historical review. By comparing this data against measured traffic on the network,
NetPredictor is able to perform accurate predictions of the effects of changes in
the network or the application. With this capability, IT personnel can accurately
gauge their capacity requirements to improve the performance of both their
applications and networks. NetPredict supplies a tool for creating and tracking
Service Level Agreements. IT managers can use it to estimate what their actual
requirements are and then use the technology to measure the application
performance end user's experience on a day-to-day basis.
NetPredict Incorporated
1010 El Camino Real, Suite 300
+1-978-614-4000
+1-650-853-8301
http://www.netscout.com
http://www.netpredict.com
NetSolve Services
Wise IP/Accelerator
The Wise IP/Accelerator enables carriers and ISPs to offer SLAs for IP-based
virtual private networks (VPNs). The IP SLAs supported by Wise/IP Accelerator
furnish point-to-point bandwidth availability guarantees for virtual private
networks (similar to the committed information rate or CIR of a frame relay
network). By utilizing Wise/IP Accelerator to offer SLAs for IP VPNs, carriers
and ISPs can generate additional subscribers among companies looking for an
inexpensive alternative to dedicated network services.
Rd N
NetSolve Incorporated
12331 Riata Trace Parkway
Austin, TX 78727
+1-512-340-3000
http://www.netsolve.com
Netuitive Incorporated
3460 Preston Ridge Rd., Suite 125
Alpharetta, GA 30005
+1-678-256-6100
http://www.netuitive.com
Bluebird
N*Manage supplies Service Level Agreement monitoring software for systems and
networks. Bluebird, N*Manage's SLA tracking software, collects service and availability data for IP, email, FTP, HTTP, NFS, and other applications. Bluebird uses a
distributed architecture and a Java client to present network health information.
Bluebird issues real-time alerts when network performance exceeds acceptable
thresholds or availability falls below an acceptable level.
N*Manage Company
Raleigh, NC 27606
+1-91 9 362-8866
.
http://www.nmanage.com
234
Opticom Incorporated
One Riverside Drive
Andover, MA 01810
+1-978-946-6200
http://www.opticominc.com
L J
I'ackeiShaper tracks average and peak traffic levels, calculates the percentage of bandwidth that's wasted on retransmissions, highlights top users and applications, and
measures performance. PacketShaper's high-level network summaries record network
trends. The product also has the capability to measure response times and then compare those numbers to what is deemed acceptable response time performance.
Packeteer
10495 N. De Anza Boulevard
Energizer PME
Cupertino, CA 95014
OptiSystems designs and sells products to manage the performance of SAP R/3
systems. The company also offers management products for R/2 applications.
Energizer PME (Performance Management Environment) for R/3 dynamically
analyzes system usage and reacts to events as they happen in order to improve
system performance.
+1-408-873-4400
The data collection engine for the Energizer PME for R/3 products runs as an
R/3 task and captures real-time interval data, as well as summary data, for all system components using SAP'S own data collection routines. As a result, Energizer's
overhead is negligible (less than 1%, according to the vendor) and R/3's own data
collection is not needlessly duplicated. In addition, the data collected by the
Energizer data collection engine is used as the basis of the Energizer PME for
R/3 product modules. After one of the modules is installed, any one of the other
modules can make use of the same data.
http://www.packeteer.com
OpenLane
Paradyne's OpenLane network management application features support for
diagnostics, real-time performance, SNMP-managed narrowband, and broadband
networks through its access device product lines. OpenLane collects and reports
performance against the terms of an SLA. Support is provided for Paradyne's
FrameSaver Frame Relay Access Units as well as Paradyne's Hotwire xDSL and
MVL products. OpenLane also supports Paradyne's 31xx, 7xxx, and NextEDGE
9xxx T1 and subrate access products.
Paradyne Corporation
OptiSystems Incorporated
Largo, FL 33773
+1-727-530-2000
+1-941-263-3885
http://www.paradyne.com
http://www.optisystems.com
PacketShaper
Packeteer supplies products to both enterprise customers and service providers
for managing network bandwidth. PacketShaper detects and classifies network traffic, analyzes traffic behavior, offers policy-based bandwidth allocation for specific
applications, and provides network reports. PacketShaper automatically detects over
150 types of traffic. It can categorize traffic by application, service, protocol, port
number, URL or wildcard (for Web traffic), hostname, precedence bits, and IP or
MAC address.
Foglight
Foglight software ensures the reliability and performance of electronic commerce
sites, enterprise resource planning (ERP) systems, and information technology
infrastructures.
Foglight monitors business applications for their availability and performance;
alerting system managers to actual or potential application problems, and allowing
them to effectively identify and correct potential problems before end users are
impacted. Foglight keeps critical applications up and running properly, monitors
and reports on application service levels, and provides a solution to scale e-business
systems growth through accurate capacity planning.
tai
ResponseCenter
ResponseCenter is an active testing solution that provides comprehensive, end-toend transaction performance and problem diagnosis for e-business and e-commerce
sites. ResponseCenter diagnoses the response time of a complete e-transaction
across networks, servers, databases, middleware objects, and application components,
breaking down the individual components of total end-to-end performance. The
product is designed to help e-businesses get an early warning of potential application brownouts or outages before e-commerce service is interrupted.
Statscout
Statscout is a network performance monitoring package based on SNMP. It runs
under FreeBSD-3.X UNIX, a little-known flavor of UNIX comparable to Linux.
Statscout boasts that its software can monitor thousands of devices and ports simultaneously while requiring minimal disk space. The software measures network
health statistics, including average response time (calculated by measuring ping
response times), utilization, and errors. Statscout also produces SLA summary
reports that include information on SLA non-conformance, as well as detailed
network management statistics.
Statscout
One World Trade Center, Suite 7967
New York, NY 10048
+1 212-321-9282
-
http://www statscout.com
SOLVE Series
Sterling's SOLVE products monitor network performance and diagnose any problems that could have a negative impact on enterprise service levels. The software
supplies IT managers with utilization information so that they can adequately allocate network resources and control spending. Sterling claims their SOLVE product
line can instantly determine the location of a problem and accelerate resolution.
Sterling offers SOLVE products for a variety of platforms and environments.
Included among those are software solutions for SNA,TCP/IP, CICS, and MVS.
Sterling Software
300 Crescent Court, Suite 1200
Dallas, Texas 75201
+1-214-981-1000
http://www sterlingsoftware .com
AJO
maintain. SLAs, plan for future growth, and manage network change by compiling
statistics over time and analyzing that data for trend information. Sync also offers
an SNMP-managed CSU/DSU.
Sync Research Incorporated
12 Morgan
Irvine, CA 92719
+1-949-588-2070
http://www.sync.com
239
Madison, AL 35758
+1-256-772-3770
+1-512-436-8000
http://www.verilink.com
800-926-0085
http://www.tivoli.com
240
Visual UpTime comes with a series of SLA monitoring and reporting tools that
track performance of frame relay and ATM network services on a daily, monthly,
or multimonth basis. A Visual UpTime Burst Advisor continuously measures onesecond usage over each port and PVC. From this information, the system automatically makes recommendations on correct bandwidth allocations. A series of
executive reports puts this data into a format suitable for presentation to CEOs
and top-level executives.
Visual IP InSight leverages technology that the company picked up as part of its
acquisition of Inverse Network Technology to give service providers and enterprises the tools required to manage IP connectivity and applications such as dedicated and remote access and Web sites from the perspective of the end user.
Visual IP InSight comprises three application suites that let IP services managers
provide and track service level agreements, offer new levels of end-user customer
care, and monitor end-to-end network performance. The service level management suite includes Service Level Performance Reports: a series of programs that
gather actual end-user performance information via the network operator's
deployment of the Visual IP InSight client. Single Visual IP InSight installations can
take feeds from as few as 500 clients, scaling to the millions, the vendor says.
Visual IP InSight service level management reports can be used with other suite
applications in order to manage a user's end-to-end experience. The reports can be
used, for instance, with Visual IP InSight Dial Care to furnish information about
access functionality at end-user desktops, or with Visual IP InSight Dial Operations
to proactively manage network access, be it in-house or outsourced. The suite also
can be used to track and manage the performance of application services, such as
virtual private networks, Web, email, and news.
Visual Networks Incorporated
2092 Gaither Road
Rockville, MD 20850
+1-301-296-2300
http://www.visualnetworks.com
Glossary
Like other specialized areas of information technology, service level
management (SLM) has acquired a
language of its own. For the most
part, the terms used in SLM are derived from the fields of networking,
general IT, enterprise management,
and software development. Here is
an alphabetical list of key terms you'll
encounter in most SLM activities and
interactions:
242
CPU utilization: The amount of
time an application requires to process
information in a computer's central
processing unit (CPU). CPU usage
governs the response time a computer
can deliver.
Critical deadlines: The specified
z4i
244
Lines of business: Those parts of an
245
Privilege class: A group of staffers or
agement tool that does not communicate directly with the managed
environment (although some secondary data collectors are able to do
so, if necessary). Secondary data collectors extract data from other products that are primary data collectors.
Security: The actions involved in
Index
customers, to reports, 42
users
defining, 29
group privileged access definition, 30
calculating
cost avoidance improvements, 136
downtime cost, 134
employee cost, 134
lost business cost, 135
productivity cost, 135
Service Level Agreement penalties,
135
Candle Corporation, 217
CCTA (Central Computing and
Telecommunications Agency), 78
charters (Service Level Agreement
negotiating teams), 59
CIM (Common Information
Model), 81
Cisco Systems Incorporated
CiscoWorks 2000, 218
Web site, 219
client-by-client SLM implementation, 139
clients. See also customers; users
agents, measuring end-to-end
response times, 161-162
client/server interactions, 27
SLM implementation
benefits, 106
drawbacks, 106-107
hardware, 106
RMON, 106
SNMI? 106
standards, 107
Web-based alternatives, 107
application management, 185
Appvisor Application Management,
213
ARM, 83-84
history, 114
procedure calls, 83
Software Developer Kit, 84
Attention!, 214
availability, 19
benefits, 183
Bluebird, 232
BMC Software Incorporated, 216
Candle Corporation, 217
CiscoWorks 2000, 218
Continuity, 225
CrossKeys Resolve, 220
CSU/DSUs, 108
Custom Network Analysis, 230
customizing, 185
Do It Yourself, 229
EcoSCOPE, 219
EcoTOOLS, 220
Empirical Suite, 222
Energizer PME, 234
EnView, 213
evolving market, 184-185
eWatcher, 215
Executive Information System, 233
FirstSense Enterprise, 223
Foglight, 235
Frame Relay Access Probe, 237
Help Desk, 238
HP OpenView ITSM Service Level
Manager, 225
Information Technology Service
Management (ITSM), 103
InfoVista Corporation, 226
IQ series, 212
Keystone CNM, 216
Keystone VPNview, 216
Luminate Software Corporation,
228
benefits, 55-56
change request process, 192
conventions, 190
creating, 56-58
documentation, 18, 61
effectiveness, 14-15
elements, 97-98
exclusions, 71
external, 56-57, 95-97
formats, 182
Gartner Group, 88
Giga Group, 89
GTE Internetworking, 96
Hurwitz Group, 89
incentives, 183
in-house, 56
internal, 56-58
accuracy, 63
affordability, 65
attainability, 63-64
availability, 62
controllability, 65
criteria, 64
measurability, 65
mutual acceptability, 66
number, 63
performance, 62
relevance, 65
selecting, 62-66
stretch objectives, 62
understandability, 65
service measures, 193-194
Service Value Agreements (META
Group), 89
setting with lines of business, 93-95
Sprint, 96
stakeholder groups, 59
standard setting, 53-55
standards, 172
statement of intent, 189
structure, 97
term, 61
user abuses, 14
user environment, 191
UUNET Technologies, 97
conventions (Service Level
Agreements), 190
cost
customizing reporting information,
120
IT cost pressures, 8
network usage accounting, 116
reports, 45-46
service level objectives, 65
services, 34
allocation complexity, 36
assigning, 35-37, 45
D
daily reports, 47, 201-202
data currency/integrity, 31
databases
monitoring, 1.12-114
monitoring by footprint, 1 . 59
databases domain (networks),
monitoring, 112-114
deadlines, 25
defining
group privileged access, 30
managed services, 180
reporting specifications (Service
Level Agreements), 71-72
service level indicators, 66-67
service levels, 13
user access, 29
dependencies (batch jobs), 28
DeskTalk Systems Incorporated,
221
desktops, monitoring, 112
detecting intrusion, 30
real-time alerts, 50-52
reports, 44
developing (Service Level
Agreements)
elements, 97-98
structure, 97
devices, network, 110-111
differentiated services, creating,
176
distributing reports, 148
DMTF (Distributed Management
Task Force), 184
CIM, 81
SLA Working Group, 81-82
Web site, 78
Do It Yourself (DIY), 229
documentation. See Service Level
Agreements
domains
applications, 112-114
databases, 112-114
network devices and connections,
110-111
planning monitoring strategies,
109-110
E
Eastern Research Incorporated,
221
eBA*ServiceMonitor, 217
eBA*ServiceNetwork, 217
e-business
expansion, 170
SLM challenges, 173
e-commerce, 9
expansion, 170
IT costs, 36
SLM challenges, 173
EcoSCOPE, 219
EcoTOOLS, 220
editing Service Level Agreements,
75
eliminating help desks, 206-207
Empirical Software Incorporated
employees
cost, 134-135
IT personnel
APIs, 162-163
client agents, 161-162
generating synthetic transactions,
163-164
end-to-end SLM, measuring
quality, 22-23, 182
Energizer PME, 234
EnView, 213
Envive Corporation
Service Level Suite, 222
Web site, 223
errors. See outages
establishing client contacts, 141
ETEWatch, 218
event-driven measurement,
166-167
eWatcher, 215
exclusions (Service Level
Agreements), 71
Executive Information System, 233
executive summaries, 42
expectation creep, 55-56
external Service Level Agreements,
56-57,95-97
F-G
failures. See outages
fault management monitoring, 115
FCAPs management model, 114
configuration monitoring, 115
fault management monitoring, 115
network usage accounting, 116
performance management, 116-117
security management, 117
FirstSense Enterprise, 223
FirstSense Software Incorporated,
223
Foglight, 235
formalizing external Service Level
Agreements, 96
formats
reports, 46-48
Service Level Agreements, 182
Forrester Research, 89
FRAP (Frame Relay Access
Probe), 237
frequency (reports)
daily, 47, 201-202
monthly overviews, 48, 202
quarterly summaries, 49, 202
real-time reporting, 49-52
weekly summaries, 48, 202
functions, monitoring, 114
configuration, 115
fault management, 115
network usage accounting, 116
performance management, 116-117
security management, 117
Ganymede Software Incorporated,
224
Gartner Group (Service Level
Agreements), 88
Gecko Software Limited
SAMAN, 224
Web site, 225
generating synthetic transactions
benefits, 163-164
drawbacks, 163
ping command, 164
traceroute command, 164
Giga Group (Service Level
Agreements), 89
group privileged access definition,
30
GTE Internetworking (Service
Level Agreements), 96
H-I
hardware agents, 106
help desks, 238
eliminating, 206-207
reducing calls, 205-206
Hewlett-Packard
ARM, 83
HP OpenView ITSM Service Level
Manager, 225
Information Technology Service
Management (ITSM), 103
OpenView Network Node
Manager (NNM), 103
Web site, 225
historical data, 119
HP OpenView ITSM Service Level
Manager, 225
Hurwitz Group (Service Level
Agreements), 89
IETF (Internet Engineering Task
Force), 184
Application Management MIB,
82-83
Web site, 78
impact (services)
balancing workloads, 130
customer loyalty, 128-129
IT personnel productivity, 130-131
planning upgrades, 129
revenue, 127-128
user productivity, 129
implementing
Internet services, 174
SLM
baseline sampling duration, 143-144
client presentation tips, 141-142
client-by-client, 139
determining baselines, 143-144
establishing client contacts, 141
establishing reporting procedures,
147-148
implementation follow-up procedures,
149-150
IT personnel priority, 139-140
negotiating complaints, 150-151
ongoing open communication, 150
ongoing service management team
meetings, 150
planning, 137-138
improving services
research, 184
strategies, 183-184
in-house Service Level
Agreements, 56
incentives (IT personnel), 183
information sources, 118
information technology. See IT
Information Technology Service
Management (ITSM), 103
InfoVista Corporation
Vistaviews, 103
Web site, 226
Intelligent Communication
Software GmbH, 225
inter-server response time, 155-156
interactive responsiveness, 24
intercepting socket traffic (networks), 161
internal reports, 41-42
internal Service Level Agreements,
56-58
internal SLA template, 195-196
Internet
ASPs, 170
defined, 171
service challenges, 174
e-commerce, 9
expansion, 170
IT costs, 36
SLM challenges, 173
ISPs, 170
services
benefits, 170
emerging service challenges, 173-174
growing demand, 170
implementing, 174
outsourcing, 171
Service Level Agreements, 172
JL
jobs, batch
accuracy issues, 32
concurrency, 27-28
dependencies, 28
measuring service quality, 25
justifying service costs, 125-126
cost justification worksheets,
131-133
Jyra Research Incorporated
Service Management Architecture,
226
Web site, 227
Keystone CNM, 216
Keystone VPNview, 216
Landmark Systems Corporation
PerformanceWorks software, 227
Web site, 228
LANQuest, 227
lawyers (Service Level
Agreements), 57
levels, service. See Service Level
Agreements
M
maintenance
accuracy issues, 32
availability issues, 32
planned downtime, 50
management
agents
characteristics, 164-165
drawbacks, 164
limiting, 165
manager-agent model, 104 105
optimizing, 165-166
-
managing
Service Level Agreements, 182-183
services
application management, 185
ASPs, 174
balancing workloads, 130
benefits, 15-18
business process management, 186
commercial products, 19, 183-185
communication, 181
cost control, 17-18
current practices, 90-91
current quality perception, 91
current understanding, 87-88
customizing solutions, 185
defining managed services, 180
documentation benefits, 18
e-business, 173
e-commerce, 173
emerging research, 88-90
external suppliers, 95-97
focus areas, 90
future goals, 179-180
improvement strategies, 183-184
IT profile improvement, 17
resource regulation authority, 17
service level reporting, 98
setting initial goals, 91-93
tools, 99
user expectation management, 16
user satisfaction gains, 16
MCI WorldCom (Service Level
Agreements), 96
measuring. See also monitoring;
monitoring tools
cost, 34-36
end-to-end response times, 153-154
APIs, 162-163
client agents, 161-162
generating synthetic transactions,
163-164
selecting measurement method,
155-156, 161
event-driven measurement, 166-167
inter-server response time, 155-156
drawbacks, 161
intercepting socket traffic, 161
wire sniffing, 160
quality, 94
customer surveys, 42, 197-200
meeting
(It-Alines, 25
service level objectives, 204-205
meetings, negotiation teams
(Service Level Agreements), 60
META Group
Service Value Agreements, 89
SLM locus areas, 90
metrics, measuring service levels,
153-155
MIBs (Management Information
Bases), 159
Micromuse Incorporated, 229
modules (ITIL), 78-80
monitoring. See also measuring;
monitoring tools
functions, 114
configuration, 115
fault management, 115
network usage accounting, 116
performance management, 116-117
security management, 117
network domains
applications, 112-114
databases, 112-114
network devices and connections,
110-111
planning strategies, 109-110
servers and desktops, 112
transactions, 112-114
hardware, IWI
RAP )N, ilia
SNMI', 1(16
standards, 107
Web-based alternatives, '107
ARM, 114
CSU/DSUs, 1.08
manager-agent model, 104-105
managers, 104-105
packet monitors, 107-108
performance management, 116-117
primary data collectors, 102
probes, 107-108
secondary data collectors, 103
simulations, 108-109
monthly overviews (reports), 48,
202
N
N*Manage Company
Bluebird, 232
Web site, 233
NaviSite (Service Level
Agreements), 96
negotiating
Service Level Agreements, 144-145,
181
documentation, 61
goals, 60
negotiation teams, 58-59
nonpe rformance consequences, 68-71
preparation, 60
scheduling negotiation meetings, 60
stakeholder groups, 59
o
objectives, service level
accuracy, 63
affordability, 65
attainability, 63-64
availability, 62
criteria, 64
measurability, 65
mutual acceptability, 66
number, 63
performance, 62
relevance, 65
selecting, 62-66
stretch objectives, 62
understandability, 65
OpenLane, 235
OpenMaster, 217
OpenView Network Node
Manager (NNM), 103
operating systems. See OSs
Opticom Incorporated
Executive Information System, 233
Web site, 234
optimizing management agents,
165-166
OptiSystems Incorporated, 234
Optivity SLM, 233
organizing IT personnel, 182
OSs (operating systems)
UNIX
accounting utility, 158
monitoring by footprint, 157-158
ps utility, 158
sar utility, 158
Windows NT/2000
monitoring by footprint, 158-159
Perfmon utility, 159
Process Explode utility, 159
Quick Slice utility, 159
Taskmanager utility, 159
outages. See also recovery
P
packet monitors, 107-108
Packeteer
PacketShaper, 234
Web site, 235
packets (network), 160
PacketShaper, 234
Paradyne Corporation, 235
Pegasus, 224
penalties (Service Level
Agreements), 68-71, 135
percent conventions (Service Level
Agreements), 190
Perfmon utility (Windows
NT/2000), 159
performance
baseline, 143-144
frames of reference, creating, 176
service level objectives, 62
services, 24-25, 154
batch job processing, 25
deadlines, 25
importance, 24
interactive responsiveness, 24
performance alerts, 50
performance reporting, 43-44
setting initial goals, 92
user perception, 25
Q
qualifying Service Level
Agreements, 61-62
quality
customer surveys, 42, 197
future requirements section, 200
general comment areas, 199
tools, 117
customizing reporting information,
120
historical data, 119
information sources, 118
real-time data, 118-119
report presentation issues, 119-120
reports. See also reporting
management
relating services to business performance, 40, 120
reporting service difficulties, 40
outage alerts, 49
pe rformance alerts, 50
planned downtime, 50
security alerts, 50-52
report card format, 46-48
reporting technology advancements,
19
reporting time-saving strategies, 208
security intrusions, 44
selecting report personnel, 73-74
service availability reporting, 43
Service Level Agreement
specifications, 71-72
service level reporting, 98
technical information, 11
technical reporting limitations,
11-12
weekly summaries, 48, 202
workload levels, 44
requests, user, 24
resources, redundancy, 130
Response Networks Incorporated,
236
Response Time Network (RTN),
218
response times
end-to-end, 153
measuring, 154
selecting measurement method,
155-156, 161-164
inter-server, 155-156
ResponseCenter, 236
responsiveness, interactive, 24
Return on Investment. See ROI
revenues, service impact
lines of business input, 128
measuring, 127-128
reviewing Service Level
Agreements, 74, 190
revising Service Level Agreements,
75
S
SAMAN (Service Level Agreement
Manager), 224
sampling-based measurement
benefits, 166
drawbacks, 167
sampling frequency, 167
sar utility (UNIX), 158
satisfaction surveys, 150
scheduling
reports, 148
Service Level Agreement
negotiations, 60
scope (Service Level Agreements),
61
secondary data collectors, 103
security. See also access
intrusion reports, 44
real-time alerts, 50-52
security management, 117
services, 28
selecting
client SLM implementation order,
140
end-to-end response time measurement methods, 155-156
APIs, 162-163
client agents, 161-162
generating synthetic transactions,
163-164
negotiation teams
charters, 59
equal representation, 59
Service Level Agreements, 58-59
stakeholder groups, 59
nonperformance penalties (Service
Level Agreements), 68-71, 135
report personnel, 73-74
service level objectives
accuracy, 63
affordability, 65
attainability, 63-64
availability, 62
controllability, 65
criteria, 64
measurability, 65
mutual acceptability, 66
number, 63
performance, 62
relevance, 65
stretch objectives, 62
understandability, 65
servers
client/server interactions, 27
monitoring, 112
servers and desktops domain (networks), monitoring, 112
service description (Service Level
Agreements), 191
Service Level Agreement Manager
(SAMAN), 224
Service Level Agreement Working
Group (SLA Working Group),
81-82
Service Level Agreements (SLAs),
13. See also services; SLM
administration, 74
approving, 75, 190
AT&T, 96
benefits, 55-56
change request process, 192
conventions, 190
creating, 56, 58
documentation, 18, 61
effectiveness, 14-15
elements, 97-98
exclusions, 71
external, 56-57, 95-97
formats, 182
Gartner Group, 88
Giga Group, 89
GTE Internetworking, 96
Hurwitz Group, 89
in-house, 56
incentives, 183
internal, 56-58
inutual acceptability, 66
number, 63
performance, 62
relevance, 65
selecting, 62-66
stretch objectives, 62
understandability, 65
drawbacks 156
measurement factors, 157
MIBs, 159
networks, 159
SNMP, 159
UNIX systems, 157-158
Windows NT/2000 systems,
158-159
performance, 24-25, 154
recovery reports, 45
recovery time necessary, 34
stages, 33
time-specific recovery, 33
redundant resources, 130
reliability, 154
revenue impact
agents, 107
Application Management MIB,
82-83
ARM, 83-84
CIM, 81
evolving standard initiatives, 184
ITIL, 78-81
Service Level Agreements, 53-54,
172
SLA Working Group, 81-82
tools, 99
user abuses, 14
user/service provider relationship
standards (SLM), 77
agents, 107
Application Management MIB,
82-83
ARM, 83-84
CIM, 81
evolving standard initiatives, 184
ITIL, 78-81
Service Level Agreements, 53-54,
172
SLA Working Group, 81-82
statement of intent (Service Level
Agreements), 189
Statscout, 237
Sterling Software, 237
stretch objectives, 62
surveys, customer, 42, 150, 197
future requirements section, 200
general comment areas, 199
IT contact frequency, 199
optional information section, 200
service quality ratings, 197-198
service usage information, 199
SVAs (Service Value Agreements),
89
Sync Research Incorporated
Frame Relay Access Probe, 237
Web site, 238
synthetic transactions
benefits, 163-164
drawbacks, 163
generating, 163-164
ping command, 164
traceroute command, 164
systems integrators (Internet
services), 171
T
Taskmanager utility (Windows
NT/2000), 159
teams, service, 182
telecommunications
AT&T, 96
external service management, 95-97
GTE Internetworking, 96
agents, 104-107
ARM, 114
CSU/DSUs, 108
manager-agent model, 104-105
managers, 104-105
packet monitors, 107-108
performance management, 116-117
primary data collectors, 102
probes, 107-108
secondary data collectors, 103
simulations, 108-109
reporting, 117
benefits, 163-164
drawbacks, 163
generatig(,, 163-164
ping command, 164
traceroute command, 164
transaction rates, 26-27
transactions domain (networks),
monitoring, 112-114
TREND, 221
Trinity, 215
U-V
Unicenter TNG products, 219
UNIX
accounting utility, 158
monitoring by footprint, 157-158
ps utility, 158
sar utility, 158
upgrades (services), 129
user environment (Service Level
Agreements), 191
users. See also clients; customers
access
defining, 29
group privileged access, 30
complaints, negotiating, 150-151
determining baselines, 143
expectation creep, 55-56
increased IT dependence, 8-9, 18
increased IT knowledge, 8, 18
new user request process (Service
Level Agreements), 192
performance perception, 25
productivity, 129
requests, 24
satisfaction surveys, 150
Service Level Agreement/SLM
abuses, 14
user/service provider relationship
W-Z
WANs (wide area networks), 108
WANsuite, 239
WANview, 236
WBEM (Web Based Enterprise
Management) initiative, 81
Web sites. See sites
weekly summaries (reports), 48,
202
Windows 2000
monitoring by footprint, 158-159
Perfmon utility, 159
Process Explode utility, 159
Inform
www.informit.com I SAMS
Career-enhancing resources
0-672-31624-2
0-672-31495-9
Danielle Larocca
$24.99 USA / $37.95 CAN
Simon Sharpe
$12.99 USA / $18.95 CAN
SMS Administrator's
Survival Guide
0-672-30984-X
James Farhatt, et al.
$59.99 USA / $84.95 CAN
0-672 -31776-1
Vivek Kale
$39.99 USA / $59.95 CAN
Peter
rton's
Maimizing
Windows 98
Administration
SECOND EDITION
Maximum Security: A
Hacker's Guide to Protecting
Your Internet Site and
Network, Second Edition
SAMS
www.samspublishing.com
0-672 - 31341-3
Anonymous
$49.99 USA / $70.95 CAN
All prices are subject to change.