ITIL OM
Service operation fundamentals
The operation of service is where these
plans, designs and optimizations are executed and
measured. From a customer viewpoint, service
operation is where actual value is seen
challenges exist outside
that focus that can put business value at risk
It is difficult to obtain funding during the
operational stage, to fix design flaws or
unforeseen requirements – because this was not
part of the original value proposition.
It is difficult to obtain additional funding for
tools or actions (including training) aimed at
improving the efficiency of service operation.
some services are taken for granted and any action
to optimize them is perceived as ‘fixing services
that are not broken’
Optimizing service operation performance
■ long-term incremental improvement This
is based on evaluating the performance and
output of all service operation processes,
technologies, functions and outputs over time.
Short-term ongoing improvements These
are the improvements made to working practices within the processes, functions and technologies that underpin service operation itself
Examples include tuning, workload
balancing, personnel redeployment and
training etc.
Functions within service operation
Functions include groups of skilled people who carry out one
or more service lifecycle processes and activities.
Service desk
Technical management
-t provides detailed technical
skills and resources needed to support the ongoing
operation of IT services and the management
of the IT infrastructure.
IT operations management
IT operations management executes the daily
operational activities needed to manage IT services
and the supporting IT infrastructure
IT operations control
This is generally staffed
by shifts of operators which ensures that
routine operational tasks are carried out. IT
operations control will also provide centralized
monitoring and control activities, usually using
an operations bridge or network operations
centre.
Facilities management
This refers to the
management of the physical IT environment,
usually data centres or computer rooms. In
many organizations technical and application
management are co-located with IT operations
in large data centres.
Achieving balance in Service operation
Internal IT view versus external business view
Examples on table on page 40
■ The external view of IT is the way in which
services are experienced by its users and
customers. They do not always understand, nor
do they wish to care about, the details of what
technology is used to manage those services.
All they are concerned about is that the services
are delivered as required and agreed
■ The internal view of IT is the way in which
IT components and systems are managed to
deliver the services. Because IT systems are
complex and diverse, this often means that the
technology is managed by several different
teams or departments – each of which is focused on achieving good performance and availability of ‘its’ systems
Stability versus responsiveness
Quality of service versus cost of service
Reactive versus proactive
A reactive organization is one which does not
act unless it is prompted to do so by an external
driver, e.g. a new business requirement, an
application that has been developed or escalation
in complaints made by users and customers.
A proactive organization is always looking
for ways to improve the current situation. It
will continually scan the internal and external
environments, looking for changes that may have
potential impact
Operational health
Communication
An important principle is that all communication
must have an intended purpose or a resultant
action
■ Routine operational communication
■■ Communication between shifts
■■ Performance reporting
■■ Communication in projects
■■ Communication related to changes
■■ Communication related to exceptions
■■ Communication related to emergencies
■■ Training on new or customized processes and
service designs
■■ Communication of strategy, design and
transition to service operation teams.
Meetings
A number of factors are essential for successful
meetings. Although these may seem to be common
sense, they are sometimes neglected
Establish and communicate a clear agenda
in advance to allow the audience to prepare
and to ensure that the meeting achieves its
objective.
Ensure that the rules for participating are
understood. Organizations tend to have a
formal set of meeting rules, ranging from
relatively informal to very formal (e.g.
published books such as Roberts Rules of Order
that describe procedures, rules, ethics and
customs for governing meetings).
Minutes of the meeting: rules should be set
about when minutes are taken. Minutes are
used to remind people who are assigned actions
and to track the progress of delegated actions.
Types
The operations meeting
Operations meetings are normally held between
the managers of the IT operational departments,teams or groups, at the beginning of each business day or week.
Department, group or team meetings
Customer meetings
■ follow-up after serious incidents The purpose
of these meetings is to repair relationships with
the customer, but also to ensure that IT has all
the information required to prevent recurrence
■ a customer forum This can be used for a
range of purposes, including testing ideas
for new services or solutions, or gathering
requirements for new or revised services or
procedures
customer meetings These should be scheduled
and held in coordination with business
relationship management and the service level
manager to ensure that communications to the
customer are coordinated and consistent.
Documentation
Documentation activities include the following:
Establishing their own technical procedures
manuals.
Participation in the definition and maintenance
of process manuals for all processes they
are involved in
Participation in the creation and maintenance
of the service portfolio
Participation in the definition and maintenance
of service management tool work instructions
in order to meet reporting requirements.
Service operation inputs and output
Examples of interfaces to other service lifecycle processes
There are other processes that will be executed or
supported during service operation, but which are
driven during other stages of the service lifecycle.
■ Change management, which is a major process
that should be closely linked to service asset
and configuration management and release
and deployment management
Capacity and availability management, which
are covered
Financial management for IT services
Service catalogue management, which identifies
the live IT services that are to be delivered. This
process is covered in ITIL Service Design.
Service strategy
Vision and mission
Service portfolio
Policies
Strategies and strategic plans
Priorities
Financial information and budgets
Demand forecasts and strategies
Strategic risks
Operating risks
Operating cost information for total cost of
ownership (TCO) calculations
Actual performance data
Service design
Service catalogue
Service design packages, including:
■■ Details of utility and warranty
■■ Operations plans and procedures
■■ Recovery procedures
Knowledge and information in the
SKMS
Vital business functions
Hardware and software maintenance
requirements
Designs for service operation processes
and procedures
SLAs, OLAs and underpinning
contracts
Security policies
Operational requirements
Actual performance data
RFCs to resolve operational issues
Historical incident and problem records
Service transition
New or changed services
Known errors
Standard changes for use in request
fulfilment
Knowledge and information in the
SKMS (including the configuration
management system)
Change schedule
RFCs to resolve operational issues
Feedback on quality of transition activities
Input to operational testing
Actual performance information
Input to change evaluation and change
advisory board meetings
Continual service improvement
Results of customer and user
satisfaction surveys
Service reports and dashboards
Data required for metrics, key
performance indicators (KPIs) and
critical success factors (CSFs)
RFCs for implementing improvements
Operational performance data and service records
Proposed problem resolutions and proactive measures
Knowledge and information in the SKMS
Achievements against metrics, KPIs and CSFs
Improvement opportunities logged in the
continual service improvement register
Service operation processes
Problem vs Incident MGMT
Note that without a distinction between incidents
and problems, and keeping separate incident and
problem records, there is a risk that:
Separating the two processes and managing
through separate incident and problem records
allows support staff to meet the rapid restoration
objective for incident management while allowing
root cause to be investigated and resolved in a
separate, parallel problem management process
Incident resolution activities may extend the
duration of service outages ‘looking for root
cause’ versus taking direct actions to restore
normal state service operation.
Incident records will be closed too early in
the overall support cycle and there will be no
actions taken to prevent recurrence – so the
same incidents will continue to disrupt the
business
Incident records will be kept open so that root
cause analysis can be done and visibility will
be lost of when the user’s service was actually
restored – so SLA targets may not be met even
though the service has been restored within
users’ expectations.
Event Management
An event can be defined as any change of state
that has significance for the management of a
configuration item (CI) or IT service
Events are typically recognized through notifications created
by an IT service, CI or monitoring tool.
■ Active monitoring tools that poll key CIs to
determine their status and availability. Any
exceptions will generate an alert that needs to
be communicated to the appropriate tool or
team for action.
■■ Passive monitoring tools that detect and
correlate operational alerts or communications
generated by CIs.
Scope
Configuration items (CIs)
Environmental conditions
Software licence monitoring
Security
Normal activity (e.g. tracking the use of an
application or the performance of a server).
Policies, principles and basic concepts
Event notifications should only go to those
responsible for the handling of their further
actions or decisions related to them
Event management and support should be
centralized as much as reasonably possible.
All application events should utilize a common
set of messaging and logging standards and
protocols wherever possible
Event handling actions should be automated
wherever possible.
A standard classification scheme should be in
place that references common handling and
escalation processes.
All recognized events should be captured
and logged. This will provide a means for
examining incidents, problems and trends after
events have occurred.
Types of events
InformatIonal events
■■ A scheduled workload has completed
■■ A user has logged in to use an application
■■ An email has reached its intended recipient.
WarnIng events
■■ A server’s memory utilization reaches within 5%
of its highest acceptable performance level
■■ The completion time of a transaction is 10%
longer than normal.
exceptIon events
■■ A user attempts to log on to an application
with the incorrect password
■■ An unusual situation has occurred in a business
process that may indicate an exception
requiring further business investigation (e.g.
a web page alert indicates that a payment
authorization site is unavailable – impacting
financial approval of business transactions)
■■ A device’s CPU is above the acceptable
utilization rate
■■ A PC scan reveals the installation of
unauthorized software.
Filtering of events
There are several strategies that can be used to
obtain the correct level of filtering. These are
shown as follows:
■ Integration Integrate event management
into all service management processes where
feasible. This will ensure that only the events
significant to these processes are reported.
■ Design Design new services with event
management in mind
Trial and error No matter how thoroughly
event management is prepared, there will be
classes of events that are not properly filtered.
Event management must therefore include a
formal process to evaluate the effectiveness of
filtering.
Planning Proper planning is needed for
the deployment of event management
software across the entire IT infrastructure.
Key considerations for designing event
management can include:
■ What needs to be monitored?
■■ What type of monitoring is required (e.g. active
or passive; performance or output)?
■■ When do we need to generate an event?
■■ What type of information needs to be
communicated in the event?
■■ Who are the messages intended for?
■■ Who will be responsible for recognizing,
communicating, escalating and taking action on
events?
Instrumentation
instrumentation is about defining and designing
exactly how to monitor and control the IT
infrastructure and IT services.
■ How will events be generated?
■■ How will events be classified?
■■ How will events be communicated and
escalated?
■■ Does the CI already have event generation
mechanisms as a standard feature and, if so,
which of these will be used? Are they sufficient
or does the CI need to be customized to include
additional mechanisms or information?
■■ What data will be used to populate the event
record?
■■ Are events generated automatically or does the
CI have to be polled?
■■ Where will events be logged and stored?
■■ How will supplementary data be gathered?
Event detection and alert mechanisms
Router(CI) ---> |Rule set| ---> Service "n" ---> |Rule set| ---> Process sales order (Business Process)
Thorough design of the event detection and alert
mechanisms requires the following:
Detailed knowledge of the service level
requirements of the service being supported by
each CI
Knowledge of who is going to be supporting
the CI
Knowledge of the significance of multiple
similar events (on the same CI or various similar
CIs)
Familiarity with incident prioritization and
categorization codes so that if it is necessary to
create an incident record, these codes can be
provided
Knowledge of other CIs that may be dependent
on the affected CI, or those CIs on which
it depends
Process activities, methods and techniques
Event notification
A general principle of event notification is that
the more meaningful the data it contains and
the more targeted the audience, the easier it is
to make decisions about the event.
Event occurs
Event detection
There should be a record of the event and any
subsequent actions. The event can be logged as
an event record in the event management tool or
it can simply be left as an entry in the system log
of the device or application that generated the
event
First-level event correlation and filtering (CI level)
The purpose of first-level event correlation and
filtering is to decide whether to communicate
the event to a management tool or to ignore it.
During the filtering step, the first level of
correlation is performed, i.e. the determination
of whether the event is informational, a warning,
or an exception (see next step)
Significance of events
Every organization will have its own categorization
of the significance of an event, but it is suggested
that at least these three broad categories be
represented.
Informational
This refers to an event that does not require
any action and does not represent an exception.
They are typically stored in the system or service
log files and kept for a predetermined period.
■ A user logs onto an application
■■ A job in the batch queue completes successfully
■■ A device has come online
■■ A transaction is completed successfully.
Warning
A warning is an event that is generated when
a service or device has reached a threshold
that indicates a situation must be checked and
appropriate actions taken to prevent an exception.
■ Memory utilization on a server is currently at
65% and increasing. If it reaches 75%, response
times will be unacceptably long and the OLA
for that department will be breached.
■■ The collision rate on a network has increased by
15% over the past hour.
Exception
n exception means that a service or device is
currently operating abnormally (however that has
been defined). Typically, this means that an OLA
and SLA have been breached and the business
is being impacted
■ A server is down
■■ Response time of a standard transaction across
the network has slowed to more than 15 seconds
■■ More than 150 users have logged on to the
general ledger application concurrently
■■ A segment of the network is not responding to
routine requests
Second-level event correlation
If an event is a warning, a decision has to be made
about exactly what the significance is and what
actions need to be taken to deal with it. It is here
that the meaning of the event is determined.
Further action required?**
textIf the second-level correlation activity recognizes
an event, a response will be required.**
Triggers, inputs, outputs and interfaces
Trigers
Event management can be initiated by any type of
change in state. The key is to define which of these
state changes need to be acted upon.
■ An exception within a business process that is
being monitored by event management
■■ The completion of an automated task or job
■■ A status change in a server or database CI
■■ Access of an application or database by a user
or automated procedure or job
■ Exceptions to any level of CI performance
defined in the design specifications, OLAs or
SOPs
■ Exceptions to an automated procedure
or process, e.g. a routine change that has
been assigned to a build team has not been
completed in time
Inputs
Alarms, alerts and thresholds
Operational and service level requirements
associated with events and their actions
Event correlation tables, rules, event codes and
automated response solutions that will support
event management activities
Roles and responsibilities for recognizing events
and communicating them to those that need to
handle them
Operational procedures for recognizing,
logging, escalating and communicating events
Outputs
Events that have been communicated and
escalated to those responsible for further action
Event logs describing what events took place
and any escalation and communication activities
taken to support forensic, diagnosis or further
CSI activities
Events that indicate an incident has occurred
Events that indicate the potential breach of an
SLA or OLA objective
Events and alerts that indicate completion
status of deployment
Populated SKMS with event information and
history.
Critical success factors and key performance indicators
CSF Detecting all changes of state that have
significance for the management of CIs and IT
services
●■ KPI Number and ratio of events compared
with the number of incidents
●■ KPI Number and percentage of each type
of event per platform or application versus
total number of platforms and applications
underpinning live IT services
CSF Ensuring all events are communicated
to the appropriate functions that need to be
informed or take further control actions
KPI Number and percentage of events that
required human intervention and whether
this was performed
KPI Number of incidents that occurred and
percentage of these that were triggered
without a corresponding even
CSF Providing the trigger, or entry point,
for the execution of many service operation
processes and operations management activities
KPI Number and percentage of events that
required human intervention and whether
this was performed
CSF Provide the means to compare actual
operating performance and behaviour against
design standards and SLA
KPI Number and percentage of incidents
that were resolved without impact to the
business
KPI Number and percentage of events that
resulted in incidents or changes
KPI Number and percentage of events
caused by existing problems or known errors
KPI Number and percentage of events
indicating performance issues (for
example, growth in the number of times
an application exceeded its transaction
thresholds over the past six months
KPI Number and percentage of events
indicating potential availability issues
CSF Providing a basis for service assurance,
reporting and service improvement
KPI Number and percentage of repeated
or duplicated events
KPI Number of events/alerts generated
without actual degradation of service/
functionality (false positives – indication
of the accuracy of the instrumentation
parameters, important for CSI)