What Is Fault Management - Describe Five Steps Process in Fault Management.
What Is Fault Management - Describe Five Steps Process in Fault Management.
Login (/site/login/?next=/p/3368/what-is-fault-management-describe-five-steps-proce/)
23k
views
1 Answer
written 7.4 years ago by
Fault Management:
Fault in a network is normally associated with failure of a network component and subsequent loss
of connectivity. Fault management involves a five-step process:
(1) Fault detection, (2) Fault location, (3) Restoration of service, (4) Identification of root cause of
the problem, and (5) Problem resolution.
i. The fault should be detected as quickly as possible by the centralized management system,
preferably before or at about the same time as when the users notice it.
ii. Fault location involves identifying where the problem is located. We distinguish this from
problem isolation, although in practice it could be the same.
iii. The reason for doing this is that it is important to restore service to the users as quickly as
possible, using alternative means.
iv. The restoration of service takes a higher priority over diagnosing the problem and fixing it.
✕
v. Identification of the root cause of the problem could be a complex process, which we will go into
greater depth soon.
vi. After identifying the source of the problem, a trouble ticket can be generated to resolve the
problem.
vii. In an automated network operations center, the trouble ticket could be generated
automatically by the NMS.
Fault Detection:
i. Fault detection is accomplished using either a polling scheme (the NMS polling management
agents periodically for status) or by the generation of traps (management agents based on
information from the network elements sending unsolicited alarms to the NMS).
ii. An application program in NMS generates the ping command periodically and waits for
response. Connectivity is declared broken when a preset number of consecutive responses are not
received.
iii. The frequency of pinging and the preset number for failure detection may be optimized for
balance between traffic overhead and the rapidity with which failure is to be detected.
iv. The alternative detection scheme is to use traps. One of the advantages of traps is that failure
detection is accomplished faster with less traffic overhead.
ii. Thus, if an interface card on a router has failed; all managed components connected to that
interface would indicate failure.
iii. After having located where the fault is, the next step is to isolate the fault (i.e. determine the
source of the problem).
iv. First, we should delineate the problem between failure of the component and the physical link.
Thus, in the above example, the interface card may be functioning well, but the link to the interface
may be down. We need to use various diagnostic tools to isolate the cause.
v. Let us assume for the moment that the link is not the problem but that the interface card is. We
then proceed to isolate the problem to the layer that is causing it. It is possible that excessive
packet loss is causing disconnection.
vi. We can measure packet loss by pinging, if pinging can be used. We can query the various
2
Management Information Base (MIB) parameters on the node itself or other related nodes to
further localize the cause of the problem.
1.0k
views
views
vii. For example, error rates calculated from the interface group parameters, ifInDiscards, ifInErrors,
ifOutDiscards, and ifOutErrors with respect to the input and out-put packet rates, could help us
isolate the problem in the interface card.
Service Restoration:
i. Whenever there is a service failure, it is NOC's responsibility to restore service as soon as possible.
This involves detection and isolation of the problem causing the failure, and restoration of service.
ii. In several failure situations, the network will do this automatically. This network feature is called
self-healing. In other situations NMS can detect failure of components and indicate with
appropriate alarms.
iii. Restoration of service does not include fixing the cause of the problem. That responsibility
usually rests with the I&M group.
iv. A trouble ticket is generated and followed up for resolution of the problem by the I&M group.
It seeks to identify the origin of a problem using a specific set of steps, with associated tools, to find
the primary cause of the problem, so that you can:
Problem Resolution:
Correcting the problem (indicates that the problem has been solved) by hardware & software
techniques, managed objects are repaired or replaced, and operations returned to normal.