Copyright (c) 2024, Oracle. All rights reserved. Oracle Confidential.
Commands To Clear FMA faults on the T5-x, T7-x, S7-x, T8-x, Mx-32, M7-x, M8-x Servers (Doc
ID 2216293.1)
In this Document
Purpose
Scope
Details
References
APPLIES TO:
MiniCluster S7-2 Hardware
SPARC T4-4
SPARC T8-2
SPARC M5-32
SPARC S7-2 - Version All Versions and later
Information in this document applies to any platform.
PURPOSE
Quick Reference for CLI commands to run to fully clear FMA faults on T5-x T7-x, T8-x and S7-x Servers
SCOPE
This document is meant to assist any persons responsible for clearing faults on T5-x T7-x, S7-x, T8-x, Mx-32, M7-x and
M8-x Servers. It will include a limited description of the fault handling process. For more complete info please see:
Managing Faults, Defects, and Alerts in Oracle ® Solaris 11.3
and
Oracle ® ILOM User's Guide for System Monitoring and Diagnostics Firmware Release 3.2.x
DETAILS
What's New Starting With T5-x Systems
Starting with T5-x systems and continuing with T7-x and S7-x systems there are significant changes with fault reporting
and clearing.
CPU and Memory diagnosis are now done mostly by ILOM FMA (FDD) instead of Solaris FMA.
An example MSG ID for such a diagnosis would be:
SPSUN4V-8000-EJ Memory Uncorrectable Error
Notice the SPSUN4V for T5 versus SUN4V used for T4
There is now a shared fault database.
Clearing of a fault, no matter which diagnosis engine (Solaris or ILOM FMA FDD) diagnosed the fault, can be done from
either Solaris or ILOM FMA shell and the fault will be cleared universally (Solaris and ILOM FMA shell).
This behavior differs from the T4 where Solaris diagnosed faults cleared via Solaris would also need to be cleared in the
ILOM FMA shell, and ILOM diagnosed faults such as power or environmental faults could not be cleared via Solaris.
Power and environmental type faults are now seen not only by ILOM “show faulty” and ILOM FMA shell commands, but
also by Solaris “fmadm faulty”. The fault can also be cleared via Solaris.
IO Errors are still diagnosed by the Solaris FMA. Some of the more common IO errors are PCIe related. Unlike the T4 an
IO error will be propagated down to the ILOM FMA shell, and it could also be cleared from there and that would clear it
from Solaris as well.
Types of Fault Repair
When a component in your system has faulted, the Fault Manager can repair the component implicitly or you can repair
the component explicitly.
Implicit repair
An implicit repair can occur when the faulty component is replaced if that component has serial number information that
the Fault Manager daemon (fmd) can track. On many systems, serial number information is included in the FMRIs so that
fmd can determine when components have been replaced. When fmd determines that a component has been replaced and
the replacement has been successfully brought into service, then the Fault Manager no longer displays that component in
fmadm list output. The component is maintained in the Fault Manager internal resource cache until the fault event is 30
days old. When fmd faults a piece of hardware, that hardware might be taken out of service so that it does not adversely
affect the system. Hardware removal from service can occur whether Solaris or ILOM diagnosed the problem. Hardware
removal from service is usually reported in the Response section of the diagnosis message.
Explicit repair
Sometimes no FRU serial number information is available even though the FMRI includes a chassis identifier. In this case,
fmd cannot detect an FRU replacement, and you must perform an explicit repair by using the fmadm command with the
replaced, repaired ,or acquit subcommand as shown in the following sections.
Other corner case situations may exist where a fault needs to be explicitly repaired.
1) Clearing Faults from Solaris
These fmadm commands take the following operands:
The UUID , also shown as the EVENT-ID in Fault Manager output, identifies the fault event. The UUID can only be used
with the fmadm acquit command. You can specify that the entire event can be safely ignored, or you can specify that a
particular resource is not a suspect in this event.
The FMRI and the Label identify the suspect faulted resource. Typically, the label is easier to use than the FMRI.
a) fmadm replaced command
Use the fmadm replaced command to indicate that the suspect FRU has been replaced. If multiple faults are currently
reported against one FRU, the FRU shows as replaced in all cases.
example: fmadm replaced /SYS/MB
When an FRU is replaced, the serial number of the FRU changes. If fmd automatically detects that the serial number of an
FRU has changed, the Fault Manager behaves in the same way as if you had entered the fmadm replaced command. If
fmd cannot detect whether the serial number of the FRU has changed, then you must enter the fmadm replaced command
if you have replaced the FRU. If fmd detects that the serial number of the FRU has not changed, then the fmadm
replaced command exits with an error.
b) fmadm repaired Command
Use the fmadm repaired command when you have performed a physical repair other than replacement of the FRU to
resolve the problem. Examples of such repairs include reseating a card or straightening a bent pin. If multiple faults are
currently reported against one FRU, the FRU shows as repaired in all cases.
example: fmadm repaired /SYS/MB
c) fmadm acquit command
Use the acquit subcommand if you determine that the indicated resource is not the cause of the fault. Usually the Fault
Manager automatically acquits some suspects in a multi-element suspect list. Acquittal can occur implicitly as the Fault
Manager refines the diagnosis, for example if additional error events occur. Sometimes Support Services gives you
instructions to perform a manual acquittal.
Replacement takes precedence over repair, and both replacement and repair take precedence over acquittal. Thus, you can
acquit a component and then subsequently repair the component, but you cannot acquit a component that has already
been repaired.
If you do not specify any FMRI or label with the UUID , then the entire event is identified as able to be ignored. A case is
considered repaired when the fault event UUID is acquitted.
example: fmadm acquit <UUID>
Acquit by FMRI or label with no UUID only if you determine that the resource is not a factor in any current cases in which
that resource is a suspect. If multiple faults are currently reported against one FRU, the FRU shows as acquitted in all
cases.
example: fmadm acquit /SYS/MB
To acquit a resource in one case and keep that resource as a suspect in other cases, specify both the fault event UUID and
the resource FMRI or both the UUID and the resource label, as shown in the following example:
example: fmadm acquit /SYS/MB <UUID>
2) Clearing Faults from the ILOM Fault Management Shell
i) While logged into the Oracle ILOM/SP, start the Fault Management Shell
-> start /SP/faultmgmt/shell
ii) View faulted components
faultmgmtsp> fmadm faulty
iii) For each fault listed, type one of the following fmadm commands to manually clear a fault:
a) fmadm replaced [ fru|cru ]
A suspect component has been replaced or removed.
example: fmadm replaced /SYS/MB
b) fmadm repaired [ fru|cru ]
A suspect component has been physically repaired to resolve the reported problem. For example, a component has been
reseated or a bent pin has been fixed.
example: fmadm repaired /SYS/MB
c) fmadm acquit [ fru|cru ] [ uuid ]
A suspect component or uuid resource is not the cause of the problem. Where [ fru|cru ] [ uuid ] appears, type the system
path to the suspect chassis FRU or CRU,
or type the associated universal unique identifier ( uuid ) for the resource reported in the problem
example: fmadm acquit <UUID>
Acquit by fru/cru or label with no UUID only if you determine that the resource is not a factor in any current cases in which
that resource is a suspect. If multiple faults are currently reported against one FRU, the FRU shows as acquitted in all
cases.
example: fmadm acquit /SYS/MB
NOTE: Do not use 'fmadm faulty -a' to determine if there any any currently active faults. When you specify the -a
option all resource information cached by the Fault Manager is listed including faults which have already been
corrected or where no recovery action is needed (see 'fmadm' man page). The listings also include information for
resources that may no longer be present in the system.
REFERENCES
NOTE:1004229.1 - How to Clear FMA Faults From Solaris[TM] and SC (System Controller) on T1000/T2000
T5120/T5220/T5140/T5240/T5440,T6320,T6340, T3-1/T3-2/T3-4, T4-1/T4-2/T4-4
NOTE:1309092.1 - How to Use the Oracle ILOM 3.x Fault Management Shell
NOTE:1643464.1 - [SPARC T3/T4/T5 and T7] OBP reports "One or more resources have been retired, please run 'show
faulty' on the SP" on console
Didn't find what you are looking for?