Hacmptrgd PDF
Hacmptrgd PDF
Standard Edition
Version 7.1
Troubleshooting PowerHA
SystemMirror
IBM
IBM PowerHA SystemMirror for AIX
Standard Edition
Version 7.1
Troubleshooting PowerHA
SystemMirror
IBM
Note
Before using this information and the product it supports, read the information in “Notices” on page 89.
This edition applies to IBM PowerHA SystemMirror 7.1 Standard Edition for AIX and to all subsequent releases and
modifications until otherwise indicated in new editions.
© Copyright IBM Corporation 2010, 2017.
US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract
with IBM Corp.
Contents
About this document . . . . . . . . . v Checking the logical volume manager . . . . 38
Highlighting . . . . . . . . . . . . . . v Checking the TCP/IP subsystem . . . . . . 43
Case-sensitivity in AIX . . . . . . . . . . . v Checking the AIX operating system . . . . . 46
ISO 9000. . . . . . . . . . . . . . . . v Checking physical networks . . . . . . . . 46
Related information . . . . . . . . . . . . v Checking disks and disk adapters . . . . . . 47
Checking the cluster communications daemon. . 47
Troubleshooting PowerHA SystemMirror 1 Checking system hardware . . . . . . . . 48
PowerHA SystemMirror installation issues . . . 48
What's new in Troubleshooting PowerHA
Solving common problems . . . . . . . . . 49
SystemMirror . . . . . . . . . . . . . . 1
PowerHA SystemMirror startup issues . . . . 50
Troubleshooting PowerHA SystemMirror clusters . . 1
Disk and file system issues . . . . . . . . 54
Becoming aware of the problem . . . . . . . 2
Network and switch issues . . . . . . . . 61
Determining a problem source . . . . . . . 3
Cluster communications issues . . . . . . . 67
Stopping the cluster manager . . . . . . . . 3
PowerHA SystemMirror takeover issues . . . . 68
Using the AIX data collection utility . . . . . 3
Client issues . . . . . . . . . . . . . 72
Using PowerHA SystemMirror diagnostic utilities 4
Miscellaneous issues . . . . . . . . . . 73
Verifying expected behavior . . . . . . . . 4
Problem determination tools . . . . . . . . 4
Sample custom scripts . . . . . . . . . . 9 Notices . . . . . . . . . . . . . . 89
Using cluster log files . . . . . . . . . . 10 Privacy policy considerations . . . . . . . . 91
System components . . . . . . . . . . . 32 Trademarks . . . . . . . . . . . . . . 91
Investigating system components . . . . . . 32
Checking highly available applications . . . . 32 Index . . . . . . . . . . . . . . . 93
Checking the PowerHA SystemMirror layer . . 32
Highlighting
The following highlighting conventions are used in this document:
Bold Identifies commands, subroutines, keywords, files, structures, directories, and other items whose names are
predefined by the system. Also identifies graphical objects such as buttons, labels, and icons that the user
selects.
Italics Identifies parameters whose actual names or values are to be supplied by the user.
Monospace Identifies examples of specific data values, examples of text similar to what you might see displayed,
examples of portions of program code similar to what you might write as a programmer, messages from
the system, or information you should actually type.
Case-sensitivity in AIX
Everything in the AIX operating system is case-sensitive, which means that it distinguishes between
uppercase and lowercase letters. For example, you can use the ls command to list files. If you type LS, the
system responds that the command is not found. Likewise, FILEA, FiLea, and filea are three distinct file
names, even if they reside in the same directory. To avoid causing undesirable actions to be performed,
always ensure that you use the correct case.
ISO 9000
ISO 9000 registered quality systems were used in the development and manufacturing of this product.
Related information
v IBM® PowerHA SystemMirror Standard Edition for 7.1.1 for AIX Update Redbooks® publication.
v The PowerHA SystemMirror PDF documents are available in the IBM Knowledge Center.
v The following PowerHA SystemMirror release notes are located in the following locations:
– PowerHA SystemMirror Standard Edition for AIX - /usr/es/sbin/cluster/release_notes
– PowerHA SystemMirror for Smart Assists - /usr/es/sbin/cluster/release_notes_assist
In this PDF file, you might see revision bars (|) in the left margin that identify new and changed
information.
August 2017
December 2015
Added information about Tivoli® System Automation for Multiplatform on a node that already has
PowerHA SystemMirror installed in the “Troubleshooting PowerHA SystemMirror and Tivoli System
Automation for Multiplatform” on page 49 topic.
Typically, a functioning PowerHA SystemMirror cluster requires minimal intervention. If a problem does
occur, diagnostic and recovery skills are essential. Therefore, troubleshooting requires that you identify
the problem quickly and apply your understanding of the PowerHA SystemMirror software to restore the
cluster to full operation.
Note: These topics present the default locations of log files. If you redirected any logs, check the
appropriate location.
Related concepts:
“Using cluster log files” on page 10
These topics explain how to use the PowerHA SystemMirror cluster log files to troubleshoot the cluster.
Included also are some sections on managing parameters for some of the logs.
There are other ways you can be notified of a cluster problem, through mail notification, or pager
notification and text messaging:
v Mail Notification. Although PowerHA SystemMirror standard components do not send mail to the
system administrator when a problem occurs, you can create a mail notification method as a pre- or
post-event to run before or after an event script executes. In a PowerHA SystemMirror cluster
environment, mail notification is effective and highly recommended.
v Remote Notification. You can also define a notification method - numeric or alphanumeric page, or an
text messaging notification to any address including a cell phone - through the SMIT interface to issue
a customized response to a cluster event.
– Pager Notification. You can send messages to a pager number on a given event. You can send textual
information to pagers that support text display (alphanumeric page), and numerical messages to
pagers that only display numbers.
– Text Messaging. You can send cell phone text messages using a standard data modem and telephone
land line through the standard Telocator Alphanumeric Protocol (TAP). Your provider must support
this service.
You can also issue a text message using a Falcom-compatible GSM modem to transmit SMS (Short
Message Service) text-message notifications wirelessly. SMS messaging requires an account with an
SMS service provider. GSM modems take TAP modem protocol as input through a RS232 line or
USB line, and send the message wirelessly to the providers' cell phone tower. The provider forwards
the message to the addressed cell phone. Each provider has a Short Message Service Center (SMSC).
For each person, define remote notification methods that contain all the events and nodes so you can
switch the notification methods as a unit when responders change.
Note: Manually distribute each message file to each node. PowerHA SystemMirror does not
automatically distribute the file to other nodes during synchronization unless the File Collections utility is
set up specifically to do so.
The PowerHA SystemMirror system generates descriptive messages when the scripts it executes (in
response to cluster events) start, stop, or encounter error conditions. In addition, the daemons that make
up a PowerHA SystemMirror cluster generate messages when they start, stop, encounter error conditions,
or change state. The PowerHA SystemMirror system writes these messages to the system console and to
one or more cluster log files. Errors may also be logged to associated system files, such as the errpt file.
Related concepts:
“Using cluster log files” on page 10
These topics explain how to use the PowerHA SystemMirror cluster log files to troubleshoot the cluster.
Included also are some sections on managing parameters for some of the logs.
Related information:
Planning PowerHA SystemMirror
If a problem with PowerHA SystemMirror has been detected, perform the following actions for initial
problem analysis:
1. Collect a PowerHA SystemMirror snapshot with the snap -e command. This should be done as soon
as possible after the problem has been detected because the collected log files contain a time window
of error.
2. Establish the state of the cluster and resource groups using the /usr/es/sbin/cluster/clstat, and
/usr/es/sbin/cluster/utilities/clRGinfo commands.
3. If an event error occurred, inspect the /var/hacmp/log/[Link] file to locate the error. If an AIX
command failed, proactively collect further debug data for the corresponding AIX component, using
the snap command. The most commonly requested flag for further problem determination for
PowerHA SystemMirror is snap -egGtL.
4. Consult the /var/hacmp/log/[Link], and /var/hacmp/log/[Link] files for the result of
the most recent cluster verification. Run cluster verification.
5. If a C-SPOC command failed, consult the /var/hacmp/log/[Link] file.
6. Verify network connectivity between nodes.
7. Inspect the error log (errpt -a) to establish if errors have been logged in the time window of failure.
You can also stop the cluster manager process after stopping cluster services with the "unmanage resource
groups" option. This option leaves the resources active but not monitored on the node. You can then begin
the troubleshooting procedure.
If all else fails, stop the PowerHA SystemMirror cluster services on all cluster nodes. Then, manually start
the application that the PowerHA SystemMirror cluster event scripts were attempting to start and run the
application without the PowerHA SystemMirror software. This may require varying on volume groups,
mounting file systems, and enabling IP addresses. With the PowerHA SystemMirror cluster services
stopped on all cluster nodes, correct the conditions that caused the initial problem.
Flag -e collects data that aids IBM support in troubleshooting a problem with PowerHA SystemMirror
and its interaction with other components. In particular, flag -e collects all log files of PowerHA
SystemMirror utilities, ODMs maintained by PowerHA SystemMirror, some AIX ODMs, and AIX
configuration data most commonly required (such as LVM, TCP/IP and installp information). The snap -e
command runs /usr/sbin/rsct/bin/ctsnap, which collects data of the Group Services.
The PowerHA SystemMirror snapshot should be collected as soon as possible after a problem has been
encountered with PowerHA SystemMirror, to ensure that the data pertaining to the time window of error
are contained in the log files.
The snap -e command relies on the Cluster Communication Daemon subsystem (clcomd), to collect data.
If this subsystem is affected by an error, the snap -e command might fail. In this case, collect the
following data on all cluster nodes:
For more information on the snap command, see the AIX Version 6.1 Commands Reference, Volume 5.
The key PowerHA SystemMirror diagnostic tools (in addition to the cluster logs and messages) include:
v clRGinfo provides information about resource groups and for troubleshooting purposes.
v clstat reports the status of key cluster components - the cluster itself, the nodes in the cluster, the
network interfaces connected to the nodes, the service labels, and the resource groups on each node.
v cldisp utility displays resource groups and their startup, fallover, and fallback policies.
v SMIT Problem Determination Tools, for information see the section Problem determination tools.
You can still specify in SMIT that the logs be collected if you want them. Skipping the logs collection
reduces the size of the snapshot and reduces the running time of the snapshot utility.
The SMIT Problem Determination Tools menu includes the options offered by cluster snapshot utility, to
help you diagnose and solve problems.
Related concepts:
“Problem determination tools”
You can use the SMIT interface to help you troubleshoot problems with PowerHA SystemMirror.
Related information:
Monitoring a PowerHA SystemMirror cluster
Saving and restoring cluster configurations
If the applications are not up and running, you might need to look elsewhere to identify problems
affecting your cluster. This document describe ways in which you should be able to locate potential
problems.
You can use the following tools to troubleshoot PowerHA SystemMirror. To access the following tools,
enter smit sysmirror from the command line and select Problem Determination Tools.
PowerHA SystemMirror Verification
You can use this tool to verify that the configuration on all nodes is synchronized, set up a
custom verification method, or set up automatic cluster verification.
View Current State
You can use this tool to display the state of the nodes, communication interfaces, resource groups,
and the local event summary for the last five events.
You can verify cluster topology resources and custom-defined verification methods.
The cluster verification utility runs on one user-selectable PowerHA SystemMirror cluster node once
every 24 hours.
By default, the first node in alphabetical order runs the verification at midnight. During verification, any
errors that might cause problems at some point in the future are displayed. You can change the defaults,
by selecting a node and time that suit your configuration.
This information is stored on every available cluster node in the PowerHA SystemMirror log file
/var/hacmp/log/[Link]. If the selected node became unavailable or could not complete cluster
verification, you can detect this by the lack of a report in the /var/hacmp/log/[Link] file.
In case cluster verification completes and detects some configuration errors, you are notified about the
following potential problems:
v The exit status of cluster verification is communicated across the cluster along with the information
about cluster verification process completion.
v Broadcast messages are sent across the cluster and displayed on stdout . These messages inform you
about detected configuration errors.
v A cluster_notify event runs on the cluster and is logged in [Link] (if cluster services is running).
More detailed information is available on the node that completes cluster verification in
/var/hacmp/clverify/[Link]. If a failure occurs during processing, error messages and warnings
clearly indicate the node and reasons for the verification failure.
You can configure the node and specify the time where cluster verification runs automatically.
Make sure the /var file system on the node has enough space for the /var/hacmp/log/[Link] file.
To configure the node and specify the time where cluster verification runs automatically:
1. From the command line, enter smit sysmirror.
2. From the SMIT interface, select Problem Determination Tools > PowerHA SystemMirror Verification
> Automatic Cluster Configuration Monitoring, and press Enter.
3. Enter field values as follows:
Table 3. Automatic Cluster Configuration Monitoring fields
Field Value
* Automatic cluster configuration verification Enabled is the default.
Node name Select one of the cluster nodes from the list. By default, the first node in
alphabetical order will verify the cluster configuration. This node will be
determined dynamically every time the automatic verification occurs.
*HOUR (00 - 23) Midnight (00) is the default. Verification runs automatically once every 24
hours at the selected hour.
For example, if script failure occurs because a filesystem mount failed, you can correct the problem,
mount the filesystem manually, then use this option to complete the rest of the cluster event processing.
The Recover From PowerHA SystemMirror Script Failure menu option sends a signal to the Cluster
Manager daemon (clstrmgrES ) on the specified node, causing it to proceed to the next step in the cluster
event. If a subsequent event failure occurs, you must repeat the process of correcting the problem, then
using Recover From PowerHA SystemMirror Script Failure option to continue to the next step. You
must continue this process until the cluster state goes to "stable".
Make sure that you fix the problem that caused the script failure. You need to manually complete the
remaining steps that followed the failure in the event script (see /var/hacmp/log/[Link] ). Then, to
resume clustering, complete the following steps to bring the PowerHA SystemMirror event script state to
EVENT COMPLETED:
1. Enter smit hacmp
2. In SMIT, select Problem Determination Tools > Recover From PowerHA SystemMirror Script
Failure.
3. Select the IP label/address for the node on which you want to run the clruncmd command and press
Enter. The system prompts you to confirm the recovery attempt. The IP label is listed in the /etc/hosts
file and is the name assigned to the service IP address of the node on which the failure occurred.
4. Press Enter to continue. Another SMIT panel appears to confirm the success of the script recovery.
Select this option from the Problem Determination Tools menu to automatically save any of your
changes in the Configuration Database as a snapshot with the path /usr/es/sbin/cluster/snapshots/
UserModifiedDB before restoring the Configuration Database with the values actively being used by the
Cluster Manager.
If a cron job executes in conjunction with a resource or application, it is useful to have that cron entry
fallover along with the resource. It may also be necessary to remove that cron entry from the cron table if
the node no longer possesses the related resource or application.
The following example shows one way to use a customized script to do this:
The example cluster is a two node hot standby cluster where node1 is the primary node and node2 is the
backup. Node1 normally owns the shared resource group and application. The application requires that a
cron job be executed once per day but only on the node that currently owns the resources.
To ensure that the job will run even if the shared resource group and application fall over to node2,
create two files as follows:
1. Assuming that the root user is executing the cron job, create the file [Link] and another file
called [Link] in a directory on a non-shared file system on node1. Make these files resemble
the cron tables that reside in the directory /var/spool/crontabs.
The [Link] table should contain all normally executed system entries, and all entries pertaining
to the shared resource or application.
The [Link] table should contain all normally executed system entries but should not contain
entries pertaining to the shared resource or application.
2. Copy the files to the other node so that both nodes have a copy of the two files.
3. On both systems, run the following command at system startup:
crontab [Link]
This will ensure that the cron table for root has only the "no resource" entries at system startup.
4. You can use either of two methods to activate the [Link] cron table. The first method is the
simpler of the two.
v Run crontab [Link] as the last line of the application start script. In the application stop
script, the first line should then be crontab [Link]. By executing these commands in the
application start and stop scripts, you are ensured that they will activate and deactivate on the
proper node at the proper time.
v Run the crontab commands as a post_event to node_up_complete and node_down_complete.
– Upon node_up_complete on the primary node, run crontab [Link].
– On node_down_complete run crontab [Link].
The print spooling system consists of two directories: /var/spool/qdaemon and /var/spool/lpd/qdir. One
directory contains files containing the data (content) of each job. The other contains the files consisting of
information pertaining to the print job itself. When jobs are queued, there are files in each of the two
directories. In the event of a fallover, these directories do not normally fallover and therefore the print
jobs are lost.
The solution for this problem is to define two file systems on a shared volume group. You might call
these file systems /prtjobs and /prtdata. When PowerHA SystemMirror starts, these file systems are
mounted over /var/spool/lpd/qdir and /var/spool/qdaemon.
Write a script to perform this operation as a post event to node_up. The script should do the following:
1. Stop the print queues
2. Stop the print queue daemon
3. Mount /prtjobs over /var/spool/lpd/qdir
4. Mount /prtdata over /var/spool/qdaemon
5. Restart the print queue daemon
6. Restart the print queues.
In the event of a fallover, the surviving node will need to do the following:
7. Stop the print queues
8. Stop the print queue daemon
9. Move the contents of /prtjobs into /var/spool/lpd/qdir
10. Move the contents of /prtdata into /var/spool/qdaemon
11. Restart the print queue daemon
12. Restart the print queues.
13. To do this, write a script called as a post-event to node_down_complete on the takeover. The script
needs to determine if the node_down is from the primary node.
For most troubleshooting, the /var/hacmp/log/[Link] file will be the most helpful log file. Resource
group handling has been enhanced in recent releases and the [Link] file has been expanded to
capture more information on the activity and location of resource groups after cluster events. For
instance, the [Link] file captures details of resource group parallel processing that other logs (such as
The PowerHA SystemMirror software writes the messages it generates to the system console and to
several log files. Each log file contains a different subset of messages generated by the PowerHA
SystemMirror software. When viewed as a group, the log files provide a detailed view of all cluster
activity.
The following list describes the log files into which the PowerHA SystemMirror software writes messages
and the types of cluster messages they contain. The list also provides recommendations for using the
different log files. Note that the default log directories are listed here; you have the option of redirecting
some log files to a chosen directory. If you have redirected any logs, check the appropriate location.
Table 4. Cluster message log files
Log file name Description
system error log Contains time-stamped, formatted messages from all AIX subsystems, including scripts and daemons. For
information about viewing this log file and interpreting the messages it contains, see the section
Understanding the system error log.
Recommended Use: Because the system error log contains time-stamped messages from many other
system components, it is a good place to correlate cluster events with system events.
/tmp/[Link] Contains a record of the conversion progress when upgrading to a recent PowerHA SystemMirror release.
The installation process runs the cl_convert utility and creates the /tmp/[Link] file.
Recommended Use: View the [Link] to gauge conversion success when running cl_convert from
the command line.
/var/ha/log/grpglsm Contains time-stamped messages in ASCII format. These track the execution of internal activities of the
RSCT Group Services Globalized Switch Membership daemon. IBM support personnel use this information
for troubleshooting. The file gets trimmed regularly. Therefore, save it promptly if there is a chance you
may need it.
/var/hacmp/adm/ Contains time-stamped, formatted messages generated by PowerHA SystemMirror scripts and daemons.
[Link]
Recommended Use: Because this log file provides a high-level view of current cluster status, check this
file first when diagnosing a cluster problem.
/var/hacmp/adm/ Contains time-stamped, formatted messages generated by PowerHA SystemMirror scripts. The system
history/ creates a cluster history file every day, identifying each file by its file name extension, where mm indicates
[Link] the month, dd indicates the day, and yyyy the year. For information about viewing this log file and
interpreting its messages, see the section Understanding the cluster history log file.
Recommended Use: Use the cluster history log files to get an extended view of cluster behavior over time.
Note that this log is not a good tool for tracking resource groups processed in parallel. In parallel
processing, certain steps formerly run as separate events are now processed differently and these steps will
not be evident in the cluster history log. Use the [Link] file to track parallel processing activity.
/var/log/clcomd/ Contains time-stamped, formatted, diagnostic messages generated by clcomd.
[Link]
Recommended Use: Information in this file is for IBM Support personnel.
/var/hacmp/log/ Contains any warnings or errors that occurred during the automatic cluster verification run.
[Link]
/var/hacmp/log/ Contains the state transitions of applications managed by PowerHA SystemMirror. For example, when
[Link] each application managed by PowerHA SystemMirror is started or stopped and when the node stops on
which an application is running.
Each node has its own instance of the file. Each record in the [Link] file consists of a single line. Each
line contains a fixed portion and a variable portion:
Recommended Use: By collecting the records in the [Link] file from every node in the cluster, a utility
program can determine how long each application has been up, as well as compute other statistics
describing application availability time.
/var/hacmp/log/ You can install Client Information (Clinfo) services on both client and server systems - client systems
[Link].n, n=1,..,7 ([Link]) will not have any HACMP ODMs (for example HACMPlogs) or utilities (for example
clcycle); therefore, the Clinfo logging will not take advantage of cycling or redirection.
The default debug level is 0 or "off". You can enable logging using command line flags. Use the clinfo -l
flag to change the log file name.
/var/hacmp/log/ Contains time-stamped, formatted messages generated by the clstrmgrES subsystem. The default messages
[Link] are verbose and are typically adequate for troubleshooting most problems, however IBM support may
direct you to enable additional debugging.
/var/hacmp/log/
[Link].n, Recommended Use: Information in this file is for IBM Support personnel.
n=1,..,7
/var/hacmp/log/ Contains high-level logging of cluster manager activity, in particular its interaction with other components
[Link] of PowerHA SystemMirror and with RSCT, which event is currently being run, and information about
resource groups (for example, their state and actions to be performed, such as acquiring or releasing them
/var/hacmp/log/ during an event.
[Link].n,
n=1,..,7 Recommended Use: Information in this file is for IBM Support personnel.
/var/hacmp/log/ Contains time-stamped, formatted messages generated by PowerHA SystemMirror C-SPOC commands.
[Link] The [Link] file resides on the node that invokes the C-SPOC command.
Recommended Use: Use the C-SPOC log file when tracing a C-SPOC command's execution on cluster
nodes.
/var/hacmp/log/ Contains a high-level of logging for the C-SPOC utility - commands and utilities that have been invoked
[Link] by C-SPOC on specified nodes and their return status.
/var/hacmp/log/ Contains logging of the execution of C-SPOC commands on remote nodes with ksh option xtrace enabled
[Link] (set -x).
/var/hacmp/log/ Contains time-stamped, formatted messages generated by PowerHA SystemMirror scripts on the current
[Link] day.
/var/hacmp/log/
[Link].n n=1,..,7 In verbose mode (recommended), this log file contains a line-by-line record of every command executed by
scripts, including the values of all arguments to each command. An event summary of each high-level
event is included at the end of each event's details. For information about viewing this log and
interpreting its messages, see the section Understanding the [Link] log file.
Recommended Use: Because the information in this log file supplements and expands upon the
information in the /var/hacmp/adm/[Link] file, it is the primary source of information when
investigating a problem.
/var/hacmp/log/ Contains information about any Oracle specific errors that occur when using this Smart Assist and is used
[Link] by the Oracle Smart Assist.
/var/hacmp/log/ Contains information about any general errors that occur when using the Smart Assists and is used by the
[Link] Smart Assist infrastructure.
/var/log/clcomd/ Contains time-stamped, formatted messages generated by Cluster Communications daemon (clcomd)
[Link] activity. The log shows information about incoming and outgoing connections, both successful and
unsuccessful. Also displays a warning if the file permissions for /usr/es/sbin/cluster/etc/rhosts are not set
correctly - users on the system should not be able to write to the file.
Recommended Use: Use this file to troubleshoot communication problems of PowerHA SystemMirror
utilities.
/var/hacmp/log/ Contains debugging information for the Two-Node Cluster Configuration Assistant. The Assistant stores
[Link] up to ten copies of the numbered log files to assist with troubleshooting activities.
/var/hacmp/ The [Link] file contains the verbose messages output by the cluster verification utility. The messages
clverify/[Link] indicate the node(s), devices, command, etc. in which any verification error occurred.
It also contains information for the file collection utility, the two-node cluster configuration assistant, and
the cluster test tool.
/var/hacmp/log/ Includes excerpts from the [Link] file. The Cluster Test Tool saves up to three log files and numbers
cl_testtool.log them so that you can compare the results of different cluster tests. The tool also rotates the files with the
oldest file being overwritten
Related reference:
“Understanding the [Link] file”
The /var/hacmp/adm/[Link] file is a standard text file. When checking this file, first find the most
recent error message associated with your problem. Then read back through the log file to the first
message relating to that problem. Many error messages cascade from an initial error that usually
indicates the problem source.
“Understanding the cluster history log file” on page 22
The cluster history log file is a standard text file with the system-assigned name /usr/es/sbin/cluster/
history/[Link], where mm indicates the month, dd indicates the day in the month and yyyy
indicates the year.
“Understanding the [Link] log file” on page 14
The /var/hacmp/log/[Link] file is a standard text file. The system cycles [Link] log file seven
times. Each copy is identified by a number appended to the file name. The most recent log file is named
/var/hacmp/log/[Link]; the oldest version of the file is named /var/hacmp/log/[Link].7.
Related information:
Upgrading a PowerHA SystemMirror cluster
Verifying and synchronizing a PowerHA SystemMirror cluster
The /var/hacmp/adm/[Link] file is a standard text file. When checking this file, first find the most
recent error message associated with your problem. Then read back through the log file to the first
message relating to that problem. Many error messages cascade from an initial error that usually
indicates the problem source.
The entry in the previous example indicates that the Cluster Information program (clinfoES) stopped
running on the node named nodeA at 5:25 P.M. on March 3.
Because the /var/hacmp/adm/[Link] file is a standard ASCII text file, you can view it using standard
AIX file commands, such as the more or tail commands. However, you can also use the SMIT interface.
The following sections describe each of the options.
Note: You can select to either scan the contents of the [Link] file as it exists, or you can watch an
active log file as new events are appended to it in real time. Typically, you scan the file to try to find a
problem that has already occurred; you watch the file as you test a solution to a problem to determine
the results.
The /var/hacmp/log/[Link] file is a standard text file. The system cycles [Link] log file seven
times. Each copy is identified by a number appended to the file name. The most recent log file is named
/var/hacmp/log/[Link]; the oldest version of the file is named /var/hacmp/log/[Link].7.
Given the recent changes in the way resource groups are handled and prioritized in fallover
circumstances, the [Link] file contains event summaries that are useful in tracking the activities and
location of your resource groups.
You can customize the wait period before a warning message appears. Since this affects how often the
config_too_long message is posted to the log, the config_too_long console message may not be evident
in every case where a problem exists. When a cluster event runs longer than expected, a warning
message is added to [Link]. This can occur if there is an event script failure or if a system command
hangs or is just running slowly.
When checking the [Link] file, search for EVENT FAILED messages. These messages indicate that a
failure has occurred. Then, starting from the failure message, read back through the log file to determine
exactly what went wrong. The [Link] log file provides the most important source of information
when investigating a problem.
When a cluster event processes resource groups with dependencies or replicated resources, an event
preamble is included in the [Link] file.
This preamble shows the sequence of events in which the cluster manager plans to attempt and bring the
resource groups online on the correct nodes and sites. It also considersthe individual group dependencies
and site configuration.
Note: The preamble represents the sequence of events the cluster manager enques during the planning
stage of the event. When an individual event fails, or the cluster manager recalculates the plan for any
reason, a new preamble is generated. Not all events in the original preamble are necessarily run.
Example
PowerHA SystemMirror Event Preamble
------------------------------------------------------------------
Event summaries:
Event summaries that appear at the end of each event's details make it easier to check the [Link] file
for errors. The event summaries contain pointers back to the corresponding event, which allow you to
easily locate the output for any event.
See the section Non-verbose and verbose output of the [Link] log file for an example of the output.
You can also view a compilation of only the event summary sections pulled from current and past
[Link] files. The option for this display is found on the Problem Determination Tools > PowerHA
SystemMirror Log Viewing and Management > View/Save/Remove Event Summaries > View Event
Summaries SMIT panel. For more detail, see the section View compiled [Link] event summaries.
Related reference:
“Viewing compiled [Link] event summaries” on page 20
In the [Link] file, event summaries appear after those events that are initiated by the Cluster
Manager. For example, node_up and node_up_complete and related subevents such as node_up_local
and node_up_remote_complete.
“Non-verbose and verbose output of the [Link] log file” on page 17
You can select either verbose or non-verbose output.
You can view the [Link] log file in HTML format by setting formatting options on the Problem
Determination Tools > PowerHA SystemMirror Log Viewing and Management > Change/Show
PowerHA SystemMirror Log File Parameters SMIT panel.
Reported resource group acquisition failures (failures indicated by a non-zero exit code returned by a
command) are tracked in [Link].
The [Link] file, event summaries, and clstat include information and messages about resource groups
in the ERROR state that attempted to get online on a joining node, or on a node that is starting up.
Similarly, you can trace the cases in which the acquisition of such a resource group has failed, and
PowerHA SystemMirror launched an rg_move event to move the resource group to another node in the
nodelist. If, as a result of consecutive rg_move events through the nodes, a non-concurrent resource
group still failed to get acquired, PowerHA SystemMirror adds a message to the [Link] file.
When you add a network interface on a network, the actual event that runs in this case is called
join_interface. This is reflected in the [Link] file.
Similarly, when a network interface failure occurs, the actual event that is run in is called fail_interface.
This is also reflected in the [Link] file. Remember that the event that is being run in this case simply
indicates that a network interface on the given network has failed.
The [Link] file allows you to fully track how resource groups have been processed in PowerHA
SystemMirror.
This topic provides a brief description, for detailed information and examples of event summaries with
job types, see the section Tracking resource group parallel and serial processing in the [Link] file.
For each resource group that has been processed by PowerHA SystemMirror, the software sends the
following information to the [Link] file:
v Resource group name
v Script name
v Name of the command that is being executed.
In cases where an event script does not process a specific resource group, for instance, in the beginning of
a node_up event, a resource group's name cannot be obtained. In this case, the resource group's name
part of the tag is blank.
For example, the [Link] file may contain either of the following lines:
cas2:node_up_local[199] set_resource_status ACQUIRING
:node_up[233] cl_ssa_fence up stan
In addition, references to the individual resources in the event summaries in the [Link] file contain
reference tags to the associated resource groups. For instance:
[Link].10.[Link].EDT [Link] _swap_IP_address.[Link].[Link]
Related reference:
“Tracking resource group processing in the [Link] file” on page 23
Output to the [Link] file allows you to isolate details related to a specific resource group and its
resources. Based on the content of the [Link] event summaries, you can determine whether or not the
resource groups are being processed in the expected order.
For each cluster event that does not complete within the specified event duration time, config_too_long
messages are logged in the [Link] file.
The messages are then sent to the console according to the following pattern:
v The first five config_too_long messages appear in the [Link] file at 30-second intervals
v The next set of five messages appears at an interval that is double the previous interval until the
interval reaches one hour
v These messages are logged every hour until the event completes or is terminated on that node.
You can customize the waiting period before a config_too_long message is sent.
Related information:
Planning for cluster events
Non-verbose output
In non-verbose mode, the [Link] log contains the start, completion, and error notification messages
output by all PowerHA SystemMirror scripts. Each entry contains the following information:
Verbose output
In verbose mode, the [Link] file also includes the values of arguments and flag settings passed to the
scripts and commands.
Some events (those initiated by the Cluster Manager) are followed by event summaries, as shown in
these excerpts:
....
Mar 25 [Link] EVENT COMPLETED: network_up alcuin tmssanet_alcuin_bede
CustomRG has a settling time configured. A lower priority node joins the cluster:
Mar 25 [Link] EVENT COMPLETED: node_up alcuin
CustomRG has a daily fallback timer configured to fall back on 22 hrs 10 minutes. The resource group is
on a lower priority node (bede). Therefore, the timer is ticking; the higher priority node (alcuin) joins the
cluster:
The message on bede
...
Setting the level and format of information recorded in the [Link] file:
You can set the level of information recorded in the /var/hacmp/log/[Link] file:
Note: If you set your formatting options for [Link] to Default (None), then no event summaries
will be generated. For information about event summaries, see the section Viewing compiled
[Link] event summaries.
6. To change the level of debug information, set the value of Cluster Manager debug level field to
either standard or high.
Related reference:
“Viewing compiled [Link] event summaries”
In the [Link] file, event summaries appear after those events that are initiated by the Cluster
Manager. For example, node_up and node_up_complete and related subevents such as node_up_local
and node_up_remote_complete.
In the [Link] file, event summaries appear after those events that are initiated by the Cluster
Manager. For example, node_up and node_up_complete and related subevents such as node_up_local
and node_up_remote_complete.
Note that event summaries do not appear for all events; for example, when you move a resource group
through SMIT.
The View Event Summaries option displays a compilation of all event summaries written to a node's
[Link] file. This utility can gather and display this information even if you have redirected the
[Link] file to a new location. You can also save the event summaries to a file of your choice instead
of viewing them via SMIT.
Note: Event summaries pulled from the [Link] file are stored in the /usr/es/sbin/cluster/
cl_event_summary.txt file. This file continues to accumulate as [Link] cycles, and is not automatically
truncated or replaced. Consequently, it can grow too large and crowd your /usr directory. You should
clear event summaries periodically, using the Remove Event Summary History option in SMIT.
This feature is node-specific. Therefore, you cannot access one node's event summary information from
another node in the cluster. Run the View Event Summaries option on each node for which you want to
gather and display event summaries.
The event summaries display is a good way to get a quick overview of what has happened in the cluster
lately. If the event summaries reveal a problem event, you will probably want to examine the source
[Link] file to see full details of what happened.
Note: If you have set your formatting options for [Link] to Default (None), then no event
summaries will be generated. The View Event Summaries command will yield no results.
The Problem Determination Tools > PowerHA SystemMirror Log Viewing and Management ->
View/Save/Remove PowerHA SystemMirror Event Summaries -> View Event Summaries option
gathers information from the [Link] log file, not directly from PowerHA SystemMirror while it is
running. Consequently, you can access event summary information even when PowerHA SystemMirror is
not running. The summary display is updated once per day with the current day's event summaries.
Note that clRGinfo displays resource group information more quickly when the cluster is running. If the
cluster is not running, wait a few minutes and the resource group information will eventually appear.
You can view a compiled list of event summaries on a node using SMIT.
You can store the compiled list of a node's event summaries to a file using SMIT.
Depending on the format you select (for example .txt or .html), you can then move this file to be able to
view it in a text editor or browser.
The PowerHA SystemMirror software logs messages to the system error log whenever a daemon
generates a state message.
The PowerHA SystemMirror messages in the system error log follow the same format used by other AIX
subsystems. You can view the messages in the system error log in short or long format.
In short format, also called summary format, each message in the system error log occupies a single line.
The description of the fields in the short format of the system error log:
Table 7. system error log
Field Description
Error_ID A unique error identifier.
Time stamp The day and time on which the event occurred.
T Error type: permanent (P), unresolved (U), or temporary (T).
CL Error class: hardware (H), software (S), or informational (O).
Resource_name A text string that identifies the AIX resource or subsystem that generated the message.
PowerHA SystemMirror messages are identified by the name of their daemon.
Error_description A text string that briefly describes the error.
Unlike the PowerHA SystemMirror log files, the system error log is not a text file.
To view the AIX system error log, you must use the AIX SMIT:
1. Enter smit
2. In SMIT, select Problem Determination Tools > PowerHA SystemMirror Log Viewing and
Management > View Detailed PowerHA SystemMirror Log Files > Scan the PowerHA
SystemMirror for AIX System Log and press Enter.
SMIT displays the error log.
The cluster history log file is a standard text file with the system-assigned name /usr/es/sbin/cluster/
history/[Link], where mm indicates the month, dd indicates the day in the month and yyyy
indicates the year.
You should decide how many of these log files you want to retain and purge the excess copies on a
regular basis to conserve disk storage space. You may also decide to include the cluster history log file in
your regular system backup procedures.
The description of the fields in the cluster history log file messages:
Table 8. cluster history log file
Field Description
Date and Time stamp The date and time at which the event occurred.
Message Text of the message.
Description Name of the event script.
Note: This log reports specific events. Note that when resource groups are processed in parallel, certain
steps previously run as separate events are now processed differently, and therefore do not show up as
events in the cluster history log file. You should use the [Link] file, which contains greater detail on
resource group activity and location, to track parallel processing activity.
Because the cluster history log file is a standard text file, you can view its contents using standard AIX
file commands, such as cat, more, and tail. You cannot view this log file using SMIT.
If you encounter a problem with PowerHA SystemMirror and report it to IBM support, you may be
asked to collect log files pertaining to the problem. In PowerHA SystemMirror, the Collect PowerHA
SystemMirror Log Files for Problem Reporting SMIT panel aids in this process.
CAUTION:
Use this panel only if requested by the IBM support personnel. If you use this utility without
direction from IBM support, be careful to fully understand the actions and the potential consequences.
PowerHA SystemMirror automatically manages cluster log files. The individual logs are limited to a
maximum size and are removed after they reach a certain age, or are overwritten by newer versions.
In general, PowerHA SystemMirror defaults to the following rules for all log files.
Table 10. General rules for log files
Item Rule
Maximum size Log files that are over 1 MB in size are cycled.
Maximum number of outdated logs No more than 7 prior versions of the file are preserved.
Maximum age Log files older than one day are cycled.
If you want to customize the values that are specified in the general rules, you can override them by
specifying different values in the /etc/environment file on each cluster node.
Several features are listed in the [Link] file and in the event summaries that might help you follow
the flow of parallel resource group processing.
v Each line in the [Link] file flow includes the name of the resource group to which it applies.
v The event summary information includes details about all resource types.
The following example shows an event summary for resource groups named cascrg1 and cascrg2 that
are processed in parallel:
PowerHA SystemMirror Event Summary
As shown here, all processed resource groups are listed first, followed by the individual resources that
are being processed.
When resource group dependencies or sites are configured in the cluster, check the event preamble which
lists the plan of action the Cluster Manager. This plan describes the process the resource groups follow
for the prescribed events.
Execution of individual events is traced in the [Link] file. If there is a problem with an event, or it
did not produce the expected results, certain patterns, and keywords are presented in the [Link].
This file is used to try to identify the source of the problem.
The following information is provided for users who are interested in understanding the low-level details
of cluster event processing. It is not intended as a reference for use in primary problem determination.
If you have a problem with PowerHA SystemMirror Enterprise Edition follow your local problem
reporting and support procedures as a primary response.
The cluster manager uses an approach described as “parallel processing” for planning cluster events.
Parallel processing combines several different recovery steps in a single event in order to maximize the
efficiency and speed of event processing. With parallel processing, the process_resources event script is
used as a main event for processing different resources based on resource types. The process_resources
event uses a keyword “JOB_TYPE” to identify the resources currently being processed.
Job types are listed in the [Link] log file. This list assists you to identify the sequence of events that
take place during acquisition or release of different types of resources. Depending on the cluster resource
groups configuration, other specific job types that take place during parallel processing of resource
groups.
JOB_TYPE=ONLINE:
In the complete phase of an acquisition event, after all resources for all resource groups have been
successfully acquired, the ONLINE job type is run. This job ensures that all successfully acquired
resource groups are set to the online state. The RESOURCE_GROUPS variable contains the list of all
groups that were acquired.
:process_resources[1476] clRGPA
:clRGPA[48] [[ high = high ]]
:clRGPA[48] version= 1. 16
:clRGPA[50] usingVer= clrgpa
:clRGPA[55] clrgpa
:clRGPA[56] exit 0
:process_resources[1476] eval JOB_TYPE= ONLINE RESOURCE_GROUPS="
cascrg1 cascrg2 conc_ rg1"
JOB_TYPE= OFFLINE:
In the complete phase of a release event, after all resources for all resource groups have been successfully
released, the OFFLINE job type is run. This job ensures that all successfully released resource groups are
set to the offline state. The RESOURCE_GROUPS variable contains the list of all groups that were
released.
conc_rg1 :process_resources[1476] clRGPA
conc_rg1 :clRGPA[48] [[ high = high ]]
conc_rg1 :clRGPA[48] version= 1. 16
conc_rg1 :clRGPA[50] usingVer= clrgpa
conc_rg1 :clRGPA[55] clrgpa
conc_rg1 :clRGPA[56] exit 0
conc_rg1 :process_resources[1476] eval JOB_TYPE= OFFLINE RESOURCE_GROUPS=" cascrg2 conc_ rg1"
JOB_TYPE=ERROR:
If an error occurred during the acquisition or release of any resource, the ERROR job type is run. The
variable RESOURCE_GROUPS contains the list of all groups where acquisition or release failed during
the current event. These resource groups are moved into the error state. When this job is run during an
acquisition event, PowerHA SystemMirror uses the Recovery from Resource Group Acquisition Failure
feature and launches an rg_move event for each resource group in the error state.
JOB_TYPE=NONE:
After all processing is complete for the current process_resources script, the final job type of NONE is
used to indicate that processing is complete and the script can return. When exiting after receiving this
job, the process_resources script always returns 0 for success.
conc_rg1: process_resources[1476] clRGPA
conc_rg1: clRGPA[48] [[ high = high ]]
conc_rg1: clRGPA[48] version= 1.16
conc_rg1: clRGPA[50] usingVer= clrgpa
conc_rg1: clRGPA[55] clrgpa
conc_rg1: clRGPA[56] exit 0
conc_rg1: process_resources[1476] eval JOB_TYPE= NONE
conc_rg1: process_resources[1476] JOB_TYPE= NONE
conc_rg1: process_resources[1478] RC= 0
conc_rg1: process_resources[1479] set +a
conc_rg1: process_resources[1481] [ 0 -ne 0 ]
conc_rg1: process_resources[1721] break
conc_rg1: process_resources[1731] exit 0
JOB_TYPE=ACQUIRE:
The ACQUIRE job type occurs at the beginning of any resource group acquisition event. Search hacmp.
out for JOB_ TYPE= ACQUIRE and view the value of the RESOURCE_ GROUPS variable to see a list
of which resource groups are being acquired in parallel during the event.
:process_resources[1476] clRGPA
:clRGPA[48] [[ high = high ]]
:clRGPA[48] version= 1. 16
:clRGPA[50] usingVer= clrgpa
:clRGPA[55] clrgpa
:clRGPA[56] exit 0
:process_resources[1476] eval JOB_TYPE= ACQUIRE RESOURCE_GROUPS=" cascrg1 cascrg2"
:process_resources[1476] JOB_TYPE= ACQUIRE RESOURCE_GROUPS= cascrg1 cascrg2
:process_resources[1478] RC= 0
:process_resources[1479] set +a
:process_resources[1481] [ 0 -ne 0 ]
:process_resources[1687] set_resource_group_state ACQUIRING
JOB_TYPE=RELEASE:
The RELEASE job type occurs at the beginning of any resource group release event. Search hacmp. out
for JOB_ TYPE= RELEASE and view the value of the RESOURCE_ GROUPS variable to see a list of
which resource groups are being released in parallel during the event.
:process_resources[1476] clRGPA
:clRGPA[48] [[ high = high ]]
:clRGPA[48] version= 1. 16
:clRGPA[50] usingVer= clrgpa
:clRGPA[55] clrgpa
:clRGPA[56] exit 0
JOB_TYPE= SSA_FENCE:
The SSA_FENCE job type is used to handle fencing and unfencing of SSA disks. The variable ACTION
indicates what should be done to the disks listed in the HDISKS variable. All resources groups (both
parallel and serial) use this method for disk fencing.
:process_resources[1476] clRGPA FENCE
:clRGPA[48] [[ high = high ]]
:clRGPA[55] clrgpa FENCE
:clRGPA[56] exit 0
:process_resources[1476] eval JOB_TYPE= SSA_ FENCE ACTION= ACQUIRE
HDISKS=" hdisk6" RESOURCE_GROUPS=" conc_ rg1 " HOSTS=" electron"
:process_ resources[1476] JOB_TYPE= SSA_FENCE ACTION= ACQUIRE
HDISKS= hdisk6 RESOURCE_GROUPS= conc_rg1 HOSTS=electron
:process_ resources[1478] RC= 0
:process_ resources[1479] set +a
:process_ resources[1481] [ 0 -ne 0 ]
:process_ resources[1675] export GROUPNAME= conc_ rg1 conc_ rg1
:process_ resources[1676] process_ ssa_ fence ACQUIRE
Note: Notice that disk fencing uses the process_resources script, and, therefore, when disk fencing
occurs, it may mislead you to assume that resource processing is taking place, when, in fact, only disk
fencing is taking place. If disk fencing is enabled, you will see in the [Link] file that the disk fencing
operation occurs before any resource group processing. Although the process_ resources script handles
SSA disk fencing, the resource groups are processed serially. cl_ ssa_ fence is called once for each
resource group that requires disk fencing. The [Link] content indicates which resource group is being
processed.
conc_ rg1: process_resources[8] export GROUPNAME
conc_ rg1: process_resources[10] get_ list_ head hdisk6
conc_ rg1: process_resources[10] read LIST_OF_HDISKS_ FOR_ RG
conc_ rg1: process_resources[11] read HDISKS
conc_ rg1: process_resources[11] get_ list_ tail hdisk6
conc_ rg1: process_resources[13] get_ list_ head electron
conc_ rg1: process_resources[13] read HOST_ FOR_ RG
conc_ rg1: process_resources[14] get_ list_ tail electron
conc_ rg1: process_resources[14] read HOSTS
conc_ rg1: process_resources[18] cl_ ssa_fence ACQUIRE electron hdisk6
conc_ rg1: cl_ssa_fence[43] version= 1. 9. 1. 2
conc_ rg1: cl_ssa_fence[44]
conc_ rg1: cl_ssa_fence[44]
conc_ rg1: cl_ssa_fence[46] STATUS= 0
conc_ rg1: cl_ssa_fence[48] (( 3 < 3
conc_ rg1: cl_ssa_fence[56] OPERATION= ACQUIRE
JOB_TYPE=SERVICE_LABELS:
The SERVICE_LABELS job type handles the acquisition or release of service labels. The variable
ACTION indicates what should be done to the service IP labels listed in the IP_LABELS variable.
conc_ rg1: process_ resources[ 1476] clRGPA
conc_ rg1: clRGPA[ 55] clrgpa
conc_ rg1: clRGPA[ 56] exit 0
conc_ rg1: process_ resources[ 1476] eval JOB_ TYPE= SERVICE_ LABELS
ACTION= ACQUIRE IP_ LABELS=" elect_ svc0: shared_ svc1, shared_ svc2"
RESOURCE_ GROUPS=" cascrg1 rotrg1" COMMUNICATION_ LINKS=": commlink1"
conc_ rg1: process_ resources[1476] JOB_ TYPE= SERVICE_ LABELS
ACTION= ACQUIRE IP_ LABELS= elect_ svc0: shared_ svc1, shared_ svc2
This job type launches an acquire_service_addr event. Within the event, each individual service label is
acquired. The content of the [Link] file indicates which resource group is being processed. Within
each resource group, the event flow is the same as it is under serial processing.
cascrg1: acquire_service_addr[ 251] export GROUPNAME
cascrg1: acquire_service_addr[251] [[ true = true ]]
cascrg1: acquire_service_addr[254] read SERVICELABELS
cascrg1: acquire_service_addr[254] get_ list_ head electron_ svc0
cascrg1: acquire_service_addr[255] get_ list_ tail electron_ svc0
cascrg1: acquire_service_addr[255] read IP_ LABELS
cascrg1: acquire_service_addr[257] get_ list_ head
cascrg1: acquire_service_addr[257] read SNA_ CONNECTIONS
cascrg1: acquire_service_addr[258] export SNA_ CONNECTIONS
cascrg1: acquire_service_addr[259] get_ list_ tail
cascrg1: acquire_service_addr[259] read _SNA_ CONNECTIONS
cascrg1: acquire_service_addr[270] clgetif -a electron_ svc0
JOB_TYPE=VGS:
The VGS job type handles the acquisition or release of volume groups. The variable ACTION indicates
what should be done to the volume groups being processed, and the names of the volume groups are
listed in the VOLUME_GROUPS and CONCURRENT_VOLUME_GROUPS variables.
conc_rg1 :process_resources[1476] clRGPA
conc_rg1 :clRGPA[55] clrgpa
conc_rg1 :clRGPA[56] exit 0
This job type runs the cl_activate_vgs event utility script, which acquires each individual volume group.
The content of the [Link] file indicates which resource group is being processed, and within each
resource group, the script flow is the same as it is under serial processing.
cascrg1 cascrg2 :cl_activate_vgs[256] 1> /usr/ es/ sbin/ cluster/ etc/ lsvg. out. 21266 2> /tmp/ lsvg. err
Resource groups in clusters that are configured with dependent groups or sites, that are handled with
dynamic event phasing.
These events process one or more resource groups at a time. Multiple nonconcurrent resource groups can
be processed within one rg_move event.
Related information:
Applications and PowerHA SystemMirror
5. Press Enter to add the values into the PowerHA SystemMirror for AIX Configuration Database.
6. Return to the main PowerHA SystemMirror emenu. Select Extended Configuration > Extended
Verification and Synchronization.
The software checks whether cluster services are running on any cluster node. If so, there will be no
option to skip verification.
7. Select the options you want to use for verification and Press Enter to synchronize the cluster
configuration and node environment across the cluster.
Related information:
Verifying and synchronizing a PowerHA SystemMirror cluster
The information in [Link] provides information about all connections to and from the daemon,
including information for the initial connections established during discovery. Because [Link]
contains diagnostic information for the daemon, you usually do not use this file in troubleshooting
situations.
The following example shows the type of output generated in the [Link] file. The second and third
entries are generated during the discovery process.
Wed May 7 [Link] 2003: Daemon was successfully started
Wed May 7 [Link] 2003: Trying to establish connection to node
temporarynode0000001439363040
Wed May 7 [Link] 2003: Trying to establish connection to node
temporarynode0000002020023310
Wed May 7 [Link] 2003: Connection to node temporarynode0000002020023310, success, [Link]->
Wed May 7 [Link] 2003: CONNECTION: ACCEPTED: test2: [Link]->[Link]
Wed May 7 [Link] 2003: WARNING: /usr/es/sbin/cluster/etc/rhosts permissions
must be -rw-------
Wed May 7 [Link] 2003: Connection to node temporarynode0000001439363040: closed
Wed May 7 [Link] 2003: Connection to node temporarynode0000002020023310: closed
Wed May 7 [Link] 2003: CONNECTION: CLOSED: test2: [Link]->[Link]
Wed May 7 [Link] 2003: Trying to establish connection to node test1
Wed May 7 [Link] 2003: Connection to node test1, success, [Link]->[Link]
Wed May 7 [Link] 2003: Trying to establish connection to node test3.
You can view the content of the [Link] or [Link] file by using the AIX vi or more
commands.
You can turn off logging to [Link] temporarily (until the next reboot, or until you enable logging
for this component again) by using the AIX tracesoff command. To permanently stop logging to
[Link], start the daemon from SRC without the -d flag by using the following command:
chssys -s clcomd -a ""
Note: Logs should be redirected to local file systems and not to shared or NFS file systems. Having logs
on those file systems may cause problems if the file system needs to unmount during a fallover event.
Redirecting logs to NFS file systems may also prevent cluster services from starting during node
reintegration.
These checks decrease the possibility that the chosen file system may become unexpectedly unavailable.
If no error messages are displayed on the console and if examining the log files proves fruitless, you next
investigate each component of your PowerHA SystemMirror environment and eliminate it as the cause of
the problem.
Your knowledge of the PowerHA SystemMirror system is essential. You must know the characteristics of
a normal cluster beforehand and be on the lookout for deviations from the norm as you examine the
cluster components. Often, the surviving cluster nodes can provide an example of the correct setting for a
system parameter or for other cluster configuration information.
You should review the PowerHA SystemMirror cluster components that you can check and describes
some useful utilities. If examining the cluster log files does not reveal the source of a problem, investigate
each system component using a top-down strategy to move through the layers. You should investigate
the components in the following order:
1. Application layer
2. PowerHA SystemMirror layer
3. Logical Volume Manager layer
4. TCP/IP layer
5. AIX layer
6. Physical network layer
7. Physical disk layer
8. System hardware layer
You should also know what to look for when examining each layer and know the tools you should use to
examine the layers.
Note: These steps assume that you have checked the log files and that they do not point to the problem.
Make sure that the PowerHA SystemMirror files required for your cluster are in the proper place, have
the proper permissions (readable and executable), and are not zero length. The PowerHA SystemMirror
files and the AIX files modified by the PowerHA SystemMirror software are listed in the README file
that accompanies the product.
When these components are not responding normally, determine if the daemons are active on a cluster
node. Use either the options on the SMIT System Management (C-SPOC)Cluster ServicesShow Cluster
Services panel or the lssrc command.
For example, to check on the status of all daemons under the control of the SRC, enter:
lssrc -a | grep active
syslogdras 290990active
sendmail mail270484active
portmapportmap286868active
inetd tcpip 295106active
snmpd tcpip 303260active
dpid2 tcpip 299162active
hostmibd tcpip 282812active
aixmibdtcpip 278670active
biodnfs 192646active
[Link] nfs 254122 active
[Link] nfs 274584active
qdaemonspooler196720active
writesrv spooler250020active
ctrmc rsct98392 active
clcomdES clcomdES 204920active
IBM.CSMAgentRMrsct_rm90268 active
[Link] rsct_rm229510active
[Link] rsct_rm188602active
IBM.AuditRMrsct_rm151722active
topsvcstopsvcs602292active
grpsvcsgrpsvcs569376active
emsvcs emsvcs 561188active
emaixosemsvcs 557102active
clstrmgrEScluster544802active
gsclvmd565356active
IBM.HostRMrsct_rm442380active
To check on the status of all cluster daemons under the control of the SRC, enter: lssrc -g cluster
To view additional information on the status of a daemon run the clcheck_server command. The
clcheck_server command makes additional checks and retries beyond what is done by lssrc command.
For more information, see the clcheck_server man page.
To determine whether the Cluster Manager is running, or if processes started by the Cluster Manager are
currently running on a node, use the ps command.
See the ps man page for more information about using this command.
To begin checking for configuration problems, ask yourself if you (or others) have made any recent
changes that may have disrupted the system. Have components been added or deleted? Has new
software been loaded on the machine? Have new PTFs or application updates been performed? Has a
system backup been restored? Then run verification to ensure that the properPowerHA
SystemMirror-specific modifications to AIX software are in place and that the cluster configuration is
valid.
The cluster verification utility checks many aspects of a cluster configuration and reports any
inconsistencies. Using this utility, you can perform the following tasks:
v Verify that all cluster nodes contain the same cluster topology information
v Check that all network interface cards are properly configured, and that shared disks are accessible to
all nodes that can own them
v Check for agreement among all nodes on the ownership of defined resources, such as file systems, log
files, volume groups, disks, and application controllers
v Check for invalid characters in cluster names, node names, network names, network interface names
and resource group names
v Verify takeover information.
The verification utility will also print out diagnostic information about the following:
v Custom snapshot methods
v Custom verification methods
v Custom pre or post events
v Cluster log file redirection.
From the main PowerHA SystemMirror SMIT panel, select Problem Determination Tools > PowerHA
SystemMirror Verification > Verify PowerHA SystemMirror Configuration. If you find a configuration
problem, correct it, then resynchronize the cluster.
Note: Some errors require that you make changes on each cluster node. For example, a missing
application start script or a volume group with autovaryon=TRUE requires a correction on each affected
node. Some of these issues can be taken care of by using PowerHA SystemMirror File Collections.
The command ls -lt /etc lists all the files in the /etc directory and shows the most recently modified
files that are important to configuring AIX, such as:
v etc/[Link]
v etc/hosts
v etc/services
It is also very important to check the resource group configuration for any errors that may not be flagged
by the verification process. For example, make sure the file systems required by the application
controllers are included in the resource group with the application.
Check that the nodes in each resource group are the ones intended, and that the nodes are listed in the
proper order. To view the cluster resource configuration information from the main PowerHA
SystemMirror SMIT panel, select Extended Configuration > Extended Resource Configuration >
PowerHA SystemMirror Extended Resource Group Configuration > Show All Resources by Node or
Resource Group.
You can also run the /usr/es/sbin/cluster/utilities/clRGinfo command to see the resource group
information.
Note: If cluster configuration problems arise after running the cluster verification utility, do not run
C-SPOC commands in this environment as they may fail to execute on cluster nodes.
Related information:
Verifying and synchronizing a PowerHA SystemMirror cluster
The default directory path for storage and retrieval of a snapshot is /usr/es/sbin/cluster/snapshots.
Note that you cannot use the cluster snapshot facility in a cluster that is running different versions of
PowerHA SystemMirror concurrently.
Related information:
Saving and restoring cluster configurations
The primary information saved in a cluster snapshot is the data stored in the PowerHA SystemMirror
Configuration Database classes (such as HACMPcluster, HACMPnode, and HACMPnetwork). This is the
information used to recreate the cluster configuration when a cluster snapshot is applied.
The cluster snapshot does not save any user customized scripts, applications, or other configuration
parameters that are not for PowerHA SystemMirror. For example, the name of an application controller
and the location of its start and stop scripts are stored in the PowerHA SystemMirror server
Configuration Database object class. However, the scripts themselves as well as any applications they
may call are not saved.
If you moved resource groups using the Resource Group Management utility clRGmove, once you apply
a snapshot, the resource groups return to behaviors specified by their default nodelists. To investigate a
cluster after a snapshot has been applied, run clRGinfo to view the locations and states of resource
groups.
In addition to this Configuration Database data, a cluster snapshot also includes output generated by
various PowerHA SystemMirror and standard AIX commands and utilities. This data includes the current
state of the cluster, node, network, and network interfaces as viewed by each cluster node, as well as the
state of any running PowerHA SystemMirror daemons.
Skipping the logs collection reduces the size of the snapshot and speeds up running the snapshot utility.
You can use SMIT to collect cluster log files for problem reporting. This option is available under the
Problem Determination Tools > PowerHA SystemMirror Log Viewing and Management > Collect
Cluster log files for Problem Reporting SMIT menu. It is recommended to use this option only if
requested by the IBM support personnel.
Note that you can also use the AIX snap -e command to collect PowerHA SystemMirror cluster data,
including the [Link] and [Link] log files.
Related information:
Saving and restoring cluster configurations
The cluster snapshot facility stores the data it saves in two separate files, the Configuration Database data
file and the Cluster State Information File, each displaying information in three sections.
This file contains all the data stored in the PowerHA SystemMirror Configuration Database object classes
for the cluster.
This file is given a user-defined basename with the .odm file extension. Because the Configuration
Database information must be largely the same on every cluster node, the cluster snapshot saves the
values from only one node. The cluster snapshot Configuration Database data file is an ASCII text file
divided into three delimited sections:
Table 13. Database Data file (.odm) sections
Section Description
Version section This section identifies the version of the cluster snapshot. The characters <VER identify the
start of this section; the characters </VER identify the end of this section. The cluster
snapshot software sets the version number.
Description section This section contains user-defined text that describes the cluster snapshot. You can specify
up to 255 characters of descriptive text. The characters <DSC identify the start of this section;
the characters </DSC identify the end of this section.
ODM data section This section contains the PowerHA SystemMirror Configuration Database object classes in
generic AIX ODM stanza format. The characters <ODM identify the start of this section; the
characters </ODM identify the end of this section.
The following is an excerpt from a sample cluster snapshot Configuration Database data file showing
some of the ODM stanzas that are saved:
<VER
1.0
</VER
<DSC
My Cluster Snapshot
</DSC
<ODM
This file contains the output from standard AIX and PowerHA SystemMirror system management
commands.
This file is given the same user-defined basename with the .info file extension. If you defined custom
snapshot methods, the output from them is appended to this file. The Cluster State Information file
contains three sections:
Table 14. Cluster State information file (.info)
Section Description
Version section This section identifies the version of the cluster snapshot. The characters <VER identify
the start of this section; the characters </VER identify the end of this section. The cluster
snapshot software sets this section.
Description section This section contains user-defined text that describes the cluster snapshot. You can
specify up to 255 characters of descriptive text. The characters <DSC identify the start of
this section; the characters </DSC identify the end of this section.
Command output section This section contains the output generated by AIX and PowerHA SystemMirror ODM
commands. This section lists the commands executed and their associated output. This
section is not delimited in any way.
In the SMIT panel Initialization and Standard Configuration > Configure PowerHA SystemMirror
Resource Groups > Change/Show Resources for a Resource Group (standard) , all volume groups listed
in the Volume Groups field for a resource group should be varied on the node(s) that have the resource
group online.
To check for inconsistencies among volume group definitions on cluster nodes, use the lsvg command to
display information about the volume groups defined on each node in the cluster:
lsvg
To list only the active (varied on) volume groups in the system, use the lsvg -o command as follows:
lsvg -o
Note: The volume group must be varied on to use the lsvg-l command.
You can also use PowerHA SystemMirror SMIT to check for inconsistencies: System Management
(C-SPOC) > PowerHA SystemMirror Logical Volume Management > Shared Volume Groups option to
display information about shared volume groups in your cluster.
Depending on your configuration, the lsvg command returns the following options:
vg state could be active (if it is active varyon), or passive only (if it is passive varyon).
To check for inconsistencies among volume group definitions on cluster nodes in a two-node C-SPOC
environment:
1. Enter smitty hacmp
2. In SMIT, select System Management (C-SPOC) > PowerHA SystemMirror Logical Volume
Management > Shared Volume Groups > List All Shared Volume Groups and press Enter to accept
the default (no).
A list of all shared volume groups in the C-SPOC environment appears. This list also contains enhanced
concurrent volume groups included as resources in non-concurrent resource groups.
You can also use the C-SPOC cl_lsvg command from the command line to display this information.
The first column of the display shows the logical name of the disk. The second column lists the physical
volume identifier of the disk. The third column lists the volume group (if any) to which it belongs.
Note that on each cluster node, AIX can assign different names (hdisk numbers) to the same physical
volume. To tell which names correspond to the same physical volume, compare the physical volume
identifiers listed on each node.
If you specify the logical device name of a physical volume (hdiskx) as an argument to the lspv
command, it displays information about the physical volume, including whether it is active (varied on).
For example:
lspv hdisk2
PHYSICAL VOLUME:hdisk2 VOLUME GROUP:abalonevg
PV IDENTIFIER: 0000301919439ba5 VG IDENTIFIER: 00003019460f63c7
PV STATE:active VG STATE:active/complete
STALE PARTITIONS: 0 ALLOCATABLE:yes
PP SIZE: 4 megabyte(s)LOGICAL VOLUMES:2
TOTAL PPs: 203 (812 megabytes) VG DESCRIPTORS: 2
FREE PPs:192 (768 megabytes)
USED PPs:11 (44 megabytes)
FREE DISTRIBUTION: 41..30..40..40..41
USED DISTRIBUTION:00..11..00..00..00
If a physical volume is inactive (not varied on, as indicated by question marks in the PV STATE field),
use the appropriate command for your configuration to vary on the volume group containing the
physical volume. Before doing so, however, you may want to check the system error report to determine
whether a disk problem exists. Enter the following command to check the system error report:
errpt -a|more
You can also use the lsdev command to check the availability or status of all physical volumes known to
the system.
As shown in the following example, you can use this command to determine the names of the logical
volumes defined on a physical volume:
lspv -l hdisk2
LV NAMELPs PPs DISTRIBUTIONMOUNT POINT
lv02 50 50 25..00..00..00..25/usr
lv04 44 44 06..00..00..32..06/clusterfs
Use the lslv logicalvolume command to display information about the state (opened or closed) of a specific
logical volume, as indicated in the LV STATE field. For example:
lslv nodeAlv
If a logical volume state is inactive (or closed, as indicated in the LV STATE field), use the appropriate
command for your configuration to vary on the volume group containing the logical volume.
In SMIT select System Management (C-SPOC) > PowerHA SystemMirror Logical Volume Management
> Shared Logical Volumes > List All Shared Logical Volumes by Volume Group . A list of all shared
logical volumes appears.
You can also use the C-SPOC cl_lslv command from the command line to display this information.
Use the following commands to obtain this information about file systems:
v The mount command
v The df command
v The lsfs command.
Use the cl_lsfs command to list file system information when running the C-SPOC utility.
Use the mount command to list all the file systems, both JFS and NFS, currently mounted on a system
and their mount points.
For example:
mount
Determine whether and where the file system is mounted, then compare this information against the
PowerHA SystemMirror definitions to note any differences.
For example:
df
Check the %used column for file systems that are using more than 90% of their available space. Then
check the free column to determine the exact amount of free space left.
For example:
lsfs
Important: For file systems to be NFS exported, be sure to verify that logical volume names for these file
systems are consistent throughout the cluster.
Check to see whether the necessary shared file systems are mounted and where they are mounted on
cluster nodes in a two-node C-SPOC environment.
In SMIT select System Management (C-SPOC) > PowerHA SystemMirror Logical Volume Management
> Shared Filesystems . Select from either Journaled Filesystems > List All Shared Filesystems or
Enhanced Journaled Filesystems > List All Shared Filesystems to display a list of shared file systems.
You can also use the C-SPOC cl_lsfs command from the command line to display this information.
At boot time, AIX attempts to check all the file systems listed in /etc/filesystems with the check=true
attribute by running the fsck command.
For file systems controlled by PowerHA SystemMirror, this error message typically does not indicate a
problem. The file system check fails because the volume group on which the file system is defined is not
varied on at boot time.
To avoid generating this message, edit the /etc/filesystems file to ensure that the stanzas for the shared
file systems do not include the check=true attribute.
Look at the first, third, and fourth columns of the output. The Name column lists all the interfaces
defined and available on this node. Note that an asterisk preceding a name indicates the interface is
down (not ready for use). The Network column identifies the network to which the interface is connected
(its subnet). The Address column identifies the IP address assigned to the node.
The netstat -rn command indicates whether a route to the target node is defined. To see all the defined
routes, enter:
netstat -rn
The same test, run on a system that does not have this route in its routing table, returns no response. If
the service and boot interfaces are separated by a bridge, router, or hub and you experience problems
communicating with network devices, the devices may not be set to handle two network segments as one
physical network. Try testing the devices independent of the configuration, or contact your system
administrator for assistance.
Note that if you have only one interface active on a network, the Cluster Manager will not generate a
failure event for that interface.
See the netstat man page for more information on using this command.
Related information:
Network interface events
Be sure to test all TCP/IP interfaces configured on the nodes (service and boot).
For example, to test the connection from a local node to a remote node named nodeA enter:
/etc/ping nodeA
Type Control-C to end the display of packets. The following statistics appear:
The ping command sends packets to the specified node, requesting a response. If a correct response
arrives, ping prints a message similar to the output shown above indicating no lost packets. This
indicates a valid connection between the nodes.
If the ping command hangs, it indicates that there is no valid path between the node issuing the ping
command and the node you are trying to reach. It could also indicate that required TCP/IP daemons are
not running. Check the physical connection between the two nodes. Use the ifconfig and netstat
commands to check the configuration. A "bad value" message indicates problems with the IP addresses or
subnet definitions.
Note that if "DUP!" appears at the end of the ping response, it means the ping command has received
multiple responses for the same address. This response typically occurs when network interfaces have
been misconfigured, or when a cluster event fails during IP address takeover. Check the configuration of
all interfaces on the subnet to verify that there is only one interface per address. For more information,
see the ping man page.
In addition, you can assign a persistent node IP label to a cluster network on a node. When for
administrative purposes you wish to reach a specific node in the cluster using the ping or telnet
commands without worrying whether an service IP label you are using belongs to any of the resource
groups present on that node, it is convenient to use a persistent node IP label defined on that node.
Related information:
Planning PowerHA SystemMirror
Configuring PowerHA SystemMirror cluster topology and resources (extended)
en0: flags=2000063<UP,BROADCAST,NOTRAILERS,RUNNING,NOECHO>
inet [Link] netmask 0xffffff00 broadcast [Link]
inet6 fe80::214:5eff:fe4d:6045/64
tcp_sendspace 131072 tcp_recvspace 65536 rfc1323 0
The ifconfig command displays multiple lines of output. The first line shows the interface's name and
characteristics. Check for these characteristics:
Table 15. ifconfig command output
Field Value
UP The interface is ready for use. If the interface is down, use the ifconfig command to initialize it. For
example:
ifconfig en0 up
If the interface does not come up, replace the interface cable and try again. If it still fails, use the diag
command to check the device.
The remaining output from the ifconfig command includes information for each address configured on
the interface. Check these fields to make sure the network interface is properly configured.
Use the arp command to view what is currently held to be the IP addresses associated with nodes listed
in a host's arp cache. For example:
arp -a
This output shows what the host node currently believes to be the IP and MAC addresses for nodes
flounder, cod, seahorse and pollock. (If IP address takeover occurs without Hardware Address Takeover,
the MAC address associated with the IP address in the host's arp cache may become outdated. You can
correct this situation by refreshing the host's arp cache.)
Be on the lookout for disk and network error messages, especially permanent ones, which indicate real
failures. See the errpt man page for more information.
For SCSI disks, including IBM SCSI disks and arrays, make sure that each array controller, adapter, and
physical disk on the SCSI bus has a unique SCSI ID. Each SCSI ID on the bus must be an integer value
from 0 through 15, although some SCSI adapters may have limitations on the SCSI ID that can be set. See
the device documentation for information about any device-specific limitations. A common configuration
is to set the SCSI ID of the adapters on the nodes to be higher than the SCSI IDs of the shared devices.
Devices with higher IDs take precedence in SCSI bus contention.
For example, if the standard SCSI adapters use IDs 5 and 6, assign values from 0 through 4 to the other
devices on the bus. You may want to set the SCSI IDs of the adapters to 5 and 6 to avoid a possible
conflict when booting one of the systems in service mode from a mksysb tape of other boot devices,
since this will always use an ID of 7 as the default.
If the SCSI adapters use IDs of 14 and 15, assign values from 3 through 13 to the other devices on the
bus.
You can check the SCSI IDs of adapters and disks using either the lsattr or lsdev command. For example,
to determine the SCSI ID of the adapter scsi1 (SCSI-3), use the following lsattr command and specify the
logical name of the adapter as an argument:
lsattr -E -l scsi1 | grep id
Do not use wildcard characters or full pathnames on the command line for the device name designation.
Important: If you restore a backup of your cluster configuration onto an existing system, be sure to
recheck or reset the SCSI IDs to avoid possible SCSI ID conflicts on the shared bus. Restoring a system
backup causes adapter SCSI IDs to be reset to the default SCSI ID of 7.
If you note a SCSI ID conflict, see the Planning Guide for information about setting the SCSI IDs on disks
and disk adapters.
For more information, refer to your hardware manuals or search for information about devices on IBM's
website.
In this case, you must update the information that is saved in the /etc/cluster/rhosts file on all cluster
nodes, and refresh the clcomd command to make it aware of the changes. When you synchronize and
verify the cluster again, the clcomd command starts using IP addresses added to the PowerHA
SystemMirror Configuration Database.
Also, configure the /etc/cluster/rhosts file to contain all the addresses currently used by PowerHA
SystemMirror for inter-node communication, and then copy this file to all cluster nodes. The
/etc/cluster/rhosts file can contain IPv4 and IPv6 addresses.
Related reference:
“Cluster communications issues” on page 67
These topics describe potential cluster communication issues.
Without an argument, diag runs as a menu-driven program. You can also run diag on a specific piece of
hardware. For example:
diag -d hdisk0 -c
Starting diagnostics.
Ending diagnostics.
Problem
At boot time, AIX tries to check, by running the fsck command, all the file systems listed in
/etc/filesystems with the check=true attribute. If it cannot check a file system. AIX reports an error. The
system displays the following:
+----------------------------------------------------------+
Filesystem Helper: 0506-519 Device open failed
+----------------------------------------------------------+
Solution
For file systems controlled by PowerHA SystemMirror, this error typically does not indicate a problem.
The file system check failed because the volume group on which the file system is defined is not varied
on at boot-time. To prevent the generation of this message, edit the /etc/filesystems file to ensure that the
stanzas for the shared file systems do not include the check=true attribute.
When you install PowerHA SystemMirror, cl_convert is run automatically. The software checks for an
existing PowerHA SystemMirror configuration and attempts to convert that configuration to the format
used by the version of the software bring installed. However, if installation fails, cl_convert will fail to
run as a result. Therefore, conversion from the Configuration Database of a previous PowerHA
SystemMirror version to the Configuration Database of the current version will also fail.
Solution
Run cl_convert from the command line. To gauge conversion success, refer to the [Link] file,
which logs conversion progress.
CAUTION:
Before converting be sure that your ODMDIR environment variable is set to /etc/es/objrepos.
Problem
During the installation of PowerHA SystemMirror client software, the following message appears:
+----------------------------------------------------------+
Post-installation Processing...
+----------------------------------------------------------+
Some configuration files could not be automatically merged into
the system during the installation. The previous versions of these files
have been saved in a configuration directory as listed below. Compare
the saved files and the newly installed files to determine if you need
to recover configuration data. Consult product documentation
to determine how to merge the data.
Configuration files, which were saved in /usr/lpp/[Link]:
/usr/es/sbin/cluster/utilities/[Link]
Solution
As part of the PowerHA SystemMirror installation process, copies of PowerHA SystemMirror files that
could potentially contain site-specific modifications are saved in the /usr/lpp/[Link] directory before
they are overwritten. As the message states, you must merge site-specific configuration information into
the newly installed files.
PowerHA SystemMirror 7.1.0, or later, is built upon the Cluster Aware AIX (CAA) capabilities. Tivoli
System Automation for Multiplatform is built upon Reliable Scalable Cluster Technology (RSCT)
capabilities. Therefore, you cannot use PowerHA SystemMirror and Tivoli System Automation for
Multiplatform on the same node because they are built upon different clustering capabilities.
Problem
Solution
PowerHA SystemMirror has a dependency on the location of certain ODM repositories to store
configuration data. The ODMPATH environment variable allows ODM commands and subroutines to
query locations other than the default location if the queried object does not reside in the default location.
You can set this variable, but it must include the default location, /etc/objrepos, or the integrity of
configuration information may be lost.
Problem
The "smux-connect" error occurs after starting the clinfoES daemon with the -a option. Another process is
using port 162 to receive traps.
Solution
Check to see if another process, such as the trapgend smux subagent of NetView® for AIX or the System
Monitor for AIX sysmond daemon, is using port 162. If so, restart clinfoES without the -a option and
configure NetView for AIX to receive the SNMP traps. Note that you will not experience this error if
clinfoES is started in its normal way using the startsrc command.
Problem
The node powers itself off or appears to hang after starting cluster services. The errpt report shows an
operator message logged by the [Link] script which issues a halt -q command to the system.
Solution
Use the cluster verification utility to uncover discrepancies in cluster configuration information on all
cluster nodes.
Correct any configuration errors uncovered by the cluster verification utility. Make the necessary changes
using the PowerHA SystemMirror Configuration SMIT panels. After correcting the problem, select the
Verify and Synchronize PowerHA SystemMirror Configuration option to synchronize the cluster
configuration across all nodes. Then select the Start Cluster Services option from the System
Management (C-SPOC) > Manage PowerHA SystemMirror Services SMIT panel to start the Cluster
Manager.
For more information about the snap -e command, see the section Using the AIX data collection utility.
Related reference:
“Using the AIX data collection utility” on page 3
Use the AIX snap command to collect data from a PowerHA SystemMirror cluster.
Related information:
Abnormal termination of Cluster Manager daemon
Problem
The /etc/hosts file on each cluster node does not contain the IP labels of other nodes in the cluster. For
example, in a four-node cluster, Node A, Node B, and Node C's /etc/hosts files do not contain the IP
labels of the other cluster nodes.
If this situation occurs, the configchk command returns the following message to the console:
"your hostname not known," "Cannot access node x."
This message indicates that the /etc/hosts file on Node x does not contain an entry for your node.
Solution
Before starting the PowerHA SystemMirror software, ensure that the /etc/hosts file on each node includes
the service and boot IP labels of each cluster node.
Problem
The Cluster Manager hangs during reconfiguration and generates messages similar to the following:
The cluster has been in reconfiguration too long;Something may be wrong.
Solution
Determine why the script failed by examining the /var/hacmp/log/[Link] file to see what process
exited with a non-zero status. The error messages in the /var/hacmp/adm/[Link] file may also be
helpful. Fix the problem identified in the log file. Then run the clruncmd command either at the
command line, or by using the SMIT Problem Determination Tools > Recover From PowerHA
SystemMirror Script Failure panel. The clruncmd command signals the Cluster Manager to resume
cluster processing.
Problem
Manually indicate to the system console (for the AIX installation assistant) that the AIX installation is
finished.
This problem usually occurs on newly installed AIX nodes; at the first boot AIX runs the installation
assistant from /etc/inittab and does not proceed with other entries in this file. AIX installation assistant
waits for your input on system console. AIX will run the installation assistant on every subsequent boot,
until you indicate that installation is finished. Once you do so, the system will proceed to start the cluster
communications daemon (clcomd) and the Cluster Manager daemon (clstrmgr).
Problem
The cluster verification utility indicates that a pre- or post-event does not exist on a node after upgrading
to a new version of the PowerHA SystemMirror software.
Solution
Ensure that a script by the defined name exists and is executable on all cluster nodes.
Each node must contain a script associated with the defined pre- or post-event. While the contents of the
script do not have to be the same on each node, the name of the script must be consistent across the
cluster. If no action is desired on a particular node, a no-op script with the same event-script name
should be placed on nodes on which no processing should occur.
Problem
The system appears to be hung. 869 is displayed continuously on the system LED display.
Solution
A number of situations can cause this display to occur. Make sure all devices connected to the SCSI bus
have unique SCSI IDs to avoid SCSI ID conflicts. In particular, check that the adapters and devices on
each cluster node connected to the SCSI bus have a different SCSI ID. By default, AIX assigns an ID of 7
to a SCSI adapter when it configures the adapter. See the Planning Guide for more information about
checking and setting SCSI IDs.
Related information:
Planning PowerHA SystemMirror
Solution
When you remove a node from the cluster, the cluster definition remains in the node's Configuration
Database. If you start cluster services on the removed node, the node reads this cluster configuration data
and attempts to rejoin the cluster from which it had been removed. The other nodes no longer recognize
this node as a member of the cluster and refuse to allow the node to join. Because the node requesting to
join the cluster has the same cluster name as the existing cluster, it can cause the cluster to become
unstable or crash the existing nodes.
Important: You must stop cluster services on the node before removing it from the cluster.
The -R flag removes the PowerHA SystemMirror entry in the /etc/inittab file, preventing cluster
services from being automatically started when the node is rebooted.
2. Remove the PowerHA SystemMirror entry from the [Link] file using the following command:
clchipat false
3. Remove the cluster definition from the node's Configuration Database using the following command:
clrmclstr
You can also perform this task by selecting Extended Configuration > Extended Topology Configuration
> Configure a PowerHA SystemMirror Cluster > Remove a PowerHA SystemMirror Cluster from the
SMIT panel.
Problem
You have specified a resource group migration operation using the Resource Group Migration Utility, in
which you have requested that this particular migration Persists across Cluster Reboot, by setting this
flag to true (or, by issuing the clRGmove command). Then, after you stopped and restarted the cluster
services, this policy is not followed on one of the nodes in the cluster.
Solution
This problem occurs if, when you specified the persistent resource group migration, a node was down
and inaccessible. In this case, the node did not obtain information about the persistent resource group
migration, and, if after the cluster services are restarted, this node is the first to join the cluster, it will
have no knowledge of the Persist across Cluster Reboot setting. Thus, the resource group migration will
not be persistent. To restore the persistent migration setting, you must again specify it in SMIT under the
Extended Resource Configuration >PowerHA SystemMirror Resource Group Configuration SMIT
menu.
Problem
The ODM entry for group "hacmp" is removed on SP nodes. This problem manifests itself as the inability
to start the cluster or clcomd errors.
Solution
To further improve security, the PowerHA SystemMirror Configuration Database (ODM) has the
following enhancements:
v Ownership. All PowerHA SystemMirror ODM files are owned by user root and group hacmp. In
addition, all PowerHA SystemMirror binary files that are intended for use by non-root users are also
owned by user root and group hacmp.
v Permissions. All PowerHA SystemMirror ODM files, except for the hacmpdisksubsystem file with 600
permissions, are set with 640 permissions (readable by user root and group hacmp, writable by user
During the installation, PowerHA SystemMirror creates the group "hacmp" on all nodes if it does not
already exist. By default, group hacmp has permission to read the PowerHA SystemMirror ODMs, but
does not have any other special authority. For security reasons, it is recommended not to expand the
authority of group hacmp.
If you use programs that access the PowerHA SystemMirror ODMs directly, you may need to rewrite
them if they are intended to be run by non-root users:
v All access to the ODM data by non-root users should be handled via the provided PowerHA
SystemMirror utilities.
v In addition, if you are using the PSSP File Collections facility to maintain the consistency of /etc/group,
the new group "hacmp" that is created at installation time on the individual cluster nodes may be lost
when the next file synchronization occurs.
Generally, PowerHA SystemMirror nodes have the same fileset level, but you can be more likely to run
into this situation while doing a node-by-node rolling PTF upgrade. These types of errors will prevent
successful cluster startup.
When starting your cluster in this situation, ignore verification errors. You can do this by entering the
following SMIT path: smit sysmirror> System Management (C-SPOC) > Manage PowerHA
SystemMirror Services > Start Cluster Services.
Within this panel, change Ignore verification errors? (default false) to true.
You can then start your cluster and avoid the problematic clverify program.
Note: Make sure your nodes are at equal fileset levels as soon as possible to avoid having to perform this
procedure. Ignoring verification errors should be avoided.
Problem
The redefinevg, varyonvg, lqueryvg, and syncvg commands fail and report errors against a shared
volume group during system restart. These commands send messages to the console when automatically
varying on a shared volume group. When configuring the volume groups for the shared disks,
autovaryon at boot was not disabled. If a node that is up owns the shared drives, other nodes attempting
to vary on the shared volume group will display various varyon error messages.
Solution
When configuring the shared volume group, set the Activate volume group AUTOMATICALLY at
system restart? field to no on the SMIT System Management (C-SPOC) > PowerHA SystemMirror
Logical Volume Management > Shared Volume Groups > Create a Shared Volume Group panel. After
Problem
The PowerHA SystemMirror software (the /var/hacmp/log/[Link] file) indicates that the varyonvg
command failed when trying to vary on a volume group.
Solution
Ensure that the volume group is not set to autovaryon on any node and that the volume group (unless it
is in concurrent access mode) is not already varied on by another node.
The lsvg -o command can be used to determine whether the shared volume group is active. Enter:lsvg
volume_group_name on the node that has the volume group activated, and check the AUTO ON field to
determine whether the volume group is automatically set to be on. If AUTO ON is set to yes, correct this
by entering chvg -an volume_group_name
Problem 2
The volume group information on disk differs from that in the Device Configuration Data Base.
Solution 2
Correct the Device Configuration Data Base on the nodes that have incorrect information:
1. Use the smit exportvg fastpath to export the volume group information. This step removes the
volume group information from the Device Configuration Data Base.
2. Use the smit importvg fastpath to import the volume group. This step creates a new Device
Configuration Data Base entry directly from the information on disk. After importing, be sure to
change the volume group to not autovaryon at the next system boot.
3. Use the SMIT Problem Determination Tools > Recover From PowerHA SystemMirror Script Failure
panel to issue the clruncmd command to signal the Cluster Manager to resume cluster processing.
Problem 3
The PowerHA SystemMirror software indicates that the varyonvg command failed because the volume
group could not be found.
Solution 3
The volume group is not defined to the system. If the volume group has been newly created and
exported, or if a mksysb system backup has been restored, you must import the volume group. Follow
the steps described in Problem 2 to verify that the correct volume group name is being referenced.
Problem 4
The PowerHA SystemMirror software indicates that the varyonvg command failed because the logical
volume
<name>
Solution 4
This indicates that the forced varyon attribute is configured for the volume group in SMIT, and that when
attempting a forced varyon operation, PowerHA SystemMirror did not find a single complete copy of the
specified logical volume for this volume group.
Also, it is possible that you requested a forced varyon operation but did not specify the super strict
allocation policy for the mirrored logical volumes. In this case, the success of the varyon command is not
guaranteed.
Related information:
Configuring HACMP resource groups (extended)
Planning shared LVM components
Problem
The /var/hacmp/log/[Link] file shows that the cl_nfskill command fails when attempting to perform
a forced unmount of an NFS-mounted file system. NFS provides certain levels of locking a file system
that resists forced unmounting by the cl_nfskill command.
Solution
Make a copy of the /etc/locks file in a separate directory before executing the cl_nfskill command. Then
delete the original /etc/locks file and run the cl_nfskill command. After the command succeeds, re-create
the /etc/locks file using the saved copy.
Problem
The cl_scdiskreset command logs error messages to the /var/hacmp/log/[Link] file. To break the
reserve held by one system on a SCSI device, the PowerHA SystemMirror disk utilities issue the
cl_scdiskreset command. The cl_scdiskreset command may fail if back-level hardware exists on the SCSI
bus (adapters, cables or devices) or if a SCSI ID conflict exists on the bus.
Solution
See the appropriate sections in Using cluster log files to check the SCSI adapters, cables, and devices.
Make sure that you have the latest adapters and cables. The SCSI IDs for each SCSI device must be
different.
Related concepts:
“Using cluster log files” on page 10
These topics explain how to use the PowerHA SystemMirror cluster log files to troubleshoot the cluster.
Included also are some sections on managing parameters for some of the logs.
At boot time, AIX runs the fsck command to check all the file systems listed in /etc/filesystems with the
check=true attribute. If it cannot check a file system, AIX reports the following error:
Filesystem Helper: 0506-519 Device open failed
Solution
For file systems controlled by PowerHA SystemMirror, this message typically does not indicate a
problem. The file system check fails because the volume group defining the file system is not varied on.
The boot procedure does not automatically vary on PowerHA SystemMirror-controlled volume groups.
To prevent this message, make sure that all the file systems under PowerHA SystemMirror control do not
have the check=true attribute in their /etc/filesystems stanzas. To delete this attribute or change it to
check=false, edit the /etc/filesystems file.
Problem
The /etc/filesystems file has not been updated to reflect changes to log names for a logical volume. If you
change the name of a logical volume after the file systems have been created for that logical volume, the
/etc/filesystems entry for the log does not get updated. Thus when trying to mount the file systems, the
PowerHA SystemMirror software tries to get the required information about the logical volume name
from the old log name. Because this information has not been updated, the file systems cannot be
mounted.
Solution
Be sure to update the /etc/filesystems file after making changes to logical volume names.
Problem
Solution
Once the node is back online, export the volume group, then import it again before starting PowerHA
SystemMirror on this node.
Problem 2
The disk replacement process failed while the replacepv command was running.
Solution 2
Delete the /tmp/replacepv directory, and attempt the replacement process again.
The disk replacement process failed with a "no free disks" message while VPATH devices were available
for replacement.
Solution 3
Be sure to convert the volume group from VPATH devices to hdisks, and attempt the replacement
process again. When the disk is replaced, convert hdisks back to the VPATH devices.
Related information:
Managing shared LVM components
Problem
If you change the name of a file system, or remove a file system and then perform a lazy update, lazy
update does not run the imfs -lx command before running the imfs command. This may lead to a failure
during fallover or prevent a successful restart of the PowerHA SystemMirror cluster services.
Solution
Use the C-SPOC utility to change or remove file systems. This ensures that imfs -lx runs before imfs and
that the changes are updated on all nodes in the cluster.
Error Reporting provides detailed information about inconsistency in volume group state across the
cluster. If this happens, take manual corrective action. If the file system changes are not updated on all
nodes, update the nodes manually with this information.
Problem
The clam_nfsv4 application monitor takes more than 60 seconds to complete. The monitor is not
responding and is stopped. Therefore, a fallover occurs on the Network File System (NFS) node. This
fallover usually occurs if the system that hosts the application monitor is experiencing high performance
workloads.
Solution
You must reduce the system workloads to correct this problem. You can also apply APAR IV08873 to
your system, which reduces the amount of time it takes to run the clam_nfsv4 application monitor script.
Related information:
clam_nfsv4 application monitor concepts
Using NFS with PowerHA SystemMirror
NFS cross-mounting in PowerHA SystemMirror
APAR IV08873: NFSV4 monitor script execution time improvements
When the repository disk fails, you are notified of the disk failure. PowerHA SystemMirror continues to
notify you of the repository disk failure until it is resolved.
To determine what the problem is with the repository disk, you can view the following log files:
v [Link]
v AIX error log (using the errpt command)
The following is an example of an error message in the [Link] log file when a repository disk fails:
When a node loses access to the repository disk, an entry is made in the AIX error log of each node that
has a problem.
The following is an example of an error message in the error log file when a repository disk fails.
Note: To view the AIX error log, you must use the errpt command.
LABEL: OPMSG
IDENTIFIER: AA8AB241
Description
OPERATOR NOTIFICATION
User Causes
Recommended Actions
REVIEW DETAILED DATA
Detail Data
MESSAGE FROM ERRLOGGER COMMAND
Error: Node 0x54628FEA1D0611E183EE001A64B90DF0 has lost access to repository disk hdisk75.
If a repository disk fails, the repository disk must be recovered on a different disk to restore all cluster
operations. The circumstances for your cluster environment and the type of the repository disk failure
determine the possible methods for recovering the repository disk.
The following are two possible scenarios where a repository disk fails and the possible methods for
restoring the repository disk on a new storage disk.
Repository disk fails but the cluster is still operational
In this scenario, the repository disk access is lost on one or more nodes in the cluster. When this
failure occurs, Cluster Aware AIX (CAA) continues to operate in restricted mode by using
repository disk information which it has cached in memory. If CAA remains active on a single
node in the cluster, the information from the previous repository disk information can be used to
rebuild the a new repository disk.
To rebuild the repository disk after a failure, complete the following steps from any node where
CAA is still active:
1. Verify that CAA is active on the node by using the lscluster -c command and then the
lscluster -m command.
2. Replace the repository disk by completing the steps in the Replacing a repository disk with
SMIT topic. PowerHA SystemMirror recognizes the problem and interacts with CAA to
rebuild the repository disk on the new storage disk.
Note: This step updates the repository information that is stored in the PowerHA
SystemMirror configuration data
3. Synchronize thePowerHA SystemMirror cluster configuration information by selecting Cluster
Nodes and Networks > Verify and Synchronize Cluster Configuration from the SMIT
interface.
Repository disk fails and the nodes in the cluster rebooted
In this rare scenario, a series of critical failures occur that result in a worst case scenario where
access to the repository disk is lost and all nodes in the cluster were rebooted. Thus, none of the
nodes in the cluster remained online during the failure and you cannot rebuild the repository
disk from the AIX operating systems memory. When the nodes are brought back online, they
cannot start CAA because a repository disk is not present in the cluster. To fix this problem, it is
ideal to bring back the repository disk and allow the cluster self heal. If that is not possible, you
must rebuild the repository disk on a new storage disk and use it to start the CAA cluster.
To rebuild the repository disk and start cluster services, complete the following steps:
1. On a node in the cluster rebuild the repository by completing the steps in the Replacing a
repository disk with SMIT topic. PowerHA SystemMirror recognizes the problem and interacts
with CAA to rebuild the repository disk on the new storage disk.
Note: This step updates the repository information that is stored in the PowerHA
SystemMirror configuration data and rebuilds the repository disk from the CAA cluster cache
file.
Note: For AIX Version 7.1 with Technology Level 4, or later, you do not need to perform steps
3 and 4. After you complete step 2, all nodes that were rebooted must wait for about 10
minutes to use the new repository disk.
5. Verify that CAA is active by first using the lscluster -c command and then the lscluster -m
command.
6. Synchronize the PowerHA SystemMirror cluster configuration information about the newly
created repository disk to all other nodes by selecting Cluster Nodes and Networks > Verify
and Synchronize Cluster Configuration from the SMIT interface.
7. Start PowerHA SystemMirror cluster services on all nodes (besides the first node where the
repository disk was created) by selecting System Management (C-SPOC) > PowerHA
SystemMirror Services > Start Cluster Services from the SMIT interface.
The snapshot migration process for an online cluster requires that the cluster information in the snapshot
matches the online cluster information. This requirement also applies to repository disks. If you change a
repository disk configuration, you must update the snapshot to reflect these changes and then complete
the snapshot migration process.
Related information:
Planning for repository disk
Repository disk failure
Creating a snapshot of the cluster configuration
Upgrading PowerHA SystemMirror using a snapshot
Unexpected network interface failures can occur in PowerHA SystemMirror configurations using
switched networks if the networks and the switches are incorrectly defined/configured.
Solution
To test end-to-end multicast communication for all nodes used to create the cluster on your network, run
the mping command to send and receive packets between nodes.
If you are running PowerHA SystemMirror 7.1.1, or later, you cannot create a cluster if the mping
command fails. If the mping command fails, your network is not set up correctly for multicast
communication. If so, review the documentation for your switches and routers to enable multicast
communication.
You can run the mping command with a specific multicast address; otherwise, the command uses a
default multicast address. You must use the multicast addresses that are used for creating the cluster as
input for the mping command.
Note: The mping command uses the interface that has the default route. To use the mping command to
test multicast communication on a different interface that does not have the default route, you must
temporarily add a static route with the required interface to the multicast IP address.
The following example shows a success case and a failure case for the mping command, where node A is
the receiver and node B is the sender.
Success case:
Receiver
root@nodeA:/# mping -r -R -c 5
mping version 1.1
Listening on [Link]/4098:
Sender
root@nodeB:/# mping -R -s -c 5
mping version 1.1
mpinging [Link]/4098 with ttl=1:
Failure case:
Receiver
root@nodeA:/# mping -r -R -c 5 -6
mping version 1.1
Listening on ff05::7F01:0101/4098:
Sender
root@nodeB:/# mping -R -s -c 5 -6
mping version 1.1
mpinging ff05::7F01:0101/4098 with ttl=1:
Note: To verify a result, you must check the sender side of the mping command only. Also, note the
percentage of packet loss. To verify whether multicast is working on a network, you must perform the
mping tests with both nodes tested as both the sender and receiver. Typically, the non-verbose output
provides you the necessary information. However, if you choose to use the -v flag with the mping
command, a good knowledge about the internals of the program is necessary, without which the verbose
output can be misunderstood. You can also check the return code from the sender side of the mping
command. If an error occurs, the sender side returns 255. Upon success, it returns 0.
Cluster Aware AIX (CAA) selects a default multicast address if you do not specify a multicast address
when you create the cluster. The default multicast address is created by combining the logical OR of the
value ([Link]) with the lower 24 bits of the IP address of the node. For example, if the IP address is
[Link], then the default multicast address would be [Link].
The Internet Protocol version 6 (IPv6) addresses are supported by PowerHA SystemMirror 7.1.2, or later.
When IPv6 addresses are configured in the cluster, Cluster Aware AIX (CAA) activates heartbeating for
the IPv6 addresses with an IPv6 multicast address. You must verify that the IPv6 connections in your
environment can communicate with multicast addresses.
To verify that IPv6 multicast communications are configured correctly in your environment, you can run
the mping command with the -6 option. When you run the mping command, it verifies the IPv6
multicast communications with the default IPv6 multicast address. To specify a specific IPv6 multicast
address, run the mping command with the -a option and specify an IPv6 multicast address. You do not
need to specify the -6 option when using the -a option. The mping command automatically determines
the family of the address passed with the -a option.
Related information:
Troubleshooting Cisco multicast switches
Multicast support for Cisco switches
Note: If your network infrastructure does not allow IGMP snooping to be disabled permanently, you
might be able to troubleshoot problems by temporarily disabling snooping on the switches and then
adding additional network components one at a time.
v Eliminate any cascaded switches between the nodes in the cluster. In other words, have only a single
switch between the nodes in the cluster.
Related information:
Troubleshooting Cisco multicast switches
Multicast support for Cisco switches
Troubleshooting unicast
By default, PowerHA SystemMirror uses unicast socket based communications between nodes in the
cluster.
If you are having problems with unicast communications, follow general network troubleshooting
procedures. For example:
v Use the ifconfig and netstat commands to verify the IP address configuration and routing.
v Use the ping and traceroute commands to verify that nodes and adapters can communicate.
v If the steps above do not identify the problem, use the iptrace command to trace low level packet
activity.
To configure IPv6 addresses after a reboot, you can manually run the autoconf6 command. Alternatively,
PowerHA SystemMirror will run the autoconf6 command automatically before starting cluster services.
To configure the autoconf6 command to run automatically for the AIX operating system, complete the
following steps to change the /etc/[Link] file:
1. Uncomment the following lines to run the autoconf6 command:
# Start up autoconf6 process
start /usr/sbin/autoconf6
Note: You can specify individual interfaces by entering the -i flag. For example,
# Start up autoconf6 process
start /usr/sbin/autoconf6 "" "-i en1"
2. Uncomment the following lines to start the ndpd daemons:
# Start up ndpd-host daemon
start /usr/sbin/ndpd-host "$src_running"
Troubleshooting VLANs
This topic discusses troubleshooting interface failure in Virtual Local Area Networks.
Problem
Interface failures in Virtual LAN networks (from now on referred to as VLAN, Virtual Local Area
Network)
Solution
To troubleshoot VLAN interfaces defined to PowerHA SystemMirror and detect an interface failure,
consider these interfaces as interfaces defined on single adapter networks.
In particular, list the network interfaces that belong to a VLAN in the ping_client_list variable in the
/usr/es/sbin/cluster/etc/[Link] script and run clinfo. This way, whenever a cluster event occurs, clinfo
monitors and detects a failure of the listed network interfaces. Due to the nature of Virtual Local Area
Networks, other mechanisms to detect the failure of network interfaces are not effective.
Problem
If your configuration has two or more nodes connected by a single network, you may experience a
partitioned cluster. A partitioned cluster occurs when cluster nodes cannot communicate. In normal
circumstances, a service network interface failure on a node causes the Cluster Manager to recognize and
handle a swap_adapter event, where the service IP label/address is replaced with another IP
label/address. Heartbeats are exchanged via the shared disks. However, there is a chance the node
becomes isolated from the cluster. Although the Cluster Managers on other nodes are aware of the
attempted swap_adapter event, they cannot communicate with the now isolated (partitioned) node
because no communication path exists.
Solution
Problem
Using the AIX utility DSMIT on operations other than starting or stopping PowerHA SystemMirror
cluster services, can cause unpredictable results.
Solution
DSMIT manages the operation of networked IBM System p processors. It includes the logic necessary to
control execution of AIX commands on all networked nodes. Since a conflict with PowerHA
SystemMirror functionality is possible, use DSMIT only to start and stop PowerHA SystemMirror cluster
services.
Problem
If an unrecoverable error causes a PCI hot-replacement process to fail, the NIC may be left in an
unconfigured state and the node may be left in maintenance mode. The PCI slot holding the NIC and/or
the new NIC may be damaged at this point.
Solution
User intervention is required to get the node back in fully working order.
Related information:
Operating system and device management
Problem
When you define network interfaces to the cluster configuration by entering or selecting a PowerHA
SystemMirror IP label, PowerHA SystemMirror discovers the associated AIX network interface name.
PowerHA SystemMirror expects this relationship to remain unchanged. If you change the name of the
AIX network interface name after configuring and synchronizing the cluster, PowerHA SystemMirror will
not function correctly.
Solution
If this problem occurs, you can reset the network interface name from the SMIT PowerHA SystemMirror
System Management (C-SPOC) panel.
Related information:
Managing the cluster resources
Problem
If data is intermittently lost during transmission, it is possible that the maximum transmission unit
(MTU) has been set to different sizes on different nodes. For example, if Node A sends 8 K packets to
Node B, which can accept 1.5 K packets, Node B assumes the message is complete; however data may
have been lost.
Solution
Run the cluster verification utility to ensure that all of the network interface cards on all cluster nodes
during the same network have the same setting for MTU size. If the MTU size is inconsistent across the
network, an error displays, which enables you to determine which nodes to adjust.
Note: You can change an MTU size by using the following command:
chev -l en0 -a mtu=<new_value_from_1_to_8>
Problem
Encryption or decryption fails after enabling security and the clcomd daemon communication fails across
nodes. To verify if your encryption or decryption fails you can view the [Link] file.
Solution
Disable security using the SMIT from the master node or any node, and then stop and start PowerHA
SystemMirror communication daemon on all nodes.
Verify that the cluster node has the following file sets installed before enabling security:
v For data encryption with DES message authentication: [Link]
v For data encryption standard Triple DES message authentication: [Link].3des
v For data encryption with Advanced Encryption Standard (AES) message authentication:
[Link].aes256.. You must have installed the clic version 4.7 file set.
If needed, install these file sets from the AIX Expansion Pack CD-ROM.
If the files ets are installed after PowerHA SystemMirror is already running, start and stop the PowerHA
SystemMirror Cluster Communications daemon to enable PowerHA SystemMirror to use these file sets.
To restart the Cluster Communications daemon:
stopscr -s clcomd
startsrc -s clcomd
If the file sets are present, and you get an encryption error, the encryption file sets may have been
installed, or reinstalled, after PowerHA SystemMirror was running. In this case, restart the Cluster
Communications daemon as described above.
Problem
Cluster nodes are unable to communicate with each, and you have one of the following configured:
v Message authentication, or message authentication and encryption enabled
v Use of persistent IP labels for VPN tunnels.
Solution
Make sure that the network is operational, see the section Network and switch issues.
Check if the cluster has persistent IP labels. If it does, make sure that they are configured correctly and
that you can ping the IP label.
If you are investigating resource group movement in PowerHA SystemMirror and what to know why an
rg_move event has occurred, you should always check the /var/hacmp/log/[Link] file. In general,
given the recent changes in the way resource groups are handled and prioritized in fallover
circumstances, particularly in PowerHA SystemMirror, the [Link] file and its event summaries have
become even more important in tracking the activity and resulting location of your resource groups. In
addition, with parallel processing of resource groups, the [Link] file reports details that cannot be
seen in the cluster history log or the [Link] log file. Always check the [Link] log early on
when investigating resource group movement after takeover activity.
Problem
The PowerHA SystemMirror software failed to vary on a shared volume group. The volume group name
is either missing or is incorrect in the PowerHA SystemMirror Configuration Database object class.
Solution
v Check the /var/hacmp/log/[Link] file to find the error associated with the varyonvg failure.
v List all the volume groups known to the system using the lsvg command; then check that the volume
group names used in the PowerHA SystemMirror resource Configuration Database object class are
correct. To change a volume group name in the Configuration Database, from the main PowerHA
SystemMirror SMIT panel select Initialization and Standard Configuration > Configure PowerHA
SystemMirror Resource Groups > Change/Show Resource Groups , and select the resource group
where you want the volume group to be included. Use the Volume Groups or Concurrent Volume
Groups fields on the Change/Show Resources and Attributes for a Resource Group panel to set the
volume group names. After you correct the problem, use the SMIT Problem Determination Tools >
Recover From PowerHA SystemMirror Script Failure panel to issue the clruncmd command to signal
the Cluster Manager to resume cluster processing.
v Run the cluster verification utility to verify cluster resources.
Problem
An application that a user has manually stopped following a stop of cluster services where resource
groups were placed in an UNMANAGED state, does not restart with reintegration of the node.
Check that the relevant application entry in the /usr/es/sbin/cluster/[Link] file has been removed
prior to node reintegration.
Since an application entry in the /usr/es/sbin/cluster/[Link] file lists all applications already
running on the node, PowerHA SystemMirror will not restart the applications with entries in the
[Link] file.
Deleting the relevant application [Link] entry before reintegration, allows PowerHA SystemMirror
to recognize that the highly available application is not running, and that it must be restarted on the
node.
Problem
PowerHA SystemMirror fails to selectively move the affected resource group to another cluster node
when a volume group quorum loss occurs.
Solution
If quorum is lost for a volume group that belongs to a resource group on a cluster node, the system
checks whether the LVM_SA_QUORCLOSE error appeared in the node's AIX error log file and informs
the Cluster Manager to selectively move the affected resource group. PowerHA SystemMirror uses this
error notification method only for mirrored volume groups with quorum enabled.
If fallover does not occur, check that the LVM_SA_QUORCLOSE error appeared in the AIX error log.
When the AIX error log buffer is full, new entries are discarded until buffer space becomes available and
an error log entry informs you of this problem. To resolve this issue, increase the size of the AIX error log
internal buffer for the device driver.
Problem
A Group Services merge message is displayed and the node receiving the message shuts itself down. You
see a GS_DOM_MERGE_ER error log entry, as well as a message in the Group Services daemon log file:
"A better domain XXX has been discovered, or domain master requested to dissolve the domain."
A Group Services merge message is sent when a node loses communication with the cluster and then
tries to reestablish communication.
Solution
Because it may be difficult to determine the state of the missing node and its resources (and to avoid a
possible data divergence if the node rejoins the cluster), you should shut down the node and successfully
complete the takeover of its resources.
For example, if a cluster node becomes unable to communicate with other nodes, yet it continues to work
through its process table, the other nodes conclude that the "missing" node has failed because they no
longer are receiving keepalive messages from the "missing" node. The remaining nodes then process the
necessary events to acquire the disks, IP addresses, and other resources from the "missing" node. This
As the disks are being acquired by the takeover node (or after the disks have been acquired and
applications are running), the "missing" node completes its process table (or clears an application
problem) and attempts to resend keepalive messages and rejoin the cluster. Since the disks and IP address
have been successfully taken over, it becomes possible to have a duplicate IP address on the network and
the disks may start to experience extraneous traffic on the data bus.
Because the reason for the "missing" node remains undetermined, you can assume that the problem may
repeat itself later, causing additional downtime of not only the node but also the cluster and its
applications. Thus, to ensure the highest cluster availability, GS merge messages should be sent to any
"missing" cluster node to identify node isolation, to permit the successful takeover of resources, and to
eliminate the possibility of data corruption that can occur if both the takeover node and the rejoining
"missing" node attempt to write to the disks. Also, if two nodes exist on the network with the same IP
address, transactions may be missed and applications may hang.
When you have a partitioned cluster, the node(s) on each side of the partition detect this and run a
node_down for the node(s) on the opposite side of the partition. If while running this or after
communication is restored, the two sides of the partition do not agree on which nodes are still members
of the cluster, a decision is made as to which partition should remain up, and the other partition is
shutdown by a GA merge from nodes in the other partition or by a node sending a GS merge to itself.
In clusters consisting of more than two nodes the decision is based on which partition has the most
nodes left in it, and that partition stays up. With an equal number of nodes in each partition (as is always
the case in a two-node cluster) the node(s) that remain(s) up is determined by the node number (lowest
node number in cluster remains) which is also generally the first in alphabetical order.
Group Services domain merge messages indicate that a node isolation problem was handled to keep the
resources as highly available as possible, giving you time to later investigate the problem and its cause.
When a domain merge occurs, Group Services and the Cluster Manager exit. The [Link] file will
contain the following error:
"announcementCb: GRPSVCS announcement code=n; exiting"
"CHECK FOR FAILURE OF RSCT SUBSYSTEMS (topsvcs or grpsvcs)"
Problem
SMIT commands like Configure Devices Added After IPL use the cfgmgr command. Sometimes this
command can cause unwanted behavior in a cluster. For instance, if there has been a network interface
swap, the cfgmgr command tries to reswap the network interfaces, causing the Cluster Manager to fail.
Solution
See the Installation Guide for information about modifying [Link], thereby bypassing the issue. You can use
this technique at all times, not just for IP address takeover, but it adds to the overall takeover time, so it
is not recommended.
Related information:
Installing PowerHA SystemMirror
Network interfaces swap fails due to an rmdev device busy error. For example, /var/hacmp/log/[Link]
shows a message similar to the following:
Method error (/etc/methods/ucfgdevice):
0514-062 Cannot perform the requested function because the specified device is busy.
Solution
Check to see whether the following applications are being run on the system. These applications may
keep the device busy:
v SNA
Use the following commands to see if SNA is running:
lssrc -g sna
Use the following command to stop SNA:
stopsrc -g sna
If that does not work, use the following command:
stopsrc -f -s sna
If that does not work, use the following command:
/usr/bin/sna -stop sna -t forced
If that does not work, use the following command:
/usr/bin/sna -stop sna -t cancel
v Netview / Netmon
Ensure that the sysmond daemon has been started with a -H flag. This will result in opening and
closing the network interface each time SM/6000 goes out to read the status, and allows the
cl_swap_HW_address script to be successful when executing the rmdev command after the ifconfig
detach before swapping the hardware address.
Use the following command to stop all Netview daemons:
/usr/OV/bin/nv6000_smit stopdaemons
v IPX
Use the following commands to see if IPX is running:
ps -ef |grep npsd
ps -ef |grep sapd
Use the following command to stop IPX:
/usr/lpp/netware/bin/stopnps
v NetBIOS.
Use the following commands to see if NetBIOS is running:
ps -ef | grep netbios
Use the following commands to stop NetBIOS and unload NetBIOS streams:
mcsadm stop; mcs0 unload
– Unload various streams if applicable (that is, if the file exists):
cd /etc
strload -uf /etc/[Link]
strload -uf /etc/[Link]
strload -uf /etc/[Link]
strload -uf /etc/[Link]
– Some customer applications will keep a device busy. Ensure that the shared applications have been
stopped properly.
Problem
The client cannot connect to the cluster. The ARP cache on the client node still contains the address of the
failed node, not the fallover node.
Solution
Issue a ping command to the client from a cluster node to update the client's ARP cache. Be sure to
include the client name as the argument to this command. The ping command will update a client's ARP
cache even if the client is not running clinfoES. You might need to add a call to the ping command in
your application's pre-event or post-event processing scripts to automate this update on specific clients.
Problem
Solution
Check the /etc/hosts file on the node on which SNMP failed to ensure that it contains IP labels or
addresses of cluster nodes. Also see Clients cannot find clusters.
Related reference:
“Clients cannot find clusters”
This topic describes a situation where the clstat utility running on a client cannot find any clusters.
Problem
The clstat utility running on a client cannot find any clusters. The clinfoES daemon has not properly
managed the data structures it created for its clients (like clstat) because it has not located an SNMP
process with which it can communicate. Because clinfoES obtains its cluster status information from
SNMP, it cannot populate the PowerHA SystemMirror MIB if it cannot communicate with this daemon.
As a result, a variety of intermittent problems can occur between SNMP and clinfoES.
Solution
Create an updated client-based clhosts file by running verification with automatic corrective actions
enabled. This produces a [Link] file on the server nodes. Copy this file to the
/usr/es/sbin/cluster/etc/ directory on the clients, renaming the file clhosts. The clinfoES daemon uses the
addresses in this file to attempt communication with an SNMP process executing on a PowerHA
SystemMirror server.
Also, check the /etc/hosts file on the node on which the SNMP process is running and on the node
having problems with clstat or other clinfo API programs.
Problem
The service and boot addresses of the cluster node from which clinfoES was started do not exist in the
client-based clhosts file.
Solution
Create an updated client-based clhosts file by running verification with automatic corrective actions
enabled. This produces a [Link] file on the server nodes. Copy this file to the
/usr/es/sbin/cluster/etc/ directory on the clients, renaming the file clhosts . Then run the clstat command.
Problem
Even though the node is down, the SNMP daemon and clinfoES report that the node is up. All the
node's interfaces are listed as down.
Solution
When one or more nodes are active and another node tries to join the cluster, the current cluster nodes
send information to the SNMP daemon that the joining node is up. If for some reason, the node fails to
join the cluster, clinfoES does not send another message to the SNMP daemon the report that the node is
down.
To correct the cluster status information, restart the SNMP daemon, using the options on the PowerHA
SystemMirror Cluster Services SMIT panel.
Miscellaneous issues
These topics describe potential non-categorized PowerHA SystemMirror issues.
If you are investigating resource group movement in PowerHA SystemMirror for why an rg_move event
has occurred, you should always check the /var/hacmp/log/[Link] file. In general, given the recent
changes in the way resource groups are handled and prioritized in fallover circumstances, particularly in
PowerHA SystemMirror, the [Link] file and its event summaries have become even more important
in tracking the activity and resulting location of your resource groups. In addition, with parallel
processing of resource groups, the [Link] file reports details that will not be seen in the cluster
history log or the [Link] file. Always check this log early on when investigating resource group
movement after takeover activity.
Problem
Only script start messages appear in the /var/hacmp/log/[Link] file. The script specified in the
message is not executable, or the DEBUG level is set to low.
Add executable permission to the script using the chmod command, and make sure the DEBUG level is
set to high.
Problem
You get the following message regardless of whether or not you have configured Auto Error Notification:
"Remember to redo automatic error notification if configuration
has changed."
Solution
Ignore this message if you have not configured Auto Error Notification.
This message appears each time a cluster event takes more time to complete than a specified time-out
period.
In versions prior to 4.5, the time-out period was fixed for all cluster events and set to 360 seconds by
default. If a cluster event, such as a node_up or a node_down event, lasted longer than 360 seconds, then
every 30 seconds PowerHA SystemMirror issued a config_too_long warning message that was logged in
the [Link] file.
In PowerHA SystemMirror you can customize the time period allowed for a cluster event to complete
before PowerHA SystemMirror issues a system warning for it.
Starting with version 4.5, for each cluster event that does not complete within the specified event
duration time, config_too_long messages are logged in the [Link] file and sent to the console
according to the following pattern:
v The first five config_too_long messages appear in the [Link] file at 30-second intervals
v The next set of five messages appears at interval that is double the previous interval until the interval
reaches one hour
v These messages are logged every hour until the event is complete or is terminated on that node.
Activities that the script is performing take longer than the specified time to complete; for example, this
could happen with events involving many disks or complex scripts.
Solution
v Determine what is taking so long to execute, and correct or streamline that process if possible.
v Increase the time to wait before calling config_too_long.
You can customize Event Duration Time using the Change/Show Time Until Warning panel in SMIT.
Access this panel through the Extended Configuration > Extended Event Configuration SMIT panel.
Problem
A command is hung and event script is waiting before resuming execution. If so, you can probably see
the command in the AIX process table (ps -ef ). It is most likely the last command in the
/var/hacmp/log/[Link] file, before the config_too_long script output.
Solution
Problem
The foreground startup process is specified for an application controller start script, but that script is not
exiting.
Note: This problem only exists if you are using PowerHA SystemMirror 7.1.1, or later.
Solution
Examine the start script to see if it is functioning properly. If there is any possibility of the script hanging,
consider using a combination of the background startup option, along with a startup monitor instead of
foreground startup.
Related reference:
“Dynamic reconfiguration sets a lock” on page 79
This topic discusses a situation where an error message is generated when attempting a dynamic
reconfiguration.
Related information:
Tuning event duration time until warning
Problem
The /etc/syslogd file has been changed to send the [Link] output to /dev/console.
Solution
Edit the /etc/syslogd file to redirect the [Link] output to /usr/tmp/[Link]. The [Link] file
is the default location for logging messages.
Solution
To prevent unplanned system reboots from disrupting a fallover in your cluster environment, all nodes in
the cluster should either have the Automatically REBOOT a system after a crash field on the
Change/Show Characteristics of Operating System SMIT panel set to false, or you should keep the IBM
System p key in Secure mode during normal operation.
Both measures prevent a system from rebooting if the shutdown command is issued inadvertently.
Without one of these measures in place, if an unplanned reboot occurs the activity against the disks on
the rebooting node can prevent other nodes from successfully acquiring the disks.
Problem
Solution
To rebuild the NetView database, perform the following steps on the NetView server:
1. Stop all NetView daemons:
/usr/OV/bin/ovstop -a
2. Remove the database from the NetView server:
rm -rf /usr/OV/database/*
3. Start the NetView object database:
/usr/OV/bin/ovstart ovwdb
4. Restore the NetView/HAView fields:
/usr/OV/bin/ovw -fields
5. Start all NetView daemons:
/usr/OV/bin/ovstart -a
Problem
Solution
Help can be displayed only if the LANG variable is set to one of the languages supported by PowerHA
SystemMirror, and if the associated PowerHA SystemMirror message catalogs are installed. The
languages supported by PowerHA SystemMirror are:
v en_US
v ja_JP
Since the LANG environment variable determines the active locale, if LANG=en_US, the locale is en_US.
Problem
In PowerHA SystemMirror, event summaries are pulled from the [Link] file and stored in the
cl_event_summary.txt file. This file continues to accumulate as [Link] cycles, and is not automatically
truncated or replaced. Consequently, it can grow too large and crowd your /usr directory.
Solution
Clear event summaries periodically, using the Problem Determination Tools > PowerHA SystemMirror
Log Viewing and Management > View/Save/Remove PowerHA SystemMirror Event Summaries >
Remove Event Summary History option in SMIT.
View event summaries does not display resource group information as expected
This topic discusses how View event summaries does not display resource group information as
expected.
Problem
In PowerHA SystemMirror, event summaries are pulled from the [Link] file and can be viewed using
the Problem Determination Tools > PowerHA SystemMirror Log Viewing and Management >
View/Save/Delete Event Summaries > View Event Summaries option in SMIT. This display includes
resource group status and location information at the end. The resource group information is gathered by
clRGinfo, and may take extra time if the cluster is not running when running the View Event
Summaries option.
Solution
clRGinfo displays resource group information more quickly when the cluster is running.
If the cluster is not running, wait a few minutes and the resource group information will eventually
appear.
Problem
Checking the State of an Application Monitor. In some circumstances, it may not be clear whether an
application monitor is currently running or not. To check on the state of an application monitor, run the
following command:
ps -ef | grep <application controller name> | grep clappmond
This command produces a long line of verbose output if the application is being monitored.
Solution
If the application monitor is not running, there may be a number of reasons, including
v No monitor has been configured for the application controller
v The monitor has not started yet because the stabilization interval has not completed
v The monitor is in a suspended state
v The monitor was not configured properly
v An error has occurred.
Check to see that a monitor has been configured, the stabilization interval has passed, and the monitor
has not been placed in a suspended state, before concluding that something is wrong.
If something is clearly wrong, reexamine the original configuration of the monitor in SMIT and
reconfigure as needed.
Problem 2
Application monitor does not perform specified failure action. The specified failure action does not occur
even when an application has clearly failed.
Solution 2
Check the Restart Interval. If set too short, the Restart Counter may be reset to zero too quickly, resulting
in an endless series of restart attempts and no other action taken.
Problem 3
Application monitor does not always indicate that the application is working correctly.
Solution 3
v Check that the monitor is written to return the correct exit code in all cases. The return value must be
zero if the application is working fine, and it must be a non-zero value if the application has failed.
v Check all possible paths through the code, including error paths to make sure that the exit code is
consistent with the application state.
Problem 4
Solution 4
Check the log files that are created by the monitor. The monitor can log messages by printing them to the
standard output stdout file. For long running monitors, the output is stored in the /var/hacmp/log/
[Link] monitor [Link] group [Link] file. For startup monitors, this output
is stored in the /var/hacmp/log/[Link] server [Link] group [Link] file. The
monitor log files are overwritten, each time the application monitor runs.
The disk replacement process fails while the replacepv command was running.
Solution
Be sure to delete the /tmp/replacepv directory, and attempt the replacement process again.
Problem
In [Link], you see that an rg_move event processes multiple non-concurrent resource groups in one
operation.
Solution
This is the expected behavior. In clusters with dependencies, PowerHA SystemMirror processes all
resource groups upon node_up events, via rg_move events. During a single rg_move event, PowerHA
SystemMirror can process multiple non-concurrent resource groups within one event.
Related reference:
“Processing in clusters with dependent resource groups or sites” on page 29
Resource groups in clusters that are configured with dependent groups or sites, that are handled with
dynamic event phasing.
Problem
A file system is not unmounted properly during an event such as when you stop cluster services with the
option to bring resource groups offline.
Solution
One of the more common reasons for a file system to fail being unmounted when you stop cluster
services with the option to bring resource groups offline is because the file system is busy. In order to
unmount a file system successfully, no processes or users can be accessing it at the time. If a user or
process is holding it, the file system will be "busy" and will not unmount.
The same issue may result if a file has been deleted but is still open.
The script to stop an application should also include a check to make sure that the shared file systems are
not in use or deleted and in the open state. You can do this by using the fuser command. The script
should use the fuser command to see what processes or users are accessing the file systems in question.
The PIDs of these processes can then be acquired and killed. This will free the file system so it can be
unmounted.
Refer to the AIX man pages for complete information on this command.
When attempting a dynamic reconfiguration (DARE) operation, an error message may be generated
regarding a DARE lock if another DARE operation is in process, or if a previous DARE operation did not
complete properly.
The error message suggests that one should take action to clear the lock if a DARE operation is not in
process. "In process" here refers to another DARE operation that may have just been issued, but it also
refers to any previous DARE operation that did not complete properly.
Solution
The first step is to examine the /var/hacmp/log/[Link] logs on the cluster nodes to determine the
reason for the previous DARE failure. A config_too_long entry will likely appear in [Link] where an
operation in an event script took too long to complete. If [Link] t indicates that a script failed to
complete due to some error, correct this problem and manually complete the remaining steps that are
necessary to complete the event.
Run the PowerHA SystemMirror SMIT Problem Determination Tools > Recover from PowerHA
SystemMirror Script Failure option. This should bring the nodes in the cluster to the next complete
event state.
You can clear the DARE lock by selecting the PowerHA SystemMirror SMIT option Problem
Determination Tools > Release Locks Set by Dynamic Configuration if the PowerHA SystemMirror
SMIT Recover from PowerHA SystemMirror Script Failure step did not do so.
Problem
Solution
1. Verify that the node in question is WPAR-capable. An AIX node with WPAR capability should have
the [Link] fileset installed. If the node is not WPAR-capable, then the resource group will not run
in the WPAR. Issue the following command to check if this fileset is installed:
lslpp -L "[Link]"
2. On the specified node, verify there is a WPAR with the same name as the WPAR-enabled resource
group. Use the lswpar <resource group name> command to check this. If there is no WPAR with the
specified name, create it using the mkwpar command. After creating a WPAR, make sure that all the
user-defined scripts associated with the WPAR-enabled resource group are accessible within the
WPAR.
3. Ensure that the file systems on the node are not full. If so, free up some disk space by moving some
files to external storage.
4. Verify that the rsh service is enabled in the corresponding WPAR. This can be done as follows:
v Check that the inetd service is running in the WPAR by issuing the following command in the
WPAR:
lssrc -s inetd
If the inetd service is not active, then start the service using the startsrc command.
v Make sure that rsh is listed as a known service in /etc/[Link] file in the WPAR.
The Simple Network Management Protocol (SNMP) provides access to a database of status and
configuration variables referred to as the Management Information Base (MIB). The SNMP subsystem
provided with base AIX provides a subset of the overall MIB, and can also work with peer daemons that
provide access to other portions of the MIB. The SystemMirror cluster manager daemon acts as such a
peer and provides access to the SystemMirror specific variables in the MIB.
When you experience problems with SNMP or the utilities that rely on it, first you must verify that the
basic SNMP configuration is functioning, then proceed to check the SystemMirror specific function.
You can check for the basic function of SNMP by using the snmpinfo command. Use the snmpinfo -m
dump command to display the default part of the MIB. If this command does not produce any output,
there is a problem with the base setup of SNMP and the snmpd subsystem itself. Check to ensure that
the snmpd subsystem is running and follow the steps in the following sections to make sure that the
basic snmpinfo command is working.
Once you have verified that the basic function is working, you can query the SystemMirror specific
portion of the MIB with the following command:
snmpinfo -m dump -v -o /usr/sbin/cluster/[Link] risc6000clsmuxpd
If the preceding command does not produce an output (and snmpinfo -m dump does), the problem is
specific to the SystemMirror portion of the MIB. Follow the steps below to verify the status and
configuration of the SystemMirror specific components.
Problem
There are two common issues with the [Link] file that is shipped with the AIX operating system.
They are as follows:
v Access to the internet portion of the SNMP Management Information Base (MIB) is commented out.
v In PowerHA SystemMirror 7.1.2, there is no COMMUNITY entry for the IPv6 loopback address.
Complete the steps in the “Troubleshooting common SNMP problems” section to resolve these issues.
However, even after the first two issues have been fixed, other issues could still interfere with the proper
working of the SNMP-based status commands. Complete the steps in the “Troubleshooting SNMP status
commands” on page 83 section to resolve these issues. If the status commands still fail, complete the
steps in the “Troubleshooting [Link] file” on page 84 section to resolve the rest of the issues.
Solution
This topic helps to resolve the two common SNMP problems. Usually, fixing these problems solves the
issues and you might not need to go through the other sections.
1. Check for access permission to the PowerHA portion of the SNMP Management Information Base
(MIB) in the SNMP configuration file. Find the defaultView entries in the /etc/[Link] file:
# grep defaultView /etc/[Link]
#VACM_VIEW defaultView internet - included -
VACM_VIEW defaultView [Link].[Link].[Link] - included -
VACM_VIEW defaultView [Link].[Link].191.1.6 - included -
VACM_VIEW defaultView snmpModules - excluded -
VACM_VIEW defaultView [Link].[Link].4 - included -
VACM_VIEW defaultView [Link].[Link].5 - included -
Beginning with AIX 7.1, as a security precaution, the [Link] file is shipped with the internet
access commented out. The preceding example shows the unmodified configuration file: the internet
descriptor is commented out, which means that there is no access to most of the MIB, including the
PowerHA information. (Other included entries provide access to other limited parts of the MIB.) By
default in AIX 7.1 and later, the PowerHA SNMP-based status commands do not work, unless you
edit the [Link] file. There are two ways to provide access to the PowerHA MIB:
v Uncomment the following internet line in the [Link] file :
VACM_VIEW defaultView internet - included -
Note: After editing the SNMP configuration file, you must stop and restart snmpd, and then refresh
the cluster manager, by using the following commands:
stopsrc -s snmpd
startsrc -s snmpd
refresh -s clstrmgrES
Try the SNMP-based status commands again. If the commands work, you do not need to go through
the rest of the section.
2. If you use PowerHA SystemMirror 7.1.2 or later, check for the correct IPv6 entries in the configuration
files for clinfoES and snmpd. In PowerHA 7.1.2, an entry is added to the /usr/es/sbin/cluster/etc/
clhosts file to support IPv6. However, the required corresponding entry is not added to the
/etc/[Link] file. This causes intermittent problems with the clstat command. There are two
ways to address this problem:
v If you do not plan to use IPv6, comment the line in the /usr/es/sbin/cluster/etc/clhosts file and
restart clinfoES, by using the following commands:
# ::1 # PowerHA SystemMirror
stopsrc -s clinfoES
startsrc -s clinfoES
Try the SNMP-based status commands again. If the commands work, you do not need to go
through the rest of the section.
v If you plan to use IPv6 in the future, add the following line to the /[Link] file:
COMMUNITY public public noAuthNoPriv :: 0 -
If you are using a different community (other than public), substitute the name of that community
for the word public.
Note: After editing the SNMP configuration file, you must stop and restart snmpd, and then refresh
the cluster manager, by using the following commands:
stopsrc -s snmpd
startsrc -s snmpd
refresh -s clstrmgrES
Try the SNMP-based status commands again. If the commands work, you do not need to go through
the next section.
This topic helps you resolve other issues that can still interfere with the working of the SNMP-based
status commands, even after you have fixed the common issues.
1. Run the following command to check whether snmpd is running:
lssrc -s snmpd
If not, start the cluster services. None of the SNMP status commands work if the cluster services are
not running.
3. If you are using the clstat command, check if the /usr/es/sbin/cluster/etc/clhosts file is correct. The
clhosts file must contain a list of IP addresses of the PowerHA nodes with which the clinfoES
daemon can communicate. (Persistent addresses are preferred. If the file contains addresses that do
not belong to a cluster node, it might cause further problems.) If you edit the file on a system, you
must restart clinfoES on that system.
v In a cluster node
– By default, the clhosts file is pre-populated with the localhost address. You can add entries for
all the nodes in the cluster so that the clstat command works while the cluster services are
running on the node.
– Beginning with PowerHA SystemMirror 7.1.2, an entry for the IPv6 loopback address is added to
the default clhosts file. As described in the “Troubleshooting common SNMP problems” on page
81 section, you can either comment this line or add a line for the IPv6 loopback address to the
SNMP configuration file.
v In a client system
– By default the clhosts file is empty. You must add addresses for the cluster nodes.
4. If you are using the clstat command, run the following command to check whether clinfoES is
running:
lssrc -s clinfoES
Tip: Start clinfoES every time you start cluster services to avoid this issue.
5. Check whether snmpd is listening at the smux port and if the cluster manager is connected. Run the
following netstat command to list active sockets that use the smux port:
# netstat -Aa | grep smux
f1000e0002988bb8 tcp 0 *.smux *.* LISTEN
f1000e00029d8bb8 tcp4 0 0 [Link] loopback.32776 ESTABLISHED
f1000e00029d4bb8 tcp4 0 0 loopback.32776 [Link] ESTABLISHED
f1000e000323fbb8 tcp4 0 0 [Link] loopback.34266 ESTABLISHED
f1000e0001b86bb8 tcp4 0 0 loopback.34266 [Link] ESTABLISHED
If you do not see a socket in the LISTEN state, use the following commands to stop and start snmpd:
stopsrc -s snmpd; startsrc -s snmpd
6. Once you have an smux socket in the LISTEN state, look for a socket pair in the ESTABLISHED
state, with one of the sockets owned by the cluster manager. You can use the rmsock command to
find which process owns the sockets. If you just restarted snmpd, ensure that there is a LISTEN
socket at the smux port. If you do not see any smux socket in the ESTABLISHED state, you can
either refresh the cluster manager (refresh -s clstrmgrES), or you can wait for a couple of minutes.
In this example, there are two ESTABLISHED socket pairs. One between snmpd and muxatmd and
one between snmpd and the cluster manager.
8. Try the SNMP-based status commands again. If the commands work, you do not need to go through
the next section.
This topic helps to resolve issues that are related to the SNMP configuration file.
1. Determine which version of snmpd is running, by using the following command:
# ls -l /usr/sbin/snmpd
lrwxrwxrwx 1 root system 9 May 14 22:19 /usr/sbin/snmpd -> snmpdv3ne
snmpdv1 uses the /etc/[Link] file and snmpdv3 uses the /etc/[Link] file.
Note: In the rest of these instructions, it is assumed that snmpdv3 daemon, which is the default
version, is running.
2. Check authentication and access control (authorization) settings for [Link] file. clinfoES,
cldump, and cldisp use community-based authentication. They use the first community that is listed
in the configuration file. Although rare, it is possible to specify the community to clinfoES. To check
this setting, use the following command:
odmget SRCsubsys | grep -p clinfo
Note: If you want to change the community that is used by clinfoES, use the chssys command.
After you change the community that is used by clinfoES, you must restart clinfoES.
3. Find the first SNMP community in the [Link] file.
# grep -i comm /etc/[Link] | grep -v ^#
COMMUNITY powerha powerha noAuthNoPriv [Link] [Link] -
COMMUNITY test test noAuthNoPriv [Link] [Link] -
In this example the VACM_GROUP is group1. You can ignore the director_group, which is used
by IBM Systems Director.
b. Find the view that is associated with this group by searching for the group you identified. The
view is listed in a VACM_ACCESS entry.
# grep group1 /etc/[Link]
VACM_GROUP group1 SNMPv1 powerha -
VACM_ACCESS group1 - - noAuthNoPriv SNMPv1 defaultView - defaultView -
Look for the name of the view for readView access. In this example, defaultView is used for
readView and notifyView access for group group1. No access is provided for writeView and
storageType.
c. Find the VACM_VIEW entries that are associated with this community by searching for the view
you identified:
# grep defaultView /etc/[Link]
#VACM_VIEW defaultView internet - included -
VACM_VIEW defaultView [Link].[Link].[Link] - included -
VACM_VIEW defaultView [Link].[Link].191.1.6 - included -
VACM_VIEW defaultView snmpModules - excluded -
VACM_VIEW defaultView [Link].[Link].4 - included -
VACM_VIEW defaultView [Link].[Link].5 - included -
VACM_VIEW defaultView [Link].[Link].191 - excluded -
VACM_ACCESS group1 - - noAuthNoPriv SNMPv1 defaultView - defaultView -
VACM_ACCESS director_group - - noAuthNoPriv SNMPv2c defaultView - defaultView -
1) Look for a VACM_VIEW entry that gives access to the PowerHA MIB. Locations in the MIB
are identified either by a string of numbers (object identifier (OID)) or by a name (object
descriptor). In this example, the first entry uses the object descriptor internet. That corresponds
to the OID [Link]. If this line is uncommented, it allows access to the entire MIB, that is [Link]
and everything that starts with [Link], which is effectively the entire SNMP MIB.
2) However, in this example, the internet descriptor is commented out, which means that there is
no access at that level. Beginning with AIX 7.1, as a security precaution, the [Link] file
is shipped with the internet access commented out. This means that by default in AIX 7.1 and
later, the PowerHA SNMP-based status commands do not work, unless you edit the
[Link] file. Also, ensure that the relevant VACM_VIEW entry has the word included
in the second last field and not excluded.
3) As described in the “Troubleshooting common SNMP problems” on page 81 section, there are
two ways to provide access to the PowerHA MIB:
v Uncomment the internet line in [Link]. This gives you access to the entire MIB.
v Add a line that provides access to the PowerHA MIB only. The PowerHA MIB can be
identified by the object descriptor or by the OID.
5. Edit the [Link] file to ensure that the PowerHA MIB is accessible for the first community. You
must make sure that the first COMMUNITY entry in the file maps to a VACM_GROUP entry that
Note: You must use the stopsrc and startsrc commands, instead of the refresh command for snmpd.
stopsrc -s snmpd; startsrc -s snmpd
7. Repeat steps 5, 6, 7 as described in the “Troubleshooting SNMP status commands” on page 83 section
to ensure that the cluster manager is connected to snmpd.
8. Try the SNMP-based status commands again.
Problem
Nodes and repository disks fail simultaneously during an event such as a data center failure.
Solution
In a simultaneous node and repository disk failure, such as when a data center fails, it might be
necessary to replace the repository disk before all nodes restart.
1. To replace the repository disk, use the following System Management Interface Tool (SMIT) path:
$ smitty sysmirror
>Problem Determination Tools > Replace the Primary Repository Disk
Note: A node that is in the DOWN state while the repository disk is being replaced continues to
access the 'original' repository disk even after the reboot. If the 'original' repository disk becomes
available again, Cluster Aware AIX (CAA) cluster services start to use that disk. The node remains in
the DOWN state.
2. To check the status of a node, enter the following command:
lscluster -m
Note: You might need to wait up to 10 minutes for the node to join the CAA cluster again, by using
the 'new' repository disk.
5. To verify that the CAA cluster services have successfully restarted, enter the following command:
a. lscluster -c
b. lscluster -m
6. Before restarting PowerHA at the affected node, the PowerHA configuration needs to be
synchronized. The synchronization needs to be started at a node, which was in the UP state while the
repository disk was replaced. To start the verification and synchronization process at a node, use the
following SMIT path:
$ smitty sysmirror
>Cluster Nodes and Networks > Verify and Synchronize Cluster Configuration
Note: If there are multiple nodes available and PowerHA is not running on all of them, you need to
choose an active node to start the synchronization.
After the verification and synchronization is successfully completed in Step 6, you can restart PowerHA
at the previously failed node, by using the following SMIT path:
$ smitty sysmirror
>System Management (C-SPOC) > PowerHA SystemMirror Services > Start Cluster Services
IBM may not offer the products, services, or features discussed in this document in other countries.
Consult your local IBM representative for information on the products and services currently available in
your area. Any reference to an IBM product, program, or service is not intended to state or imply that
only that IBM product, program, or service may be used. Any functionally equivalent product, program,
or service that does not infringe any IBM intellectual property right may be used instead. However, it is
the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or
service.
IBM may have patents or pending patent applications covering subject matter described in this
document. The furnishing of this document does not grant you any license to these patents. You can send
license inquiries, in writing, to:
For license inquiries regarding double-byte character set (DBCS) information, contact the IBM Intellectual
Property Department in your country or send inquiries, in writing, to:
This information could include technical inaccuracies or typographical errors. Changes are periodically
made to the information herein; these changes will be incorporated in new editions of the publication.
IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this
publication at any time without notice.
Any references in this information to non-IBM websites are provided for convenience only and do not in
any manner serve as an endorsement of those websites. The materials at those websites are not part of
the materials for this IBM product and use of those websites is at your own risk.
IBM may use or distribute any of the information you provide in any way it believes appropriate without
incurring any obligation to you.
Licensees of this program who wish to have information about it for the purpose of enabling: (i) the
exchange of information between independently created programs and other programs (including this
one) and (ii) the mutual use of the information which has been exchanged, should contact:
Such information may be available, subject to appropriate terms and conditions, including in some cases,
payment of a fee.
The licensed program described in this document and all licensed material available for it are provided
by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement or
any equivalent agreement between us.
The performance data and client examples cited are presented for illustrative purposes only. Actual
performance results may vary depending on specific configurations and operating conditions.
Information concerning non-IBM products was obtained from the suppliers of those products, their
published announcements or other publicly available sources. IBM has not tested those products and
cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM
products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of
those products.
Statements regarding IBM's future direction or intent are subject to change or withdrawal without notice,
and represent goals and objectives only.
All IBM prices shown are IBM's suggested retail prices, are current and are subject to change without
notice. Dealer prices may vary.
This information is for planning purposes only. The information herein is subject to change before the
products described become available.
This information contains examples of data and reports used in daily business operations. To illustrate
them as completely as possible, the examples include the names of individuals, companies, brands, and
products. All of these names are fictitious and any similarity to actual people or business enterprises is
entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs
in any form without payment to IBM, for the purposes of developing, using, marketing or distributing
application programs conforming to the application programming interface for the operating platform for
which the sample programs are written. These examples have not been thoroughly tested under all
conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these
programs. The sample programs are provided "AS IS", without warranty of any kind. IBM shall not be
liable for any damages arising out of your use of the sample programs.
Each copy or any portion of these sample programs or any derivative work must include a copyright
notice as follows:
Portions of this code are derived from IBM Corp. Sample Programs.
This Software Offering does not use cookies or other technologies to collect personally identifiable
information.
If the configurations deployed for this Software Offering provide you as the customer the ability to collect
personally identifiable information from end users via cookies and other technologies, you should seek
your own legal advice about any laws applicable to such data collection, including any requirements for
notice and consent.
For more information about the use of various technologies, including cookies, for these purposes, see
IBM’s Privacy Policy at [Link] and IBM’s Online Privacy Statement at
[Link] the section entitled “Cookies, Web Beacons and Other
Technologies” and the “IBM Software Products and Software-as-a-Service Privacy Statement” at
[Link]
Trademarks
IBM, the IBM logo, and [Link] are trademarks or registered trademarks of International Business
Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at
Copyright and trademark information at [Link]/legal/[Link].
Notices 91
92 Troubleshooting PowerHA SystemMirror
Index
Special characters checking (continued)
PowerHA SystemMirror components 33
/usr/es/sbin/cluster/history/[Link] system hardware 48
understanding 22 TCP/IP subsystem 43
/usr/es/sbin/cluster/snapshots/ [Link] 11 volume group
/usr/es/sbin/cluster/wsm/logs/ wsm_smit.log 11 definitions 38
/var/ha/log/grpglsm 11 varyon state 39
/var/ha/log/grpsvcs 11 clam_nfsv4 application 58
/var/hacmp/adm/[Link] 11, 13 client issues 72
/var/hacmp/adm/history/[Link] 11 cluster
/var/hacmp/clverify/[Link] 11 checking communications daemon 48
/var/hacmp/log/ cl_testtool.log 11 checking configuration 34
/var/hacmp/log/ [Link] 11 checking snapshot 35
/var/hacmp/log/ [Link] 11 collecting log files 22
/var/hacmp/log/[Link] 11 communication issues 67
/var/hacmp/log/[Link] 11 reviewing message log files 11
/var/hacmp/log/[Link] 11 stopping 3
/var/hacmp/log/[Link] 11 tracking resource group 23
/var/hacmp/log/[Link] 11 understand log files 13, 14
/var/hacmp/log/[Link] 11 viewing log files 10
/var/hacmp/log/[Link]/var/hacmp/log/ cluster history log
[Link] 11 understanding 22
/var/hacmp/log/[Link] 10, 11, 14, 19 cluster log
event preamble 15 managing 23
event summary 15 cluster state information file 38
setting level of information recorded 19 [Link] 13
viewing compiled event summaries 20 collecting
/var/hacmp/log/[Link] 11 cluster log files 22
/var/hacmp/log/[Link] 11 configuration database data file 37
/var/hacmp/log/[Link] 11 cron job
.info 38 making highly available 9
.odm 37
A D
diagnostic utilities
AIX operating system using 4
checking 46 disk
application checking 47
checking 32 disk adapter
checking 47
disk issues 54
C
checking
AIX operating system 46 E
applications 32 event preamble 15
cluster communications daemon 48 event summary 15
cluster configuration 34 parallel processing order 23
cluster snapshot 35 saving 21
disk adapters 47 viewing compiled [Link] 20
disks 47
file system information 42
file systems 41
IP address 45 F
logical volume manager 38 file system
logical volumes 40 checking 41
mount points 42 file system information
netmask 45 checking 42
permissions 42 file system issues 54
physical networks 46
physical volumes 40
point-to-point connectivity 44
S
saving
event summary 21
script failure
recovering from
problem determination tools 8
SMIT
viewing [Link] 19
stopping
cluster manager 3
switch issues 61
system component
investigating 32
system error log 11
understanding 21
T
TCP/IP
checking subsystem 43
tmp/[Link] 11
tracking
resource group processing 23
U
understanding
cluster log files 13, 14
system error log 21, 22
unicasting
troubleshoot 64
using
diagnostic utilities 4
problem determination tools 4
V
viewing
cluster log files 10
compiled [Link] event summaries 20
[Link] using SMIT 19
volume group
checking definitions 38
checking varyon state 39
Index 95
96 Troubleshooting PowerHA SystemMirror
IBM®
Printed in USA