11gR2 Clusterware Technical WP PDF
11gR2 Clusterware Technical WP PDF
2)
Technical White Paper
Internal / Confidential
Version 1.0 update 3
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
1.2
1.3
Agents.............................................................................. 11
1.4
1.5
1.6
1.7
1.8
1.9
1.10
mdnsd .............................................................................. 57
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
Resources ............................................................................... 66
3.1
3.2
Event Sources.................................................................. 80
4.2
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
5.2
misscount......................................................................... 84
6.2
6.3
6.4
6.5
ocrpatch ........................................................................... 91
7.2
vdpatch ............................................................................ 91
7.3
7.4
7.5
Appendix................................................................................ 106
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
Disclaimer
The information contained in this document is subject to change without notice. If you find
any problems in this paper, or have any comments, corrections or suggestions, please report
them to us via E-Mail (mailto:[email protected]). We do not warrant that this
document is error-free. No part of this document may be reproduced in any form or by any
means, electronic or mechanical, for any purpose, without the permission of the authors.
This document is for internal use only and may not be distributed outside of Oracle.
The diagram below is a high level overview about the daemons, resources and agents used
in Oracle Clusterware 11g release (11.2).
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
The first big change noticed between pre-11.2 and 11.2 is the new OHASD daemon, which is
replacing all the known init scripts which exist in pre-11.2.
1.2
Oracle Clusterware consists of two separate stacks. The upper stack anchored by the Cluster
Ready Services daemon (crsd) and a lower stack anchored by the Oracle High Availability
Services daemon (ohasd). These two stacks have several processes that facilitate cluster
operations. The following chapters will describe them in detail.
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
The OHASD is the daemon which starts every other daemon that is part of the Oracle
Clusterware stack on a node. OHASD will replace all the pre-11.2 existing init scripts.
The entry point for OHASD is /etc/inittab, which executes the /etc/init.d/ohasd and
/etc/init.d/init.ohasd control scriptsThe /etc/init.d/ohasd script is a RC script including the
start and the stop actions. The /etc/init.d/init.ohasd script is the OHASD framework control
script which will spawn the Grid_home/bin/ohasd.bin executable.
The cluster control files are located in /etc/oracle/scls_scr/<hostname>/root (this is the
location for Linux) and maintained by crsctl; in other words, a crsctl enable / disable crs
will update the files in this directory.
# crsctl enable -h
Usage:
crsctl enable crs
Enable OHAS autostart on this server
# crsctl disable h
Usage:
crsctl disable crs
Disable OHAS autostart on this server
The content of the file scls_scr/<hostname>/root/ohasdstr controls the autostart of the CRS
stack; the two possible values in the file are enable autostart enabled, or disable
autostart disabled.
The file scls_scr/<hostname>/root/ohasdrun controls the init.ohasd script. The three
possible values are reboot sync with OHASD, restart restart crashed OHASD, stop
scheduled OHASD shutdown.
The big benefit of having OHASD in Oracle Clusterware 11g release 2 (11.2) is the ability to
run certain crsctl commands in a clusterized manner. Clusterized commands are completely
operating system independent, as they only rely on ohasd. If ohasd is running, then remote
operations, such as the starting, stopping, and checking the stack status of remote nodes,
can be performed.
Clusterized commands include the following:
There are more functions that OHASD is performing, such as processing and managing the
Oracle Local Repository (OLR), as well as acting as the OLR server. In a cluster, OHASD runs
as root; in an Oracle Restart environment, where OHASD manages application resources, it
runs as the oracle user.
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
1.2.1
The clusterware stack in Oracle Clusterware 11g release 2 (11.2) is started by the OHASD
daemon, which itself is spawned by the script /etc/init.d/init.ohasd when a node is started.
Alternatively, ohasd is started on a running node with crsctl start crs after a prior crsctl
stop crs. The OHASD daemon will then start other daemons and agents. Each Clusterware
daemon is represented by an OHASD resource, stored in the OLR. The chart below shows
the association of the OHASD resources / Clusterware daemons and their respective agent
processes and owner.
Resource Name
Agent Name
Owner
ora.gipcd
oraagent
crs user
ora.gpnpd
oraagent
crs user
ora.mdnsd
oraagent
crs user
ora.cssd
cssdagent
Root
ora.cssdmonitor
cssdmonitor
Root
ora.diskmon
orarootagent
Root
ora.ctssd
orarootagent
Root
ora.evmd
oraagent
crs user
ora.crsd
orarootagent
Root
ora.asm
oraagent
crs user
ora.driver.acfs
orarootagent
Root
root
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
The picture below shows all the resource dependencies between OHASD managed resources
/ daemons:
CRSD
EVMD
START:hard , pullup
START:hard , pullup
STOP:hard (intermediate)
START:hard , pullup
CTSSD
STOP:hard(intermediate)
START:hard , pullup
STOP:hard
START:hard , pullup
START:weak(concurrent),pullup(always)
DISKMON
CSSD
STOP:hard
CSSDMONITOR
START:hard
START:weak
START:weak
GPNPD
START:weak
MDNSD
STOP:hard(intermediate)
START:weak
STOP:hard(intermediate)
GIPCD
Figure 3: For details regarding the hard/weak and pullup/intermediate resource dependencies see to 3.2.
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
1.2.2
Daemon Resources
A typical daemon resource list from a node is listed below. To get the daemon resources list
we need to use the init flag with the crsctl command.
# crsctl stat res -init t
--------------------------------------------------------------------------NAME
TARGET
STATE
SERVER
STATE_DETAILS
-------------------------------------------------------------------------Cluster Resources
-------------------------------------------------------------------------ora.asm
1
ONLINE
ONLINE
node1
ONLINE
ONLINE
node1
ONLINE
ONLINE
node1
ONLINE
ONLINE
node1
ONLINE
ONLINE
node1
ONLINE
ONLINE
node1
ONLINE
ONLINE
node1
ONLINE
ONLINE
node1
ONLINE
ONLINE
node1
ONLINE
ONLINE
node1
ONLINE
ONLINE
node1
Started
ora.crsd
1
ora.cssd
1
ora.cssdmonitor
1
ora.ctssd
1
OBSERVER
ora.diskmon
1
ora.drivers.acfs
1
ora.evmd
1
ora.gipcd
1
ora.gpnpd
1
ora.mdnsd
1
The list below will show the types used, and the hierarchy. Everything is built on the base
resource type. The cluster_resource is using the resource type as base type. Using the
cluster_type as base type we build the ora.daemon.type which is the building block for e.g.
the ora.cssd.type and all the other daemon resources.
To print the internal resource type names and resources use the crsctl init flag.
# crsctl stat type -init
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
TYPE_NAME=application
BASE_TYPE=cluster_resource
TYPE_NAME=cluster_resource
BASE_TYPE=resource
TYPE_NAME=local_resource
BASE_TYPE=resource
TYPE_NAME=ora.asm.type
BASE_TYPE=ora.daemon.type
TYPE_NAME=ora.crs.type
BASE_TYPE=ora.daemon.type
TYPE_NAME=ora.cssd.type
BASE_TYPE=ora.daemon.type
TYPE_NAME=ora.cssdmonitor.type
BASE_TYPE=ora.daemon.type
TYPE_NAME=ora.ctss.type
BASE_TYPE=ora.daemon.type
TYPE_NAME=ora.daemon.type
BASE_TYPE=cluster_resource
TYPE_NAME=ora.diskmon.type
BASE_TYPE=ora.daemon.type
TYPE_NAME=ora.drivers.acfs.type
BASE_TYPE=ora.daemon.type
TYPE_NAME=ora.evm.type
BASE_TYPE=ora.daemon.type
TYPE_NAME=ora.gipc.type
BASE_TYPE=ora.daemon.type
TYPE_NAME=ora.gpnp.type
BASE_TYPE=ora.daemon.type
TYPE_NAME=ora.mdns.type
BASE_TYPE=ora.daemon.type
TYPE_NAME=resource
BASE_TYPE=
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
Using the ora.cssd resource as an example, all the ora.cssd attributes can be shown using
crsctl stat res ora.cssd init f (note, not all the attributes are listed in the below example,
onyl the most important one).
# crsctl stat res ora.cssd -init f
NAME=ora.cssd
TYPE=ora.cssd.type
STATE=ONLINE
TARGET=ONLINE
ACL=owner:root:rw-,pgrp:oinstall:rw-,other::r--,user:oracle11:r-x
AGENT_FILENAME=%CRS_HOME%/bin/cssdagent%CRS_EXE_SUFFIX%
CHECK_INTERVAL=30
ocssd_PATH=%CRS_HOME%/bin/ocssd%CRS_EXE_SUFFIX%
CSS_USER=oracle11
ID=ora.cssd
LOGGING_LEVEL=1
START_DEPENDENCIES=weak(ora.gpnpd,concurrent:ora.diskmon)hard(ora.cssdmonitor)
STOP_DEPENDENCIES=hard(intermediate:ora.gipcd,shutdown:ora.diskmon)
In order to debug daemon resources the init flag must be used all the time. To enable
additional debugging for e.g. ora.cssd:
# crsctl set log res ora.cssd:3 -init
10
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
1.3
Agents
Oracle Clusterware 11g Release 2 (11.2) introduces a new agent concept which makes the
Oracle Clusterware more robust and performant. These agents are multi-threaded daemons
which implement entry points for multiple resource types and which spawn new processes
for different users. The agents are highly available and besides the oraagent, orarootagent
and cssdagent/cssdmonitor, there can be an application agent and a script agent.
The two main agents are the oraagent and the orarootagent. Both ohasd and crsd employ
one oraagent and one orarootagent each. If the CRS user is different from the ORACLE user,
then crsd would utilize two oraagents and one orarootagent.
1.3.1
oraagent
ohasds oraagent:
crsds oraagent:
Receives eONS events, and translates and forwards them to interested clients
(eONS will be removed and its functionality included in EVM in 11.2.0.2)
Receives CRS state change events and dequeues RLB events and enqueues HA
events for OCI and ODP.NET clients
1.3.2
orarootagent
ohasds orarootagent:
crsds orarootagent:
1.3.3
Performs start/stop/check/clean actions for GNS, VIP, SCAN VIP and network
resources
cssdagent / cssdmonitor
11
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
1.3.4
13:25:18.231:
[ora.crsd][2991836048]
[check]
DaemonAgent::check
returned 0
2009-10-07 13:25:18.231: [ora.crsd][2991836048] [check] CRSD Deep Check
If any error occurs, the entry points for determining what happened are:
12
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
1
2009-11-25 06:20:25.767: [
ora.scan2.vip 1 1
1
2009-11-25 06:20:26.769: [
ora.scan2.vip 1 1
13
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
1.4
The CSS daemon (ocssd) manages the cluster configuration by controlling which nodes are
members of the cluster and by notifying members when a node joins or leaves the cluster. If
you are using certified third-party clusterware, then ocssd interfaces with the vendor
clusterware to manage node membership information.
The other clusterware daemons as well as ASM and database instance(s) rely on a
functioning CSS. If ocssd cannot bootstrap due to any reason, like no voting file information
found, all the other layers cannot start as well.
ocssd also monitors the cluster health via the network heartbeat (NHB) and the disk
heartbeat (DHB). The NHB is the primary indicator that a node is alive and can participate in
a cluster. The DHB will mainly be used for split brain resolution.
1.4.1
The below section will list and explain all the threads used by ocssd.
Cluster Listener Thread (CLT) attempts to connect to all remote nodes at boot
time, receives and processes all incoming messages, and responds to connect
requests from other nodes. Whenever a packet is received from a node, the
listener resets the miss count for that node.
Sending Thread (ST) dedicated to sending network heartbeat (NHB) to all nodes
once per second, and local heartbeat (LHB) to the cssdagent and the cssdmonitor
once per second using Grid IPC (GIPC).
Reconfig Manager Thread (RMT) initiates and manages the cluster reconfiguration
on the Reconfig Manager Node (RMN) when the polling thread requests a
reconfiguration. The Reconfig Manager Thread on the remaining nodes (not the
RMN) monitors the health of the manager node via the disk heartbeat so they can
complete the reconfiguration if the Reconfig Manager Node would fail.
o
14
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
Fencing thread for communicating with the diskmon process for I/O fencing, if
EXADATA is used.
1.4.2
Reads the kill block to see if its host node has been evicted.
This thread also monitors the voting-disk heartbeat for remote nodes. The
disk heartbeat information is used during reconfigurations in order to
determine whether a remote ocssd has terminated.
Kill Block thread (one per voting file) monitors voting file availability to ensure a
sufficient number of voting files are accessible. If Oracle redundancy is used, we
require the majority of the configured voting disks online.
Worker thread (new in 11.2.0.1, 1 per voting file) miscellaneous I/O to voting files
This thread watches to ensure that disk ping threads are correctly reading
their kill blocks on a majority of the configured voting files. If we cant
perform I/O to the voting file(s) due to I/O hang or I/O failures or other
reasons, we take the voting file(s) offline. This thread monitors the progress
of the disk ping threads. If CSS is unable to read a majority of the voting
files, it is possible that it no longer shares access to at least one disk with
each other node. It would be possible for this node to miss an eviction
notice; in other words, CSS is not able to cooperate and must be
terminated.
15
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
1.4.3
Node Kill threads (transient) used for killing nodes via IPMI
local-kill thread - when a CSS client initiates a member kill, the local CSS kill
thread will be created
1.4.4
This thread registers as a member of the node group with skgxn and
watches for changes in node-group membership. When a reconfig event
occurs, this thread requests the current node-group membership bitmap
from skgxn and compares it to the bitmap it received last time and the
current values of two other bitmaps: eviction pending, which identifies
nodes that are in the process of going down, and VMONs group
membership, which indicates nodes whose oclsmon process is still running
(nodes that are (still) up). When a membership transition is identified, the
node-monitor thread initiates the appropriate action.
In Oracle Clusterware 11g release 2 (11.2) there are diminished configuration requirements,
meaning nodes are added back automatically when started and deleted if they have been
down for too long. Unpinned servers that stop for longer than a week are no longer
reported by olsnodes. These servers are automatically administered when they leave the
cluster, so you do not need to explicitly remove them from the cluster.
1.4.4.1 Pinning nodes
The appropriate command to change the node pin behavior (i.e. to pin or unpin any specific
node), is the crsctl pin/unpin css command. Pinning a node means that the association of a
node name with a node number is fixed. If a node is not pinned, its node number may
change if the lease expires while it is down. The lease of a pinned node never expires.
Deleting a node with the crsctl delete node command implicitly unpins the node.
During upgrade of Oracle Clusterware, all servers are pinned, whereas after a fresh
installation of Oracle Clusterware 11g release 2 (11.2), all servers you add to the
cluster are unpinned.
You cannot unpin a server that has an instance of Oracle RAC that is older than
Oracle Clusterware 11g release 2 (11.2) if you installed Oracle Clusterware 11g
release 2 (11.2) on that server.
16
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
Pinning a node is required for rolling upgrade to Oracle Clusterware 11g release 2 (11.2) and
will be done automatically. We have seen cases where customers perform a manual
upgrade and this would fail due to unpinned nodes.
1.4.4.2 Port assignment
The fixed port assignment for the CSS and node monitor has been removed, so there should
be no contention with other applications for ports. The only exception is during rolling
upgrade where we assign two fixed ports.
1.4.4.3 GIPC
The CSS layer is using the new communication layer Grid IPC (GIPC) and it still supports the
interaction with the pre-11.2 CLSC communication layer. In 11.2.0.2, GIPC will support the
use of multiple NICs for a single communications link, e.g. CSS/NM internode
communications.
1.4.4.4 Cluster alert.log
More cluster_alert.log messages have been added to allow faster location of entries
associated with a problem. An identifier will be printed in both the alert.log and the daemon
log entries that are linked to the problem. The identifier will be unique within the
component, e.g. CSS or CRS.
2009-11-24 03:46:21.110
[crsd(27731)]CRS-2757:Command 'Start' timed out waiting for response from the
resource 'ora.stnsp006.vip'. Details at (:CRSPE00111:) in
/scratch/grid_home_11.2/log/stnsp005/crsd/crsd.log.
2009-11-24 03:58:07.375
[cssd(27413)]CRS-1605:CSSD voting file is online: /dev/sdj2; details in
/scratch/grid_home_11.2/log/stnsp005/cssd/ocssd.log.
17
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
error; this is an expected behaviour when another node is already up. John Leys said do
not file bugs because you receive CRS-4402.
1.4.4.6 Voting file discovery
The method of identifying voting files has changed in 11.2. While voting files were
configured in OCR in 11.1 and earlier, in 11.2 voting files are located via the CSS voting file
discovery string in the GPNP profile. Examples:
18
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
1.4.5
CSS lease
Lease acquisition is a mechanism through which a node acquires a node number. A lease
denotes that a node owns the associated node number for a period defined by the lease
duration. A lease duration is hardcoded in the GPNP profile to be one week. A node owns
the lease for the lease duration from the time of last lease renewal. A lease is considered to
be renewed during every DHB. Hence a lease expiry is defined
as below - lease expiry time = last DHB time + lease duration.
There are two types of lease.
Pinned leases
A node uses a hard coded static node number. A pinned lease is used in an upgrade
scenario which involves older version clusterware that use static node number.
Unpinned leases
A node acquires a node number dynamically using a lease acquisition algorithm.
Lease acquisition algorithm is designed to resolve conflicts among nodes which try
to acquire the same slot at the same time.
message
is
put
into
the
For a lease acquisition failure, an appropriate message is also put in the <alert>hostname.log
and the ocssd.log. In the current release there are no tunable to tune the lease duration.
1.4.6
The below chapter will describe the main components and techniques used to resolve split
brain situations.
1.4.6.1 Heartbeats
The CSS uses two main heartbeat mechanisms for cluster membership, the network
heartbeat (NHB) and the disk heartbeat (DHB). The heartbeat mechanisms are intentionally
redundant and they are used for different purposes. The NHB is used for the detection of
loss of cluster connectivity, whereas the DHB is mainly used for network split brain
resolution. Each cluster node must participate in the heartbeat protocols in order to be
considered a healthy member of the cluster.
19
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
to all the other nodes in a cluster and receive every second a NHB from the remote nodes.
The NHB is also sent to the cssdmonitor and the cssdagent.
The NHB contains time stamp information from the local node, and is used by the remote
node to figure out when the NHB was sent. It indicates that a node can participate in cluster
activities, e.g. group membership changes, message sends etc. If the NHB is missing for
<misscount> seconds (30 seconds in Linux 11.2), a cluster membership change (cluster
reconfiguration) is required. The loss of connectivity to the network is not necessarily fatal if
the network connectivity is restored in less then <misscount> seconds.
To debug NHB issues, it is sometimes useful to increase the ocssd log level to 3 to see each
heartbeat message. Run the crsctl set log command as root user on each node:
# crsctl set log css ocssd:3
Monitor the largest misstime value in milliseconds to see if the misscount is increasing,
which would indicate network problems.
# tail -f ocssd.log | grep -i misstime
2009-10-22 06:06:07.275: [
ocssd][2840566672]clssnmPollingThread: node 2,
1256205968 220 slgtime 246596654 DTO 28030 (index=1) biggest misstime 220 NTO
28280
2009-10-22 06:06:08.277: [
ocssd][2840566672]clssnmPollingThread: node 2,
1256205969 223 slgtime 246597654 DTO 28030 (index=1) biggest misstime 1230 NTO
28290
2009-10-22 06:06:09.279: [
ocssd][2840566672]clssnmPollingThread: node 2,
1256205970 226 slgtime 246598654 DTO 28030 (index=1) biggest misstime 2785 NTO
28290
To display the value of the current misscount setting use the command crsctl get css
misscount. We do not support a misscount setting other than the default; for customers
with more stringent HA requirements, contact Support / Development.
20
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
Via an eviction message sent through the network. In most cases this will fail
because of the existing network failure.
To explain this in more detail we use the following example for a cluster with nodes A, B, C
and D:
21
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
Split begins when 2 cohorts stop receiving NHBs from each other
CSS assumes a symmetric failure, i.e. the cohort of A+B stops receiving NHBs from the
cohort of C+D at the same time that C+D stop receiving NHBs from A+B.
In scenarios like this, CSS uses the voting file and DHB for split brain resolution. The kill
block, which is one part of the voting file structure, will be updated and used to notify nodes
that they have been evicted. Each node is reading its kill block every second, and will commit
suicide after another node has updated this kill block section.
In cases like the above, where we have similar sized sub-clusters, the sub-cluster with the
node containing the lower node number will survive and the other sub-cluster nodes will
reboot.
In case of a split in a larger cluster, the bigger sub-cluster will survive. In the two-node
cluster case, the node with the lower node number will survive in case of a network split,
independent from where the network error occurred.
The connectivity to a majority of voting files required for a node to stay active.
1.4.7
The kill daemon in 11.2.0.1 is an unprivileged process that kills members of CSS groups. It is
spawned by the ocssd library code when an I/O capable client joins a group, and it is
respawned when required. There is ONE kill daemon (oclskd) per user (e.g. crsowner,
oracle).
1.4.7.1 Member kill description
The following ocssd threads are involved in member kill / member kill escalation:
22
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
Member kills are issued by clients who want to eliminate group members doing IO, for
example:
Member kills always involve a remote target; either a remote ASM or database instance, or
a remote, non-PE master crsd. The member kill request is handed over to the local ocssd,
who then sends the request to ocssd on the target node. In 11.1 and 11.2.0.1, ocssd will
hand over the process id's of the primary and shared members of the group to be killed to
oclskd. The oclskd will then perform a kill -9 on these processes. In 11.2.0.2 and later, the kill
daemon runs as a thread in the cssdagent and cssdmonitor processes, hence there will no
running oclskd.bin process anymore. The kill daemon / thread register with CSS separately in
the KILLD group.
In some situations, and more likely in 11.2.0.1 and earlier, such as extreme CPU and memory
starvation, the remote node's kill daemon or remote ocssd cannot service the local ocssds
member kill request in time (misscount seconds), and therefore the member kill request will
time out. If LMON (ASM and/or RDBMS) requested the member kill, then the request will be
escalated by the local ocssd to a remote node kill. A member kill request by crsd will never
be escalated to a node kill, instead, we rely on the orarootagent's check action to detect the
dysfunctional crsd and restart it. The target node's ocssd will receive the member kill
escalation request, and will commit suicide, thereby forcing a node reboot.
With the kill daemon running as real-time thread in cssdagent/cssdmonitor (11.2.0.2),
there's a higher chance that the kill request succeeds despite high system load.
If IPMI is configured and functional, the ocssd node monitor will spawn a node termination
thread to shutdown the remote node using IPMI. The node termination thread
communicates with the remote BMC via the management LAN; it will establish an
authentication session (only a privileged user can shutdown a node) and check the power
status. The next step is requesting is a power-off and repeatedly checking the status until
the node status is OFF. After receiving the OFF status, we will power-ON the remote node
again, and the node termination thread will exit.
1.4.7.2 Member kill example:
LMON of database instance 3 issuing a member kill for instance on node 2 due to CPU
starvation:
23
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
The local ocssd (third node, internal node number 2) receives the member kill request:
2009-10-21 12:22:22.151: [
ocssd][2996095904]clssgmExecuteClientRequest: Member
12:22:22.151:
ocssd][2996095904]clssgmReqMemberKill:
Kill
12:22:22.152:
ocssd][2712714144]clssgmMbrKillThread:
Kill
requested map 0x00000002 id 1 Group name DBPOMMI flags 0x00000001 start time
0x91794756 end time 0x91797442 time out 11500 req node 2
The remote ocssd on the target node (second node, internal node number 1) receives the
request and submits the PID's to the kill daemon:
2009-10-21
12:22:22.201:
ocssd][3799477152]clssgmmkLocalKillThread:
Local
kill requested: id 1 mbr map 0x00000002 Group name DBPOMMI flags 0x00000000 st
time 1088320132 end time 1088331632 time out 11500 req node 2
2009-10-21
12:22:22.201:
ocssd][3799477152]clssgmmkLocalKillThread:
Kill
12:22:22.201:
ocssd][3799477152]clssgmUnreferenceMember:
global
12:22:22.201:
ocssd][3799477152]GM
Diagnostics
started
for
mbrnum/grockname: 1/DBPOMMI
2009-10-21 12:22:22.201: [
0xe331fd68, pid 23973) sharing group DBPOMMI, member 1, share type normal
2009-10-21 12:22:22.201: [
(client 0x89f7858, pid 23957) sharing group DBPOMMI, member 1, share type xmbr
24
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
2009-10-21 12:22:22.201: [
0x8a1e648, pid 23949) sharing group DBPOMMI, member 1, share type normal
2009-10-21 12:22:22.201: [
0x89e7ef0, pid 23951) sharing group DBPOMMI, member 1, share type normal
2009-10-21 12:22:22.202: [
0xe8aabbb8, pid 23947) sharing group DBPOMMI, member 1, share type normal
2009-10-21 12:22:22.202: [
(client 0x8a23df0, pid 23949) sharing group DG_LOCAL_POMMIDG, member 0, share type
normal
2009-10-21 12:22:22.202: [
(client 0x8a25268, pid 23929) sharing group DG_LOCAL_POMMIDG, member 0, share type
normal
2009-10-21 12:22:22.202: [
(client 0x89e9f78, pid 23951) sharing group DG_LOCAL_POMMIDG, member 0, share type
normal
2009-10-21 12:22:22.202: [
12:22:22.202:
ocssd][3799477152]GM
Diagnostics
completed
for
12:22:22.202:
ocssd][3799477152]clssgmmkLocalSendKD:
Copy
pid
12:22:22.202:
ocssd][3799477152]clssgmmkLocalSendKD:
Copy
pid
12:22:22.202:
ocssd][3799477152]clssgmmkLocalSendKD:
Copy
pid
12:22:22.202:
ocssd][3799477152]clssgmmkLocalSendKD:
Copy
pid
12:22:22.202:
ocssd][3799477152]clssgmmkLocalSendKD:
Copy
pid
12:22:22.202:
ocssd][3799477152]clssgmmkLocalSendKD:
Copy
pid
12:22:22.202:
ocssd][3799477152]clssgmmkLocalSendKD:
Copy
pid
12:22:22.202:
ocssd][3799477152]clssgmmkLocalSendKD:
Copy
pid
12:22:22.202:
ocssd][3799477152]clssgmmkLocalSendKD:
Copy
pid
12:22:22.202:
ocssd][3799477152]clssgmmkLocalSendKD:
Copy
pid
mbrnum/grockname: 1/DBPOMMI
2009-10-21
23929
2009-10-21
23973
2009-10-21
23957
2009-10-21
23949
2009-10-21
23951
2009-10-21
23947
2009-10-21
23949
2009-10-21
23929
2009-10-21
23951
2009-10-21
23947
25
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
At this point, the oclskd.log should indicate the successful kill of these processes, and
thereby the completion of the kill request. In 11.2.0.2 and later, the kill daemon thread will
perform the kill:
2009-10-21
12:22:22.295:
USRTHRD][3980221344]
clsnkillagent_main:killreq
received:
2009-10-21 12:22:22.295: [ USRTHRD][3980221344] clsskdKillMembers: kill status 0
pid 23929
2009-10-21 12:22:22.295: [ USRTHRD][3980221344] clsskdKillMembers: kill status 0
pid 23973
2009-10-21 12:22:22.295: [ USRTHRD][3980221344] clsskdKillMembers: kill status 0
pid 23957
2009-10-21 12:22:22.295: [ USRTHRD][3980221344] clsskdKillMembers: kill status 0
pid 23949
2009-10-21 12:22:22.295: [ USRTHRD][3980221344] clsskdKillMembers: kill status 0
pid 23951
2009-10-21 12:22:22.295: [ USRTHRD][3980221344] clsskdKillMembers: kill status 0
pid 23947
2009-10-21 12:22:22.295: [ USRTHRD][3980221344] clsskdKillMembers: kill status 0
pid 23949
2009-10-21 12:22:22.295: [ USRTHRD][3980221344] clsskdKillMembers: kill status 0
pid 23929
2009-10-21 12:22:22.295: [ USRTHRD][3980221344] clsskdKillMembers: kill status 0
pid 23951
2009-10-21 12:22:22.295: [ USRTHRD][3980221344] clsskdKillMembers: kill status 0
pid 23947
However, if within (misscount + 1/2 seconds) the request doesn't complete, the ocssd on the
local node escalates the request to a node kill:
2009-10-21
12:22:33.655:
ocssd][2712714144]clssgmMbrKillThread:
Time
up:
Start time -1854322858 End time -1854311358 Current time -1854311358 timeout 11500
2009-10-21 12:22:33.655: [
request complete.
2009-10-21
answers
or
12:22:33.655:
immediate
ocssd][2712714144]clssgmMbrKillSendEvent:
escalation:
Req
member
Req
node
Number
of
Missing
answers
12:22:33.656:
ocssd][2712714144]clssgmQueueGrockEvent:
26
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
kill initiated
2009-10-21 12:22:33.656: [
ocssd][2712714144]clssgmMbrKillThread: Exiting
Timeout 11500 Start time 1088320132 End time 1088331632 Current time 1088331632
2009-10-21 12:22:33.705: [
ocssd][3799477152]clssgmmkLocalKillResults: Replying
to kill request from remote node 2 kill id 1 Success map 0x00000000 Fail map
0x00000000
2009-10-21 12:22:33.705: [
ocssd][3799477152]clssgmmkLocalKillThread: Exiting
...
2009-10-21
12:22:34.679:
12:22:34.679:
ocssd][3948735392]###################################
2009-10-21 12:22:34.679: [
thread clssnmvKillBlockThread
2009-10-21
12:22:34.679:
ocssd][3948735392]###################################
1.4.7.3 How to identify the client who originally requested the member kill?
From the ocssd.log, the requestor can also be derived:
2009-10-21
12:22:22.151:
[ocssd][2996095904]clssgmExecuteClientRequest:
Member
12:13:24.913:
[ocssd][2996095904]clssgmRegisterClient:
proc(22/0x8a5d5e0), client(1/0x8b054a8)
<search backwards to when process connected to ocssd>
2009-10-21 12:13:24.897: [ocssd][2996095904]clssgmClientConnectMsg: Connect from
con(0x677b23)
proc(0x8a5d5e0)
pid(20485/20485)
version
11:2:1:4,
properties:
1,2,3,4,5
Using 'ps', or from other history (e.g. trace file, IPD/OS, OSWatcher), the process can be
27
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
1.4.8
0 01:46 ?
00:01:15 ora_lmon_pommi_3
obtains IPMI username and password and configures OLR on all cluster nodes
28
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
Manual configuration - after install or when using static IP addresses for BMCs
See Also: Oracle Clusterware Administration and Deployment Guide, Configuration and
Installation for Node Fencing" for more information and Oracle Grid Infrastructure
Installation Guide, Enabling Intelligent Platform Management Interface (IPMI)
1.4.9
Debugging CSS
Logging level 3 = verbose e.g. displays each heartbeat message including the
misstime which can be helpful debugging NHB related problems
Most problems can be solved with level 2. Some require level 3, few require level 4. Using
level 3 or 4, trace information may only be kept for a few hours (or even minutes) because
the trace files can fill up and information can be overwritten. Please note that a high logging
level will incur a performance impact on ocssd due to the amount of tracing. If you need to
keep data for a longer period of time, create a cron job to back up and compress the CSS
logs.
In order to trace the cssdagent or the cssdmonitor the below enhanced tracing can be set
via crsctl.
# crsctl set log res ora.cssd=2 -init
# crsctl set log res ora.cssdmonitor=2 -init
In Oracle Clusterware 11g release 2 (11.2), CSS prints the stack dump into the cssdOUT.log.
There are enhancements which will help to flush diagnostic data to disk before a reboot
occurs. So in 11.2 we dont consider it necessary to change the diagwait (default 0) unless
advised by support or development.
In very rare cases and only during debugging, it might maybe necessary to disable ocssd
reboots. This can be done via below crsctl command. Disabling reboots should only be done
when instructed by support or development and can be done online without a clusterware
stack restart.
29
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
Starting with 11.2.0.2 the possibility to set higher log levels for the individual modules is
introduced.
To list all the module names for the css daemon, the following command should be used:
# crsctl lsmodules css
To check which trace level is currently set the following command can be used:
# crsctl get log ALL
# crsctl get log css GIPCCM
1.4.10
The cssdagent and cssdmonitor provide almost the same functionality. The cssdagent
(represented by the ora.cssd resource) starts, stops, and checks the status of the ocssd
daemon. The cssdmonitor (represented by the ora.cssdmonitor resource) monitors the
cssdagent. There is no ora.cssdagent resource, and there is no resource for the ocssd
daemon.
30
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
Both agents implement the functionality of several pre-11.2 daemons such as the oprocd,
and olcsomon; the thread that implements oclsvmon functionality, runs in either process,
not both. The cssdagent and cssdmonitor run in real-time priority with locked down
memory, just like ocssd.
In addition, the cssdagent and cssdmonitor provide the following services to guarantee data
integrity:
Monitoring the node scheduling: if node is hung / not scheduled, reboot the node.
To make more comprehensive decisions whether a reboot is required, both cssdagent and
cssdmonitor receive state information from ocssd, via NHB, to ensure that the state of the
local nodes as perceived by remote nodes is accurate. Furthermore, the integration will
leverage the time before other nodes perceive the local node to be down for purposes such
as filesystem sync to get complete diagnostic data.
1.4.10.1 CSSDAGENT and CSSDMONITOR debugging
In order to enable ocssd agent debugging, the command crsctl set log res ora.cssd:3 init
should
be
used.
The
operation
is
logged
in
the
Grid_home/log/<hostname>/agent/ohasd/oracssdagent_root/oracssdagent_root.log
and
immediate more trace information is written to the oracssdagent_root.log.
2009-11-25 10:00:52.386: [
RESOURCE_MODIFY_ATTR[ora.cssd 1 1] ID 4355:106099
2009-11-25 10:00:52.387: [
AGFW][2945420176]
ora.cssd 1 1
2009-11-25 10:00:52.388: [
RESOURCE_MODIFY_ATTR[ora.cssd 1 1] ID 4355:106099
2009-11-25 10:00:52.484: [ CSSCLNT][3031063440]clssgsgrpstat: rc 0, gev 0, incarn
2, mc 2, mast 1, map 0x00000003, not posted
31
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
1.4.11
Concepts
1.4.11.1 HEARTBEATS
Disk HeartBeat (DHB) is written to the voting file periodically, once per second
Network HeartBeat (NHB) is sent to the other nodes periodically, once per second
Local HeartBeat (LHB) is sent to the agent/monitor periodically, once per second
Sending Thread (ST) sends NHBs and LHBs (at the same time)
Cluster Listener (CLT) receive messages from other nodes, mostly NHBs
OMON thread (OMT) monitors for connection failure and state of its local peer
VMON thread (VMT) replaces clssvmon executable, registers in skgxn group when
vendor clusterware present
1.4.11.4 Timeouts
Misscount (MC) amount of time with no NHB from a node before removing the
node from the cluster
Network Time Out (NTO) maximum time remaining with no NHB from a node
before removing the node from the cluster
Disk Time Out (DTO) maximum time left before a majority of voting files are
considered inaccessible
ReBoot Time (RBT) the amount of time allowed for a reboot; historically had to
account for init script latencies in rebooting. The default is 3 seconds.
Long I/O Timeout (LIOT) is configurable via crsctl set css disktimeout and the
default is 200 seconds
32
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
If node alive, wait full misscount for DHB activity to be missing, i.e. node
not alive
Perception of local state by other nodes must be valid to avoid data corruption
DHB only read starting shortly before a reconfig to remove the node is started
When no reconfig is impending, the I/O timeout not important, so need not be
monitored
If the disk timeout expires, but the NHBs have been sent to and received from
other nodes, it will still be misscount seconds before other nodes will start a
reconfig
1.4.11.8 Clocks
Time Of Day Clock (TODC) the clock that indicates the hour/minute/second of the
day (may change as a result of commands)
Invariant Time Clock (ITC) a monotonically increasing clock that is invariant i.e. does
not change as a result of commands). The invariant clock does not change if time set
backwards or forwards; it is always constant.
o
33
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
1.4.12
How It Works
ocssd state information contains the current clock information, the network time out (NTO)
based on the node with the longest time since the last NHB and a disk I/O timeout based on
the amount of time since the majority of voting files was last online. The sending thread
gathers this current state information and sends both a NHB and local heartbeat to ensure
that the agent perception of the aliveness of ocssd is the same as that of other nodes.
The cluster listener thread monitors the sending thread. It ensures the sending thread has
been scheduled recently and wakes up if necessary. There are enhancements here to ensure
that even after clock shifts backwards and forwards, the sending thread is scheduled
accurately.
There are several agent threads, one is the oprocd thread which just sleeps and wakes up
periodically. Upon wakeup, it checks if it should initiate a reboot, based on the last known
ocssd state information and the local invariant time clock (ITC). The wakeup is timer driven.
The heartbeat thread is just waiting for a local heartbeat from the ocssd. The heartbeat
thread will calculate the value that the oprocd thread looks at, to determine whether to
reboot. It checks if the oprocd thread has been awake recently and if not, pings it awake.
The heartbeat thread is event driven and not timer driven.
1.4.13
Filesystem Sync
When the ocssd fails, a filesystem sync is started. There is a fair amount of time to get this
done, so we can wait several seconds for a sync. The last local heartbeat indicates how long
we can wait, and the wait time is based on misscount. When the wait time expires, oprocd
will reboot the node. In most cases, diagnostic data will get written to disk. There are rare
cases when this may not possible, e.g. when the sync is not issued due to CSS being hung.
34
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
1.5
Cluster Ready Services is the primary program for managing high availability operations in a
cluster. The CRS daemon (crsd) manages cluster resources based on the configuration
information that is stored in OCR for each resource. This includes start, stop, monitor, and
failover operations. The crsd daemon monitors the Oracle database instance, listener, and
so on, and automatically restarts these components when a failure occurs.
The crsd daemon runs as root and restarts automatically after a failure. When Oracle
Clusterware is installed in a single-instance database environment for Oracle ASM and
Oracle Restart, ohasd instead of crsd manages application resources.
1.5.1
Policy Engine
1.5.1.1 Overview
Resource High Availability in 11.2 is handled by OHASD (usually for infrastructure resources)
and CRSD (for applications deployed in the cluster). Both daemons share the same
architecture and most of the code base. For most intents and purposes, OHASD can be seen
as a CRSD in a cluster of one node. The discussion in the subsequent sections applies to both
daemons, to the extent it makes sense (OHASD is like a CRSD in a single node cluster!)
Since 11.2, the architecture of CRSD implements the master-slave model: a single CRSD in
the cluster is picked to be the master and others are all slaves. Upon daemon start-up and
every time the master is re-elected, every CRSD writes the current master into its crsd.log
(grep for PE MASTER NAME) e.g.
grep "PE MASTER" Grid_home/log/hostname/crsd/crsd.*
crsd.log:2010-01-07 07:59:36.529: [
CRSD is a distributed application comprised of several modules. Modules are mostly state-less
and operate by exchanging messages. The state (context) is always carried with each individual
message; most interactions are asynchronous in nature. Some modules have dedicated threads
others share a single thread and some operate with a pool of threads. The important CRSD
modules are as follows:
-
The Policy Engine (a.k.a PE/CRSPE in logs) is responsible for rendering all policy decisions
The Agent Proxy Server (a.k.a Proxy/AGFW in logs) is responsible for agent management
and proxy-ing commands/events between the Policy Engine and the agents
The UI Server (a.k.a UI/UiServer in logs) is responsible for managing client connections
(APIs/crsctl), and being a proxy between the PE and client programs
The OCR/OLR module (OCR in logs) is the front-end for all OCR/OLR interactions
The Reporter module (CRSRPT in logs) is responsible for all event publishing out of CRSD
35
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
For example, a client request to modify a resource will produce the following interaction:
CRSCTL UI Server PE OCR Module PE Reporter (event publishing)
Proxy (to notify the agent)
CRSCTL UI Server
PE
Note that the UiServer/PE/Proxy can each be on different nodes, as shown on Figure 4 below.
crsctl
1
agent
8
CRSD
@ Node 1
4
2
7
CRSD
@ Node 0
3
6
CRSD
@ Node 2
[
CRSPE][2660580256] {1:25747:256} RI [r1 1 1] new target state: [ONLINE] old
value: [OFFLINE]
36
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
CRSD log node1 (crsctl always connects to the local CRSD; UI server forwards the
command to the PE):
2009-12-29 17:07:24.742: [UiServer][2689649568] {1:25747:256} Container [ Name:
UI_START
RESOURCE:
TextMessage[r1]
2009-12-29 17:07:24.742: [UiServer][2689649568] {1:25747:256} Sending message to
PE. ctx= 0xa3819430
2009-12-29 17:07:24.745: [
CRSPE][2660580256] {1:25747:256} Processing PE
command id=347. Description: [Start Resource : 0xa7258ba8]
2009-12-29 17:07:24.748: [
CRSPE][2660580256] {1:25747:256} RI [r1 1 1] new
target state: [ONLINE] old value: [OFFLINE]
37
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
2009-12-29 17:07:24.748: [
processing...
2009-12-29 17:07:24.753: [
agfw: id = 2198
Here, the PE performs a policy evaluation and interacts with the Proxy on the
destination node (to issue the start action) and the OCR (to record the new value for the
TARGET).
CRSD log node 2 (The proxy starts the agent, forwards the message to it)
2009-12-29 17:07:24.763: [
AGFW][2703780768] {1:25747:256} Agfw Proxy Server
received the message: RESOURCE_START[r1 1 1] ID 4098:2198
2009-12-29 17:07:24.767: [
AGFW][2703780768] {1:25747:256} Starting the agent:
/ade/agusev_bug/oracle/bin/scriptagent with user id: agusev and incarnation:1
2009-12-29 17:07:26.990: [
AGFW][2987383712] {1:25747:256} Command: start for
resource: r1 1 1 completed with status: SUCCESS
2009-12-29 17:07:26.991: [
AGFW][2966404000] {1:25747:256} Agent sending reply
for: RESOURCE_START[r1 1 1] ID 4098:1459
CRSD log node 2 (The proxy gets a reply, forwards it back to the PE)
2009-12-29 17:07:27.514: [
AGFW][2703780768] {1:25747:256} Agfw Proxy Server
received the message: CMD_COMPLETED[Proxy] ID 20482:2212
2009-12-29 17:07:27.514: [
AGFW][2703780768] {1:25747:256} Agfw Proxy Server
replying to the message: CMD_COMPLETED[Proxy] ID 20482:2212
CRSD log node 0 (with PE master: receives the reply, notifies the Reporter and replies to UI
Server; the Reporter publishes to EVM)
2009-12-29 17:07:27.012: [
CRSPE][2660580256] {1:25747:256} Received reply to
action [Start] message ID: 2198
2009-12-29 17:07:27.504: [
CRSPE][2660580256] {1:25747:256} RI [r1 1 1] new
external state [ONLINE] old value: [OFFLINE] on agusev_bug_2 label = []
2009-12-29 17:07:27.504: [
2009-12-29 17:07:27.513: [
CRSPE][2660580256] {1:25747:256} UI Command [Start
Resource : 0xa7258ba8] is replying to sender.
38
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
CRSD log node1 (where crsctl command was issued; UI server writes out the response,
completes the API call)
2009-12-29 17:07:27.525: [UiServer][2689649568] {1:25747:256} Container [ Name:
UI_DATA
r1:
TextMessage[0]
]
2009-12-29 17:07:27.526: [UiServer][2689649568] {1:25747:256} Done for
ctx=0xa3819430
The above demonstrates the ease of following distributed processing of a single request
across 4 processes on 3 nodes by using tints as a way to filter, extract, group and correlate
information pertaining to a single event across a plurality of diagnostic logs.
39
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
1.6
A new feature in Oracle Clusterware 11g release 2 (11.2) is Grid Plug and Play, which is
mainly managed by the Grid Plug and Play Daemon (GPnPD). The GPnPD provides access to
the GPnP profile, and coordinates updates to the profile among the nodes of the cluster to
ensure that all of the nodes have the most recent profile.
1.6.1
GPnP Configuration
The GPnP configuration is a profile and wallet configuration, identical for every peer node.
The profile and wallet are created and copied by the Oracle Universal Installer. The GPnP
profile is a XML test file which contains bootstrap information necessary to form a cluster.
Information such as the clustername, the GUID, the discovery strings, expected network
connectivity. It does not contain node specific. The profile is managed by GPnPD, and it
exists on every node in the GPnP cache. When there are no updates to the profile, it is
identical on all cluster nodes. The way the best profile is judged, is via a sequence number.
The GPnP wallet is just a binary blob containing public / private RSA keys, used to sign and
verify the GPnP profile. The wallet is identical for all GPnP peers and once created by the
Oracle Universal Installer; it never changes and lives forever.
A typical profile would contain the information below. Never change the XML file directly;
instead, use supported tools, like OUI, ASMCA, asmcd, oifcfg etc. in order to modify GPnP
profile information.
The use of gpnptool to make changes to the GPnP profile is discouraged as multiple steps
have to be executed to even get a modification into the profile. If the modification adds
invalid content, it will certainly mess up the profile information and subsequent errors will
happen.
# gpnptool get
Warning: some command line parameters were defaulted. Resulting command line:
/scratch/grid_home_11.2/bin/gpnptool.bin get -o<?xml version="1.0" encoding="UTF-8"?><gpnp:GPnP-Profile Version="1.0"
xmlns="http://www.grid-pnp.org/2005/11/gpnp-profile" xmlns:gpnp="http://www.gridpnp.org/2005/11/gpnp-profile" xmlns:orcl="http://www.oracle.com/gpnp/2005/11/gpnpprofile" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.grid-pnp.org/2005/11/gpnp-profile gpnp-profile.xsd"
ProfileSequence="4" ClusterUId="0cd26848cf4fdfdebfac2138791d6cf1"
ClusterName="stnsp0506" PALocation=""><gpnp:Network-Profile><gpnp:HostNetwork
id="gen" HostName="*"><gpnp:Network id="net1" IP="10.137.8.0" Adapter="eth0"
Use="public"/><gpnp:Network id="net2" IP="10.137.20.0" Adapter="eth2"
Use="cluster_interconnect"/></gpnp:HostNetwork></gpnp:Network-Profile><orcl:CSS-
40
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
The initial GPnP configuration is created and propagated by the root script as part of the
Oracle Clusterware installation. During a fresh install the profile content is sourced from the
Oracle Universal Installer interview results in Grid_home/crs/install/crsconfig_params.
1.6.2
GPnP Daemon
The GPnP daemon is like all other daemons OHASD managed and spawned by OHASD
oraagent. The main purpose of the GPnPD is to serve the profiles, therefore it must run in
order for the stack to start. The GPnPD startup sequence is mainly:
opens wallet/profile
equalizes profile
41
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
1.6.3
There are a few client tools which indirectly perform GPnP profile changes. They require
ocssd to be running:
ASM srvctl or sqlplus changing the spfile location or the ASM disk discoverystring
Note, that profile changes are serialized cluster-wide with a CSS lock (bug 7327595).
Grid_home/bin/gpnptool is the actual tool to manipulate the gpnp profile. To see the
detailed usage, run gpnptool help.
Oracle GPnP Tool
Usage:
"gpnptool <verb> <switches>", where verbs are:
1.6.4
create
edit
getpval
get
rget
put
find
lfind
check
c14n
sign
unsign
verify
help
ver
In order to get more log and trace information there is a tracing environment variable
GPNP_TRACELEVEL which range is from [0-6]. The GPnP traces are located mainly at
Grid_home/log/<hostname>/alert*,
Grid_home/log/<hostname>/client/gpnptool*, other client logs
Grid_home/log/<hostname>/gpnpd|mdnsd/*
Grid_home/log/<hostname>/agent/ohasd/oraagent_<username>/*
The product setup files which are holding the initial information are located at
42
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
Grid_home/crs/install/crsconfig_params
Grid_home/cfgtoollogs/crsconfig/root*
Grid_home/gpnp/*, Grid_home /gpnp/<hostname>/* [profile+wallet]
If the GPnP setup is failing the following failure scenario checks should be performed.
Failed to create wallet, profile? Failed to sign profile? Wrong signature? No access
to wallet or profile? [gpnpd is dead, stack is dead] (bug:8609709,bug:8445816)
If something is failing during the GPnP runtime the following checks should be done.
Is mdnsd running? Gpnpd failed to register with mdnsd? Discovery fails? [no put,
rget]
Is gpnpd is not fully up? [no get, no put, client spins in retries, times out]
Discovering spurious nodes as a part of the cluster? [no put, can block gpnpd
dispatch]
OCR was up, but failed [gpnpd dispatch can block, client waits in receive until OCR
recovers]
For all the above a first source would be the appropriate daemon log files and check the
resources status via crsctl stat res init t
43
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
Check if the GPnP configuration is valid and check the GPnP log files for errors.
Some sanity checks can be done with gpnptool check or gpnptool verify
# gpnptool check -\
p=/scratch/grid_home_11.2/gpnp/stnsp006/profiles/peer/profile.xml
Profile cluster="stnsp0506", version=4
GPnP profile signed by peer, signature valid.
Got GPnP Service current profile to check against.
Current GPnP Service Profile cluster="stnsp0506", version=4
Error: profile version 4 is older than- or duplicate of- GPnP Service
current profile version 4.
Profile appears valid, but push will not succeed.
# gpnptool verify
Oracle GPnP Tool
verify
Usage:
"gpnptool verify <switches>", where switches are:
-p[=profile.xml]
-w[=file:./]
keys
-wp=<val>
-wu[=owner]
owner,peer,pa)
-t[=3]
-f=<val>
-?
gpnptool get should return the local profile information. If gpnptool lfind|get
hangs, a pstack from the hanging client and the GPnPD log files under
Grid_home/log/<hostname>/gpnpd would be beneficial for further debugging.
44
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
To check if the remote GPnPD daemon is responding, the find option is very
helpful:
# gpnptool find -h=stnsp006
Found 1 instances of service 'gpnp'.
mdns:service:gpnp._tcp.local.://stnsp006:17452/agent=gpnpd,cname=stnsp0506
,host=stnsp006,pid=13133/gpnpd h:stnsp006 c:stnsp0506
check
the
To check if all the peers are responding, run gpnptool find c=<clustername>
# gpnptool find -c=stnsp0506
Found 2 instances of service 'gpnp'.
mdns:service:gpnp._tcp.local.://stnsp005:23810/agent=gpnpd,cname=stnsp0506
,host=stnsp005,pid=12408/gpnpd h:stnsp005 c:stnsp0506
mdns:service:gpnp._tcp.local.://stnsp006:17452/agent=gpnpd,cname=stnsp0506
,host=stnsp006,pid=13133/gpnpd h:stnsp006 c:stnsp0506
We store copies of the GPnP profile in the local OLR and the OCR. In case of loss or
corruption, GPnPD pulls the information from there and recreates the profile.
45
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
1.7
GNS performs name resolution in the cluster. GNS doesn't always use mDNS for
performance reasons.
In Oracle Clusterware 11g release 2 (11.2) we support the use of DHCP for both the private
interconnect and for almost all virtual IP addresses on the public network. For clients outside
the cluster to find the virtual hosts in the cluster, we provide a Grid Naming Service (GNS).
This works with any higher-level DNS to provide resolvable names to external clients.
This section explains how to perform a simple setup of DHCP and GNS. A complex network
environment may require a more elaborate solution. The GNS and DHCP setup must be in
place before the grid infrastructure installation.
1.7.1
DHCP provides dynamic configuration of the hosts IP address, but does not provide a good
way to produce names that are useful to external clients. As a result, it has been uncommon
in server complexes. In Oracle Clusterware 11g release 2 (11.2), this problem is solved by
providing our own service for resolving names in the cluster, and connecting this to the DNS
that is visible to the clients.
1.7.2
To get GNS to work for clients, it is necessary to configure the higher-level DNS to delegate
a subdomain to the cluster, and the cluster must run GNS on an address known to the DNS.
The GNS address will be maintained as a statically configured VIP in the cluster. The GNS
daemon (GNSD) will follow that VIP around the cluster and service names in the subdomain.
Four things need to be configured:
A single static address in the public network for the cluster to use as the GNS VIP.
Delegation from the higher-level DNS for names within the cluster sub-domain to
the GNS VIP.
a DHCP for dynamic address provision on the public network
a running cluster with properly configured GNS
46
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
strdv0108-gns.mycorp.com NS strdv0108.mycorp.com
#Let the world know to go to the GNS vip
strdv0108.mycorp.com 10.9.8.7
Here, the sub-domain is strdv0108.mycorp.com, the GNS VIP has been assigned the
name strdv0108-gns.us.mycorp.com (corresponding to a chosen static IP address),
and the GNS daemon will listen on the default port 53.
NOTE: This does not establish an address for the name strdv0108.mycorp.com it
creates a way of resolving a name within this sub-domain, such as clusterNode1VIP.strdv0108.mycorp.com.
1.7.3
DHCP
With DHCP, a host requiring an IP address sends a broadcast message to the hardware
network. A DHCP server on the segment can respond to the request, and give back an
address, along with other information such as what gateway to use, what DNS server(s) to
use, what domain should be used, what NTP server should be used, etc.
When we get DHCP for the public network, we have several IP addresses:
The GNS VIP cant be obtained from DHCP, because it must be known in advance, so must
be statically assigned.
The DHCP configuration file is /etc/dhcp.conf.
Using the following configuration example:
the domain the machines will reside in for DNS purposes is strdv0108.mycorp.com
47
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
The /etc/nsswitch.conf controls name service lookup order. In some system configurations,
the Network Information System (NIS) can cause problems with Oracle SCAN address
resolution. It is suggested to place the NIS entry at the end of the search list.
/etc/nsswitch.conf
hosts:
files
dns
nis
48
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
1.7.4
During a server startup the GNS server retrieves the name from the subdomain to be
serviced from the OCR and starts the threads. The first thing the GNS server will do is a self
check once all the threads are running. It performs a test to see if the name resolution is
working. The client API is called to register a dummy name and address and the server then
attempts to resolve the name. If the resolution succeeds and one of the addresses matches
the dummy address, the self check has succeeded and a message is written to the cluster
alert<hostname>.log. This self check is done only once and even if the test is failing GNS
server keeps running.
The default trace location for GNS server is Grid_home/log/<hostname>/gnsd/. The trace file
format looks like the following:
<Time stamp>: [GNS][Thread ID]<Thread name>::<function>:<message>
2009-09-21 10:33:14.344: [GNS][3045873888] Resolve::clsgnmxInitialize:
initializing mutex 0x86a7770 (SLTS 0x86a777c).
1.7.5
The GNS Agent (orarootagent) will check the GNS server periodically. The check is done by
querying the GNS for its status.
To see if the agent is successfully advertising with GNS, run:
#grep -i 'updat.*gns'
Grid_home/log/<hostname>/agent/crsd/orarootagent_root/orarootagent_*
orarootagent_root.log:2009-10-07 10:17:23.513: [ora.gns.vip] [check] Updating GNS
with stnsp0506-gns-vip 10.137.13.245
orarootagent_root.log:2009-10-07 10:17:23.540: [ora.scan1.vip] [check] Updating
GNS with stnsp0506-scan1-vip 10.137.12.200
orarootagent_root.log:2009-10-07 10:17:23.562: [ora.scan2.vip] [check] Updating
GNS with stnsp0506-scan2-vip 10.137.8.17
orarootagent_root.log:2009-10-07 10:17:23.580: [ora.scan3.vip] [check] Updating
GNS with stnsp0506-scan3-vip 10.137.12.214
orarootagent_root.log:2009-10-07 10:17:23.597: [ora.stnsp005.vip] [check] Updating
GNS with stnsp005-vip 10.137.12.228
orarootagent_root.log:2009-10-07 10:17:23.615: [ora.stnsp006.vip] [check] Updating
GNS with stnsp006-vip 10.137.12.226
49
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
1.7.6
There command line interface to interact with GNS is via srvctl (the only supported way).
The crsctl can stop and start the ora.gns but we dont support this other than told by
development directly.
GNS operations are run by performing operations on the gns noun so like:
# srvctl {start|stop|modify|etc.} gns ...
To start gns:
# srvctl start gns [-l <log_level>]
To stop gns:
# srvctl stop gns
1.7.7
Debugging GNS
The default GNS server logging level is 0, which can be seen via a simple ps ef | grep
gnsd.bin.
/scratch/grid_home_11.2/bin/gnsd.bin
-trace-level
-ip-address
10.137.13.245
startup-endpoint ipc://GNS_stnsp005_31802_429f8c0476f4e1
To debug GNS server issues it is sometimes necessary to increase this log level. Which can be
done by stopping the GNS server via srvctl stop gns and restart it via srvctl start gns v l 5.
Only the root user can stop and start the GNS.
Usage: srvctl start gns [-v] [-l <log_level>] [-n <node_name>]
-v
-l <log_level>
Verbose output
Specify the level of logging that GNS should run
with.
-n <node_name>
Node name
-h
Print usage
The trace level ranges from 0 to 6; level 5 should be sufficient in all the cases; setting the
trace level to level 6 is not recommended as gnsd will consume a lot of CPU.
50
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
Due to bug 8705125 in 11.2.0.1, the default logging level for GNS server (gnsd daemon) will
be level 6 after the initial installation. To set the log level back to the default value of 0, stop
and start the GNS using srvctl stop / start. This will only stop and start the gnsd.bin, and
will not cause any harm on the running cluster.
Starting with 11.2.0.2 the l option (List all records in GNS) is a very helpful option in order
to debug GNS issues.
51
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
1.8
The configuration regarding listening endpoints with GIPC is a little different. The
private/cluster interconnects are now defined in the GPnP profile.
The requirement for the same interfaces to exist with the same name on all nodes is more
relaxed, as long as communication will be established. The part of the GPnP profile
regarding the private and public network configuration is:
<gpnp:Network id="net1" IP="10.137.8.0" Adapter="eth0" Use="public"/><gpnp:Network
id="net2" IP="10.137.20.0" Adapter="eth2" Use="cluster_interconnect"/>
1.8.1
The GIPC default trace level only prints errors, and the default trace level for the different
components ranges from 0 to 2. To debug GIPC related issues, it might be necessary to
increase the trace levels, which are described below.
1.8.2
With crsctl it is possible to set a GIPC trace level for different components.
Example:
# crsctl set log css COMMCRS:abcd
Where
52
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
If the component of interest is GIPC, and want to modify the GIPC trace level only, up from
its default value of 2, simply run:
To turn on GIPC tracing for all components (NM, GM, etc.), set
# crsctl set log css COMMCRS:3 or
# crsctl set log css COMMCRS:4
With level 4, a lot of tracing is generated, so the ocssd.log will wrap around fairly quickly.
1.8.3
Another option is to set a pair of environment variables for the component using GIPC as
communication e.g. ocssd. In order to achieve this, a wrapper script is required. Taking
ocssd as an example, the wrapper script is Grid_home/bin/ocssd that invokes ocssd.bin.
Adding the variables below to the wrapper script (under the LD_LIBRARY_PATH) and
restarting ocssd will enable GIPC tracing. To restart ocssd.bin, perform a crsctl stop/start
cluster.
case `/bin/uname` in
Linux)
LD_LIBRARY_PATH=/scratch/grid_home_11.2/lib
export LD_LIBRARY_PATH
export GIPC_TRACE_LEVEL=4
export GIPC_FIELD_LEVEL=0x80
#
forcibly
eliminate
LD_ASSUME_KERNEL
to
ensure
NPTL
where
available
LD_ASSUME_KERNEL=
export LD_ASSUME_KERNEL
LOGGER="/usr/bin/logger"
if [ ! -f "$LOGGER" ];then
LOGGER="/bin/logger"
fi
LOGMSG="$LOGGER -puser.err"
;;
This will set the trace level to 4. The values for the trace environment variables are
53
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
To enable more fine grained tracing use the following environment variable
GIPC_COMPONENT_TRACE. The defined components are
GIPCGEN, GIPCTRAC, GIPCWAIT, GIPCXCPT, GIPCOSD, GIPCBASE, GIPCCLSA, GIPCCLSC,
GIPCEXMP, GIPCGMOD, GIPCHEAD, GIPCMUX, GIPCNET, GIPCNULL, GIPCPKT, GIPCSMEM,
GIPCHAUP, GIPCHALO, GIPCHTHR, GIPCHGEN, GIPCHLCK, GIPCHDEM, GIPCHWRK
Example:
# export GIPC_COMPONENT_TRACE=GIPCWAIT:4,GIPCNET:3
0xa481c830, len 104, olen 104, parentEndp 0x8f99118, ret gipcretSuccess (0),
objFlags 0x0, reqFlags 0x4 }
Only some layers like CSS (client and server), GPNPD, GNSD, and small parts of MDNSD are
using GIPC right now.
Others like CRS/EVM/OCR/CTSS will use GIPC starting with 11.2.0.2. This is important to
know in order to turn on GIPC tracing or the old NS/CLSC tracing to debug communication
issues.
54
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
1.9
The CTSS is a new feature in Oracle Clusterware 11g release 2 (11.2), which takes care of
time synchronization in a cluster, in case the network time protocol daemon is not running
or is not configured properly.
The CTSS synchronizes the time on all of the nodes in a cluster to match the time setting on
the CTSS master node. When Oracle Clusterware is installed, the Cluster Time
Synchronization Service (CTSS) is installed as part of the software package. During
installation, the Cluster Verification Utility (CVU) determines if the network time protocol
(NTP) is in use on any nodes in the cluster. On Windows systems, CVU checks for NTP and
Windows Time Service.
If Oracle Clusterware finds that NTP is running or that NTP has been configured, then NTP is
not affected by the CTSS installation. Instead, CTSS starts in observer mode (this condition is
logged in the alert log for Oracle Clusterware). CTSS then monitors the cluster time and logs
alert messages, if necessary, but CTSS does not modify the system time. If Oracle
Clusterware detects that NTP is not running and is not configured, then CTSS designates one
node as a clock reference, and synchronizes all of the other cluster member time and date
settings to those of the clock reference.
Oracle Clusterware considers an NTP installation to be misconfigured if one of the following
is true:
NTP is not installed on all nodes of the cluster; CVU detects an NTP installation by a
configuration file, such as ntp.conf
The primary and alternate clock references are different for all of the nodes of the
cluster
The NTP processes are not running on all of the nodes of the cluster; only one type
of time synchronization service can be active on the cluster.
To check whether CTSS is running in active or observer mode run crsctl check ctss
CRS-4700: The Cluster Time Synchronization Service is in Observer mode.
or
CRS-4701: The Cluster Time Synchronization Service is in Active mode.
CRS-4702: Offset from the reference node (in msec): 100
The tracing for the ctssd daemon is written to the octssd.log. The alert log
(alert<hostname>.log) also contains information about the mode in which CTSS is running.
[ctssd(13936)]CRS-2403:The Cluster Time Synchronization Service on host node1 is
in observer mode.
[ctssd(13936)]CRS-2407:The new Cluster Time Synchronization Service reference node
55
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
is host node1.
[ctssd(13936)]CRS-2401:The Cluster Time Synchronization Service started on host
node1.
1.9.1
CVU checks
There are pre-install CVU checks performed automatically during installation, like: cluvfy
stage pre crsinit <>
This step will check and make sure that the operating system time synchronization software
(e.g. NTP) is either properly configured and running on all cluster nodes, or on none of the
nodes.
During the post-install check, CVU will run cluvfy comp clocksync n all. If CTSS is in observer
mode, it will perform a configuration check as above. If the CTSS is in active mode, we verify
that the time difference is within the limit.
1.9.2
CTSS resource
When CTSS comes up as part of the clusterware startup, it performs step time sync, and if
everything goes well, it publishes its state as ONLINE. There is a start dependency on
ora.cssd but note that it has no stop dependency, so if for some reasons (maybe faulted
CTSSD), CTSSD dumps core or exits, nothing else should be affected.
The chart below shows the start dependency build on ora.ctssd for other resources.
56
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
TARGET
STATE
SERVER
STATE_DETAILS
---------------------------------------------------------------------ora.ctssd
1
1.10
1.10.1
ONLINE
ONLINE
node1
OBSERVER
mdnsd
Debugging mdnsd
In order to capture mdnsd network traffic, use the mDNS Network Monitor located in
Grid_home/bin:
# mkdir Grid_home/log/$HOSTNAME/netmon
# Grid_home/bin/oranetmonitor &
The output from oranetmonitor will be captured in netmonOUT.log in the above directory.
57
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
ASM manages voting files differently from other files that it stores. When voting files are
placed on disks in an ASM disk group, Oracle Clusterware records exactly on which disks in
that diskgroup they are located. If ASM fails, then CSS can still access the voting files. If you
choose to store voting files in ASM, then all voting files must reside in ASM, i.e. we do not
support mixed configurations like storing some voting files in ASM and some on NAS.
The number of voting files you can store in a particular Oracle ASM disk group depends upon
the redundancy of the disk group.
External redundancy: A disk group with external redundancy can store only one
voting file
Normal redundancy: A disk group with normal redundancy can store up to three
voting files
High redundancy: A disk group with high redundancy can store up to five voting files
By default, Oracle ASM puts each voting file in its own failure group within the disk group. A
failure group is a subset of the disks in a disk group, which could fail at the same time
because they share hardware, e.g. a disk controller. The failure of common hardware must
be tolerated. For example, four drives that are in a single removable tray of a large JBOD
(Just a Bunch of Disks) array are in the same failure group because the tray could be
removed, making all four drives fail at the same time. Conversely, drives in the same cabinet
can be in multiple failure groups if the cabinet has redundant power and cooling so that it is
not necessary to protect against failure of the entire cabinet. However, Oracle ASM
mirroring is not intended to protect against a fire in the computer room that destroys the
entire cabinet. If voting files stored on Oracle ASM with Normal or High redundancy, and the
storage hardware in one failure group suffers a failure, then if there is another disk available
in a disk group in an unaffected failure group, Oracle ASM recovers the voting file in the
unaffected failure group.
2.2
The voting files formation critical data are stored in the voting file and not in the
OCR anymore. From a voting file perspective, the OCR is not touched at all. The
critical data each node must agree on to form a cluster are e.g. misscount and the
list of voting files configured.
58
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
New blocks added to the voting files are the voting file identifier block (needed for
voting file stored in ASM), and it contains the cluster GUID and the file UID. The
committed and pending configuration incarnation number (CCIN and PCIN) contain
this formation critical data.
To query the configured voting files and to see their location run crsctl query css
votedisk:
$ crsctl query css votedisk
##
STATE
File Universal Id
File Name
Disk group
--
-----
-----------------
---------
----------
1. ONLINE
2. ONLINE
3. ONLINE
Voting files that reside in ASM may be automatically deleted and added
back if one of the existing voting files gets corrupted.
Voting files can be migrated from/to NAS/ASM and from ASM to ASM with e.g
$ crsctl replace css votedisk /nas/vdfile1 /nas/vdfile2 /nas/vdfile3
or
$ crsctl replace css votedisk +OTHERDG
If all voting files are corrupted, however, you can restore them as described below.
If the cluster is down and cannot restart due to lost voting files, then you must start
CSS in exclusive mode to replace the voting files by entering the following
command:
o
59
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
information see Appendix Oracle Clusterware 11g release 2 (11.2) - Using standard
NFS to support a third voting file on a stretch cluster configuration.
See Also: Oracle Clusterware Administration and Deployment Guide, "Voting file, Oracle
Cluster Registry, and Oracle Local Registry" for more information. For information about
extended clusters and how to configure the quorum voting file see the Appendix.
2.3
As of 11.2, OCR can also be stored in ASM. The ASM partnership and status table (PST) is
replicated on multiple disks and is extended to store OCR. Consequently, OCR can tolerate
the loss of the same number of disks as are in the underlying disk group, and be can
relocated / rebalanced in response to disk failures.
In order to store an OCR on a disk group, the disk group has a special file type called ocr.
The default configuration location is /etc/oracle/ocr.loc
# cat /etc/oracle/ocr.loc
ocrconfig_loc=+DATA
local_only=FALSE
From a user and maintenance perspective, the rest remains the same. The OCR can only be
configured in ASM when the cluster completely migrated to 11.2 (crsctl query crs
activeversion >= 11.2.0.1.0). We still support mixed configurations, so we could have OCRs
stored in ASM and another stored on a supported NAS device, as we support up to 5 OCR
locations in 11.2.0.1. We do not support raw or block devices for neither OCR nor voting files
anymore.
The OCR diskgroup is auto mounted by the ASM instance during startup. The CRSD and ASM
dependency is maintained by OHASD.
OCRCHECK
There are small enhancements in ocrcheck like the config which is only checking the
configuration. Run ocrcheck as root otherwise the logical corruption check will not run. To
check OLR data use the local keyword.
Usage: ocrcheck [-config] [-local]
Shows OCR version, total, used and available space
Performs OCR block integrity (header and checksum) checks
Performs OCR logical corruption checks (11.1.0.7)
-config checks just configuration (11.2)
-local checks OLR, default OCR
Can be run when stack is up or down
60
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
262120
3072
259048
701301903
+DATA
: /nas/cluster3/ocr3
: /nas/cluster5/ocr1
: /nas/cluster2/ocr2
: /nas/cluster4/ocr4
2.4
The OLR, similar in structure as the OCR, is a node-local repository, and is managed by
OHASD. The configuration data in OLR pertains to the local node only, and is not shared
among other nodes.
The configuration is stored in /etc/oracle/olr.loc (on Linux) or equivalent on other OS. The
default location after installing Oracle Clusterware is:
RAC: Grid_home/cdata/<hostname.olr>
The information stored in the OLR is needed by OHASD to start or join a cluster; this includes
data about GPnP wallets, clusterware configuration and version information.
OLR keys have the same properties as OCR keys and the same tools are used to either check
or dump them.
To see the OLR location, run the command:
61
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
See Also: Oracle Clusterware Administration and Deployment Guide, "Managing the Oracle
Cluster Registry and Oracle Local Registries" for more information about using the ocrconfig
and ocrcheck.
2.5
ASM has to be up with the diskgroup mounted before any OCR operations can be
performed. There are bugs reported when the diskgroup having OCR was dismounted force
and/or ASM instance was shutdown abort.
When the stack is running, CRSD keeps reading/writing OCR.
OHASD maintains the resource dependency and will bring up ASM with the required
diskgroup mounted before it starts CRSD.
Once ASM is up with the diskgroup mounted, the usual ocr* commands (ocrcheck,
ocrconfig, etc.) can be used.
The shutdown command will fail with an ORA-15097 for the ASM instance with an active
OCR (meaning that crsd is running on this node) in it. In order to see which clients are
accessing ASM, use the commands
asmcmd lsct (v$asm_client)
DB_Name
Status
Software_Version
Compatible_version
11.2.0.1.0
11.2.0.1.0
Instance_Name
Disk_Group
+ASM
CONNECTED
+ASM2
DATA
asmcmd lsof
DB_Name
Instance_Name
Path
+ASM
+ASM2
+data.255.4294967295
Where +data.255 is the OCR file number which is used to identify the OCR file within ASM.
2.6
62
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
2.7
Ensure that the ASM instance is up and running with the required diskgroup
mounted, and/or check ASM alert.log for the status for the ASM instance.
Verify that the OCR files were properly created in the diskgroup, using asmcmd ls.
Since the clusterware stack keeps accessing OCR files, most of the time the error
will show up as a CRSD error in the crsd.log. Any error related to an ocr* command
(like crsd, also considered an ASM client) will generate a trace file in the
Grid_home/log/<hostname>/client directory; in either case, look for kgfo / kgfp /
kgfn at the top of the error stack.
Confirm that the ASM compatible.asm property of the diskgroup is set to at least
11.2.0.0.
The ASM Diskgroup Resource
When the diskgroup is created, the diskgroup resource is automatically created with the
name, ora.<DGNAME>.dg and the status is set to ONLINE. The status OFFLINE will be set if
the diskgroup is dismounted, as this is a CRS managed resource now. When the diskgroup is
dropped the diskgroup resource is removed as well.
A dependency between the database and the diskgroup is automatically created when the
database tries to access the ASM files. However, when the database does not longer uses
the ASM files or the ASM files are removed, we do not remove the database dependency
automatically. This must be done using the srvctl command line tool.
Typical ASM alert.log messages for success/failure and warnings are
Success:
NOTE: diskgroup resource ora.DATA.dg is offline
NOTE: diskgroup resource ora.DATA.dg is online
Failure
ERROR: failed to online diskgroup resource ora.DATA.dg
ERROR: failed to offline diskgroup resource ora.DATA.dg
Warning
WARNING: failed to online diskgroup resource ora.DATA.dg (unable to
communicate with CRSD/OHASD)
This warning may appear when the stack is started
WARNING: unknown state for diskgroup resource ora.DATA.dg
If errors happen, look at the ASM alert.log for the related resource operation status message
like,
63
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
ERROR: the resource operation failed; check CRSD log and Agent log for more
details
Grid_home/log/<hostname>/crsd/
Grid_home/log/<hostname>/agent/crsd/oraagent_user/
WARNING: cannot communicate with CRSD.
This warning can be ignored during bootstrap as ASM instance starts up and mount the
diskgroup before CRSD.
The status of the diskgroup resource and the diskgroup should be consistent. In rare cases,
they may become out of sync transiently. To get them back in sync manually run srvctl to
sync the status, or wait some time for the agent to refresh the status. If they become out of
sync for a long period, please check CRSD log and ASM log for more details.
To turn on more comprehensive tracing use event="39505 trace name context forever, level
1.
2.8
A quorum failure group is a special type of failure group and disks in these failure groups do
not contain user data and are not considered when determining redundancy requirements.
The COMPATIBLE.ASM disk group compatibility attribute must be set to 11.2 or greater to
store OCR or voting file data in a disk group.
During Oracle Clusterware installation we do not offer to create a quorum failure group
which is needed for a third voting files in case of extended / stretched clusters or two
storage arrays.
Create a diskgroup with a failgroup and optionally a quorum failgroup if a third array is
available.
SQL> CREATE DISKGROUP PROD NORMAL REDUNDANCY
FAILGROUP fg1 DISK <a disk in SAN1>
FAILGROUP fg2 DISK <a disk in SAN2>
QUORUM FAILGROUP fg3 DISK <another disk or file on a third location>
ATTRIBUTE compatible.asm = 11.2.0.0;
If the diskgroup creation was done using ASMCA, then after adding a quorum disk to the disk
group, Oracle Clusterware will automatically change the CSS votedisk location to something
like below:
$ crsctl query css votedisk
##
STATE
File Universal Id
File Name
Disk group
--
-----
-----------------
---------
---------
1. ONLINE
64
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
2. ONLINE
3. ONLINE
462722bd24c94f70bf4d90539c42ad4c (/voting_disk/vote_node1)
[DATA]
Located 3 voting file(s).
If it is done via SQL*PLUS the crsctl replace css votedisk must be used.
See Also: Oracle Database Storage Administrator's Guide, "Oracle ASM Failure Groups" for
more information. Oracle Clusterware Administration and Deployment Guide, "Voting file,
Oracle Cluster Registry, and Oracle Local Registry" for more information about backup and
restore and failure recovery.
2.9
ASM spfile
2.9.1
Oracle recommends that the Oracle ASM SPFILE is placed in a disk group. You cannot use a
new alias created on an existing Oracle ASM SPFILE to start up the Oracle ASM instance.
If you do not use a shared Oracle grid infrastructure home, then the Oracle ASM instance
can use a PFILE. The same rules for file name, default location, and search order that apply
to database initialization parameter files also apply to Oracle ASM initialization parameter
files.
When an Oracle ASM instance searches for an initialization parameter file, the search order
is:
The location of the initialization parameter file specified in the Grid Plug and Play
(GPnP) profile
If the location has not been set in the GPnP profile, the search order changes to:
2.9.2
You can back up, copy, or move an Oracle ASM SPFILE with the ASMCMD spbackup, spcopy
or spmove commands. For information about these ASMCMD commands see the Oracle
Database Storage Administrator's Guide.
See Also: Oracle Database Storage Administrator's Guide "Configuring Initialization
Parameters for an Oracle ASM Instance" for more information.
65
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
3 Resources
Oracle Clusterware manages applications and processes as resources that you register with
Oracle Clusterware. The number of resources you register with Oracle Clusterware to
manage an application depends on the application. Applications that consist of only one
process are usually represented by only one resource. More complex applications, built on
multiple processes or components, may require multiple resources.
3.1
Resource types
Generally, all resources are unique but some resources may have common attributes. Oracle
Clusterware uses resource types to organize these similar resources. Using resource types
provides the following benefits:
Every resource that is registered in Oracle Clusterware must have a certain resource type. In
addition to the resource types included in Oracle Clusterware, custom resource types can be
defined using the crsctl utility. The included resource types are:
All user-defined resource types must be based, directly or indirectly, on either the
local_resource or cluster_resource type.
In order to list all defined types and their base types, run the crsctl stat type command:
TYPE_NAME=application
BASE_TYPE=cluster_resource
TYPE_NAME=cluster_resource
BASE_TYPE=resource
TYPE_NAME=local_resource
BASE_TYPE=resource
TYPE_NAME=ora.asm.type
BASE_TYPE=ora.local_resource.type
TYPE_NAME=ora.cluster_resource.type
BASE_TYPE=cluster_resource
66
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
TYPE_NAME=ora.cluster_vip.type
BASE_TYPE=ora.cluster_resource.type
TYPE_NAME=ora.cluster_vip_net1.type
BASE_TYPE=ora.cluster_vip.type
TYPE_NAME=ora.database.type
BASE_TYPE=ora.cluster_resource.type
TYPE_NAME=ora.diskgroup.type
BASE_TYPE=ora.local_resource.type
TYPE_NAME=ora.eons.type
BASE_TYPE=ora.local_resource.type
TYPE_NAME=ora.gns.type
BASE_TYPE=ora.cluster_resource.type
TYPE_NAME=ora.gns_vip.type
BASE_TYPE=ora.cluster_vip.type
TYPE_NAME=ora.gsd.type
BASE_TYPE=ora.local_resource.type
TYPE_NAME=ora.listener.type
BASE_TYPE=ora.local_resource.type
TYPE_NAME=ora.local_resource.type
BASE_TYPE=local_resource
TYPE_NAME=ora.network.type
BASE_TYPE=ora.local_resource.type
TYPE_NAME=ora.oc4j.type
BASE_TYPE=ora.cluster_resource.type
TYPE_NAME=ora.ons.type
BASE_TYPE=ora.local_resource.type
TYPE_NAME=ora.registry.acfs.type
BASE_TYPE=ora.local_resource.type
TYPE_NAME=ora.scan_listener.type
BASE_TYPE=ora.cluster_resource.type
TYPE_NAME=ora.scan_vip.type
BASE_TYPE=ora.cluster_vip.type
67
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
TYPE_NAME=resource
BASE_TYPE=
To list all the attributes and default values for a type, run crsctl stat type <typeName> -f (for
full configuration) or p (for static configuration).
3.1.1
This section specifies the attributes that make up the resource type definition. A resource is
an abstract and read-only type definition. The type may only serve as a base for other types.
Oracle Clusterware 11.2.0.1 will not allow user-defined types to extend this type directly.
To see all default values and names from the base resource type, run crsctl stat type
resource p.
Name
History
Description
NAME
From
10gR2
TYPE
From
10gR2,
modified
Type: string
Special Values: No
CHECK_INTERVAL
From
10gR2
Unchanged
Type: unsigned integer
Special Values: No
Per-X Support: Yes
DESCRIPTION
From
10gR2
Unchanged
Type: string
Special Values: No
RESTART_ATTEMPTS
From
10gR2
Unchanged
Type: unsigned integer
Special Values: No
Per-X Support: Yes
START_TIMEOUT
From
10gR2
Unchanged
Type: unsigned integer
Special Values: No
Per-X Support: Yes
STOP_TIMEOUT
From
10gR2
Unchanged
Type: unsigned integer
Special Values: No
68
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
From
10gR2
Unchanged
Type: unsigned integer
Special Values: No
Per-X Support: Yes
UPTIME_THRESHOLD
From
10gR2
Unchanged
Type: string
Special Values: No
Per-X Support: Yes
AUTO_START
From
10gR2
Unchanged
Type: string
Format: restore|never|always
Required: No
Default: restore
Special Values: No
BASE_TYPE
New
The name of the base type from which this type extends. This
is the value of the TYPE in the base types profile.
Type: string
Format: [name of the base type]
Required: Yes
Default: empty string (none)
Special Values: No
Per-X Support: No
DEGREE
New
ENABLED
New
The flag that governs the state of the resource as far as being
managed by Oracle Clusterware, which will not attempt to
manage a disabled resource whether directly or because of a
dependency to another resource. However, stopping of the
resource when requested by the administrator will be allowed
69
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
New
STOP_DEPENDENCIES
New
AGENT_FILENAME
New
ACTION_SCRIPT
From
10gR2,
modified
70
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
New
Format:owner:<user>:rwx,pgrp:<group>:rwx,other::r
Where
owner: the OS User of the resource owner, followed by the
permissions that the owner has. Resource actions will be
executed as with this user ID.
pgrp: the OS Group that is the resources primary group,
followed by the permissions that members of the group have
other: followed by permissions that others have
Type: string
Required: No
Special Values: No
STATE_CHANGE_EVENT_TEM
PLATE
New
PROFILE_CHANGE_EVENT_TE
MPLATE
New
71
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
Special Values: No
ACTION_FAILURE_EVENT_TE
MPLATE
New
LAST_SERVER
New
OFFLINE_CHECK_INTERVAL
New
72
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
STATE_DETAILS
New
3.1.2
The local_resource type is the basic building block for resources that are instantiated for
each server but are cluster oblivious and have a locally visible state. While the definition of
the type is global to the clusterware, the exact property values of the resource instantiation
on a particular server are stored on that server. This resource type has no equivalent in
Oracle Clusterware 10gR2 and is a totally new concept to Oracle Clusterware.
The following table specifies the attributes that make up the local_resource type definition.
To see all default values run the command crsctl stat type local_resource p.
Name
Description
ALIAS_NAME
Type: string
Required: No
Special Values: Yes
Per-X Support: No
LAST_SERVER
Overridden from resource: the name of the server to which the resource
73
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
is assigned (pinned).
Only Cluster Administrators will be allowed to register local resources.
3.1.3
The cluster_resource is the basic building block for resources that are cluster aware and
have globally visible state. 11.1s application is a cluster_resource. The types base is
resource. The type definition is read-only. The following table specifies the attributes that
make up the cluster_resource type definition.
The following table specifies the attributes that make up the cluster_resource type
definition. Run crsctl stat type cluster_resource p to see all default values.
Name
History
Description
ACTIVE_PLACEMENT
From 10gR2
Unchanged
Type: unsigned integer
Special Values: No
FAILOVER_DELAY
From 10gR2
Unchanged, Deprecated
Special Values: No
FAILURE_INTERVAL
From 10gR2
Unchanged
Type: unsigned integer
Special Values: No
Per-X Support: Yes
FAILURE_THRESHOLD
From 10gR2
Unchanged
Type: unsigned integer
Special Values: No
Per-X Support: Yes
PLACEMENT
From 10gR2
Format: value
where value is one of the following:
restricted
Only servers that belong to the associated server
pool(s) or hosting members may host instances of the
resource.
favored
If only SERVER_POOLS or HOSTING_MEMBERS
attribute is non-empty, servers belonging to the
74
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
From 10g
SERVER_POOLS
New
Format:
* | [<pool name1> []]
This attribute creates an affinity between the resource
and one or more server pools as far as placement goes.
The meaning of this attribute depends on what the
value of PLACEMENT is.
When a resource should be able to run on any server of
the cluster, a special value of * needs to be used. Note
that only Cluster Administrators can specify * as the
value for this attribute.
75
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
Required:
restricted PLACEMENT requires either
SERVER_POOLS or HOSTING_MEMBERS
favored PLACEMENT requires either
SERVER_POOLS or HOSTING_MEMBERS
but allows both.
Balanced PLACEMENT does not require a value
Type: string
Default: *
Special Values: No
CARDINALITY
New
LOAD
New
76
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
Required: No
Default: 1
Special Values: No
Per-X Support: Yes
3.2
Resource Dependencies
With Oracle Clusterware 11.2 a new dependency concept is introduced, to be able to build
dependencies for start and stop actions independent and have a much better granularity.
3.2.1
Hard Dependency
Weak Dependency
77
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
Attraction
If resource A attracts B, then whenever B needs to be started, servers that currently have A
running will be first on the list of placement candidates. Since a resource may have more
than one resource to which it is attracted, the number of attraction-exhibiting resources will
govern the order of precedence as far as server placement goes.
If the dependency is on a resource type, as opposed to a concrete resource, this should be
interpreted as any resource of the type.
A possible flavor of this relation is to require that a resources placement be re-evaluated
when a related resources state changes. For example, resource A is attracted to B and C. At
the time of starting A, A is started where B is. Resource C may either be running or started
thereafter. Resource B is subsequently shut down/fails and does not restart. Then resource
A requires that at this moment its placement be re-evaluated and it be moved to C. This is
somewhat similar to the AUTOSTART attribute of the resource profile, with the dependent
resources state change acting as a trigger as opposed to a server joining the cluster.
A possible parameter to this relation is whether or not resources in intermediate state
should be counted as running thus exhibit attraction or not.
If resource A excludes resource B, this means that starting resource A on a server where B is
running will be impossible. However, please see the dependencys namesake for STOP to
find out how B may be stopped/relocated so A may start.
3.2.4
Pull-up
78
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
only start resources if their TARGET is ONLINE. Note that this modifier is on the relation, not
on any of the targets as it applies to the entire relation.
If the dependency is on a resource type, as opposed to a concrete resource, this should be
interpreted as any resource of the type. The aforementioned modifiers for locality/state
still apply accordingly.
3.2.5
Dispersion
The property between two resources that desire to avoid being co-located, if theres no
alternative other than one of them being stopped, is described by the use of the dispersion
relation. In other words, if resource A prefers to run on a different server than the one
occupied by resource B, then resource A is said to have a dispersion relation to resource B at
start time. This sort of relation between resources has an advisory effect, much like that of
attraction: it is not binding as the two resources may still end up on the same server.
A special variation on this relation is whether or not crsd is allowed/expected to disperse
resources, once it is possible, that are already running. In other words, normally, crsd will
not disperse co-located resources when, for example, a new server becomes online: it will
not actively relocate resources once they are running, only disperse them when starting
them. However, if the dispersion is active, then crsd will try to relocate one of the
resources that disperse to the newly available server.
A possible parameter to this relation is whether or not resources in intermediate state
should be counted as running thus exhibit attraction or not.
79
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
4.1
Event Sources
In 11.2, the CRSD master is the originator of most events, and the database is the source of
the Remote Load Balance (RLB) events. The CRSD master passes events from the
PolicyEngine thread to the ReporterModule thread, in which the events are translated to
eONS events, and then the events are sent out to peers within the cluster. If eONS is not
running, the ReporterModule attempts to cache the events until it the eONS server is
running, and then retries. The events are guaranteed to be sent and received in the order in
which the actions happened.
4.2
4.2.1
Every node runs one database agent, one ONS agent, and one eONS agent within crsd's
oraagent process. These agents are responsible for stop/start/check actions. There are no
dedicated threads for each agent; instead oraagent use a pool of threads to execute these
actions for the various resources.
4.2.2
Each of the three agents (as mentioned above) is associated with one other thread in the
oraagent that is blocked on ons_subscriber_receive(). These eONS subscriber threads can be
identified by the string "Thread:[EonsSub ONS]", "Thread:[EonsSub EONS]" and
"Thread:[EonsSub FAN]" in the oraagent log. In the example below, a service was stopped
and this node's crsd oraagent process and its three eONS subscriber received the event:
2009-05-26
process {
2009-05-26
process }
2009-05-26
process }
2009-05-26
process {
2009-05-26
process {
2009-05-26
process }
4.2.3
On one node of the cluster, the eONS subscriber of the following agents also assumes the
role of a publisher or processor or master (pick your favorite terminology):
80
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
Each eonsagent's eONS subscriber on every node publishes eONS events as user
callouts. There is no single eONS publisher in the cluster. User callouts are no longer
produced by racgevtf.
[AGENTUSR][2934959008][UNKNOWN]
CssLock::tryLock,
got
lock
got
lock
got
lock
got
lock
CLSN.ONS.ONSPROC
staiu02/agent/crsd/oraagent_spommere/oraagent_spommere.l01:2009-05-26
19:51:41.626:
[AGENTUSR][3992972192][UNKNOWN]
CssLock::tryLock,
CLSN.ONS.ONSNETPROC
staiu03/agent/crsd/oraagent_spommere/oraagent_spommere.l01:2009-05-26
20:00:21.214:
[AGENTUSR][2856319904][UNKNOWN]
CssLock::tryLock,
CLSN.RLB.pommi
staiu02/agent/crsd/oraagent_spommere/oraagent_spommere.l01:2009-05-26
20:00:27.108:
[AGENTUSR][3926576032][UNKNOWN]
CssLock::tryLock,
CLSN.FAN.pommi.FANPROC
These CSS-based locks work in such a way that any node can grab the lock if it is not already
held. If the process of the lock holder goes away, or CSS thinks the node went away, the lock
is released and someone else tries to get the lock. The different processors try to grab the
lock whenever they see an event. If a processor previously was holding the lock, it doesn't
have to acquire it again. There is currently no implementation of a "backup" or designated
failover-publisher.
4.2.4
ONSNETPROC
In a cluster of 2 or more nodes, one onsagent's eONS subscriber will also assume the role of
CLSN.ONS.ONSNETPROC, i.e. is responsible for just publishing network down events. The
publishers with the roles of CLSN.ONS.ONSPROC and CLSN.ONS.ONSNETPROC cannot and
will not run on the same node, i.e. they must run on distinct nodes.
If both the CLSN.ONS.ONSPROC and CLSN.ONS.ONSNETPROC simultaneously get their public
network interface pulled down, there may not be any event.
4.2.5
RLB publisher
Another additional thread tied to the dbagent thread in the oraagent process of only one
node in the cluster, is " Thread:[RLB:dbname]", and it dequeues the LBA/RLB/affinity event
81
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
from the SYS$SERVICE_METRICS queue, and publishes the event to eONS clients. It assumes
the lock role of CLSN.RLB.dbname. The CLSN.RLB.dbname publisher can run on any node,
and is not related to the location of the MMON master (who enqueues LBA events into the
SYS$SERVICE_METRICS queue. So since the RLB publisher (RLB.dbname) can run on a
different node than the ONS publisher (ONSPROC), RLB events can be dequeued on one
node, and published to ONS on another node. There is one RLB publisher per database in
the cluster
Sample trace, where Node 3 is the RLB publisher, and Node 2 has the ONSPROC role:
Node 3:
2009-05-28 19:29:10.754: [AGENTUSR][2857368480][UNKNOWN]
Thread:[RLB:pommi] publishing message srvname = rlb
2009-05-28 19:29:10.754: [AGENTUSR][2857368480][UNKNOWN]
Thread:[RLB:pommi] publishing message payload = VERSION=1.0 database=pommi
service=rlb { {instance=pommi_3 percent=25 flag=UNKNOWN
aff=FALSE}{instance=pommi_4 percent=25 flag=UNKNOWN
aff=FALSE}{instance=pommi_2 percent=25 flag=UNKNOWN
aff=FALSE}{instance=pommi_1 percent=25 flag=UNKNOWN aff=FALSE} }
timestamp=2009-05-28 19:29:10
The RLB events will be received by the eONS subscriber of the ONS publisher
(ONSPROC) who then posts the event to ONS:
Node 2:
2009-05-28 19:29:40.773: [AGENTUSR][3992976288][UNKNOWN] Publishing the
ONS event type database/event/servicemetrics/rlb
82
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
4.2.6
4.2.7
Example
Node 1
o
Node 2
o
Node 3
o
Node 4
o
Coming up in 11.2.0.2
The above description is only valid for 11.2.0.1. In 11.2.0.2, the eONS proxy a.k.a eONS
server will be removed, and its functionality will be assumed by evmd. In addition, the
tracing as described above, will change significantly. The major reason for this change was
the high resource usage of the eONS JVM.
In order to find the publishers in the oraagent.log in 11.2.0.2, search for these patterns:
ONS.ONSNETPROC CssLockMM::tryMaster I am the master
ONS.ONSPROC CssLockMM::tryMaster I am the master
FAN.<dbname> CssLockMM::tryMaster I am the master
RLB.<dbname> CssSemMM::tryMaster I am the master
83
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
5
5.1
Oracle does not recommend configuring separate interfaces for Oracle Clusterware and
Oracle RAC; instead, if multiple private interfaces are configured in the system, we
recommend those to be bonded to a single interface in order to provide redundancy in case
of a NIC failure. Unless bonded, multiple private interfaces provide only load balancing, not
failover capabilities.
The consequences of changing interface names depend on which name you are changing,
and whether you are also changing the IP address. In cases where you are only changing the
interface names, the consequences are minor. If you change the name for the public
interface that is stored in the OCR, then you also must modify the node applications for each
node. Therefore, you must stop the node applications for this change to take effect.
Changes made with oifcfg delif / setif for the cluster interconnect also change the private
interconnect used by clusterware, hence an Oracle Clusterware restart is the consequence.
The interface used by the Oracle RAC (RDBMS) interconnect must be the same interface that
Oracle Clusterware is using with the hostname. Do not configure the private interconnect
for Oracle RAC on a separate interface that is not monitored by Oracle Clusterware.
See Also: Oracle Clusterware Administration and Deployment Guide, "Changing Network
Addresses on Manually Configured Networks" for more information.
5.2
misscount
As misscount is a critical value, Oracle does not support changing the default value. The
current misscount value can be checked with
# crsctl get css misscount
CRS-4678: Successful get misscount 30 for Cluster Synchronization Services.
In case of vendor clusterware integration we set misscount to 600 in order to give the
vendor clusterware enough time to make a node join / leave decision. Never change the
default in a vendor clusterware configuration.
84
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
6
6.1
After a successful cluster installation or node startup the health of the entire cluster or a
node can be checked.
crsctl check has will check if OHASD is started on the local node and if the daemon is
running healthy.
# crsctl check has
CRS-4638: Oracle High Availability Services is online
crsctl check crs will check the OHASD, the CRSD, the ocssd and the EVM daemon.
# crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
crsctl check cluster all will check all the daemons from all nodes belonging to that cluster.
# crsctl check cluster all
**************************************************************
node1:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
node2:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
During startup issues monitor the output from the crsctl start cluster command; all attempts
to start a resource should be successful. If the start of a resource is failing, consult the
appropriate log file to see the errors.
# crsctl start cluster
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'node1'
CRS-2676: Start of 'ora.cssdmonitor' on 'node1' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'node1'
CRS-2672: Attempting to start 'ora.diskmon' on 'node1'
85
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
6.2
It is the Oracle Clusterware management utility that has commands to manage all
Clusterware entities under the Oracle Clusterware framework. This includes the daemons
that are part of the Clusterware, wallet management and clusterized commands that work
on all or some of the nodes in the cluster.
You can use CRSCTL commands to perform several operations on Oracle Clusterware, such
as:
86
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
A full comprehensive list of all debugging features and options is listed in the
Troubleshooting and Diagnostic Output section in the Oracle Clusterware Administration
and Deployment Guide.
6.3
Oracle Clusterware uses a unified log directory structure to consolidate component log files.
This consolidated structure simplifies diagnostic information collection and assists during
data retrieval and problem analysis.
Oracle Clusterware uses a file rotation approach for log files. If you cannot find the reference
given in the file specified in the "Details in" section of an alert file message, then this file
might have been rolled over to a rollover version, typically ending in *.lnumber where
number is a number that starts at 01 and increments to however many logs are being kept,
the total for which can be different for different logs. While there is usually no need to
follow the reference unless you are asked to do so by Oracle Support, you can check the
path given for roll over versions of the file. The log retention policy, however, foresees that
older logs are be purged as required by the amount of logs generated
GRID_HOME/log/<host>/diskmon Disk Monitor Daemon
GRID_HOME/log/<host>/client OCRDUMP, OCRCHECK, OCRCONFIG, CRSCTL edit the
GRID_HOME/srvm/admin/ocrlog.ini file to increase the trace level
GRID_HOME/log/<host>/admin not used
GRID_HOME/log/<host>/ctssd Cluster Time Synchronization Service
GRID_HOME/log/<host>/gipcd Grid Interprocess Communication Daemon
GRID_HOME/log/<host>/ohasd Oracle High Availability Services Daemon
GRID_HOME/log/<host>/crsd Cluster Ready Services Daemon
GRID_HOME/log/<host>/gpnpd Grid Plug and Play Daemon
GRID_HOME/log/<host>/mdnsd Mulitcast Domain Name Service Daemon
GRID_HOME/log/<host>/evmd Event Manager Daemon
GRID_HOME/log/<host>/racg/racgmain RAC RACG
GRID_HOME/log/<host>/racg/racgeut RAC RACG
GRID_HOME/log/<host>/racg/racgevtf RAC RACG
GRID_HOME/log/<host>/racg RAC RACG (only used if pre-11.1 database is installed)
GRID_HOME/log/<host>/cssd Cluster Synchronization Service Daemon
GRID_HOME/log/<host>/srvm Server Manager
GRID_HOME/log/<host>/agent/ohasd/oraagent_oracle11 HA Service Daemon Agent
GRID_HOME/log/<host>/agent/ohasd/oracssdagent_root HA Service Daemon CSS Agent
GRID_HOME/log/<host>/agent/ohasd/oracssdmonitor_root HA Service Daemon
ocssdMonitor Agent
GRID_HOME/log/<host>/agent/ohasd/orarootagent_root HA Service Daemon Oracle Root
Agent
GRID_HOME/log/<host>/agent/crsd/oraagent_oracle11 CRS Daemon Oracle Agent
87
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
Diagcollection
The best way to get all clusterware related traces for an incident is using
Grid_home/bin/diagcollection.pl. To get all traces and an OCRDUMP run the command as
root user diagcollection.pl collect crshome <GRID_HOME> on all nodes from the cluster
and provide support or development the collected traces.
# Grid_home/bin/diagcollection.pl
Production Copyright 2004, 2008, Oracle.
88
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
For more information about collection of IPD data please see section 6.4.
In case of a vendor clusterware installation it is important to collect and provide all related
vendor clusterware files to Oracle Support.
6.3.2
Beginning with Oracle Database 11g release 2 (11.2), certain Oracle Clusterware messages
contain a text identifier surrounded by "(:" and ":)". Usually, the identifier is part of the
message text that begins with "Details in..." and includes an Oracle Clusterware diagnostic
log file path and name similar to the following example. The identifier is called a DRUID, or
Diagnostic Record Unique ID:
2009-07-16 00:18:44.472
[/scratch/11.2/grid/bin/orarootagent.bin(13098)]CRS-5822:Agent
'/scratch/11.2/grid/bin/orarootagent_root' disconnected from server. Details at
(:CRSAGF00117:)
in
/scratch/11.2/grid/log/stnsp014/agent/crsd/orarootagent_root/orarootagent_root.log
.
DRUIDs are used to relate external product messages to entries in a diagnostic log file and to
internal Oracle Clusterware program code locations. They are not directly meaningful to
customers and are used primarily by Oracle Support when diagnosing problems.
6.4
There are several Java-based GUI tools which in case of errors should run with the following
trace levels set:
"setenv SRVM_TRACE true" (or "export SRVM_TRACE=true")
"setenv SRVM_TRACE_LEVEL 2" (or "export SRVM_TRACE_LEVEL=2")
The Oracle Universal Installer can run with the debug flag in case of installer errors (e.g.
"./runInstaller -debug" for install).
6.5
Reboot Advisory
89
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
the essence in most reboot scenarios, and the reboot usually occurs before the operating
system flushes buffered log data to disk. This means that an explanation of what led to the
reboot may be lost.
New in the 11.2 release of Oracle Clusterware is a feature called Reboot Advisory that
improves the chances of preserving an explanation for a Clusterware-initiated reboot. At
the moment a reboot decision is made by Clusterware, a short explanatory message is
produced and an attempt is made to publish it in two ways:
The reboot decision is written to a small file (normally on locally-attached storage) using a
direct, non-buffered I/O request. The file is created and preformatted in advance of the
failure (during Clusterware startup), so this I/O has a high probability of success, even on a
failing system. The reboot decision is also broadcast over all available network interfaces on
the failing system.
These operations are executed in parallel and are subject to an elapsed time limit so as not
to delay the impending reboot. Attempting both disk and network publication of the
message makes it likely that at least one succeeds, and often both will. Successfully stored
or transmitted Reboot Advisory messages ultimately appear in a Clusterware alert log on
one or more nodes of the cluster.
When network broadcast of a Reboot Advisory is successful, the associated messages
appear in the alert logs of other nodes in the cluster. This happens more or less
instantaneously, so the messages can be viewed immediately to determine the cause of the
reboot. The message includes the host name of node that is being rebooted to distinguish it
from the normal flow of alert messages for that node. Only nodes in the same cluster as the
failing node will display these messages.
If the Reboot Advisory was successfully written to a disk file, when Oracle Clusterware starts
the next time on that node, it will produce messages related to the prior in the Clusterware
alert log. Reboot Advisories are timestamped and the startup scan for these files will
announce any occurrences that are less than 3 days old. The scan doesnt empty or mark
already-announced files, so the same Reboot Advisory can appear in the alert log multiple
times if Clusterware is restarted on a node multiple times within a 3-day period.
Whether from a file or a network broadcast, Reboot Advisories use the same alert log
messages, normally two per advisory. The first is message CRS-8011, which displays the host
name of the rebooting node, a software component identifier, and a timestamp
(approximately the time of the reboot). An example looks like this:
[ohasd(24687)]CRS-8011:reboot
advisory
message
from
host:
sta00129,
component:
Following message CRS-8011 will be CRS-8013, which conveys the explanatory message for
the forced reboot, as in this example:
90
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
Note that everything in message CRS-8013 after text: originates in the Clusterware
component that instigated the reboot. Because of the critical circumstances in which it is
produced, this text does not come from an Oracle NLS message file: it is always in English
language and USASCII7 character set.
In some circumstances, Reboot Advisories may convey binary diagnostic data in addition to a
text message. If so, message CRS-8014 and one or more of message CRS-8015 will also
appear. This binary data is used only if the reboot situation is reported to Oracle for
resolution.
Because multiple components can write to the Clusterware alert log at the same time, it is
possible that the messages associated with a given Reboot Advisory may appear with other
(unrelated) messages interspersed. However, messages for different Reboot Advisories are
never interleaved: all of the messages for one Advisory are written before any message for
another Advisory.
For additional information, refer to the Oracle Errors manual discussion of messages CRS8011 and 8013.
7
7.1
Other Tools
ocrpatch
ocrpatch was developed in 2005 in order provide Development and Support with a tool that
is able to fix corruptions or make other changes in OCR in the case where official tools such
as ocrconfig or crsctl were unable to handle such changes. ocrpatch is NOT being distributed
as part of the software release. The functionality of ocrpatch is already well described in a
separate document, therefore we won't go into details in this paper; the ocrpatch document
is located in the public RAC Performance Group Folder on stcontent.
7.2
7.2.1
vdpatch
Introduction
vdpatch is a new, Oracle internal tool, developed for Oracle Clusterware 11g release 2
(11.2). vdpatch pretty much uses the same code as ocrpatch, i.e. the look & feel is very
similar. The purpose of this tool is to facilitate diagnosis of CSS related issues where voting
file content is involved. vdpatch operates on a per-block basis, i.e. it can read (not write)
512-byte blocks from a voting file by block number or name. Similarly to ocrpatch, it
attempts to interpret the content in a meaningful way instead of just presenting columns of
91
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
hexadecimal values. vdpatch allows online (clusterware stack and ocssd running) and offline
(clusterware stack / ocssd not running) access. vdpatch works for both voting files on NAS
and in ASM. At this time, vdpatch is not actively being distributed such as ocrpatch.
Development and Support have to obtain a binary from a production ADE label.
7.2.2
General Usage
The filename/pathname of the voting file(s) can be obtained via 'crsctl query css votedisk'
command; note that this command only works if ocssd is running. If ocssd is not up, crsctl
will signal
# crsctl query css votedisk
Unable to communicate with the Cluster Synchronization Services daemon.
The above output indicates that there are three voting files defined in the diskgroup +VDDG,
each located on particular raw device, which is part of the ASM diskgroup. vdpatch allows
opening only ONE device at a time to read its content:
# vdpatch
VD Patch Tool Version 11.2 (20090724)
Oracle Clusterware Release 11.2.0.2.0
Copyright (c) 2008, 2009, Oracle. All rights reserved.
vdpatch> op /dev/raw/raw100
[OK] Opened /dev/raw/raw100, type: ASM
If the voting file is on a raw device, crsctl and vdpatch would show
$ crsctl query css votedisk
## STATE
File Universal Id
-- --------------------1. ONLINE
1de94f4db65a4f9bbf8b9bf3eba6f43b
2. ONLINE
26d28a7311264f77bf8df6463420e614
92
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
3. ONLINE
9f862a63239b4f52bfdbce6d262dc349 (/dev/raw/raw134) []
Located 3 voting file(s).
# vdpatch
VD Patch Tool Version 11.2 (20090724)
Oracle Clusterware Release 11.2.0.2.0
Copyright (c) 2008, 2009, Oracle. All rights reserved.
vdpatch> op /dev/raw/raw126
[OK] Opened /dev/raw/raw126, type: Raw/FS
Using the 'h' command, it will list all other available commands:
vdpatch> h
Usage: vdpatch
BLOCK operations
op <path to voting file>
open voting file
rb <block#>
read block by block#
rb status|kill|lease <index> read named block
index=[0..n] => Devenv nodes 1..(n-1)
index=[1..n] => shiphome nodes 1..n
rb toc|info|op|ccin|pcin|limbo
read named block
du
dump native block from offset
di
display interpreted block
of <offset>
set offset in block, range 0-511
MISC operations
i
show parameters, version, info
h
this help screen
exit / quit
exit vdpatch
7.2.3
The common use case for vdpatch is reading the content. Voting file blocks can be read by
either block number or named block type. For types TOC, INFO, OP, CCIN, PCIN and LIMBO,
there just exists one block in the voting file, so reading this one block would be done by e.g.
running 'rb toc'; the output will both show a hex/ascii dump of the 512-byte block, as well as
the interpreted content of that block:
vdpatch> rb toc
[OK] Read block 4
[INFO] clssnmvtoc block
0 73734C63 6B636F54 01040000 00020000 00000000 ssLckcoT............
20 00000000 40A00000 00020000 00000000 10000000 ....@...............
40 05000000 10000000 00020000 10020000 00020000 ....................
93
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
....................
....................
....................
....................
............
For block types STATUS, KILL and LEASE, there exists one block per defined cluster node, so
the 'rb' command needs to be used in combination with an index that denotes the node
number. In a Development environment, the index starts with 0, while in a
shiphome/production environment, the index starts with 1. So in order to read the 5th
node's KILL block in a Development environment, submit 'rb kill 4', while in a production
environment, use 'rb kill 5'.
Example to read the STATUS block of node 3 (here: staiu03) in a Development environment:
vdpatch> rb status 2
[OK] Read block 18
[INFO] clssnmdsknodei vote block
0 65746F56 02000000 01040B02 00000000
20 75303300 00000000 00000000 00000000
40 00000000 00000000 00000000 00000000
60 00000000 00000000 00000000 00000000
80 00000000 3EC40609 8A340200 03000000
100 00000000 00000000 00000000 00000000
120 00000000 00000000 00000000 00000000
73746169
00000000
00000000
00000000
03030303
00000000
00000000
etoV............stai
u03.................
....................
....................
....>....4..........
....................
....................
94
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
95
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
We do not plan to allow vdpatch making any changes to a voting file. The only
recommended way of modifying voting files is to drop and recreate them using the crsctl
command.
7.3
In 11.2 the creation and deletion from an application or uservip can be managed via
Grid_home/bin/appvipcfg
Production Copyright 2007, 2008, Oracle.All rights reserved
Usage: appvipcfg create -network=<network_number> -ip=<ip_address>
-vipname=<vipname>
-user=<user_name>[-group=<group_name>]
delete -vipname=<vipname>
The appvipcfg command line tool can only create an application VIP on the default network
for which the resource ora.net1.network is created by default. If someone needs to create
an application VIP on a different network or subnet this must be done manually.
EXAMPLE of creating a uservip on a different network (ora.net2.network)
srvctl add vip -n node1 -k 2 -A appsvip1/255.255.252.0/eth2
crsctl add type coldfailover.vip.type -basetype ora.cluster_vip_net2.type
crsctl add resource coldfailover.vip -type coldfailover.vip.type -attr \
"DESCRIPTION=USRVIP_resource,RESTART_ATTEMPTS=0,START_TIMEOUT=0, STOP_TIMEOUT=0, \
CHECK_INTERVAL=10, USR_ORA_VIP=10.137.11.163, \
START_DEPENDENCIES=hard(ora.net2.network)pullup(ora.net2.network), \
STOP_DEPENDENCIES=hard(ora.net2.network), \
ACL='owner:root:rwx,pgrp:root:r-x,other::r--,user:oracle11:r-x'"
There are a couple known bugs around that area and for tracking purpose and completeness
we will list them:
8632344 srvctl modify nodeapps -a will modify the vip even if the interface is not
valid
8703112 appsvip should have the same behavior as ora.vip like vip failback
8761666 appsvipcfg should respect /etc/hosts entry for apps ip even if gns is
configured
96
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
7.4
8820801 using a second network (k 2) Im able to add and start the same ip twice
Application and Script Agent
The application or script agent manages the application/resource through the application
specific user code. Oracle Clusterware contains a special shared library (libagfw) which
allows users to plug-in application specific actions using a well defined interface.
The following sections describe how to build an agent using Oracle Clusterware's agent
framework interface.
7.4.1
Action entry points refer to user defined code that needs to be executed whenever an
action has to be taken on a resource (start resource, stop resource etc.). For every resource
type, Clusterware requires that action entry points are defined for the following actions:
start : Actions to be taken to start the resource
stop : Actions to gracefully stop the resource
check : Actions taken to check the status of the resource
clean : Actions to forcefully stop the resource.
These action entry points can be defined using C++ code or in script. If any of these actions
are not explicitly defined, Clusterware assumes by default that they are defined in a script.
This script is located via the ACTION_SCRIPT attribute for the resource type. Hence it is
possible to have hybrid agents, which define some action entry points using script and other
action entry points using C++. It is possible to define action entry points for other actions too
(e.g. for changes in attribute value) but these are not mandatory.
7.4.2
Sample Agents
Consider a file as the resource that needs to be managed by Clusterware. An agent that
manages this resource has the following tasks:
On startup
: Create the file.
On shutdown
: Gracefully delete the file.
On check command: Detect whether the file is present or not.
On clean command: Forcefully delete the file.
To describe this particular resource to Oracle Clusterware, a specialized resource type is first
created, that contains all the characteristic attributes for this resource class. In this case, the
only special attribute to be described is the filename to be monitored. This can be done with
the crsctl command. While defining the resource type, we can also specify the
ACTION_SCRIPT and AGENT_FILENAME attributes. These are used to refer to the shell script
and executables that contain the action entry points for the agents.
97
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
Once the resource type is defined, there are several options to write a specialized agent
which does the required tasks - the agent could be written as a script, as a C/C++ program or
as a hybrid.
Examples for each of them are given below.
7.4.3
-attr \
"ATTRIBUTE=ACTION_SCRIPT,TYPE=string,DEFAULT_VALUE=/path/to/demoActionScript"
Modify the path to the file appropriately. This adds a new resource type to Clusterware.
Alternately, the attributes can be added in a text file which is passed as a parameter to the
CRSCTL utility.
(3) Add new resources to the cluster using the crsctl utility. The commands to do this are:
$ crsctl add resource r1 -type test_type1 -attr "PATH_NAME=/tmp/r1.txt"
$ crsctl add resource r2 -type test_type1 -attr "PATH_NAME=/tmp/r2.txt"
Modify the PATH_NAME attribute for the resources as needed. This adds resources named
r1 and r2 to be monitored by clusterware. Here we are overriding the default value for the
PATH_NAME attribute for our resources.
(4) Start/stop the resources using the crsctl utility. The commands to do this are:
$ crsctl start res r1
$ crsctl start res r2
$ crsctl check res r1
$ crsctl stop res r2
The files /tmp/r1.txt and /tmp/r2.txt get created and deleted as the resources r1 and r2 get
started and stopped.
7.4.4
98
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
(1) Compile the C++ agent using the provided source file demoagent1.cpp and makefile.
The makefile needs to be modified based on the local compiler/linker paths and install
locations. The output will be an executable named demoagent1
(2) Start the Clusterware
(3) Add a new resource type using the crsctl utility as below
$ crsctl add type test_type1 -basetype cluster_resource \
-attr "ATTRIBUTE=PATH_NAME,TYPE=string,DEFAULT_VALUE=default.txt" \
-attr "ATTRIBUTE=AGENT_FILENAME,TYPE=string,DEFAULT_VALUE=/path/to/demoagent1"
Modify the path to the file appropriately. This adds a new resource type to Clusterware.
(4) Create a new resource based on the type that is defined above. The commands are as
follows:
$ crsctl add res r3 -type test_type1 -attr "PATH_NAME=/tmp/r3.txt"
$ crsctl add res r4 -type test_type1 -attr "PATH_NAME=/tmp/r4.txt"
The files /tmp/r3.txt and /tmp/r4.txt get created and deleted as the resources get started
and stopped.
7.4.5
99
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
-attr "ATTRIBUTE=AGENT_FILENAME,TYPE=string,DEFAULT_VALUE=/path/demoagent2" \
-attr "ATTRIBUTE=ACTION_SCRIPT,TYPE=string,DEFAULT_VALUE=/path/demoActionScript"
Modify the path to the files appropriately. This adds a new resource type to Clusterware.
(4) Create new resources based on the type that is defined above. The commands are as
follows:
$ crsctl add res r5 -type test_type1 -attr "PATH_NAME=/tmp/r5.txt"
$ crsctl add res r6 -type test_type1 -attr "PATH_NAME=/tmp/r6.txt"
The files /tmp/r5.txt and /tmp/r6.txt get created and deleted as the resources get started
and stopped.
7.5
7.5.1
This tool (formerly known as Instantaneous Problem Detection tool) is designed to detect
and analyze operating system (OS) and cluster resource related degradation and failures in
order to bring more explanatory power to many Oracle Clusterware and Oracle RAC issues
such as node eviction.
It continuously tracks the OS resource consumption at node, process, and device level. It
collects and analyzes the cluster-wide data. In real time mode, when thresholds are hit, an
alert is shown to the operator. For root cause analysis, historical data can be replayed to
understand what was happening at the time of failure.
The tool installation is pretty simple and is described in the README shipped with the zipfile.
The latest version is uploaded on OTN for Linux and NT under the following link
http://www.oracle.com/technology/products/database/clustering/ipd_download_homepag
e.html
100
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
7.5.2
In order to install the tool on a list of nodes run the following basic steps (for more detailed
information read the REAMDE):
To finalize install, login as root and run crfinst.pl f on all installed nodes
7.5.3
The OS tool must be started via /etc/init.d/init.crfd start. This command spawns the
osysmond process which spawns the ologgerd daemon. The ologgerd then picks a replica
node (if >= 2 nodes) and informs the osysmond on that node to spawn the replica ologgerd.
The OS Tool stack can be shutdown on a node as follows:
# /etc/init.d/init.crfd disable
7.5.4
The osysmond (one daemon per cluster node) will perform the following steps to collect the
data:
The osysmond will alert on perceived node-hangs (under-utilized resources despite many
potential consumer tasks)
101
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
# Disk IOs persec < 10% of max possible Disk IOs persec
7.5.5
CRFGUI
The Oracle Cluster Health Monitor is shipped with two data retrieval tools one is the crfgui
which is the main GIU display.
Crfgui connects to the local or remote master LOGGERD. Is the GUI installed inside the
cluster it auto detects the LOGGERD, otherwise, running outside the cluster a cluster node
must be specified with the -m switch.
The GUI alerts critical resource usage events and perceived system hangs. After starting it
we support different GUI views like cluster view, node view and device view.
Usage:
crfgui
[-m
<node>]
[-d
<time>]
[-r
<sec>]
[-h
<sec>]
-d <time>
-r <sec>
Refresh rate
-h <sec>
Highlight rate
-W <sec>
-I
-f <name>
given
-D <int>
7.5.6
oclumon
A command line tool is included in the package which can be used to query the Berkeley DB
backend to print out to the terminal the node specific metrics for a specified time period.
The tool also supports a query to print the durations and the states for a resource on a node
during a specified time period. These states are based on predefined thresholds for each
resource metric and are denoted as red, orange, yellow and green indicating decreasing
order of criticality. For example, you could ask to show how many seconds did the CPU on
node "node1" remain in RED state during the last 1 hour. Oclumon can also be used to
perform miscellaneous administrative tasks such as changing the debug levels, querying
version of the tool, changing the metrics database size, etc.
The usage of the oclumon can be printed by oclumon h. To get more information about
each verb option run oclumon <verb> -h.
Currently supported verbs are:
showtrail, showobjects, dumpnodeview, manage, version, debug, quit and help
102
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
Below some useful attribute examples that can be passed to oclumon. The default location
for oclumon is /usr/lib/oracrf/bin/oclumon.
Showobjects
oclumon showobjects n node time "2009-10-07 15:11:00"
Dumpnodeview
oclumon dumpnodeview n node
Showgaps
oclumon showgaps -n node1 -s "2009-10-07 02:40:00" \
-e "2009-10-07 03:59:00"
Showtrail
oclumon showtrail -n node1 -diskid sde qlen totalwaittime \
-s "2009-07-09 03:40:00" -e "2009-07-09 03:50:00" \
-c "red" "yellow" "green"
Parameter=QUEUE LENGTH
2009-07-09 03:40:00
TO
2009-07-09 03:41:31
TO
2009-07-09 03:45:21
TO
2009-07-09 03:49:18
TO
Parameter=TOTAL WAIT TIME
2009-07-09
2009-07-09
2009-07-09
2009-07-09
03:41:31
03:45:21
03:49:18
03:50:00
GREEN
GREEN
GREEN
GREEN
7.5.7
03:40:00
03:41:31
03:45:21
03:49:18
TO
TO
TO
TO
2009-07-09
2009-07-09
2009-07-09
2009-07-09
03:41:31
03:45:21
03:49:18
03:50:00
GREEN
GREEN
GREEN
GREEN
Run the 'Grid_home/bin/diagcollection.pl --collect --ipd --incidenttime <inc time> -incidentduration <duration>' command on the IPD master, LOGGERD node, where -incidenttime format is MM/DD/YYYY24HH:MM:SS, and --incidentduration is HH:MM
103
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
Collect data for at least 30 min before and after the incident.
masterloggerhost:$./bin/diagcollection.pl --collect --ipd --incidenttime
10/05/200909:10:11 --incidentduration 02:00
Starting with 11.2.0.2 and the CRS integrated IPD/OS the syntax to get the IPD data
collected is "masterloggerhost:$./bin/diagcollection.pl --collect --crshome
/scratch/grid_home_11.2/ --ipdhome /scratch/grid_home_11.2/ --ipd -incidenttime 01/14/201001:00:00 --incidentduration 04:00"
7.5.8
Debugging
In order to turn on debugging for the osysmond or the loggerd run oclumon debug log all
allcomp:5 as root user. This will turn on debugging for all components.
Starting with 11.2.0.2 the IPD/CHM log files will be under
Grid_home/log/<hostname>/crfmond
Grid_home/log/<hostname>/crfproxy
Grid_home/log/<hostname>/crflogd
7.5.9
osysmond usually starts immediately, while it may take seconds (minutes if your I/O
subsystem is slow) for ologgerd and oproxyd to start due to the initialization of the Berkeley
Database (bdb). First node to call 'runcrf' will be configured as master. First node after the
master to run 'runcrf' will be configured as replica. From there on, things will move if
required. Daemons to look out for are: osysmond (on all nodes), ologgerd (on master and
replica nodes), oproxyd (on all nodes).
In a development environment, the IPD/OS processes do not run as root or in real time.
104
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
7.5.10
11.2.0.2
- The oproxyd process may or may not exist anymore. As of time of publication of this
document, the oproxyd process is disabled.
-
IPDOS will be represented by the OHASD resource ora.crf, and the need for manual
installation and configuration for both development and production environments
will be eliminated.
105
Oracle White Paper Oracle Clusterware 11g Release 2 (11.2) Technical White Paper
Appendix
References
Oracle Clusterware 11g Release 2 (11.2) Using standard NFS to support a third
voting file for extended cluster configurations
Grid Infrastructure Installation Guide 11g Release 2 (11.2)
Clusterware Administration and Deployment Guide 11g Release 2 (11.2)
Storage Administrator's Guide 11g Release 2 (11.2)
Oracle Clusterware 11g Release 2 Technical Overview
http://www.oracle.com/technology/products/database/clustering/ipd_download_h
omepage.html
Functional Specification for CRS Resource Modeling Capabilities, Oracle
Clusterware, 11gR2
Useful Notes
Note 294430.1 - CSS Timeout Computation in Oracle Clusterware
Note 1050693.1 - Troubleshooting 11.2 Clusterware Node Evictions (Reboots)
Note 1053010.1 - How to Dump the Contents of an Spfile on ASM when ASM/GRID
is down
Note 338706.1 - Oracle Clusterware (formerly CRS) Rolling Upgrades
Note:785351.1 - Upgrade Companion 11g Release 2
http://www.oracle.com/technology/products/database/oracle11g/upgrade/index.h
tml
106
the contents hereof are subject to change without notice. This document is not warranted to be error-free, nor subject to any other
warranties or conditions, whether expressed orally or implied in law, including implied warranties and conditions of merchantability or
fitness for a particular purpose. We specifically disclaim any liability with respect to this document and no contractual obligations are
formed either directly or indirectly by this document. This document may not be reproduced or transmitted in any form or by any
means, electronic or mechanical, for any purpose, without our prior written permission.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective
owners.