Disaster Recovery for IBM Systems
Disaster Recovery for IBM Systems
• This session will focus on the architectural and operational issues that need to be considered when planning
and implementing a Disaster Recovery plan with WebSphere Application Server and IBM BPM. Topics will
include use of multiple data centers, geographic separation constraints, supporting software components,
disaster recovery and other common deployment issues. Though focused primarily on WebSphere Application
Server and IBM BPM this session also applies to IBM Middleware that is deployed on WebSphere Application
Server as well as Pure Application System.
• While not a prerequisite, attendees should be familiar with the material covered in " Preparing to Fail, Practical
WebSphere Application Server High Availability”
• Concepts
• Disaster Recovery
– Multiple Cells and Data Centers
– WebSphere Application Server Recovery
– IBM BPM Recovery
• Final Thoughts
• Redundancy
– The provision of additional or duplicate systems, equipment, etc., that
function in case an operating part or system fails, as in a spacecraft.
• Isolated
– Separated from other persons or things; alone; solitary
• Independent
– Not dependent; not depending or contingent upon something else for
existence, operation, etc.
• All of the Above are Fundamental for Effective High Availability and
Disaster Recovery
•Usually the goal is very brief disruptions for only some users for
unplanned events
Continuous Operations
•Ensuring that the system is never unavailable during planned
activities
•Very expensive
Cluster “B”
Server 1 Server 2 • Database Hardware Clustered With Shared
Disk
Cluster “C” Server 2 Server 2
– IBM
Is No Longer Fault Cloud Technical Academy
Tolerant
Definitions
•Ensuring that the system can be reconstituted and/or activated at another location and can process
work after an unexpected catastrophic failure at one location
•Often multiple single failures (which are normally handled by high availability techniques) is
considered catastrophic
•This environment may be substantially smaller than the entire production environment, as only a
subset of production applications demand DR
• Service Levels (SLAs ) cover many things, our focus is availability aspects
• You need a clear set of requirements that define precisely the availability
requirements of the system, taking into account
– Components of the system
A system has many pieces and business aspects, how do their requirements differ?
‒ Responsiveness and throughput requirements
100% of requests aren't going to work perfectly 100% of the time
Service Timeframe 7 x 24 7 x 24
• Concepts
• Disaster Recovery
– Multiple Cells and Data Centers
– WebSphere Application Server Recovery
– IBM BPM Recovery
• Final Thoughts
WAS Txn • Otherwise, an HA event could force you to enact your DR procedure
Logs
User Registry Database
– Database is only replicated HA in Different Data Center
Shared
Filesystem
16
IBM Cloud Technical Academy
Multiple Data Center Options (1/4)
• Classic DR
• Active/Passive
– Two Data Centers, one Serving Requests the other Idling
– Independent Cells
– Easier Than Active/Active
– User and Application State Synchronization are Less Critical
– Asynchronous Replication Is Likely Sufficient
• Application Data, WebSphere Transaction Logs in Single Consistency Group
– Lower Cost for Network and Hardware Capacity
– From a Capacity perspective One Data Center is Being Underutilized.
• Typically Does Not Incur S/W License Charges When Idle
– If You Don’t Pay for S/W Licenses Is Cost and Underutilization Still a Concern?
– WebSphere License Provides for
o Hot – Processing Requests License Required
o Warm – Started But Not Processing Requests, License Not Required
o Cold – Installed, But Not Started, License Not Required
– DB2 and MQ Require < 100 % of Hot Licenses for Replication
• Classic “Active/Active”
• Two Data Centers
– Independent Cells and Synchronized Resource Managers (DBs)
– Serving Requests for Same Applications
– Requires Shared Application Data
o Application Data Consistency is prerequisite to any other planning
o Simultaneous Reads/Writes = Geographic Synchronous Disk Replication
– Additional Hardware and Disk Capacity Required
– e.g. IBM High Availability Geographic Cluster (HAGEO), Sun Cluster Geographic Edition
• Provides Most of the Benefits of “Classic Active/Active” without the Cost and Complexity
http://en.wikipedia.org/wiki/CAP_theorem
IBM Cloud Technical Academy
Multiple Active Data Centers and the CAP Theorem
• Legend
– Active/Active Improves Utilization
• Reality
– An Active/Active Topology at 40-50% Utilization in Each DC Is Equivalent to An Active/Passive Datacenter
Deployment with One Active at 80% to 90 % Utilization and the Other Passive
– Running Active/Active at Greater Than 50% Of Total (both Datacenters) Capacity Can Often Result in a
Complete Loss of Service When a Data Center Outage Occurs
o Insufficient Capacity in Remaining Data Center to Handle > 100% Capacity Results in
– Poor Response Time (at best)
– Network and Server Overload, Resulting in a Complete Crash
• There’s an earthquake
• Far Sync instance behaves just like any other Data Guard/Active Data Guard archive
destination configured for SYNC transport. This means that the laws of physics
regarding the speed of light that affect round-trip network latency have not been
altered in any way; primary database performance will still be impacted by the network
round -trip time between the primary data base and the Far Sync instance."
http://www.oracle.com/technetwork/database/availability/farsync-2267608.pdf - page 6
•Your Network Team Assures You That Can (or Have) Constructed a Network Link
Between Data Centers
– For Arguments Sake, We’ll I Agree, It Is possible to construct a network so that latency
is NOT an issue Under Normal Conditions
– Even so, WANs are Less Reliable than LANs.
o And Much Harder To Fix !
•Question
– Do You Want to Have to Explain to Your CIO Why A Problem In One Data Center
Impacted The Other and Resulted in a Outage Because You Didn’t Have Cells
Aligned to Data Center Boundaries ?
IP
Sprayer IHS Nod Nod IP IHS Nod Nod
e1 e2 Sprayer e1 e2
Node Node File Copy Node Node
Agt Agt Agt Agt
for Install & Config
A P Data A P
Messagin Msg.mem1 Msg.mem2 Messagin Msg.mem
Msg.mem
g g 2
Database 1 Database
AppTarge App.mem1 App.mem2 AppTarge App.mem1 App.mem2
t t
Support Sup.mem1 Sup.mem2 Support Sup.mem1 Sup.mem2
Filesystem
Filesystem (NFS)
(NFS) WAS Txn WAS Txn
Logs Logs
Storage Storage
(SAN) (SAN)
SAN Replication
for Application Data
Web
Server
DMgr DMgr
IHS
IP
Sprayer IHS Nod Nod Nod Nod
e1 e2 e3 e4
Node Node Node Node
Agt Agt Agt Agt
A P A P
Msg.mem1 Msg.mem2 Msg.mem Msg.mem
Messagin
3 4
g
User Registry
App.mem1 App.mem2 App.mem3 App.mem4
AppTarge
t Web
Sup.mem1 Sup.mem2 Server
Support Sup.mem3 Sup.mem4
IHS
IP
WAS Txn WAS Txn Sprayer IHS
Logs Logs
DB-managed
Replication for
Application Data
Primary Secondary
Datacenter Datacenter
IBM Cloud Technical Academy
IBM BPM: Licensing Guidance for HA/DR Configurations
What’s active?
Configuration WAS What licenses are needed for the backup nodes?
DB2 BPM
ND
• Files in the backup data center are being synchronized automatically by a
“Classic” Disaster Recovery
off off off SAN. But there is no DB2, WAS ND, or BPM program active.
(DR): SAN-based replication • No extra DB2, WAS ND, or BPM licenses needed for backup nodes.
• Active DB2 HADR Standby setup considered warm standby – licenses for
DR configuration with OS
100 DB2 PVUs required to cover warm standby servers
replication for config data & ON off off • BPM and WAS ND are inactive – no extra WAS ND or BPM licenses needed
DB2 replication for runtime data for backup nodes.
• Active DB2 HADR Standby setup considered warm standby – licenses for
DR configuration with WAS ND 100 DB2 PVUs required to cover warm standby servers
replication for config data & ON ON off • WebSphere process in the backup nodes used for synchronization – WAS
DB2 replication for runtime data ND licenses are required.
• BPM is inactive* – no extra BPM licenses needed for backup nodes.
• IBM BPM is active in all nodes – full BPM licenses are required.
High Availability (HA) ON ON ON • WAS ND and DB2 licensing based on Supporting Programs terms of BPM
Note: For any HA or DR configuration, any node(s) running DMGR only will also require a WAS ND license.
(*Inactive: BPM server JVMs in the remote datacenter are not started)
REFERENCES
• IBM Program License Agreement licensing for Backup Use: This document explains that extra licenses are not required for cold or warm backup nodes.
However, if there is a program actively “doing work” to keep the backup node synchronized with the primary site, then that program must be licensed. E.g.,
when the DB2 and WAS ND node agents are actively “doing work” (replication), they must be licensed, but if IBM BPM services are not active, no BPM
licenses are required in backup nodes.
• DeveloperWorks article on “Stray Node” DR Configuration: This article describes a “better” Stray Node DR configuration. It is different from a Classic DR
configuration in that it keeps the WAS ND environment up-to-date as well as the DB2 environment, in order to reduce the server recovery time after a
disaster. In this “better” Stray Node DR configuration, WAS ND node agents are active, but IBM BPM is not active.
• A note about logical corruption – data integrity problems that get replicated to the DR environment
– We recommend using storage system tooling (for example, FlashCopy) to periodically copy the system state
– This can be done at the replica, to avoid interfering with normal operations
– If the Primary data and its replica are both corrupted, then state can be restored to a copy made before the corruption
• For DR purposes, why can’t I just make one WebSphere ND cell with members running in both of my datacenters? That way, if one
datacenter is lost, another can carry the load
– See ‘Active/Active Antipattern’ discussion on the following slides
Web Web
Server Server
DMgr DMgr
IHS IHS
IP IP
Sprayer IHS Sprayer IHS
Nod Nod Nod Nod
e1 e2 e3 e4
User Registry Node Node Node Node
Agt Agt Agt Agt User
Registry
A P P P
Messagin Msg.mem1 Msg.mem2 Msg.mem3 Msg.mem4
g
Database Database
Filesystem Filesystem
(NFS) (NFS)
WAS Logs
Storage Storage
(SAN) (SAN)
SAN Replication
for Application Data
* An anti-pattern (or antipattern) is a common response to a recurring problem that is usually ineffective and risks being highly counterproductive. IBM Cloud Technical Academy
Why is this type of topology considered an Anti-Pattern?
• Active/Active approaches introduce new complexities that undermine the stability of the system
– Issues/Problems Can Propagate From One DC to the Other
– This Compromises Redundancy and Resiliency
– Worst Case a Outage Cascades Across Both Data Centers
• Frequently these negate the advantages that led the customer to consider the approach in the first place
– Increased risk of network instability can lead to partitioned network (‘split brain’)
• Independent Transaction “Recovery” in Both Data Centers By HA Manager
– The two data centers could move to inconsistent transactional states!!
– Increased network latency can limit system performance during normal operations
• Latency between the Application Server and its databases
• Latency among cluster members communicating via the WebSphere HA Manager component
– Desire to automate failover increases risk of false failover & rapid cycling
– A system more than 50% utilized introduces the risk that losing a single component will compromise the
entire system, turning what could have been a (simple) HA event into a true disaster
• In practice, many Active/Active topologies do not deliver Disaster Recovery capability at all:
– Attempts to limit latency lead to datacenters physically near each other, increasing the risk that a single
disaster will eliminate the entire system
– Many disasters arise from human error and data corruption. Tight coupling between DR resources does not
provide protection from this type of failure at all
– A WAS-ND Cell Spanning Data Centers will actually interfere with Zero RTO
• Refer to
– http://www.ibm.com/developerworks/websphere/techjournal/0606_col_alcott/0606_col_alcott.html#sec1d
– http://www.ibm.com/developerworks/websphere/techjournal/1004_webcon/1004_webcon.html
• No
– Same Fundamentals for Effective Redundancy and the Requirements for Isolation and
Independence Apply
– Though Liberty May Make It Easier to Ignore or Believe That the Fundamentals Don’t Apply
39
IBM Cloud Technical Academy
Agenda
• Concepts
• Disaster Recovery
– Multiple Cells and Data Centers
– WebSphere Application Server Recovery
– IBM BPM Recovery
• Final Thoughts
• Most RTO and RPO goals will deeply impact application and infrastructure architecture and can't be done
“after the fact”
– e.g., if data is shared across data centers, your database and application design will have to be careful to avoid
conflicting database updates and/or tolerate them
– e.g., application upgrades have to account for multiple versions of the application running at once which can affect
user interface design, database layout, etc
• Extreme RTO and RPO goals tend to conflict
– e.g., using synchronous disk replication of data gives you a zero RPO but that means the second system can't be
operational, which raises RTO
45
IBM Cloud Technical Academy
Example and (Very High Level) DR Plan
• Think !
• All programs running in backup mode must be under the customer's control, even if running at another enterprise's location.
• COLD - a copy of the program may be stored for backup purpose machine as long as the program has not been started.
– There is no charge for this copy.
• WARM - a copy of the program may reside for backup purposes on a machine and is started, but is "idling", and is not doing any work of any kind.
– There is no charge for this copy.
• HOT - a copy of the program may reside for backup purposes on a machine, is started and is doing work. However, this program must be ordered.
– There is a charge for this copy.
• "Doing Work", includes, for example, production, development, program maintenance, and testing. It also could include other activities such as mirroring of transactions,
updating of files, synchronization of programs, data or other resources (e.g. active linking with another machine, program, data base or other resource, etc.) or any activity or
configurability that would allow an active hot-switch or other synchronized switch-over between programs, data bases, or other resources to occur