0% found this document useful (0 votes)
156 views54 pages

Disaster Recovery for IBM Systems

The document discusses planning for disasters with IBM WebSphere Application Server and IBM Business Process Manager. It covers using multiple data centers for redundancy, geographic separation, supporting software components, and common deployment issues. The session aims to help attendees avoid data loss and downtime from disasters like lightning strikes. It defines key terms like redundancy, isolated, independent, high availability, continuous operations, and disaster recovery. The agenda covers concepts, disaster recovery approaches, and final thoughts.

Uploaded by

Dmitry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
156 views54 pages

Disaster Recovery for IBM Systems

The document discusses planning for disasters with IBM WebSphere Application Server and IBM Business Process Manager. It covers using multiple data centers for redundancy, geographic separation, supporting software components, and common deployment issues. The session aims to help attendees avoid data loss and downtime from disasters like lightning strikes. It defines key terms like redundancy, isolated, independent, high availability, continuous operations, and disaster recovery. The agenda covers concepts, disaster recovery approaches, and final thoughts.

Uploaded by

Dmitry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 54

IBM Cloud Technical Academy

© 2015 IBM Corporation

IBM Cloud Technical Academy


Planning for Catastrophe with
IBM WebSphere Application Server &
IBM Business Process Manager
Tom Alcott
STSM

© 2015 IBM Corporation

IBM Cloud Technical Academy


This Session

• This session will focus on the architectural and operational issues that need to be considered when planning
and implementing a Disaster Recovery plan with WebSphere Application Server and IBM BPM. Topics will
include use of multiple data centers, geographic separation constraints, supporting software components,
disaster recovery and other common deployment issues. Though focused primarily on WebSphere Application
Server and IBM BPM this session also applies to IBM Middleware that is deployed on WebSphere Application
Server as well as Pure Application System.

• While not a prerequisite, attendees should be familiar with the material covered in " Preparing to Fail, Practical
WebSphere Application Server High Availability”

IBM Cloud Technical Academy


Introduction

• Why Are We Here?


• To Avoid This

IBM Cloud Technical Academy


….and this…
• Some people have permanently lost access to the
files on the affected disks as a result.
• A number of disks damaged following the lightning
strikes did, however, later became accessible.
• Generally, data centres require more lightning
protection than most other buildings.
• Google has said that lightning did not actually strike
the data centre itself, but the local power grid and
the BBC understands that customers, through
various backup technologies, were able to recover
all lost data.
• While four successive strikes might sound unlikely,
lightning does not need to repeatedly strike the
same place or the actual building to cause damage.
• Justin Gale, project manager for the lightning
protection service Orion, said lightning could strike
power or telecommunications cables connected to
a building at a distance and still cause disruptions.
• "The cabling alone can be struck anything up to a
kilometre away, bring [the shock] back to the data
centre and fuse everything that's in it," he said.
• Unlucky strike
• The Google Compute Engine (GCE) service allows
Google's clients to store data and run virtual
computers in the cloud. It's not known which clients
were affected, or what type of data was lost.
• In an online statement, Google said that data on
just 0.000001% of disk space was permanently
affected.
• "Although automatic auxiliary systems restored
power quickly, and the storage systems are
designed with battery backup, some recently
written data was located on storage systems which
were more susceptible to power failure from
extended or repeated battery drain," it said.
• The company added it would continue to upgrade
hardware and improve its response procedures to
make future losses less likely.
• A spokesman for data centre consultants Future-
Tech, commented that while data centres were
designed to withstand lightning strikes via a
network of conductive lightning rods, it was not
impossible for strikes to get through.
• "Everything in the data centre is connected one
way or another," said James Wilman, engineering
sales director. "If you get four large strikes it
wouldn't surprise me that it has affected the
facility."
• Although the chances of data being wiped by
lightning strikes are incredibly low, users do have
the option of being able to back things up locally as
a safety measure.

IBM Cloud Technical Academy


Agenda

• Concepts
• Disaster Recovery
– Multiple Cells and Data Centers
– WebSphere Application Server Recovery
– IBM BPM Recovery
• Final Thoughts

IBM Cloud Technical Academy


Definitions

• Redundancy
– The provision of additional or duplicate systems, equipment, etc., that
function in case an operating part or system fails, as in a spacecraft.

• Isolated
– Separated from other persons or things; alone; solitary

• Independent
– Not dependent; not depending or contingent upon something else for
existence, operation, etc.

• All of the Above are Fundamental for Effective High Availability and
Disaster Recovery

IBM Cloud Technical Academy


Definitions

High Availability (HA)


•Ensuring that the system can continue to process work within one
location after routine single component failures

•Usually we assume a single failure

•Usually the goal is very brief disruptions for only some users for
unplanned events

Continuous Operations
•Ensuring that the system is never unavailable during planned
activities

•E.g., if the application is upgraded to a new version, we do it in a way


that avoids all downtime
IBM Cloud Technical Academy
Definitions

Continuous Availability (CA)


•High Availability coupled with Continuous Operations

•No tolerance for planned downtime

•Little unplanned downtime as possible

•Very expensive

•Note that while achieving CA almost always requires an


aggressive DR plan, they are not the same thing

IBM Cloud Technical Academy


Background: High Availability in one picture

• Clustered IP Sprayer and Firewalls (not


Web
depicted)
Server
DMgr • Clustered HTTP Servers
IHS

IP • WAS-ND Cell with Clustered Application


Sprayer IHS
Nod
e1
Nod
e2
Servers
Node Node
Agt Agt
• User Registry (LDAP) Hardware Clustered with
Cluster “A” Server 1 Server 2 Shared Disk

Cluster “B”
Server 1 Server 2 • Database Hardware Clustered With Shared
Disk
Cluster “C” Server 2 Server 2

• JMS Provider (Not Depicted)


– WAS Messaging Engine with Shared
Disks/DB
WAS Txn
Logs
– External JMS with Hardware Cluster and
User Registry Database Shared Disk
Shared
Filesystem • Transaction Logs on Shared File System

• Clusters of “2” Provide High Availability,

• Don’t Forget “Rule of 3”,


Storage
(SAN) • When Using With Clusters of 2
– An Outage (Planned or Unplanned) Reduces
Capacity by 50%

– IBM
Is No Longer Fault Cloud Technical Academy
Tolerant
Definitions

Disaster Recovery (DR)

•Ensuring that the system can be reconstituted and/or activated at another location and can process
work after an unexpected catastrophic failure at one location

•Often multiple single failures (which are normally handled by high availability techniques) is
considered catastrophic

•There may or may not be significant downtime as part of a disaster recovery

•This environment may be substantially smaller than the entire production environment, as only a
subset of production applications demand DR

•Normally based on justifiable business need.

•Recovery Time Objective (RTO)


– Service Recovery with little to no interruption

•Recovery Point Objective (RPO)


– Data Recovery and acceptable data loss IBM Cloud Technical Academy
Definitions

• Service Levels (SLAs ) cover many things, our focus is availability aspects
• You need a clear set of requirements that define precisely the availability
requirements of the system, taking into account
– Components of the system
 A system has many pieces and business aspects, how do their requirements differ?
‒ Responsiveness and throughput requirements
 100% of requests aren't going to work perfectly 100% of the time

– Degraded services requirements


 Does everything have to meet the responsiveness requirements ALL the time?
– Dependent system requirements
 What are the implications if a system on which you depend is down?
– Data Loss
• Application Data
• Application State (Is this Critical in a Disaster?)
– Maintenance
 Change occurs, how does that affect availability?
– Disaster Recovery
 The unimaginable happens, then what? IBM Cloud Technical Academy
HA Service Level Example

SLA External SLA Internal Target


Commitment

Service Timeframe 7 x 24 7 x 24

Application 99.5% per month 99.7% per month


Processing
Availability
Recovery Time 4 Hours 1 Hour
Objective

Maintenance Window Tue-Thurs Tue-Thurs


3:00 - 6:00 am 3:00 - 6:00 am

99.5% = 3.60 Hours 99.7% = 2.16 Hours


Downtime/Month Downtime/Month

IBM Cloud Technical Academy


DR Service Level Example

SLA External SLA Internal Target


Commitment

Recovery Time 16 Hours 4 Hours


Objective

Recovery Point ~ 0 (No Data Loss) ~ 0 (No Data Loss)


Objective

Note: This is Recovery of an Entire Data Center with


100’s of Servers, Application, Database, Messaging, etc

IBM Cloud Technical Academy


Agenda

• Concepts
• Disaster Recovery
– Multiple Cells and Data Centers
– WebSphere Application Server Recovery
– IBM BPM Recovery
• Final Thoughts

IBM Cloud Technical Academy


Stage 0 DR – a sound HA strategy

• HA is cheaper and Less Complex Than DR .


Web
Server
• A Robust HA Solution prevents small failures from becoming disasters
DMgr
IHS • Don’t let a (relatively) minor failure become a catastrophe
IP
Sprayer IHS
Nod Nod • Eliminate all single points of failure in your primary datacenter
e1 e2
Node Node
Agt Agt – Spread Workloads Across Multiple Servers (and Hypervisors ! )
Cluster “A” Server 1 Server 2 – Add 2nd Production WAS-ND Cell

Server 1 Server 2 – Consider DB Replication, DB2 HADR, Oracle RAC in conjunction


Cluster “B”
with hardware clustering
Cluster “C” Server 2 Server 2
– LDAP/Registry Replication

WAS Txn • Otherwise, an HA event could force you to enact your DR procedure
Logs
User Registry Database
– Database is only replicated HA in Different Data Center
Shared
Filesystem

• Recently a Common HA Pattern is Two “Virtual” Data Centers


within a Single Physical DC
– Two Completely Separate Sets of Infrastructure in One Physical DC
Storage
(SAN)

16
IBM Cloud Technical Academy
Multiple Data Center Options (1/4)

• Classic DR
• Active/Passive
– Two Data Centers, one Serving Requests the other Idling
– Independent Cells
– Easier Than Active/Active
– User and Application State Synchronization are Less Critical
– Asynchronous Replication Is Likely Sufficient
• Application Data, WebSphere Transaction Logs in Single Consistency Group
– Lower Cost for Network and Hardware Capacity
– From a Capacity perspective One Data Center is Being Underutilized.
• Typically Does Not Incur S/W License Charges When Idle
– If You Don’t Pay for S/W Licenses Is Cost and Underutilization Still a Concern?
– WebSphere License Provides for
o Hot – Processing Requests License Required
o Warm – Started But Not Processing Requests, License Not Required
o Cold – Installed, But Not Started, License Not Required
– DB2 and MQ Require < 100 % of Hot Licenses for Replication

IBM Cloud Technical Academy


Multiple Data Center Options (2/4)

•“Active/Active” with Single Set of Active Databases


•Two Data Centers
– Independent Application Cells
– Serving Requests for Same Applications
– Database(s) Only Active in One Data Center
• Additional Latency for Application Data Requests from Remote Data Center
– Latency Will Also Limit DC Separation
• Request Processing Interruption (in both Data Centers)
– When a DB Outage Occurs

– While Replica DB is Promoted to Primary


• Data Consistency problematic
– WAS Transaction Logs from both DCs and Application Data from Active DB(s) must be in
the same consistency group

IBM Cloud Technical Academy


Multiple Data Center Options (3/4)

• Classic “Active/Active”
• Two Data Centers
– Independent Cells and Synchronized Resource Managers (DBs)
– Serving Requests for Same Applications
– Requires Shared Application Data
o Application Data Consistency is prerequisite to any other planning
o Simultaneous Reads/Writes = Geographic Synchronous Disk Replication
– Additional Hardware and Disk Capacity Required
– e.g. IBM High Availability Geographic Cluster (HAGEO), Sun Cluster Geographic Edition

– Expectation of Continuous Availability and Transparent Failover


o Requires Sharing Application State
– Expectation Seldom Realized
– Outage of One Data Center, Stops Disk Writes in Both, No Longer “Transparent”

• Synchronous Disk Replication Limits Geographic Separation


• Hardest and Costliest to Achieve
Note: Disk Replication only employed for Application Data and Application State, WAS-ND cell configuration, software updates, and application
maintenance should maintained independently in order to insure isolation (and availability)
IBM Cloud Technical Academy
Multiple Data Center Options (4/4)

• Hybrid “Active/Active” (Partitioned by Applications)


• Two Data Centers
– Independent Cells with replicated Resource Managers
• Both DC’s Serving Requests, Both DC’s Configured for All Applications
– Running Different Applications (With Different Application Data)
• New Application Tests
• One DC Performing Updates, One DC Performing Inquiry Only (e.g. data warehouse)
– No Shared Application State, No Shared Application Data
• Asynchronous Replication Sufficient
– Global Network Switch Used to Partition/Distribute Traffic
• In the Event of a Disaster
– Users failover from one DC to the other
– Likely Some Interruption
• As Data Replica is Promoted to Primary
• During Failover Workload Startup

• Provides Most of the Benefits of “Classic Active/Active” without the Cost and Complexity

IBM Cloud Technical Academy


The CAP Theorem

• In a distributed environment, especially spanning data centers across LANs


and WANs there are three core requirements for a service:
– Consistency
• Either the service works or fails
• Traditional ACID of databases provides consistency and isolation
– Availability
• Extremely important in web business model
• In a large distributed system, one may have to compromise with consistency for the
sake of availability
– Partition Tolerance
• Network partition will happen when not all machines are connected
• “No set of failures less than the total network failure is allowed to cause the system
to respond incorrectly” – Seth and Lynch
• Quorum is used to guard against split brain syndrome

• Brewer’s CAP conjecture states that


– One can achieve only two not all three of the above mentioned requirements

http://en.wikipedia.org/wiki/CAP_theorem
IBM Cloud Technical Academy
Multiple Active Data Centers and the CAP Theorem

• Active/Active requires you to sacrifice either consistency, availability or partition tolerance.


– All three aren’t possible
• If you choose full availability, then you are going to lose guaranteed consistency.
– So you need to design with this in mind, and build in mechanisms (typically involving queuing technologies)
that enable your system to "tend towards“ consistency.
• Your data is going to be in two places, either partitioned or replicated.
– If the former, what happens when one site is down?
– If the latter, what happens when users hitting each site see slightly different versions of the current state?
– These are very complex problems.
– Which is why I try to steer customers away from active/active and into an active/passive model with DR
from active to passive.
– But they always feel like they are wasting hardware……………!

IBM Cloud Technical Academy


Data Center Utilization Urban Legends

• Legend
– Active/Active Improves Utilization

• Reality
– An Active/Active Topology at 40-50% Utilization in Each DC Is Equivalent to An Active/Passive Datacenter
Deployment with One Active at 80% to 90 % Utilization and the Other Passive

– Running Active/Active at Greater Than 50% Of Total (both Datacenters) Capacity Can Often Result in a
Complete Loss of Service When a Data Center Outage Occurs
o Insufficient Capacity in Remaining Data Center to Handle > 100% Capacity Results in
– Poor Response Time (at best)
– Network and Server Overload, Resulting in a Complete Crash

IBM Cloud Technical Academy


Active/Active - What’s Wrong With This Picture ?

A former employer of mine had two data centers, running active/active


at two facilities approximately 2.6 miles (or 4.2 KM) apart.
• Close Proximity Addressed Data Consistency Concerns ……..But………

IBM Cloud Technical Academy


What Happens When ?

• There’s an earthquake

• There’s a Civil Insurrection

• A Hazardous Chemical Spill Occurs


– And The Wind Is Blowing the Chemical Cloud from West to East (or vice versa)

• Your DC May Not Be Located in a Locale Prone to Earthquakes


– But what about the other catastrophes ???
– They can, *and* will happen !!

• There’s No Substitute For Isolation Between Data Centers

• Data Centers Should Be Sufficiently Distant So That a Single Event Doesn’t


Impact Both !!

– This Likely Mandates Asynchronous Replication


– Active/Active No Longer Practical IBM Cloud Technical Academy
Network Latency and Application Data Consistency – A 3 rd
Party Perspective

• Far Sync instance behaves just like any other Data Guard/Active Data Guard archive
destination configured for SYNC transport. This means that the laws of physics
regarding the speed of light that affect round-trip network latency have not been
altered in any way; primary database performance will still be impacted by the network
round -trip time between the primary data base and the Far Sync instance."
http://www.oracle.com/technetwork/database/availability/farsync-2267608.pdf - page 6

• High availability solutions must ultimately address business requirements. For


example, a business may use a zero-data-loss solution that synchronously mirrors
every transaction on the primary database to a remote database. However,
considering the speed-of-light limitations and the physical limitations associated with a
network, there are round-trip delays in the network transmission. These delays
increase with distance and vary based on network bandwidth, traffic congestion, router
latencies, and so on. Thus, this synchronous mirroring, if performed over large wide
area network (WAN) distances, will affect the primary site performance. "
http://www.oracle.com/technetwork/database/availability/ha-overview-112-452462.pdf - page 2-5

IBM Cloud Technical Academy


Multiple Cells and Data Centers

•Your Network Team Assures You That Can (or Have) Constructed a Network Link
Between Data Centers
– For Arguments Sake, We’ll I Agree, It Is possible to construct a network so that latency
is NOT an issue Under Normal Conditions
– Even so, WANs are Less Reliable than LANs.
o And Much Harder To Fix !

•But You’re Missing The Point !


– Network Interdependency Between Data Centers Means That the Data Centers are Not
Independent

•Question
– Do You Want to Have to Explain to Your CIO Why A Problem In One Data Center
Impacted The Other and Resulted in a Outage Because You Didn’t Have Cells
Aligned to Data Center Boundaries ?

IBM Cloud Technical Academy


How Do I Recover WebSphere Application Server ?

• File System or OS Backup and Recovery


– Disk or Tape
– WAS backUpConfig/restoreConfig
• WAS_PROFILE/properties, ../etc, WAS_ROOT/java/jre/lib/*properties, WAS_ROOT/java/jre/lib/security,
– Build from Scratch
o Only a Realistic Option with Complete Set of Scripts and Rigorous Change Control
• Best Options
– File System/OS Backup & Recovery
– backupConfig/restoreConfig for Deployment Manager
• WAS V8.0 and above, addNode –asExistingNode can Reconfigure Each Note After restoreConfig of Deployment Manager
Configuration
– Both From Last Know Working Production Configuration
– Otherwise No Assurance Recovery Will Succeed
– Same Concern with “Build From Scratch”
‒ If Using Virtualization Consider VM Cloning for Install and Configuration
‒ Consider Smart Cloud Orchestrator for Automated Install and Configuration
o Provision Both Primary Site and DR Site in a Consistent Manner with SCO
o Note: Don’t Deploy to Backup to DR Site over a WAN !
• May Need to Change Cell and Host Names
– Will the Original Data Center Be Restored, Or Is it Gone (for Good)?

IBM Cloud Technical Academy


WAS Full Profile DR Recovery
• Transaction Recovery on Separate (Physical) Server
– Access the Transaction Logs
• Move/Mount the Transaction Logs to Physical Server Hosting Application Server with Access to Same
Resources (e.g. JDBC, JMS)
• V8.x Optional Use of DB for Transaction logs
– If Recovery Occurs in Different Cell use wsadmin to Configure the Same JAAS Alias for Accessing
XA Resources
• With adminconsole the node name gets prefixed to the alias .

– No Longer Required to Have Same Hostname and IP Address


• Different IP’s with Multiple Host Alias’s Typical
http://www-01.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/tjta_mvelog.html

• WAS Messaging Engine


– recoverMEConfig AdminTask Command Retrieves MEUUID From Persistent Message Store and
Updates Message Engine Configuration
– Allows Recovery of Stranded Messages After Catastrophic ME Failure in WAS V8.5.0 and above
http://www-01.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/rjk_recoverme_config.html

• Some Small Capacity in DR Site Needs to be set Aside for Recovery


– In Addition to Production Workload
– Or Production Workload Not Processed Until Recovery is Complete IBM Cloud Technical Academy
WAS Liberty Profile DR Recovery

IBM Cloud Technical Academy


We Need Your Help
• Automated JMS Failover for WebSphere Liberty Profile
– http
://www.ibm.com/developerworks/rfe/execute?use_case=viewChangeRequest&CR_ID=7
1179

• Transaction peer recovery for WAS Liberty


– https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=77832
– Comment in the RFE if you need this outside Bluemix (e.g on premises)

•  EJB IIOP WLM for Liberty Profile


– http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=78768

• Your Vote Counts


• Customer Interest (votes!) Helps with Prioritization

IBM Cloud Technical Academy


Classic DR for Stateful Applications: full cell replication

User Registry User Registry


Web Web
Server Server

DMgr DMgr DMgr DMgr


IHS IHS

IP
Sprayer IHS Nod Nod IP IHS Nod Nod
e1 e2 Sprayer e1 e2
Node Node File Copy Node Node
Agt Agt Agt Agt
for Install & Config
A P Data A P
Messagin Msg.mem1 Msg.mem2 Messagin Msg.mem
Msg.mem
g g 2
Database 1 Database
AppTarge App.mem1 App.mem2 AppTarge App.mem1 App.mem2
t t
Support Sup.mem1 Sup.mem2 Support Sup.mem1 Sup.mem2

Filesystem
Filesystem (NFS)
(NFS) WAS Txn WAS Txn
Logs Logs

Storage Storage
(SAN) (SAN)
SAN Replication
for Application Data

Consistency Group Consistency Group


Primary Secondary
Datacenter Datacenter
IBM Cloud Technical Academy
DR via Stray Nodes & Database Managed replication

Web
Server

DMgr DMgr
IHS

IP
Sprayer IHS Nod Nod Nod Nod
e1 e2 e3 e4
Node Node Node Node
Agt Agt Agt Agt
A P A P
Msg.mem1 Msg.mem2 Msg.mem Msg.mem
Messagin
3 4
g
User Registry
App.mem1 App.mem2 App.mem3 App.mem4
AppTarge
t Web
Sup.mem1 Sup.mem2 Server
Support Sup.mem3 Sup.mem4

IHS

IP
WAS Txn WAS Txn Sprayer IHS
Logs Logs

Database Database User Registry

DB-managed
Replication for
Application Data

Primary Secondary
Datacenter Datacenter
IBM Cloud Technical Academy
IBM BPM: Licensing Guidance for HA/DR Configurations

What’s active?
Configuration WAS What licenses are needed for the backup nodes?
DB2 BPM
ND
• Files in the backup data center are being synchronized automatically by a
“Classic” Disaster Recovery
off off off SAN. But there is no DB2, WAS ND, or BPM program active.
(DR): SAN-based replication • No extra DB2, WAS ND, or BPM licenses needed for backup nodes.

• Active DB2 HADR Standby setup considered warm standby – licenses for
DR configuration with OS
100 DB2 PVUs required to cover warm standby servers
replication for config data & ON off off • BPM and WAS ND are inactive – no extra WAS ND or BPM licenses needed
DB2 replication for runtime data for backup nodes.
• Active DB2 HADR Standby setup considered warm standby – licenses for
DR configuration with WAS ND 100 DB2 PVUs required to cover warm standby servers
replication for config data & ON ON off • WebSphere process in the backup nodes used for synchronization – WAS
DB2 replication for runtime data ND licenses are required.
• BPM is inactive* – no extra BPM licenses needed for backup nodes.

• IBM BPM is active in all nodes – full BPM licenses are required.
High Availability (HA) ON ON ON • WAS ND and DB2 licensing based on Supporting Programs terms of BPM

Note: For any HA or DR configuration, any node(s) running DMGR only will also require a WAS ND license.
(*Inactive: BPM server JVMs in the remote datacenter are not started)

REFERENCES
• IBM Program License Agreement licensing for Backup Use: This document explains that extra licenses are not required for cold or warm backup nodes.
However, if there is a program actively “doing work” to keep the backup node synchronized with the primary site, then that program must be licensed. E.g.,
when the DB2 and WAS ND node agents are actively “doing work” (replication), they must be licensed, but if IBM BPM services are not active, no BPM
licenses are required in backup nodes.

• DeveloperWorks article on “Stray Node” DR Configuration: This article describes a “better” Stray Node DR configuration. It is different from a Classic DR
configuration in that it keeps the WAS ND environment up-to-date as well as the DB2 environment, in order to reduce the server recovery time after a
disaster. In this “better” Stray Node DR configuration, WAS ND node agents are active, but IBM BPM is not active.

•DeveloperWorks article on licensing DB2 10.1 servers in a HA environment


IBM Cloud Technical Academy
What Your Mother Didn’t Tell You
About Disaster Recovery
• Transaction Log replication & Network configuration requirements
– Historically, the WebSphere transaction service required IP Address & Hostname at the target server match the source
– This is because, for some types of transactions, the server network information is written into the logs themselves
– As of WAS v8 servers, this requirement is relaxed a bit – IP addresses no longer need to match

• High Availability and Disaster Recovery for the Deployment Manager


– Techniques are available that leverage hardware clustering or replication of the DMgr’s cell configuration to an alternate
server. These are described in detail at:
https://www.ibm.com/developerworks/websphere/techjournal/1001_webcon/1001_webcon.html
– Beginning in version 8.5.5 WebSphere supports High Availability features for the Deployment Manager, using a shared
filesystem. WAS installations (including BPM) running on applicable versions can leverage this feature:
http://pic.dhe.ibm.com/infocenter/wasinfo/v8r5/topic/com.ibm.websphere.nd.doc/ae/twve_xdsoconfig.html
– Because these HA techniques rely on DMgr replication, they apply unchanged to DR scenarios
– Generally, we recommend recovering application servers before bringing up a replacement DMgr

• A note about logical corruption – data integrity problems that get replicated to the DR environment
– We recommend using storage system tooling (for example, FlashCopy) to periodically copy the system state
– This can be done at the replica, to avoid interfering with normal operations
– If the Primary data and its replica are both corrupted, then state can be restored to a copy made before the corruption

• For DR purposes, why can’t I just make one WebSphere ND cell with members running in both of my datacenters? That way, if one
datacenter is lost, another can carry the load
– See ‘Active/Active Antipattern’ discussion on the following slides

IBM Cloud Technical Academy


Active/Active anti-pattern* (Cells Spanning Data Centers)

Web Web
Server Server

DMgr DMgr
IHS IHS

IP IP
Sprayer IHS Sprayer IHS
Nod Nod Nod Nod
e1 e2 e3 e4
User Registry Node Node Node Node
Agt Agt Agt Agt User
Registry
A P P P
Messagin Msg.mem1 Msg.mem2 Msg.mem3 Msg.mem4
g

AppTarge App.mem1 App.mem2 App.mem3 App.mem4


t

Support Sup.mem1 Sup.mem2 Sup.mem3 Sup.mem4

Database Database
Filesystem Filesystem
(NFS) (NFS)
WAS Logs

Storage Storage
(SAN) (SAN)

SAN Replication
for Application Data

Consistency Group Consistency Group


Primary Secondary
Datacenter Datacenter

* An anti-pattern (or antipattern) is a common response to a recurring problem that is usually ineffective and risks being highly counterproductive. IBM Cloud Technical Academy
Why is this type of topology considered an Anti-Pattern?
• Active/Active approaches introduce new complexities that undermine the stability of the system
– Issues/Problems Can Propagate From One DC to the Other
– This Compromises Redundancy and Resiliency
– Worst Case a Outage Cascades Across Both Data Centers
• Frequently these negate the advantages that led the customer to consider the approach in the first place
– Increased risk of network instability can lead to partitioned network (‘split brain’)
• Independent Transaction “Recovery” in Both Data Centers By HA Manager
– The two data centers could move to inconsistent transactional states!!
– Increased network latency can limit system performance during normal operations
• Latency between the Application Server and its databases
• Latency among cluster members communicating via the WebSphere HA Manager component
– Desire to automate failover increases risk of false failover & rapid cycling
– A system more than 50% utilized introduces the risk that losing a single component will compromise the
entire system, turning what could have been a (simple) HA event into a true disaster
• In practice, many Active/Active topologies do not deliver Disaster Recovery capability at all:
– Attempts to limit latency lead to datacenters physically near each other, increasing the risk that a single
disaster will eliminate the entire system
– Many disasters arise from human error and data corruption. Tight coupling between DR resources does not
provide protection from this type of failure at all
– A WAS-ND Cell Spanning Data Centers will actually interfere with Zero RTO
• Refer to
– http://www.ibm.com/developerworks/websphere/techjournal/0606_col_alcott/0606_col_alcott.html#sec1d
– http://www.ibm.com/developerworks/websphere/techjournal/1004_webcon/1004_webcon.html

IBM Cloud Technical Academy


What are the recommended alternatives?
• Properly plan a High Availability solution distinct from Disaster Recovery
• Eliminate single points of failure through redundancy in network and software components
– HA features allow rapid and automatic recovery from loss of a single component. Utilize them!
• Improve RTO by reducing complexity, scripting operational procedures and drill
– Automate Processes for Repeatability and Consistency
• Scripting
• Point and Click” is Not Repeatable
– Discipline and Practice are Essential
– Well Defined Procedures for Every Contingency
• You Do Not Want to Learn During an Outage
• Practice Those Procedures
• Won’t Make Mistakes in Crisis
• Validates that Procedures Actually Work

• Practice Backup and Recovery, System Failures, Disaster Recovery, etc.


– Goal: Make Daily Operations Boring
• Improve electricity distribution via Uninterruptable Power Supply
• Utilize application design patterns like loose coupling in order to improve application flexibility
• In cases where RTO between 1 and 4 hours is necessary, without the requirement to process new work,
consider the Stray Node pattern
IBM Cloud Technical Academy
Is This Different for the Liberty Profile and Liberty Collectives?

• No
– Same Fundamentals for Effective Redundancy and the Requirements for Isolation and
Independence Apply

– Though Liberty May Make It Easier to Ignore or Believe That the Fundamentals Don’t Apply

– Controllers and Collectives Should Be Aligned Along Data Center


Boundaries
• Collective Controllers Employ Quorum
• Quorum Requires Odd Number and Min Of 3
• Mathematically Impossible to Configure Odd Number /Data Center as well as Odd
Number Across Data Centers Required to Maintain Quorum

39
IBM Cloud Technical Academy
Agenda

• Concepts
• Disaster Recovery
– Multiple Cells and Data Centers
– WebSphere Application Server Recovery
– IBM BPM Recovery
• Final Thoughts

IBM Cloud Technical Academy


Disaster Recovery

• Develop a Disaster Recovery Plan


• Group Business Needs and Associated Applications into Tiers
• Group into tiers based on the hard/soft dollar impact on the
organization
• Categorize by RPO and RTO.
• The top tier likely includes zero data loss and either no downtime or
perhaps just a few minutes of down time
• Subsequent tiers have an RTO of 24 hours, then 48 to 72 hours,
then perhaps 72 to 96 hour
• Essential Part of Any Plan
– Who approves DR move/recovery ?
– Automated site failover is a bad idea
o Typically triggering DR is very expensive
o You do not want to trigger a DR by accident because of some transient issue – just makes the
situation worse

IBM Cloud Technical Academy


Disaster Recovery Objectives

• Recovery Time Objective


– How quickly the system will be able to accept traffic after the disaster
– Shorter times require progressively more expensive techniques
o e.g., a tape backup and restore is relatively inexpensive
o e.g., a fully redundant fully operational data center is very expensive

• One challenge is detection time


– It takes time to determine you are in a disaster state and trigger disaster procedures
o While you are deciding if you are down, you are probably missing your SLA.
o Does the RTO include detection time?

IBM Cloud Technical Academy


Disaster Recovery Objectives

• Recovery Point Objective


– How much data you are willing to lose when there is a disaster
– Limiting data loss raises costs
o e.g., restoring from tape is relatively inexpensive but you'll lose everything since the last backup
o e.g., asynchronous replication of data and system state requires significant network bandwidth to prevent falling
far behind
o e.g., synchronous replication to the backup data center guarantees no data loss but requires VERY fast and
reliable network and will significantly harm performance
• Warning: results in increased latency which means capacity must be increased at all layers

IBM Cloud Technical Academy


Disaster Recovery Objectives

• Most RTO and RPO goals will deeply impact application and infrastructure architecture and can't be done
“after the fact”
– e.g., if data is shared across data centers, your database and application design will have to be careful to avoid
conflicting database updates and/or tolerate them
– e.g., application upgrades have to account for multiple versions of the application running at once which can affect
user interface design, database layout, etc
• Extreme RTO and RPO goals tend to conflict
– e.g., using synchronous disk replication of data gives you a zero RPO but that means the second system can't be
operational, which raises RTO

• Trying to Achieve a Zero RTO *and* a Zero RPO is Mutually Exclusive

IBM Cloud Technical Academy


Disaster Recovery Testing

• The DR hardware Should Be Put Into Actual Production Usage


– Otherwise How Can You Be Sure It Will Work When You REALLY Need It.
• A Corollary of Murphy’s Law
– The larger the numbers, the less likely all Tier 1 machines can be successfully
restored to operations.
• DR Testing Options
– “Saturday Afternoon Surprise”
• Unannounced DR Test
• Only If You Can Tolerate an Outage
– Progressively More Realistic and Complex Tests
• Startup of Remote Infrastructure
• Remote Startup with Simulated Workload
• Remote Startup with Production Workload Shift

• Other Issues In a Real Disaster


– Will your key staff want to travel?
– Will they be able to travel?

45
IBM Cloud Technical Academy
Example and (Very High Level) DR Plan

• Executive/Management Approval for Activation of DR

• Isolate Data Centers


– Halt Incoming Network Traffic
• Static “Temporarily Unavailable Web Page”
– Break Disk Synchronization
– Sever Network Links Between Data Centers

• Start and Recovery of Surviving Center


– Restore/Recovery Hardware and Middleware
– Start DB, Messaging and Application Servers
– Examine DB and Message provider logs for pending transactions
– Recover Pending transactions and messages

• Start Accepting New Work in Surviving Data Center


– Enable Network

IBM Cloud Technical Academy


Other Aspects to Consider

• An HA, CA or DR Deployment Architecture is Not a Product Feature.


– WAS and the WebSphere portfolio products
• Provide HA Features and Function
• Can Be Employed in an HA Architecture
• The Appropriate Environment Varies by Customer
• One Size Does Not Fit All !

• Optimizing WebSphere HA Capabilities into a Robust Deployment


– Requires In-depth Understanding
• Of Environment
• Of Applications
• Of Operational Requirements (Service Levels)

• Architectural Advice May Require ISSW Assistance

IBM Cloud Technical Academy


Learn from Your Mistakes

• Mistakes and failures will occur, learn from them


– What separates mediocre organizations from the good and great isn't so much
perfection as it is the constant striving to get better – to not repeat mistakes

• After every outage perform


– Root cause analysis
• Capture diagnostic information
• Meet as a team including all key players to discuss
• Determine precisely what went wrong
– Wrong doesn't mean “Bob made an error.”

– Find the process flaw that led to the problem


• Determine a corrective action that will prevent this from happening again
– If you can't, determine what diagnostic information is needed next time this happens and ensure it is collected
• Implement that corrective action
– All too often this last step isn't done
– V erify that action corrected problem

• A senior manager must own this process

IBM Cloud Technical Academy


Indispensable When Planning for Catastrophe

• Think !

IBM Cloud Technical Academy


Questions?

IBM Cloud Technical Academy


Backup Slides

IBM Cloud Technical Academy


Shameless Self Promotion

IBM WebSphere Deployment and Advanced Configuration


By Roland Barcia, Bill Hines, Tom Alcott and Keys Botzum
ISBN: 0131468626

IBM Cloud Technical Academy


Another Recommended Book

IBM WebSphere v5.0 System Administration


By Leigh Williamson, Lavena Chan,Roger Cundiff, Shawn Lauzon and Christopher C. Mitchell
ISBN: 0131446045

IBM Cloud Technical Academy


Licensing Servers as Back Up Servers

From IBM Contracts and Practices Database


• The policy is to Charge for HOT, and not for WARM or COLD back ups. The following are definitions of what constitutes HOT-WARM-COLD backups:

• All programs running in backup mode must be under the customer's control, even if running at another enterprise's location.

• COLD - a copy of the program may be stored for backup purpose machine as long as the program has not been started.
– There is no charge for this copy.

• WARM - a copy of the program may reside for backup purposes on a machine and is started, but is "idling", and is not doing any work of any kind.
– There is no charge for this copy.

• HOT - a copy of the program may reside for backup purposes on a machine, is started and is doing work. However, this program must be ordered.
– There is a charge for this copy.

• "Doing Work", includes, for example, production, development, program maintenance, and testing. It also could include other activities such as mirroring of transactions,
updating of files, synchronization of programs, data or other resources (e.g. active linking with another machine, program, data base or other resource, etc.) or any activity or
configurability that would allow an active hot-switch or other synchronized switch-over between programs, data bases, or other resources to occur

Refer to http://www-03.ibm.com/software/sla/sladb.nsf/pdf/policies/$file/Feb-2003-IPLA-backup.pdf for more information

IBM Cloud Technical Academy

You might also like