0 ratings0% found this document useful (0 votes) 226 views15 pagesIntroduction - High Availability For SAP HANA
Introduction_ High Availability for SAP HANA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
ay
SAP HANA™ — High Availability
Business Continuity requires that the operation of business critical systems remain highly
available at all times, even in the presence of failures. This paper discusses the functionality of
SAP HANA in support of High Availability and Disaster Recovery.
Updated for HANA 2.0 SPOO
«+P SHANA 2.0 SPS00: new operation mode “logreplay_readaccess" for system replication with Active/Active (read enabled) feature
‘Taippegaacnanan «manna bameeatnencom
SAP A
V9.0, March 2017Table of Contents
Contents
Table of Contents.
Legal Disclaimer...
4 Introduction
SAP HANA....
About this Document. 4
2 Whatis High Availabi
Recovery - Key Performance Indicators...
3 Eliminating Single Points of Failure..
Hardware Redundancy...
Network Redundancy...
Data Center Redundancy...
4 SAP HANA High Availability Support ens
BACKUPS weenie 6
Storage Replication . 7
‘System Replication 8
Service Auto-Restatt..... "1
Host Auto-Fallover. 11
5 Design for High Availability
Planning for Failure...
Glossary
Industry Terms...
SAP HANA Terms...& SAP HANA High Availability
ay
Legal Disclaimer
‘TS DOCUMENT IS PROVIDED FOR INFORMATION PURPOSES ONLY AND DOES NOT MODIFY THE TERMS OF ANY AGREEMENT.
THE CONENT OF THIS DOCUMENT IS SUBJECT TO CHANGE AND NO THIRD PARTY MAY LAY LEGAL CLAIM TO THE CONTENT OF
‘THIS DOCUMENT. IS CLASSIFIED AS “CUSTONER’ AND MAY ONLY BE SHARED WITHA THIRD PARTY IN VIEW OF AN ALREADY
[EXISTING OR FUTURE BUSINESS CONNECTION WITH SAP. IF THERE IS NO SUCH BUSINESS CONNECTION IN PLACE OR
INTENDED AND YOU HAVE RECEIVED THIS DOCUMENT, We STRONGLY REQUEST THAT YOU KEEP THE CONTENTS.
‘CONFIDENTIAL AND DELETE AND DESTROY ANY ELECTRONIC OR PAPER COPIES OF THIS DOCUMENT. THIS DOCUMENT SHALL
NOT BE FORWARDED TO ANY OTHER PARTY THAN THE ORIGINALLY PROJECTED ADDRESSEE.
Ti document outines our garerl produit crecton ad shou not be eed an in making a purchase econ. This document isnt subject
to your soense agreement or any oner agreement wih SAP. SAP has no aeigation o purse ay couse of business extines nis
presentation oro dovelop or release ay unctonay mentioned ints document. Ths document and SAP's strategy and possible ature
evelopment ae sujet to change and may be changed by SAP at any tme fr any reason waht rote. This documents proved
‘whou a waranty of any kd, eer exxess oF moles. lading bu ot hed othe mole warranties of merchantably,fness fox a
arose purpose, or nanintingemers. SAP assumes no esgensily 0 Or amssions ints document and sal haven Yay or
damages o any kr that may fest Rem the use af ese mates, exe Suen damages were ca.sed by SAP wentoaly oF gOSSy
regigent
(© Copyrigh 2017 SAP SE. Al ighsresened
(© 2017 SAP SE page 315& SAP HANA High Availability
ay
1__ Introduction
SAP HANA
SAP HANA™ Is an Innovative in-memory database and data management platform, specifically developed to
take full advantage of the capabilities provided by modern hardware to Increase application performance. By
keeping all relevant data in main memory, data processing operations are significantly accelerated
Design for scalability Is @ core SAP HANA principle, SAP HANA can be distributed across many multiple hosts
to achieve scalability in terms of bath data volume and user concurrency. Unlike clusters, distributed HANA,
systems also distribute the data efficiently, achieving high scaling without 1/0 locks.
‘The key performance indicators of SAP HANA appeal to many of our customers, and thousands of deployments
are in progress. SAP HANA has become the fastest growing product in SAP's 40+ year history.
About this Document
Loss of business critical system resources and services, like SAP HANA, translate directly into lost revenue.
The goal therefore Is Business Continulty, using systems designed for continuous operation even in the
presence of inevitable feilures. Mission critical systems require High Availability; this 's no longer optional,
‘SAP HANA Is fully designed for High Availabilty, supporting a broad range of recovery scenarios from various
faults, from simple software errors, to disasters that decommission an entire site
This paper describes SAP HANA’s Hich Availabilty support for Fault and Disaster Recovery. A comprehensive
High Availabilty solution offers more design choices and evidently requires the discussion of more details than
can be covered in a short paper, and may therefore require additional consultations.
2 Whatis High Availability?
Availability, the measure of 8 system's operational continuity, Is expressed as a percentage of time, Inversely
proportional to downtime. For example, ifa given system is designed to be available for 99.9 2% of the time
(Sometimes called “three nines"); its downtime per year must be less than 0.1%, or 9 hours,
Downtime is the consequence of outages, which may be intentional (e.g. for system upgrades) or caused by
Unplanned faults. A fault can be due to equipment malfunction, software or network failures, or due to 8
major disaster such as a fire, 8 regional power lass or # construction accident, which may decommission the
entire dato-center.
High Availability is 2 set of techniques, engineering practices and desian principles for Business Continulty.
This is achieved by eliminating single points of failure (fault tolerance), and providing the ability to rapidly
resume operations after a system outage with minimal business loss (feuft resilience),
Fault Recovery is the process of recovering and resuming operations after an outage due to a fault. Disaster
Recovery Is the process of recovering operations after an outage due to a prolonged datacenter or ste fallure,
Praparing for disasters may require backing up data across longer distances, and may thus be more complex.
and costly
Recovery - Key Performance Indicators
Customers commonly use two key measures to specify the recovery parameters of a system following an
outage: The Recovery Period Objective (RPO) and the Recovery Time Objective (RTO). The RPO and RTO of
2 system are illustrated below:
Soaks 1 Tite
Reo IR
‘+ The RPO Is the maximal permissible period of time during which operational date may be lost without
abilty to recover (time between the last backup and the crash)
RPO and RTO
(© 2017 SAP SE page 4/15& SAP HANA High Availability
ay
‘+The RTO Is the maximal permissible time It takes to recover the system, so that its operations can
3 Eliminating Single Points of Failure
The key to achieving fault tolerance is to eliminate single points of failure by introducing redundancy. SAP
HANA Appliance vendors deliver several levels of redundancy to avoid outage due to component failure, which
are briefly discussed here. Generally speaking, these techniques are "transparent to SAP HANA’s operation,
but they form a crucial line of defense against avoidable system outage, and therefore greatly contribute to
Business Continuity,
Hardware Redundancy
‘SAP HANA appliance hardware vendors design multiple layers of redundant hardware components and sub-
systems, These include redundant and hot-swappable power supply units (PSUs), fans, network interface
cards and enterprise-grade error-correcting protected memories. These subsystems are designed such that,
the redundant component can sustain the operation of the system if the other component fails
Particularly critical is the storage system. Enterprise-grade storage systems combine multiple physical drives
into logical units, with built-in standard (RAID) techniques for redundancy and error recovery. These include
mirroring, the writing of the same data to two different drives in parallel, and parity, extra bits written to
allow the detection and automatic correction of errors?.
Network Redundancy
Redundant networks, network equipment and network connectivity Is required to avoid network fallures from
affecting system availablity. This Is typically accomplished by deploying 2 completely redundant switch
topology, using the Spanning Tree Protocol to avoid loops. Routers can be configured with the Hot Standby
Router Protocol (HSRP) for automatic fallover. BGP is commonly used to manage dual WAN connections.
Data Center Redundancy
Data centers that host SAP HANA solutions are equipped with Uninterrupted Power Supply (UPS) and backup
ower generators, redundant cooling systems and multi-sourced providers of network connectivity and
electricity, achieving operational availability in the presence of individual failures, and significantly reducing
the probability of a business-impacting outage.
Some enterprises operate fully duplicated data centers, providing @ high level of disaster tolerance.
4 SAP HANA High Availability Support
{As an in-memory database, SAP HANA must not only concern itself with maintaining the reliability of its data
In the event of failures, but also with resuming operations with most of that data loaded back in memory as
uickly as possible
‘The following figure shows the phasas of High Availablity. The frst phase is readinass, being prepared for the
Inevitable fault. During this time, data fs backed up and standby systems are ready to take over. A fault must
be detected, either automatically or administratively (to avoid false positives), and a recovery process Is put
In action. Finally, the fault must be repaired, and the system may need to be reverted to the original
configuration (falled back), to be ready again for the next fault
"The SAP HANA software itself isa single point of fallure, as it can cease to operate due to software errrs or extreme out
of-memory situations, Fault Recovery support is dscussed inthe next section,
2 fn example of high availabilty hardware design can be found here: pw eooks itm comedpane insted 406 pat
Read further: hp iwnoad inte comsuppormohertnardseeversblenisprse dacs versie deskop css hard drves
(© 2017 SAP SE page 5115& SAP HANA High Availability
ay
primary
backup
performance fine
ramp
prepare. detect | recover fallback.
Different RPO/RTO values can be associated with different kinds of faults. Business critical systems are
expected to operate with an RPO of zero data loss in the case of local faults, and often even in the case of
disaster. But the challenges of disaster recovery are diferent from locally recoverable faults; to achieve zero
RPO and low RTO, data must be replicated synchronously over longer distances, which Impacts regular system
performance and may require more expensive standby and fallover solutions.
All ofthis leads to tradeoff decisions around the attributes of fault recovery functionality, cost and complexity,
SAP accordingly offers complementary design options, including three levels of Disaster Recovery support and
two automatic Fault Recovery support features, summarized in the following table and further discussed in
the sections below.
Gost RPO RTO
DISASTER i Backups 3 >0 high
RECOVERY 2. storage Replication ss ~0 med
SUPPORT 3. System Replication $$ 0 low
44. System Replication Active/Active (read enabled) 5 0 low
5._ System Replication w/o data preload S00 __mes
FAULTRECOVERY 1. Service Auto-Restart O00 med
SUPPORT 2 $$ 0 med
Backups
‘SAP HANA uses in-memory technology, but of course, it fully persists any transaction that changes the data,
such as row insertions, deletions and updates, so it can resume from a power-outage without less of data,
SAP HANA persists two types of data to storage: transaction redo logs, and cata changes in the form of
savepoints.
[A transaction redo log Is used to record a change. To make a transaction durable, It is not required to persist
the complete data when the transaction is committed; instead It is sufficient to persist the redo log. Upon an
outage, the most recent consistent state of the database can be restored by replaying the changes recorded
In the fog, redoing completed transactions and rolling back incomplete ones.
‘A savepoint is a periodic point in time, when all the changed data is written to storage, in the form of pages,
‘One goal of performing savepoints Is to speed up restart: when starting up the system, logs need not be
processed from the beginning, but only from the last savepoint position. Savepoints are coordinated across
all processes (called SAP HANA services) and instances of the database to ensure transaction consistency. BY
default, savepoints are performed every five minutes, but this can be configured.
Savepoints normally overwrite older savepoints, but itis possible to freeze a savepoint for future use; this is
called a snapshot. Snapshots can be replicated in the form of full data backups, which can be used te restore
2 database to a specific point in time. This can be useful in the event of data corruption, for instance. In
addition to data backups and snapshots, smaller periodic log backups ensure the ability to recover from fatal
storage faults with minimal loss of data. While full data backups contain all current data also delta backups
can be created (since HANA 1.0 SPS11) containing all data that was changed since the last data backup. Two
types of delta backups are to be distingulshed:ineremental backups contain all changed data since the last
full or delta backup, and differential backups contain all changed data since the last full backup.
(© 2017 SAP SE page 6115& SAP HANA High Availability
ay
ala backup leg backuo
Local Persistence and Backups
‘The above figure shows the savepoints, saved to local storage, and the additional backups, saved te backup
storage. Local recovery from the crash uses the latest savepoint, and then replays the last logs, to recover,
the database without any data loss. If the lecal storage was corrupted by the crash, it is still possible to
recover the database from the data backup (or last snapshot), and log backups, possibly with some data loss,
Regularly shipping backups to a remote location aver 2 network or via couriers can be a simple and relatively
Inexpensive way to prepare for a disaster. Depending on the frequency and shipping method, this approach
may have an RPO of hours to days.
Storage Replication
One drawback of backups Is the potential loss of data from the time of the last backup to the time of the
failure. A preferred solution therefore, is to provide continuous replication of all persisted data. Several SAP
HANA hardware partners offer a storage-level replication solution, which delivers a backup of the volumes or
file-system to a remote, networked storage system. In some of these vendor-specific solutions, which are
certified by SAP, the SAP HANA transaction only completes when the locally persisted transaction log has been
replicated remotely. This is called synchronous storage replication. Synchronous storage replication can be
Used only where the distance between the primary and backup site Is up to 100 kilometers (one or few hops,
with no more than ~5 psec latency per kilometer), allowing for sub-milisecond round-trip latencles.
oa
Primary System Repllested Storage
Storage Replication
Due to its continuous nature, storage replication (sometimes also called remote storage mirroring) offers 3
‘more attractive RPO than backups, but this solution of course requires a reliable, high bandwidth and low
latency connection between the primary site and the secondary site.
(© 2017 SAP SE page 7/15& SAP HANA High Availability
ay
In the event of 3 disastrous failure that justifies full system fallover, an administrator attaches @ standby
system to the replicated storage, and then restarts the SAP HANA system. The administrator must take care
that the falled primary system can no longer write to the replicated storage (an action called fencing), or else
there Isa risk of data corruption, with two systems writing to the same storage.
System Replication
System Replication is an alternative HA solution for SAP HANA, providing an extremely short RTO, and
compatible with all SAP HANA hardware partner solutions. System replication employs an "N+N” approach,
with secondary standby SAP HANA syste with the same number of active nodes as the active, primary
system. Each service and instance of the primary SAP HANA system communicates palrwise with 3 counterpart
In the secondary system*
Primary System Secondary Syatem
System Replication
‘The secondary system can be located near the primary system to serve as a rapid failover solution for planned
downtime, or to handle storage corruption or other local faults. Alternatively, or additionally (multi-tiered or
cascaded), a sacondary system can be installed in a remote sits for disaster recovery. Like Storage Replication,
this Disaster Recovery option requires a rellable link between the primary and secondary sites.
‘The instances in the secondary system operate in live replication mode. In this mode, all secondary system
services constantly communicate with their primary counterparts, replicate and persist dats and logs, and
typically load data to memory. The log and data can be compressed before shipping. The secondary system
can accept queries when system replication was set up a5 Active/Active (read enabled) configuration;
othenwise It does nat accept requests or queries, With the Active/Active setup the secondary system can be
sed to handle reporting workload without disrupting the primary system,
In an alternative configuration, called system replication without data-preload, the secondary system does
not pre-load data, and hence consumes very little memory. This allows the hosts of the secondary system to
serve dual purposes, for instance for development or test/QA with separate storage. Before takeover, these
activities must of course be turned off, The tradeoff is 2 langer RTO in case of failover.
Here is how system replication works. When the secondary system is brought up to start running in live
replication mode, each service component establishes @ connection with Its primary system counterpart, and
requests 2 snapshot of the data. From then on, all logged changes in the primary system are replicated,
Whenever logs are persisted in the primary system (ie. written to the log volumes of each service), they are
also sent to the secondary system. A transaction in the primary system Is not committed until the redo logs
are replicated, a5 determined by a log replication option:
‘+ Synchronous: The primary system welts with commiting the transaction until receives @ reply
that the log is persisted in the secondary system, This mode guarantees immediate consistency
between both systems, at 2 cost of delaying the tansaction by the time for date transmission and
persisting in the secondary system.
The question of what to do if replication falls (for instance due to @ network fault) is governed by the
{ull sync configuration option. It can be set to commit the transaction, or to fail the commit on the
primary system, until replication is restored.
4 From HANA 1.0 SPS09 SAP HANA supports multi-tenant database containers. System replication can be only set up for
the system as whole, not per individual tenant.
(© 2017 SAP SE page 8115& SAP HANA High Availability
ay
‘+ Synchronous in-memory: The primary system commits the transaction after It receives a reply
that the log was received by the secondary system, but before it was persisted. The transaction delay
In the primary system Is shorter, because It only Includes the data transmission time.
+ Asynchronous: the primary system commits the transaction after sending the log without waiting
for a response. This eliminates the synchronization latency, at the risk of minor theoretical data-loss
during failure. This mode Is most useful when the secondary site Is hundreds of klometers away
from the primary site, ar when reducing latency is critical,
If the connection to the secondary system Is lost, or the secondary system crashes, the primary system (after
a brief, configurable, timeout) will resume operations without the backup protection.
Handling of the received logs on the secondary site Is done In different ways, depending on the configured
system replication operation mode:
‘+ Delta data shipping: In this operation mode the secondary system persists, but does not
Immediately replay the received logs. To avold an ever-growing list of logs, incremental data
snapshots are transmitted asynchronously from time to time from the primary to the secondary
system. Ifthe secondary system has to take over, only that part of the log needs to be replayed that
represents changes that were made after the most racent data snapshot. In addition to snapshots,
the primary system also transfers status Information regarding which column table columns are
currently loaded into memory. The secondary system correspondingly preloads these columns.
‘+ Logreplay (as of HANA 1.0 SPS11): With this operation mode configured, the recelved log entries
are replayed Immediately In the secondary system. The takeover time Is reduced because the log
does not have to be replayed anymore. Additionally, there is much less traffic an the network between
the primary and the secondary site, because no delta data shipping needs to take place.
+ Logreplay read access (as of HANA 2.0 SPS00): In this operation mode the received log entries
are also replayed Immediately in the secondary system. Additionally, the replicated data are read
accessible with a small delay compared to the primary’s data. Read access Is possible via direct,
connaction to the secondary or by providing hinted SQL statements on the primary, which are routed
to the secondary for execution. The takeover time is reduced further - not only because the log does
not have to be replayed anymore, but also because this system is even more prepared for productive:
operation.
In the event of a failure that Justifies full system takeover, an administrator instructs the secondary system
to switch from live replication mode to full operation. The secondary system, which already preloaded the
same column data as the primary systam, and possibly Is already read enabled, becomes the primary system
by replaying the last transaction lags, and then starts to accept queries
When the original system can be restored to service, It can be configured as the new secondary system, or,
reverted to the original configuration by "falling back"
HANA 1.0 SPSO9 Introduced @ way to hook events and actions inside SAP HANA scale-out (such as Host Auto-
Failover) and system replication. An administrator can add required actions to a Python script, to be executed
before or after events (like startup, shutdown, feilover, takeover, ..)-
‘These so-called "HA/DR provider" hooks can be used to address issues that require integration attention such
ts how to handle connections from database clients that were configured to reach the primary system, and
need to be “diverted” to the secondary system after a takeover. For example, 3 hook could be writen to
remap virtual IP addresses after a takeover in SAP HANA system replication
IP redirection is the method of choice for end-to-end client reconnection support, as it uniformly and simply.
handles the end-to-end recovery of both SQL and HTTP cllents, with very short recovery times, and without
special client-side configuration. The principle of IP redirection (also knowin as VIP) is ta define an additional
“logical” hostname (hana, in the picture below) vith its separate logical IP address (for example,
410.68.104.51), and then map this initlally to the MAC address of the original host in the primary system (by
binding it to one of the host's interfaces). As part of the takeover procedure, a script is executed which re-
‘maps the unchanged logical IP address to the corresponding takeover host in the secondary system. This
must be done palr-wise, for each host in the primary system. The remapping affects the L2 switching, as can
be seen in step 4 of the following diagram:
© ag.a result, the primary system and sacendary system might get out of eyne. Such & situation i detectes by the
secondary system when It resumes, reestabishes the connection, and receives the next set of lag entries. In Such case, the
Secondary system requests a data baciup deta based on which the log replication can be restarted.
2 bg, see here: hielo hog Hogepot com/20" UO Wialip actresses andthe tml
(© 2017 SAP SE page 915& SAP HANA High Availability
1068.10481 (3) 1088.10881
Step,
Seer,
Primary System ‘Secondary System
IP redirection can be implemented using @ number of techniques, for instance with the use of Linux commands
Which affect the network ARP tables, by configuring L2 network switches directly, or by using cluster
‘management software. Following the IP redirection configuration, the ARP caches should be flushed, to provide
an almost instantaneous recovery experience to clients.
IP redirection requires that both the primary and failover host(s) are on the same L2 network. This depends
fon the customer network design, but networks are increasingly designed with L2-over-L3 (such as Ethernet,
over MPLS), making this option a viable solution in many cases. If the standby system Is in @ completely
separate L3 network, then DNS redirection is the preferred alternative solution.
DNS Is 2 binding from a logical domain name to an IP address. Clients contact a DNS server to obtain the 1P
address of the HANA host (step 1 below) they wish to reach. Many DNS products support failover configuration
by using short (few minutes or less) TTL response fields, and can be set up with watchdog functionality and
‘automatically triggered switchover.
As part of the fall-aver procedure, a script is executed that changes the DNS name-to-IP mapping from the
primary host to the corresponding host in the secondary system (pair-wise forall hosts in the system). From
that point in time, clients are redirected to the fallover hosts, as In step 2 of the following clagram:
fassssor
PrimarySyetem ‘SecondarySystem
DNS and IP redirection share the advantage that there are no client-specific configurations or requirements,
Further, it supports DR configurations where the primary and standby systems may be in two completely
different network domains (separated by routers). One drawback of this solution Is that modifying DNS.
mappings requires a vendor-proprietary solution. Further, due to DNS caching in nodes (both clients and
Intermediate network equipment), It may take a while (up to hours) until the DNS changes are propagated,
causing clients to experience downtime despite the recovery of the system.
(© 2017 SAP SE page 1015& SAP HANA High Availability
ay
‘A special handling of virtual IP addresses is required in an Active/Active (read enabled) system replication
configuration, where @ separate virtual IP address is needed for the read access connections to the secondary
system. In the takeover-case the virtual IP address for primary access Is rebound to the secondary system,
while the secondary’s virtual IP address stays active. Two virtual IP addresses are available for system access
to the then active system after takeover.
Service Auto-Restart
In the event of a software failure (or an intentional intervention by an administrator), thet disables one of the
configured SAP HANA services (Index Server, Name Server, etc.), the service will be restarted by the SAP
HANA Service Auto-Restart watchdog function, which automatically detects the failure and restarts the
stopped service pracess. Upon restart, the service loads data into memory and resumes its function. While all
data remains safe (RPO=0), the service recovery takes some time.
Host Auto-Failover
Host Auto-Failover is 2 local "Ntm” (m Is often 1) Fault Recovery solution that can be used as &
supplemental of alternative measure to the system replication solution described earlier, One (or more)
standby hosts are added to an SAP HANA system, and configured to work in standby mode. As long as they
are in standby mode the databases on these hosts do not contain any data and do not accept requests or
queries.
‘Standby
Host Auto-Failover, before failure
When an active (worker) host fails, @ standby host automatically takes its place. Since the standby host may.
take over operation from any of the primary hosts, it needs access to all the database volumes. This can be
‘accomplished by a shared networked storage server, by using @ distributed filesystem, or with vendor-speciic
solutions that use an SAP HANA programmatic interface (the so-called Storage Connector API) to dynamically
detach and attach (mount) networked storage (e.g. using block storage via Fiber Channel) upon fallover.
Xi
Host Auto-Failover, after recovery
Atopic that requires some attention Is how to recover connections from SAP HANA clients that were configured
to reach the ariginal host, and need ta be “clverted” to the standby hast after host auto-fallover.
‘One approach Is a network-based (IP or DNS) approach, exactly as discussed earlier. Alternatively, SQLMDX
database clients can be configured with the connection Information of multiple hosts, optionally including the
standby host (2 multi-host list is provided in the connection string). The client connection cade (ODBC/IDBC)
Uses a "round-robin" approach to reconnect and ensures that these clients can reach the SAP HANA database,
even after fallover. To support HTTP (web) clients, which use the SAP HANA XS application services’, It Is.
7 arene x5 senwces are nat used by any application, wt can be disabled, and no HLB is required. Curently only one XS
server ean run on 8 ciatibited system. Inthe future, multiple lo-sharing XS servers wil be installable on 2 =ystem,
making the use of an HLB even more valuable.
(© 2017 SAP SE pege 11/15& SAP HANA High Availability
recommended to install an external, Itself fault protected, HTTP load balancer (HLB), such as SAP's Web
Dispatcher, or a similar praduct from another vendor. The HLBs are configured to moniter the web-servers on
all the hosts on both the primary and secondary sites,
The HLB (which serves as a reverse web-proxy) redirects the HTTP clients to the correct server, upon HANA
Instance failure, HTTP clients are configured to use the IP address of the HLB itself (obtained via ONS), and
remain unaware of any HANA failover activity
‘One dangerous scenario that may occur with Host Auto-Fallaver Is referred to as split-brain. A split-brain could
accidentally happen if, for instance, host2 did not really fall, but only lst all its network connections (causing
the standby host to decide to take over). In this case both non-communicating systems assume the host2
role, and may both write to the same storage, causing data corruption. Preventing such éata corruption due
to split-brain situations (fencing) must be implemented. The above-mentioned storage API supports fencing
Once repaired, the faled host can be rejoined to the system as the new standby host, to reestablish the felure
recovery capability
5 _ Design for High Availability
‘The following table summarizes the main advantages and limitations of the SAP HANA High Availability support
options.
oes Limitations
‘Backups + Alows Disaster Recovery * RPO of minutes to hours, depending on frequency of
+ Lowest cost, simplest backup an shipping method (synehreneus shipping,
+ Supports pointsinstime recovery using 3 party tool, is recannmended)
+ an also be used to “clone” oF copy systems _« In case of disaster, need to aoquie and configure
secondary system (hours-days)
+ Cold star ~ longer RTO (hour)
+ Extra time (up to heurs) to load column data and return
to full performance
‘Storage > Allows Disaster Recovery 7 In case of disaster, need To possibly Tree up, boot up
Replication + F20=0 vith synchrorcus replication;
RPO ofa few seconds otherwse
+ Secondary system ean be used for ther
purposes, until needed
fang reconfigure sacondary system (hours)
+ Cold start ~ longer RTO (hour)
+ Not yet offered by all SAP HANA hardware partners
+ Extra time (up to hours) to return to fll performance
+ Requires networked storage systers and efficient inter-
site lin
+ Synchronous replication only supports distances of up to
100 km
+ Doesn't protect against storage corruption
+ More bandwcth wasteful than System Replication
‘System
Replication main MA faiover for near zero dovmtime
rsintenanee or faiures
+ Actve/Actve (read enabled) configurations the
+ Allows Disaster Recovery, and can be wed BS
secondary is usable for reporting workloac
+ RPO=0 (synchronous)
+ RTO of only @ minute (continuous log replay)
+ Full performance right afer tokeower
+ Compatible with all partner solutions
+ Supports single-host systems with local
storage, ne need for external network storage
appliances
Requires dedicated ve standby system and ecient
inter-sitelink
+ Requires a solution for lent connection recovery upon
failover (e.g. DNS or Virual IP address based)
(© 2017 SAP SE
pege 12/15& SAP HANA High Availability
ay
7 With no deta preload configuration, secondary
system(s) can be used for noneitical dual
purposes
Host + Can be used to complement System + Requires access to database storage by the standby
Auto Replication or by self host (shared network storage or other partner-specfic
Failover __ + Automatic detection and failover solution)
In addition to the aforementioned SAP HANA High Availablity options, one other approach deserves to be
mentioned, for analytic "data mart" applications where the data In SAP HANA is the result of using SAP
Landscape Transformation (SLT) replication from another data source. In such a situation, High Availability
through redundancy can be achleved by setting up concurrent SLT replication streams from the common data-
source to two separate SAP HANA eystems. Both systems can actively operate independantly; in the case of
2 failure or disaster, the other system remains available,
Planning for Failure
Failures are inevitable. Planning 2 comprehensive High Availability solution for SAP HANA requires an
evaluation of the impact of potential failures, the company's tolerance and requirements for different RPO and
RTO values in the presence of common transient local failures vs. extremely rare disasters, and an
Understanding of the benefits and cost of the different alternatives offered.
To recap, here is @ brief summary of main faults and how SAP HANA addresses them
ult et
‘Service down (software fault) + Service Auto-Restart. System Replication can also be used to fall over
Power outage + Persistence of savepoints and warsaction logs guarantees recovery witout dats oS
Host crash (hardware faut)» Host Auto-Fallover. Alternatively, System Replication can be used to fll over.
‘Storage or Data Corruption + Backups and snapshots allow point-i-time recovery, applicable to all solutions.
(Data center out (disaster) + System Replication supports rapid resumption of operation. Alternatively, Storage
Replication o Backups can be used to bring up the system in an alternate datacenter
Besides the high-level consideration of RPO/RTO in the different scenarios, other aspects will need to be
evaluated as well: the size of the system and database, the frequency and size of the logs and data files that
need to be replicated, the bandwidth availabilty, rellablity and latency of the links between the systems, the
nature of the landscape management and availabilty solutions used for other non-SAP HANA systems, and
other considerations.
‘Small RTO requirements lead to the preferred system replication solution, which can also be used for rapid
fallover in case of planned and unplanned outages. Tradeoffs may lead to other alternatives. The following
decision tree summarizes the main design choices:
yes es.
Disaster
Recovery a 0
=
Local
Failure
Recovery
High Availability Decision Tree
(© 2017 SAP SE pege 13/15& SAP HANA High Availability
ay
Realistically, the above decision process will be further Influenced by considerations like timelines, costs,
budgets and customer paracigm-preferences, which are outside the scope of this short paper.
(© 2017 SAP SE pege 14/15& SAP HANA High Availability
ay
6 InSummary
SAP HANA supports @ comprehensive range of High Availabilty options, designed to satisfy tradeoffs between
demanding High Availabilty and Disaster Recovery requirements, while also considering cost and complexity.
In particular, the SAP HANA System Replication solution supports an RPO of zero saconds, and an RTO
measured in minutes, and is SAP's recommended configuration for addressing SAP HANA outage reduction
due to planned maintenance, faults and disasters.
SAP HANA High Availabilty documentation: SAP Note 2407188
More information about SAP HANA can be found on hitp:/help, sap.com4,
Glossary
Industry Terms
Tem Description
Fault 2A falar of & EVR=m Or One Of iz component aub-ajatems (hardware, nator, Ware)
Disaster jor faut: the Failure of an entire data center/ate
Ounage 2 system's inability to operate (due to failure or planned downtime)
Aviat “The measure of systems operational continuity, expressed as a percentage of me
Downtime Taverse of evaiebity: the dureton of ume Uist a system isnot operational
High Aveiabiiny (5
2 framework of design principles, techniques onc best practices to reduce SONTETE
Fault Recovery (FA)
Recovery of system operations affer outage due t9 2 local faut
Disaster Recovery (ORD
Recovery of system operations ater outage due toa disaster
Falover/Takeover
“Switching te backup (standby) system/host, upon failure of the srimary System /ROS
Filbsck Process of restoring 9 system tos original ete
Recovery Point ‘the maximal permissible period of time during which operational data may be lost without
Objective (RPO) biity to recover (time betwen the last backup and the crash
Recovery Time ‘The maximal permissible time i takes to recover the system, so that is operations Can resume
Objective (RTO)
SAP HANA Terms.
Tem
Description
‘SAP HANA System
"A SAP HANA system fe lanified by = system (GID), Tee perceived aa one unit from the
perspective ofthe acministrate, who can install, update, star up, shut down, or backup the
System as a whole. A cmnbuted SAP HANA system is a System whichis installed on more than
‘one host. The colection of elements of the syetem on each host are referred to as an nseanoe,
SAP HANA Sane
"A SAP HANA serves issn independent Funeronsl component of a SAP HANA Syetam, such #=
the Index Server, the Name Server, etc. They appear as separate processes from an Operating
‘System perspective,
Disaster Recovery [Szcp Periodic saving of database copies in safe place
Support ‘Storage Replication | Continuous relation (mirroring) between primary Storage ana achup
storage over 8 network (may be synchronous)
‘System Replication continuous (synchronous and asyneworous) update oF secondary system
by primary system, including in-memory table leading and continuous log
repiny on the secondary system (F configured
Fault Recovery | Seivice Auto-Restart_| automatic restart of stopped services on host (wath
Support Host auto-Falover [Automatic falover from crashed host to standby host in the Some SPE
(© 2017 SAP SE poge 15/15