What is High Availability and Disaster Recovery?
In the context of the modern and digital way of working, high availability
(HA) and disaster recovery (DR), both reduce downtime and maintain business
continuity in times of trouble. But what do they mean?
High Availability (HA) – This refers to a system, network or aspect of an
infrastructure that is continuously operational for as long as possible.
Disaster Recovery (DR) – This refers to a set of policies and procedures that
enable the recovery or continuation of vital infrastructure and systems following a
natural or human disaster.
High Availability
While we may think high availability is all about a system or network that is
continuously operational, it is far more complex than it originally sounds.
High availability is all about eliminating single points of failure to ensure the
continuous running of a system or application. As the core concept of HA is about
reducing points of failure, the notion of redundancy is naturally built-in and split
into three key areas that are applied to most systems: hardware, software and
environmental.
1. Hardware redundancy
This was one of the first ways HA was introduced into the world of
computing. Before applications had a continuous internet connection and could be
backed up anywhere and at anytime, hardware redundancy was vital. Today
manufacturers continue to look to solve points of failure by incorporating
redundant storage elements, power supplies and networking solutions.
Redundant storage ensures that data is written to read from multiple physical disks.
This prevents data loss and downtime in the instance of a server failing
Redundant power typically occurs in the form of multiple power sources, enabling
admins to failover to a backup power supply in the instance of failure from a single
source
Redundant networking allows connection to multiple independent networks to
ensure that a server remains online in the event of a network failure on the main
network connection
2. Software redundancy
As technology and demands developed, developers ensured that
applications themselves could tolerate failures in a system, be it for reasons
including hardware or configuration errors. Today this is often accomplished by:
Clustering technologies, allowing workloads to be spread across several
different servers
Load balancing, allowing incoming requests to be routed to healthy
application nodes as well as raise issues to proactively mitigate against failure
Self-healing systems, that allow workloads to move around or allocate
additional capacity when failures occur
3. Environmental
As cloud computing continues to rise, providers are now taking HA to
another level through two key areas:
Hardware redundancy on a server rack level, allowing users to spread
workloads to mitigate single points of failures without having to transition to
another data centre
Data centre redundancy, allows users to run applications in separate data centres
that are located geographically close to each other, specifically for instances that
are out of the user’s or data centre operators’ hands
In instances where all of these factors fail and a system or application goes
down, this is where disaster recovery comes into play.
Disaster Recovery
Disaster recovery can take shape in a number of different forms, from simply
restoring a backup to significantly more complex actions.
In similar multi-faceted nature to high availability, disaster recovery
incorporates two core concepts:
Recovery time objective:
This is the maximum amount of time that a system can be down before it is
recovered to its operational and original state. Naturally, this period varies between
the system or application and its importance. For the low-level systems, this
recovery time can be measured in a matter of hours or even days, but for
business-critical systems, it will usually be measured in seconds or minutes.
Recovery point objective:
This is the amount of data loss measured in time that can be tolerated in a
disaster. Using the above analogy of low-level systems, losing a day or two worth of
data may be acceptable, while for business-critical systems such as transactional
websites, that may be as short as minutes or even seconds