Backup and DR in Hadoop
Lars George – Partner and Co-Founder @ OpenCore
DataWorks Summit Munich 2017
Distributed Problems
About Me
• Partner & Co-Founder at OpenCore
• Before that
• Lars: EMEA Chief Architect at Cloudera (5+ years)
• Hadoop since 2007
• Apache Committer & Apache Member
• HBase (also in PMC)
• Lars: O’Reilly Author: HBase – The Definitive Guide
• Contact
• lars.george@opencore.com
• @larsgeorge
Website: www.opencore.com
Agenda
• Context
• Data Backup Strategies
• Summary
Context
What do you have to look out for?
What is What?
• Backup
• Ability to restore data using previously taken, frozen in time data snapshots
• Allows to recover deleted, or erroneously modified data
• Usually backups are not current, as the most recent is not included
• Disaster Recovery
• Restore business and operations after a complete system failure
• Includes rebuilding the environment and restoring the data from the last (good)
backup
• Minimize the impact on the business (financial loss)
Many Systems
• Hadoop is a platform of many distributed
sytems
• Simple tools only cover simple topics
• Every system has data and/or meta data
• Amount of data ranges from a few terabytes
to multiple petabytes in practice
• A cluster contains few to hundreds of servers
 What do you back up, how often, and how?
2006 2008 2009 2010 2011 2012 2013
Core Hadoop
(HDFS,
MapReduce)
HBase
ZooKeeper
Solr
Pig
Core Hadoop
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
Core Hadoop
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
Core Hadoop
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
Parquet
Sentry
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
The stack evolves and grows continuously!
2007
Solr
Pig
Core Hadoop
Knox
Flink
Parquet
Sentry
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
2014 2015
Kudu
RecordService
Ibis
Falcon
Knox
Flink
Parquet
Sentry
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
Evolution of the Hadoop Platform
Why is backing up data difficult?
• Data at scale is difficult to move around!
• You cannot cheat physics
• The sheer inertia of data requires new approaches
• Do not or only minimally move data as necessary
• If duplicated data, use it for other purposes as well
• Multiple clusters with different workloads
• Traditional backup tools often require standardized APIs
• Hadoop does not supply those necessarily, or they are inefficient here
• Included backup tools in Hadoop are often rudimentary
• Not all scenarios are covered, or are only partially covered
Databases in Hadoop
• Many components use databases to store their state and metadata for
persistency
• The selection of RDBMS may have a substantial impact on that functionality
 Never use the ”developer option” (e.g. Derby)!
 The RDBMS should be highly available (HA)
• Databases should be backed up and archived on a regular basis
• But the question often remains: Is this a task of the Hadoop team or the
(often central) IT department?
• This also applies to other, external Hadoop stack systems (e.g. Storm)
Goals and Objectives
Usually backup and DR is grounded into conditions:
RTO – Recovery Time Objective
• Time to recover a service
• The hotter backup data is kept, the
shorter the RTO
• At scale, the RTO is foremost a
factor of infrastructure
RPO – Recovery Point Objective
• Measures how much data is lost in
case of a disastrous failure
• The more often data is backed up,
the shorter the RPO
 The RPO and RTO are driving cost factors and are multiplied by each other
Failure Scenarios
• Node Degradation
• One or more nodes are slowing down or produce an increasing number of errors
(and with it fewer results) – coined “The John Wayne”
• Mayb cause byzantine errors, which are difficult to identify
 Reasons: Failures or bugs in disks, NICs, device drivers, software
 Hadoop can handle many such errors, but not all
• Partial Node Failure
• Single (redundant) components are failing completely
• Example: A disk stops working
• Operators can swap component at runtime
 Hadoop is built to handle failures like this
 Impact is restricted to the share of component on total capacity
Failure Scenarios (cont.)
• Node Failure
• Assumes preparation, like enabling HA everywhere or configure „Rack Awareness“
 Reasons: Power or network outage
 Hadoop can handle this just fine
• Network Partitioning
• The cluster is split into two or more parts at random points
• Causes the so-called „split brain“ problem, where each now autonomous part has to
decide if it must fail, or can continue to serve request
• Applications need to switch to one of the working parts of the cluster
 Hadoop has some support for that, but there are external dependencies
 What happens when the parts join the cluster again?
Failure Scenarios (cont.)
• Loss of an entire data center
• Complete loss of a data copy
• Either switch to a warm/hot standby cluster (blue-green deployment)
• Or, rebuild cluster and restore data
 Reasons: Power or network outage
 Has to be done outside of Hadoop
Data Sources
• Not all Hadoop components have persistent data (or metadata)
• Transient data can (should) be recomputed as needed
• The number of used Hadoop components varies a lot
• „Onboarding“ checklist can help to capture that
• Given a set of requirements the RTO and RPO can be different
• Question: How long does re-computing derived data take?
• Basic Rule: The more you have, the more costly and time consuming it is
• You can always omit parts, as long as everyone is OK with it (for realz!)
• Cost can be capped – but not without consequence (higher RTO)
Backup Strategies
• Replication
• Copy of data and modifications of one cluster to another
• Basically like the venerable rsync problem
• Some components in Hadoop support this (partially?)
• What do you do with deleted data?
• Snapshots
• Few tools have a built-in snapshot feature
• HDFS and HBase
• Special access to frozen-in-time data
• Using special paths or system tools
• Data is local and needs to be moved
• How do you do this incrementally?
Bakup Strategies (cont.)
• Backup
• Store of data to a cold media
• Not supplied with Hadoop
• A few tools have system tools
• But… Versioned? Complete? Consistent?
• HA and Rack-Awareness
• Does neither cover backup nor DR
• Unless calling the HDFS trash functionality a backup... NOPE!
• Only valid within the cluster, within the same data center
Data Types
 There are two main types of data: persisted data and metadata
 There is also transient data
• Data concerns all user data, stored in HDFS, HBase, Solr, and so on
• Can be accessed using an interface
• Metadata are auxiliary information, helping to make sense of or being to
access the user data
• Hive Schemas
• Cluster Information
• Transient data often is stored in temporary files, logs, or streams
Data Consistency
• An often missed (or ignored?) topic, describing what actually is inside a backup
• Is the contained data consistent in itself?
• Some components (NoSQL, including HDFS) cannot mark data across system
boundaries in a reliable and predictable manner
• Snapshots may also be of no help as they are taken asynchronously
• Per regions server in HBase
• Open blocks are added in HDFS
• Move the task towards the application
• Which application was design to do that?
• When restoring data, gaps or bulges can form!
• Question is: Who is responsible to handle that?
• You could be tempted to add transactions...
Validation
• After taking a backup, its integrity needs to be checked
• Should consistency also be verified?
• HDFS has typical checks like CRCs
• Database could be restored and checked
• Special test scripts?
• Applications should ideally supply their own verification tools or rule sets
• Make this part of the software engineering task
• Use Jenkins CI as a backup und restore pipeline?
So far…
• Backup is a combination of already available techniques, or a special
implementation for systems that have no native support
• Snapshots alone only offer local versioning
• Replication is either a hot mirror, or a set of raw data structures that do not
allow an instantaneous restoration
• Consistency has to be handled on the application side
• The required RTO und RPO is crucial for how cluster environments have to
be built, and should be considered from the get go
• There does not seem to be a complete solution, requiring special
implementations!
Data Backup Strategies
Practical scenarios
Approaches
Cluster A Export
Scenario 1: Cold Export
StorageAnwendungAnwendungApplication
Cluster A Replication
Scenario 2: Replication
AnwendungAnwendungApplication Cluster B
Approaches (cont.)
• Applications write into two
(or more) clusters at the
same time
• Typically using Kafka
• ACK requires for both
clusters to confirm the write
• Could be controlled per
application
• See Google Spanner and
TrueTime
Cluster A
Scenario 3: Simultaneous Writes
AnwendungAnwendungApplication
Cluster B
Impact on Business
• The basic scenario
are polar opposites
• Depending on
complexity extra
layers can be added
• Kafka as a buffer
• Cost varies greatly,
with #3 requiring
two same size
clusters
RTO
RPO
HochNiedrig
Niedrig Hoch
1
23
Summary
Where to go from here?
Summary
Backup and DR must be part of planning and procurement from the start
Many systems handle data differently, requiring special treatment
Data backup and restoration has to be handled by the applications
Commercial offerings are few and not fully featured
Thank You!
@larsgeorge

Backup and Disaster Recovery in Hadoop

  • 1.
    Backup and DRin Hadoop Lars George – Partner and Co-Founder @ OpenCore DataWorks Summit Munich 2017 Distributed Problems
  • 2.
    About Me • Partner& Co-Founder at OpenCore • Before that • Lars: EMEA Chief Architect at Cloudera (5+ years) • Hadoop since 2007 • Apache Committer & Apache Member • HBase (also in PMC) • Lars: O’Reilly Author: HBase – The Definitive Guide • Contact • [email protected] • @larsgeorge Website: www.opencore.com
  • 3.
    Agenda • Context • DataBackup Strategies • Summary
  • 4.
    Context What do youhave to look out for?
  • 5.
    What is What? •Backup • Ability to restore data using previously taken, frozen in time data snapshots • Allows to recover deleted, or erroneously modified data • Usually backups are not current, as the most recent is not included • Disaster Recovery • Restore business and operations after a complete system failure • Includes rebuilding the environment and restoring the data from the last (good) backup • Minimize the impact on the business (financial loss)
  • 6.
    Many Systems • Hadoopis a platform of many distributed sytems • Simple tools only cover simple topics • Every system has data and/or meta data • Amount of data ranges from a few terabytes to multiple petabytes in practice • A cluster contains few to hundreds of servers  What do you back up, how often, and how?
  • 7.
    2006 2008 20092010 2011 2012 2013 Core Hadoop (HDFS, MapReduce) HBase ZooKeeper Solr Pig Core Hadoop Hive Mahout HBase ZooKeeper Solr Pig Core Hadoop Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig Core Hadoop Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop The stack evolves and grows continuously! 2007 Solr Pig Core Hadoop Knox Flink Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop 2014 2015 Kudu RecordService Ibis Falcon Knox Flink Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Evolution of the Hadoop Platform
  • 8.
    Why is backingup data difficult? • Data at scale is difficult to move around! • You cannot cheat physics • The sheer inertia of data requires new approaches • Do not or only minimally move data as necessary • If duplicated data, use it for other purposes as well • Multiple clusters with different workloads • Traditional backup tools often require standardized APIs • Hadoop does not supply those necessarily, or they are inefficient here • Included backup tools in Hadoop are often rudimentary • Not all scenarios are covered, or are only partially covered
  • 9.
    Databases in Hadoop •Many components use databases to store their state and metadata for persistency • The selection of RDBMS may have a substantial impact on that functionality  Never use the ”developer option” (e.g. Derby)!  The RDBMS should be highly available (HA) • Databases should be backed up and archived on a regular basis • But the question often remains: Is this a task of the Hadoop team or the (often central) IT department? • This also applies to other, external Hadoop stack systems (e.g. Storm)
  • 10.
    Goals and Objectives Usuallybackup and DR is grounded into conditions: RTO – Recovery Time Objective • Time to recover a service • The hotter backup data is kept, the shorter the RTO • At scale, the RTO is foremost a factor of infrastructure RPO – Recovery Point Objective • Measures how much data is lost in case of a disastrous failure • The more often data is backed up, the shorter the RPO  The RPO and RTO are driving cost factors and are multiplied by each other
  • 11.
    Failure Scenarios • NodeDegradation • One or more nodes are slowing down or produce an increasing number of errors (and with it fewer results) – coined “The John Wayne” • Mayb cause byzantine errors, which are difficult to identify  Reasons: Failures or bugs in disks, NICs, device drivers, software  Hadoop can handle many such errors, but not all • Partial Node Failure • Single (redundant) components are failing completely • Example: A disk stops working • Operators can swap component at runtime  Hadoop is built to handle failures like this  Impact is restricted to the share of component on total capacity
  • 12.
    Failure Scenarios (cont.) •Node Failure • Assumes preparation, like enabling HA everywhere or configure „Rack Awareness“  Reasons: Power or network outage  Hadoop can handle this just fine • Network Partitioning • The cluster is split into two or more parts at random points • Causes the so-called „split brain“ problem, where each now autonomous part has to decide if it must fail, or can continue to serve request • Applications need to switch to one of the working parts of the cluster  Hadoop has some support for that, but there are external dependencies  What happens when the parts join the cluster again?
  • 13.
    Failure Scenarios (cont.) •Loss of an entire data center • Complete loss of a data copy • Either switch to a warm/hot standby cluster (blue-green deployment) • Or, rebuild cluster and restore data  Reasons: Power or network outage  Has to be done outside of Hadoop
  • 14.
    Data Sources • Notall Hadoop components have persistent data (or metadata) • Transient data can (should) be recomputed as needed • The number of used Hadoop components varies a lot • „Onboarding“ checklist can help to capture that • Given a set of requirements the RTO and RPO can be different • Question: How long does re-computing derived data take? • Basic Rule: The more you have, the more costly and time consuming it is • You can always omit parts, as long as everyone is OK with it (for realz!) • Cost can be capped – but not without consequence (higher RTO)
  • 15.
    Backup Strategies • Replication •Copy of data and modifications of one cluster to another • Basically like the venerable rsync problem • Some components in Hadoop support this (partially?) • What do you do with deleted data? • Snapshots • Few tools have a built-in snapshot feature • HDFS and HBase • Special access to frozen-in-time data • Using special paths or system tools • Data is local and needs to be moved • How do you do this incrementally?
  • 16.
    Bakup Strategies (cont.) •Backup • Store of data to a cold media • Not supplied with Hadoop • A few tools have system tools • But… Versioned? Complete? Consistent? • HA and Rack-Awareness • Does neither cover backup nor DR • Unless calling the HDFS trash functionality a backup... NOPE! • Only valid within the cluster, within the same data center
  • 17.
    Data Types  Thereare two main types of data: persisted data and metadata  There is also transient data • Data concerns all user data, stored in HDFS, HBase, Solr, and so on • Can be accessed using an interface • Metadata are auxiliary information, helping to make sense of or being to access the user data • Hive Schemas • Cluster Information • Transient data often is stored in temporary files, logs, or streams
  • 18.
    Data Consistency • Anoften missed (or ignored?) topic, describing what actually is inside a backup • Is the contained data consistent in itself? • Some components (NoSQL, including HDFS) cannot mark data across system boundaries in a reliable and predictable manner • Snapshots may also be of no help as they are taken asynchronously • Per regions server in HBase • Open blocks are added in HDFS • Move the task towards the application • Which application was design to do that? • When restoring data, gaps or bulges can form! • Question is: Who is responsible to handle that? • You could be tempted to add transactions...
  • 19.
    Validation • After takinga backup, its integrity needs to be checked • Should consistency also be verified? • HDFS has typical checks like CRCs • Database could be restored and checked • Special test scripts? • Applications should ideally supply their own verification tools or rule sets • Make this part of the software engineering task • Use Jenkins CI as a backup und restore pipeline?
  • 20.
    So far… • Backupis a combination of already available techniques, or a special implementation for systems that have no native support • Snapshots alone only offer local versioning • Replication is either a hot mirror, or a set of raw data structures that do not allow an instantaneous restoration • Consistency has to be handled on the application side • The required RTO und RPO is crucial for how cluster environments have to be built, and should be considered from the get go • There does not seem to be a complete solution, requiring special implementations!
  • 21.
  • 22.
    Approaches Cluster A Export Scenario1: Cold Export StorageAnwendungAnwendungApplication Cluster A Replication Scenario 2: Replication AnwendungAnwendungApplication Cluster B
  • 23.
    Approaches (cont.) • Applicationswrite into two (or more) clusters at the same time • Typically using Kafka • ACK requires for both clusters to confirm the write • Could be controlled per application • See Google Spanner and TrueTime Cluster A Scenario 3: Simultaneous Writes AnwendungAnwendungApplication Cluster B
  • 24.
    Impact on Business •The basic scenario are polar opposites • Depending on complexity extra layers can be added • Kafka as a buffer • Cost varies greatly, with #3 requiring two same size clusters RTO RPO HochNiedrig Niedrig Hoch 1 23
  • 25.
  • 26.
    Summary Backup and DRmust be part of planning and procurement from the start Many systems handle data differently, requiring special treatment Data backup and restoration has to be handled by the applications Commercial offerings are few and not fully featured
  • 27.

Editor's Notes

  • #8 The rapid expansion of the Hadoop ecosystem is further evidence of its meteoric adoption.