Introduction To Map Reduce
Introduction To Map Reduce
1
Map Reduce: Motivation
                                                                          2
Problem Scope
                                                                                       3
Problem Scope
 Required functions
      Automatic parallelization & distribution
      Fault-tolerance
      Status and monitoring tools
      A clean abstraction for programmers
         Functional programming meets
          distributed computing
         A batch data processing system
                                                   4
Commodity Clusters
                                                                     5
Mapreduce & Hadoop - History
 2003: Google publishes about its cluster architecture & distributed file
  system (GFS)
 2004: Google publishes about its MapReduce model used on top of GFS
    Both GFS and MapReduce are written in C++ and are closed-source, with Python
     and Java APIs available to Google programmers only
 2006: Apache & Yahoo! -> Hadoop & HDFS
    open-source, Java implementations of Google MapReduce and GFS with a diverse
     set of API available to the public
    Evolved from Apache Lucene/Nutch open-source web search engine
 2008: Hadoop becomes an independent Apache project
    Yahoo! Uses Hadoop in production
 Today: Hadoop is used as a general-purpose storage and analysis platform
  for big data
    Other Hadoop distributions from several vendors including EMC, IBM, Microsoft,
     Oracle, Cloudera, etc.
    Many users (http://wiki.apache.org/hadoop/PoweredBy)
    Research and development actively continues...
                                                                                      6
Google Cluster Architecture: Key Ideas
                                                                                  7
 What Makes MapReduce Unique?
 Its simplified programming model which allows the user to quickly write
  and test distributed systems
 Its efficient and automatic distribution of data and workload across
  machines
 Its flat scalability curve. Specifically, after a Mapreduce program is
  written and functioning on 10 nodes, very little-if any- work is required
  for making that same program run on 1000 nodes.
 MapReduce ties smaller and more reasonably priced machines together
  into a single cost-effective commodity cluster
                                                                        8
Isolated Tasks
                                                                      9
MapReduce in a Nutshell
 Given:
    a very large dataset
    a well-defined computation task to be performed on elements of this dataset
     (preferably, in a parallel fashion on a large cluster)
 Map Reduce framework:
    Just express what you want to compute (map() & reduce()).
    Dont worry about parallelization, fault tolerance, data distribution, load
     balancing (MapReduce takes care of these).
    What changes from one application to another is the actual computation; the
     programming structure stays similar.
 In simple terms
      Read lots of data.
      Map: extract something that you care about from each record.
      Shuffle and sort.
      Reduce: aggregate, summarize, filter, or transform.
      Write the results.
 One can use as many Maps and Reduces as needed to model a given
  problem.
                                                                                   10
                                                   Note: There is no precise 1-1
                                                   correspondence. Please take
Functional programming                             this just as an analogy.
foundations
                                                                          11
                                                     Note: There is no precise 1-1
                                                     correspondence. Please take
Functional programming                               this just as an analogy.
foundations
                                                                            12
MapReduce Basic Programming Model
                                                                        13
Word Count
map(k1, v1)  list(k2, v2)                 reduce(k2, list(v2))  list(v2)
                                                                                   14
Parallel processing model
                            15
  Execution overview           Read as part of this lecture!
                               Jeffrey Dean and Sanjay
                               Ghemawat. 2008. MapReduce:
                               simplified data processing on
 Master  Workers             large clusters. Commun. ACM
                               51, 1 (January 2008), 107-113.
 Master coordinates
 Local Write / remote reads
                                                        16
MapReduce Scheduling
                                                                                 17
Data Distribution
 An underlying distributed file systems (e.g., GFS) splits large data files
  into chunks which are managed by different nodes in the cluster
                             Input data: A large file
 Even though the file chunks are distributed across several machines,
  they form a single namespace
                                                                              18
Partitions
                   20
Choosing M and R
                                                                                21
MapReduce Fault Tolerance
 On worker failure:
    Master detects failure via periodic heartbeats.
    Both completed and in-progress map tasks on that worker should be re-
     executed ( output stored on local disk).
    Only in-progress reduce tasks on that worker should be re- executed (
     output stored in global file system).
    All reduce workers will be notified about any map re-executions.
 On master failure:
    State is check-pointed to GFS: new master recovers & continues.
 Robustness:
    Example: Lost 1600 of 1800 machines once, but finished fine.
                                                                              22
MapReduce Data Locality
                                                                                   23
Stragglers & Backup Tasks
                                                                      24
Other Practical Extensions
                                                                       25
Basic MapReduce Program Design
                                                                      26
MapReduce vs. Traditional RDBMS
                                                              27
More Hadoop details
                      28
Hadoop
                                                                    29
 Hadoop MapReduce: A Closer Look
                     Node 1                                                                  Node 2
     Files loaded from local HDFS store                                 Files loaded from local HDFS store
InputFormat InputFormat
              file                                                                                                     file
                        Split        Split      Split                   Split        Split      Split
       file                                                                                                     file
RecordReaders RR RR RR RR RR RR RecordReaders
                                OutputFormat                                    OutputFormat
 Writeback to local                                                                                        Writeback to local
 HDFS store                                                                                                HDFS store
Input Files
 Input files are where the data for a MapReduce task is initially stored
 The input files typically reside in a distributed file system (e.g. HDFS)
 The format of input files is arbitrary
       Line-based log files
       Binary files
       Multi-line input records
       Or something else entirely
file
file
                                                                              31
InputFormat
 How the input files are split up and read is defined by the InputFormat
 InputFormat is a class that does the following:
    Selects the files that should be used for input
    Defines the InputSplits that break a file
    Provides a factory for RecordReader objects that       Files loaded from local HDFS store
     read the file
                                                                                    InputFormat
file
file
                                                                                           32
InputFormat Types
                                                                                          33
Input Splits
 An input split describes a unit of work that comprises a single map task in a
  MapReduce program
                                                                 file
 If the file is very large, this can improve                            Split      Split      Split
  performance significantly through parallelism           file
 The input split defines a slice of work but does not describe how
  to access it
 The RecordReader class actually loads data from its source and
  converts it into (K, V) pairs suitable for reading by Mappers
                                                    Files loaded from local HDFS store
 The Mapper performs the user-defined work        Files loaded from local HDFS store
  of the first phase of the MapReduce program
                                                                           InputFormat
 A new instance of Mapper is created
  for each split                                           file
                                                                   Split       Split      Split
                                                    file
                                                                             Reduce
Partitioner
 Each mapper may emit (K, V) pairs               Files loaded from local HDFS store
  to any partition
                                                                          InputFormat
 Therefore, the map nodes must all agree on
  where to send different pieces of                       file
                                                                  Split       Split      Split
  intermediate data                                file
                                                                   RR          RR         RR
 The partitioner class determines which
  partition a given (K,V) pair will go to
                                                                  Map         Map        Map
                                                                            Reduce
Sort
                                             Files loaded from local HDFS store
 Each Reducer is responsible for reducing
  the values associated with (several)
                                                                     InputFormat
  intermediate keys
                                                     file
                                                             Split       Split      Split
 The set of intermediate keys on a single    file
Partitioner
Sort
                                                                       Reduce
OutputFormat
                                                                 Files loaded from local HDFS store
                                                                                         OutputFormat
Questions?
             40
Exercise
           41
Exercise
                                           42
MapReduce Use Case: Word Length
                                      Big 37
                                      Medium 148
                                      Small 200
                                      Tiny 9
                                                    43
MapReduce Use Case: Word Length
                                  44
MapReduce Use Case: Word Length
                                   Big 1
                                   Big 1
                                   Big 1
                                   
                Big 1,1,1,1,
                Medium 1,1,1,..    Medium 1
                Small 1,1,1,1,..   Medium 1
                Tiny 1,1,1,1,               Big 37
                                              Medium 148
                                   Small 1    Small 200
                                   Small 1    Tiny 9
                Big 1,1,1,1,      Small 1
                Medium 1,1,1,..    
                Small 1,1,1,1,..
                Tiny 1,1,1,1,
                                   Tiny 1
                                   Tiny 1
                                   Tiny 1
                                   
                                                   45
MapReduce Use Case: Inverted Indexing
             49
Sources & References
50