0% found this document useful (0 votes)
4 views

UNIT -4 PPT

The document outlines the MapReduce programming model and YARN architecture for processing big data, detailing the phases of MapReduce, including Map, Shuffle and Sort, and Reduce. It explains the roles of Job Tracker, Task Tracker, Resource Manager, and Node Manager in managing and executing jobs, as well as addressing failure modes in classic MapReduce and improvements in YARN. The document also describes the workflow of job submission, initialization, task assignment, execution, and completion in both classic MapReduce and YARN environments.

Uploaded by

anisha01531
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

UNIT -4 PPT

The document outlines the MapReduce programming model and YARN architecture for processing big data, detailing the phases of MapReduce, including Map, Shuffle and Sort, and Reduce. It explains the roles of Job Tracker, Task Tracker, Resource Manager, and Node Manager in managing and executing jobs, as well as addressing failure modes in classic MapReduce and improvements in YARN. The document also describes the workflow of job submission, initialization, task assignment, execution, and completion in both classic MapReduce and YARN environments.

Uploaded by

anisha01531
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

UNIT -4

Map Reduce and Yarn


Topics
 Map Reduce and Yarn: Hadoop Map Reduce paradigm
 Map and Reduce tasks, Job and Task trackers, Mapper, Reducer,
 Map Reduce workflows, classic Map-reduce
 YARN - failures in classic Map-reduce and
 YARN - job scheduling - shuffle and sort - task execution
 Map Reduce types -input formats - output formats.
4.1 Hadoop Map Reduce paradigm:
MapReduce programming model refers to a programming paradigm for processing Big Data sets with a parallel
and distributed environment using map and reduce tasks.
 Big Data Processing employs the Map Reduce Programming Model. A job means a Map Reduce Program.
Each job consists of several smaller unit, called MapReduce Tasks.
 A software execution framework in MapReduce programming defines the parallel tasks.
 The Hadoop MapReduce implementation uses Java framework.
1. Map Phase:
1. The input data is split into independent chunks (Input Splits).
2. Each chunk is processed by a Mapper task.
3. The Mapper outputs key-value pairs.
2. Shuffle and Sort:
1. After the Map phase, intermediate data is shuffled and sorted.
2. Data with the same key is grouped together to be processed by the Reducer.
3. Reduce Phase:
1. Reducers take the grouped key-value pairs from the Shuffle and Sort phase.
2. They perform aggregation or summarization and produce the final output.
 Job Clients:
 The job client the one who submits the job.A job contains the mapper function
and reducer function and some configuration function that will drive a job.
 Job Tracker :The job tracker is the master of task trackers which are the slaves
work on data nodes. The job tracker responsibilities to come up with the
execution plan and it is coordinate and schedule the plan cross the task trackers.
It also can do phase coordination.
 Task Tracker :
 Task tracker is the one who break down the job into tasks that is map and reduce
task .
 Every task tracker has slots on it. the job tracker take map and reduce function
all of them compile binary and throw them into the task slots which actually do
the execution over the map and reduce functions.
Working of MapReduce:
 when a client submits a job, and the succeeding actions by the JobTracker and
TaskTracker.
 The data for a MapReduce task is initially at input files. The input files typically
reside in the HDFS.
 The files may be line-based log files, binary format file, multi-line input records,
or something else entirely different. These input files are practically very large,
hundreds of terabytes or even more than it.
 JobTracker and Task Tracker MapReduce consists of a single master
JobTracker and one slave TaskTracker per cluster node.
 The master is responsible for scheduling the component tasks in a job onto the
slaves, monitoring them and re-executing the failed tasks. The slaves execute the
tasks as directed by the master.
4.2 Map Reduce workflows:
• Input: This is the input data / file to be processed.
• Split: Hadoop splits the incoming data into smaller pieces called “splits”.
• Map: In this step, MapReduce processes each split according to the logic defined in map()
function. Each mapper works on each split at a time. Each mapper is treated as a task and
multiple tasks are executed across different TaskTrackers and coordinated by the JobTracker.
• Combine: This is an optional step and is used to improve the performance by reducing the
amount of data transferred across the network. Combiner is the same as the reduce step and is
used for aggregating the output of the map() function before it is passed to the subsequent
steps.
• Shuffle & Sort: In this step, outputs from all the mappers is shuffled, sorted to put them in
order, and grouped before sending them to the next step.
• Reduce: This step is used to aggregate the outputs of mappers using the reduce() function.
Output of reducer is sent to the next and final step. Each reducer is treated as a task and
multiple tasks are executed across different TaskTrackers and coordinated by the JobTracker.
• Output: Finally the output of reduce step is written to a file in HDFS.
4.3 Classic Map-reduce:
A job run in classic MapReduce is illustrated in Figure. At the highest level, there
are four independent entities:
• The client, which submits the MapReduce job.
• The jobtracker, which coordinates the job run. The jobtracker is a Java application
whose main class is JobTracker.
• The tasktrackers, which run the tasks that the job has been split into. Tasktrackers
are Java applications whose main class is TaskTracker.
• The distributed filesystem (normally HDFS) which is used for sharing job files
between the other entities.
1. Job Submission:
 The submit() method on Job creates an internal JobSummitter instance and calls submitJobInternal()
on it (step 1 in Figure ).
 The job submission process implemented by JobSummitter does the following:
  Asks the jobtracker for a new job ID (by calling getNewJobId() on JobTracker) (step 2).
  Computes the input splits for the job. Copies the resources needed to run the job, including the job
JAR file, the configuration file, and the computed input splits, to the jobtracker’s filesystem in a
directory named after the job ID. (step 3).
 Tells the jobtracker that the job is ready for execution (by calling submitJob() on JobTracker) (step 4).
2. Job Initialization:
 When the JobTracker receives a call to its submitJob() method, it puts it into an internal queue from
where the job scheduler will pick it up and initialize it. Initialization involves creating an object to
represent the job being run (step 5).
 To create the list of tasks to run, the job scheduler first retrieves the input splits computed by the client
from the shared filesystem (step 6). It then creates one map task for each split.
3. Task Assignment:
 Tasktrackers run a simple loop that periodically sends heartbeat method calls to
the jobtracker. Heartbeats tell the jobtracker that a tasktracker is alive.
 As a part of the heartbeat, a tasktracker will indicate whether it is ready to run a
new task, and if it is, the jobtracker will allocate it a task, which it communicates
to the tasktracker using the heartbeat return value (step 7).
4. Task Execution:
 Now that the tasktracker has been assigned a task, the next step is for it to run the
task. First, it localizes the job JAR by copying it from the shared filesystem to the
tasktracker’s filesystem. It also copies any files needed from the distributed
cache by the application to the local disk; (step 8).
 TaskRunner launches a new Java Virtual Machine (step 9) to run each task in
(step 10).
5. Progress and Status Updates:
 MapReduce jobs are long-running batch jobs, taking anything from minutes to
hours to run. Because this is a significant length of time, it’s important for the
user to get feedback on how the job is progressing. A job and each of its tasks
have a status.
 When a task is running, it keeps track of its progress, that is, the proportion of the
task completed.
6. Job Completion:
 When the jobtracker receives a notification that the last task for a job is complete
(this will be the special job cleanup task), it changes the status for the job to
“successful.”
4.4 Failures in Classic MapReduce:
 In the MapReduce 1 runtime there are three failure modes to consider:
1.failure of the running task,
2. failure of the tastracker, and
3. failure of the jobtracker.
1. Task Failure:
 Consider first the case of the child task failing. The most common way that this happens is when user
code in the map or reduce task throws a runtime exception. If this happens, the child JVM reports the
error back to its parent tasktracker, before it exits.
 The error ultimately makes it into the user logs. The tasktracker marks the task attempt as failed,
freeing up a slot to run another task.
 When the jobtracker is notified of a task attempt that has failed (by the tasktracker’s heartbeat call), it
will reschedule execution of the task. The jobtracker will try to avoid rescheduling the task on a
tasktracker where it has previously failed.
2. Tasktracker Failure:
 Failure of a tasktracker is another failure mode. If a tasktracker fails by crashing, or running very
slowly, it will stop sending heartbeats to the jobtracker (or send them very infrequently).
 The jobtracker will notice a tasktracker that has stopped sending heartbeats (if it hasn’t received
one for 10 minutes, configured via the mapred.task tracker.expiry.interval property, in
milliseconds) and remove it from its pool of tasktrackers to schedule tasks on.
 A tasktracker can also be blacklisted by the jobtracker, even if the tasktracker has not failed. If
more than four tasks from the same job fail on a particular tasktracker then the jobtracker records
this as a fault.
 Blacklisted tasktrackers are not assigned tasks, but they continue to communicate with the
jobtracker. Faults expire over time (at the rate of one per day), so tasktrackers get the chance to
run jobs again simply by leaving them running.
3. Jobtracker Failure:
 Failure of the jobtracker is the most serious failure mode. Hadoop has no mechanism for dealing
with failure of the jobtracker—it is a single point of failure—so in this case the job fails.
 However, this failure mode has a low chance of occurring, since the chance of a particular
machine failing is low. The good news is that the situation is improved in YARN, since one of its
design goals is to eliminate single points of failure in Map Reduce.
 After restarting a jobtracker, any jobs that were running at the time it was stopped will need to be
re-submitted. There is a configuration option that attempts to recover any running job
(mapred.jobtracker.restart.recover, turned off by default), however it is known not to work
reliably, so should not be used.
4.5 YARN Architecture:
 YARN stands for Yet Another Resource Negotiator. It has two major
responsibilities:
1. Management of cluster resources such as compute, network, and memory
2. Scheduling and monitoring of jobs
 YARN achieves these goals through two long-running daemons:
1. Resource Manager
2. Node Manager
The two components work in a master-slave relationship, where the Resource
Manager(RM) is the master and the Node Managers the slave.
A single Resource Manager runs in the cluster with one Node Manager per
machine. Together, these two components make up the data-computation
framework. Let’s discuss the resource manager first.
• 1. Client: It submits map-reduce jobs.
• 2. Resource Manager: It is the master daemon of YARN and is responsible for
resource assignment and management among all the applications. Whenever it
receives a processing request, it forwards it to the corresponding node manager
and allocates resources for the completion of the request accordingly.
• It has two major components:
• Scheduler: It performs scheduling based on the allocated application and
available resources. It is a pure scheduler, means it does not perform other tasks
such as monitoring or tracking and does not guarantee a restart if a task fails. The
YARN scheduler supports plugins such as Capacity Scheduler and Fair Scheduler
to partition the cluster resources.
• Application manager: It is responsible for accepting the application and
negotiating the first container from the resource manager. It also restarts the
Application Master container if a task fails.
3. Node Manager: It take care of individual node on Hadoop cluster and manages
application and workflow and that particular node. Its primary job is to keep-up with
the Resource Manager. It registers with the Resource Manager and sends heartbeats
with the health status of the node. It monitors resource usage, performs log
management and also kills a container based on directions from the resource manager.
4. Application Master: An application is a single job submitted to a framework. The
application master is responsible for negotiating resources with the resource manager,
tracking the status and monitoring progress of a single application.
The application master requests the container from the node manager by sending a
Container Launch Context(CLC) which includes everything an application needs to
run. Once the application is started, it sends the health report to the resource manager
from time-to-time.
5. Container: It is a collection of physical resources such as RAM, CPU cores and
disk on a single node. The containers are invoked by Container Launch Context(CLC)
which is a record that contains information such as environment variables, security
tokens, dependencies etc.
4.6 YARN architecture for running a Map Reduce Job:
MapReduce on YARN involves more entities than classic MapReduce.
They are:
 The client, which submits the MapReduce job.
 The YARN resource manager, which coordinates the allocation of compute
resources on the cluster.
 The YARN node managers, which launch and monitor the compute containers on
machines in the cluster.
 The MapReduce application master, which coordinates the tasks running the
MapReduce job. The application master and the MapReduce tasks run in
containers that are scheduled by the resource manager, and managed by the node
managers.
 The distributed filesystem which is used for sharing job files between the other
entities.
1. Job Submission
2. Job Initialization
3. Task Assignment
4. Task Execution
5. Job Completion
1. Job Submission:
The job submission process implemented by JobSubmitter does the following:
• Asks the resource manager for a new application ID, used for the MapReduce job ID (step 2).
• Checks the output specification of the job. For example, if the output directory has not been specified or it already
exists, the job is not submitted, and an error is thrown to the MapReduce program.
• Computes the input splits for the job. If the splits cannot be computed (because the input paths don’t exist, for
example), the job is not submitted, and an error is thrown to the MapReduce program.
• Copies the resources needed to run the job, including the job JAR file, the configuration file, and the computed input
splits, to the shared filesystem in a directory named after the job ID (step 3). The job JAR is copied with a high
replication factor so that there are lots of copies across the cluster for the node managers to access when they run
tasks for the job.
• Submits the job by calling submitApplication() on the resource manager (step 4).
 2. Job Initialization:
• When the resource manager receives a call to its submitApplication() method, it hands off the
request to the YARN scheduler. The scheduler allocates a container, and the resource manager
then launches the application master’s process.
• The application master for MapReduce jobs is a Java application, it initializes the job by
creating a number of bookkeeping objects to keep track of the job’s progress, as it will receive
progress and completion reports from the tasks (step 6).
• Next, it retrieves the input splits computed in the client from the shared filesystem (step 7). It
then creates a map task object for each split, as well as a number of reduce task. Tasks are
given IDs at this point.
 3. Task Assignment:
• If the job does not qualify for running as an uber task, then the application master requests
containers for all the map and reduce tasks in the job from the resource manager (step 8).
• Requests for map tasks are made first and with a higher priority than those for reduce tasks,
since all the map tasks must complete before the sort phase of the reduce can start. Requests
for reduce tasks are not made until 5% of map tasks have completed.
• Reduce tasks can run anywhere in the cluster, but requests for map tasks have data locality
constraints that the scheduler tries to honor.
 4. Task Execution:
• Once a task has been assigned resources for a container on a particular node by the resource
manager’s scheduler, the application master starts the container by contacting the node manager
(steps 9a and 9b).
• The task is executed by a Java application whose main class is YarnChild. Before it can run the
task, it localizes the resources that the task needs, including the job configuration and JAR file,
and any files from the distributed cache.
• Finally, it runs the map or reduce task (step 11).
 5. Job Completion:
• When the application master receives a notification that the last task for a job is complete, it
changes the status for the job to “successful”.
• Then, when the Job polls for status, it learns that the job has completed successfully, so it prints
a message to tell the user and then returns from the waitForCompletion() method.
4.7 Job Scheduling:
1. FIFO Scheduler:
 First In First Out is the default scheduling policy used in Hadoop. FIFO
Scheduler gives more preferences to the application coming first than those
coming later. It places the applications in a queue and executes them in the order
of their submission (first in, first out).
 Advantage:
• It is simple to understand and doesn’t need any configuration.
• Jobs are executed in the order of their submission.
 Disadvantage:
• It is not suitable for shared clusters. If the large application comes before the
shorter one, then the large application will use all the resources in the cluster, and
the shorter application has to wait for its turn. This leads to starvation.
• It does not take into account the balance of resource allocation between the long
applications and short applications.
2. Capacity Scheduler:
 The CapacityScheduler allows multiple-tenants to securely share a large Hadoop
cluster. It is designed to run Hadoop applications in a shared, multi-tenant cluster
while maximizing the throughput and the utilization of the cluster.
 It supports hierarchical queues to reflect the structure of organizations or groups
that utilizes the cluster resources. A queue hierarchy contains three types of
queues that are root, parent, and leaf.
 Advantages:
• It maximizes the utilization of resources and throughput in the Hadoop cluster.
• Provides elasticity for groups or organizations in a cost-effective manner.
• It also gives capacity guarantees and safeguards to the organization utilizing
cluster.
 Disadvantage:
• It is complex amongst the other scheduler.
3.The Fair Scheduler:
 The Fair Scheduler aims to give every user a fair share of the cluster capacity
over time.
 If a single job is running, it gets all of the cluster. As more jobs are submitted,
free task slots are given to the jobs in such a way as to give each user a fair share
of the cluster.
 A short job belonging to one user will complete in a reasonable time even while
another user’s long job is running, and the long job will still make progress.
 Jobs are placed in pools, and by default, each user gets their own pool. A user
who submits more jobs than a second user will not get any more cluster resources
than the second, on average. It is also possible to define custom pools with
guaranteed minimum capacities defined in terms of the number of map and
reduce slots, and to set weightings for each pool.
 Advantages:

• It provides a reasonable way to share the Hadoop Cluster between the


number of users.
• Also, the FairScheduler can work with app priorities where the
priorities are used as weights in determining the fraction of the total
resources that each application should get.
 Disadvantage:

• It requires configuration.
4.8 Shuffle and Sort:
 MapReduce makes the guarantee that the input to every reducer is sorted by key. The process by
which the system performs the sort—and transfers the map outputs to the reducers as inputs—is
known as the shuffle.
1. The Map Side
 When the map function starts producing output, it is not simply written to disk. Each map task
has a circular memory buffer that it writes the output to. The buffer is 100 MB by default, a size
which can be tuned by changing the io.sort.mb property.
 When the contents of the buffer reaches a certain threshold size (io.sort.spill.percent, default 0.80,
or 80%), a background thread will start to spill the contents to disk.
 Map outputs will continue to be written to the buffer while the spill takes place, but if the buffer
fills up during this time, the map will block until the spill is complete.
 Spills are written in round-robin fashion to the directories specified by the mapred.local.dir
property, in a job-specific subdirectory
2.The Reduce Side :
 Let’s turn now to the reduce part of the process.
 The map output file is sitting on the local disk of the machine that ran the map task.But now it is
needed by the machine that is about to run the reduce task for the partition. The reduce task needs
the map output for its particular partition from several map tasks across the cluster.
 The copy phase of the reduce task. The reduce task has a small number of copier threads so that it
can fetch map outputs in parallel. The default is five threads, but this number can be changed by
setting the mapred.reduce.parallel.copies property.
 The map outputs are copied to the reduce task JVM’s memory if they are small enough,otherwise
they are copied to disk.When the in-memory buffer reaches a threshold size , or reaches a
threshold number of map outputs (mapred.inmem.merge.threshold), it is merged and spilled to
disk.
 When all the map outputs have been copied, the reduce task moves into the merge phase.which
merges the map outputs, maintaining their sort ordering.
4.9 Failures in YARN Classic MapReduce:
 1. Task Failures:Failure of the running task is similar to the classic case.
Runtime exceptions and sudden exits of the JVM are propagated back to the
application master and the task attempt is marked as failed.
 The configuration properties for determining when a task is considered to be
failed are the same as the classic case: a task is marked as failed after four
attempts.
 2. Application Master Failure: An application master sends periodic
heartbeats to the resource manager, and in the event of application master failure,
the resource manager will detect the failure and start a new instance of the master
running in a new container (managed by a node manager).
 In the case of the MapReduce application master, it can recover the state of the
tasks that had already been run by the (failed) application so they don’t have to
be rerun.
 3. Node Manager Failure:
 If a node manager fails, then it will stop sending heartbeats to the resource manager, and the node
manager will be removed from the resource manager’s pool of available nodes.
 The property yarn.resourcemanager.nm.liveness-monitor.expiry-intervalms, which defaults to 600000
(10 minutes), determines the minimum time the resource manager waits before considering a node
manager that has sent no heartbeat in that time as failed.
 Node managers may be blacklisted if the number of failures for the application is high. Blacklisting is
done by the application master, and for MapReduce the application master will try to reschedule tasks
on different nodes if more than three tasks fail on a node manager.
 4. Resource Manager Failure:
 Failure of the resource manager is serious, since without it neither jobs nor task containers can be
launched.
 After a crash, a new resource manager instance is brought up (by an adminstrator) and it recovers from
the saved state. The state consists of the node managers in the system as well as the running
applications.
4.10 TASK EXECUTION:
1. The Task Execution Environment
2. Speculative Execution
3. Output Committers
 1. The Task Execution Environment:
 Hadoop provides information to a map or reduce task
about the environment in which it is running. For
example, a map task can discover the name of the file
it is processing, and a map or reduce task can find out
the attempt number of the task.
 Task Execution Property:
2. Speculative Execution:
 The MapReduce model is to break jobs into tasks and run the tasks in parallel to make the overall
job execution time smaller than it would otherwise be if the tasks ran sequentially.
 This makes job execution time sensitive to slow-running tasks, as it takes only one slow task to
make the whole job take significantly longer than it would have done otherwise. When a job
consists of hundreds or thousands of tasks, the possibility of a few straggling tasks is very real.
 Tasks may be slow for various reasons, including hardware degradation or software mis-
configuration, but the causes may be hard to detect since the tasks still complete successfully,
albeit after a longer time than expected. Hadoop doesn’t try to diagnose and fix slow-running
tasks; instead, it tries to detect when a task is running slower than expected and launches another,
equivalent, task as a backup. This is termed speculative execution of tasks.
 speculative task is launched only after all the tasks for a job have been launched, and then only
for tasks that have been running for some time (at least a minute) and have failed to make as
much progress, on average, as the other tasks from the job.
 When a task completes successfully, any duplicate tasks that are running are killed since they are
no longer needed. So if the original task completes before the speculative task, then the
speculative task is killed; on the other hand, if the speculative task finishes first, then the original
is killed.
3.Output Committers:
 Hadoop MapReduce uses a commit protocol to ensure that jobs and tasks either succeed, or fail
cleanly. The behavior is implemented by the OutputCommitter in use for the job, and this is set
in the old MapReduce API by calling the setOutputCommitter() on JobConf, or by setting
mapred.output.committer.class in the configuration.
 In the new MapReduce API, the OutputCommitter is determined by the OutputFormat, via its
getOut putCommitter() method. The default is FileOutputCommitter, which is appropriate for
file-based MapReduce.
 The setupJob() method is called before the job is run, and is typically used to perform initialization.
For FileOutputCommitter the method creates the final output directory, ${mapred.output.dir}, and a
temporary working space for task output, ${mapred.output.dir}/_temporary.
 If the job succeeds then the commitJob() method is called, which in the default filebased
implementation deletes the temporary working space, and creates a hidden empty marker file in the
output directory called _SUCCESS to indicate to filesystem clients that the job completed
successfully.
 If the job did not succeed, then the abort Job() is called with a state object indicating whether the job
failed or was killed (by a user, for example). In the default implementation this will delete the job’s
temporary working space.
4.11 Map Reduce Types:
 The map and reduce functions in Hadoop MapReduce have the following general form:
 map: (K1, V1) → list(K2, V2)
 reduce: (K2, list(V2)) → list(K3, V3)
 In general, the map input key and value types (K1 and V1) are different from the map output
types (K2 and V2). However, the reduce input must have the same types as the map output,
although the reduce output types may be different again (K3 and V3).
 If a combine function is used, then it is the same form as the reduce function (and is an
implementation of Reducer), except its output types are the intermediate key and value types (K2
and V2), so they can feed the reduce function:
 map: (K1, V1) → list(K2, V2)
 combine: (K2, list(V2)) → list(K2, V2)
 reduce: (K2, list(V2)) → list(K3, V3)
 Often the combine and reduce functions are the same, in which case, K3 is the same as K2, and
V3 is the same as V2.
1.Input Format: Input Format Class Hierarchy Diagram
 Input Formats:
 Hadoop can process many different types of data formats, from flat text files to databases.
 The Relationship Between Input Splits and HDFS Blocks Figure shows an example. A single file is
broken into lines, and the line boundaries do not correspond with the HDFS lock boundaries. Splits honor
logical record boundaries,in this case lines, so we see that the first split contains line 5, even though it spans
the first and second block. The second split starts at line 6.
1. TextInputFormat :
 TextInputFormat is the default InputFormat. Each record is a line of input. The key, a LongWritable, is the
byte offset within the file of the beginning of the line. The value is the contents of the line.
 So a file containing the following text:
 On the top of the Crumpetty Tree
 The Quangle Wangle sat,
 But his face you could not see,
 On account of his Beaver Hat.
 Like in the TextInputFormat case, the input is in a single split comprising four records, although this times
the keys are the Text sequences before the tab in each line:
 (line1, On the top of the Crumpetty Tree)
 (line2, The Quangle Wangle sat,)
 (line3, But his face you could not see,)
 (line4, On account of his Beaver Hat.)
2. NLineInputFormat:
 With TextInputFormat and KeyValueTextInputFormat, each mapper receives a variable number of lines of
input. The number depends on the size of the split and the length of the lines.
 N refers to the number of lines of input that each mapper receives. With N set to one (the default), each
mapper receives exactly one line of input.
 On the top of the Crumpetty Tree
 The Quangle Wangle sat,
 But his face you could not see,
 On account of his Beaver Hat.
 If, for example, N is two, then each split contains two lines. One mapper will receive the first two key-value
pairs:
 (0, On the top of the Crumpetty Tree)
 (33, The Quangle Wangle sat,)
 And another mapper will receive the second two key-value pairs:
 (57, But his face you could not see,)
 (89, On account of his Beaver Hat.)
3. Binary Input :
 SequenceFileInputFormat
 Hadoop’s sequence file format stores sequences of binary key-value pairs.
Sequence files are well suited as a format for MapReduce data since they are
splittable they support compression as a part of the format
 SequenceFileAsTextInputFormat
 SequenceFileAsTextInputFormat is a variant of SequenceFileInputFormat that
converts the sequence file’s keys and values to Text objects.
 SequenceFileAsBinaryInputFormat
 SequenceFileAsBinaryInputFormat is a variant of SequenceFileInputFormat that
retrieves the sequence file’s keys and values as opaque binary objects.
 4. Multiple Inputs:
 Although the input to a MapReduce job may consist of multiple input files (constructed by a combination of
file globs, filters, and plain paths), all of the input is interpreted by a single InputFormat and a single
Mapper.
 These cases are handled elegantly by using the MultipleInputs class, which allows you to specify the
InputFormat and Mapper to use on a per-path basis. For example, if we had weather data from the UK Met
Office6 that we wanted to combine with the NCDC data for our maximum temperature analysis, then we
might set up the input as follows:
MultipleInputs.addInputPath(job, ncdcInputPath,
TextInputFormat.class, MaxTemperatureMapper.class);
MultipleInputs.addInputPath(job, metOfficeInputPath,
TextInputFormat.class, MetOfficeMaxTemperatureMapper.class);
 5. Database Input:
 DBInputFormat is an input format for reading data from a relational database, using JDBC. It is best used
for loading relatively small datasets, perhaps for joining with larger datasets from HDFS, using
MultipleInputs.
2. Output Formats: Output Format Class Hierarchy Diagram
1.Text Output:
 The default output format, TextOutputFormat, writes records as lines of text. Its
keys and values may be of any type, since TextOutputFormat turns them to
strings by calling toString() on them.
 Each key-value pair is separated by a tab character. The counterpart to
TextOutput Format for reading in this case is KeyValueTextInputFormat.
2.Binary Output:
 SequenceFileOutputFormat: As the name indicates,
SequenceFileOutputFormat writes sequence files for its output.
 SequenceFileAsBinaryOutputFormat: SequenceFileAsBinaryOutputFormat is
the counterpart to SequenceFileAsBinaryInput Format, and it writes keys and
values in raw binary format into a SequenceFile container.
 MapFileOutputFormat:
 MapFileOutputFormat writes MapFiles as output. The keys in a MapFile must be
added in order, so you need to ensure that your reducers emit keys in sorted
3. Multiple Outputs:
 FileOutputFormat and its subclasses generate a set of files in the output directory.
There is one file per reducer, and files are named by the partition number: part-r-
00000, partr-00001, etc. There is sometimes a need to have more control over the
naming of the files or to produce multiple files per reducer.
4. Lazy Output:
 FileOutputFormat subclasses will create output (part-r-nnnnn) files, even if they
are empty. Some applications prefer that empty files not be created, which is
where LazyOutputFormat helps.
5. Database Output:
 The output formats for writing to relational databases and to HBase.
DBOutputFormat, which is useful for dumping job outputs (of modest size) into
a database.

You might also like