0% found this document useful (1 vote)
259 views

Big Data Unit 3

The document provides an introduction to MapReduce programming and big data. It discusses why MapReduce is used and provides a conceptual understanding of MapReduce programming. It also outlines some of the key issues in developing distributed programs, such as scalability, heterogeneity, resource management, and failure management. Finally, it discusses why MapReduce is suitable for processing large and complex datasets that cannot be handled by traditional computing techniques.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
259 views

Big Data Unit 3

The document provides an introduction to MapReduce programming and big data. It discusses why MapReduce is used and provides a conceptual understanding of MapReduce programming. It also outlines some of the key issues in developing distributed programs, such as scalability, heterogeneity, resource management, and failure management. Finally, it discusses why MapReduce is suitable for processing large and complex datasets that cannot be handled by traditional computing techniques.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

UNIT-3 INTRODUCTION TO BIG DATA: Map-Reduce Programming

Unit-3
1. Map-Reduce Programming: Developing Distributed Programs and issues
2. why map- reduce and conceptual understanding of Map-Reduce programming
3. Developing Map-Reduce programs in Java
4. setting up the cluster with HDFS and understanding how Map- Reduce works on
HDFS
5. Running simple word count Map-Reduce program on the cluster
6. Additional examples of M-R Programming.

1. DEVELOPING DISTRIBUTED PROGRAMS AND ISSUES

Scalability: Scaling is one of the major issues of Developing Distributed Programs. The scaling issue
consists of dimensions like communication capacity. The program should be designed such that the
capacity may be increased with the increasing demand on the system.

Heterogeneity: It is an important design issue for the distributed programming. The communications
infrastructure consists of channels of different capacities. End-Systems will possess a wide variety of
presentation techniques.

Objects representation and translation: Selecting the best programming models for distributed
objects like CORBA, Java etc. is an important issue.

Resource management: In Developing Distributed Programs, objects consisting of resources are


located on different places. Routing is an issue at the network layer of the distributed system and at
the application layer.

Security and privacy: How to apply the security policies to the interdependent system is a great
issue in Developing Distributed Programs. Since Distributed Programs deal with sensitive data and
information so the program must have a strong security and privacy measurement. Protection of
distributed system assets, including base resources, storage, communications and user-interface I/O
as well as higher-level composites of these resources, like processes, files, messages, display
windows and more complex objects, are important issues in distributed program

Transparency: Transparency means up to what extent the distributed system should appear to the
user as a single system? Distributed program must be designed to hide the complexity of the system
to a greater extent.

Openness: Openness means up to what extent a system be designed using standard protocols to
support Interoperability. It is desired for developers to add new features or replace subsystem in
future. To accomplish this, distributed program must have well defined interfaces.

Quality of Service: How to specify the quality of service given to system users and acceptable level
of quality of service delivered to the users. The quality of service is heavily dependent on the
processes to be allocated to the processors in the system, resource distribution, hardware, adaptability
of the program, network etc. A good performance, availability and reliability are required to reflect
good quality of service.

Failure management: How can failure of the system be detected and repaired.

Synchronization: One of the most important issues that engineers of distributed programs are facing
is synchronizing computations consisting of thousands of components. Current methods of
synchronization like semaphores, monitors, barriers, remote procedure call, object method
invocation, and message passing, do not scale well.

Resource identification: The resources are distributed across various computers and a proper
naming scheme is to be designed for exact reference of the resources.

G B Gangadhar

1
UNIT-3 INTRODUCTION TO BIG DATA: Map-Reduce Programming
Communications: Distributed Systems have become more effective with the advent of Internet but
there are certain requirements for performance, reliability etc. Effective approaches to
communication should be used.

Software Architectures: It reflects the application functionality distributed over the logical
components and across the processors. Selecting the right architecture for an application is required
for better quality of service.

Performance analysis: The Performance analysis of Distributed software program is a great issue. It
is expected that it should be high speed, fault tolerant and cost effective. It is also essential towards
evaluating alternative design to meet the Quality of service. Moreover the ability to estimate the
future performance of a large and complex distributed software program at design time can reduce
the software cost and risk.

Generating a Test Data: Generating a test data to cover the respective test criteria for testing the
component is a difficult task. It becomes more difficult in case of distributed program because the
number of possible paths increases significantly. Test cases must cover the low level elements.

Component selection for testing: Testing distributed components require the services of other
components. When a component is used with other there could be a possibility of deadlocks and race
condition. There may be no error detected when only one client is used because only one thread is
executed but in the case of multithreading the number of clients used for testing the components may
detect the errors.

Test Sequence: The component is to be tested along with other components. What orders should be
followed in testing components? If the components do not follow the layered architectural model,
there could be chances of cycles among the components.

Testing for system scalability and performance: Scalability of conventional test criteria of data is a
major issue in the context of testing. The concept of threading may be used in the components for
improved performance while testing. But using multiple threads is a challenging task in testing.

Redundant testing during integration of component: components are first tested separately. When
the entire program is tested, lots of retesting of component occurs.

Availability of source code: Software components may be developed in house or off- the- shelf.
Depending upon the availability of source code various testing techniques are used for the system
testing.

Heterogeneous languages, platform and Architecture: Different languages may be used for
writing the components of the system. The components may be used on different hardware and
software platform.

Monitoring and control mechanism in testing distributed software: Distributed software system
involves multiple computers on the network. Testing monitoring and control mechanism in
distributed environment is complex compared with centralized software system. Monitoring
Distributed System services are also important for debugging during program development and
required as part of the application itself like process control and automation.

Reproducibility of Events: Reproducibility of events is a challenging task because of concurrent


processing and asynchronous communication occurring in the distributed environment. Moreover the
lack of full control over the environment is another hurdle in this regards.

Deadlocks and Race Conditions: Deadlocks and race conditions are other great issues while
developing distributed programs especially in the context of testing. It becomes more important issue
especially in shared memory multiprocessor environment .

Testing for fault tolerance: The ability to tolerate faults in software system is required in
applications like nuclear plant, Space missions, medical equipments etc. Testing for fault tolerance is
challenging because the fault recovery code hardly gets executed while testing. Different fault
injection techniques are used for fault tolerance by injecting faults in the program under test.

Scheduling issue for distributed program: Focuses on Scheduling problems in homogeneous and
heterogeneous parallel distributed systems. The performance of distributed programs are affected by

2
UNIT-3 INTRODUCTION TO BIG DATA: Map-Reduce Programming
Broadcast/multicast processing and required to develop a delivering procedure that completes the
processing in minimum time.

Controllability and Observability issues : Controllability and observability are two important
issues in testing because they have an effect on the capability of the test system to check the
conformance of an implementation
under test. Controllability is the capability of the Test System to force the Implementation under Test
to receive inputs in a given order.

Distributed Task Allocation: Finding an optimal Task allocation in developing distributed program
is a challenging job keeping in mind the concept of reliability and performance.

2. WHY MAP- REDUCE AND CONCEPTUAL UNDERSTANDING OF MAP-REDUCE


PROGRAMMING

MapReduce is a programming model for writing applications that can process Big Data in
parallel on multiple nodes. MapReduce provides analytical capabilities for analyzing huge volumes
of complex data.

What is Big Data?

Big Data is a collection of large datasets that cannot be processed using traditional computing
techniques. For example, the volume of data Facebook or YouTube need require it to collect and
manage on a daily basis, can fall under the category of Big Data. However, Big Data is not only
about scale and volume, it also involves one or more of the following aspects − Velocity, Variety,
Volume, and Complexity.

Why MapReduce?

Traditional Enterprise Systems normally have a centralized server to store and process data. The
following illustration depicts a schematic view of a traditional enterprise system. Traditional model is
certainly not suitable to process huge volumes of scalable data and cannot be accommodated by
standard database servers. Moreover, the centralized system creates too much of a bottleneck while
processing multiple files simultaneously.

Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a task
into small parts and assigns them to many computers. Later, the results are collected at one place and
integrated to form the result dataset.

3
UNIT-3 INTRODUCTION TO BIG DATA: Map-Reduce Programming
MapReduce programs work in two phases:

1. Map phase
2. Reduce phase.

Input to each phase are key-value pairs. In addition, every programmer needs to specify two
functions: map function and reduce function.

The whole process goes through three phase of execution namely,

How MapReduce works

Lets understand this with an example –

Consider you have following input data for your MapReduce Program

Welcome to Hadoop Class

Hadoop is good

Hadoop is bad

The final output of the MapReduce task is

bad 1
Class 1
good 1
Hadoop 3
is 2
to 1
Welcome 1

4
UNIT-3 INTRODUCTION TO BIG DATA: Map-Reduce Programming
The data goes through following phases

Input Splits:

Input to a MapReduce job is divided into fixed-size pieces called input splits Input split is a chunk of
the input that is consumed by a single map

Mapping

This is very first phase in the execution of map-reduce program. In this phase data in each split is
passed to a mapping function to produce output values. In our example, job of mapping phase is to
count number of occurrences of each word from input splits (more details about input-split is given
below) and prepare a list in the form of <word, frequency>

Shuffling

This phase consumes output of Mapping phase. Its task is to consolidate the relevant records from
Mapping phase output. In our example, same words are clubed together along with their respective
frequency.

Reducing

In this phase, output values from Shuffling phase are aggregated. This phase combines values from
Shuffling phase and returns a single output value. In short, this phase summarizes the complete
dataset.

In our example, this phase aggregates the values from Shuffling phase i.e., calculates total
occurrences of each words.

The overall process in detail

 One map task is created for each split which then executes map function for each record in
the split.
 It is always beneficial to have multiple splits, because time taken to process a split is small
as compared to the time taken for processing of the whole input. When the splits are smaller,
the processing is better load balanced since we are processing the splits in parallel.
 However, it is also not desirable to have splits too small in size. When splits are too small,
the overload of managing the splits and map task creation begins to dominate the total job
execution time.
 For most jobs, it is better to make split size equal to the size of an HDFS block (which is 64
MB, by default).
 Execution of map tasks results into writing output to a local disk on the respective node and
not to HDFS.
 Reason for choosing local disk over HDFS is, to avoid replication which takes place in case
of HDFS store operation.
 Map output is intermediate output which is processed by reduce tasks to produce the final
output.
 Once the job is complete, the map output can be thrown away. So, storing it in HDFS with
replication becomes overkill.
 In the event of node failure before the map output is consumed by the reduce task, Hadoop
reruns the map task on another node and re-creates the map output.
 Reduce task don't work on the concept of data locality. Output of every map task is fed to
the reduce task. Map output is transferred to the machine where reduce task is running.
 On this machine the output is merged and then passed to the user defined reduce function.
 Unlike to the map output, reduce output is stored in HDFS (the first replica is stored on the
local node and other replicas are stored on off-rack nodes). So, writing the reduce output

5
UNIT-3 INTRODUCTION TO BIG DATA: Map-Reduce Programming

How MapReduce Organizes Work?

Hadoop divides the job into tasks. There are two types of tasks:

1. Map tasks (Spilts & Mapping)


2. Reduce tasks (Shuffling, Reducing) as mentioned above.

The complete execution process (execution of Map and Reduce tasks, both) is controlled by two
types of entities called a

1. Jobtracker : Acts like a master (responsible for complete execution of submitted job)
2. Multiple Task Trackers : Acts like slaves, each of them performing the job

For every job submitted for execution in the system, there is one Jobtracker that resides
on Namenode and there are multiple tasktrackers which reside on Datanode.

 A job is divided into multiple tasks which are then run onto multiple data nodes in a cluster.
 It is the responsibility of jobtracker to coordinate the activity by scheduling tasks to run on
different data nodes.
 Execution of individual task is then look after by tasktracker, which resides on every data
node executing part of the job.
 Tasktracker's responsibility is to send the progress report to the jobtracker.
 In addition, tasktracker periodically sends 'heartbeat' signal to the Jobtracker so as to notify
him of current state of the system.
 Thus jobtracker keeps track of overall progress of each job. In the event of task failure, the
jobtracker can reschedule it on a different tasktracker.

3. DEVELOPING MAP-REDUCE PROGRAMS IN JAVA

Given below is the data regarding the electrical consumption of an organization. It contains the
monthly electrical consumption and the annual average for various years.

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Avg
1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45

6
UNIT-3 INTRODUCTION TO BIG DATA: Map-Reduce Programming
If the above data is given as input, we have to write applications to process it and produce results
such as finding the year of maximum usage, year of minimum usage, and so on. This is a walkover
for the programmers with finite number of records. They will simply write the logic to produce the
required output, and pass the data to the application written.

But, think of the data representing the electrical consumption of all the largescale industries of a
particular state, since its formation.

When we write applications to process such bulk data,

 They will take a lot of time to execute.


 There will be a heavy network traffic when we move data from source to network server and
so on.

To solve these problems, we have the MapReduce framework.

Input Data

The above data is saved as sample.txt and given as input. The input file looks as shown
below.

1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45

Example Program

Given below is the program to the sample data using MapReduce framework.

package hadoop;

import java.util.*;

import java.io.IOException;
import java.io.IOException;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class ProcessUnits


{
//Mapper class
public static class E_EMapper extends MapReduceBase implements
Mapper<LongWritable , /*Input key Type */
Text, /*Input value Type*/
Text, /*Output key Type*/
IntWritable> /*Output value Type*/
{

//Map function
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException
{
String line = value.toString();
String lasttoken = null;
StringTokenizer s = new StringTokenizer(line,"\t");
String year = s.nextToken();

7
UNIT-3 INTRODUCTION TO BIG DATA: Map-Reduce Programming
while(s.hasMoreTokens())
{
lasttoken=s.nextToken();
}

int avgprice = Integer.parseInt(lasttoken);


output.collect(new Text(year), new IntWritable(avgprice));
}
}

//Reducer class
public static class E_EReduce extends MapReduceBase implements
Reducer< Text, IntWritable, Text, IntWritable >
{

//Reduce function
public void reduce( Text key, Iterator <IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws
IOException
{
int maxavg=30;
int val=Integer.MIN_VALUE;

while (values.hasNext())
{
if((val=values.next().get())>maxavg)
{
output.collect(key, new IntWritable(val));
}
}

}
}

//Main function
public static void main(String args[])throws Exception
{
JobConf conf = new JobConf(ProcessUnits.class);

conf.setJobName("max_eletricityunits");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(E_EMapper.class);
conf.setCombinerClass(E_EReduce.class);
conf.setReducerClass(E_EReduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));


FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);
}
}

Save the above program as ProcessUnits.java. The compilation and execution of the program is
explained below.

Compilation and Execution of Process Units Program

Let us assume we are in the home directory of a Hadoop user (e.g. /home/hadoop).

Follow the steps given below to compile and execute the above program.

8
UNIT-3 INTRODUCTION TO BIG DATA: Map-Reduce Programming
Step 1

The following command is to create a directory to store the compiled java classes.

$ mkdir units

Step 2

Download Hadoop-core-2.6.5.jar, which is used to compile and execute the MapReduce program.
Visit the following link http://www-us.apache.org/dist/hadoop/common/hadoop-2.6.5/hadoop-
2.6.5.tar.gz to download the jar. Let us assume the downloaded folder is /home/hadoop/.

Step 3

The following commands are used for compiling the ProcessUnits.java program and creating a jar
for the program.

$ javac -classpath hadoop-core-1.2.1.jar -d units ProcessUnits.java


$ jar -cvf units.jar -C units/ .

Step 4

The following command is used to create an input directory in HDFS.

$HADOOP_HOME/bin/hadoop fs -mkdir input_dir

Step 5

The following command is used to copy the input file named sample.txtin the input directory of
HDFS.

$HADOOP_HOME/bin/hadoop fs -put /home/hadoop/sample.txt input_dir

Step 6

The following command is used to verify the files in the input directory.

$HADOOP_HOME/bin/hadoop fs -ls input_dir/

Step 7

The following command is used to run the Eleunit_max application by taking the input files from the
input directory.

$HADOOP_HOME/bin/hadoop jar units.jar hadoop.ProcessUnits input_dir


output_dir

Wait for a while until the file is executed. After execution, as shown below, the output will contain
the number of input splits, the number of Map tasks, the number of reducer tasks, etc.

INFO mapreduce.Job: Job job_1414748220717_0002


completed successfully
14/10/31 06:02:52
INFO mapreduce.Job: Counters: 49
File System Counters

FILE: Number of bytes read=61


FILE: Number of bytes written=279400
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=546
HDFS: Number of bytes written=40
HDFS: Number of read operations=9

9
UNIT-3 INTRODUCTION TO BIG DATA: Map-Reduce Programming
HDFS: Number of large read operations=0
HDFS: Number of write operations=2 Job Counters

Launched map tasks=2


Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=146137
Total time spent by all reduces in occupied slots (ms)=441
Total time spent by all map tasks (ms)=14613
Total time spent by all reduce tasks (ms)=44120
Total vcore-seconds taken by all map tasks=146137

Total vcore-seconds taken by all reduce tasks=44120


Total megabyte-seconds taken by all map tasks=149644288
Total megabyte-seconds taken by all reduce tasks=45178880

Map-Reduce Framework

Map input records=5


Map output records=5
Map output bytes=45
Map output materialized bytes=67
Input split bytes=208
Combine input records=5
Combine output records=5
Reduce input groups=5
Reduce shuffle bytes=6
Reduce input records=5
Reduce output records=5
Spilled Records=10
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=948
CPU time spent (ms)=5160
Physical memory (bytes) snapshot=47749120
Virtual memory (bytes) snapshot=2899349504
Total committed heap usage (bytes)=277684224

File Output Format Counters

Bytes Written=40

Step 8

The following command is used to verify the resultant files in the output folder.

$HADOOP_HOME/bin/hadoop fs -ls output_dir/

Step 9

The following command is used to see the output in Part-00000 file. This file is generated by HDFS.

$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000

Below is the output generated by the MapReduce program.

1981 34
1984 40
1985 45

10
UNIT-3 INTRODUCTION TO BIG DATA: Map-Reduce Programming
Step 10

The following command is used to copy the output folder from HDFS to the local file system for
analyzing.

$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000/bin/hadoop dfs get output_dir


/home/Hadoop

Important Commands
All Hadoop commands are invoked by the $HADOOP_HOME/bin/hadoop command. Running the
Hadoop script without any arguments prints the description for all commands.

Usage : hadoop [--config confdir] COMMAND

The following table lists the options available and their description.

Options Description
namenode -format Formats the DFS filesystem.
secondarynamenode Runs the DFS secondary namenode.
namenode Runs the DFS namenode.
datanode Runs a DFS datanode.
dfsadmin Runs a DFS admin client.
mradmin Runs a Map-Reduce admin client.
fsck Runs a DFS filesystem checking utility.
fs Runs a generic filesystem user client.
balancer Runs a cluster balancing utility.
oiv Applies the offline fsimage viewer to an fsimage.
fetchdt Fetches a delegation token from the NameNode.
jobtracker Runs the MapReduce job Tracker node.
pipes Runs a Pipes job.
tasktracker Runs a MapReduce task Tracker node.
historyserver Runs job history servers as a standalone daemon.
job Manipulates the MapReduce jobs.
queue Gets information regarding JobQueues.
version Prints the version.
jar <jar> Runs a jar file.
distcp <srcurl> <desturl> Copies file or directories recursively.
distcp2 <srcurl> <desturl> DistCp version 2.
archive -archiveName NAME -p Creates a hadoop archive.
<parent path> <src>* <dest>
classpath Prints the class path needed to get the Hadoop jar and the
required libraries.
daemonlog Get/Set the log level for each daemon

11
UNIT-3 INTRODUCTION TO BIG DATA: Map-Reduce Programming

4. SETTING UP THE CLUSTER WITH HDFS AND UNDERSTANDING HOW


MAP- REDUCE WORKS ON HDFS

a. Setting up SSH for a Hadoop cluster


The first step is to check whether SSH is installed on your nodes. We can easily do this by
use of the "which" UNIX command:
[hadoop-user@master]$ which ssh
/usr/bin/ssh
[hadoop-user@master]$ which sshd
/usr/bin/sshd
[hadoop-user@master]$ which ssh-keygen
/usr/bin/ssh-keygen
If you instead receive an error message such as this,

/usr/bin/which: no ssh in (/usr/bin:/bin:/usr/sbin...


install OpenSSH (www.openssh.com) via a Linux package manager or by downloading the
source directly. (Better yet, have your system administrator do it for you.)

Generate SSH key pair

Having verified that SSH is correctly installed on all nodes of the cluster, we use ssh-keygen
on the master node to generate an RSA key pair. Be certain to avoid entering a passphrase,
or you’ll have to manually enter that phrase every time the master node attempts to access
another node.

[hadoop-user@master]$ ssh-keygen -t rsa


Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop-user/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop-user/.ssh/id_rsa.
Your public key has been saved in /home/hadoop-user/.ssh/id_rsa.pub.
After creating your key pair, your public key will be of the form
[hadoop-user@master]$ more /home/hadoop-user/.ssh/id_rsa.pub ssh-rsa
AAAAB3NzaC1yc2EAAAABIwAAAQEA1WS3RG8LrZH4zL2/1oYgkV1OmVclQ2OO5
vRi0Nd51Sy3wWpBVHx82F3x3ddoZQjBK3uvLMaDhXvncJG31JPfU7CTAfmtgINYv0k
dUbDJq4TKG/fuO5q9CqHV71thN2M310gcJ0Y9YCN6grmsiWb2iMcXpy2pqg8UM3ZK
ApyIPx99O1vREWm+4moFTgYwIl5be23ZCyxNjgZFWk5MRlT1p1TxB68jqNbPQtU7fIa
fS7Sasy7h4eyIy7cbLh8x0/V4/mcQsY5dvReitNvFVte6onl8YdmnMpAh6nwCvog3UeWW
JjVZTEBFkTZuV1i9HeYHxpm1wAzcnf7az78jT IRQ== hadoop-user@master
and we next need to distribute this public key across your cluster.

Distribute public key and validate logins

Albeit a bit tedious, you’ll next need to copy the public key to every slave node as well as
the master node:
[hadoop-user@master]$ scp ~/.ssh/id_rsa.pub hadoop-user@target:~/master_key
Manually log in to the target node and set the master key as an authorized key (or append to
the list of authorized keys if you have others defined).
[hadoop-user@target]$ mkdir ~/.ssh
[hadoop-user@target]$ chmod 700 ~/.ssh
[hadoop-user@target]$ mv ~/master_key ~/.ssh/authorized_keys

12
UNIT-3 INTRODUCTION TO BIG DATA: Map-Reduce Programming
[hadoop-user@target]$ chmod 600 ~/.ssh/authorized_keys
After generating the key, you can verify it’s correctly defined by attempting to log in to the
target node from the master:
[hadoop-user@master]$ ssh target
The authenticity of host 'target (xxx.xxx.xxx.xxx)' can’t be established.
RSA key fingerprint is 72:31:d8:1b:11:36:43:52:56:11:77:a4:ec:82:03:1d.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'target' (RSA) to the list of known hosts.
Last login: Sun Jan 4 15:32:22 2009 from master
After confirming the authenticity of a target node to the master node, you won’t be
prompted upon subsequent login attempts.
[hadoop-user@master]$ ssh target
Last login: Sun Jan 4 15:32:49 2009 from master
We’ve now set the groundwork for running Hadoop on your own cluster. Let’s discuss the
different Hadoop modes you might want to use for your projects.

Running Hadoop

We need to configure a few things before running Hadoop. Let’s take a closer look at the
Hadoop configuration directory:
[hadoop-user@master]$ cd $HADOOP_HOME
[hadoop-user@master]$ ls -l conf/
total 100
-rw-rw-r-- 1 hadoop-user hadoop 2065 Dec 1 10:07 capacity-scheduler.xml
-rw-rw-r-- 1 hadoop-user hadoop 535 Dec 1 10:07 configuration.xsl
-rw-rw-r-- 1 hadoop-user hadoop 49456 Dec 1 10:07 hadoop-default.xml
-rwxrwxr-x 1 hadoop-user hadoop 2314 Jan 8 17:01 hadoop-env.sh
-rw-rw-r-- 1 hadoop-user hadoop 2234 Jan 2 15:29 hadoop-site.xml
-rw-rw-r-- 1 hadoop-user hadoop 2815 Dec 1 10:07 log4j.properties
-rw-rw-r-- 1 hadoop-user hadoop 28 Jan 2 15:29 masters
-rw-rw-r-- 1 hadoop-user hadoop 84 Jan 2 15:29 slaves
-rw-rw-r-- 1 hadoop-user hadoop 401 Dec 1 10:07 sslinfo.xml.example

The first thing you need to do is to specify the location of Java on all the nodes including the master.
In hadoop-env.sh.

>$export JAVA_HOME=/usr/share/jdk

b. Operational Modes Of Hadoop

We have 3 operational modes for running Hadoop are,

1. Local (standalone) mode

2. Pseudo-distributed mode

3. Fully distributed mode

1. Local (standalone) mode

The standalone mode is the default mode for Hadoop. When you first uncompress
the Hadoop source package, it’s ignorant of your hardware setup. Hadoop chooses
to be conservative and assumes a minimal configuration. All three XML files (or
hadoop-site.xml before version 0.20) are empty under this default mode:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
</configuration>

13
UNIT-3 INTRODUCTION TO BIG DATA: Map-Reduce Programming
With empty configuration files, Hadoop will run completely on the local machine.
Because there’s no need to communicate with other nodes, the standalone mode
doesn’t use HDFS, nor will it launch any of the Hadoop daemons. Its primary use is
for developing and debugging the application logic of a MapReduce pro-gram
without the additional complexity of interacting with the daemons.

2. Pseudo-distributed mode

The pseudo-distributed mode is running Hadoop in a “cluster of one” with all


daemons running on a single machine. This mode complements the standalone
mode for debugging your code, allowing you to examine memory usage, HDFS
input/out-put issues, and other daemon interactions. Listing 2.1 provides simple
XML files to configure a single server in this mode.

Listing 2.1 Example of the three configuration files for pseudo-distributed


mode
core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
<description>The name of the default file system. A
URI whose scheme and authority determine the
FileSystem implementation. </description>
</property>
</configuration>
mapred-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
<description>The host and port that the MapReduce job
tracker runs at.</description>
</property>
</configuration>
hdfs-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->


<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>The actual number of replications can be specified
when the file is created.</description>
</property>
</configuration>

14
UNIT-3 INTRODUCTION TO BIG DATA: Map-Reduce Programming
In core-site.xml and mapred-site.xml we specify the hostname and port of the
NameNode and the JobTracker, respectively. In hdfs-site.xml we specify the default
replication factor for HDFS, which should only be one because we’re running on
only one node. We must also specify the location of the Secondary NameNode in
the mas-ters file and the slave nodes in the slaves file:
[hadoop-user@master]$ cat masters
localhost
[hadoop-user@master]$ cat slaves
localhost
While all the daemons are running on the same machine, they still communicate
with each other using the same SSH protocol as if they were distributed over a
cluster. For single-node operation simply check to see if your machine already
allows you to ssh back to itself.
[hadoop-user@master]$ ssh localhost
If it does, then you’re good. Otherwise setting up takes two lines.
[hadoop-user@master]$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
[hadoop-user@master]$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
You are almost ready to start Hadoop. But first you’ll need to format your HDFS by
using the command
[hadoop-user@master]$ bin/hadoop namenode -format
We can now launch the daemons by use of the start-all.sh script. The Java jps
command will list all daemons to verify the setup was successful.
[hadoop-user@master]$ bin/start-all.sh
[hadoop-user@master]$ jps
26893 Jps
26832 TaskTracker
26620 SecondaryNameNode
26333 NameNode
26484 DataNode
26703 JobTracker

3. Fully distributed mode

After continually emphasizing the benefits of distributed storage and distributed


computation, it’s time for us to set up a full cluster. In the discussion below we’ll
use the following server names:
■ master—The master node of the cluster and host of the NameNode and Job-
Tracker daemons
■ backup—The server that hosts the Secondary NameNode daemon
■ hadoop1, hadoop2, hadoop3, ...—The slave boxes of the cluster running both
DataNode and TaskTracker daemons

Listing 2.2 Example configuration files for fully distributed

15
UNIT-3 INTRODUCTION TO BIG DATA: Map-Reduce Programming

The key differences are


■ We explicitly stated the hostname for location of the NameNode ‘1’ and JobTracker ‘2’
daemons.
■ We increased the HDFS replication factor to take advantage of distributed
storage ‘3’. Recall that data is replicated across HDFS to increase availability and
reliability.
We also need to update the masters and slaves files to reflect the locations of the other daemons.
[hadoop-user@master]$ cat masters
backup
[hadoop-user@master]$ cat slaves
hadoop1
hadoop2
hadoop3
...
Once you have copied these files across all the nodes in your cluster, be sure to format
HDFS to prepare it for storage:
[hadoop-user@master]$ bin/hadoop namenode-format

Now you can start the Hadoop daemons:


[hadoop-user@master]$ bin/start-all.sh
and verify the nodes are running their assigned jobs.
[hadoop-user@master]$ jps
30879 JobTracker
30717 NameNode
30965 Jps
[hadoop-user@backup]$ jps
2099 Jps
1679 SecondaryNameNode
[hadoop-user@hadoop1]$ jps
7101 TaskTracker
7617 Jps
6988 DataNode
You have a functioning cluster!

16
UNIT-3 INTRODUCTION TO BIG DATA: Map-Reduce Programming

5. RUNNING SIMPLE WORD COUNT MAP-REDUCE PROGRAM ON THE CLUSTER

WordCount is a simple application that counts the number of occurrences of each word in a given input set.

This works with a local-standalone, pseudo-distributed or fully-distributed Hadoop installation

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount


{

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>


{

private final static IntWritable one = new IntWritable(1);


private Text word = new Text();

public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens())
{
word.set(itr.nextToken());
context.write(word, one);
}
}
}

public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>


{
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,Context context ) throws


IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values)
{
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

public static void main(String[] args) throws Exception


{
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);

17
UNIT-3 INTRODUCTION TO BIG DATA: Map-Reduce Programming

job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Usage
Assuming environment variables are set as follows:

export JAVA_HOME=/usr/java/default
export PATH=${JAVA_HOME}/bin:${PATH}
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

Compile WordCount.java and create a jar:

$ bin/hadoop com.sun.tools.javac.Main WordCount.java


$ jar cf wc.jar WordCount*.class

Assuming that:

 /user/joe/wordcount/input - input directory in HDFS


 /user/joe/wordcount/output - output directory in HDFS

Sample text-files as input:

$ bin/hadoop fs -ls /user/joe/wordcount/input/ /user/joe/wordcount/input/file01


/user/joe/wordcount/input/file02

$ bin/hadoop fs -cat /user/joe/wordcount/input/file01


Hello World Bye World

$ bin/hadoop fs -cat /user/joe/wordcount/input/file02


Hello Hadoop Goodbye Hadoop

Run the application:

$ bin/hadoop jar wc.jar WordCount /user/joe/wordcount/input /user/joe/wordcount/output

Output:

$ bin/hadoop fs -cat /user/joe/wordcount/output/part-r-00000`


Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2`

Walk-through

The WordCount application is quite straight-forward.

18
UNIT-3 INTRODUCTION TO BIG DATA: Map-Reduce Programming

public void map(Object key, Text value, Context context


) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}

The Mapper implementation, via the map method, processes one line at a time, as provided by the specified
TextInputFormat. It then splits the line into tokens separated by whitespaces, via the StringTokenizer, and emits a
key-value pair of < <word>, 1>.

For the given sample input the first map emits:

< Hello, 1>


< World, 1>
< Bye, 1>
< World, 1>

The second map emits:

< Hello, 1>


< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>

We’ll learn more about the number of maps spawned for a given job, and how to control them in a fine-grained
manner, a bit later in the tutorial.

job.setCombinerClass(IntSumReducer.class);

WordCount also specifies a combiner. Hence, the output of each map is passed through the local combiner (which is
same as the Reducer as per the job configuration) for local aggregation, after being sorted on the *key*s.

The output of the first map:

< Bye, 1>


< Hello, 1>
< World, 2>`

The output of the second map:

< Goodbye, 1>


< Hadoop, 2>
< Hello, 1>`

public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException,


InterruptedException
{
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}

The Reducer implementation, via the reduce method just sums up the values, which are the occurrence counts for
each key (i.e. words in this example).

Thus the output of the job is:

19
UNIT-3 INTRODUCTION TO BIG DATA: Map-Reduce Programming

< Bye, 1>


< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>`

The main method specifies various facets of the job, such as the input/output paths (passed via the command line),
key/value types, input/output formats etc., in the Job. It then calls the job.waitForCompletion to submit the job and
monitor its progress.

6. ADDITIONAL EXAMPLES OF M-R PROGRAMMING.

Problem statement: I run a highly busy website and need to pull down my site for an hour in order to apply some
patches and maintenance of backend severs, which means the website will be completely unavailable for an hour.
To perform this activity the primary lookout will be that shutdown outage should be affected to least number of
users. The games starts here: We need to identify at what hour of the day the web traffic is least for the website so
that maintenance activity can be scheduled for that time.
There is an Apache web server log for each day which records the activities happening on website. But those are
huge files up to 5 GB each.

Excerpt from Log file:


64.242.88.10 – – [07/Mar/2014:22:12:28 -0800] “GET /twiki/bin/attach/TWiki/WebSearch HTTP/1.1” 401 12846
64.242.88.10 – – [07/Mar/2014:22:15:57 -0800] “GET /mailman/listinfo/hs_rcafaculty HTTP/1.1” 200 6345

We are interested only in the date field i.e. [07/Mar/2014:22:12:28 -0800]


Solution: I need to consume log files of one month and run my MapReduce code which calculates the total number
of hits for each hour of the day. Hour which has the least number of hits is perfect for the downtime. It is as simple
as that!

A MapReduce program usually consists of the following 3 parts:

1. Mapper
2. Reducer
3. Driver

As the name itself states Map and Reduce, the code is divided basically into two phases one is Map and second is
Reduce. Both phase has an input and output as key-value pairs. Programmer has been given the liberty to choose the
data model for the input and output for Map and Reduce both. Depending upon the business problem we need to use
the appropriate data model.

What Mappers does?

 The Map function reads the input files as key/value pairs, processes each, and generates zero or
more output key/value pairs.
 The Map class extends Mapper class which is a subclass of org.apache.hadoop.mapreduce.
 java.lang.Object : org.apache.hadoop.mapreduce.Mapper
 The input and output types of the map can be (and often are) different from each other.
 If the application is doing a word count, the map function would break the line into words and
output a key/value pair for each word. Each output pair would contain the word as the key and
the number of instances of that word in the line as the value.
 The Map function is also a good place to filter any unwanted fields/ data from input file, we
take the data only we are interested to remove unnecessary workload.

I have used Hadoop 1.2.1 API, Java 1.7 to write this program.

1 package com.balajitk.loganalyzer;
2
3 import java.io.IOException;
4 import java.text.ParseException;
5 import java.util.regex.Matcher;
6 import java.util.regex.Pattern;
7 import org.apache.hadoop.io.IntWritable;
8 import org.apache.hadoop.io.LongWritable;
9 import org.apache.hadoop.io.Text;
10import org.apache.hadoop.mapreduce.Mapper;

20
UNIT-3 INTRODUCTION TO BIG DATA: Map-Reduce Programming

11import org.slf4j.Logger;
12import org.slf4j.LoggerFactory;
13import com.balajitk.loganalyzer.ParseLog;
14
15public class LogMapper extends
16 Mapper&lt;LongWritable, Text, IntWritable, IntWritable&gt; {
17
18 private static Logger logger = LoggerFactory.getLogger(LogMapper.class);
19 private IntWritable hour = new IntWritable();
20 private final static IntWritable one = new IntWritable(1);
21 private static Pattern logPattern = Pattern
22 .compile("([^ ]*) ([^ ]*) ([^ ]*) \\[([^]]*)\\]"
23 + " \"([^\"]*)\""
24 + " ([^ ]*) ([^ ]*).*");
25
26 public void map(LongWritable key, Text value, Context context)
27 throws InterruptedException, IOException {
28 logger.info("Mapper started");
29 String line = ((Text) value).toString();
30 Matcher matcher = logPattern.matcher(line);
31 if (matcher.matches()) {
32 String timestamp = matcher.group(4);
33 try {
34 hour.set(ParseLog.getHour(timestamp));
35 } catch (ParseException e) {
36 logger.warn("Exception", e);
37 }
38 context.write(hour, one);
39 }
40 logger.info("Mapper Completed");
41 }
42}

The Mapper code which is written above is written for processing single record from programmer’s point
of view. We will never write logic in MapReduce to deal with entire data set. The framework is
responsible to convert the code to process entire data set by converting into desired key value pair.

The Mapper class has four parameters that specifies the input key, input value, output key, and output
values of the Map function.

1Mapper<LongWritable, Text, IntWritable, IntWritable>


1Mapper<Input key, Input value, Output key, and Output values>
1Mapper<Offset of the input file, Single Line of the file, Hour of the day, Integer One>

Hadoop provides its own set of basic types that are optimized for network serialization which can be found
in the org.apache.hadoop.io package.
In my program I have used LongWritable, which corresponds to a Java Long, Text (like Java String), and
IntWritable (like Java Integer). Mapper write their output using instance of Context class which is used to
communicate in Hadoop.

What Reducer does?

1. The Reducer code reads the outputs generated by the different mappers as pairs and
emits key value pairs.
2. Reducer reduces a set of intermediate values which share a key to a smaller set of values.
3. java.lang.Object : org.apache.hadoop.mapreduce.Reducer
4. Reducer has 3 primary phases: shuffle, sort and reduce.
5. Each reduce function processes the intermediate values for a particular key generated by the map
function. There exists a one-one mapping between keys and reducers.
6. Multiple reducers run in parallel, as they are independent of one another. The number of reducers
for a job is decided by the programmer. By default, the number of reducers is 1.
7. The output of the reduce task is typically written to the FileSystem via
OutputCollector.collect(WritableComparable, Writable)

1 package com.balajitk.loganalyzer;
2
3 import java.io.IOException;

21
UNIT-3 INTRODUCTION TO BIG DATA: Map-Reduce Programming

4 import org.apache.hadoop.io.IntWritable;
5 import org.apache.hadoop.mapreduce.Reducer;
6 import org.slf4j.Logger;
7 import org.slf4j.LoggerFactory;
8
9 public class LogReducer extends
10 Reducer&lt;IntWritable, IntWritable, IntWritable, IntWritable&gt; {
11
12 private static Logger logger = LoggerFactory.getLogger(LogReducer.class);
13
14 public void reduce(IntWritable key, Iterable&lt;IntWritable&gt; values,
15 Context context) throws IOException, InterruptedException {
16
17 logger.info("Reducer started");
18 int sum = 0;
19 for (IntWritable value : values) {
20 sum = sum + value.get();
21 }
22 context.write(key, new IntWritable(sum));
23 logger.info("Reducer completed");
24
25 }
26}

Four parameters are used in Reducers to specify input and output, which define the types of the input and
output key/value pairs. Output of the map task will be input to reduce task. First two parameter are the
input key value pair from map task. In our example IntWritable, IntWritable

1Reducer<IntWritable, IntWritable, IntWritable, IntWritable>


1Reducer<Input key, Input value, Output key, and Output values>
1Reducer<Hour of the day, List of counts, Hour, Total Count for the Hour>;

What Driver does?

Driver class is responsible to execute the MapReduce framework. Job object allows you to configure the
Mapper, Reducer, InputFormat, OutputFormat etc.

1 package com.balajitk.loganalyzer;
2
3 import org.apache.hadoop.fs.Path;
4 import org.apache.hadoop.io.IntWritable;
5 import org.apache.hadoop.mapreduce.Job;
6 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
7 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
8 import org.slf4j.Logger;
9 import org.slf4j.LoggerFactory;
10
11public class LogDriver {
12
13 private static Logger logger = LoggerFactory.getLogger(LogDriver.class);
14
15 public static void main(String[] args) throws Exception {
16 logger.info("Code started");
17
18 Job job = new Job();
19 job.setJarByClass(LogDriver.class);
20 job.setJobName("Log Analyzer");
21
22 job.setMapperClass(LogMapper.class);
23 job.setReducerClass(LogReducer.class);
24
25 job.setOutputKeyClass(IntWritable.class);
26 job.setOutputValueClass(IntWritable.class);
27
28 FileInputFormat.addInputPath(job, new Path(args[0]));
29 FileOutputFormat.setOutputPath(job, new Path(args[1]));
30

22
UNIT-3 INTRODUCTION TO BIG DATA: Map-Reduce Programming

31 job.waitForCompletion(true);
32 logger.info("Code ended");
33 }
34
35}

Job control is performed through the Job class in the new API, rather than the old
JobClient, which no longer exists in the new API.

Output:

23

You might also like