0% found this document useful (0 votes)
12 views7 pages

Bda Exp2 Chinmay

Uploaded by

Chinmay Pichad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views7 pages

Bda Exp2 Chinmay

Uploaded by

Chinmay Pichad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Name Chinmay Vasant Pichad

UID no. 2020300053

Experiment No. EXP 2

AIM Map Reduce Implementation

Program 1

PROBLEM Develop a MapReduce implementation in Hadoop to analyze large datasets efficiently,


STATEMENT: addressing scalability, fault tolerance, and performance challenges in distributed data
processing.

THEORY: Hadoop Word Count program using the normal method involves reading, parsing, and
counting words sequentially, while MapReduce distributes tasks for parallel processing,
optimizing efficiency and scalability.

MapReduce is a programming model and processing framework developed by Google to


handle large-scale data processing tasks in a distributed computing environment.
Hadoop, an open-source project, has become the de facto standard for implementing the
MapReduce paradigm in big data processing.

The core idea behind MapReduce is to divide a large dataset into smaller chunks and
distribute them across a cluster of commodity hardware. The processing of data is
divided into two main phases: the Map phase and the Reduce phase.

● Map Phase: In this phase, the input data is split into smaller parts, and a mapping
function is applied to each chunk independently. This function transforms the
data into a set of key-value pairs, where the key is often used for grouping related
data.
● Shuffling and Sorting: After the Map phase, the framework shuffles and sorts the
generated key-value pairs to group data with the same key together. This step is
critical for the efficiency of the Reduce phase.
● Reduce Phase: In this phase, another function, the reducing function, is applied to
each group of key-value pairs. This function aggregates, filters, or processes the
data to produce the final output.

Hadoop provides a distributed file system called Hadoop Distributed File System
(HDFS) that stores data across multiple nodes in the cluster. It also manages the
distribution of Map and Reduce tasks, monitors task progress, and handles task failures,
ensuring fault tolerance.
Key advantages of MapReduce using Hadoop include:

● Scalability: Hadoop can handle datasets of virtually any size by adding more
nodes to the cluster.
● Fault Tolerance: Hadoop automatically replicates data and tasks to ensure that if
a node fails, processing can continue on another node.
● Data Locality: Hadoop tries to process data on the same node where it resides,
reducing network overhead.
● Programming Flexibility: Developers can write Map and Reduce functions in
various programming languages, making it accessible to a wide range of users.
● Ecosystem: Hadoop has a rich ecosystem of tools and libraries for various data
processing tasks, including Hive, Pig, and Spark.

CODE: import java.io.IOException;


import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

public static class TokenizerMapper


extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);


private Text word = new Text();

public void map(Object key, Text value, Context context


) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}

public static class IntSumReducer


extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,


Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
OUTPUT
AND
RESULT:
CONCLUSION:

In conclusion, the Hadoop Word Count program showcases the transformative power of MapReduce by
enabling distributed, parallel processing of vast datasets. This approach greatly improves performance and
scalability compared to traditional sequential methods, making it a cornerstone in the world of big data
analytics.
REFERENCES:

● https://www.youtube.com/watch?v=6sK3LDY7Pp4

● https://www.projectpro.io/hadoop-tutorial/hadoop-mapreduce-wordcount-tutorial#:~:text=T
he%20text%20from%20the%20input,and%20value%20is%20'1'.&text=This%20is%20how
%20the%20MapReduce,in%20any%20given%20input%20file.

● https://www.youtube.com/watch?v=WoZ2KSAfujQ

You might also like