Bda Exp2 Chinmay
Bda Exp2 Chinmay
Program 1
THEORY: Hadoop Word Count program using the normal method involves reading, parsing, and
counting words sequentially, while MapReduce distributes tasks for parallel processing,
optimizing efficiency and scalability.
The core idea behind MapReduce is to divide a large dataset into smaller chunks and
distribute them across a cluster of commodity hardware. The processing of data is
divided into two main phases: the Map phase and the Reduce phase.
● Map Phase: In this phase, the input data is split into smaller parts, and a mapping
function is applied to each chunk independently. This function transforms the
data into a set of key-value pairs, where the key is often used for grouping related
data.
● Shuffling and Sorting: After the Map phase, the framework shuffles and sorts the
generated key-value pairs to group data with the same key together. This step is
critical for the efficiency of the Reduce phase.
● Reduce Phase: In this phase, another function, the reducing function, is applied to
each group of key-value pairs. This function aggregates, filters, or processes the
data to produce the final output.
Hadoop provides a distributed file system called Hadoop Distributed File System
(HDFS) that stores data across multiple nodes in the cluster. It also manages the
distribution of Map and Reduce tasks, monitors task progress, and handles task failures,
ensuring fault tolerance.
Key advantages of MapReduce using Hadoop include:
● Scalability: Hadoop can handle datasets of virtually any size by adding more
nodes to the cluster.
● Fault Tolerance: Hadoop automatically replicates data and tasks to ensure that if
a node fails, processing can continue on another node.
● Data Locality: Hadoop tries to process data on the same node where it resides,
reducing network overhead.
● Programming Flexibility: Developers can write Map and Reduce functions in
various programming languages, making it accessible to a wide range of users.
● Ecosystem: Hadoop has a rich ecosystem of tools and libraries for various data
processing tasks, including Hive, Pig, and Spark.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
In conclusion, the Hadoop Word Count program showcases the transformative power of MapReduce by
enabling distributed, parallel processing of vast datasets. This approach greatly improves performance and
scalability compared to traditional sequential methods, making it a cornerstone in the world of big data
analytics.
REFERENCES:
● https://www.youtube.com/watch?v=6sK3LDY7Pp4
● https://www.projectpro.io/hadoop-tutorial/hadoop-mapreduce-wordcount-tutorial#:~:text=T
he%20text%20from%20the%20input,and%20value%20is%20'1'.&text=This%20is%20how
%20the%20MapReduce,in%20any%20given%20input%20file.
● https://www.youtube.com/watch?v=WoZ2KSAfujQ