Ravikant_Hadoop_file
Ravikant_Hadoop_file
THEORY:
Map Function:
The Map function processes input data and transforms it into key-value pairs. Each input element is
processed independently, making it possible to distribute tasks across many nodes.
Example – Example: In a word-count program, the Map function might take a text and output pairs like
("word", 1) for every word in the text.
Reduce Function – The Reduce function processes each key and its list of values. It aggregates or performs
calculations to produce a single output for each key.
For the word-count program, the Reduce function sums up all counts for each word, resulting in ("word",
total_count) for every word in the dataset.
WORK FLOW OF PROGRAM:
The following diagram shows the logical flow of a MapReduce programming model.
The stages depicted above are –
1. Input: This is the input data / file to be processed.
2. Split: Hadoop splits the incoming data into smaller pieces called “splits”.
3. Map: In this step, MapReduce processes each split according to the logic defined in map() function.
Each mapper works on each split at a time. Each mapper is treated as a task and multiple tasks are
executed across different TaskTrackers and coordinated by the JobTracker.
4. Combine: This is an optional step and is used to improve the performance by reducing the amount
of data transferred across the network. Combiner is the same as the reduce step and is used for
aggregating the output of the map() function before it is passed to the subsequent steps.
5. Shuffle & Sort: In this step, outputs from all the mappers is shuffled, sorted to put them in order,
and grouped before sending them to the next step.
6. Reduce: This step is used to aggregate the outputs of mappers using the reduce() function. Output
of reducer is sent to the next and final step. Each reducer is treated as a task and multiple tasks are
executed across different TaskTrackers and coordinated by the JobTracker.
7. Output: Finally the output of reduce step is written to a file in HDFS.
Now Let’s See the Word Count Program in Java
Make sure that Hadoop is installed on your system with java idk Steps to follow
Step 1. Open Eclipse> File > New > Java Project > (Name it – MRProgramsDemo) > Finish
Step 2. Go to File -> New -> Java Project.
Name the project, e.g., "WordCountHadoop".
Step 3. Right-click on your project, go to Build Path -> Add External JARs..., and add the Hadoop JARs to
Your project.
Step 4. Add Following Reference Libraries –
Right Click on Project > Build Path> Add External Archivals
• /usr/lib/hadoop-0.20/hadoop-core.jar
• Usr/lib/hadoop-0.20/lib/Commons-cli-1.2.jar
PROGRAM:
Step 5. Type following Program :
package PackageDemo; import java.io.IOException; import
org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable; import
org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import
org.apache.hadoop.mapreduce.Job; import
org.apache.hadoop.mapreduce.Mapper; import
org.apache.hadoop.mapreduce.Reducer; import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import
org.apache.hadoop.util.GenericOptionsParser; public class WordCount { public
static void main(String [] args) throws Exception
j.setMapperClass(MapForWordCount.class);
j.setReducerClass(ReduceForWordCount.class); j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(j, input);
FileOutputFormat.setOutputPath(j, output);
System.exit(j.waitForCompletion(true)?0:1);
}
public static class MapForWordCount extends Mapper<LongWritable, Text, Text, IntWritable>{ public void
{
String line = value.toString();
public void reduce(Text word, Iterable<IntWritable> values, Context con) throws IOException, InterruptedException
value : values)
sum += value.get();
}
con.write(word, new IntWritable(sum));
To Move this into Hadoop directly, open the terminal and enter the following commands:
[training@localhost ~]$ hadoop fs -put wordcountFile wordCountFile
Run Jar file
(Hadoop jar jarfilename.jar packageName.ClassName PathToInputTextFile PathToOutputDirectry)
[training@localhost ~]$ Hadoop jar MRProgramsDemo.jar PackageDemo.WordCount wordCountFile
MRDir1
BUS 7
CAR 4
TRAIN 6
PROGRAM 3
AIM: Map-Reduce Character count.
THEORY:
In Hadoop, a MapReduce program for counting characters follows a structured approach based on two key
phases: Map and Reduce.
1. Map Phase:
• The Mapper takes input data (which could be large text files) and processes it line by line or split by
chunks (as determined by Hadoop's file system).
• Each character in the input text is considered as a key, and the Mapper emits a key-value pair for
each character, where the key is the character and the value is 1 (indicating that this character has
appeared once).
• For example, if the input is "Hello", the Mapper will output:
('H', 1), ('e', 1), ('l', 1), ('l', 1), ('o', 1)
2. Shuffle and Sort Phase:
• Before passing the data to the Reducer, Hadoop automatically performs a shuffle and sort
operation.
• This step groups all identical keys (characters in this case) together across different Mappers,
ensuring that all occurrences of the same character are sent to the same Reducer.
• For example, if the input was split into multiple Mappers, the character 'l' from all Mappers would
be grouped together:
('H', [1]), ('e', [1]), ('l', [1, 1]), ('o', [1])
3. Reduce Phase:
• The Reducer receives the key-value pairs from the shuffling phase, where the key is the character
and the value is a list of counts from different Mappers.
• The Reducer sums up these counts to get the total number of occurrences of each character.
• For example, for the character 'l', the values [1, 1] would be summed up to produce ('l', 2).
4. Output:
• After the Reduce phase, the output is written to the Hadoop Distributed File System (HDFS) or
another specified storage location.
• The final output will be a list of characters along with their total counts, like:
('H', 1)
('e', 1)
('l', 2)
('o', 1)
PROCEDURE:
Step 1: First Open Eclipse -> then select File -> New -> Java Project ->Name it CharCount -> then select use
an execution environment -> choose JavaSE-1.8 then next -> Finish.
Step 2: Create Three Java Classes into the project. Name them CharCountDriver(having the main function),
CharCountMapper, CharCountReducer.
Mapper Code: You have to copy and paste this program into the CharCountMapper Java Class file.
import java.io.IOException; import
org.apache.hadoop.io.IntWritable; import
org.apache.hadoop.io.LongWritable; import
org.apache.hadoop.io.Text; import
org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class CharCountMapper extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable>{ public void map(LongWritable key, Text
value,OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException{
}
Reducer Code: You have to copy-paste this below program into the CharCountReducer Java Class file.
import java.io.IOException; import java.util.Iterator;
import org.apache.hadoop.io.IntWritable; import
org.apache.hadoop.io.Text; import
org.apache.hadoop.mapred.MapReduceBase; import
org.apache.hadoop.mapred.OutputCollector; import
org.apache.hadoop.mapred.Reducer; import
org.apache.hadoop.mapred.Reporter;
Step 4: Now we add these external jars to our CharCount project. Right Click on CharCount -> then select
Build Path-> Click on Configure Build Path and select Add External jars…. and add jars from it’s download
location then click -> Apply and Close.
Step 5: Now export the project as a jar file. Right-click on CharCount choose Export.. and go to Java > JAR
file click -> Next and choose your export destination then click -> Next. Choose Main Class as CharCount by
clicking -> Browse and then click -> Finish -> Ok.
Now the Jar file is successfully created and saved at /Documents directory
with the name charectercount.jar in my case.
Step 6: Create a simple text file and add some data to it. nano
test.txt
Step 9: Now Run your Jar File with the below command and produce the output in CharCountResult File.
Syntax: hadoop jar /jar_file_location /dataset_location_in_HDFS /output-file_name
Command:
hadoop jar /home/dikshant/Documents/charectercount.jar /test.txt /CharCountResult
RESULT:
After Moving to localhost:50070/, under utilities select Browse the file system and download part-r-
00000 in /CharCountResult directory to see result. we can also check the result i.e. that part-r00000 file
with cat command as shown below.
hdfs dfs -cat /CharCountResult/part-00000
PROGRAM 4
AIM: HDFS commands.
THEORY:
Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop to manage and store
large datasets across distributed computing environments. It is designed to handle large amounts of data,
providing high-throughput access to data and ensuring fault tolerance across a cluster of commodity
hardware.
Key Concepts of HDFS:
1. Distributed Storage:
o HDFS splits large files into smaller blocks (default size is 128 MB or 256 MB). o These blocks
are distributed across multiple nodes in the Hadoop cluster, which allows parallel processing
of data.
2. Master-Slave Architecture:
o HDFS follows a master-slave architecture with two types of nodes:
▪ NameNode (Master): It manages the metadata of the file system. It keeps track of
where each block of a file is located in the cluster and manages operations like
opening, renaming, and closing files.
▪ DataNodes (Slaves): These store the actual data blocks. Each DataNode is
responsible for serving read and write requests from the clients. They also report
back to the NameNode with the status of the stored blocks.
HDFS (Hadoop Distributed File System) commands allow users to interact with the file system, perform
various operations, and manage files. These commands are similar to Unix shell commands but operate on
HDFS.
COMMANDS:
1. mv: This command is used to move files within hdfs.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example: bin/hdfs dfs -mv /geeks/myfile.txt
/geeks_copied
2. ls: This command is used to list all the files. Use lsr for recursive approach. It is useful when we
want a hierarchy of a folder.
Syntax:
bin/hdfs dfs -ls <path> Example:
bin/hdfs dfs -ls /
It will print all the directories present in HDFS. bin directory contains executables so, bin/hdfs means we
want the executables of hdfs particularly dfs(Distributed File System) commands.
3. rmr: This command deletes a file from HDFS recursively. It is very useful command when you want
to delete a non-empty directory. Syntax:
bin/hdfs dfs -rmr <filename/directoryName> Example:
bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content inside the directory then the directory
itself.
7. copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest> Example:
bin/hdfs dfs -copyToLocal /geeks ../Desktop/hero
(OR) bin/hdfs dfs -get /geeks/myfile.txt
../Desktop/hero
9. cp: This command is used to copy files within hdfs. Lets copy folder geeks to geeks_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
Example: bin/hdfs dfs -cp /geeks /geeks_copied
10. mkdir: To create a directory. In Hadoop dfs there is no home directory by default.
Syntax: bin/hdfs dfs -mkdir <folder
name> creating home directory:
hdfs/bin -mkdir /user
hdfs/bin -mkdir /user/username -> write the username of your computer Example:
bin/hdfs dfs -mkdir /geeks => '/' means absolute path bin/hdfs dfs -mkdir geeks2 => Relative path ->
the folder will be created relative to the home directory.
PROGRAM 5
AIM: HIVE (HQL)
THEORY:
HiveQL or HQL is a Hive query language that we used to process or query structured data on Hive. HQL
syntaxes are very much similar to MySQL but have some significant differences.
Hive allows programmers who are familiar with the language to write the custom MapReduce framework
to perform more sophisticated analysis.
Uses of Hive:
1. The Apache Hive distributed storage.
2. Hive provides tools to enable easy data extract/transform/load (ETL)
3. It provides the structure on a variety of data formats.
4. By using Hive, we can access files stored in Hadoop Distributed File System (HDFS is used to
querying and managing large datasets residing in) or in other data storage systems such as Apache
HBase.
COMMANDS:
• Syntax to Create Database:
We can create a database with the help of the below command but if the database already exists then, in
that case, Hive will throw an error.
Syntax:
CREATE DATABASE|SCHEMA <database name> # we can use DATABASE or SCHEMA for creation of DB
Example:
CREATE DATABASE Test; # create database with name Test show
databases; # this will show the existing databases
If we again try to create a Test database hive will throw an error/warning that the database with the name
Test already exists. In general, we don’t want to get an error if the database exists. So we use the create
database command with [IF NOT EXIST] clause. This will do not throw any error.
Syntax:
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name> Example:
CREATE SCHEMA IF NOT EXISTS Test1;
SHOW DATABASES;