0% found this document useful (0 votes)
10 views

Ravikant_Hadoop_file

The document outlines the implementation of Map-Reduce programs for word count and character count using Hadoop. It explains the Map and Reduce functions, the workflow of the MapReduce model, and provides step-by-step instructions for creating and running Java programs in Eclipse. Additionally, it covers HDFS commands for managing files in the Hadoop Distributed File System.

Uploaded by

ravi.saini2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Ravikant_Hadoop_file

The document outlines the implementation of Map-Reduce programs for word count and character count using Hadoop. It explains the Map and Reduce functions, the workflow of the MapReduce model, and provides step-by-step instructions for creating and running Java programs in Eclipse. Additionally, it covers HDFS commands for managing files in the Hadoop Distributed File System.

Uploaded by

ravi.saini2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 22

PROGRAM 2

AIM: Map-Reduce Word Count.

THEORY:
Map Function:
The Map function processes input data and transforms it into key-value pairs. Each input element is
processed independently, making it possible to distribute tasks across many nodes.
Example – Example: In a word-count program, the Map function might take a text and output pairs like
("word", 1) for every word in the text.
Reduce Function – The Reduce function processes each key and its list of values. It aggregates or performs
calculations to produce a single output for each key.
For the word-count program, the Reduce function sums up all counts for each word, resulting in ("word",
total_count) for every word in the dataset.
WORK FLOW OF PROGRAM:

The following diagram shows the logical flow of a MapReduce programming model.
The stages depicted above are –
1. Input: This is the input data / file to be processed.
2. Split: Hadoop splits the incoming data into smaller pieces called “splits”.
3. Map: In this step, MapReduce processes each split according to the logic defined in map() function.
Each mapper works on each split at a time. Each mapper is treated as a task and multiple tasks are
executed across different TaskTrackers and coordinated by the JobTracker.
4. Combine: This is an optional step and is used to improve the performance by reducing the amount
of data transferred across the network. Combiner is the same as the reduce step and is used for
aggregating the output of the map() function before it is passed to the subsequent steps.
5. Shuffle & Sort: In this step, outputs from all the mappers is shuffled, sorted to put them in order,
and grouped before sending them to the next step.
6. Reduce: This step is used to aggregate the outputs of mappers using the reduce() function. Output
of reducer is sent to the next and final step. Each reducer is treated as a task and multiple tasks are
executed across different TaskTrackers and coordinated by the JobTracker.
7. Output: Finally the output of reduce step is written to a file in HDFS.
Now Let’s See the Word Count Program in Java
Make sure that Hadoop is installed on your system with java idk Steps to follow
Step 1. Open Eclipse> File > New > Java Project > (Name it – MRProgramsDemo) > Finish
Step 2. Go to File -> New -> Java Project.
Name the project, e.g., "WordCountHadoop".
Step 3. Right-click on your project, go to Build Path -> Add External JARs..., and add the Hadoop JARs to
Your project.
Step 4. Add Following Reference Libraries –
Right Click on Project > Build Path> Add External Archivals

• /usr/lib/hadoop-0.20/hadoop-core.jar

• Usr/lib/hadoop-0.20/lib/Commons-cli-1.2.jar

PROGRAM:
Step 5. Type following Program :
package PackageDemo; import java.io.IOException; import
org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable; import
org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import
org.apache.hadoop.mapreduce.Job; import
org.apache.hadoop.mapreduce.Mapper; import
org.apache.hadoop.mapreduce.Reducer; import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import
org.apache.hadoop.util.GenericOptionsParser; public class WordCount { public
static void main(String [] args) throws Exception

Configuration c=new Configuration();

String[] files=new GenericOptionsParser(c,args).getRemainingArgs(); Path input=new


Path(files[0]);

Path output=new Path(files[1]); Job j=new


Job(c,"wordcount");
j.setJarByClass(WordCount.class);

j.setMapperClass(MapForWordCount.class);
j.setReducerClass(ReduceForWordCount.class); j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(j, input);
FileOutputFormat.setOutputPath(j, output);
System.exit(j.waitForCompletion(true)?0:1);

}
public static class MapForWordCount extends Mapper<LongWritable, Text, Text, IntWritable>{ public void

map(LongWritable key, Text value, Context con) throws IOException, InterruptedException

{
String line = value.toString();

String[] words=line.split(","); for(String word: words )

Text outputKey = new Text(word.toUpperCase().trim()); IntWritable outputValue = new IntWritable(1);


con.write(outputKey, outputValue);

public static class ReduceForWordCount extends Reducer<Text, IntWritable, Text, IntWritable>

public void reduce(Text word, Iterable<IntWritable> values, Context con) throws IOException, InterruptedException

int sum = 0; for(IntWritable

value : values)

sum += value.get();

}
con.write(word, new IntWritable(sum));

Make Jar File


Right Click on Project> Export> Select export destination as Jar File > next> Finish

To Move this into Hadoop directly, open the terminal and enter the following commands:
[training@localhost ~]$ hadoop fs -put wordcountFile wordCountFile
Run Jar file
(Hadoop jar jarfilename.jar packageName.ClassName PathToInputTextFile PathToOutputDirectry)
[training@localhost ~]$ Hadoop jar MRProgramsDemo.jar PackageDemo.WordCount wordCountFile
MRDir1

RESULT: Open Result


[training@localhost ~]$ hadoop fs -ls MRDir1 Found 3 items

-rw-r--r-- 1 training supergroup

0 2016-02-23 03:36 /user/training/MRDir1/_SUCCESS drwxr-xr-x - training supergroup

0 2016-02-23 03:36 /user/training/MRDir1/_logs

-rw-r--r-- 1 training supergroup

20 2016-02-23 03:36 /user/training/MRDir1/part-r-00000 [training@localhost ~]$ hadoop fs -cat MRDir1/part-


r00000

BUS 7

CAR 4

TRAIN 6
PROGRAM 3
AIM: Map-Reduce Character count.
THEORY:
In Hadoop, a MapReduce program for counting characters follows a structured approach based on two key
phases: Map and Reduce.
1. Map Phase:
• The Mapper takes input data (which could be large text files) and processes it line by line or split by
chunks (as determined by Hadoop's file system).
• Each character in the input text is considered as a key, and the Mapper emits a key-value pair for
each character, where the key is the character and the value is 1 (indicating that this character has
appeared once).
• For example, if the input is "Hello", the Mapper will output:
('H', 1), ('e', 1), ('l', 1), ('l', 1), ('o', 1)
2. Shuffle and Sort Phase:
• Before passing the data to the Reducer, Hadoop automatically performs a shuffle and sort
operation.
• This step groups all identical keys (characters in this case) together across different Mappers,
ensuring that all occurrences of the same character are sent to the same Reducer.
• For example, if the input was split into multiple Mappers, the character 'l' from all Mappers would
be grouped together:
('H', [1]), ('e', [1]), ('l', [1, 1]), ('o', [1])
3. Reduce Phase:
• The Reducer receives the key-value pairs from the shuffling phase, where the key is the character
and the value is a list of counts from different Mappers.
• The Reducer sums up these counts to get the total number of occurrences of each character.
• For example, for the character 'l', the values [1, 1] would be summed up to produce ('l', 2).
4. Output:
• After the Reduce phase, the output is written to the Hadoop Distributed File System (HDFS) or
another specified storage location.
• The final output will be a list of characters along with their total counts, like:
('H', 1)
('e', 1)
('l', 2)
('o', 1)
PROCEDURE:
Step 1: First Open Eclipse -> then select File -> New -> Java Project ->Name it CharCount -> then select use
an execution environment -> choose JavaSE-1.8 then next -> Finish.

Step 2: Create Three Java Classes into the project. Name them CharCountDriver(having the main function),
CharCountMapper, CharCountReducer.
Mapper Code: You have to copy and paste this program into the CharCountMapper Java Class file.
import java.io.IOException; import
org.apache.hadoop.io.IntWritable; import
org.apache.hadoop.io.LongWritable; import
org.apache.hadoop.io.Text; import
org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class CharCountMapper extends MapReduceBase implements
Mapper<LongWritable,Text,Text,IntWritable>{ public void map(LongWritable key, Text
value,OutputCollector<Text,IntWritable> output, Reporter reporter) throws IOException{

String line = value.toString();


String tokenizer[] = line.split("");
for(String SingleChar :
tokenizer)
{
Text charKey = new Text(SingleChar);
IntWritable One = new IntWritable(1);
output.collect(charKey, One);
}
}

}
Reducer Code: You have to copy-paste this below program into the CharCountReducer Java Class file.
import java.io.IOException; import java.util.Iterator;
import org.apache.hadoop.io.IntWritable; import
org.apache.hadoop.io.Text; import
org.apache.hadoop.mapred.MapReduceBase; import
org.apache.hadoop.mapred.OutputCollector; import
org.apache.hadoop.mapred.Reducer; import
org.apache.hadoop.mapred.Reporter;

public class CharCountReducer extends MapReduceBase implements


Reducer<Text, IntWritable, Text,
IntWritable> {
public void
reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException
{
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Driver Code: You have to copy-paste this below program into the CharCountDriver Java Class file.
import java.io.IOException; import org.apache.hadoop.fs.Path; import
org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import
org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import
org.apache.hadoop.mapred.TextInputFormat; import org.apache.hadoop.mapred.TextOutputFormat;
public class CharCountDriver { public static void main(String[] args)
throws IOException
{
JobConf conf = new JobConf(CharCountDriver.class);
conf.setJobName("CharCount"); conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(CharCountMapper.class);
conf.setCombinerClass(CharCountReducer.class);
conf.setReducerClass(CharCountReducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,
new Path(args[0]));
FileOutputFormat.setOutputPath(conf,
new Path(args[1]));
JobClient.runJob(conf);
}
}
Step 3: Now we need to add an external jar for the packages that we have import. Download the jar
package Hadoop Common and Hadoop MapReduce Core according to your Hadoop version. You can check
Hadoop Version with the below command: hadoop version

Step 4: Now we add these external jars to our CharCount project. Right Click on CharCount -> then select
Build Path-> Click on Configure Build Path and select Add External jars…. and add jars from it’s download
location then click -> Apply and Close.

Step 5: Now export the project as a jar file. Right-click on CharCount choose Export.. and go to Java > JAR
file click -> Next and choose your export destination then click -> Next. Choose Main Class as CharCount by
clicking -> Browse and then click -> Finish -> Ok.
Now the Jar file is successfully created and saved at /Documents directory
with the name charectercount.jar in my case.
Step 6: Create a simple text file and add some data to it. nano
test.txt

Step 7: Start our Hadoop Daemons


start-dfs.sh start-yarn.sh
Step 8: Move your test.txt file to the Hadoop HDFS.
Syntax: hdfs dfs -put /file_path /destination In below command / shows the
root directory of our HDFS. hdfs dfs -put /home/dikshant/Documents/test.txt /
Check the file is present in the root directory of HDFS or not. hdfs dfs -ls /

Step 9: Now Run your Jar File with the below command and produce the output in CharCountResult File.
Syntax: hadoop jar /jar_file_location /dataset_location_in_HDFS /output-file_name
Command:
hadoop jar /home/dikshant/Documents/charectercount.jar /test.txt /CharCountResult

RESULT:
After Moving to localhost:50070/, under utilities select Browse the file system and download part-r-
00000 in /CharCountResult directory to see result. we can also check the result i.e. that part-r00000 file
with cat command as shown below.
hdfs dfs -cat /CharCountResult/part-00000
PROGRAM 4
AIM: HDFS commands.
THEORY:
Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop to manage and store
large datasets across distributed computing environments. It is designed to handle large amounts of data,
providing high-throughput access to data and ensuring fault tolerance across a cluster of commodity
hardware.
Key Concepts of HDFS:
1. Distributed Storage:
o HDFS splits large files into smaller blocks (default size is 128 MB or 256 MB). o These blocks
are distributed across multiple nodes in the Hadoop cluster, which allows parallel processing
of data.
2. Master-Slave Architecture:
o HDFS follows a master-slave architecture with two types of nodes:
▪ NameNode (Master): It manages the metadata of the file system. It keeps track of
where each block of a file is located in the cluster and manages operations like
opening, renaming, and closing files.
▪ DataNodes (Slaves): These store the actual data blocks. Each DataNode is
responsible for serving read and write requests from the clients. They also report
back to the NameNode with the status of the stored blocks.
HDFS (Hadoop Distributed File System) commands allow users to interact with the file system, perform
various operations, and manage files. These commands are similar to Unix shell commands but operate on
HDFS.

COMMANDS:
1. mv: This command is used to move files within hdfs.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example: bin/hdfs dfs -mv /geeks/myfile.txt
/geeks_copied
2. ls: This command is used to list all the files. Use lsr for recursive approach. It is useful when we
want a hierarchy of a folder.
Syntax:
bin/hdfs dfs -ls <path> Example:
bin/hdfs dfs -ls /
It will print all the directories present in HDFS. bin directory contains executables so, bin/hdfs means we
want the executables of hdfs particularly dfs(Distributed File System) commands.

3. rmr: This command deletes a file from HDFS recursively. It is very useful command when you want
to delete a non-empty directory. Syntax:
bin/hdfs dfs -rmr <filename/directoryName> Example:
bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content inside the directory then the directory
itself.

4. du: It will give the size of each file in directory.


Syntax:
bin/hdfs dfs -du <dirName>
Example: bin/hdfs dfs -du
/geeks
5. cp: This command is used to copy files within hdfs. Let’s copy folder geeks to geeks_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
Example: bin/hdfs dfs -cp /geeks /geeks_copied

6. cat: To print file contents.


Syntax:
bin/hdfs dfs -cat <path> Example:
// print the content of AI.txt
present // inside geeks folder.
bin/hdfs dfs -cat /geeks/AI.txt ->

7. copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest> Example:
bin/hdfs dfs -copyToLocal /geeks ../Desktop/hero
(OR) bin/hdfs dfs -get /geeks/myfile.txt
../Desktop/hero

8. moveFromLocal: This command will move file from local to hdfs.


Syntax:
bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)> Example:
bin/hdfs dfs -moveFromLocal ../Desktop/cutAndPaste.txt /geeks

9. cp: This command is used to copy files within hdfs. Lets copy folder geeks to geeks_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
Example: bin/hdfs dfs -cp /geeks /geeks_copied

10. mkdir: To create a directory. In Hadoop dfs there is no home directory by default.
Syntax: bin/hdfs dfs -mkdir <folder
name> creating home directory:
hdfs/bin -mkdir /user
hdfs/bin -mkdir /user/username -> write the username of your computer Example:
bin/hdfs dfs -mkdir /geeks => '/' means absolute path bin/hdfs dfs -mkdir geeks2 => Relative path ->
the folder will be created relative to the home directory.
PROGRAM 5
AIM: HIVE (HQL)
THEORY:
HiveQL or HQL is a Hive query language that we used to process or query structured data on Hive. HQL
syntaxes are very much similar to MySQL but have some significant differences.
Hive allows programmers who are familiar with the language to write the custom MapReduce framework
to perform more sophisticated analysis.
Uses of Hive:
1. The Apache Hive distributed storage.
2. Hive provides tools to enable easy data extract/transform/load (ETL)
3. It provides the structure on a variety of data formats.
4. By using Hive, we can access files stored in Hadoop Distributed File System (HDFS is used to
querying and managing large datasets residing in) or in other data storage systems such as Apache
HBase.

COMMANDS:
• Syntax to Create Database:
We can create a database with the help of the below command but if the database already exists then, in
that case, Hive will throw an error.
Syntax:
CREATE DATABASE|SCHEMA <database name> # we can use DATABASE or SCHEMA for creation of DB
Example:
CREATE DATABASE Test; # create database with name Test show
databases; # this will show the existing databases

If we again try to create a Test database hive will throw an error/warning that the database with the name
Test already exists. In general, we don’t want to get an error if the database exists. So we use the create
database command with [IF NOT EXIST] clause. This will do not throw any error.
Syntax:
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name> Example:
CREATE SCHEMA IF NOT EXISTS Test1;
SHOW DATABASES;

• Syntax To Drop Existing Databases:


DROP DATABASE <db_name>; or DROP DATABASE IF EXIST <db_name> # The IF EXIST clause again is used
to suppress error Example:
DROP DATABASE IF EXISTS Test;
DROP DATABASE Test1;

• Syntax To Create Table in Hive:


CREATE TABLE [IF NOT EXISTS] <table-name> (
<column-name> <data-type>,
<column-name> <data-type> COMMENT 'Your Comment',
<column-name> <data-type>,
.
.
.
<column-name> <data-type>
)
COMMENT 'Add if you want'
LOCATION 'Location On HDFS'
ROW FORMAT DELIMITED FIELDS
TERMINATED BY ','; Example:
CREATE TABLE IF NOT EXISTS student_data(
Student_Name STRING COMMENT 'This col. Store the name of student',
Student_Rollno INT COMMENT 'This col. Stores the rollno of student',
Student_Marks FLOAT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

• Syntax to show all tables in database: SHOW


TABLES [IN <database_name>]; Example:
SHOW TABLES IN student_detail;

You might also like