0% found this document useful (0 votes)

179 views20 pages

DSBDSAssingment 11

The document discusses how to design and implement a distributed application using MapReduce to process a log file. It describes the key components of MapReduce including the Mapper, Reducer, and Driver classes and their roles in processing log file data. It also provides steps to install Hadoop on Linux and configure the environment.

Uploaded by

403 Chaudhari Sanika Sagar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

179 views20 pages

DSBDSAssingment 11

Uploaded by

403 Chaudhari Sanika Sagar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Assignment 11

Title:

To design a distributed application using MapReduce which processes a log file of a

system.

Problem Statement:
Design a distributed application using MapReduce which processes a log file of a
system.

Objective:
By completing this task, students will learn the following
1. Hadoop Distributed File System.
2. MapReduce Framework.

Software/Hardware Requirements:
64-bit Open source OS-Linux, Java, Java Development Kit (JDK), Hadoop.

Theory:

Hadoop : Hadoop is an open-source distributed computing framework designed to

handle and process large volumes of data across clusters of commodity hardware. It
was inspired by Google's MapReduce and Google File System (GFS) papers and is
written in Java. Apache Hadoop provides a scalable, reliable, and distributed
computing environment for processing and analyzing big data.
Map Reduce: MapReduce is a programming model and framework for processing
and generating large datasets in a distributed and parallel manner. It consists of two
main phases: the Map phase, where input data is divided into smaller chunks and
processed independently, and the Reduce phase, where the results from the Map
phase are aggregated and combined to produce the final output.
Mapper Class: The Mapper class is a crucial component in a MapReduce job,
responsible for processing each input record and generating intermediate key-value
pairs.
In the context of processing a log file, the Mapper class parses each log entry and
extracts relevant information.
The typical steps involved in implementing a Mapper class for log file processing
include:
1. Input Parsing: Read each line of the log file.
2. Data Extraction: Extract relevant information from each log entry, such as
timestamps, error codes, or other data points of interest.
3. Data Transformation: Convert the extracted information into key-value pairs.
For example, if the goal is to analyze error frequencies, the Mapper might emit
<error_code, 1> pairs for each occurrence of an error code in the log entry.
4. Output Emission: Emit the key-value pairs to the MapReduce framework for
further processing.
The Mapper class typically extends the Mapper class provided by the MapReduce
framework and overrides the map() method to define custom logic for processing
input records.
Reducer Class: The Reducer class is another crucial component in a MapReduce job,
responsible for aggregating and processing the intermediate key-value pairs
generated by the Mapper class.
In the context of processing a log file, the Reducer class receives key-value pairs
where the key represents a unique identifier (e.g., an error code) and the value
represents the count of occurrences.
The typical steps involved in implementing a Reducer class for log file processing
include:
1. Input Aggregation: Receive key-value pairs grouped by key from the
MapReduce framework.
2. Data Aggregation: Aggregate the counts of occurrences for each unique key.
3. Output Generation: Produce the final output, which may include aggregated
statistics, summaries, or any other desired analysis results.
The Reducer class typically extends the Reducer class provided by the MapReduce
framework and overrides the reduce() method to define custom logic for aggregating
intermediate results.
Driver Class : The Driver class in a MapReduce application is responsible for
configuring the job, setting up input and output paths, specifying mapper and
reducer classes, and submitting the job for execution. Here's a breakdown of the key
components typically found in a Driver class:
1. Configuration Setup: In the Driver class, you initialize a Hadoop configuration
object (Configuration) which holds various settings and parameters for the
MapReduce job. This includes properties such as input/output paths, mapper
and reducer classes, and any other job-specific configurations.
2. Job Initialization: Using the configuration object, you create a Job object (Job),
which represents the entire MapReduce job to be executed. This involves
specifying the name of the job, setting input/output formats, and configuring
the mapper and reducer classes.
3. Input and Output Paths: Specify the input and output paths for the job. This
tells Hadoop where to find the input data (e.g., log file) and where to write the
output of the job.
4. Mapper and Reducer Classes: Set the mapper and reducer classes to be used
in the MapReduce job. This involves specifying the Java classes that
implement the Mapper and Reducer interfaces and defining the logic for data
processing and aggregation.
5. InputFormat and OutputFormat: Configure the input and output formats for
the job, which define how input data is read and how output data is written.
Hadoop provides default input and output formats, but you can also use
custom formats if needed.
6. Output Key-Value Types: Specify the types of keys and values that the mapper
and reducer classes will emit. These types should match the output types of
the mapper and reducer classes.
7. Job Submission: Finally, submit the MapReduce job to the Hadoop or
MapReduce framework for execution. This involves calling the
job.waitForCompletion() method, which initiates the job execution and waits
for it to complete.
Log file : A log file is a file that records events, actions, or messages that occur within
a software application, operating system, or system component. These files are
commonly used for troubleshooting, debugging, monitoring, auditing, and analysis
purposes. Here's some key information about log files:
Log files serve various purposes, including:
Recording system events: Log files often record events such as system startups,
shutdowns, errors, warnings, and informational messages.
Debugging: Developers use log files to debug software by analyzing logs to identify
and fix issues.
Monitoring and performance analysis: System administrators use log files to monitor
system health, performance, and resource usage.
Auditing and compliance: Log files are sometimes used to track user activities for
auditing and compliance purposes.
Log files can be stored in various formats, including plain text, XML, JSON, and
structured formats. The choice of format depends on the logging framework or
application generating the logs and the requirements of downstream analysis tools.

Steps to install hadoop (Linux) :

Step 1 : Install Java Development Kit

The default Ubuntu repositories contain Java 8 and Java 11 both. I am using Java 8
because hive only works on this version.Use the following command to install it.
sudo apt update && sudo apt install openjdk-8-jdk

Step 2 : Verify the Java version :

Once you have successfully installed it, check the current Java version:
java -version

Step 3 : Install SSH :

SSH (Secure Shell) installation is vital for Hadoop as it enables secure communication
between nodes in the Hadoop cluster. This ensures data integrity, confidentiality,
and allows for efficient distributed processing of data across the cluster.
sudo apt install ssh

Step 4 : Create the hadoop user :

All the Hadoop components will run as the user that you create for Apache Hadoop,
and the user will also be used for logging in to Hadoop’s web interface.
Run the command to create user and set password :
sudo adduser hadoop

Step 5 : Switch user :

Switch to the newly created hadoop user:
su - hadoop

Step 6 : Configure SSH :

Now configure password-less SSH access for the newly created hadoop user, so I
didn’t enter key to save file and passpharse. Generate an SSH keypair first:
ssh-keygen -t rsa
Step 7 : Set permissions :
Copy the generated public key to the authorized key file and set the proper
permissions:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 640 ~/.ssh/authorized_keys

Step 8 : SSH to the localhost

ssh localhost

You will be asked to authenticate hosts by adding RSA keys to known hosts. Type yes
and hit Enter to authenticate the localhost.

Step 9 : Switch user

Again switch to hadoop
su - hadoop

Step 10 : Install hadoop

Download hadoop 3.3.6
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz

Once you’ve downloaded the file, you can unzip it to a folder.

tar -xvzf hadoop-3.3.6.tar.gz

Rename the extracted folder to remove version information. This is an optional step,
but if you don’t want to rename, then adjust the remaining configuration paths.
mv hadoop-3.3.6 hadoop

Next, you will need to configure Hadoop and Java Environment Variables on your
system. Open the ~/.bashrc file in your favorite text editor.Here I am using nano
editior , to pasting the code we use ctrl+shift+v for saving the file ctrl+x and
ctrl+y ,then hit enter:
nano ~/.bashrc

Append the below lines to the file.

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
Load the above configuration in the current environment.
source ~/.bashrc

You also need to configure JAVA_HOME in hadoop-env.sh file. Edit the Hadoop
environment variable file in the text editor:
nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Search for the “export JAVA_HOME” and configure it .

JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Step 11 : Configuring Hadoop :
First, you will need to create the namenode and datanode directories inside the
Hadoop user home directory. Run the following command to create both directories:
cd hadoop/

mkdir -p ~/hadoopdata/hdfs/{namenode,datanode}

Next, edit the core-site.xml file and update with your system hostname:
nano $HADOOP_HOME/etc/hadoop/core-site.xml

Change the following name as per your system hostname:

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Save and close the file.
Then, edit the hdfs-site.xml file:
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Change the NameNode and DataNode directory paths as shown below:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>
Then, edit the mapred-site.xml file:
nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Make the following changes:

<configuration>
<property>
<name>yarn.app.mapreduce.am.env</name>

<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/h
adoop</value>
</property>
<property>
<name>mapreduce.map.env</name>

<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/h
adoop</value>
</property>
<property>
<name>mapreduce.reduce.env</name>

<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/h
adoop</value>
</property>
</configuration>

Then, edit the yarn-site.xml file:

nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Make the following changes:

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Save the file and close it .

Step 12 : Start Hadoop cluster:

Before starting the Hadoop cluster. You will need to format the Namenode as a
hadoop user.
Run the following command to format the Hadoop Namenode:
hdfs namenode -format

Once the namenode directory is successfully formatted with hdfs file system, you
will see the message “Storage directory
/home/hadoop/hadoopdata/hdfs/namenode has been successfully formatted”.
Then start the Hadoop cluster with the following command.
start-all.sh

You can now check the status of all Hadoop services using the jps command:
jps
To check that all the Hadoop services are up and running, run the below command.

Step 1)jps

Step 2) cd

Step 3) sudo mkdir mapreduce_bhargavi

Step 4) sudo chmod 777 -R mapreduce_bhargavi/

Step 5) sudo chown -R bhargavi mapreduce_bhargavi/

Step 6) sudo cp /home/bhargavi/Desktop/logfiles1/* ~/bhargavi/

Step 7) cd mapreduce_bhargavi/

Step 8) ls

Step 9) sudo chmod +r .

Step 10) export CLASSPATH="/home/bhargavi/hadoop-3.3.6/share/hadoop/mapreduce/hadoop-

mapreduce-client-core-3.3.6.jar:/home/bhargavi/hadoop-
3.3.6/share/hadoop/mapreduce/hadoop-mapreduce-client-common-
3.3.6.jar:/home/bhargavi/hadoop-3.3.6/share/hadoop/common/hadoop-common-
3.3.6.jar:~/mapreduce_bhargavi/SalesCountry/*:$HADOOP_HOME/lib/*"

Step 11) javac -d . SalesMapper.java SalesCountryReducer.java SalesCountryDriver.java

Step 12) ls
Step 13) cd SalesCountry/

Step 14) ls (check is class files are created)

Step 15) cd ..

Step 16) gedit Manifest.txt

(add following lines to it:
Main-Class: SalesCountry.SalesCountryDriver)

Step 17) jar -cfm mapreduce_vijay.jar Manifest.txt SalesCountry/*.class

Step 18) ls

Step 19) cd

Step 20) cd mapreduce_bhargavi/

Step 21) sudo mkdir /input200

Step 22) sudo cp access_log_short.csv /input200

Step 23) $HADOOP_HOME/bin/hdfs dfs -put /input200 /

Step 24) $HADOOP_HOME/bin/hadoop jar mapreduce_bhargavi.jar /input200 /output200

Step 25) hadoop fs -ls /output200

Step 26) hadoop fs -cat /out321/part-00000

Step 27) Now open the Mozilla browser and go to localhost:50070/dfshealth.html to check the
NameNode interface.

Java Code to process logfile

Mapper Class:
package SalesCountry;
import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.*;

public class SalesMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text,

IntWritable> {

private final static IntWritable one = new IntWritable(1);

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter

reporter) throws IOException {

String valueString = value.toString();

String[] SingleCountryData = valueString.split("-");

output.collect(new Text(SingleCountryData[0]), one);

Reducer Class:

package SalesCountry;

import java.io.IOException;

import java.util.*;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.*;
public class SalesCountryReducer extends MapReduceBase implements Reducer<Text, IntWritable,

Text, IntWritable> {

public void reduce(Text t_key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable>

output, Reporter reporter) throws IOException {

Text key = t_key;

int frequencyForCountry = 0;

while (values.hasNext()) {

// replace type of value with the actual type of our value

IntWritable value = (IntWritable) values.next();

frequencyForCountry += value.get();

output.collect(key, new IntWritable(frequencyForCountry));

Driver Class:

package SalesCountry;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapred.*;

public class SalesCountryDriver {

public static void main(String[] args) {

JobClient my_client = new JobClient();

// Create a configuration object for the job

JobConf job_conf = new JobConf(SalesCountryDriver.class);

// Set a name of the Job

job_conf.setJobName("SalePerCountry");

// Specify data type of output key and value

job_conf.setOutputKeyClass(Text.class);

job_conf.setOutputValueClass(IntWritable.class);

// Specify names of Mapper and Reducer Class

job_conf.setMapperClass(SalesCountry.SalesMapper.class);

job_conf.setReducerClass(SalesCountry.SalesCountryReducer.class);

// Specify formats of the data type of Input and output

job_conf.setInputFormat(TextInputFormat.class);

job_conf.setOutputFormat(TextOutputFormat.class);

// Set input and output directories using command line arguments,

//arg[0] = name of input directory on HDFS, and arg[1] = name of output directory to be created to

store the output file.

FileInputFormat.setInputPaths(job_conf, new Path(args[0]));

FileOutputFormat.setOutputPath(job_conf, new Path(args[1]));

my_client.setConf(job_conf);

try {

// Run the job

JobClient.runJob(job_conf);

} catch (Exception e) {

e.printStackTrace();
}

Note : The paths and directory names will change according to your folder. Change the names and

paths accordingly.

Conclusion :
In this assignment we learnt how to process a log file using Hadoop frame work on distributed
system.

Common questions

The installation of Hadoop on a Linux system involves several steps: 1) Install the Java Development Kit (JDK), which is necessary because Hadoop is written in Java and requires this runtime environment. 2) Install SSH to enable secure communication between nodes in the Hadoop cluster, which is crucial for distributed data processing. 3) Create a dedicated 'hadoop' user for running Hadoop services, ensuring security and administrative separation. 4) Configure password-less SSH access for the 'hadoop' user to facilitate seamless interactions between cluster nodes. 5) Download and install Hadoop, followed by extracting it to a preferred directory. 6) Set environment variables in the ~/.bashrc file and the hadoop-env.sh file to define paths for Java and Hadoop, enabling the system to recognize and correctly use their components. 7) Create directories for NameNode and DataNode storage, which are vital for the Hadoop Distributed File System (HDFS) to keep metadata and data blocks. 8) Configure core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml files to enable and customize Hadoop's operational parameters. 9) Format the namenode to initialize HDFS for the first time. Finally, start the cluster using provided scripts to ensure that all configurations are appropriately applied and services are running correctly .

Log files are crucial in system applications because they record events, actions, or messages that occur within a software application, operating system, or system component. Their typical uses include: 1) Recording system events such as startups, shutdowns, and errors which are essential for auditing and troubleshooting. 2) Debugging by enabling developers to track logs to identify and resolve issues. 3) Monitoring system health and performance analysis by administrators to optimize system operations. 4) Auditing and compliance tracking of user activities, contributing to compliance with legal and operational standards. Log files can vary in format, such as plain text, XML, or JSON, depending on the requirements of the system or application generating them .

Proper configuration of environment variables is critical when setting up Hadoop as it ensures that the system recognizes and correctly uses Hadoop and Java components for successful operation. Key variables include JAVA_HOME, which specifies the path to the Java Development Kit necessary to run Hadoop, and HADOOP_HOME, which defines the location of the Hadoop installation directory. Other important variables are HADOOP_MAPRED_HOME, HADOOP_COMMON_HOME, HADOOP_HDFS_HOME, HADOOP_YARN_HOME, and HADOOP_COMMON_LIB_NATIVE_DIR, which specify paths for various Hadoop components and allow the system to access necessary resources for processing. Properly setting the PATH variable to include Hadoop executable paths enables the system to execute Hadoop commands efficiently. These configurations ensure that both Hadoop and Java are correctly integrated into the system, allowing for efficient data processing and resource management .

Hadoop provides significant advantages for processing large datasets through its scalability and reliability features. It achieves scalability by allowing horizontal scaling across a cluster of commodity hardware, where new nodes can be added without downtime, accommodating increasing data volumes. Hadoop's MapReduce programming model and HDFS support distributed data storage and parallel processing, which enhances its ability to handle large-scale datasets efficiently. Reliability is achieved through data replication across multiple nodes, ensuring data availability and fault tolerance in the event of hardware failures. Hadoop also supports automatic failover and re-execution of tasks on failure, maintaining job reliability. By distributing both data and computations across a cluster, Hadoop ensures high availability and optimized use of resources .

The Driver class in a MapReduce application plays a critical role in setting up and facilitating the execution of a MapReduce job. It first configures the job by creating a Hadoop configuration object, which holds settings such as input/output paths, the mapper and reducer classes, and other job-specific parameters. It initializes the Job object, which represents the MapReduce job, specifying job name, input/output formats, mapper and reducer classes, and the key-value types to be used. The Driver class also sets the input and output paths, defining where to read input data from and where to store the output. Finally, it submits the job by calling the job.waitForCompletion() method, which submits the job to the Hadoop framework for execution and waits for it to complete. This structured setup ensures the MapReduce job is executed correctly using the defined logic and resources within the Hadoop ecosystem .

Hadoop relies on several key configuration files to customize its operations: 1) core-site.xml, which sets the core configuration settings such as the default file system (e.g., specifying hdfs://localhost:9000). 2) hdfs-site.xml contains HDFS-specific configurations like replication factors and NameNode/DataNode directories. 3) mapred-site.xml configures MapReduce parameters including environment variables and the paths for map and reduce operations. 4) yarn-site.xml, used to configure settings for the YARN resource manager and NodeManagers, including service settings like auxiliary mapreduce_shuffle. These configuration files allow administrators to tailor Hadoop operations to specific hardware, data needs, and resource availability, optimizing the distributed computing environment to improve both performance and functionality .

Setting up the Hadoop Distributed File System (HDFS) involves several challenges and considerations: 1) Hardware Configuration: Choosing commodity hardware that balances cost and performance, while ensuring reliability and redundancy with multiple nodes. 2) Network Configuration: Ensuring secure and efficient communication between nodes using SSH and configuring hosts for network identification. 3) Data Redundancy: Correctly configuring the replication factor in hdfs-site.xml to decide how many copies of each data block are stored across different nodes to prevent data loss. 4) Scalability: Planning for future scale-out by setting up modular configurations in Hadoop's configuration files to allow seamless node additions. 5) Security: Implementing authentication, authorization, and encryption to safeguard data in transit and at rest. 6) Configuration Management: Ensuring that configuration files like core-site.xml and hdfs-site.xml are correctly set for local paths, replication, and service ports. Addressing these considerations ensures a robust and efficient HDFS setup that can handle large datasets reliably .

The MapReduce framework processes input data in two main phases: the Map phase and the Reduce phase. In the Map phase, input data is divided into smaller chunks which are processed independently by the Mapper class. The Mapper class generates intermediate key-value pairs. For example, when processing a log file, the Mapper reads each line of the log, extracts relevant information like error codes, and emits key-value pairs such as <error_code, 1>. In the Reduce phase, the Reducer class aggregates these intermediate key-value pairs based on their keys, summarizing results across all pairs to produce the final output. This framework allows for parallel processing across different nodes in a cluster, enabling scalable and efficient data processing .

In a log processing application, the Mapper class is responsible for processing each input record from the log file and generating intermediate key-value pairs. The steps implemented within it include: 1) Input Parsing: Reading each line of the log file. 2) Data Extraction: Extracting relevant information such as timestamps or error codes from each log entry. 3) Data Transformation: Converting this information into key-value pairs, such as <error_code, 1> for error analysis purposes. 4) Output Emission: Emitting these key-value pairs to the MapReduce framework for subsequent aggregation in the Reduce phase. The Mapper class extends the base Mapper class in the MapReduce framework and overrides the map() method to encapsulate this custom logic .

Formatting the Hadoop NameNode is a key step in the cluster setup process because it initializes the HDFS by creating a new, empty filesystem. The process involves running the command 'hdfs namenode -format', which sets up the directory paths for the NameNode metadata. This step is significant because it prepares the cluster for managing and storing data by allocating the necessary filesystem structure in the specified directories. Proper formatting ensures that the NameNode is ready to track file block locations, manage filesystem hierarchy, and perform efficient data operations. This preparation is crucial before any data is uploaded into HDFS, as it lays the foundation for subsequent data storage and processing .

BDT Lab Manual
No ratings yet
BDT Lab Manual
34 pages
DSBDA Practical Final
No ratings yet
DSBDA Practical Final
49 pages
Mini Project Stqa Report
No ratings yet
Mini Project Stqa Report
13 pages
Hadoop Ecosystem in Healthcare Analysis
No ratings yet
Hadoop Ecosystem in Healthcare Analysis
29 pages
AOOP-4340701-Lab Manual (1) Added Page
No ratings yet
AOOP-4340701-Lab Manual (1) Added Page
301 pages
Java Programming: Open Ended Experiment To Make Basic Calculator
No ratings yet
Java Programming: Open Ended Experiment To Make Basic Calculator
6 pages
White-Box Testing - A
No ratings yet
White-Box Testing - A
29 pages
ETI Micro Project 39 To 42
0% (1)
ETI Micro Project 39 To 42
20 pages
MOURI Tech Aptitude Test Instructions
No ratings yet
MOURI Tech Aptitude Test Instructions
3 pages
Network Reconnaissance Tools Overview
No ratings yet
Network Reconnaissance Tools Overview
6 pages
Software Engineering - Challenges Ahead
No ratings yet
Software Engineering - Challenges Ahead
43 pages
College Website Design Project for Students
No ratings yet
College Website Design Project for Students
16 pages
Agricultural Crop Management System
No ratings yet
Agricultural Crop Management System
40 pages
Journal App Report
No ratings yet
Journal App Report
37 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
80 pages
Java Lab Guide for Women Engineers
No ratings yet
Java Lab Guide for Women Engineers
31 pages
Face Recognition System
No ratings yet
Face Recognition System
7 pages
Computer Organization & Architecture
No ratings yet
Computer Organization & Architecture
2 pages
PCI Microproject Format
No ratings yet
PCI Microproject Format
17 pages
Linux Process Commands Guide
No ratings yet
Linux Process Commands Guide
11 pages
Railway Management System A Project Submitted To Chhattisgarh Swami Vivekanand Technical University Bhilai Chhattisgarh (India)
No ratings yet
Railway Management System A Project Submitted To Chhattisgarh Swami Vivekanand Technical University Bhilai Chhattisgarh (India)
30 pages
Flower Recog System
No ratings yet
Flower Recog System
11 pages
Real Time Sign Language Interpreter Report
No ratings yet
Real Time Sign Language Interpreter Report
48 pages
Tic Tac Toe MP
No ratings yet
Tic Tac Toe MP
12 pages
AES-Based Image Encryption Project
No ratings yet
AES-Based Image Encryption Project
30 pages
Deep Learning in Healthcare: Opportunities and Challenges
No ratings yet
Deep Learning in Healthcare: Opportunities and Challenges
3 pages
Dance Academy Management System: Presented by
No ratings yet
Dance Academy Management System: Presented by
21 pages
A Mini Project Report On Age and Gender Detection-2
No ratings yet
A Mini Project Report On Age and Gender Detection-2
16 pages
CC Practicals SSIU 20200330 092422357 PDF
75% (4)
CC Practicals SSIU 20200330 092422357 PDF
87 pages
Genai Manual
No ratings yet
Genai Manual
103 pages
Java JDBC Database Connectivity Guide
No ratings yet
Java JDBC Database Connectivity Guide
37 pages
Job Description - Tata Power - DET (Computer Science and IT) - 2024
No ratings yet
Job Description - Tata Power - DET (Computer Science and IT) - 2024
3 pages
Hospital Management System Project
No ratings yet
Hospital Management System Project
30 pages
Internal Mark Assessment System: Purpose of The Project
No ratings yet
Internal Mark Assessment System: Purpose of The Project
3 pages
PG Life Website Project Report
No ratings yet
PG Life Website Project Report
27 pages
Experiment No: 1: Create Different Entities Dynamically
No ratings yet
Experiment No: 1: Create Different Entities Dynamically
43 pages
A Micro-Project Report ON: Course Name-Mobile Application Development COURSE CODE - 22617
No ratings yet
A Micro-Project Report ON: Course Name-Mobile Application Development COURSE CODE - 22617
7 pages
Fest Management Web App
56% (9)
Fest Management Web App
2 pages
BT - Mini Project
No ratings yet
BT - Mini Project
13 pages
E-Village: Mini Project Report
No ratings yet
E-Village: Mini Project Report
10 pages
Shell Programming Lab Manual
No ratings yet
Shell Programming Lab Manual
35 pages
PPTAmbulance 1
No ratings yet
PPTAmbulance 1
19 pages
Python V PAD Text Editor Guide
No ratings yet
Python V PAD Text Editor Guide
6 pages
Bone Fracture Detection Presentation
No ratings yet
Bone Fracture Detection Presentation
30 pages
Java - Lab - Manual-21csl35 - Skit
No ratings yet
Java - Lab - Manual-21csl35 - Skit
30 pages
Time Table Management Micro Project
No ratings yet
Time Table Management Micro Project
12 pages
Java Applet Project: Ice Cream Cone
No ratings yet
Java Applet Project: Ice Cream Cone
9 pages
Web-Based Chat Application With Webcam Using PHP
No ratings yet
Web-Based Chat Application With Webcam Using PHP
5 pages
ACN QB Chap3 - 103558
No ratings yet
ACN QB Chap3 - 103558
3 pages
Advanced Java Laboratory Manual
No ratings yet
Advanced Java Laboratory Manual
97 pages
Mini Project
No ratings yet
Mini Project
37 pages
Final LP-VI Lab Manual 23-24
No ratings yet
Final LP-VI Lab Manual 23-24
71 pages
Oops and CG
No ratings yet
Oops and CG
14 pages
Samiksha Ste
No ratings yet
Samiksha Ste
26 pages
Software Engineering Project
No ratings yet
Software Engineering Project
39 pages
CSE 443: Data Structures Guide
50% (4)
CSE 443: Data Structures Guide
22 pages
Hadoop Ecosystem Study & Setup
No ratings yet
Hadoop Ecosystem Study & Setup
77 pages
3 Unit
No ratings yet
3 Unit
17 pages
KCC Institute of Technology and Management: Big Data and Analytics Lab File BCDS651
No ratings yet
KCC Institute of Technology and Management: Big Data and Analytics Lab File BCDS651
30 pages
Big Data Mapreduce and Streaming
No ratings yet
Big Data Mapreduce and Streaming
10 pages
Mongodb - PHP
No ratings yet
Mongodb - PHP
6 pages
Command Line Basics Guide
No ratings yet
Command Line Basics Guide
1 page
Most Asked Data Structure Interview Questions On Linked List
No ratings yet
Most Asked Data Structure Interview Questions On Linked List
3 pages
SQL Database Lab Assignment
No ratings yet
SQL Database Lab Assignment
9 pages
Hospital Management System Setup
No ratings yet
Hospital Management System Setup
37 pages
Salesforce Platform Developer I Study Guide
No ratings yet
Salesforce Platform Developer I Study Guide
13 pages
Sterling Integrator. Working With Property Files Version 5.0
No ratings yet
Sterling Integrator. Working With Property Files Version 5.0
264 pages
Windows Server 2008 AD Interview Q&A
No ratings yet
Windows Server 2008 AD Interview Q&A
3 pages
AWS Disaster Recovery
100% (2)
AWS Disaster Recovery
56 pages
Unlocking The Power of Big Data
No ratings yet
Unlocking The Power of Big Data
8 pages
The Forrester Wave™ - Big Data Fabric Q2 2018 PDF
No ratings yet
The Forrester Wave™ - Big Data Fabric Q2 2018 PDF
18 pages
Dbmsunit 3
No ratings yet
Dbmsunit 3
362 pages
Report Table BANDARA
No ratings yet
Report Table BANDARA
2 pages
DBS Lec Uog 2 2023 4
No ratings yet
DBS Lec Uog 2 2023 4
61 pages
Core Data
No ratings yet
Core Data
183 pages
Unit 3
No ratings yet
Unit 3
9 pages
Access 2007 Query Creation Guide
No ratings yet
Access 2007 Query Creation Guide
22 pages
XXX Ty
No ratings yet
XXX Ty
48 pages
Unit 2 - BD - Big Data Technology Foundations
No ratings yet
Unit 2 - BD - Big Data Technology Foundations
44 pages
Java Persistence API JPA Basics
100% (1)
Java Persistence API JPA Basics
60 pages
SQL Interview Questions and Answers
100% (1)
SQL Interview Questions and Answers
25 pages
Conditional Formatting and Charts
No ratings yet
Conditional Formatting and Charts
16 pages
Amit Trash Nationality Overview
No ratings yet
Amit Trash Nationality Overview
10 pages
Go To WOBO Page, Input The Seller ID. Use This Link To: Calling API and CMM API
No ratings yet
Go To WOBO Page, Input The Seller ID. Use This Link To: Calling API and CMM API
8 pages
Dummies Low Code Data Engineering On Databricks
No ratings yet
Dummies Low Code Data Engineering On Databricks
48 pages
Chapter 1: Introduction: Database System Concepts, 7 Ed
No ratings yet
Chapter 1: Introduction: Database System Concepts, 7 Ed
37 pages
Big Data & Analytics - Cisco Systems Special Ed. (2016)
100% (1)
Big Data & Analytics - Cisco Systems Special Ed. (2016)
53 pages
Public Domain Book Access Guidelines
No ratings yet
Public Domain Book Access Guidelines
303 pages
Array MCQ Quiz: Questions & Answers
No ratings yet
Array MCQ Quiz: Questions & Answers
22 pages
Business Analytics Notes
No ratings yet
Business Analytics Notes
8 pages

DSBDSAssingment 11

Uploaded by

DSBDSAssingment 11

Uploaded by

Assignment 11

To design a distributed application using MapReduce which processes a log file of a

Hadoop : Hadoop is an open-source distributed computing framework designed to

Steps to install hadoop (Linux) :

Step 1 : Install Java Development Kit

Step 2 : Verify the Java version :

Step 3 : Install SSH :

Step 4 : Create the hadoop user :

Step 5 : Switch user :

Step 6 : Configure SSH :

Step 8 : SSH to the localhost

Step 9 : Switch user

Step 10 : Install hadoop

Once you’ve downloaded the file, you can unzip it to a folder.

Append the below lines to the file.

Search for the “export JAVA_HOME” and configure it .

Change the following name as per your system hostname:

Change the NameNode and DataNode directory paths as shown below:

Make the following changes:

Then, edit the yarn-site.xml file:

Make the following changes:

Step 12 : Start Hadoop cluster:

Step 3) sudo mkdir mapreduce_bhargavi

Step 4) sudo chmod 777 -R mapreduce_bhargavi/

Step 5) sudo chown -R bhargavi mapreduce_bhargavi/

Step 6) sudo cp /home/bhargavi/Desktop/logfiles1/* ~/bhargavi/

Step 9) sudo chmod +r *.*

Step 10) export CLASSPATH="/home/bhargavi/hadoop-3.3.6/share/hadoop/mapreduce/hadoop-

Step 11) javac -d . SalesMapper.java SalesCountryReducer.java SalesCountryDriver.java

Step 14) ls (check is class files are created)

Step 16) gedit Manifest.txt

Step 17) jar -cfm mapreduce_vijay.jar Manifest.txt SalesCountry/*.class

Step 20) cd mapreduce_bhargavi/

Step 21) sudo mkdir /input200

Step 23) $HADOOP_HOME/bin/hdfs dfs -put /input200 /

Step 24) $HADOOP_HOME/bin/hadoop jar mapreduce_bhargavi.jar /input200 /output200

Step 25) hadoop fs -ls /output200

Step 26) hadoop fs -cat /out321/part-00000

Java Code to process logfile

public class SalesMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text,

private final static IntWritable one = new IntWritable(1);

reporter) throws IOException {

String valueString = value.toString();

String[] SingleCountryData = valueString.split("-");

output.collect(new Text(SingleCountryData[0]), one);

public void reduce(Text t_key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable>

output, Reporter reporter) throws IOException {

Text key = t_key;

// replace type of value with the actual type of our value

IntWritable value = (IntWritable) values.next();

output.collect(key, new IntWritable(frequencyForCountry));

public class SalesCountryDriver {

public static void main(String[] args) {

JobClient my_client = new JobClient();

// Create a configuration object for the job

JobConf job_conf = new JobConf(SalesCountryDriver.class);

// Specify data type of output key and value

// Specify names of Mapper and Reducer Class

// Specify formats of the data type of Input and output

// Set input and output directories using command line arguments,

store the output file.

FileInputFormat.setInputPaths(job_conf, new Path(args[0]));

FileOutputFormat.setOutputPath(job_conf, new Path(args[1]));

// Run the job

Common questions

What are the steps involved in installing Hadoop on a Linux system, and why is each step necessary?

What are the steps involved in installing Hadoop on a Linux system, and why is each step necessary?

Why are log files important in system applications, and what are their typical uses?

Why are log files important in system applications, and what are their typical uses?

Why is it important to configure environment variables properly when setting up Hadoop, and what are the key variables to set?

Why is it important to configure environment variables properly when setting up Hadoop, and what are the key variables to set?

What advantages does using Hadoop provide for processing large datasets, and how does it achieve scalability and reliability?

What advantages does using Hadoop provide for processing large datasets, and how does it achieve scalability and reliability?

How does the Driver class in a MapReduce application facilitate the execution of a MapReduce job?

How does the Driver class in a MapReduce application facilitate the execution of a MapReduce job?

What are the different configuration files in Hadoop, and how do they contribute to customizing Hadoop operations?

Step 9) sudo chmod +r .