0% found this document useful (0 votes)
179 views20 pages

DSBDSAssingment 11

The document discusses how to design and implement a distributed application using MapReduce to process a log file. It describes the key components of MapReduce including the Mapper, Reducer, and Driver classes and their roles in processing log file data. It also provides steps to install Hadoop on Linux and configure the environment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
179 views20 pages

DSBDSAssingment 11

The document discusses how to design and implement a distributed application using MapReduce to process a log file. It describes the key components of MapReduce including the Mapper, Reducer, and Driver classes and their roles in processing log file data. It also provides steps to install Hadoop on Linux and configure the environment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Assignment 11

Title:

To design a distributed application using MapReduce which processes a log file of a


system.

Problem Statement:
Design a distributed application using MapReduce which processes a log file of a
system.

Objective:
By completing this task, students will learn the following
1. Hadoop Distributed File System.
2. MapReduce Framework.

Software/Hardware Requirements:
64-bit Open source OS-Linux, Java, Java Development Kit (JDK), Hadoop.

Theory:

Hadoop : Hadoop is an open-source distributed computing framework designed to


handle and process large volumes of data across clusters of commodity hardware. It
was inspired by Google's MapReduce and Google File System (GFS) papers and is
written in Java. Apache Hadoop provides a scalable, reliable, and distributed
computing environment for processing and analyzing big data.
Map Reduce: MapReduce is a programming model and framework for processing
and generating large datasets in a distributed and parallel manner. It consists of two
main phases: the Map phase, where input data is divided into smaller chunks and
processed independently, and the Reduce phase, where the results from the Map
phase are aggregated and combined to produce the final output.
Mapper Class: The Mapper class is a crucial component in a MapReduce job,
responsible for processing each input record and generating intermediate key-value
pairs.
In the context of processing a log file, the Mapper class parses each log entry and
extracts relevant information.
The typical steps involved in implementing a Mapper class for log file processing
include:
1. Input Parsing: Read each line of the log file.
2. Data Extraction: Extract relevant information from each log entry, such as
timestamps, error codes, or other data points of interest.
3. Data Transformation: Convert the extracted information into key-value pairs.
For example, if the goal is to analyze error frequencies, the Mapper might emit
<error_code, 1> pairs for each occurrence of an error code in the log entry.
4. Output Emission: Emit the key-value pairs to the MapReduce framework for
further processing.
The Mapper class typically extends the Mapper class provided by the MapReduce
framework and overrides the map() method to define custom logic for processing
input records.
Reducer Class: The Reducer class is another crucial component in a MapReduce job,
responsible for aggregating and processing the intermediate key-value pairs
generated by the Mapper class.
In the context of processing a log file, the Reducer class receives key-value pairs
where the key represents a unique identifier (e.g., an error code) and the value
represents the count of occurrences.
The typical steps involved in implementing a Reducer class for log file processing
include:
1. Input Aggregation: Receive key-value pairs grouped by key from the
MapReduce framework.
2. Data Aggregation: Aggregate the counts of occurrences for each unique key.
3. Output Generation: Produce the final output, which may include aggregated
statistics, summaries, or any other desired analysis results.
The Reducer class typically extends the Reducer class provided by the MapReduce
framework and overrides the reduce() method to define custom logic for aggregating
intermediate results.
Driver Class : The Driver class in a MapReduce application is responsible for
configuring the job, setting up input and output paths, specifying mapper and
reducer classes, and submitting the job for execution. Here's a breakdown of the key
components typically found in a Driver class:
1. Configuration Setup: In the Driver class, you initialize a Hadoop configuration
object (Configuration) which holds various settings and parameters for the
MapReduce job. This includes properties such as input/output paths, mapper
and reducer classes, and any other job-specific configurations.
2. Job Initialization: Using the configuration object, you create a Job object (Job),
which represents the entire MapReduce job to be executed. This involves
specifying the name of the job, setting input/output formats, and configuring
the mapper and reducer classes.
3. Input and Output Paths: Specify the input and output paths for the job. This
tells Hadoop where to find the input data (e.g., log file) and where to write the
output of the job.
4. Mapper and Reducer Classes: Set the mapper and reducer classes to be used
in the MapReduce job. This involves specifying the Java classes that
implement the Mapper and Reducer interfaces and defining the logic for data
processing and aggregation.
5. InputFormat and OutputFormat: Configure the input and output formats for
the job, which define how input data is read and how output data is written.
Hadoop provides default input and output formats, but you can also use
custom formats if needed.
6. Output Key-Value Types: Specify the types of keys and values that the mapper
and reducer classes will emit. These types should match the output types of
the mapper and reducer classes.
7. Job Submission: Finally, submit the MapReduce job to the Hadoop or
MapReduce framework for execution. This involves calling the
job.waitForCompletion() method, which initiates the job execution and waits
for it to complete.
Log file : A log file is a file that records events, actions, or messages that occur within
a software application, operating system, or system component. These files are
commonly used for troubleshooting, debugging, monitoring, auditing, and analysis
purposes. Here's some key information about log files:
Log files serve various purposes, including:
Recording system events: Log files often record events such as system startups,
shutdowns, errors, warnings, and informational messages.
Debugging: Developers use log files to debug software by analyzing logs to identify
and fix issues.
Monitoring and performance analysis: System administrators use log files to monitor
system health, performance, and resource usage.
Auditing and compliance: Log files are sometimes used to track user activities for
auditing and compliance purposes.
Log files can be stored in various formats, including plain text, XML, JSON, and
structured formats. The choice of format depends on the logging framework or
application generating the logs and the requirements of downstream analysis tools.

Steps to install hadoop (Linux) :

Step 1 : Install Java Development Kit


The default Ubuntu repositories contain Java 8 and Java 11 both. I am using Java 8
because hive only works on this version.Use the following command to install it.
sudo apt update && sudo apt install openjdk-8-jdk

Step 2 : Verify the Java version :


Once you have successfully installed it, check the current Java version:
java -version

Step 3 : Install SSH :


SSH (Secure Shell) installation is vital for Hadoop as it enables secure communication
between nodes in the Hadoop cluster. This ensures data integrity, confidentiality,
and allows for efficient distributed processing of data across the cluster.
sudo apt install ssh

Step 4 : Create the hadoop user :


All the Hadoop components will run as the user that you create for Apache Hadoop,
and the user will also be used for logging in to Hadoop’s web interface.
Run the command to create user and set password :
sudo adduser hadoop

Step 5 : Switch user :


Switch to the newly created hadoop user:
su - hadoop

Step 6 : Configure SSH :


Now configure password-less SSH access for the newly created hadoop user, so I
didn’t enter key to save file and passpharse. Generate an SSH keypair first:
ssh-keygen -t rsa
Step 7 : Set permissions :
Copy the generated public key to the authorized key file and set the proper
permissions:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 640 ~/.ssh/authorized_keys

Step 8 : SSH to the localhost


ssh localhost

You will be asked to authenticate hosts by adding RSA keys to known hosts. Type yes
and hit Enter to authenticate the localhost.

Step 9 : Switch user


Again switch to hadoop
su - hadoop

Step 10 : Install hadoop


Download hadoop 3.3.6
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz

Once you’ve downloaded the file, you can unzip it to a folder.


tar -xvzf hadoop-3.3.6.tar.gz

Rename the extracted folder to remove version information. This is an optional step,
but if you don’t want to rename, then adjust the remaining configuration paths.
mv hadoop-3.3.6 hadoop

Next, you will need to configure Hadoop and Java Environment Variables on your
system. Open the ~/.bashrc file in your favorite text editor.Here I am using nano
editior , to pasting the code we use ctrl+shift+v for saving the file ctrl+x and
ctrl+y ,then hit enter:
nano ~/.bashrc

Append the below lines to the file.


export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
Load the above configuration in the current environment.
source ~/.bashrc

You also need to configure JAVA_HOME in hadoop-env.sh file. Edit the Hadoop
environment variable file in the text editor:
nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Search for the “export JAVA_HOME” and configure it .


JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Step 11 : Configuring Hadoop :
First, you will need to create the namenode and datanode directories inside the
Hadoop user home directory. Run the following command to create both directories:
cd hadoop/

mkdir -p ~/hadoopdata/hdfs/{namenode,datanode}

Next, edit the core-site.xml file and update with your system hostname:
nano $HADOOP_HOME/etc/hadoop/core-site.xml

Change the following name as per your system hostname:


<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Save and close the file.
Then, edit the hdfs-site.xml file:
nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Change the NameNode and DataNode directory paths as shown below:


<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>
Then, edit the mapred-site.xml file:
nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Make the following changes:


<configuration>
<property>
<name>yarn.app.mapreduce.am.env</name>

<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/h
adoop</value>
</property>
<property>
<name>mapreduce.map.env</name>

<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/h
adoop</value>
</property>
<property>
<name>mapreduce.reduce.env</name>

<value>HADOOP_MAPRED_HOME=$HADOOP_HOME/home/hadoop/hadoop/bin/h
adoop</value>
</property>
</configuration>

Then, edit the yarn-site.xml file:


nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Make the following changes:


<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Save the file and close it .

Step 12 : Start Hadoop cluster:


Before starting the Hadoop cluster. You will need to format the Namenode as a
hadoop user.
Run the following command to format the Hadoop Namenode:
hdfs namenode -format

Once the namenode directory is successfully formatted with hdfs file system, you
will see the message “Storage directory
/home/hadoop/hadoopdata/hdfs/namenode has been successfully formatted”.
Then start the Hadoop cluster with the following command.
start-all.sh

You can now check the status of all Hadoop services using the jps command:
jps
To check that all the Hadoop services are up and running, run the below command.

Step 1)jps

Step 2) cd

Step 3) sudo mkdir mapreduce_bhargavi

Step 4) sudo chmod 777 -R mapreduce_bhargavi/

Step 5) sudo chown -R bhargavi mapreduce_bhargavi/

Step 6) sudo cp /home/bhargavi/Desktop/logfiles1/* ~/bhargavi/

Step 7) cd mapreduce_bhargavi/

Step 8) ls

Step 9) sudo chmod +r *.*

Step 10) export CLASSPATH="/home/bhargavi/hadoop-3.3.6/share/hadoop/mapreduce/hadoop-


mapreduce-client-core-3.3.6.jar:/home/bhargavi/hadoop-
3.3.6/share/hadoop/mapreduce/hadoop-mapreduce-client-common-
3.3.6.jar:/home/bhargavi/hadoop-3.3.6/share/hadoop/common/hadoop-common-
3.3.6.jar:~/mapreduce_bhargavi/SalesCountry/*:$HADOOP_HOME/lib/*"

Step 11) javac -d . SalesMapper.java SalesCountryReducer.java SalesCountryDriver.java

Step 12) ls
Step 13) cd SalesCountry/

Step 14) ls (check is class files are created)


Step 15) cd ..

Step 16) gedit Manifest.txt


(add following lines to it:
Main-Class: SalesCountry.SalesCountryDriver)

Step 17) jar -cfm mapreduce_vijay.jar Manifest.txt SalesCountry/*.class

Step 18) ls

Step 19) cd

Step 20) cd mapreduce_bhargavi/

Step 21) sudo mkdir /input200


Step 22) sudo cp access_log_short.csv /input200

Step 23) $HADOOP_HOME/bin/hdfs dfs -put /input200 /

Step 24) $HADOOP_HOME/bin/hadoop jar mapreduce_bhargavi.jar /input200 /output200

Step 25) hadoop fs -ls /output200

Step 26) hadoop fs -cat /out321/part-00000

Step 27) Now open the Mozilla browser and go to localhost:50070/dfshealth.html to check the
NameNode interface.

Java Code to process logfile

Mapper Class:
package SalesCountry;
import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.*;

public class SalesMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text,

IntWritable> {

private final static IntWritable one = new IntWritable(1);

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter

reporter) throws IOException {

String valueString = value.toString();

String[] SingleCountryData = valueString.split("-");

output.collect(new Text(SingleCountryData[0]), one);

Reducer Class:

package SalesCountry;

import java.io.IOException;

import java.util.*;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.*;
public class SalesCountryReducer extends MapReduceBase implements Reducer<Text, IntWritable,

Text, IntWritable> {

public void reduce(Text t_key, Iterator<IntWritable> values, OutputCollector<Text,IntWritable>

output, Reporter reporter) throws IOException {

Text key = t_key;

int frequencyForCountry = 0;

while (values.hasNext()) {

// replace type of value with the actual type of our value

IntWritable value = (IntWritable) values.next();

frequencyForCountry += value.get();

output.collect(key, new IntWritable(frequencyForCountry));

Driver Class:

package SalesCountry;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapred.*;

public class SalesCountryDriver {

public static void main(String[] args) {

JobClient my_client = new JobClient();

// Create a configuration object for the job

JobConf job_conf = new JobConf(SalesCountryDriver.class);


// Set a name of the Job

job_conf.setJobName("SalePerCountry");

// Specify data type of output key and value

job_conf.setOutputKeyClass(Text.class);

job_conf.setOutputValueClass(IntWritable.class);

// Specify names of Mapper and Reducer Class

job_conf.setMapperClass(SalesCountry.SalesMapper.class);

job_conf.setReducerClass(SalesCountry.SalesCountryReducer.class);

// Specify formats of the data type of Input and output

job_conf.setInputFormat(TextInputFormat.class);

job_conf.setOutputFormat(TextOutputFormat.class);

// Set input and output directories using command line arguments,

//arg[0] = name of input directory on HDFS, and arg[1] = name of output directory to be created to

store the output file.

FileInputFormat.setInputPaths(job_conf, new Path(args[0]));

FileOutputFormat.setOutputPath(job_conf, new Path(args[1]));

my_client.setConf(job_conf);

try {

// Run the job

JobClient.runJob(job_conf);

} catch (Exception e) {

e.printStackTrace();
}

Note : The paths and directory names will change according to your folder. Change the names and

paths accordingly.

Conclusion :
In this assignment we learnt how to process a log file using Hadoop frame work on distributed
system.

Common questions

Powered by AI

The installation of Hadoop on a Linux system involves several steps: 1) Install the Java Development Kit (JDK), which is necessary because Hadoop is written in Java and requires this runtime environment. 2) Install SSH to enable secure communication between nodes in the Hadoop cluster, which is crucial for distributed data processing. 3) Create a dedicated 'hadoop' user for running Hadoop services, ensuring security and administrative separation. 4) Configure password-less SSH access for the 'hadoop' user to facilitate seamless interactions between cluster nodes. 5) Download and install Hadoop, followed by extracting it to a preferred directory. 6) Set environment variables in the ~/.bashrc file and the hadoop-env.sh file to define paths for Java and Hadoop, enabling the system to recognize and correctly use their components. 7) Create directories for NameNode and DataNode storage, which are vital for the Hadoop Distributed File System (HDFS) to keep metadata and data blocks. 8) Configure core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml files to enable and customize Hadoop's operational parameters. 9) Format the namenode to initialize HDFS for the first time. Finally, start the cluster using provided scripts to ensure that all configurations are appropriately applied and services are running correctly .

Log files are crucial in system applications because they record events, actions, or messages that occur within a software application, operating system, or system component. Their typical uses include: 1) Recording system events such as startups, shutdowns, and errors which are essential for auditing and troubleshooting. 2) Debugging by enabling developers to track logs to identify and resolve issues. 3) Monitoring system health and performance analysis by administrators to optimize system operations. 4) Auditing and compliance tracking of user activities, contributing to compliance with legal and operational standards. Log files can vary in format, such as plain text, XML, or JSON, depending on the requirements of the system or application generating them .

Proper configuration of environment variables is critical when setting up Hadoop as it ensures that the system recognizes and correctly uses Hadoop and Java components for successful operation. Key variables include JAVA_HOME, which specifies the path to the Java Development Kit necessary to run Hadoop, and HADOOP_HOME, which defines the location of the Hadoop installation directory. Other important variables are HADOOP_MAPRED_HOME, HADOOP_COMMON_HOME, HADOOP_HDFS_HOME, HADOOP_YARN_HOME, and HADOOP_COMMON_LIB_NATIVE_DIR, which specify paths for various Hadoop components and allow the system to access necessary resources for processing. Properly setting the PATH variable to include Hadoop executable paths enables the system to execute Hadoop commands efficiently. These configurations ensure that both Hadoop and Java are correctly integrated into the system, allowing for efficient data processing and resource management .

Hadoop provides significant advantages for processing large datasets through its scalability and reliability features. It achieves scalability by allowing horizontal scaling across a cluster of commodity hardware, where new nodes can be added without downtime, accommodating increasing data volumes. Hadoop's MapReduce programming model and HDFS support distributed data storage and parallel processing, which enhances its ability to handle large-scale datasets efficiently. Reliability is achieved through data replication across multiple nodes, ensuring data availability and fault tolerance in the event of hardware failures. Hadoop also supports automatic failover and re-execution of tasks on failure, maintaining job reliability. By distributing both data and computations across a cluster, Hadoop ensures high availability and optimized use of resources .

The Driver class in a MapReduce application plays a critical role in setting up and facilitating the execution of a MapReduce job. It first configures the job by creating a Hadoop configuration object, which holds settings such as input/output paths, the mapper and reducer classes, and other job-specific parameters. It initializes the Job object, which represents the MapReduce job, specifying job name, input/output formats, mapper and reducer classes, and the key-value types to be used. The Driver class also sets the input and output paths, defining where to read input data from and where to store the output. Finally, it submits the job by calling the job.waitForCompletion() method, which submits the job to the Hadoop framework for execution and waits for it to complete. This structured setup ensures the MapReduce job is executed correctly using the defined logic and resources within the Hadoop ecosystem .

Hadoop relies on several key configuration files to customize its operations: 1) core-site.xml, which sets the core configuration settings such as the default file system (e.g., specifying hdfs://localhost:9000). 2) hdfs-site.xml contains HDFS-specific configurations like replication factors and NameNode/DataNode directories. 3) mapred-site.xml configures MapReduce parameters including environment variables and the paths for map and reduce operations. 4) yarn-site.xml, used to configure settings for the YARN resource manager and NodeManagers, including service settings like auxiliary mapreduce_shuffle. These configuration files allow administrators to tailor Hadoop operations to specific hardware, data needs, and resource availability, optimizing the distributed computing environment to improve both performance and functionality .

Setting up the Hadoop Distributed File System (HDFS) involves several challenges and considerations: 1) Hardware Configuration: Choosing commodity hardware that balances cost and performance, while ensuring reliability and redundancy with multiple nodes. 2) Network Configuration: Ensuring secure and efficient communication between nodes using SSH and configuring hosts for network identification. 3) Data Redundancy: Correctly configuring the replication factor in hdfs-site.xml to decide how many copies of each data block are stored across different nodes to prevent data loss. 4) Scalability: Planning for future scale-out by setting up modular configurations in Hadoop's configuration files to allow seamless node additions. 5) Security: Implementing authentication, authorization, and encryption to safeguard data in transit and at rest. 6) Configuration Management: Ensuring that configuration files like core-site.xml and hdfs-site.xml are correctly set for local paths, replication, and service ports. Addressing these considerations ensures a robust and efficient HDFS setup that can handle large datasets reliably .

The MapReduce framework processes input data in two main phases: the Map phase and the Reduce phase. In the Map phase, input data is divided into smaller chunks which are processed independently by the Mapper class. The Mapper class generates intermediate key-value pairs. For example, when processing a log file, the Mapper reads each line of the log, extracts relevant information like error codes, and emits key-value pairs such as <error_code, 1>. In the Reduce phase, the Reducer class aggregates these intermediate key-value pairs based on their keys, summarizing results across all pairs to produce the final output. This framework allows for parallel processing across different nodes in a cluster, enabling scalable and efficient data processing .

In a log processing application, the Mapper class is responsible for processing each input record from the log file and generating intermediate key-value pairs. The steps implemented within it include: 1) Input Parsing: Reading each line of the log file. 2) Data Extraction: Extracting relevant information such as timestamps or error codes from each log entry. 3) Data Transformation: Converting this information into key-value pairs, such as <error_code, 1> for error analysis purposes. 4) Output Emission: Emitting these key-value pairs to the MapReduce framework for subsequent aggregation in the Reduce phase. The Mapper class extends the base Mapper class in the MapReduce framework and overrides the map() method to encapsulate this custom logic .

Formatting the Hadoop NameNode is a key step in the cluster setup process because it initializes the HDFS by creating a new, empty filesystem. The process involves running the command 'hdfs namenode -format', which sets up the directory paths for the NameNode metadata. This step is significant because it prepares the cluster for managing and storing data by allocating the necessary filesystem structure in the specified directories. Proper formatting ensures that the NameNode is ready to track file block locations, manage filesystem hierarchy, and perform efficient data operations. This preparation is crucial before any data is uploaded into HDFS, as it lays the foundation for subsequent data storage and processing .

You might also like