Lab Manual
Lab Manual
Introduction:
Apache Hadoop is an open-source software framework written in Java for distributed storage and
distributed processing of very large data sets on computer clusters built from commodity hardware. The
core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and
a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across
nodes in a cluster. To process data, Hadoop transfers code for nodes to process in parallel based on the
data that needs to be processed. This approach takes advantage of data locality.
The Apache Hadoop framework is composed of the following modules:
Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on
commodity machines, providing very high aggregate bandwidth across the cluster;
Hadoop YARN – a resource-management platform responsible for managing computing
resources in clusters and using them for scheduling of users' applications; and
Hadoop MapReduce – an implementation of the MapReduce programming model for large
scale data processing.
Software https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-
installation common/SingleCluster.html
Steps:
[2] Install ssh ( secure shell): Hadoop requires SSH access to manage its nodes.
>ssh-keygen –t rsa –P “”
Hadoop uses SSH (to access its nodes) which would normally require the user to
enter a password. However, this requirement can be eliminated by creating and
setting up SSH certificates using the following commands. If asked for a filename just
leave it blank and press the enter key to continue. (
/home/hadoop_user/.ssh/id_rsa.pub )
Hadoop Download a recent stable release from one of the Apache Download Mirrors
Installation https://www.apache.org/dyn/closer.cgi/hadoop/common/
:
http://www.eu.apache.org/dist/hadoop/common/stable/hadoop-2.7.1.tar.gz
[1] copy and extract hadoop-2.7.1.tar.gz in home folder
There are three modes in which you can start Hadoop cluster:
1) Local ( Standalone )Mode
2) Pseudo-Distributed Mode
3)Fully Distributed Mode
Hadoop is configured by default to run in a non-distributed mode, as a single Java process, Which is
useful for debugging. Hadoop daemon runs in a separate Java process in single-node in a psudo-
distributed mode
Setup Configuration Files : The following files will have to be modified to complete the Hadoop setup:
1) /usr/local/hadoop/etc/hadoop/hadoop-env.sh
2) /usr/local/hadoop/etc/hadoop/core-site.xml
3) /usr/local/hadoop/etc/hadoop/mapred-site.xml.template
4) /usr/local/hadoop/etc/hadoop/hdfs-site.xml
5) ~/.bashrc
[ Before editing the .bashrc file in our home directory, we need to find the path where Java has been
installed to set the JAVA_HOME environment variable using the following command:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
</property>
</configuration>
[4] hdfs-site.xml:
Create Namenode and Datanode directories in hadoop hdfs cluster
> sudo mkdir -p /usr/local/HADOOP_STORE/hdfs/namenode
> sudo mkdir -p /usr/local/HADOOP_STORE/hdfs/datanode
> sudo chown –R hduser:hadoop /usr/local/HADOOP_STORE
Edit hdfs-site.xml :
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/HADOOP_STORE/hdfs/namenode </value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/HADOOP_STORE/hdfs/datanode </value>
</property>
</configuration>
Adding the above statement in the hadoop-env.sh file ensures that the value of JAVA_HOME
variable will be available to Hadoop whenever it is started up.
For example :
# Java version
> java –version
Java :1.8.0”
> hadoop version
#Displays Hadoop version
2.9.0
>bin/hadoop
#This will display the usage documentation for the hadoop scrip
>start-all.sh
[10] Check all process are started
>Jps
Conclusion: We have studied Hadoop installation on single node & Hadoop configured on
Ubuntu
.
Reference : http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-
common/SingleCluster.html
Group A: Assignments based on the Hadoop
Assignment No: 2
Aim:
Design a distributed application using MapReduce(Using Java) which processes a log
file of a system. List out the users who have logged for maximum period on the
system. Use simple log file from the Internet and process it using a pseudo
distribution mode on Hadoop platform.
Prerequisites:
Ensure that Hadoop is installed, configured and is running.
Single Node Setup.
Theory:
What is MapReduce?
MapReduce is a processing technique and a program model for distributed
computing based on java. The MapReduce algorithm contains two important tasks,
namely Map and Reduce. Map takes a set of data and converts it into another set of
data, where individual elements are broken down into tuples (key/value pairs).
Secondly, reduce task, which takes the output from a map as an input and combines
those data tuples into a smaller set of tuples. As the sequence of the name
MapReduce implies, the reduce task is always performed after the map job.
The Algorithm
MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
Map stage : The map or mapper’s job is to process the input data. Generally
the input data is in the form of file or directory and is stored in the Hadoop
file system (HDFS). The input file is passed to the mapper function line b y
line. The mapper processes the data and creates several small chunks of data.
Reduce stage : This stage is the combination of the Shuffle stage and the
Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be
stored in the HDFS.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster.
The framework manages all the details of data-passing such as issuing tasks,
verifying task completion, and copying data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
After completion of the given tasks, the cluster collects and reduces the data to
form an appropriate result, and sends it back to the Hadoop server.
Input (Sample) :
10.223.157.186 - - [15/Jul/2009:20:50:32 -0700] "GET /assets/js/the-associates.js
HTTP/1.1" 304 -
10.223.157.186 - - [15/Jul/2009:20:50:33 -0700] "GET /assets/img/home-logo.png
HTTP/1.1" 304 -
10.223.157.186 - - [15/Jul/2009:20:50:33 -0700] "GET /assets/img/dummy/primary-
news-2.jpg HTTP/1.1" 304 -
Output (Sample):
$ hadoop dfs -cat output/part-r-00000
10.1.1.236 7
10.1.181.142 14
10.1.232.31 5
10.10.55.142 14
10.102.101.66 1
10.103.184.104 1
10.103.190.81 53
Implementation:
1. Mapper :
public void map( Object key, Text value, OutputCollector<Text, IntWritable > output,
Reporter rep) throws IOException
{
The Mapper is implementation, via the map method. Mapper processes one line at a time,
as provided by the specified TextInputFormat. It then splits the line into tokens
separated by whitespaces, split method of string. Mapper emits a key-value pair of <
<nLog>, 1>. The value of one is hard-coded by 1, by using new IntWritable(1).
The output of each map is passed through the local combiner (which is same as the Reducer
as per the job configuration) for local aggregation, after being sorted on the keys.
while(values.hasNext() )
{
IntWritable i= values.next();
count = count + i.get();
10.1.1.236 7
10.1.181.142 14
10.1.232.31 5
10.10.55.142 14
10.102.101.66 1
10.103.184.104 1
10.103.190.81 53……Till end of job
The main method specifies various parameters of the job, such as the input/output paths
(passed via the command line), key/value types, input/output formats etc., in the Job. It then
calls the job.waitForCompletion to submit the job and monitor its progress.
conf.setMapperClass(mapper1.class);
conf.setReducerClass(reducer1.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);
conf.setOutputValueClass(IntWritable.class);
conf.setOutputKeyClass(Text.class);
JobClient.runJob(conf);
}
}
Conclusion:
This Assignment we have studied how to develop distributed application using map
reduce.
Group A : Assignments based on the Hadoop
Assignment No: 3
Title : Write an application using HiveQL for flight information system which will include
a. Creating, Dropping, and altering Database tables.
b. Creating an external Hive table.
c. Load table with data, insert new values and field in the table, Join tables with Hive
d. Create index on Flight Information Table
e. Find the average departure delay per day in 2008..
Prerequisites:
- Ensure that Hadoop is installed, configured and is running.
- Single Node Setup.
- Hive installed and working propery
- Hbase installed and working propery
Theory:
Hive : Hive is a data warehouse infrastructure tool used to process data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Steps :
[1]Start Hadoop
[2] Start Hive
[3] Create a database
Example :
Hive>CREATE DATABASE ourfirstdatabase;
Name Description
1 Year 1987-2008
2 Month 1-12
3 DayofMonth 1-31
4 DayOfWeek 1 (Monday) - 7 (Sunday)
5 DepTime actual departure time (local, hhmm)
6 CRSDepTime scheduled departure time (local, hhmm)
7 ArrTime actual arrival time (local, hhmm)
8 CRSArrTime scheduled arrival time (local, hhmm)
9 UniqueCarrier unique carrier code
10 FlightNum flight number
11 TailNum plane tail number
12 ActualElapsedTime in minutes
13 CRSElapsedTime in minutes
14 AirTime in minutes
15 ArrDelay arrival delay, in minutes
16 DepDelay departure delay, in minutes
17 Origin origin IATA airport code
18 Dest destination IATA airport code
19 Distance in miles
20 TaxiIn taxi in time, in minutes
21 TaxiOut taxi out time in minutes
22 Cancelled was the flight cancelled?
reason for cancellation (A = carrier, B = weather, C = NAS, D =
23 CancellationCode
security)
24 Diverted 1 = yes, 0 = no
25 CarrierDelay in minutes
26 WeatherDelay in minutes
27 NASDelay in minutes
28 SecurityDelay in minutes
29 LateAircraftDelay in minutes
Create a table:
CREATE TABLE IF NOT EXISTS FlightInfo2007
(
Year SMALLINT, Month TINYINT, DayofMonth TINYINT,
DayOfWeek TINYINT,
DepTime SMALLINT, CRSDepTime SMALLINT, ArrTime SMALLINT,CRSArrTime SMALLINT,
UniqueCarrier STRING, FlightNum STRING, TailNum STRING,
ActualElapsedTime SMALLINT, CRSElapsedTime SMALLINT,
AirTime SMALLINT, ArrDelay SMALLINT, DepDelay SMALLINT,
Origin STRING, Dest STRING,Distance INT,
TaxiIn SMALLINT, TaxiOut SMALLINT, Cancelled SMALLINT,
CancellationCode STRING, Diverted SMALLINT,
CarrierDelay SMALLINT, WeatherDelay SMALLINT,
NASDelay SMALLINT, SecurityDelay SMALLINT,
LateAircraftDelay
SMALLINT)
COMMENT 'Flight InfoTable'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
JOIN
Hive>SELECT m8.Year, m8.Month, m8.FlightNum, m8.Origin, m8.Dest, m7.Year, m7.Month,
m7.FlightNum, m7.Origin, m7.Dest FROM myFlightinfo2008 m8 JOIN myFlightinfo2007
m7 ON m8.FlightNum=m7.FlightNum;
hive>SELECT
m8.Year,m8.Month,m8.FlightNum,m8.Origin,m8.Dest,m7.Year,m7.Month,m7.FlightNum,
m7.Origin,m7.Dest FROM myFlightinfo2008 m8 LEFT OUTER JOIN myFlightinfo2007
m7 ON m8.FlightNum=m7.FlightNum;
hive> SELECT Origin, COUNT(1) FROM flightinfo2008 WHERE Origin = 'SYR' GROUP BY
Origin;
3 8.289761053658728
6 8.645680904903614
1 10.269990244459473
4 9.772897177836702
7 11.568973392595312
2 8.97689712068735
5 12.158036387869656
Day 5 under the results in Step (B) — had the highest number of delays.
Assignment no – 01
Title : Perform the following operations using Python on the Facebook metrics data
sets :
a. Create data subsets
b. Merge Data
c. Sort Data
d. Transposing Data
e. Shape and reshape Data
Pandas is a Python library used for working with data sets. It is made mainly for
working with labeled or relational data both easily and intuitively. It provides data
structures (Series, DataFrame ) and It has functions for analyzing, cleaning, exploring,
and manipulating data. Pandas library is built on the top of Numpy library.
Advantages:
Fast and efficient for data manipulation and data analysis.
Data from different sources can be loaded
It is easier to handle missing data ( Data represented as NaN)
In pandas data frame columns can be easily inserted or deleted
Data set merging
Provides functions for reshaping and pivoting of data sets
After pandas installed in the system, you need to import the pandas library as:
import pandas as pd
The Pandas Series: A Pandas Series is a one-dimensional labeled array. It can hold
data of any type( integer, string, float, python object etc) . It can be created from a list
or array as follows:
data = pd.Series( [0.25, 0.5, 0.75, 1.0] )
DataFrame:
Example:
import pandas as pd
df=pd.DataFrame({"Rollno":[1,2,3], "Name":["Sunil","Nitin","Ajay"]})
print(df)
Data Loading: First step in any data science project is to import data. pandas
provide functions such as read_csv( ), read_table(), read_excel() for reading data as a
DataFrame object.
Reading data from a csv file: To read data from a csv file, pandas read_csv() is
used.
Example:
df = pd.read_csv('dataset_Facebook.csv', sep=';')
Example:
df = df.loc[[0, 2]] # returns row 0 to 2
df.iloc[0:3,2] # returns row from 0 to 3 and 2 column
Example:
# create subset1 with data from row 1 to row 5 and columns Category', 'like', 'share',
'Type' of Facebook metric data set
subset1 = df.loc[1:5,['Category','like','share','Type']]
[2] Combining and Merging Data Sets : Data contained in pandas objects
can be combined together using pandas following functions:
merge( ) : connects rows in DataFrames based on one or more keys.
concate( ) : concat or stacks together objects along an axis
Syntax: merge_set = pd.concat([subset1, subset2], axis=0)
[3] Sort Data: Pandas sort_values() function sorts a data frame in Ascending or
Descending order of passed Column
Syntax:
DataFrame.sort_values(by, axis=0, ascending=True, inplace=False,
kind=’quicksort’, na_position=’last’)
Example: Sort data in ascending order of 'like' column in Facebook metric dataset
df.sort_values(by='like',ascending=True)
[5] Shape and reshape Data: Python pandas library provides melt ( ) and
pivote( ) functions :
melt() : Pandas melt( ) function is used to change the DataFrame format from
wide to long. melt( ) is used to create a specific format of the DataFrame object
where one or more columns work as identifiers. All the remaining columns are
treated as values and unpivoted to the row axis and only two columns – variable
and value.
Pivot( ) : We can use pivot() function to unmelt a DataFrame object and get the
original dataframe. The pivot() function ‘index’ parameter value should be same
as the ‘id_vars’ value. The ‘columns’ value should be passed as the name of the
‘variable’ column.
Syntax:
Assignment no – 02
Title : Perform the following operations using Python on the Air quality and Heart
Diseases data sets :
a. Data cleaning
b. Data integration
c. Data transformation
d. Error correcting
e. Data model building
Theory:
Data Cleaning: Data cleaning is the process of fixing or removing incorrect, corrupted,
incorrectly formatted, duplicate, or incomplete data within a dataset.
Pandas df.isna() method return true if value is Null. We can clean the data by adding and
dropping the needed and unwanted data. We can drop columns which do not provide
more information or drop row where data is not available.
[2] Dropping rows /columns with Null values: Pandas dropna( ) method allows
us to drop rows/columns with Null values.
Example:
data = DataFrame({'k1': ['one'] * 3 + ['two'] * 4, 'k2': [1, 1, 2, 3, 3, 4, 4]})
k1 k2
0 one 1
1 one 1
2 one 2
3 two 3
4 two 3
5 two 4
6 two 4
The DataFrame method duplicated returns a boolean Series indicating whether each
row is a duplicate or not:
data.duplicated( )
0 False
1 True
2 False
3 False
4 True
5 False
6 True
data.replace(-999, np.nan )
If you want to replace multiple values at once, you instead pass a list then the
substitute value:
data.replace([-999, -1000], np.nan)
To use a different replacement for each value, pass a list of substitutes:
data.replace([-999, -1000], [np.nan, 0])
Data model building : Steps for building a machine learning model are:
1) Loading dataset, 2) Data preprocessing 3) Data splitting into train and test set
4) Model building 5) Train model, 6) Test Model 7) Evaluate model
performance
Example: Build a machine learning model for heart disease prediction.
Use heart.csv data file.
import pandas as pd
df = pd.read_csv('heart.csv')
Title : Visualize the data using Python libraries matplotlib, seaborn by plotting the graphs
Theory:
Data Visualization: Data visualization refers to the techniques used to communicate
data or information by encoding it as visual objects (e.g., points, lines or bars) contained
in graphics. It involves the creation and study of the visual representation of data.
Making plots and static or interactive visualizations is one of the most important tasks
in data analysis.
Types of Data Visualization: Data can be visualized in the form of 1D, 2D or3D
structure. Different types of data visualization are:
1D/Linear
2D (Planner) Data Visualization
3D (Volumetric) Data Visualization
Temporal
Multidimensional: Number of dimensions is used for visualization. Examples: Pie
Chart, histogram, tag cloud, bar chart, scatter plot, heat map etc
Tree/Hierarchical
Network data Visualization
Python libraries for data visualization : Python matplotlib, seaborn library used for
data visualization.
Matplotlib: Matplotlib is a multiplatform data visualization library built on NumPy
arrays. It was conceived by John Hunter in 2002. . It is used for creating plots. Most of
the functionality is present in sub module pyplot. In python program matplotlib library
is imported as :
import matplotlib.pyplot as plt
df = pd.read_csv('tips.csv')
df.head()
Bar plot: A bar chart or bar graph is a chart or graph that presents categorical data with
rectangular bars. The height or lengths proportional to the values that they represent.
The bars can be plotted vertically or horizontally.
x=df['day']
y=df['total_bill']
plt.bar(x,y)
plt.title('Tips Data set')
plt.xlabel('Day')
plt.ylabel('Total Bill')
plt.show( )
Histogtam:
# Histogram
# create visualization
sns.relplot( data =tips , x="total_bill", y="tip")
Theory:
Tableau is a Business Intelligence tool for visually analyzing the data. Users can create
and distribute an interactive and shareable dashboard, which depict the trends,
variations, and density of the data in the form of graphs and charts. Tableau can connect
to files, relational and Big Data sources to acquire and process data It is used by
businesses, academic researchers, and many government organizations for visual data
analysis. It is also positioned as a leader Business Intelligence and Analytics Platform in
Gartner Magic Quadrant.
Speed of Analysis
Self-Reliant
Visual Discovery
Blend Diverse Data Sets .
Real-Time Collaboration
Centralized Data
There are three basic steps involved in creating any Tableau data visualization:
[1] Connect to data source : Tableau can connect with various data sources such as
text, excel file, databases to big data queries. To connect to data source , from connect
select data source type or we can use Data menu. Connect To Microsoft Excel
select Microsoft excel data fileSample-store.xls
Example : Find the Sales and Profit Values over the years
1. Drag Order Date from Dimensions and Sales from Measures to Rows.
2. Right click on the green Sales Pill, and select Discrete, in place of Continuous, since
we want the explicit values and not the bar graphs.
3. Finally drag Profit on the ‘abc’ column to get :