0% found this document useful (0 votes)
0 views

Lab Manual

Uploaded by

mohitsurwade149
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Lab Manual

Uploaded by

mohitsurwade149
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Assignment No : 01

Aim: Hadoop Installation on a) Single Node b) Multiple Node

Theory Single node Hadoop installation

Prerequisites  GNU/Linux is supported as a development and production


platform.
 Java must be installed.
 ssh must be installed and sshd must be running to use the
Hadoop scripts that manage remote Hadoop daemons.

 Windows is also a supported platform. Need virtual box


with linux operating system

To install software internet connectivity is required.

Introduction:
Apache Hadoop is an open-source software framework written in Java for distributed storage and
distributed processing of very large data sets on computer clusters built from commodity hardware. The
core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and
a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across
nodes in a cluster. To process data, Hadoop transfers code for nodes to process in parallel based on the
data that needs to be processed. This approach takes advantage of data locality.
The Apache Hadoop framework is composed of the following modules:
 Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
 Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on
commodity machines, providing very high aggregate bandwidth across the cluster;
 Hadoop YARN – a resource-management platform responsible for managing computing
resources in clusters and using them for scheduling of users' applications; and
 Hadoop MapReduce – an implementation of the MapReduce programming model for large
scale data processing.

Software https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-
installation common/SingleCluster.html
Steps:

 > sudo apt-get update


[1] Java Installation: Java is the main prerequisite for Hadoop.
First of all, you should verify the existence of java in your system using the
command
 >java –version

 Install java
 > sudo apt-get install default-jdk

[2] Install ssh ( secure shell): Hadoop requires SSH access to manage its nodes.

>ssh-keygen –t rsa –P “”

Hadoop uses SSH (to access its nodes) which would normally require the user to
enter a password. However, this requirement can be eliminated by creating and
setting up SSH certificates using the following commands. If asked for a filename just
leave it blank and press the enter key to continue. (
/home/hadoop_user/.ssh/id_rsa.pub )

> cat /home/hadoop_user/.ssh/id_rsa.pub >>


/home/hadoop_user/.ssh/authorized_keys

Hadoop Download a recent stable release from one of the Apache Download Mirrors
Installation https://www.apache.org/dyn/closer.cgi/hadoop/common/
:
http://www.eu.apache.org/dist/hadoop/common/stable/hadoop-2.7.1.tar.gz
[1] copy and extract hadoop-2.7.1.tar.gz in home folder

[2] Hadoop Configuration:

There are three modes in which you can start Hadoop cluster:
1) Local ( Standalone )Mode
2) Pseudo-Distributed Mode
3)Fully Distributed Mode
Hadoop is configured by default to run in a non-distributed mode, as a single Java process, Which is
useful for debugging. Hadoop daemon runs in a separate Java process in single-node in a psudo-
distributed mode

Fully distributed mode runs with two or more machines in a cluster

Setup Configuration Files : The following files will have to be modified to complete the Hadoop setup:

1) /usr/local/hadoop/etc/hadoop/hadoop-env.sh
2) /usr/local/hadoop/etc/hadoop/core-site.xml
3) /usr/local/hadoop/etc/hadoop/mapred-site.xml.template
4) /usr/local/hadoop/etc/hadoop/hdfs-site.xml
5) ~/.bashrc

[ Before editing the .bashrc file in our home directory, we need to find the path where Java has been
installed to set the JAVA_HOME environment variable using the following command:

> readlink -f /usr/bin/javac

[3] Edit core-site.xml : The /usr/local/hadoop/etc/hadoop/core-site.xml file contains configuration


properties that Hadoop uses when starting up

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>

<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
</property>

</configuration>
[4] hdfs-site.xml:
Create Namenode and Datanode directories in hadoop hdfs cluster
> sudo mkdir -p /usr/local/HADOOP_STORE/hdfs/namenode
> sudo mkdir -p /usr/local/HADOOP_STORE/hdfs/datanode
> sudo chown –R hduser:hadoop /usr/local/HADOOP_STORE

Edit hdfs-site.xml :

<configuration>

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/HADOOP_STORE/hdfs/namenode </value>
</property>
<property>

<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/HADOOP_STORE/hdfs/datanode </value>
</property>

</configuration>

[5] Edit mapred-site.xml :


<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

[6] edit hadoop-env.sh and Add JAVA-HOME path


Example : export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Adding the above statement in the hadoop-env.sh file ensures that the value of JAVA_HOME
variable will be available to Hadoop whenever it is started up.

[7] Edit ~/.bashrc file :

Edit ~/.bashrc file and add Java path, hadoop path

For example :

> gedit ~/.bashrc

# set to the root of your Java installation


# set to the root of your Hadoop installation
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME =/usr/local/hadoop
export PATH=$PATH:$ HADOOP_HOME /bin
export PATH=$PATH:$ HADOOP_HOME / sbin
export HADOOP_MAPRED_HOME = $HADOOP_HOME
export HADOOP_COMMON_HOME =$HADOOP_HOME
export HADOOP_HDFS_HOME = $HADOOP_HOME
export YARN_HOME=$ HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_LOCAL/lib"

Try following commands

> echo $JAVA_HOME


Display java path
> echo $HADOOP_HOME
Display Hadoop path

# Java version
> java –version
Java :1.8.0”
> hadoop version
#Displays Hadoop version
2.9.0
>bin/hadoop
#This will display the usage documentation for the hadoop scrip

[8] Format Namenode:


> hadoop namenode -format
Or
> hdfs namenode -format

[9] Start hadoop:

>start-all.sh
[10] Check all process are started
>Jps

Conclusion: We have studied Hadoop installation on single node & Hadoop configured on
Ubuntu
.

Reference : http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-
common/SingleCluster.html
Group A: Assignments based on the Hadoop

Assignment No: 2

Aim:
Design a distributed application using MapReduce(Using Java) which processes a log
file of a system. List out the users who have logged for maximum period on the
system. Use simple log file from the Internet and process it using a pseudo
distribution mode on Hadoop platform.

Prerequisites:
Ensure that Hadoop is installed, configured and is running.
 Single Node Setup.

Theory:

What is MapReduce?
MapReduce is a processing technique and a program model for distributed
computing based on java. The MapReduce algorithm contains two important tasks,
namely Map and Reduce. Map takes a set of data and converts it into another set of
data, where individual elements are broken down into tuples (key/value pairs).
Secondly, reduce task, which takes the output from a map as an input and combines
those data tuples into a smaller set of tuples. As the sequence of the name
MapReduce implies, the reduce task is always performed after the map job.

The Algorithm
MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.

 Map stage : The map or mapper’s job is to process the input data. Generally
the input data is in the form of file or directory and is stored in the Hadoop
file system (HDFS). The input file is passed to the mapper function line b y
line. The mapper processes the data and creates several small chunks of data.
 Reduce stage : This stage is the combination of the Shuffle stage and the
Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be
stored in the HDFS.

During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster.
The framework manages all the details of data-passing such as issuing tasks,
verifying task completion, and copying data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
 After completion of the given tasks, the cluster collects and reduces the data to
form an appropriate result, and sends it back to the Hadoop server.

Inputs and Outputs :


The MapReduce framework operates exclusively on <key, value> pairs, that is,
the framework views the input to the job as a set of <key, value> pairs and
produces a set of <key, value> pairs as the output of the job, conceivably of
different types.
The key and value classes have to be serializable by the framework and
hence need to implement the Writable interface. Additionally, the key classes
have to implement the WritableComparable interface to facilitate sorting by the
framework.

Input and Output types of a MapReduce job:


(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

Input (Sample) :
10.223.157.186 - - [15/Jul/2009:20:50:32 -0700] "GET /assets/js/the-associates.js
HTTP/1.1" 304 -
10.223.157.186 - - [15/Jul/2009:20:50:33 -0700] "GET /assets/img/home-logo.png
HTTP/1.1" 304 -
10.223.157.186 - - [15/Jul/2009:20:50:33 -0700] "GET /assets/img/dummy/primary-
news-2.jpg HTTP/1.1" 304 -

Output (Sample):
$ hadoop dfs -cat output/part-r-00000
10.1.1.236 7
10.1.181.142 14
10.1.232.31 5
10.10.55.142 14
10.102.101.66 1
10.103.184.104 1
10.103.190.81 53
Implementation:
1. Mapper :

public void map( Object key, Text value, OutputCollector<Text, IntWritable > output,
Reporter rep) throws IOException
{

String line = value.toString();


String[] map_data = line.split("-");

output.collect(new Text(map_data[0]), new IntWritable(1));


}

The Mapper is implementation, via the map method. Mapper processes one line at a time,
as provided by the specified TextInputFormat. It then splits the line into tokens
separated by whitespaces, split method of string. Mapper emits a key-value pair of <
<nLog>, 1>. The value of one is hard-coded by 1, by using new IntWritable(1).

For the given sample input the first map emits:


<10.1.1.236 , 1>
<10.1.181.142 , 1>
<10.1.232.31 , 1>
<10.10.55.142, 1>………...till end of file

The output of each map is passed through the local combiner (which is same as the Reducer
as per the job configuration) for local aggregation, after being sorted on the keys.

public void reduce(Text key, Iterator<IntWritable> values,


OutputCollector<Text, IntWritable> output, Reporter rep) throws IOException
{

int count =0;

while(values.hasNext() )
{
IntWritable i= values.next();
count = count + i.get();

output.collect(key, new IntWritable(count));


}
The Reducer implementation, via the reduce method just sums up the values, which are the
occurrence counts for each key.
Thus the output of the job is:

10.1.1.236 7
10.1.181.142 14
10.1.232.31 5
10.10.55.142 14
10.102.101.66 1
10.103.184.104 1
10.103.190.81 53……Till end of job
The main method specifies various parameters of the job, such as the input/output paths
(passed via the command line), key/value types, input/output formats etc., in the Job. It then
calls the job.waitForCompletion to submit the job and monitor its progress.

public class testdriver {

public static void main(String[] args) throws Exception {

JobConf conf = new JobConf(testdriver.class );

FileInputFormat.addInputPath(conf, new Path(args[0]));


FileOutputFormat.setOutputPath(conf,new Path(args[1]));

conf.setMapperClass(mapper1.class);
conf.setReducerClass(reducer1.class);

conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);

conf.setOutputValueClass(IntWritable.class);
conf.setOutputKeyClass(Text.class);

JobClient.runJob(conf);

}
}

Conclusion:
This Assignment we have studied how to develop distributed application using map
reduce.
Group A : Assignments based on the Hadoop

Assignment No: 3

Aim: Study of Hive

Title : Write an application using HiveQL for flight information system which will include
a. Creating, Dropping, and altering Database tables.
b. Creating an external Hive table.
c. Load table with data, insert new values and field in the table, Join tables with Hive
d. Create index on Flight Information Table
e. Find the average departure delay per day in 2008..

Prerequisites:
- Ensure that Hadoop is installed, configured and is running.
- Single Node Setup.
- Hive installed and working propery
- Hbase installed and working propery

Theory:

Hive : Hive is a data warehouse infrastructure tool used to process data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes querying and analyzing easy.

Steps :

[1]Start Hadoop
[2] Start Hive
[3] Create a database
Example :
Hive>CREATE DATABASE ourfirstdatabase;

Hive> USE ourfirstdatabase;

[4] Crete a table


hive >CREATE TABLE our_first_table
(
FirstName STRING,
LastName STRING,
EmployeeId INT
);
Examples:

hive> DROP DATABASE ourfirstdatabase CASCADE;

3) Download Flight data set of 2007 & 2008 from :


http://stat-computing.org/dataexpo/2009/the-data.html
Dataset description: Different fields in the flight data set are

Name Description
1 Year 1987-2008
2 Month 1-12
3 DayofMonth 1-31
4 DayOfWeek 1 (Monday) - 7 (Sunday)
5 DepTime actual departure time (local, hhmm)
6 CRSDepTime scheduled departure time (local, hhmm)
7 ArrTime actual arrival time (local, hhmm)
8 CRSArrTime scheduled arrival time (local, hhmm)
9 UniqueCarrier unique carrier code
10 FlightNum flight number
11 TailNum plane tail number
12 ActualElapsedTime in minutes
13 CRSElapsedTime in minutes
14 AirTime in minutes
15 ArrDelay arrival delay, in minutes
16 DepDelay departure delay, in minutes
17 Origin origin IATA airport code
18 Dest destination IATA airport code
19 Distance in miles
20 TaxiIn taxi in time, in minutes
21 TaxiOut taxi out time in minutes
22 Cancelled was the flight cancelled?
reason for cancellation (A = carrier, B = weather, C = NAS, D =
23 CancellationCode
security)
24 Diverted 1 = yes, 0 = no
25 CarrierDelay in minutes
26 WeatherDelay in minutes
27 NASDelay in minutes
28 SecurityDelay in minutes
29 LateAircraftDelay in minutes
Create a table:
CREATE TABLE IF NOT EXISTS FlightInfo2007
(
Year SMALLINT, Month TINYINT, DayofMonth TINYINT,
DayOfWeek TINYINT,
DepTime SMALLINT, CRSDepTime SMALLINT, ArrTime SMALLINT,CRSArrTime SMALLINT,
UniqueCarrier STRING, FlightNum STRING, TailNum STRING,
ActualElapsedTime SMALLINT, CRSElapsedTime SMALLINT,
AirTime SMALLINT, ArrDelay SMALLINT, DepDelay SMALLINT,
Origin STRING, Dest STRING,Distance INT,
TaxiIn SMALLINT, TaxiOut SMALLINT, Cancelled SMALLINT,
CancellationCode STRING, Diverted SMALLINT,
CarrierDelay SMALLINT, WeatherDelay SMALLINT,
NASDelay SMALLINT, SecurityDelay SMALLINT,
LateAircraftDelay
SMALLINT)
COMMENT 'Flight InfoTable'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'

Load Data into table:

hive> load data local inpath '/home/hduser/Desktop/2007.csv' into table FlightInfo2007;

hive> CREATE TABLE IF NOT EXISTS FlightInfo2008 LIKE FlightInfo2007;


hive> load data local inpath '/home/hduser/Desktop/2008.csv' into table FlightInfo2008;

hive> CREATE TABLE IF NOT EXISTS myFlightInfo (


Year SMALLINT, DontQueryMonth TINYINT, DayofMonth
TINYINT, DayOfWeek TINYINT, DepTime SMALLINT, ArrTime SMALLINT,
UniqueCarrier STRING, FlightNum STRING,
AirTime SMALLINT, ArrDelay SMALLINT, DepDelay SMALLINT,
Origin STRING, Dest STRING, Cancelled SMALLINT,
CancellationCode STRING)
COMMENT 'Flight InfoTable'
PARTITIONED BY(Month TINYINT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' ;
hive> CREATE TABLE myflightinfo2007 AS SELECT Year, Month, DepTime, ArrTime,
FlightNum, Origin, Dest FROM FlightInfo2007 WHERE (Month = 7 AND DayofMonth =
3) AND (Origin='JFK' AND Dest='ORD');

hive>SELECT * FROM myFlightInfo2007;

hive> CREATE TABLE myFlightInfo2008 AS SELECT Year, Month, DepTime, ArrTime,


FlightNum, Origin, Dest FROM FlightInfo2008 WHERE (Month = 7 AND DayofMonth =
3) AND (Origin='JFK' AND Dest='ORD');
hive> SELECT * FROM myFlightInfo2008;

JOIN
Hive>SELECT m8.Year, m8.Month, m8.FlightNum, m8.Origin, m8.Dest, m7.Year, m7.Month,
m7.FlightNum, m7.Origin, m7.Dest FROM myFlightinfo2008 m8 JOIN myFlightinfo2007
m7 ON m8.FlightNum=m7.FlightNum;

hive> SELECT m8.FlightNum,m8.Origin,m8.Dest,m7.FlightNum,m7.Origin,m7.Dest FROM


myFlightinfo2008 m8 FULL OUTER JOIN myFlightinfo2007 m7 ON
m8.FlightNum=m7.FlightNum;

hive>SELECT
m8.Year,m8.Month,m8.FlightNum,m8.Origin,m8.Dest,m7.Year,m7.Month,m7.FlightNum,
m7.Origin,m7.Dest FROM myFlightinfo2008 m8 LEFT OUTER JOIN myFlightinfo2007
m7 ON m8.FlightNum=m7.FlightNum;

hive> CREATE INDEX f08_index ON TABLE flightinfo2008 (Origin) AS


> 'COMPACT' WITH DEFERRED REBUILD;

hive> ALTER INDEX f08_index ON flightinfo2008 REBUILD;

hive>SHOW INDEXES ON FlightInfo2008;

hive> SELECT Origin, COUNT(1) FROM flightinfo2008 WHERE Origin = 'SYR' GROUP BY
Origin;

hive> DESCRIBE default__flightinfo2008_f08_index__;

hive> CREATE VIEW avgdepdelay AS SELECT DayOfWeek, AVG(DepDelay) FROM


FlightInfo2008 GROUP BY DayOfWeek;

hive> SELECT * FROM avgdepdelay;

3 8.289761053658728
6 8.645680904903614
1 10.269990244459473
4 9.772897177836702
7 11.568973392595312
2 8.97689712068735
5 12.158036387869656

Day 5 under the results in Step (B) — had the highest number of delays.

Conclusion : We have studied hive for big data analysis


Group B: Assignments based on Data Analytics using Python

Assignment no – 01

Aim : Study various operations on dataset using python

Title : Perform the following operations using Python on the Facebook metrics data
sets :
a. Create data subsets
b. Merge Data
c. Sort Data
d. Transposing Data
e. Shape and reshape Data

Prerequisites: Ensure that Python , pandas is installed, configured and is running.


Theory:
Python for Data Analysis:
Pandas:

Pandas is a Python library used for working with data sets. It is made mainly for
working with labeled or relational data both easily and intuitively. It provides data
structures (Series, DataFrame ) and It has functions for analyzing, cleaning, exploring,
and manipulating data. Pandas library is built on the top of Numpy library.

Advantages:
 Fast and efficient for data manipulation and data analysis.
 Data from different sources can be loaded
 It is easier to handle missing data ( Data represented as NaN)
 In pandas data frame columns can be easily inserted or deleted
 Data set merging
 Provides functions for reshaping and pivoting of data sets

After pandas installed in the system, you need to import the pandas library as:

import pandas as pd

Pandas provide two data structures for manipulating data.


1) Series and
2) DataFrame

The Pandas Series: A Pandas Series is a one-dimensional labeled array. It can hold
data of any type( integer, string, float, python object etc) . It can be created from a list
or array as follows:
data = pd.Series( [0.25, 0.5, 0.75, 1.0] )

DataFrame:

Pandas DataFrame is a two-dimensional tabular data structure. Pandas DataFrame can


be created by loading data from different sources like data from CSV file, Excel file,
SQL database.

Example:

import pandas as pd
df=pd.DataFrame({"Rollno":[1,2,3], "Name":["Sunil","Nitin","Ajay"]})
print(df)

Data Loading: First step in any data science project is to import data. pandas
provide functions such as read_csv( ), read_table(), read_excel() for reading data as a
DataFrame object.

Reading data from a csv file: To read data from a csv file, pandas read_csv() is
used.

Example:

df = pd.read_csv('dataset_Facebook.csv', sep=';')

Create a subset: There are different ways to create data subset.


We can create a subset of a Python dataframe using following functions:
1) loc()
2) iloc()

Python loc[] enable us to create a subset of a dataframe according to a specific row


and column or combining both. Python iloc() function enables us to create subset
choosing specific values from rows and columns based on indexes. loc() function
which works on labels, iloc() function works on index values.

Example:
df = df.loc[[0, 2]] # returns row 0 to 2
df.iloc[0:3,2] # returns row from 0 to 3 and 2 column
Example:
# create subset1 with data from row 1 to row 5 and columns Category', 'like', 'share',
'Type' of Facebook metric data set

subset1 = df.loc[1:5,['Category','like','share','Type']]

[2] Combining and Merging Data Sets : Data contained in pandas objects
can be combined together using pandas following functions:
 merge( ) : connects rows in DataFrames based on one or more keys.
 concate( ) : concat or stacks together objects along an axis
Syntax: merge_set = pd.concat([subset1, subset2], axis=0)

[3] Sort Data: Pandas sort_values() function sorts a data frame in Ascending or
Descending order of passed Column
Syntax:
DataFrame.sort_values(by, axis=0, ascending=True, inplace=False,
kind=’quicksort’, na_position=’last’)

Example: Sort data in ascending order of 'like' column in Facebook metric dataset

df.sort_values(by='like',ascending=True)

[4] Transposing Data: DataFrame.transpose( ) converts rows into column

[5] Shape and reshape Data: Python pandas library provides melt ( ) and
pivote( ) functions :

melt() : Pandas melt( ) function is used to change the DataFrame format from
wide to long. melt( ) is used to create a specific format of the DataFrame object
where one or more columns work as identifiers. All the remaining columns are
treated as values and unpivoted to the row axis and only two columns – variable
and value.

Syntax : pandas.melt(frame, id_vars =None, value_vars = None, var_name=None,


value_name='value', col_level=None)

Pivot( ) : We can use pivot() function to unmelt a DataFrame object and get the
original dataframe. The pivot() function ‘index’ parameter value should be same
as the ‘id_vars’ value. The ‘columns’ value should be passed as the name of the
‘variable’ column.

Syntax:

df_unmelted = df_melted.pivot(index='ID', columns='Name of variable column')

Conclusion : We have studied various operations using python on facebook metrics


Dataset
Group B: Assignments based on Data Analytics using Python

Assignment no – 02

Aim : Study Data Preprocessing and model building using python

Title : Perform the following operations using Python on the Air quality and Heart
Diseases data sets :

a. Data cleaning
b. Data integration
c. Data transformation
d. Error correcting
e. Data model building

Prerequisites: Ensure that Python, pandas, sklearn is installed, configured and is


running.

Theory:
Data Cleaning: Data cleaning is the process of fixing or removing incorrect, corrupted,
incorrectly formatted, duplicate, or incomplete data within a dataset.

Pandas df.isna() method return true if value is Null. We can clean the data by adding and
dropping the needed and unwanted data. We can drop columns which do not provide
more information or drop row where data is not available.

Example: [1] Dropping columns which are not required

df = df.drop( [list of columns which not required] , axis = 1 )

df = df.drop( ['stn_code' 'station'] , axis = 1 )

[2] Dropping rows /columns with Null values: Pandas dropna( ) method allows
us to drop rows/columns with Null values.

Syntax: DataFrameName.dropna( axis=0, how=‘any’,


thresh=None, subset=None, inplace=False)
axis: is a int/string value for rows/columns( axis: 0-row, 1-column).
Inplace: It is a boolean which makes the changes in data frame itself if True
Example:
df = df.dropna( ) , df = df.dropna(subset=['date'])
Data transformation: Transforms data from one form to another. Data transformation
includes Removing Duplicates , Replacing missing Values, Data binning, handling
categorical values , Transforming Data Using a Function or Mapping .
[ I ] Removing Duplicates: Duplicate rows may be found in a DataFrame .
drop_duplicates( ) used to remove duplicate rows.

Example:
data = DataFrame({'k1': ['one'] * 3 + ['two'] * 4, 'k2': [1, 1, 2, 3, 3, 4, 4]})

k1 k2
0 one 1
1 one 1
2 one 2
3 two 3
4 two 3
5 two 4
6 two 4
The DataFrame method duplicated returns a boolean Series indicating whether each
row is a duplicate or not:

data.duplicated( )

0 False
1 True
2 False
3 False
4 True
5 False
6 True

drop_duplicates( ) returns a DataFrame where the duplicated array is True:


data.drop_duplicates()
k1 k2
0 one 1
2 one 2
3 two 3
5 two 4
[ II ] Transforming Data Using a Function or Mapping : For a data sets, you may wish
to perform some transformation based on the values in an array, Series, or
column in a DataFrame.

[ III] Replacing missing Values: fillna( ) is used to fill in missing values.


replace( ) method provides simple and flexible way to replace values.
Example: consider a series: data = Series( [1 , -999 , 2, -999, -1000, 3 ])
The -999 values might be sentinel values for missing data. To replace
these with NA values that pandas understands, we can use replace,
producing a new Series

data.replace(-999, np.nan )

If you want to replace multiple values at once, you instead pass a list then the
substitute value:
data.replace([-999, -1000], np.nan)
To use a different replacement for each value, pass a list of substitutes:
data.replace([-999, -1000], [np.nan, 0])

Data model building : Steps for building a machine learning model are:
1) Loading dataset, 2) Data preprocessing 3) Data splitting into train and test set
4) Model building 5) Train model, 6) Test Model 7) Evaluate model
performance
Example: Build a machine learning model for heart disease prediction.
Use heart.csv data file.

import pandas as pd
df = pd.read_csv('heart.csv')

# Perform data preprocessing and build model

from sklearn.model_selection import train_test_split


from sklearn import svm
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn import svm

# Create a svm Classifier


clf = svm.SVC(kernel='linear') # Linear Kernel

#Train the model using the training sets


clf.fit(X_train, y_train)

#Predict the response for test dataset


y_pred = clf.predict( X_test)

Conclusion: We learn various methods for data preprocessing and model


building using python
Assignment no – 03

Aim : Data Visualization using python

Title : Visualize the data using Python libraries matplotlib, seaborn by plotting the graphs

Prerequisites: Ensure that Python, pandas, sklearn, matplotlib, seaborn is installed,


configured and is running.

Theory:
Data Visualization: Data visualization refers to the techniques used to communicate
data or information by encoding it as visual objects (e.g., points, lines or bars) contained
in graphics. It involves the creation and study of the visual representation of data.
Making plots and static or interactive visualizations is one of the most important tasks
in data analysis.

Types of Data Visualization: Data can be visualized in the form of 1D, 2D or3D
structure. Different types of data visualization are:
 1D/Linear
 2D (Planner) Data Visualization
 3D (Volumetric) Data Visualization
 Temporal
 Multidimensional: Number of dimensions is used for visualization. Examples: Pie
Chart, histogram, tag cloud, bar chart, scatter plot, heat map etc
 Tree/Hierarchical
 Network data Visualization

Python libraries for data visualization : Python matplotlib, seaborn library used for
data visualization.
Matplotlib: Matplotlib is a multiplatform data visualization library built on NumPy
arrays. It was conceived by John Hunter in 2002. . It is used for creating plots. Most of
the functionality is present in sub module pyplot. In python program matplotlib library
is imported as :
import matplotlib.pyplot as plt

# Example 1: Draw a line from (0,0) to (6,200) initialize the data


x = np.array([0 , 6])
y = np.array([0 , 200])
plt.plot(x,y)
Reading a data set:
# reading the tip.csv

df = pd.read_csv('tips.csv')
df.head()

Bar plot: A bar chart or bar graph is a chart or graph that presents categorical data with
rectangular bars. The height or lengths proportional to the values that they represent.
The bars can be plotted vertically or horizontally.

# initialize the data

x=df['day']
y=df['total_bill']

#plotting the data

plt.bar(x,y)
plt.title('Tips Data set')
plt.xlabel('Day')
plt.ylabel('Total Bill')
plt.show( )
Histogtam:
# Histogram

plt.hist(x, bins=4, color='green', edgecolor='red')


plt.ylabel('Frequency')
plt.xlabel('Total Bill')
plt.show( )

Seaborn: Seaborn is an open-source Python library. It is built on top of matplotlib.


Seaborn is used for data visualization and exploratory data analysis. Seaborn works
easily with dataframes and the Pandas library. The graphs created can also be
customized easily.

Example Seaborn plot:

import seaborn as sns


# load dataset
tips = sns.load_dataset("tips")

# create visualization
sns.relplot( data =tips , x="total_bill", y="tip")

Conclusion: We learn data visualization using python matplotlib and seaborn


library.
Assignment no – 04

Aim : Data Visualization using Tableau

Title: Perform the following data visualization operations using Tableau


on Adult and Iris datasets / superstore dataset.
a. 1D (Linear) Data visualization
b. 2D (Planar) Data Visualization
c. 3D (Volumetric) Data Visualization
d. Temporal Data Visualization
e. Multidimensional Data Visualization
f. Tree/ Hierarchical Data visualization
g. Network Data visualization

Prerequisites: Ensure that Tableau is installed and running.

Theory:
Tableau is a Business Intelligence tool for visually analyzing the data. Users can create
and distribute an interactive and shareable dashboard, which depict the trends,
variations, and density of the data in the form of graphs and charts. Tableau can connect
to files, relational and Big Data sources to acquire and process data It is used by
businesses, academic researchers, and many government organizations for visual data
analysis. It is also positioned as a leader Business Intelligence and Analytics Platform in
Gartner Magic Quadrant.

Tableau Features: Tableau features are:

 Speed of Analysis
 Self-Reliant
 Visual Discovery
 Blend Diverse Data Sets .
 Real-Time Collaboration
 Centralized Data

Tableau products are:


 Tableau Desktop: Made for individual use
 Tableau Server: Collaboration for any organization
 Tableau Online: Business Intelligence in the Cloud
 Tableau Reader: Let you read files saved in Tableau Desktop.
 Tableau Public: For anyone to publish interactive data online
Tableau Terminologies:
 Workbook : The name of a Tableau file that holds visualizations. It usually takes
the format of .twb or .twbx.
 Data source : A single data table compiled in Tableau. It is the source of the data
that individual visualizations will use as their basis. It can be created from a
single source connection or multiple of them, through joining or union.
 Sheet : A single tab within a workbook that holds a single visualization

Starting Tableau: Tableau runs on windows operating system.


Start program  Tableau ( or Click on Tableau Shortcut on desktop )

There are three basic steps involved in creating any Tableau data visualization:

1) Connect to a data source − It involves locating the data and using an


appropriate type of connection to read the data.
2) Choose dimensions and measures − this involves selecting the required
columns from the source data for analysis.
3) Apply visualization technique − this involves applying required visualization
methods, such as a specific chart or graph type to the data being analyzed.

[1] Connect to data source : Tableau can connect with various data sources such as
text, excel file, databases to big data queries. To connect to data source , from connect
select data source type or we can use Data menu. Connect  To Microsoft Excel 
select Microsoft excel data fileSample-store.xls

[2] drag the data sheet on Drag sheets here


Data Visualizations:
Go to Worksheet. A Worksheet is where you make all of your graphs, so click on that
tab to reach the following screen :

Visualization in Tableau is possible through dragging and dropping Measures and


Dimensions onto these different Shelves . Pills Blue means Discrete and Green,
Continuous.

Example : Find the Sales and Profit Values over the years

1. Drag Order Date from Dimensions and Sales from Measures to Rows.

2. Right click on the green Sales Pill, and select Discrete, in place of Continuous, since
we want the explicit values and not the bar graphs.
3. Finally drag Profit on the ‘abc’ column to get :

Conclusion : We have studied Data visualization using Tableau

You might also like