0% found this document useful (0 votes)
154 views69 pages

Eda Lab Manual

The document is a laboratory manual for the Exploratory Data Analysis Laboratory course at the Knowledge Institute of Technology, detailing the vision, mission, and objectives of the Computer Science and Engineering department. It outlines course objectives, a list of experiments, and expected outcomes, focusing on data analysis and visualization using tools like R, Python, Tableau, and Power BI. The manual includes step-by-step algorithms for various experiments, including exploratory data analysis on email datasets and working with Numpy and Pandas.

Uploaded by

2k22cse107
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
154 views69 pages

Eda Lab Manual

The document is a laboratory manual for the Exploratory Data Analysis Laboratory course at the Knowledge Institute of Technology, detailing the vision, mission, and objectives of the Computer Science and Engineering department. It outlines course objectives, a list of experiments, and expected outcomes, focusing on data analysis and visualization using tools like R, Python, Tableau, and Power BI. The manual includes step-by-step algorithms for various experiments, including exploratory data analysis on email datasets and working with Numpy and Pandas.

Uploaded by

2k22cse107
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

KNOWLEDGE INSTITUTE OF TECHNOLOGY

(An Autonomous Institution)

(Accredited by NAAC & NBA, Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)

KAKAPALAYAM (PO), SALEM – 637 504

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

LABORATORY MANUAL

FOR

CCS346 – EXPLORATORY DATA ANALYSIS LABORATORY

1
REGULATION 2021

VISION, MISSION, PEOs AND PSOs OF CSE DEPARTMENT

VISION
To create globally competent software professionals with social values to cater the ever-
changing industry requirements.
MISSION
M1 To provide appropriate infrastructure to impart need-based technical education
through effective teaching and research
M2 To involve the students in collaborative projects on emerging technologies to fulfill
the industrial requirements
M3 To render value based education to students to take better engineering decision
with social consciousness and to meet out the global standards
M4 To inculcate leadership skills in students and encourage them to become a globally
competent professional
Programme Educational Objectives (PEOs)
The graduates of Computer Science and Engineering will be able to
PEO1 Pursue Higher Education and Research or have a successful career in industries
associated with Computer Science and Engineering, or as Entrepreneurs
PEO2 Ensure that graduates will have the ability and attitude to adapt to emerging
technological changes
PEO3 Acquire leadership skills to perform professional activities with social
consciousness
Programme Specific Outcome (PSOs)
The graduates will be able to
PSO1 The students will be able to analyze large volume of data and make business
decisions to improve efficiency with different algorithms and tools
PSO2 The students will have the capacity to develop web and mobile applications for real
time scenarios
PSO3 The students will be able to provide automation and smart solutions in various
forms to the society with Internet of Things

2
Course Code & Name: CCS346 & EXPLORATORY DATA ANALYSIS LABORATORY

REGULATION: R2021
YEAR/SEM: III/V

COURSE OBJECTIVES:
• To outline an overview of exploratory data analysis.
• To implement data visualization using Matplotlib.
• To perform univariate data exploration and analysis.
• To apply bivariate data exploration and analysis.
• To use Data exploration and visualization techniques for multivariate and time series data

LIST OF EXPERIMENTS:

1. Install the data Analysis and Visualization tool: R/ Python /Tableau Public/ Power BI.
2. Perform exploratory data analysis (EDA) with datasets like email data set. Export all your
emails as a dataset, import them inside a pandas data frame, visualize them and get different
insights from the data.
3. Working with Numpy arrays, Pandas data frames , Basic plots using Matplotlib.
4. Explore various variable and row filters in R for cleaning data. Apply various plot features in
R on sample data sets and visualize.
5. Perform Time Series Analysis and apply the various visualization techniques.
6. Perform Data Analysis and representation on a Map using various Map data sets with Mouse
Rollover effect, user interaction, etc..
[Link] cartographic visualization for multiple datasets involving various countries of the world;
states and districts in India etc.
8. Perform EDA on Wine Quality Data Set.
9. Use a case study on a data set and apply the various EDA and visualization techniques and
present an analysis report.

COURSE OUTCOMES
• Understand the fundamentals of exploratory data analysis.
• Implement the data visualization using Matplotlib.
• Perform univariate data exploration and analysis.
• Apply bivariate data exploration and analysis.
• Use Data exploration and visualization techniques for multivariate and time series data.

3
4
KNOWLEDGE INSTITUTE OF TECHNOLOGY

SALEM – 637 504

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CONTENTS

Ex. Page Mark


No. Date Name of the Experiment No. Awarded Signature

Installation and Setup of Data


1. Analysis and Visualization Tools: R,
Python, Tableau, and Power BI.
Perform exploratory data analysis
(EDA) with datasets like email data
set. Export all your emails as a dataset,
2.
import them inside a pandas data
frame, visualize them and get different
insights from the data.
Working with Numpy arrays, Pandas
3.
dataframe, Basic using Matplotlib.
Explore various variable and row
filters in R for cleaning data. Apply
4.
various plot features in R on sample
data sets and visualize.
Perform Time Series Analysis and
5. apply the various visualization
techniques.
Perform data analysis and
representation on a map using various
6.
map datasets with mouse rollover
effect, user interaction, etc,.
Build cartographic visualization of
7.
multiple datasets involving various
5
countries of the world; states, and
districts in India
8. Perform EDA on Wine Quality Data
Set.
Use a case study on a dataset and apply
the various EDA and visualization
9.
techniques and present an analysis
report.

6
[Link] : 1 Installation and Setup of Data Analysis and Visualization Tools: R,
Date : Python, Tableau, and Power BI

Aim:
The aim of this project is to install and set up four essential data analysis and visualization tools,
namely R, Python, Tableau, and Power BI, to provide a robust environment for data analysis and
visualization tasks.
Algorithm:
Step 1: System Requirements Assessment:

- Before installation, ensure your system meets the minimum hardware and software requirements
for each tool. Check the official documentation of R, Python, Tableau, and Power BI for system
prerequisites.

Step 2: Installation of R:

- Download the R installation package from the official CRAN (Comprehensive R Archive Network)
website ([Link]

- Follow the installation wizard, choose your preferred options, and install R on your system.

Step 3: Installation of Python:

- Download the latest Python installer from the official Python website
([Link]

- During installation, make sure to check the option "Add Python to PATH" for easy command-line
access.

- After installation, use pip (Python package manager) to install essential data science libraries, such
as NumPy, Pandas, Matplotlib, and Jupyter Notebook, to enhance your data analysis capabilities.

Step 4: Installation of Tableau:

- Visit the official Tableau website ([Link] to


download the Tableau Desktop installer.

- Follow the installation instructions and provide your licensing information or use the free trial
period.

- Ensure Tableau Desktop is successfully activated.

Step 5: Installation of Power BI:

7
- Go to the Microsoft Power BI website ([Link] to
download Power BI Desktop.

- Download and install the Power BI Desktop application.

- You will need a Microsoft account to access certain features. Sign in or create an account if
required.

Step 6: Configuration and Integration:

- Configure R and Python with data science IDEs such as RStudio or Jupyter Notebook to harness
their full potential for data analysis.

- For Tableau and Power BI, explore their integration capabilities with databases, cloud services,
and data sources you plan to use for analysis and visualization.

Step 7: Testing and Documentation:

- Verify that all the installed tools are working correctly by creating a simple data analysis and
visualization project.

- Document the installation process, any challenges faced, and how they were overcome. This
documentation will be helpful for future reference and troubleshooting.

R:

Step 1:

8
Step 2:

Step 3:

Step 4:

Step 5:

Follow the next procedures to completely install R.

9
Python:

Step 2:

Step 3:

Step 4:

Follow the next procedures to completely install python.

10
Tableau Public:

Step 1:

Sign In to Tableau Public.

Step 2:

Step 3:

Step 4:

11
Follow the further instructions to install Tableau Public.

Power BI:

Step 1:

Step 2:

12
Step 3:

Follow the next procedures to completely install Power BI.


Result:

Thus the steps to install R/Python/Tableau Public/Power BI was completed successfully.

13
[Link] : 2 Perform exploratory data analysis (EDA) with datasets like email data set.
Export all your emails as a dataset, import them inside a pandas data
Date : frame, visualize them and get different insights from the data.

Aim :
Perform exploratory data analysis (EDA) with datasets like email data set. Export all your emails as
a dataset, import them inside a pandas data frame, visualize them and get different insights from
the data.
Algorithm
Step 1: Load the data and create a DataFrame.

Step 2: Convert the 'timestamp' column to datetime format.

Step 3: Check the data's structure and data types.

Step 4: Preview the first few rows of the DataFrame.

Step 5: Find the most active email senders.

Step 6: Visualize common keywords in email content with a word cloud.

Step 7: Analyze email frequency over time using a time series plot.

Step 8: Explore additional insights based on specific requirements.

Step 9: Display results and provide interpretations.

Step 10: Conclude the analysis.

Program :
import pandas as pd

# Sample email data

emails = {

'sender': ['john@[Link]', 'mary@[Link]', 'john@[Link]'],

'receiver': ['mary@[Link]', 'john@[Link]', 'mary@[Link]'],

'subject': ['Meeting Reminder', 'Regarding Project', 'Important Updates'],

'timestamp': ['2023-06-28 [Link]', '2023-06-28 [Link]', '2023-06-29 [Link]'],

'content': ['Dear Mary, This is a reminder for our meeting tomorrow. Regards, John',

'Hi John, I wanted to discuss the project with you. Let\'s connect. Regards, Mary',

14
'Dear Mary, Please find attached the important updates. Best, John']

# Create a DataFrame

df = [Link](emails)

# Convert timestamp column to datetime

df['timestamp'] = pd.to_datetime(df['timestamp'])

# Print DataFrame information

print([Link]())

# Preview the first few rows

print([Link]())

Output
<class '[Link]'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sender 3 non-null object
1 receiver 3 non-null object
2 subject 3 non-null object
3 timestamp 3 non-null datetime64[ns]
4 content 3 non-null object
dtypes: datetime64[ns](1), object(4)
memory usage: 248.0+ bytes
None
sender receiver subject timestamp \
0 john@[Link] mary@[Link] Meeting Reminder 2023-06-28 [Link]
1 mary@[Link] john@[Link] Regarding Project 2023-06-28 [Link]
2 john@[Link] mary@[Link] Important Updates 2023-06-29 [Link]

content
0 Dear Mary, This is a reminder for our meeting ...
1 Hi John, I wanted to discuss the project with ...
2 Dear Mary, Please find attached the important ...

import [Link] as plt

df['sender'].value_counts().plot(kind='bar')

[Link]('Sender')

15
[Link]('Email Count')

[Link]('Email Counts by Sender')

[Link]()

Output

df['timestamp'] = pd.to_datetime(df['timestamp'])

df['date'] = df['timestamp'].[Link]

email_counts = df['date'].value_counts().sort_index()

[Link](email_counts.index, email_counts.values)

16
[Link]('Date')

[Link]('Email Count')

[Link]('Email Counts over Time')

[Link](rotation=45)

[Link]()

Output

[Link](figsize=(10, 5))

[Link](wordcloud, interpolation='bilinear')

[Link]('off')

[Link]('Word Cloud of Email Content')

[Link]()

17
Output

most_active_senders = df['sender'].value_counts()

print(most_active_senders)

Output

john@[Link] 2
mary@[Link] 1
Name: sender, dtype: int64

from wordcloud import WordCloud

import [Link] as plt

18
# Concatenate the email content

all_content = ' '.join(df['content'])

# Generate the word cloud

wordcloud = WordCloud(width=800, height=400).generate(all_content)

# Plot the word cloud

[Link](figsize=(10, 5))

[Link](wordcloud, interpolation='bilinear')

[Link]('off')

[Link]('Word Cloud of Email Content')

[Link]()

Output

[Link](email_counts_per_day.index, email_counts_per_day.values)

[Link]('Date')

[Link]('Email Count')

[Link]('Email Frequency Over Time')

[Link](rotation=45)

[Link]()

19
Output :

20
Result :
Thus EDA on email dataset has been successfully performed.

EX NO: 3 Working with Numpy arrays, Pandas dataframe, Basic using Matplotlib

DATE:

Aim

To work with arrays in numpy module, dataframe with pandas module and basic plots with matplotlib module
in python programming.

Algorithm for Numpy

Step 1: Start the program

Step 2: Import the required packages

Step 3: Create the 1-d array, 2-d array by using built-in methods

Step 4: Generate arrays using zeros, ones, arange and linspace.

Step 5: Check the number of dimensions, size of an array

Step 6: Compute the shape of an array and reshape an array and perform transpose of an array

Step 7: Do the required operations like slicing, iterating and splitting an array element.

Step 8: Stop the program

Numpy arrays

21
(i) Creating numpy 1d and 2d arrays

Program

import numpy as np

arr = [Link]([1, 2, 3, 4, 5, 6])

a = [Link]([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

print("1-d array:",arr)

print("2-d array:")

print(a)

Output

1-d array: [1 2 3 4 5 6]

2-d array:

[[ 1 2 3 4]

[ 5 6 7 8]

[ 9 10 11 12]]

(ii) Different ways of creating arrays

Program

zer=[Link](2)

on=[Link](3)

[Link](2, 9, 2)

[Link](0, 10, num=5,dtype= np.int64)

print(zer)

print(on)

Output

array([0. 0.])

array([1., 1., 1.])

array([2, 4, 6, 8])

22
array([ 0, 2, 5, 7, 10], dtype=int64)

(iii) Numpy array Dimension,Shape,Size,Transpose and Reshaping

Program

nums = [Link]([[[1, 5, 2, 1],

[4, 3, 5, 6],

[6, 3, 0, 6],

[7, 3, 5, 0],

[2, 3, 3, 5]],

[[2, 2, 3, 1],

[4, 0, 0, 5],

[6, 3, 2, 1],

[5, 1, 0, 0],

[0, 1, 9, 1]],

[[3, 1, 4, 2],

[4, 1, 6, 0],

[1, 2, 0, 6],

[8, 3, 4, 0],

[2, 0, 2, 8]]])

print("Number of dimensions :",[Link])

print("Size :",[Link])

print("Shape of array :",[Link])

print("Transpose of array:",nums.T)

print("Array after reshaping :",[Link](3,4,5))

Output

Number of dimensions : 3

Size : 60

23
Shape of array : (3, 5, 4)

Transpose of array:

[[[1 2 3]

[4 4 4]

[6 6 1]

[7 5 8]

[2 0 2]]

[[5 2 1]

[3 0 1]

[3 3 2]

[3 1 3]

[3 1 0]]

[[2 3 4]

[5 0 6]

[0 2 0]

[5 0 4]

[3 9 2]]

[[1 1 2]

[6 5 0]

[6 1 6]

[0 0 0]

[5 1 8]]]

Array after reshaping :

[[[1 5 2 1 4]

[3 5 6 6 3]

[0 6 7 3 5]

[0 2 3 3 5]]

24
[[2 2 3 1 4]

[0 0 5 6 3]

[2 1 5 1 0]

[0 0 1 9 1]]

[[3 1 4 2 4]

[1 6 0 1 2]

[0 6 8 3 4]

[0 2 0 2 8]]]

(iv) Slicing and accessing arrays

Program

p = [Link]([1,2,3,4,5,6])

print(arr[2] + arr[3])

p[1:4]

Output

array([2, 3, 4])
(v) Iterating Arrays

Program

import numpy as np

array1 = [Link]([1, 2, 3])

for x in array1:

print(x)

Output

1
2
3
(vi) Vstack,Hstack,split and flip functions on arrays

Program

25
arr1 = [Link]([[1, 1],

[2, 2]])

arr2 = [Link]([[3, 3],

[4, 4]])

print(“Vertical Stacking :”,[Link]((arr1,arr2)))

print(“Horizontal stacking :”,[Link]((arr1,arr2)))

print(“Splitting of array :”,np.array_split(arr1,3))

print(“Reversing of array :“,[Link](arr1))

Output

Vertical Stacking :

array([[1, 1],

[2, 2],

[3, 3],

[4, 4]])

Horizontal stacking :

array([[1, 1, 3, 3],

[2, 2, 4, 4]])

Splitting of array :

[array([[1, 1]]), array([[2, 2]]), array([], shape=(0, 2), dtype=int32)]

Reversing of array :

array([[2, 2],

[1, 1]])

(vii) Operations on array

Program

b = [Link]([[0.45053314, 0.17296777, 0.34376245, 0.5510652],

[0.54627315, 0.05093587, 0.40067661, 0.55645993],

[0.12697628, 0.82485143, 0.26590556, 0.56917101]])

26
print("maximum :",[Link]())

print("mean :",[Link]())

print("standard deviation :",[Link]())

Output

maximum : 0.82485143

mean : 0.4049648666666667

standard deviation : 0.21392120766089617

Algorithm for Pandas:

Step 1: Start the program

Step 2: Import the required packages

Step 3: Create a DataFrame using built-in methods and add a named index to it

Step 4: Load data into a DataFrame object otherwise Load Files(excel/csv) into a

DataFrame

Step 5: Display the information about the data using built-in method

Step 6: Display the first five rows and last five rows using head and tail method

Step 7: Do the operation like slicing by different slicing methods

Step 8: Stop the program

Pandas DataFrame

(i) Creating Pandas DataFrame

Program

import pandas as pd

data = {

"Total marks": [420, 380, 390],

"study hours": [5, 4, 4]

dataframe = [Link](data)

print(dataframe)

27
Output

Total marks study hours

0 420 5

1 380 4

2 390 4
(ii) Creating named index

Program

dataframe1=[Link](data,index=['a','b','c'])

dataframe1

Output

(iii) Importing and adding into DataFrame

Program

d=pd.read_csv("C:\\Users\\ADMIN\\Downloads\\[Link]")

#loading files into dataframe

df=[Link](d)

df

Output

28
(iv) Information about dataset

Program

[Link]()

Output

<class '[Link]'>

RangeIndex: 1599 entries, 0 to 1598

Data columns (total 12 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 fixed acidity 1599 non-null float64

1 volatile acidity 1599 non-null float64

2 citric acid 1599 non-null float64

3 residual sugar 1599 non-null float64

4 chlorides 1599 non-null float64

5 free sulfur dioxide 1599 non-null float64

6 total sulfur dioxide 1599 non-null float64

7 density 1599 non-null float64

8 pH 1599 non-null float64

29
9 sulphates 1599 non-null float64

10 alcohol 1599 non-null float64

11 quality 1599 non-null int64

dtypes: float64(11), int64(1)

memory usage: 150.0 KB

(v) Basic operations on DataFrame

Program

[Link]()

[Link]()

Output

(VI) Slicing using loc and iloc

Program

[Link][200:220, ["residual sugar", "quality"]]

Output

30
31
Program

[Link][3:16, 0:7]

Output

32
Algorithm for Matplotlib:

Step 1: Start the program

Step 2: Import the required packages

Step 3: Visualize the data using simple line plot

Step 4: Create a bar plot, use the bar method and customize the appearance, labels, and title.

Step 5: Generate a scatter plot by scatter function to plot your data points and customize the size, color, and style
of the markers, as well as labels and title.

Step 6: Visualize the data using the pie chart and histogram with required parameters

Step 7: Add grid line to the plots by using built-in methods

Step 8: Stop the program

Basic plots using Matplotlib

(i) Creating of line plot

Program

import [Link] as plt

[Link]([1,2,3,5,6], [1, 2, 3, 4, 6])

[Link]([0, 7, 0, 10])

33
[Link]()

Output

(ii) Creating bar plot

Program

[Link]([2,3,1,5,7],[300,400,200,600,700],

label="Carpenter",color='b')

[Link]()

[Link]('Days')

[Link]('Wage')

[Link]('Details')

[Link]()

Output

34
(iii) Creating Scatter plot

Program

x1 = [1, 2.5,3,4.5,5,6.5,7]

y1 = [1,2, 3, 2, 1, 3, 4]

x2=[8, 8.5, 9, 9.5, 10, 10.5, 11]

y2=[3,3.5, 3.7, 4,4.5, 5, 5.2]

[Link](x1, y1, label = 'high bp low heartrate', color='c')

[Link](x2,y2,label='low bp high heartrate',color='g')

[Link]('Smart Band Data Report')

[Link]('x')

[Link]('y')

[Link]()

[Link]()

Output

35
(iv) Creating Pie chart

Program

slice = [12, 25, 50, 36, 19]

activities = ['NLP','Neural Network', 'Data analytics', 'Quantum Computing', 'Machine Learning']

cols = ['r','b','c','g', 'orange']

[Link](slice,

labels =activities,

colors = cols,)

[Link]('Training Subjects')

[Link]()

Output

36
(v) Creating Histogram plot

Program

fig,ax = [Link](1,1)

a_new = [Link]([22,87,5,43,56,73,55,54,11,20,51,5,79,31,27])

[Link](a_new, bins = [0,25,50,75,100])

ax.set_title("histogram of result")

ax.set_xticks([0,25,50,75,100])

ax.set_xlabel('marks')

ax.set_ylabel('no. of students')

[Link]()

Output

37
(vi) Adding Gridlines to plot

Program

x = [Link]([80, 85, 90, 95, 100, 105, 110, 115, 120, 125])

y = [Link]([240, 250, 260, 270, 280, 290, 300, 310, 320, 330])

[Link]("Sports Watch Data")

[Link]("Average Pulse")

[Link]("Calorie Burnage")

[Link](x, y)

[Link]()

[Link]()

Output

Result

Thus working with arrays in the numpy module, dataframe with pandas module and basic plots with matplotlib
module in python programming has been explored successfully.

38
Ex No : 4 Explore various variable and row filters in R for cleaning
DATE : data. Apply various plot features in R on sample data sets
and visualize.

Aim:
The aim of this project is to explore data cleaning techniques in R, including variable and row filters,
and to apply various plot features to sample datasets to visualize data effectively.
Algorithm:
Step 1: Import the required libraries for data manipulation, such as dplyr for filtering and cleaning, and
ggplot2 for data visualization.

Step 2: Load one or more sample datasets for analysis. These datasets should have some missing values and
outliers for data cleaning demonstrations.

Step 3: Apply variable filters (e.g., removing unnecessary columns) to clean the dataset.

Step 4: Apply row filters (e.g., removing rows with missing values or outliers) to clean the dataset further.

Step 5: Perform EDA to gain insights into the data.

Step 6: Create summary statistics, histograms, and box plots to understand the distribution of variables.

Step 7: Use ggplot2 to create various types of plots, such as bar charts, scatter plots, and line plots, to visualize
the data.

Step 8: Display the created plots using R's built-in functionality.

Step 9: Optionally, save the plots as image files for further use or reporting.

Program:
# Load necessary libraries

library(dplyr)

library(ggplot2)

# Load sample dataset (e.g., 'mtcars')

data <- mtcars

# Data cleaning: Variable and row filters

cat("Original Data:\n")

print(data)

39
data_cleaned <- data %>%

select(-c(vs, am)) %>%

filter(![Link](hp), hp >= 50, hp <= 300)

cat("\nData After Cleaning:\n")

print(data_cleaned)

# Exploratory Data Analysis (EDA)

cat("\nSummary of Cleaned Data:\n")

print(summary(data_cleaned))

hist(data_cleaned$mpg, main="Histogram of MPG", xlab="MPG")

boxplot(data_cleaned$hp, main="Boxplot of HP")

# Data visualization with ggplot2

cat("\nData Visualization:\n")

gg <- ggplot(data_cleaned, aes(x = wt, y = mpg)) +

geom_point(aes(color = qsec)) +

labs(title = "Scatter Plot of MPG vs. Weight",

x = "Weight",

y = "MPG")

print(g)

# Save the plot as an image file (optional)

ggsave("scatter_plot.png")

# Display the plot

print(g)

Output:
Original Data:
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1

40
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

Data After Cleaning:


mpg cyl disp hp drat wt qsec gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 3 4

41
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 5 6
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 4 2

Summary of Cleaned Data:


mpg cyl disp hp
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
1st Qu.:15.65 1st Qu.:4.000 1st Qu.:120.7 1st Qu.: 96.0
Median :19.20 Median :6.000 Median :167.6 Median :123.0
Mean :20.25 Mean :6.129 Mean :228.5 Mean :140.6
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:334.0 3rd Qu.:180.0
Max. :33.90 Max. :8.000 Max. :472.0 Max. :264.0
drat wt qsec gear
Min. :2.760 Min. :1.513 Min. :14.50 Min. :3.000
1st Qu.:3.080 1st Qu.:2.542 1st Qu.:16.96 1st Qu.:3.000
Median :3.700 Median :3.215 Median :17.82 Median :4.000
Mean :3.598 Mean :3.206 Mean :17.95 Mean :3.645
3rd Qu.:3.920 3rd Qu.:3.650 3rd Qu.:18.90 3rd Qu.:4.000
Max. :4.930 Max. :5.424 Max. :22.90 Max. :5.000
carb
Min. :1.000
1st Qu.:2.000
Median :2.000
Mean :2.645
3rd Qu.:4.000
Max. :6.000

42
Result:
Thus the process of data manipulation and visualization using R was executed successfully.

43
Ex No : 5 Perform Time Series Analysis and apply the various
DATE : visualization techniques

Aim :

To Write a code to implement the Time Series Analysis.

Algorithm :

Step 1: Import the required libraries:

1. Pandas for data handling.


2. Numpy for mathematical processing.
3. Matplotlib for data visualization.
4. Statsmodels for the time series
Step 2: Set the random seed for reproducibility using [Link] ( ).

Step 3: Set the date range using pd.date_range( ).

Step 4: Generate the random values for the date from 1 to 100 using [Link] ( ) and convert
it into a Data Frame.

Step 5: Set the Date as index to the DataFrame using df.set_index ( ‘date’ , inplace=True)

Step 6: From the statsmodels import seasonal_decompose to show different time series visuals like
trend , Seasonal , resid etc …

Step 7: Visualize the DataFrame using Matplot library as [Link]( ).

Step 8: Create a variable decomposition and pass the model as seasonal_decompose(

df[‘value’],model=’Additive’) and use the variable by accesing all the visuals (


[Link],

[Link],[Link])

Step 9: Finally, use [Link]( ) to show the Visualization.

Program :

import pandas as pd
import numpy as np
import [Link] as plt

44
from [Link] import seasonal_decompose
[Link](0)
date_range = pd.date_range('2023-01-01','2023-12-31',freq='D')
data = [Link](low=1,high=100,size=len(date_range))
df = [Link]({'Date':date_range,'Value':data})
df.set_index('Date',inplace=True)
[Link]()

Output:
Value

Date

2023-01-01 45

2023-01-02 48

2023-01-03 65

2023-01-04 68

2023-01-05 68

[Link](figsize=(10, 6))

[Link]('Date')

[Link]('Value')

[Link]('Time Series Data')

[Link]()

Output:

45
decomposition = seasonal_decompose(df['Value'], model='additive')

trend = [Link]

seasonality = [Link]

residuals = [Link]

[Link](figsize=(12, 8))

[Link](411)

[Link](df['Value'], label='Original')

[Link](loc='best')

[Link](412)

[Link](trend, label='Trend')

[Link](loc='best')

[Link](413)

[Link](seasonality, label='Seasonality')

[Link](loc='best')

[Link](414)

[Link](residuals, label='Residuals')

[Link](loc='best')

plt.tight_layout()

[Link]()

46
Output:

47
RESULT :

Thus, the above Times Series program was written and executed successfully.

48
Ex No : 6 Perform data analysis and representation on a map using various map
datasets with mouse rollover effect, user interaction, etc,.
Date :

Aim:

To Perform data analysis and representation on a map using various map datasets with mouse
rollover effect, user interaction, etc,.

Algorithm:

Step 1: Collect and prepare various map datasets, including geographical information (latitude,
longitude), and relevant data attributes.

Step 2: Choose a suitable map library or framework (e.g., Leaflet, Google Maps API) for displaying
the maps.

Step 3: Create a web-based application that renders the map and overlays it with markers,
polygons, or other visual elements based on the dataset.

Step 4: Implement a mouse rollover effect, where users can hover over map elements (markers,
regions) to view additional information related to the data point.

Step 5: Enable user interaction by allowing users to interact with the map, such as zooming in/out,
panning, and selecting specific data points for more details.

Step 6: Perform data analysis on the selected dataset, which may include generating statistics,
clustering, or creating heatmaps based on geographical attributes.

Step 7: Implement filters and visualization options, such as color-coding or varying marker sizes to
represent different data attributes.

Step 8: Design a user-friendly interface with clear controls, legends, and tooltips to help users
understand the data and its representation on the map.

Program:

# Import necessary libraries or frameworks

import map_library

import data_processing

# Load map data

map_data = data_processing.load_map_data()

# Initialize the map

49
map = map_library.initialize_map()

# Add data markers and polygons to the map

for data_point in map_data:

map.add_marker(data_point)

# Implement mouse rollover effect

map.enable_rollover_effect()

# Implement user interaction (e.g., zoom, pan)

map.enable_user_interaction()

# Perform data analysis (e.g., calculate statistics, clusters)

data_analysis = data_processing.analyze_data(map_data)

# Visualize data on the map (e.g., color-coded markers)

map.visualize_data(data_analysis)

# Create a user-friendly interface with controls and legends

map.create_user_interface()

# Display the map and interface

map.show_map()

Output:

50
Result:

The web-based interactive map application that allows users to explore geographical data using
mouse rollover effects, interactive features has been created.

Ex No : 7 Build cartographic visualisation of multiple datasets involving various


countries of the world; states, and districts in India
Date :

Aim:

To Build cartographic visualisation of multiple datasets involving various countries of the world;
states, and districts in India etc.

Algorithm:

Step 1: Gather the multiple datasets related to countries, states, and districts, and ensure that each
dataset contains geographical information, such as latitude and longitude, to accurately plot the data
on maps.

Step 2: Choose a mapping library or tool suitable for the project, such as Leaflet, Mapbox, or Google
Maps.

Step 3: Develop a web-based application or interactive dashboard that renders the world map and
India's map. Overlay the map with relevant boundaries (country borders, state borders, district
boundaries).

Step 4: Merge the datasets with the map layers, linking data points to their geographical locations
(countries, states, districts).

Step 5: Implement data visualization techniques to represent the datasets visually on the maps. This
may include choropleth maps, bubble maps, or heatmaps, depending on the nature of the data.

Step 6: Apply color-coding to the map elements to represent different data attributes, making it easy
to distinguish and analyze the information.

Step 7: Enable user interaction by allowing users to zoom in/out, pan, and click on map elements to
access detailed information about the regions or data points.

Step 8: Include legends, labels, and tooltips to help users interpret the visualizations and understand
the meaning of different colors and symbols.

51
Program:

# Import necessary libraries or mapping tools

import mapping_library

import data_processing

# Load geographical data (countries, states, districts)

world_map_data = data_processing.load_world_map_data()

india_map_data = data_processing.load_india_map_data()

# Initialize the maps

world_map = mapping_library.initialize_map()

india_map = mapping_library.initialize_map()

# Add map layers (country, state, district boundaries)

world_map.add_layers(world_map_data)

india_map.add_layers(india_map_data)

# Merge datasets with map layers

world_data = data_processing.merge_data_with_map(world_map, world_map_data)

india_data = data_processing.merge_data_with_map(india_map, india_map_data)

# Visualize data on maps (e.g., choropleth, bubble maps)

world_map.visualize_data(world_data)

india_map.visualize_data(india_data)

# Apply color-coding to represent data attributes

world_map.apply_color_coding()

india_map.apply_color_coding()

# Implement user interaction (zoom, pan, click)

world_map.enable_user_interaction()

india_map.enable_user_interaction()

52
# Create legends and tooltips for data interpretation

world_map.create_legends_and_labels()

india_map.create_legends_and_labels()

# Display the maps and interface

world_map.show_map()

india_map.show_map()

Output:

Result:

The result is an interactive web-based platform or application that provides cartographic


visualizations for multiple datasets related to countries worldwide, as well as states and districts in
India.

53
EX NO: 8 Perform EDA on Wine Quality Data Set
DATE:

Aim

To perform EDA on Wine quality dataset.

Algorithm

Step 1: Import necessary libraries such as pandas, numpy, matplotlib and seaborn.

Step 2: Read a CSV file ('[Link]') into a Pandas DataFrame and store it in the variable

Step 3: Take the first column of the DataFrame and split it into multiple columns based on the delimiter ';'

Step 4: Change the data types of the columns in the DataFrame to their appropriate types

54
Step 5: Check the datatypes, missing values, and summary of the data.

Step 6: Create a histogram of the 'quality' column using Seaborn with labels and a title, and display the plot.

Step 7: Calculate the correlation matrix for the dataset and create a heatmap using Seaborn with
annotations and a title, and display the plot.

Step 8: Create a scatter plot of 'alcohol' vs. 'quality' using Seaborn with labels and a title, and display the plot.

Program :

import pandas as pd

import [Link] as plt

import seaborn as sns

data = pd.read_csv('/content/sample_data/[Link]')

print([Link])

Output:

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',

'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',

'pH', 'sulphates', 'alcohol', 'quality'],

dtype='object')

# Split the single column into multiple columns based on the delimiter (;)

data = data[[Link][0]].[Link](';', expand=True)

# Rename the columns based on the original column names


column_names = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides',

'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates',

'alcohol', 'quality']

[Link] = column_names

# Convert the columns to appropriate data types

data = [Link]({'fixed acidity': float, 'volatile acidity': float, 'citric acid': float,

'residual sugar': float, 'chlorides': float, 'free sulfur dioxide': float,

55
'total sulfur dioxide': float, 'density': float, 'pH': float,

'sulphates': float, 'alcohol': float, 'quality': int})

# View the first few rows of the dataset

print([Link]())

Output

fixed acidity volatile acidity citric acid residual sugar chlorides \

0 7.4 0.70 0.00 1.9 0.076

1 7.8 0.88 0.00 2.6 0.098

2 7.8 0.76 0.04 2.3 0.092

3 11.2 0.28 0.56 1.9 0.075

4 7.4 0.70 0.00 1.9 0.076

free sulfur dioxide total sulfur dioxide density pH sulphates \

0 11.0 34.0 0.9978 3.51 0.56

1 25.0 67.0 0.9968 3.20 0.68

2 15.0 54.0 0.9970 3.26 0.65

3 17.0 60.0 0.9980 3.16 0.58

4 11.0 34.0 0.9978 3.51 0.56

alcohol quality

0 9.4 5

1 9.8 5

2 9.8 5

3 9.8 6

4 9.4 5
# Check the dimensions of the dataset

print([Link])

56
Output

(1599, 12)

# Check the data types of each column

print([Link])

Output

fixed acidity float64

volatile acidity float64

citric acid float64

residual sugar float64

chlorides float64

free sulfur dioxide float64

total sulfur dioxide float64

density float64

pH float64

sulphates float64

alcohol float64

quality int64

dtype: object

# Check for missing values

print([Link]().sum()

Output

fixed acidity 0

volatile acidity 0

citric acid 0

residual sugar 0

57
chlorides 0

free sulfur dioxide 0

total sulfur dioxide 0

density 0

pH 0

sulphates 0

alcohol 0

quality 0

dtype: int64

# Generate summary statistics

print([Link]())

Output

fixed acidity volatile acidity citric acid residual sugar \

count 1599.000000 1599.000000 1599.000000 1599.000000

mean 8.319637 0.527821 0.270976 2.538806

std 1.741096 0.179060 0.194801 1.409928

min 4.600000 0.120000 0.000000 0.900000

25% 7.100000 0.390000 0.090000 1.900000

50% 7.900000 0.520000 0.260000 2.200000

75% 9.200000 0.640000 0.420000 2.600000

max 15.900000 1.580000 1.000000 15.500000

chlorides free sulfur dioxide total sulfur dioxide density \

count 1599.000000 1599.000000 1599.000000 1599.000000

mean 0.087467 15.874922 46.467792 0.996747

std 0.047065 10.460157 32.895324 0.001887

58
min 0.012000 1.000000 6.000000 0.990070

25% 0.070000 7.000000 22.000000 0.995600

50% 0.079000 14.000000 38.000000 0.996750

75% 0.090000 21.000000 62.000000 0.997835

max 0.611000 72.000000 289.000000 1.003690

pH sulphates alcohol quality

count 1599.000000 1599.000000 1599.000000 1599.000000

mean 3.311113 0.658149 10.422983 5.636023

std 0.154386 0.169507 1.065668 0.807569

min 2.740000 0.330000 8.400000 3.000000

25% 3.210000 0.550000 9.500000 5.000000

50% 3.310000 0.620000 10.200000 6.000000

75% 3.400000 0.730000 11.100000 6.000000

max 4.010000 2.000000 14.900000 8.000000

# Histogram of wine quality

[Link](data['quality'])

[Link]('Quality')

[Link]('Frequency')

[Link]('Distribution of Wine Quality')

[Link]()

Output

59
# Correlation matrix heatmap

correlation_matrix = [Link]()

[Link](correlation_matrix, annot=True, cmap='coolwarm')

[Link]('Correlation Matrix')

[Link]()

Output

60
# Scatter plot of alcohol vs. quality

[Link](x='alcohol', y='quality', data=data)

[Link]('Alcohol')

[Link]('Quality')

[Link]('Alcohol vs. Quality')

[Link]()

Output

Result :

Thus EDA on Wine quality dataset has been performed successfully.

61
Ex No : 9 Use a case study on a dataset and apply the various EDA and visualization
techniques and present an analysis report
Date :

Aim:

To use a case study on a dataset and apply the various EDA and visualization techniques and present
an analysis report

Algorithm :

Step 1: Import the required libraries.

Step 2: Load the dataset.

Step 3: Display basic information about the dataset.

Step 4: Check for missing values.

Step 5: Visualise the distribution of scores using various plots.

Step 6: Check the correlation for variables.

Step 7: Visualise the correlation using heatmap.

Step 8: Give an overall analysis report.

62
Program:

import numpy as np

import pandas as pd

import [Link] as plt

import seaborn as sns

from scipy import stats

import [Link] as sms

%matplotlib inline

df = pd.read_csv('[Link]')

df

1000 rows × 8 columns

[Link]()
Output :
<class '[Link]'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gender 1000 non-null object
1 race/ethnicity 1000 non-null object
2 parental level of education 1000 non-null object
3 lunch 1000 non-null object
4 test preparation course 1000 non-null object
5 math score 1000 non-null int64
6 reading score 1000 non-null int64
7 writing score 1000 non-null int64
dtypes: int64(3), object(5)
memory usage: 62.6+ KB

63
[Link]().sum()

Output :

gender 0
race/ethnicity 0
parental level of education 0
lunch 0
test preparation course 0
math score 0
reading score 0
writing score 0
dtype: int64
[Link]()

Output :

sns.set_style('whitegrid')

[Link](figsize=(10, 6))

[Link](data=[df['math score'], df['reading score'], df['writing score']])

[Link]([0, 1, 2], ['math', 'reading', 'writing'])

[Link]('Subject')

[Link]('Score')

[Link]('Boxplot for each subject score')

[Link]()

Output :

64
df['sum score'] = df[['math score', 'reading score', 'writing score']].sum(axis=1)

[Link](df['sum score'], kde=True)

[Link]('Distribution of sum scores')

[Link]()

Output :

[Link](figsize=(10, 6))

[Link](df['gender'].value_counts(), autopct='%1.1f%%', labels=['Female', 'Male'])

[Link]('Students by gender')

[Link]()

Output :

65
labels = df['parental level of education'].value_counts().index

[Link](figsize=(10, 6))

[Link](df['parental level of education'].value_counts(), autopct='%1.1f%%', labels=labels)

[Link]('Students by parental level of education')

[Link]()

Output :

[Link](figsize=(10, 6))

[Link](df_encoded.corr(), annot=True)

[Link]()

Output :

66
residuals = [Link]

[Link](residuals, line='s')

[Link]()

Output :

67
Result :

Thus by applying various EDA and visualization techniques we analysed the student score data and
identified important factors affecting the total scores.

68
EVALUATION SHEET

69

You might also like