Eda Lab Manual
Eda Lab Manual
(Accredited by NAAC & NBA, Approved by AICTE, New Delhi and Affiliated to Anna University, Chennai)
LABORATORY MANUAL
FOR
1
REGULATION 2021
VISION
To create globally competent software professionals with social values to cater the ever-
changing industry requirements.
MISSION
M1 To provide appropriate infrastructure to impart need-based technical education
through effective teaching and research
M2 To involve the students in collaborative projects on emerging technologies to fulfill
the industrial requirements
M3 To render value based education to students to take better engineering decision
with social consciousness and to meet out the global standards
M4 To inculcate leadership skills in students and encourage them to become a globally
competent professional
Programme Educational Objectives (PEOs)
The graduates of Computer Science and Engineering will be able to
PEO1 Pursue Higher Education and Research or have a successful career in industries
associated with Computer Science and Engineering, or as Entrepreneurs
PEO2 Ensure that graduates will have the ability and attitude to adapt to emerging
technological changes
PEO3 Acquire leadership skills to perform professional activities with social
consciousness
Programme Specific Outcome (PSOs)
The graduates will be able to
PSO1 The students will be able to analyze large volume of data and make business
decisions to improve efficiency with different algorithms and tools
PSO2 The students will have the capacity to develop web and mobile applications for real
time scenarios
PSO3 The students will be able to provide automation and smart solutions in various
forms to the society with Internet of Things
2
Course Code & Name: CCS346 & EXPLORATORY DATA ANALYSIS LABORATORY
REGULATION: R2021
YEAR/SEM: III/V
COURSE OBJECTIVES:
• To outline an overview of exploratory data analysis.
• To implement data visualization using Matplotlib.
• To perform univariate data exploration and analysis.
• To apply bivariate data exploration and analysis.
• To use Data exploration and visualization techniques for multivariate and time series data
LIST OF EXPERIMENTS:
1. Install the data Analysis and Visualization tool: R/ Python /Tableau Public/ Power BI.
2. Perform exploratory data analysis (EDA) with datasets like email data set. Export all your
emails as a dataset, import them inside a pandas data frame, visualize them and get different
insights from the data.
3. Working with Numpy arrays, Pandas data frames , Basic plots using Matplotlib.
4. Explore various variable and row filters in R for cleaning data. Apply various plot features in
R on sample data sets and visualize.
5. Perform Time Series Analysis and apply the various visualization techniques.
6. Perform Data Analysis and representation on a Map using various Map data sets with Mouse
Rollover effect, user interaction, etc..
[Link] cartographic visualization for multiple datasets involving various countries of the world;
states and districts in India etc.
8. Perform EDA on Wine Quality Data Set.
9. Use a case study on a data set and apply the various EDA and visualization techniques and
present an analysis report.
COURSE OUTCOMES
• Understand the fundamentals of exploratory data analysis.
• Implement the data visualization using Matplotlib.
• Perform univariate data exploration and analysis.
• Apply bivariate data exploration and analysis.
• Use Data exploration and visualization techniques for multivariate and time series data.
3
4
KNOWLEDGE INSTITUTE OF TECHNOLOGY
CONTENTS
6
[Link] : 1 Installation and Setup of Data Analysis and Visualization Tools: R,
Date : Python, Tableau, and Power BI
Aim:
The aim of this project is to install and set up four essential data analysis and visualization tools,
namely R, Python, Tableau, and Power BI, to provide a robust environment for data analysis and
visualization tasks.
Algorithm:
Step 1: System Requirements Assessment:
- Before installation, ensure your system meets the minimum hardware and software requirements
for each tool. Check the official documentation of R, Python, Tableau, and Power BI for system
prerequisites.
Step 2: Installation of R:
- Download the R installation package from the official CRAN (Comprehensive R Archive Network)
website ([Link]
- Follow the installation wizard, choose your preferred options, and install R on your system.
- Download the latest Python installer from the official Python website
([Link]
- During installation, make sure to check the option "Add Python to PATH" for easy command-line
access.
- After installation, use pip (Python package manager) to install essential data science libraries, such
as NumPy, Pandas, Matplotlib, and Jupyter Notebook, to enhance your data analysis capabilities.
- Follow the installation instructions and provide your licensing information or use the free trial
period.
7
- Go to the Microsoft Power BI website ([Link] to
download Power BI Desktop.
- You will need a Microsoft account to access certain features. Sign in or create an account if
required.
- Configure R and Python with data science IDEs such as RStudio or Jupyter Notebook to harness
their full potential for data analysis.
- For Tableau and Power BI, explore their integration capabilities with databases, cloud services,
and data sources you plan to use for analysis and visualization.
- Verify that all the installed tools are working correctly by creating a simple data analysis and
visualization project.
- Document the installation process, any challenges faced, and how they were overcome. This
documentation will be helpful for future reference and troubleshooting.
R:
Step 1:
8
Step 2:
Step 3:
Step 4:
Step 5:
9
Python:
Step 2:
Step 3:
Step 4:
10
Tableau Public:
Step 1:
Step 2:
Step 3:
Step 4:
11
Follow the further instructions to install Tableau Public.
Power BI:
Step 1:
Step 2:
12
Step 3:
13
[Link] : 2 Perform exploratory data analysis (EDA) with datasets like email data set.
Export all your emails as a dataset, import them inside a pandas data
Date : frame, visualize them and get different insights from the data.
Aim :
Perform exploratory data analysis (EDA) with datasets like email data set. Export all your emails as
a dataset, import them inside a pandas data frame, visualize them and get different insights from
the data.
Algorithm
Step 1: Load the data and create a DataFrame.
Step 7: Analyze email frequency over time using a time series plot.
Program :
import pandas as pd
emails = {
'content': ['Dear Mary, This is a reminder for our meeting tomorrow. Regards, John',
'Hi John, I wanted to discuss the project with you. Let\'s connect. Regards, Mary',
14
'Dear Mary, Please find attached the important updates. Best, John']
# Create a DataFrame
df = [Link](emails)
df['timestamp'] = pd.to_datetime(df['timestamp'])
print([Link]())
print([Link]())
Output
<class '[Link]'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 sender 3 non-null object
1 receiver 3 non-null object
2 subject 3 non-null object
3 timestamp 3 non-null datetime64[ns]
4 content 3 non-null object
dtypes: datetime64[ns](1), object(4)
memory usage: 248.0+ bytes
None
sender receiver subject timestamp \
0 john@[Link] mary@[Link] Meeting Reminder 2023-06-28 [Link]
1 mary@[Link] john@[Link] Regarding Project 2023-06-28 [Link]
2 john@[Link] mary@[Link] Important Updates 2023-06-29 [Link]
content
0 Dear Mary, This is a reminder for our meeting ...
1 Hi John, I wanted to discuss the project with ...
2 Dear Mary, Please find attached the important ...
df['sender'].value_counts().plot(kind='bar')
[Link]('Sender')
15
[Link]('Email Count')
[Link]()
Output
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['date'] = df['timestamp'].[Link]
email_counts = df['date'].value_counts().sort_index()
[Link](email_counts.index, email_counts.values)
16
[Link]('Date')
[Link]('Email Count')
[Link](rotation=45)
[Link]()
Output
[Link](figsize=(10, 5))
[Link](wordcloud, interpolation='bilinear')
[Link]('off')
[Link]()
17
Output
most_active_senders = df['sender'].value_counts()
print(most_active_senders)
Output
john@[Link] 2
mary@[Link] 1
Name: sender, dtype: int64
18
# Concatenate the email content
[Link](figsize=(10, 5))
[Link](wordcloud, interpolation='bilinear')
[Link]('off')
[Link]()
Output
[Link](email_counts_per_day.index, email_counts_per_day.values)
[Link]('Date')
[Link]('Email Count')
[Link](rotation=45)
[Link]()
19
Output :
20
Result :
Thus EDA on email dataset has been successfully performed.
EX NO: 3 Working with Numpy arrays, Pandas dataframe, Basic using Matplotlib
DATE:
Aim
To work with arrays in numpy module, dataframe with pandas module and basic plots with matplotlib module
in python programming.
Step 3: Create the 1-d array, 2-d array by using built-in methods
Step 6: Compute the shape of an array and reshape an array and perform transpose of an array
Step 7: Do the required operations like slicing, iterating and splitting an array element.
Numpy arrays
21
(i) Creating numpy 1d and 2d arrays
Program
import numpy as np
print("1-d array:",arr)
print("2-d array:")
print(a)
Output
1-d array: [1 2 3 4 5 6]
2-d array:
[[ 1 2 3 4]
[ 5 6 7 8]
[ 9 10 11 12]]
Program
zer=[Link](2)
on=[Link](3)
[Link](2, 9, 2)
print(zer)
print(on)
Output
array([0. 0.])
array([2, 4, 6, 8])
22
array([ 0, 2, 5, 7, 10], dtype=int64)
Program
[4, 3, 5, 6],
[6, 3, 0, 6],
[7, 3, 5, 0],
[2, 3, 3, 5]],
[[2, 2, 3, 1],
[4, 0, 0, 5],
[6, 3, 2, 1],
[5, 1, 0, 0],
[0, 1, 9, 1]],
[[3, 1, 4, 2],
[4, 1, 6, 0],
[1, 2, 0, 6],
[8, 3, 4, 0],
[2, 0, 2, 8]]])
print("Size :",[Link])
print("Transpose of array:",nums.T)
Output
Number of dimensions : 3
Size : 60
23
Shape of array : (3, 5, 4)
Transpose of array:
[[[1 2 3]
[4 4 4]
[6 6 1]
[7 5 8]
[2 0 2]]
[[5 2 1]
[3 0 1]
[3 3 2]
[3 1 3]
[3 1 0]]
[[2 3 4]
[5 0 6]
[0 2 0]
[5 0 4]
[3 9 2]]
[[1 1 2]
[6 5 0]
[6 1 6]
[0 0 0]
[5 1 8]]]
[[[1 5 2 1 4]
[3 5 6 6 3]
[0 6 7 3 5]
[0 2 3 3 5]]
24
[[2 2 3 1 4]
[0 0 5 6 3]
[2 1 5 1 0]
[0 0 1 9 1]]
[[3 1 4 2 4]
[1 6 0 1 2]
[0 6 8 3 4]
[0 2 0 2 8]]]
Program
p = [Link]([1,2,3,4,5,6])
print(arr[2] + arr[3])
p[1:4]
Output
array([2, 3, 4])
(v) Iterating Arrays
Program
import numpy as np
for x in array1:
print(x)
Output
1
2
3
(vi) Vstack,Hstack,split and flip functions on arrays
Program
25
arr1 = [Link]([[1, 1],
[2, 2]])
[4, 4]])
Output
Vertical Stacking :
array([[1, 1],
[2, 2],
[3, 3],
[4, 4]])
Horizontal stacking :
array([[1, 1, 3, 3],
[2, 2, 4, 4]])
Splitting of array :
Reversing of array :
array([[2, 2],
[1, 1]])
Program
26
print("maximum :",[Link]())
print("mean :",[Link]())
Output
maximum : 0.82485143
mean : 0.4049648666666667
Step 3: Create a DataFrame using built-in methods and add a named index to it
Step 4: Load data into a DataFrame object otherwise Load Files(excel/csv) into a
DataFrame
Step 5: Display the information about the data using built-in method
Step 6: Display the first five rows and last five rows using head and tail method
Pandas DataFrame
Program
import pandas as pd
data = {
dataframe = [Link](data)
print(dataframe)
27
Output
0 420 5
1 380 4
2 390 4
(ii) Creating named index
Program
dataframe1=[Link](data,index=['a','b','c'])
dataframe1
Output
Program
d=pd.read_csv("C:\\Users\\ADMIN\\Downloads\\[Link]")
df=[Link](d)
df
Output
28
(iv) Information about dataset
Program
[Link]()
Output
<class '[Link]'>
29
9 sulphates 1599 non-null float64
Program
[Link]()
[Link]()
Output
Program
Output
30
31
Program
[Link][3:16, 0:7]
Output
32
Algorithm for Matplotlib:
Step 4: Create a bar plot, use the bar method and customize the appearance, labels, and title.
Step 5: Generate a scatter plot by scatter function to plot your data points and customize the size, color, and style
of the markers, as well as labels and title.
Step 6: Visualize the data using the pie chart and histogram with required parameters
Program
[Link]([0, 7, 0, 10])
33
[Link]()
Output
Program
[Link]([2,3,1,5,7],[300,400,200,600,700],
label="Carpenter",color='b')
[Link]()
[Link]('Days')
[Link]('Wage')
[Link]('Details')
[Link]()
Output
34
(iii) Creating Scatter plot
Program
x1 = [1, 2.5,3,4.5,5,6.5,7]
y1 = [1,2, 3, 2, 1, 3, 4]
[Link]('x')
[Link]('y')
[Link]()
[Link]()
Output
35
(iv) Creating Pie chart
Program
[Link](slice,
labels =activities,
colors = cols,)
[Link]('Training Subjects')
[Link]()
Output
36
(v) Creating Histogram plot
Program
fig,ax = [Link](1,1)
a_new = [Link]([22,87,5,43,56,73,55,54,11,20,51,5,79,31,27])
ax.set_title("histogram of result")
ax.set_xticks([0,25,50,75,100])
ax.set_xlabel('marks')
ax.set_ylabel('no. of students')
[Link]()
Output
37
(vi) Adding Gridlines to plot
Program
x = [Link]([80, 85, 90, 95, 100, 105, 110, 115, 120, 125])
y = [Link]([240, 250, 260, 270, 280, 290, 300, 310, 320, 330])
[Link]("Average Pulse")
[Link]("Calorie Burnage")
[Link](x, y)
[Link]()
[Link]()
Output
Result
Thus working with arrays in the numpy module, dataframe with pandas module and basic plots with matplotlib
module in python programming has been explored successfully.
38
Ex No : 4 Explore various variable and row filters in R for cleaning
DATE : data. Apply various plot features in R on sample data sets
and visualize.
Aim:
The aim of this project is to explore data cleaning techniques in R, including variable and row filters,
and to apply various plot features to sample datasets to visualize data effectively.
Algorithm:
Step 1: Import the required libraries for data manipulation, such as dplyr for filtering and cleaning, and
ggplot2 for data visualization.
Step 2: Load one or more sample datasets for analysis. These datasets should have some missing values and
outliers for data cleaning demonstrations.
Step 3: Apply variable filters (e.g., removing unnecessary columns) to clean the dataset.
Step 4: Apply row filters (e.g., removing rows with missing values or outliers) to clean the dataset further.
Step 6: Create summary statistics, histograms, and box plots to understand the distribution of variables.
Step 7: Use ggplot2 to create various types of plots, such as bar charts, scatter plots, and line plots, to visualize
the data.
Step 9: Optionally, save the plots as image files for further use or reporting.
Program:
# Load necessary libraries
library(dplyr)
library(ggplot2)
cat("Original Data:\n")
print(data)
39
data_cleaned <- data %>%
print(data_cleaned)
print(summary(data_cleaned))
cat("\nData Visualization:\n")
geom_point(aes(color = qsec)) +
x = "Weight",
y = "MPG")
print(g)
ggsave("scatter_plot.png")
print(g)
Output:
Original Data:
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
40
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
41
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 5 6
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 4 2
42
Result:
Thus the process of data manipulation and visualization using R was executed successfully.
43
Ex No : 5 Perform Time Series Analysis and apply the various
DATE : visualization techniques
Aim :
Algorithm :
Step 4: Generate the random values for the date from 1 to 100 using [Link] ( ) and convert
it into a Data Frame.
Step 5: Set the Date as index to the DataFrame using df.set_index ( ‘date’ , inplace=True)
Step 6: From the statsmodels import seasonal_decompose to show different time series visuals like
trend , Seasonal , resid etc …
[Link],[Link])
Program :
import pandas as pd
import numpy as np
import [Link] as plt
44
from [Link] import seasonal_decompose
[Link](0)
date_range = pd.date_range('2023-01-01','2023-12-31',freq='D')
data = [Link](low=1,high=100,size=len(date_range))
df = [Link]({'Date':date_range,'Value':data})
df.set_index('Date',inplace=True)
[Link]()
Output:
Value
Date
2023-01-01 45
2023-01-02 48
2023-01-03 65
2023-01-04 68
2023-01-05 68
[Link](figsize=(10, 6))
[Link]('Date')
[Link]('Value')
[Link]()
Output:
45
decomposition = seasonal_decompose(df['Value'], model='additive')
trend = [Link]
seasonality = [Link]
residuals = [Link]
[Link](figsize=(12, 8))
[Link](411)
[Link](df['Value'], label='Original')
[Link](loc='best')
[Link](412)
[Link](trend, label='Trend')
[Link](loc='best')
[Link](413)
[Link](seasonality, label='Seasonality')
[Link](loc='best')
[Link](414)
[Link](residuals, label='Residuals')
[Link](loc='best')
plt.tight_layout()
[Link]()
46
Output:
47
RESULT :
Thus, the above Times Series program was written and executed successfully.
48
Ex No : 6 Perform data analysis and representation on a map using various map
datasets with mouse rollover effect, user interaction, etc,.
Date :
Aim:
To Perform data analysis and representation on a map using various map datasets with mouse
rollover effect, user interaction, etc,.
Algorithm:
Step 1: Collect and prepare various map datasets, including geographical information (latitude,
longitude), and relevant data attributes.
Step 2: Choose a suitable map library or framework (e.g., Leaflet, Google Maps API) for displaying
the maps.
Step 3: Create a web-based application that renders the map and overlays it with markers,
polygons, or other visual elements based on the dataset.
Step 4: Implement a mouse rollover effect, where users can hover over map elements (markers,
regions) to view additional information related to the data point.
Step 5: Enable user interaction by allowing users to interact with the map, such as zooming in/out,
panning, and selecting specific data points for more details.
Step 6: Perform data analysis on the selected dataset, which may include generating statistics,
clustering, or creating heatmaps based on geographical attributes.
Step 7: Implement filters and visualization options, such as color-coding or varying marker sizes to
represent different data attributes.
Step 8: Design a user-friendly interface with clear controls, legends, and tooltips to help users
understand the data and its representation on the map.
Program:
import map_library
import data_processing
map_data = data_processing.load_map_data()
49
map = map_library.initialize_map()
map.add_marker(data_point)
map.enable_rollover_effect()
map.enable_user_interaction()
data_analysis = data_processing.analyze_data(map_data)
map.visualize_data(data_analysis)
map.create_user_interface()
map.show_map()
Output:
50
Result:
The web-based interactive map application that allows users to explore geographical data using
mouse rollover effects, interactive features has been created.
Aim:
To Build cartographic visualisation of multiple datasets involving various countries of the world;
states, and districts in India etc.
Algorithm:
Step 1: Gather the multiple datasets related to countries, states, and districts, and ensure that each
dataset contains geographical information, such as latitude and longitude, to accurately plot the data
on maps.
Step 2: Choose a mapping library or tool suitable for the project, such as Leaflet, Mapbox, or Google
Maps.
Step 3: Develop a web-based application or interactive dashboard that renders the world map and
India's map. Overlay the map with relevant boundaries (country borders, state borders, district
boundaries).
Step 4: Merge the datasets with the map layers, linking data points to their geographical locations
(countries, states, districts).
Step 5: Implement data visualization techniques to represent the datasets visually on the maps. This
may include choropleth maps, bubble maps, or heatmaps, depending on the nature of the data.
Step 6: Apply color-coding to the map elements to represent different data attributes, making it easy
to distinguish and analyze the information.
Step 7: Enable user interaction by allowing users to zoom in/out, pan, and click on map elements to
access detailed information about the regions or data points.
Step 8: Include legends, labels, and tooltips to help users interpret the visualizations and understand
the meaning of different colors and symbols.
51
Program:
import mapping_library
import data_processing
world_map_data = data_processing.load_world_map_data()
india_map_data = data_processing.load_india_map_data()
world_map = mapping_library.initialize_map()
india_map = mapping_library.initialize_map()
world_map.add_layers(world_map_data)
india_map.add_layers(india_map_data)
world_map.visualize_data(world_data)
india_map.visualize_data(india_data)
world_map.apply_color_coding()
india_map.apply_color_coding()
world_map.enable_user_interaction()
india_map.enable_user_interaction()
52
# Create legends and tooltips for data interpretation
world_map.create_legends_and_labels()
india_map.create_legends_and_labels()
world_map.show_map()
india_map.show_map()
Output:
Result:
53
EX NO: 8 Perform EDA on Wine Quality Data Set
DATE:
Aim
Algorithm
Step 1: Import necessary libraries such as pandas, numpy, matplotlib and seaborn.
Step 2: Read a CSV file ('[Link]') into a Pandas DataFrame and store it in the variable
Step 3: Take the first column of the DataFrame and split it into multiple columns based on the delimiter ';'
Step 4: Change the data types of the columns in the DataFrame to their appropriate types
54
Step 5: Check the datatypes, missing values, and summary of the data.
Step 6: Create a histogram of the 'quality' column using Seaborn with labels and a title, and display the plot.
Step 7: Calculate the correlation matrix for the dataset and create a heatmap using Seaborn with
annotations and a title, and display the plot.
Step 8: Create a scatter plot of 'alcohol' vs. 'quality' using Seaborn with labels and a title, and display the plot.
Program :
import pandas as pd
data = pd.read_csv('/content/sample_data/[Link]')
print([Link])
Output:
dtype='object')
# Split the single column into multiple columns based on the delimiter (;)
'alcohol', 'quality']
[Link] = column_names
data = [Link]({'fixed acidity': float, 'volatile acidity': float, 'citric acid': float,
55
'total sulfur dioxide': float, 'density': float, 'pH': float,
print([Link]())
Output
alcohol quality
0 9.4 5
1 9.8 5
2 9.8 5
3 9.8 6
4 9.4 5
# Check the dimensions of the dataset
print([Link])
56
Output
(1599, 12)
print([Link])
Output
chlorides float64
density float64
pH float64
sulphates float64
alcohol float64
quality int64
dtype: object
print([Link]().sum()
Output
fixed acidity 0
volatile acidity 0
citric acid 0
residual sugar 0
57
chlorides 0
density 0
pH 0
sulphates 0
alcohol 0
quality 0
dtype: int64
print([Link]())
Output
58
min 0.012000 1.000000 6.000000 0.990070
[Link](data['quality'])
[Link]('Quality')
[Link]('Frequency')
[Link]()
Output
59
# Correlation matrix heatmap
correlation_matrix = [Link]()
[Link]('Correlation Matrix')
[Link]()
Output
60
# Scatter plot of alcohol vs. quality
[Link]('Alcohol')
[Link]('Quality')
[Link]()
Output
Result :
61
Ex No : 9 Use a case study on a dataset and apply the various EDA and visualization
techniques and present an analysis report
Date :
Aim:
To use a case study on a dataset and apply the various EDA and visualization techniques and present
an analysis report
Algorithm :
62
Program:
import numpy as np
import pandas as pd
%matplotlib inline
df = pd.read_csv('[Link]')
df
[Link]()
Output :
<class '[Link]'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gender 1000 non-null object
1 race/ethnicity 1000 non-null object
2 parental level of education 1000 non-null object
3 lunch 1000 non-null object
4 test preparation course 1000 non-null object
5 math score 1000 non-null int64
6 reading score 1000 non-null int64
7 writing score 1000 non-null int64
dtypes: int64(3), object(5)
memory usage: 62.6+ KB
63
[Link]().sum()
Output :
gender 0
race/ethnicity 0
parental level of education 0
lunch 0
test preparation course 0
math score 0
reading score 0
writing score 0
dtype: int64
[Link]()
Output :
sns.set_style('whitegrid')
[Link](figsize=(10, 6))
[Link]('Subject')
[Link]('Score')
[Link]()
Output :
64
df['sum score'] = df[['math score', 'reading score', 'writing score']].sum(axis=1)
[Link]()
Output :
[Link](figsize=(10, 6))
[Link]('Students by gender')
[Link]()
Output :
65
labels = df['parental level of education'].value_counts().index
[Link](figsize=(10, 6))
[Link]()
Output :
[Link](figsize=(10, 6))
[Link](df_encoded.corr(), annot=True)
[Link]()
Output :
66
residuals = [Link]
[Link](residuals, line='s')
[Link]()
Output :
67
Result :
Thus by applying various EDA and visualization techniques we analysed the student score data and
identified important factors affecting the total scores.
68
EVALUATION SHEET
69