Ad3301 - Dev Lab
Ad3301 - Dev Lab
Mr/Ms. _________________________________________________________________ in
year 20 - 20
Vision
Mission
RVSCOE Strives:
Vision
Mission
1. By providing state of the art facilities and cutting-edge technology, we create an optimal
learning and research environment, enhancing the capabilities of our Artificial
Intelligence and Data Science students and faculty.
PSO 1: Apply the concepts and practical knowledge in analysis, design and
development of Artificial intelligence and Data science solutions to address real world
problems and meet the challenges of society.
• PEO1: Develop the next generation of highly skilled graduates equipped with a strong
knowledge in Artificial intelligence and Data Science for creating innovative solutions
to society’s pressing challenges.
• PEO3: Produce engineers who are professional entrepreneurs and capable of self
learning to excel in their career.
• PEO4 : To prepare graduates who excel in diverse teams upholding professional ethics
and societal responsibilities.
PO-2: Problem analysis: Identify, formulate, review research literature, and analyze
complex engineering problems reaching substantiated conclusions using first principles
of mathematics, natural sciences and engineering sciences.
PO-6: The engineering and society: Apply reasoning informed by the contextual
knowledge to assess societal, health, safety, legal and cultural issues and the consequent
responsibilities relevant to the professional engineering practice.
PO-8: Ethics: Apply ethical principles and commit to professional ethics and
responsibilities and norms of the engineering practice.
PO-12: Life-long Learning: Recognize the need for, and have the preparation and
ability to engage in independent and life-long learning in the broadest context of
technological change.
AD3301 - DATA EXPLORATION AND VISUALIZATION
LIST OF EXPERIMENTS
1. Install the data Analysis and Visualization tool: R/ Python /Tableau Public/ Power BI.
2. Perform exploratory data analysis (EDA) on with datasets like email data set. Export all your
emails as a dataset, import them inside a pandas data frame, visualize them and get different
insights from the data.
3. Working with Numpy arrays, Pandas data frames , Basic plots using Matplotlib.
4. Explore various variable and row filters in R for cleaning data. Apply various plot features
in R on sample data sets and visualize.
5. Perform Time Series Analysis and apply the various visualization techniques.
6. Perform Data Analysis and representation on a Map using various Map data sets with Mouse
Rollover effect, user interaction, etc..
7. Build cartographic visualization for multiple datasets involving various countries of the
world states and districts in India etc.
9. Use a case study on a data set and apply the various EDA and visualization techniques and
Aim:
Python develops new versions with changes periodically and releases them according to
version numbers. Python is currently at version 3.11.3.
Installation on Windows
Visit the link https://www.python.org to download the latest release of Python. In this process,
we will install Python 3.11.3 on our Windows operating system. When we click on the above
link, it will bring us the following page.
The following window will open. Click on the Add Path check box, it will set the Python path
automatically. Now, Select Customize installation and proceed. We can also click on the
customize installation to choose desired location and features. Other important thing is install
launcher for the all user must be checked.
Here, under the advanced options, click on the checkboxes of "Install Python 3.11 for all users",
which is previously not checked in. This will checks the other option "Precompile standard
library" automatically. And the location of the installation will also be changed. We can change
it later, so we leave the install location default. Then, click on the install button to finally install.
The set up is in progress. All the python libraries, packages, and other python default files will
be installed in our system. Once the installation is successful, the following page will appear
saying "Setup was successful ".
To verify whether the python is installed or not in our system, we have to do the following.
o Go to "Start" button, and search " cmd ".
o Then type, " python - - version ".
o If python is successfully installed, then we can see the version of the python installed.
o If not installed, then it will print the error as "'python' is not recognized as an internal
or external command, operable program or batch file. ".
Now, to work on our first python program, we will go the interactive interpreter prompt (idle).
To open this, go to "Start" and type idle. Then, click on open to start working on idle.
Result:
AIM:
To perform exploratory data analysis (EDA) on with datasets like email data set. Export all
emails as a dataset, import them inside a pandas data frame, visualize them and get different insights
from the data.
ALGORITHM:
STEP 1: Export your Emails as a dataset: Export Emails from your Email client as CSV file,
alternatively, if using Gmail, you can export data using Google Takeout and download it in MBOX
format.
STEP 2: Load the Email data into a pandas data frame: Use the mail-parser library to convert it into
a pandas dataframe.
STEP 3: Data cleaning and preprocessing: Clean the dataset by handling missing values, renaming
columns and extracting useful information.
STEP 4: Perform initial data exploration: Explore unique values by check for unique sender, subjects
and labels and count the number of Emails from each sender.
STEP 5:Data visualization: Plot the number of Emails received over time to identify trends, plot the
top senders based on the number of Emails and analyse the distribution of Email reception times
throughotut the day.
PROGRAM:
import mailbox
import pandas as pd
# Load the mbox file
mbox = mailbox.mbox('path_to_your_mbox_file.mbox')
# Convert to DataFrame
data = {
"from": [],
"subject": [],
"date": [],
"body": []
}
for msg in mbox:
data["from"].append(msg["from"])
data["subject"].append(msg["subject"])
data["date"].append(msg["date"])
if msg.is_multipart():
body = ''
for part in msg.get_payload():
if part.get_content_type() == 'text/plain':
body += part.get_payload()
data["body"].append(body)
else:
data["body"].append(msg.get_payload())
df = pd.DataFrame(data)
EDA on the DataFrame:
Basic Statistics:
print(df.info())
print(df.describe())
print(df.isnull().sum())
Time Series Analysis:
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
df['subject'].resample('M').count().plot(title="Monthly Email Count")
Sender analysis:
df['from'].value_counts().head(10).plot(kind='bar', title="Top 10 Email Senders")
Word Cloud of Subjects or Bodies:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
OUTPUT:
RESULT:
Thus exploratory data analysis (EDA) with Email dataset and visualize them to get different
insights from the data was executed.
Exp No: 3
NUMPY ARRAYS, PANDAS DATA FRAMES, BASIC PLOTS USING
Date: MATPLOTLIB
AIM:
To understand and implement fundamental data manipulation and visualization techniques using
python libraries NumPy, pandas,Matplotlib.
ALGORITHM:
NumPY
STEP 1:Create array using numpy.
STEP 2:Access the element in the array.
STEP 3:Retrieve element using the slice operation.
STEP 4:Compute calculation on the array.
Pandas
STEP 1:Create a dataframe from a dictionary.
STEP 2:Display first and last few rows of the dataframe.
STEP 3:Select a specific column and filter rows based on condition.
STEP 4:Calculate descriptive statistic for numeric columns.
Matplotlib:
STEP 1: Import matplotlib library.
STEP 2:Define x,y axis.
STEP 3:Label the axis.
STEP 4:Visualizing the data using line plot , bar chart , scatter plot and etc.
PROGRAM:
NUMPY:
1)
import numpy as np
np.random.seed(0) # seed for reproducibility
x1 = np.random.randint(10, size=6) # One-dimensional array
x2 = np.random.randint(10, size=(2, 3)) # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5)) # Three-dimensional array
print("X1,", x1)
print("X2,", x2)
print("X3,", x3)
#1D
print("x1 ndim: ", x1.ndim)
print("x1 shape:", x1.shape)
print("x1 size: ", x1.size)
#2D
print("x2 ndim: ", x2.ndim)
print("x2 shape:", x2.shape)
print("x2 size: ", x2.size)
#3D
print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)
2) To get input from a user and store it as a NumPy array, you can use Python's input() function.
import numpy as np
# Define how many numbers the user should enter
n = int(input("How many numbers do you want to enter? "))
# Collect the numbers in a list
numbers = []
for i in range(n):
num = float(input(f"Enter number {i+1}: ")) # Use int() for integers
numbers.append(num)
# Convert the list to a NumPy array
arr = np.array(numbers)
print(arr)
import numpy as np
# Creating a 1D array
print(arr)
# Creating a 2D array
print(arr2d)
# Creating arrays
print(merged_vertically)
print(merged_horizontally)
#Concatenating 1D arrays
print(concatenated_arr)
print(arr.dtype)
print(arr_float.dtype)
print(arr_int.dtype)
print(arr_float.dtype)
print(arr_int)
print(arr_bool)
Pandas
1)
import pandas as pd
import numpy as np
ser = pd.Series()
print("Pandas Series: ", ser)
# simple array
data = np.array(['r', 'a', 'g', 'z'])
ser = pd.Series(data)
print("Pandas Series:\n", ser)
#creating dataframe
import pandas as pd
df = pd.DataFrame()
print(df)
# list of strings
lst = ['apple', 'orange', 'kiwi', 'grapes']
df = pd.DataFrame(lst)
print(df)
Matplotlib
#Line plot
import matplotlib.pyplot as plt
x=[1,2,3,4]
y=[3,5,2,7]
plt.plot(x,y,marker='o',linestyle='-',color='blue')
plt.title('sample line chart')
plt.xlabel('x axis')
plt.ylabel('y axis')
plt.show
#Barchart
import matplotlib.pyplot as plt
categories=['A','B','C']
values=[10,20,30]
plt.bar(categories,values,color='red')
plt.title('sample bar chart')
plt.xlabel(' categories')
plt.ylabel('values')
plt.show()
#Scatterplot
import matplotlib.pyplot as plt
import numpy as np
# Generate random data
x = np.random.rand(50) # 50 random points for x-axis
y = np.random.rand(50) # 50 random points for y-axis
# Create a scatter plot
plt.scatter(x, y, color='blue', marker='o', alpha=0.7)
plt.title("Simple Scatter Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
# Show the plot
plt.show()
#Histogram
import matplotlib.pyplot as plt
import numpy as np
# Generate random data
data = np.random.randn(1000) # 1000 random values from a normal distribution
# Create a histogram
plt.hist(data, bins=30, color='skyblue', edgecolor='black', alpha=0.7)
plt.title("Histogram of Random Data")
plt.xlabel("Value")
plt.ylabel("Frequency")
# Show the plot
plt.show()
OUTPUT:
#Numpy
1)
X1, [5 0 3 3 7 9]
X2, [[3 5 2]
[4 7 6]]
X3, [[[8 8 1 6 7]
[7 8 1 5 9]
[8 9 4 3 0]
[3 5 0 2 3]]
[[8 1 3 3 3]
[7 0 1 9 9]
[0 4 7 3 2]
[7 2 0 0 4]]
[[5 5 6 8 4]
[1 4 9 8 1]
[1 7 9 9 3]
[6 7 2 0 3]]]
#1D
x1 ndim: 1
x1 shape: (6,)
x1 size: 6
#2D
x2 ndim: 2
x2 shape: (2, 3)
x2 size: 6
#3D
x3 ndim: 3
x3 shape: (3, 4, 5)
x3 size: 60
2) How many numbers do you want to enter?
#Creating a NumPy Array
[1 2 3 4 5]
[[1 2 3]
[4 5 6]]
#Concatenation along an axis
[[1 2]
[3 4]
[5 6]
[7 8]]
[[1 2 5 6]
[3 4 7 8]]
#Concatenating 1D arrays
[1 2 3 4 5 6]
#Pandas
#Matplotlib
RESULT:
Thus the python program to perform NumPy,Pandas and Matplotlib was executed successfully.
EXP NO :04
EXPLORING VARIOUS VARIABLE AND ROW FILTERS IN R FOR
DATE: CLEANING DATA
AIM:
To explore various variable and row filters in R for cleaning data and apply various plot features
in R on sample datasets and visualize.
ALGORITHM:
STEP 1:Load the data: Load iris dataset and ensure it is in the right format.
STEP 2: Data exploration and variable conversion: Verify the types of each variable and covert the
species to factor if necessary.
STEP 3: Filter data by rows: Apply row filters to select rows that meet specific criteria and combine
multiple conditions for refined filtering.
STEP 4: Filter data by columns: Use select() to isolate specific variables for analysis and remove
unnecessary columns to simplify the dataset.
STEP 5: Data cleaning and transformation: Handle missing values by replacing or removing not
available values.
STEP 6: Data visualization: Use ggplot2 to create various plots and explore relationship between
variables in the iris dataset.
PROGRAM:
# Import necessary libraries
library(dplyr)
library(ggplot2)
library(plotly)
data(iris)
ggplot(data = iris) + geom_point(aes(x = Sepal.Length, y = Sepal.Width, color = Species))
#Selecting Variables
selected_vars <- select(iris, Sepal.Length, Sepal.Width)
selected_vars
ggplot(data = selected_vars) + geom_point(aes(x = Sepal.Length, y = Sepal.Width))
#dropping Variables
dropped_vars <- select(iris, -Species)
dropped_vars
ggplot(data = dropped_vars) + geom_point(aes(x = Sepal.Length, y = Sepal.Width))
# Renaming Variables
renamed_vars <- rename(iris, Length = Sepal.Length, Width = Sepal.Width)
renamed_vars
ggplot(data = renamed_vars) + geom_point(aes(x = Length, y = Width))
# Filtering Rows based on Conditions
filtered_data <- filter(iris, Petal.Width > 0.2)
filtered_data
ggplot(data = filtered_data) + geom_point(aes(x = Sepal.Length, y = Sepal.Width))
# Filtering Rows based on Multiple Conditions
filtered_data <- filter(iris, Petal.Width > 0.2 & Sepal.Length > 5)
filtered_data
ggplot(data = filtered_data) + geom_point(aes(x = Sepal.Length, y = Sepal.Width))
# Sorting Rows based on Variables
sorted_data <- arrange(iris, Sepal.Length)
sorted_data
ggplot(data = sorted_data) + geom_point(aes(x = Sepal.Length, y = Sepal.Width))
# Plot Features# Base R - Scatter Plot
plot(iris$Sepal.Length, iris$Sepal.Width)
# Base R - Bar Plotbar
plot(iris$Sepal.Width, names.arg = iris$Species)
# Base R - Histogram
hist(iris$Sepal.Length)
# ggplot2 - Scatter Plot
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()
ggplot
# ggplot2 - Bar Plot
ggplot(iris, aes(x = factor(Species), y = Sepal.Width)) + #Fixed typo gplot to ggplot
geom_col(stat = "summary", fun = "mean") + labs(title = "Mean Sepal Width by Species", x =
"Species", y = "Mean Sepal Width") +
theme_minimal()
# ggplot2 - Histogram
ggplot(iris, aes(x = Sepal.Length)) + geom_histogram()
OUTPUT:
RESULT:
Thus the program for cleaning and visualizing the data using R was executed.
Exp No: 5
TIME SERIES ANALYSIS
Date:
AIM:
To perform Time Series Analysis using temperature dataset and apply the various
visualization techniques.
ALGORITHM:
STEP 1: Load the Data: Load a time series dataset and ensure it is in the right format (e.g., datetime
index).
STEP 2: Visualize Raw Data: Use line plots to understand the overall trend and patterns.
STEP 3: Decomposition: Break down the series into trend, seasonal, and residual components.
STEP 4: Seasonality Analysis: Identify recurring seasonal patterns.
STEP 5: Autocorrelation and Partial Autocorrelation: Visualize lag relationships to detect
seasonality and trend.
STEP 6: Rolling Statistics: Smooth the series to better understand trends and identify anomalies.
PROGRAM:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
To perform data analysis and visualization on a map using Python. The project includes
creating an interactive map representation that responds to user interactions, such as mouse rollover,
displaying information, and customizing visual styles.
ALGORITHM:
STEP 1: Load and Inspect Dataset: Load a map-based dataset (e.g., locations with geographical data
like latitude and longitude).
STEP 2: Data Cleaning and Preprocessing: Handle missing values, format data types, and filter
relevant data.
STEP 3: Data Analysis: Analyze data to get insights for visualization (e.g., counts by region,
clustering, or heat maps).
STEP 4: Create Base Map: Initialize a base map centered on a relevant region or with global
coordinates.
STEP 6: Display and Save Map: Display the interactive map in a Jupyter notebook or save it as an
HTML file.
PROCEDURE:
import pandas as pd
import folium
# Here, we use a sample dataset with geographic locations. Replace 'your_data.csv' with the actual file
path.
data = pd.read_csv('India Cities LatLng.csv') # Ensure this file has columns like 'latitude', 'longitude',
and 'location_name'
# Step 3: Perform Data Analysis (e.g., count per location, grouping, etc.)
location_counts = data['country'].value_counts().reset_index()
folium.CircleMarker(
location=(row['lat'], row['lng']),
radius=6,
color='blue',
fill=True,
fill_color='blue',
).add_to(m)
# Step 6: Add a Heatmap for Visual Representation (optional, for density-based insights)
plugins.HeatMap(heat_data).add_to(m)
marker_cluster = plugins.MarkerCluster().add_to(m)
folium.Marker(
location=(row['lat'], row['lng']),
popup=f"<b>{row['country']}</b><br>Count: {location_counts[location_counts['Location'] ==
row['country']]['Count'].values[0]}"
).add_to(marker_cluster)
m
OUTPUT:
RESULT:
The interactive map can be viewed in a web browser as an HTML file, allowing for easy
sharing and use in presentations.
Exp No: 7
BUILD CARTOGRAPHIC VISUALIZATION FOR MULTIPLE
DATASETS INVOLVING VARIOUS COUNTRIES OF THE WORLD;
Date:
STATES AND DISTRICTS IN INDIA ETC.
AIM:
ALGORITHM:
STEP 1: Import Libraries: Import geopandas and matplotlib.pyplot for handling geospatial data and
plotting maps.
STEP 3: Initialize Plot: Create a side-by-side plot layout with two subplots.
PROGRAM:
world = gpd.read_file('/content/ne_110m_admin_0_countries.shp')
india_states = gpd.read_file('/content/india_India_Country_Boundary.shp')
# Load India districts shapefile
ax[0].set_title('World Map')
plt.tight_layout()
plt.show()
OUTPUT:
RESULT:
An India states map showcasing the individual states within India, allowing viewers to see
state boundaries clearly. The code that includes districts within India, with clear distinctions among
world countries, Indian states, and districts.
Exp No: 8
Perform Exploratory Data Analysis (EDA) on Wine Quality Data Set
Date:
AIM
To perform Exploratory Data Analysis (EDA) on the Wine Quality dataset to understand data
distributions, correlations, and patterns that may affect wine quality.
ALGORITHM
STEP 1: Load the Dataset: Import and load the Wine Quality dataset.
STEP 2: Data Overview: Inspect dataset structure, types, and basic statistics.
STEP 3: Data Cleaning: Check for and handle any missing or duplicate values.
STEP 5: Visualization:
print("Dataset Overview:")
print(data.info())
print("\nDescriptive Statistics:")
print(data.describe())
print(data.isnull().sum())
data.drop_duplicates(inplace=True)
# Distribution of Quality
plt.figure(figsize=(8, 5))
plt.show()
# Histogram of Features
plt.suptitle("Feature Distributions")
plt.show()
# Correlation Matrix
plt.figure(figsize=(12, 8))
plt.show()
plt.figure(figsize=(15, 10))
sns.boxplot(data=data, palette="Set2")
plt.xticks(rotation=45)
plt.show()
plt.figure(figsize=(12, 6))
plt.show()
OUTPUT:
RESULT:
This EDA helps identify important features and relationships, providing insights for further
analysis or model building.
Exp No: 9 USE A CASE STUDY ON A DATA SET AND APPLY THE VARIOUS
EDA AND VISUALIZATION TECHNIQUES AND PRESENT AN
Date:
ANALYSIS REPORT
AIM:
Use a case study on a data set and apply the various EDA and visualization techniques and present an
analysis report.
PROCEDURE:
Import Libraries:
Descriptive Statistics:
Correlation Heatmap:
Violin Plots:
Combine box plots with kernel density estimation for better visualization.