NAAN MUDHALVAN DATA ANALYTICS
COURSE FOR ENGINEERING STUDENTS
ingage
SUBMITTED BY:
SURYA.R(au422522104305)
NM1069-DATA ANALYTICS
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Year/Semester-III/VI
UNIVERSITY COLLEGE OF ENGINEERING VILLUPURAM
(A CONSTITUENT COLLEGE OF ANNA UNIVERSITY CHENNAI)
VILLUPURAM – 605 103
ANNA UNIVERSITY: CHENNAI 600 025
APRIL 2025
UNIVERSITY COLLEGE OF ENGINEERING VILLUPURAM
(A CONSTITUENT COLLEGE OF ANNA UNIVERSITY CHENNAI)
VILLUPURAM – 605 103
Department of Computer Science and
Engineering
Bonafide record of work done in the Computer Laboratory of University College Of
Engineering Villupuram for NM1069-NAAN MUDHALVAN Data Analytics
Course by Google during the year 2024-2025
by………………………………………........Reg.No: .......................................................
Studying in the Sixth Semester B.E. (Computer Science and Engineering).
Staff In-Charge Head of the Department
Submitted for the practical examination held at University College of Engineering
Villupuram on ………………….
Internal Examiner External Examiner
INDEX
S.NO TOPIC SIGN
1. EDA ON GLOBAL SUPERSTORE SALES
DATASET
2. EDA ON COVID-19 GLOBAL DATASET
3. EDA ON YOUTUBE TRENDING VIDEOS
DATASET
EX.No:1 EDA ON GLOBAL SUPERSTORE SALES DATASET
EXPLORATORY DATA ANALYSIS (EDA):
Exploratory Data Analysis (EDA) is the process of examining and understanding a
dataset before applying any modeling or predictive techniques. It involves summarizing the
dataset’s main characteristics using statistical measures and visualizations to uncover patterns,
spot anomalies, test hypotheses, and check assumptions. EDA typically includes cleaning the
data (handling missing values and duplicates), generating descriptive statistics (like mean,
median, and standard deviation), and using plots such as histograms, bar charts, and line graphs
to visualize trends and relationships. This step is crucial for gaining insights and making
informed decisions about the direction of further analysis or modeling.
DATA SOURCE:
Dataset link: https://www.kaggle.com/datasets/fatihilhan/global-superstore-dataset
STEP 1: LOAD THE DATASET
PROGRAM:
import pandas as pd
file_path = "/content/GLOBAL DATASTORE.csv"
df = pd.read_csv(file_path)
OUTPUT:
STEP 2:DATA CLEANING
Check and remove missing values
Remove duplicates
PROGRAM:
df.dropna(inplace=True)
df.drop_duplicates(inplace=True)
STEP 3: SUMMARY STATISTICS
PROGRAM:
sales_summary = df["Sales"].describe()[["mean", "50%", "std"]]
profit_summary = df["Profit"].describe()[["mean", "50%", "std"]]
print("Sales Summary:\n", sales_summary)
print("Profit Summary:\n", profit_summary)
OUTPUT:
Sales Summary:
mean 246.498440
50% 85.000000
std 487.567175
Name: Sales, dtype: float64
Profit Summary:
mean 28.610982
50% 9.240000
std 174.340972
Name: Profit, dtype: float64
STEP 4: ANALYSIS
Total Sales per Region
PROGRAM:
sales_per_region = df.groupby("Region")["Sales"].sum()
print(sales_per_region)
OUTPUT:
Region
Africa 783776
Canada 66932
Caribbean 324281
Central 2822399
Central Asia 752839
EMEA 806184
East 678834
North 1248192
North Asia 848349
Oceania 1100207
South 1600960
Southeast Asia 884438
West 725514
Name: Sales, dtype: int64
Top 5 Most Profitable Product Categories
PROGRAM:
top_profitable_categories=df.groupby("Category")["Profit"].sum().nlargest(5)
print(top_profitable_categories)
OUTPUT:
Category
Technology 663778.73318
Office Supplies 518473.83430
Furniture 285204.72380
Name: Profit, dtype: float64
Year-wise Sales Trend
PROGRAM:
df["Order.Date"] = pd.to_datetime(df["Order.Date"])
df["Year"] = df["Order.Date"].dt.year
yearly_sales = df.groupby("Year")["Sales"].sum()
print(yearly_sales)
OUTPUT:
Year
2025 12642905
Name: Sales, dtype: int64
STEP 5: VISUALIZATIONS
Bar Chart: Sales by Region
PROGRAM:
import matplotlib.pyplot as plt
sales_per_region.plot(kind="bar", color="skyblue")
plt.title("Total Sales by Region")
plt.xlabel("Region")
plt.ylabel("Total Sales")
plt.show()
OUTPUT:
Line Chart: Year-wise Sales Trend
PROGRAM:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")
plt.figure(figsize=(10, 6))
yearly_sales = df.groupby('Year')['Sales'].sum()
sns.lineplot(x=yearly_sales.index, y=yearly_sales.values, marker='o',
color='orange')
plt.title("Year-wise Sales Trend")
plt.xlabel("Year")
plt.ylabel("Total Sales")
plt.tight_layout()
plt.show()
OUTPUT:
GOOGLE COLAB LINK:
https://colab.research.google.com/drive/12ok_SXN84wnqSQL9AzV4OA4e7kD
ohCWQ?usp=sharing
STEP 6: INSIGHTS
Bar Chart – Sales by Region:
The West region shows the highest total sales, followed by East and
Central.
South lags behind, indicating potential for growth or marketing focus.
Line Chart – Year-wise Sales Trend:
Sales have shown a steady upward trend year over year.
Indicates growing business or improved operations/logistics over time.
Ex.No:2 EDA ON COVID-19 GLOBAL DATASET
INTRODUCTION:
The COVID-19 pandemic, caused by the SARS-CoV-2 virus, has had a profound global
impact since early 2020, affecting millions of lives and disrupting economies. To better
understand the spread, trends, and regional impact of the virus, data-driven approaches such as
Exploratory Data Analysis (EDA) are essential. By exploring confirmed cases, recoveries, and
deaths, this analysis aims to uncover insights into the progression of the pandemic, identify the
most affected states, and visualize daily trends in new infections.
Dataset link: https://www.kaggle.com/datasets/ COVID-19 in India
GOOGLE COLAB LINK:
https://colab.research.google.com/drive/1VEuFN6gRCyIMnIEkwccqlMqENFi11BRv?usp=s
haring
STEP 1: LOAD AND INSPECT THE DATASET
PROGRAM:
import pandas as pd
df = pd.read_csv('path_to_covid_dataset.csv')
print(df.head())
OUTPUT:
PROGRAM:
print(df.columns)
print(df.info())
OUTPUT:
STEP 2: HANDLE MISSING DATA AND CONVERT DATES
PROGRAM:
df.fillna(0, inplace=True)
df['Date'] = pd.to_datetime(df['Date'])
STEP 3: COMPUTE METRICS
a) Total confirmed, recovered, and death cases per state:
PROGRAM:
statewise_total=df.groupby('State/UnionTerritory')[['Confirmed','Cured',
'Deaths']].max().reset_index()
print(statewise_total)
OUTPUT:
b) State with the highest number of confirmed cases:
PROGRAM:
top_state=statewise_total[statewise_total['Confirmed']==
statewise_total['Confirmed'].max()]
print("State with highest confirmed cases:\n", top_state)
OUTPUT:
State with highest confirmed cases:
State/UnionTerritory Confirmed Cured Deaths
27 Maharashtra 6363442 6159676 134201
c) Daily trend of new cases:
PROGRAM:
daily_cases = df.groupby('Date')['Confirmed'].sum().diff().fillna(0)
STEP 4: VISUALIZATIONS
a) Pie Chart: Top 5 States by Confirmed Cases
PROGRAM:
import matplotlib.pyplot as plt
top5_states = statewise_total.sort_values('Confirmed', ascending=False).head(5)
plt.figure(figsize=(8, 8))
plt.pie(top5_states['Confirmed'],labels=top5_states['State/UnionTerritory'],
autopct='%1.1f%%', startangle=140)
plt.title('Top 5 Indian States by Confirmed COVID-19 Cases')
plt.show()
OUTPUT:
b) Line Graph: Daily Trend of Confirmed Cases
PROGRAM:
plt.figure(figsize=(10, 6))
plt.plot(daily_cases.index, daily_cases.values, color='blue')
plt.title('Daily New Confirmed COVID-19 Cases in India')
plt.xlabel('Date')
plt.ylabel('New Cases')
plt.grid(True)
plt.show()
OUTPUT:
STEP 5:OBSERVATION
Top affected states (e.g., Maharashtra, Kerala, Karnataka) account for the
majority of confirmed cases.
Trend graph shows multiple waves—sharp increases followed by
declines.
Lockdown periods and vaccination rollouts align with noticeable trend
changes.
Deaths and recovery rates vary by region and wave, highlighting
healthcare disparities.
Ex.No:3 EDA ON YOUTUBE TRENDING VIDEOS DATASET
INTRODUCTION:
YouTube has become a dominant platform for video sharing, content
creation, and audience engagement worldwide. The YouTube Trending Videos
Dataset provides a snapshot of videos that were trending in various regions over
time, offering valuable insights into user preferences, content popularity, and
engagement metrics.
This Exploratory Data Analysis (EDA) aims to uncover trends in video
categories, the frequency of trending videos across different channels, and
patterns in user interactions such as views, likes, and comments. By analyzing
this data, we can better understand what makes a video trend, which content types
perform best, and how users engage with trending content.
DATA SOURCE:
Dataset link: https://www.kaggle.com/datasets/anushabellam/Trending videos on
Youtube
GOOGLE COLAB LINK:
https://colab.research.google.com/drive/1xkMAoAhJsC8CxQH-
ZaAoNUXe9g2HoFxs?usp=sharing
STEP 1: LOAD AND INSPECT THE DATASET
PROGRAM:
import pandas as pd
df = pd.read_csv('USvideos.csv')
print(df.info())
print(df.head())
OUTPUT:
STEP 2: DATA CLEANING
PROGRAM:
df = df.drop_duplicates()
df = df.dropna()
df['publishedAt'] = pd.to_datetime(df['publishedAt'], errors='coerce')
STEP 3: KEY CALCULATIONS
PROGRAM:
most_common_categories = df['videoCategoryId'].value_counts().head(10)
top_channels = df['videoTitle'].value_counts().head(5)
avg_likes = df['likeCount'].mean()
avg_views = df['viewCount'].mean()
avg_comments = df['commentCount'].mean()
average_metrics = {
'Average Likes': avg_likes,
'Average Views': avg_views,
'Average Comments': avg_comments
}
print(average_metrics)
OUTPUT:
{'Average Likes': np.float64(182.2095238095238), 'Average Views':
np.float64(9999.657142857142), 'Average Comments':
np.float64(82.97142857142858)}
STEP 4: VISUALIZATIONS
PROGRAM:
a) Bar chart: Video count by category
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10, 6))
most_common_categories.plot(kind='bar', color='skyblue')
plt.title('Top 10 Video Categories by Count')
plt.xlabel('Category ID')
plt.ylabel('Number of Videos')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
b) Scatter plot: Likes vs Views
plt.figure(figsize=(10, 6))
sns.scatterplot(x='viewCount', y='likeCount', data=df, alpha=0.5)
plt.title('Likes vs. Views')
plt.xlabel('Views')
plt.ylabel('Likes')
plt.xscale('log')
plt.yscale('log')
plt.tight_layout()
plt.show()
OUTPUT:
STEP 5:OBSERVATION
Top Categories: Certain categories like music, entertainment, and news
dominate the trending list.
Channel Popularity: A few channels consistently produce trending content.
Engagement Patterns: There's a strong positive correlation between views
and likes.
Outliers: Some videos have extremely high views but relatively low
likes/comments, suggesting passive viewing.