0% found this document useful (0 votes)
12 views29 pages

Report MSA Practice02

The document outlines a lab assignment for a Multivariate Statistical Analysis course at Vietnam National University, focusing on data visualization using Python libraries such as Matplotlib and Pandas. It details the preparation of COVID-19 datasets, including confirmed cases and death statistics, along with methodologies for data analysis and visualization. The assignment aims to uncover trends and insights from the data through various graphical representations.

Uploaded by

Long Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views29 pages

Report MSA Practice02

The document outlines a lab assignment for a Multivariate Statistical Analysis course at Vietnam National University, focusing on data visualization using Python libraries such as Matplotlib and Pandas. It details the preparation of COVID-19 datasets, including confirmed cases and death statistics, along with methodologies for data analysis and visualization. The assignment aims to uncover trends and insights from the data through various graphical representations.

Uploaded by

Long Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

VIETNAM NATIONAL UNIVERSITY

HO CHI MINH CITY

UNIVERSITY OF SCIENCE

Faculty of Information Technology

Multivariate Statistical Analysis


PRACTICE 02

Nguyễn Bảo Long - 22127243

INSTRUCTORS
Lý Quốc Ngọc
Nguyễn Mạnh Hùng
Phạm Thanh Tùng

February 18, 2025


University of Science - VNUHCM Multivariate Statistical Analysis

INFORMATION Names, IDs, E-mails of members

ID Name Email
22127243 Nguyễn Bảo Long [email protected]

SELF EVALUATION

No. Function Completion Level


1 Prepare Dataset 100%
2 Visualization with at least 3 graphs 100%
3 Analysis, Comments on Graphs 100%
4 Another Dataset Preparation 100%
5 Compare Plotly vs Matplotlib & Standard Deviation 100%
6 Visualize another data with analysis 100%

Bảng 1: Table of function completion levels

Nguyễn Bảo Long - 22127243 Page 2 of 29


University of Science - VNUHCM Multivariate Statistical Analysis

Mục lục

1 Introduction - Problem Statement 4

2 Methodology 4
2.1 Used Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Prepare Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Confirmed Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Death Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 OWID Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Graph 1 - Random countries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Confirmed cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Death Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Graph 2 - Top 10 Countries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Confirmed Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.2 Death Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Graph 3 - Daily New Confirmed + 7-Day Moving Average . . . . . . . . . . . . . . . . . . 19
2.6 Graph 4: 7-Day moving average of Daily New Cases . . . . . . . . . . . . . . . . . . . . . 21
2.7 Bonus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.8 Another Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

A Reference 29

Nguyễn Bảo Long - 22127243 Page 3 of 29


University of Science - VNUHCM Multivariate Statistical Analysis

1 Introduction - Problem Statement

This lab assignment addresses the challenge of transforming raw case data into meaningful visual
representations. The primary goal is to utilize Python’s Matplotlib library to explore trends, distributions,
and correlations within the dataset. Key tasks include reading and preprocessing CSV data, generating
diverse plots (e.g., line charts, histograms, scatter plots), and interpreting the results to uncover patterns
such as case progression, regional disparities, or unemployment rate correlations.
Here are some proposed methods that I will present in this Lab Assignment:

• Environment Setup: Configure a Python virtual environment with libraries like Pandas for data
manipulation and Matplotlib for visualization.

• Data Loading & Preprocessing: Import the COVID-19 dataset using Pandas, clean missing
values, and structure the data for analysis.

• Visualization: Create at least three distinct plots:

– A line chart to track case trends over time.

– A bar chart to compare case numbers across countries.

– A scatter plot to analyze relationships between variables (e.g., median earnings vs. unemploy-
ment rates).

• Analysis & Insights: Extract observations from each plot, such as identifying outbreak peaks,
high-risk regions, or socioeconomic correlations.

• Bonus: Extend the analysis by integrating additional datasets and alternative libraries (e.g.,
Seaborn, Plotly) and compare their usability with Matplotlib.

2 Methodology

2.1 Used Libraries

Here are some libraries and their description and analysis.

• Mathplotlib.pyplot
This library is the core plotting library for static, publication-quality visualizations in this Lab
Assignment. It is used for highly customizable plots (line, bar, scatter, histograms) and fine-grained
control over axes, labels, and styles.
Here are some functions that are used in this Lab:

– plt.figure(): Create new figure


– plt.title()/plt.xlabel()/plt.ylabel(): Add labels
– plt.legend(): Display plot legend
– plt.show(): Render plots
– plt.bar()/plt.barh(): Create bar charts
– plt.xticks(): Customize x-axis ticks
– plt.pie(): Create pie charts
– plt.hist(): Generate histograms

Nguyễn Bảo Long - 22127243 Page 4 of 29


University of Science - VNUHCM Multivariate Statistical Analysis

• Pandas
This is a library for data manipulation and analysis. It introduces DataFrame (tabular data) and
Series (1D array) structures. As well as a tool for reading/writing data (CSV, Excel), handling
missing values, and aggregating data. I decided to use this library instead of the other because it
directly integrates with Matplotlib/Seaborn for plotting from DataFrames.
Here are some functions in this library that I used:

– pd.read_csv(): Read CSV files into DataFrames


– DataFrame.drop(): Remove specified columns
– DataFrame.groupby(): Group data by specified columns
– DataFrame.sum(): Aggregate data with sum
– pd.to_datetime(): Convert columns to datetime format
– DataFrame.diff(): Calculate differences between rows
– DataFrame.rolling(): Apply rolling window calculations
– DataFrame.shape: Check DataFrame dimensions
– DataFrame.dropna(): Handle missing values

• Seaborn
This is a high-level statistical visualization library built on Matplotlib. Seaborn simplifies complex
plots (heatmaps, violin plots, pair plots) and uses built-in themes and color palettes for aesthetics.
This is more advantagous than Matplotlib because it reduces boilerplate code compared to
Matplotlib. It is also ideal for exploratory data analysis (EDA) with minimal effort. I only use this
function:
sns.heatmap(): to generate correlation heatmaps

• Plotly.express & Plotly.graph_objects


These libraries are significantly useful for creating interactive, web-based visualizations. This library
is a high-level API for quick interactive charts (scatter, line, bar). The advantages of this library
which outweigh the Matplotlib/Seaborn are the dynamic presentations, dashboards, interactive
reports and real-time data exploration.

– px.line(): Interactive line plots


– px.scatter(): Interactive scatter plots
– go.Figure(): Initialize plotly figure
– go.Scatter(): Add line traces
– go.Scatter3d(): Create 3D scatter plots
– px.choropleth(): Creates interactive geographical maps where regions are colored based on data
values.
– px.bar(): Creates bar charts for categorical data comparisons.
– px.histogram(): Shows distribution of numerical data through bins.
– fig.add_trace(): Add plot components
– fig.update_layout(): Customize plot appearance

2.2 Prepare Dataset

Before running the analysis, we need to ensure the datasets are ready for the implementation. The
analysis is based on three datasets:

Nguyễn Bảo Long - 22127243 Page 5 of 29


University of Science - VNUHCM Multivariate Statistical Analysis

2.2.1 Confirmed Cases

covid-19-cases.csv: [2]this dataset contains time-series data of confirmed COVID-19 cases world-
wide. Includes columns like Province/State, Country/Region, Lat, Long, and date-wise case
counts.
DATA STRUCTURE

• Rows: Each row represents a geographic unit. Some rows may include a “Province/State” if data
is available at a subnational level.

• Columns:

– Province/State: (Optional) The subnational division or state name (if available).


– Country/Region: The country name.
– Lat Long: The geographical coordinates (latitude and longitude) of the location.
– Date Columns (e.g., 1/22/20, 1/23/20, . . . ): Each subsequent column represents a specific
date. The cell values are the cumulative counts of confirmed cases recorded on that date.

DATA PREPARATION STEPS

1. First, we need to read the CSV file and load into a DataFrame by using
df_confirmed = pd.read_csv(’data/covid-19-cases.csv’, index_col=0). The CSV is loaded
into a DataFrame with an index from the first column.

2. Then, we drop unnecessary columns, as the analysis focuses on country-level trends, columns like
“Province/State”, “Lat”, and “Long” are removed by using df_confirmed.drop(columns=[’Province/State’,
’Lat’, ’Long’], inplace=True)

3. After removing unnecessary columns, we start grouping by Country. If a country has multiple rows
(because of province-level data), the counts are aggregated to get country-level cumulative cases:
df_confirmed = df_confirmed.groupby([’Country/Region’]).sum()

4. Next, I converted Date Columns to Datetime Objects, as this conversion simplifies date-based
operations and plotting.

5. Calculating Daily New Cases: Daily new cases are calculated by subtracting the previous day’s
cumulative value from the current day’s value. The first day uses a fill value of 0. This is shown by:
df_confirmed_daily = df_confirmed - df_confirmed.shift(1, axis=1, fill_value=0)

6. Calculating a 7-Day Moving Average: a 7-day rolling average smooths out daily fluctuations and
makes trends easier to interpret, using this:
df_confirmed_daily_moving = df_confirmed_daily.rolling(window=7, axis=1).mean()

As can be seen in the dataset description below, the DataFrame has 195 rows and 540 columns. In this
case, each row represents a country (after grouping by “Country/Region”), and each column represents a
specific date (from January 22, 2020, to July 14, 2021).
All columns represent dates (properly converted to datetime objects) and all rows have numeric (integer)
data.

Nguyễn Bảo Long - 22127243 Page 6 of 29


University of Science - VNUHCM Multivariate Statistical Analysis

Hình 1: Overview of Confirmed Cases Dataset

Mean, Standard Deviation (std), Min, 25th, 50th, 75th percentiles, and Max:
For example, for 2020-01-22:
• Mean ≈ 2.86: On average, there were very few cases on this day.

• Std ≈ 39.24: There is some variation, but note that many countries reported 0 cases.

• Min = 0 and Max = 548: The lowest and highest cumulative counts on that day.
Similar statistics for other dates show how the numbers change as the pandemic progressed. This helps
you understand the distribution and scale of the case counts over time.

2.2.2 Death Cases

DATA STRUCTURE
The same data structure was applied for covid-19-death.csv file:
• Rows and Columns: Very similar to the cases dataset:

– Province/State, Country/Region, Lat, Long: Geographical identifiers.


– Date Columns: Each date column contains the cumulative number of deaths reported up to
that day.

DATA PREPARATION STEPS


As the same as the above dataset, the preparation involves cleaning the data (dropping non-essential
columns), aggregating subnational data to a country level, converting date columns for easier
time-series manipulation, and then calculating both daily differences (new cases/deaths) and 7-day
moving averages to better understand trends.

Nguyễn Bảo Long - 22127243 Page 7 of 29


University of Science - VNUHCM Multivariate Statistical Analysis

2.2.3 OWID Dataset

The file owid-covid-latest.csv is a snapshot of the most recent COVID-19 statistics for each
country, provided by Our World In Data. It contains a broad range of indicators beyond just cases and
deaths. Here are some key parameters and their meanings:

• iso_code: The ISO 3166-1 alpha-3 code that uniquely identifies each country.

• continent: The continent on which the country is located (e.g., Asia, Europe).

• location: The name of the country.

• last_updated_date: The date when the data for that country was last refreshed.

• Epidemiological Metrics:

– total_cases: Cumulative confirmed COVID-19 cases.


– new_cases: Number of new cases reported in the most recent update.
– ...

• Per Capita Metrics:

– total_cases_per_million: Total cases normalized per one million people.


– new_cases_per_million: New cases per one million people.
– ...

• Healthcare Capacity and Severity:

– reproduction_rate: The effective reproduction number (R), indicating how many people, on
average, one infected person will pass the virus to.
– icu_patients: Number of patients currently in intensive care units (ICU).
– ...

• Vaccination Metrics:

– total_vaccinations: Cumulative number of vaccination doses administered.


– people_vaccinated: Number of people who have received at least one vaccine dose.
– ...

A wide range of epidemiological, testing, vaccination, and socio-economic indicators is provided. These
parameters offer insights into the progression of the pandemic, the effectiveness of public health measures,
and the demographic and economic context of each country.
Here is the overall description for this dataset:

Nguyễn Bảo Long - 22127243 Page 8 of 29


University of Science - VNUHCM Multivariate Statistical Analysis

Hình 2: Overview of OWID Dataset

There are 194 rows, each row typically represents a country or region (after filtering or cleaning the
data). Besides, there are 60 different variables or features, which include epidemiological statistics
(like total cases, new cases, deaths, etc.), testing metrics, vaccination data, government response indica-
tors, and socio-economic/demographic information.
Next, these data suggest that the central tendency and the spread of data. For instance, the mean of
total_cases is around 3,084,054, but the high standard deviation (15,470,020) indicates there is a very
wide range between countries with few cases and those with very many cases.

2.3 Graph 1 - Random countries

2.3.1 Confirmed cases

First graph illustrates a line plot of Covid-19 confirmed cases for 10 countries over time. It plots the
7-Day moving average of daily new caess to smooth out short-term fluctuations
1 countries = [
2 ’ Vietnam ’ , ’ China ’ , ’ Japan ’ , ’US ’ , ’ France ’ ,
3 ’ I n d i a ’ , ’ B r a z i l ’ , ’ R u s s i a ’ , ’ United Kingdom ’ ,
4 ’ Spain ’ ,
5 ]

This is a list of 10 country names created. These countries will be plotted on the graph. Then it sets the
size of the figure to 16 inches wide and 6 inches tall to ensure better readability.

• The code iterates through each country in the countries list.

• df_confirmed_daily_moving.loc[country] retrieves the 7-day moving average of daily new cases


for each country.

• plt.plot(...) plots the data as a line graph for each country.

• label=country ensures that each country’s line is labeled in the legend.

Nguyễn Bảo Long - 22127243 Page 9 of 29


University of Science - VNUHCM Multivariate Statistical Analysis

As the result would be:

Hình 3: Confirmed cases of 10 random countries

The US (red line) and India (brown line) show the highest spikes in Covid-19 cases, with India
reaching a massive peak around mid-2021. The US had multiple surges, with a significant peak around
late 2020 to early 2021 (Dec 2020 to Feb 2021). Brazil (pink line) had a prolonged period of high cases,
with multiple waves but never dropping to very low levels. France (purple line) had fluctuations, with
peaks occurring periodically. The UK (yellow line) saw a clear peak at the beginning of 2021 but had
fluctuations throughout. China (orange line) remained relatively flat, indicating very few confirmed cases
over time despite being recognized as the originated country. Vietnam (blue line) also remained low until
mid-2021, showing a small rise later.

2.3.2 Death Cases

To have a clear analysis over the dataset and the relationship between confirmed cases and death,
here is a line splot showing the 7-day moving average of daily new Covid-19 deaths for 10 countries over
time.
Pseudocode for Death cases analysis:

Algorithm 1 Plot 7-Day Moving Average of Daily New Deaths


1: Initialize a list of countries:
2: countries = [’Vietnam’, ’China’, ’Japan’, ’US’, ’France’, ’India’, ’Brazil’,
’Russia’, ’United Kingdom’, ’Spain’]
3: Create a figure for the plot:
4: plt.figure(figsize=(16, 6))
5: Set labels and title:
6: plt.xlabel(’Date’, fontsize=16)
7: plt.ylabel(’Daily New Deaths (7-Day MA)’, fontsize=16)
8: plt.title(’Covid-19 Deaths (7-Day Moving Average) - 10 Countries’, fontsize=16)
9: for each country in countries do
10: if country exists in df_deaths_daily_moving.index then
11: Plot the 7-day moving average of daily new deaths for country:
12: plt.plot(df_deaths_daily_moving.loc[country], label=country)
13: Add a legend:
14: plt.legend(fontsize=12)
15: Adjust layout and display the graph:
16: plt.tight_layout()
17: plt.show()

Nguyễn Bảo Long - 22127243 Page 10 of 29


University of Science - VNUHCM Multivariate Statistical Analysis

This is the result of Death cases in 10 countries reported:

Hình 4: Death cases of 10 random countries

As can be seen, the US witnessed the first major wave in Spring 2020 (April-May) with ≈ 2, 200 daily
deaths corresponding to 30, 000 cases. The largest wave occured in the Winter 2020-2021 (December-
January) with ≈ 3, 500 daily deaths following 250,000 daily cases. A clear pattern where death peaks
followed confirmed case peaks by approximately 2-3 weeks
There was also a strong correlation between confirmed cases and deaths with about 2-week lag: There
were massive waves in April-May 2021 in India: Cases peaked at 400,000 daily and Deaths peaked at
4,000 daily
Here are some conclusions that I have noticed:

• Time Lag Pattern: Consistently across countries, death peaks followed case peaks by approxi-
mately 2-3 weeks. This lag remained relatively constant throughout the pandemic

• Changing Case Fatality Rates:

– Early 2020: Higher death rates relative to case numbers

– Later waves: Generally lower death rates per case, particularly visible in developed countries

– Suggests improved treatment protocols and healthcare preparation

• Regional Patterns: Western countries showed similar wave timing while Asian countries (except
India) maintained lower numbers. Brazil and India showed different patterns from other regions.

• Countries with large case spikes (e.g., US, India, Brazil) often show subsequent increases
in daily deaths, countries with lower reported cases (e.g., Vietnam, Japan) typically show
lower deaths, reflecting both their smaller infection counts and possibly stronger containment
measures.

To support these analysis, the scatter plot from another dataset recorded the Covid-19 total deaths
and total confirmed cases also reveals the same trend.
By using the owid-covid-latest.csv provided, I created a scatter plot use x-axis and y-axis as Total
Cases and Total Deaths, respectively; Here is the Pseudocode for this scatter plot:

Nguyễn Bảo Long - 22127243 Page 11 of 29


University of Science - VNUHCM Multivariate Statistical Analysis

Algorithm 2 Scatter Plot of Total Cases vs. Total Deaths


1: Initialize visualization:
2: Create figure with dimensions 12x6 inches
3: plt.figure(figsize=(12, 6))
4: Create scatter plot:
5: Plot points using total_cases vs total_death from df_latest
6: plt.scatter(df_latest[’total_cases’], df_latest[’total_deaths’], alpha=0.7)
7: Label axes:
8: Set x-axis label to "Total Cases" (fontsize=14)
9: plt.xlabel(’Total Cases’, fontsize=14)
10: Set y-axis label to "Total Deaths" (fontsize=14)
11: plt.ylabel(’Total Deaths’, fontsize=14)
12: Add title:
13: Set plot title to "Total Cases vs. Total Deaths (OWID Covid Latest)"
14: plt.title(’Total Cases vs. Total Deaths (OWID Covid Latest)’, fontsize=16)
15: Finalize plot:
16: Adjust layout to prevent overlapping elements
17: plt.tight_layout()
18: Display the generated plot
19: plt.show()

Hình 5: OWID Dataset - total deaths and total cases

As can be seen, the scatter plot also generally shows a positive correlation: countries with higher
total Covid-19 cases also tend to report higher total Covid-19 deaths. This aligns with the
previous analysis where we saw death patterns following case patterns.
Besides, many points are clustered in the lower left (small number of cases/deaths) as many countries
have relatively low total cases and low total deaths, forming a dense cluster near (0,0). One or more
points might stand out are the outliers (at the top-right corner). These represent countries with very
large total case counts and significant total deaths (e.g., the United States, India, or Brazil).
The plot shows total cases reaching up to ≈ 175million (1.75×10) and deaths up to 4 million. This
aggregate view supports the massive scale of outbreaks we saw in the time series, particularly in countries
like India, US, and Brazil

Nguyễn Bảo Long - 22127243 Page 12 of 29


University of Science - VNUHCM Multivariate Statistical Analysis

2.4 Graph 2 - Top 10 Countries

As shown as the large scale, now we will discuss top 10 countries that occupied for the majority of
confirmed cases and death cases.

2.4.1 Confirmed Cases

I created two visualizations: a bar chart showing absolute numbers and a pie chart showing relative
proportions. This complements the time series and scatter plots by providing a snapshot view of cumu-
lative cases.

1. To create these graphs, first, I need to identify the latest date as I take the last column of
df_confirmed (using [-1]) and stores it in latest_date. Each column in df_confirmed represents
a date, so the last column is the most recent date available in the dataset.

2. Next, to extract top 10 countries, I use:

• df_confirmed[latest_date]: Selects the column corresponding to latest_date, which con-


tains the cumulative confirmed cases for each country on that date.
• .sort_values(ascending=False): Sorts the values in descending order so that countries with
the highest confirmed cases appear first.
• .head(10): Retrieves the top 10 rows from this sorted list.

Here is the Pseudocode for both of these graphs:

Algorithm 3 Plot Top 10 Countries by Cumulative Confirmed Cases


DataFrame df_confirmed with date columns. Bar chart and pie chart for the top 10 countries by cumu-
lative confirmed cases.
Step 1: Determine Latest Date latest_date ← last column of df_confirmed
Step 2: Extract Top 10 Countries
top10 ← sort df_confirmed[latest_date] in descending order
top10 ← select the first 10 entries
Step 3: Create Bar Chart
Initialize figure with size (10, 6)
Plot a bar chart with:
x-axis: top10.index (country names)
y-axis: top10.values (cumulative confirmed cases)
Color: purple
Set title: “Top 10 Countries by Cumulative Confirmed Cases (as of latest_date)”
Label x-axis as “Country” and y-axis as “Cumulative Confirmed Cases”
Rotate x-axis tick labels by 45 degrees
Apply layout adjustments (e.g., tight_layout())
Display the bar chart
Step 4: Create Pie Chart Initialize figure with size (8, 8) Plot a pie chart with:
Values: top10.values
Labels: top10.index
Percentage format: autopct = ’%1.1f%%’
Start angle: 140 degrees Set title: “Share of Cumulative Confirmed Cases (Top 10 Countries as of
latest_date)” Apply layout adjustments Display the pie chart

Nguyễn Bảo Long - 22127243 Page 13 of 29


University of Science - VNUHCM Multivariate Statistical Analysis

Hình 6: Top 10 countries with the most confirmed cases

This shows the final outcome of the trends we observed in the previous time series graphs. The bar
chart provides absolute comparisons while the pie chart shows relative contributions. This would likely
show countries like

• US - which had multiple large waves and had the highest bar, with over 30 million cases.

• India - be the second place, which had the massive spike in 2021, and around 20-25 million cases.

Nguyễn Bảo Long - 22127243 Page 14 of 29


University of Science - VNUHCM Multivariate Statistical Analysis

• Brazil - which maintained high cases throughout, closely followed by 15-20 million cases.

• European countries like Russia, the UK, France would have progressively smaller bars, ranging from
5-10 million cases.

• In general, the top 3 countries (United States, India, Brazil) would account for the majority of the
cases (possibly 60-70% combined). The remaining 7 countries would have smaller shares, reflecting
lower case numbers.

2.4.2 Death Cases

Similarly, these diagrams illustrate the correlation between confirmed cases and deaths, which point
out the data characteristic and trend preferences.

Algorithm 4 Visualization of Top 10 Countries by Cumulative Death Cases


1: Input: DataFrame df_deaths containing cumulative death cases by country and date.
2: Output: Bar chart and pie chart visualizing top 10 countries by cumulative deaths.
3: Step 1: Extract the most recent date
4: latest_date ← df_deaths.columns[-1] ▷ Get the last column (most recent date)
5: Step 2: Extract top 10 countries by cumulative deaths
6: top10_cum_deaths ← df_deaths[latest_date].sort_values(ascending=False).head(10) ▷
Sort and select top 10
7: Step 3: Create a bar chart
8: Initialize a figure with size (10, 6)
9: Plot a bar chart with:
10: X-axis: top10_cum_deaths.index (country names)
11: Y-axis: top10_cum_deaths.values (cumulative deaths)
12: Color: Purple
13: Add title: "Top 10 Countries by Cumulative Deaths (as of latest_date)"
14: Label X-axis: "Country"
15: Label Y-axis: "Cumulative Deaths"
16: Rotate X-axis labels by 45 degrees
17: Adjust layout for better visualization
18: Display the bar chart
19: Step 4: Create a pie chart
20: Initialize a figure with size (8, 8)
21: Plot a pie chart with:
22: Values: top10_cum_deaths.values (cumulative deaths)
23: Labels: top10_cum_deaths.index (country names)
24: Percentages: Display percentages with 1 decimal place
25: Start angle: 140 degrees
26: Add title: "Share of Cumulative Deaths (Top 10 Countries as of latest_date)"
27: Adjust layout for better visualization
28: Display the pie chart

As can be seen, the bar chart displays the absolute cumulative death numbers for the top 10 countries.
Each bar represents a country, with the height corresponding to the total number of deaths. While the pie
chart shows the percentage share of deaths among these given countries, each slice of the pie corresponds
to a country’s proportion of total deaths within this group.

Nguyễn Bảo Long - 22127243 Page 15 of 29


University of Science - VNUHCM Multivariate Statistical Analysis

Hình 7: Top 10 countries with the most death cases

The figures for Death cases share quite the same trend with Confirmed cases, countries like the United
States, India, Brazil, Russia, and the United Kingdom are likely to appear in both lists, indicating a strong
correlation between high confirmed cases and high death counts.
Similarly, the pie chart also shared the same characteristic. For example, if a country like the United
States has a high percentage of confirmed cases (e.g., 30%), it is also likely to have a high percentage of
deaths (e.g., 20-25%). This suggests that countries with a larger number of confirmed cases tend to have

Nguyễn Bảo Long - 22127243 Page 16 of 29


University of Science - VNUHCM Multivariate Statistical Analysis

a higher number of deaths, reflecting the direct relationship between case numbers and fatalities.
However, some countries show the disproportion between these trends. As Peru and Mexico, these
countries have a notable share of cumulative deaths (e.g., 20.6% for Mexico), which might be dispropor-
tionately high compared to their share of confirmed cases. This could indicate a higher case fatality rate
(CFR) in these countries, possibly due to healthcare system limitations or other factors.
To support the idea of Western and America countries tend to have the higher death toll.
The same idea with above implementations, here is the application for OWID dataset:

• The code sorts the DataFrame df_latest by the column ’total_deaths’ in descending order.

• It then takes the first 10 rows (.head(10)) to get the top 10 locations with the highest total Covid-19
deaths.

• This subset is stored in top10_deaths.

And here is the result:

Hình 8: Total Death rate all around the world

The bar chart displays the 10 locations with the highest total Covid-19 deaths according to df_latest.
Note that “World,” continents (e.g., “Europe”), or unions (e.g., “European Union”) may appear alongside
actual countries (e.g., “United States,” “Brazil,” “India”). The bars are sorted from left to right in descend-
ing order of total deaths. The leftmost bar has the highest deaths, and each subsequent bar decreases in
height.
However, because the dataset includes broad regions (e.g., continents, unions) and the entire “World”
category, it’s not strictly “Top 10 Countries” in the traditional sense—it’s more accurately the top 10
entities (countries/regions/aggregates) by total deaths.
As can be seen, Europe accounted for the majority of death cases (the continent which have the highest
toll rate in the world), followed by America countries (both in North and South), where United States
was the most influential factor.

Nguyễn Bảo Long - 22127243 Page 17 of 29


University of Science - VNUHCM Multivariate Statistical Analysis

Hình 9: Top 10 Regions by Citizens Vaccinations

As above visualizations, it can prompt further questions about why certain countries have higher
death tolls. Factors might include population size, healthcare system capacity, reporting standards, de-
mographics, and the timing/intensity or most important - Vaccination of Covid-19 waves. Therefore, with
the figures here - the top 10 countries/regions by total vaccinations, which are mainly attributed
to the death toll.
Here is the Pseudocode of this implementation:

Algorithm 5 Top 10 Regions/Countries by Total Vaccinations


1: Input: DataFrame df_latest with vaccination data
2: Output: Bar plot of top 10 regions/countries by total vaccinations
3: Step 1: Drop rows with missing values in ’total_vaccinations’ column
4: df_latest_vacc ← df_latest.dropna(subset=[’total_vaccinations’])
5: Step 2: Sort the DataFrame by ’total_vaccinations’ in descending order and select top 10
6: top10_vaccinations ← df_latest_vacc.sort_values(by=’total_vaccinations’, ascending=False).head(10)
7: Step 3: Create a bar plot
8: plt.figure(figsize=(10, 6))
9: plt.bar(top10_vaccinations[’location’], top10_vaccinations[’total_vaccinations’], color=’green’)
10: plt.title(’Top 10 Regions/Countries by Total Vaccinations’, fontsize=14)
11: plt.xlabel(’Region/Country’, fontsize=12)
12: plt.ylabel(’Total Vaccinations’, fontsize=12)
13: plt.xticks(rotation=45)
14: plt.tight_layout()
15: plt.show()

As can be seen above, the graph illustrates the “World” bar is significantly higher than any single
region or country. This is expected since it represents the sum of all vaccinations globally.
Entries like “Asia,” “Europe,” “North America,” and “South America” appear alongside individual
countries such as “China,” “India,” “United States,” and “Brazil.” This could be reasons to explain the

Nguyễn Bảo Long - 22127243 Page 18 of 29


University of Science - VNUHCM Multivariate Statistical Analysis

low death rate in China despite being regconised as an Covid-19 originated country. “China” stands out
for having administered a large number of vaccinations—on par with entire continents. “India,” “United
States,” and “Brazil” also appear in the top 10, each with hundreds of millions (or more) of administered
doses.
From the previous Total Deaths chart (also dominated by “World,” regions, and populous countries like
the United States, Brazil, and India), we can observe:

• High Population, High Numbers: The same large-population entities—such as “India,” “Brazil,”
and the “United States”—appear in both charts. They have high total deaths but also high total
vaccinations, largely because of their sheer population sizes.

• Importance of Vaccination: Higher vaccination levels typically correlate with lower Covid-19
mortality (or at least a significant reduction in the risk of severe outcomes). The presence of coun-
tries in both top 10 lists (deaths and vaccinations) underscores that large populations can accumu-
late high totals in both categories. However, effective vaccination efforts help mitigate the worst
outcomes by decreasing death rates over time.

2.5 Graph 3 - Daily New Confirmed + 7-Day Moving Average

After implementing the large overview of 10 countries, I noticed that the figures for the United States
was the largest among all categories. As this section would be the analysis for daily new confirmed Covid-
19 cases/Death cases and 7-Day moving average of Daily new cases/Death cases. Line charts, heatmap
and bar chart will be used to implement these figures.
The idea of plot the Daily New Cases/Death Cases or plot the 7-Day Moving Average was same:

• X-axis: dates (all the date columns).

• Y-axis: df_confirmed_daily.loc[country], which represents the daily new confirmed cases


for the specified country. Or df_confirmed_daily_moving.loc[country] which holds the 7-day
moving average of daily new cases for the country.

These graphs demonstrate a day-by-day look at how many new Covid-19 cases are confirmed each day for
the selected country (here, the US) and a smoothed version of daily new cases over a 7-day period. Spikes
or dips in the first line graph indicate surges or drops in daily new infections while the smoothing second
line graph helps filter out short-term spikes or reporting anomalies, making trends easier to observe.

Nguyễn Bảo Long - 22127243 Page 19 of 29


University of Science - VNUHCM Multivariate Statistical Analysis

Hình 10: Daily New Cases and 7-Day Moving Average in the US

The same figure and algorithms are applied for Death Cases, with the following results:

Hình 11: Daily New Death Cases and 7-Day Moving Average of Death Cases in the US

As can be seen, both 2 figures: Daily New Cases and 7-Day Moving Average in the US show quite
a consistent correlation: higher case surges correlate with higher death tolls, but the death-to-case ratio
may decline over time due to improved treatments, vaccination, or broader testing capturing milder cases.
It is also noticeable that deaths typically trail confirmed cases by 2-4 weeks, as severe outcomes develop

Nguyễn Bảo Long - 22127243 Page 20 of 29


University of Science - VNUHCM Multivariate Statistical Analysis

after diagnosis. This lag is visible in the delayed peaks of the death curve compared to the case curve.

2.6 Graph 4: 7-Day moving average of Daily New Cases

This is the heatmap showing 7-day moving average of daily new Covid-19 cases for selected countries
(US’, ’India’, ’Brazil’, ’Russia’, ’Vietnam’, ’France’, ’Italy’, ’Spain’, ’Germany’, ’China’). In this heatmap
there are some notice:

• Each row corresponds to one of the selected countries.

• Each column corresponds to a specific date.

• The color intensity in each cell indicates the 7-day moving average of new cases:

– Lighter (yellow) generally means lower daily new cases.


– Darker (orange/red) indicates higher daily new cases.

• By scanning across a row, we can see how the daily cases for a single country change
over time.

• By scanning down a column, we can compare different countries on a specific date.

And here are the results:

Hình 12: Heatmap shows the 7-Day moving average for 10 countries

As can be seen, the US shows particularly intense periods (orange to red) during late 2020 and
early 2021) while India shows a notable intense red period around early-mid 2021. Most other countries
show varying intensities of yellow, indicating lower relative case counts. These figures are coherent with
the above death rates and confirmed cases, the winter surge (late 2020/early 2021) shows this clearly:
cases peaked around January 2021, while deaths peaked shortly after. This also reveals the major waves,
especially in the US:

• First wave: Spring 2020 (relatively smaller)

• Summer wave: July-August 2020 (moderate)

Nguyễn Bảo Long - 22127243 Page 21 of 29


University of Science - VNUHCM Multivariate Statistical Analysis

• Winter surge: November 2020 - January 2021 (largest)

• Gradual decline: Spring 2021

2.7 Bonus

In this Lab Assignment, besides using Matplotlib, Plotly is a powerful, high-level visualization library
that allows you to create interactive, web-ready plots.
Here are some reasons that I choose Plotly to implement the same datasets, as the output would be more
interactive, responsive and intuitive: Interactivity:

• Plotly charts are interactive by default. I can zoom, pan, and hover over data points to see additional
information. This makes it especially useful for dashboards and presentations where user engagement
is important.

• Ease of Use with Plotly Express: Plotly Express is a high-level interface to Plotly, offering simple
syntax for creating complex visualizations. This often results in less code compared to Mat-
plotlib for similar plots.

• Web Integration: Plotly outputs can be saved as HTML files, which makes them easy to embed in
web pages and share online.

• Aesthetic and Modern Visuals: Plotly offers attractive, modern visual styles and smooth color scales,
which can be more appealing right out-of-the-box than Matplotlib’s default settings.

For example, to plot the total confirmed cases in Matplotlib, I have to list all the countries, even I
don’t know whether that country had data or not. However, with this library, I can easily aggregate the
COVID-19 confirmed cases data by using:

• df_confirmed.sum(axis=1) sums up all values in each row. Each row represents a country, so this
calculates the total confirmed cases for each country.

• .reset_index() converts the result into a DataFrame and turns the row indices into a column.

• The columns are then renamed to [’Country/Region’, ’Confirmed’] to clearly represent the data.

As the result, not only selected particular or limited countries, this library allows me to plot the whole
world map with their corresponding figures:

Hình 13: Total confirmed cases of the world

Nguyễn Bảo Long - 22127243 Page 22 of 29


University of Science - VNUHCM Multivariate Statistical Analysis

Plotly has specialized functions (like px.choropleth) to create geographic visualizations, which can be
more straightforward than creating similar maps in Matplotlib.
Similarly, for dashboards and live data visualizations, Plotly’s interactivity and support for dynamic
updates provide a robust solution compared to the more static nature of Matplotlib plots as graphs
below:

Hình 14: Top 10 countries with the most confirmed cases shown in Plotly

Hình 15: Top 10 countries with the most death cases shown in Plotly

Hình 16: Daily New Cases with 7-Day Moving Average and other categories shown in the same plot

Nguyễn Bảo Long - 22127243 Page 23 of 29


University of Science - VNUHCM Multivariate Statistical Analysis

2.8 Another Dataset

[1]File containing all matches combined at combined_matches.csv. 244,038 total matches scraped.
There are 244,038 rows with 8 columns as describe below.
Each file contains following header: League,Date,HomeTeam,AwayTeam,HomeGoals,AwayGoals,Result

• League: league match is from

• Date: date match was played on

• HomeTeam: match home team

• AwayTeam: match away team

• HomeGoals: goals scored by home team

• AwayGoals: goals scored by away team

• Result: match result. KEY: A = Away team win. H = Home team win. D = Draw

As the same as above, the first step is to prepare the dataset, I use Pandas to reads the information in
the CSV file named combined_matches.csv from the data folder into Dataframe called df.

• Here are some parameters that I parse in:

– dayfirst=True: Indicates that the date format uses day first (common in European formats).
– infer_datetime_format=True: Allows pandas to automatically detect the date format for
faster conversion.
– errors=’coerce’: Any date that cannot be parsed correctly will be set as NaT (Not a Time)
instead of causing an error.

• Then I clean the Result column, which is meant to indicate whether the match was won by the
home team (H), away team (A), or if it was a draw (D). Inconsistent formatting (like extra spaces
or punctuation) can cause issues when analyzing this data, so this cleaning step ensures consistency.

• After that I calculate total goals per Match, by using the line df[’TotalGoals’] = df[’HomeGoals’]
+ df[’AwayGoals’] - This line creates a new column named TotalGoals by summing the goals
scored by the home team (HomeGoals) and the away team (AwayGoals) for each match.

Here are some figures that I captured in both Plotly and Matplotlib, so that we could have intuitive
perspectives toward 2 libraries, and the strength of each library according to the user’s needs.

• The first graph is to illustrate the Home Team Wins, as the parsed parameters, I want to filter the
original DataFrame df to include only matches where the Result column is ’H’ (i.e., the home
team won).

• After filtering, it groups the matches by the HomeTeam column. Each group now corresponds to
a single home team. Then, size() returns the number of rows (i.e., matches) in each group. In this
context, it’s the total number of wins that a specific home team has when playing at home.

• .reset_index(name=’Wins’): This turns the grouped data into a DataFrame with two columns:

– HomeTeam: The name of the home team.


– Wins: The total count of home wins for that team.

Nguyễn Bảo Long - 22127243 Page 24 of 29


University of Science - VNUHCM Multivariate Statistical Analysis

• Finally, the code sorts the DataFrame in descending order by the number of Wins and keeps only
the top 10 teams. This results in a list of the 10 teams with the most home wins.

In the first image, a Plotly bar chart shows the top 10 home teams by wins. This is easy to embed in
web applications or notebooks with interactive features. Also, it provokes interactivity such as Zoom,
pan, hover, and tooltips. This is the advantages that could outweigh the traditional method used by
Matplotlib. It offers an interactive bar chart, making it easy to hover over each bar and see
exact numbers (e.g., “Home Team = Barcelona, Number of Wins = 450”).

Hình 17: Top 10 Home Teams by Win

In the second image, a Matplotlib bar chart presents the same information: the x-axis lists the top
10 home teams, and the y-axis shows the number of home wins. It is straightforward for quick plots in
Python scripts. Especially it is highly customizable for print and publication.
As shown in the graph, Barcelona (450) edges out Real Madrid (444) by a small margin, highlighting
the fierce competition between these two giants in La Liga. There is a relatively tight cluster among several
teams (Rangers, Juventus, PSV all at 411; Porto at 407) indicating they share very similar home-win
records. These clubs are generally the most successful teams in their respective leagues, often dominating
domestically over many years. A high number of home wins is partly a reflection of long-term success
and consistency.
The same for Top 10 Away Teams by Wins:

Nguyễn Bảo Long - 22127243 Page 25 of 29


University of Science - VNUHCM Multivariate Statistical Analysis

Hình 18: Top 10 Away Teams by Win

The same clubs dominate both lists, just in a different order. These are historically successful teams
in their respective leagues (Spain, Scotland, England, Portugal, Germany, Italy, the Netherlands). With
Celtic leads away wins (359) and is 3rd at home (433). Rangers ranks 2nd away (329) and 5th at home
(411). Their relatively smaller “home-away gap” (e.g., 433 home vs. 359 away for Celtic) suggests they
travel well in a league they dominate.
Barcelona has 450 home wins vs. 318 away while Real Madrid has 444 home wins vs. 318 away.
Both are near the top in home performance, but the difference (home minus away) is around 130
wins—highlighting a more pronounced home advantage in La Liga for these two giants.
The remaining clubs have substantial home success, but also rank among the top away perform-
ers—testament to their dominance and consistency over many seasons.

Next graph is to calculate the total Matches per day. To achieve this, I need to group all the rows
(each representing a single match) in the DataFrame by their Date column using df.groupby(’Date’).
Then, once grouped, .size() returns the total number of rows (i.e., matches) in each group. Essentially,
for each unique date, it counts how many matches occurred on that day. Finally, I convert the grouped
data into a new DataFrame with two columns:

• Date: The unique date of the match.

• Matches: The number of matches played on that date.

The argument name=’Matches’ assigns a column label to the counts.

Nguyễn Bảo Long - 22127243 Page 26 of 29


University of Science - VNUHCM Multivariate Statistical Analysis

Hình 19: Matches Played per Day

It can be seen that from around 1986 to the mid-2000s, the data is sparser (lower match counts).
There’s a noticeable jump in frequency and density of matches around 2016–2017, where the data becomes
much denser on the chart. This might indicate:

• Increased coverage of leagues over time.

• Multiple leagues all playing on the same dates, leading to higher daily match counts.

Football matches often follow seasonal patterns. Therefore, we may see spikes in the chart for specific
days (weekends, holidays, or specific tournaments). That is the reason why the charts have many spikes
and fluctuations. Off-seasons or winter breaks in some leagues might show dips or even zero matches on
certain dates.
The last graph is to calculate home and away goals by match. Here I use a DataFrame df where each row
represents a match. The columns HomeGoals and AwayGoals contain the goals scored by the home and
away teams, respectively.

Nguyễn Bảo Long - 22127243 Page 27 of 29


University of Science - VNUHCM Multivariate Statistical Analysis

Hình 20: Home Goals vs. Away Goals by Result

As can be seen, the majority of the scatter plot is Home Wins (Blue dots). This tend to occur when
home goals are significantly higher than away goals, and it appears more frequently as home goals increase,
particularly beyond 2-3 goals. High Avg_HomeGoals (2.6) and low Avg_AwayGoals (e.g., 0.9) indicate
strong home-team performance. For Away wins (A - Red dots), occur when away goals are significantly
higher than home goals. Higher Avg_AwayGoals (e.g., 2.5) compared to Avg_HomeGoals (e.g., 1.2)
suggest effective away teams, more common when home teams score few goals (0-2 range).
For draw results (green dots), appears along the diagonal where home and away goals are equal (e.g.,
0-0, 1-1, 2-2). Less frequent in very high-scoring games. There also some extreme scores, some matches
have unusually high away goals (e.g., 12-13 goals).

Nguyễn Bảo Long - 22127243 Page 28 of 29


University of Science - VNUHCM Multivariate Statistical Analysis

A Reference

References

[1] Aiden Flynn. European Football Matches. 8 Feb, 2025. url: https://www.kaggle.com/datasets/
flynn28/european-football-matches?select=combined_matches.csv.
[2] RehamFawzy. COVID-19 Global Analysis. 24 Nov, 2024. url: https://www .kaggle .com/code/
rehamfaw/covid-19-global-analysis.

Nguyễn Bảo Long - 22127243 Page 29 of 29

You might also like