Report MSA Practice02
Report MSA Practice02
UNIVERSITY OF SCIENCE
INSTRUCTORS
Lý Quốc Ngọc
Nguyễn Mạnh Hùng
Phạm Thanh Tùng
ID Name Email
22127243 Nguyễn Bảo Long [email protected]
SELF EVALUATION
Mục lục
2 Methodology 4
2.1 Used Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Prepare Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Confirmed Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Death Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 OWID Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Graph 1 - Random countries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Confirmed cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Death Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Graph 2 - Top 10 Countries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Confirmed Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.2 Death Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Graph 3 - Daily New Confirmed + 7-Day Moving Average . . . . . . . . . . . . . . . . . . 19
2.6 Graph 4: 7-Day moving average of Daily New Cases . . . . . . . . . . . . . . . . . . . . . 21
2.7 Bonus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.8 Another Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
A Reference 29
This lab assignment addresses the challenge of transforming raw case data into meaningful visual
representations. The primary goal is to utilize Python’s Matplotlib library to explore trends, distributions,
and correlations within the dataset. Key tasks include reading and preprocessing CSV data, generating
diverse plots (e.g., line charts, histograms, scatter plots), and interpreting the results to uncover patterns
such as case progression, regional disparities, or unemployment rate correlations.
Here are some proposed methods that I will present in this Lab Assignment:
• Environment Setup: Configure a Python virtual environment with libraries like Pandas for data
manipulation and Matplotlib for visualization.
• Data Loading & Preprocessing: Import the COVID-19 dataset using Pandas, clean missing
values, and structure the data for analysis.
– A scatter plot to analyze relationships between variables (e.g., median earnings vs. unemploy-
ment rates).
• Analysis & Insights: Extract observations from each plot, such as identifying outbreak peaks,
high-risk regions, or socioeconomic correlations.
• Bonus: Extend the analysis by integrating additional datasets and alternative libraries (e.g.,
Seaborn, Plotly) and compare their usability with Matplotlib.
2 Methodology
• Mathplotlib.pyplot
This library is the core plotting library for static, publication-quality visualizations in this Lab
Assignment. It is used for highly customizable plots (line, bar, scatter, histograms) and fine-grained
control over axes, labels, and styles.
Here are some functions that are used in this Lab:
• Pandas
This is a library for data manipulation and analysis. It introduces DataFrame (tabular data) and
Series (1D array) structures. As well as a tool for reading/writing data (CSV, Excel), handling
missing values, and aggregating data. I decided to use this library instead of the other because it
directly integrates with Matplotlib/Seaborn for plotting from DataFrames.
Here are some functions in this library that I used:
• Seaborn
This is a high-level statistical visualization library built on Matplotlib. Seaborn simplifies complex
plots (heatmaps, violin plots, pair plots) and uses built-in themes and color palettes for aesthetics.
This is more advantagous than Matplotlib because it reduces boilerplate code compared to
Matplotlib. It is also ideal for exploratory data analysis (EDA) with minimal effort. I only use this
function:
sns.heatmap(): to generate correlation heatmaps
Before running the analysis, we need to ensure the datasets are ready for the implementation. The
analysis is based on three datasets:
covid-19-cases.csv: [2]this dataset contains time-series data of confirmed COVID-19 cases world-
wide. Includes columns like Province/State, Country/Region, Lat, Long, and date-wise case
counts.
DATA STRUCTURE
• Rows: Each row represents a geographic unit. Some rows may include a “Province/State” if data
is available at a subnational level.
• Columns:
1. First, we need to read the CSV file and load into a DataFrame by using
df_confirmed = pd.read_csv(’data/covid-19-cases.csv’, index_col=0). The CSV is loaded
into a DataFrame with an index from the first column.
2. Then, we drop unnecessary columns, as the analysis focuses on country-level trends, columns like
“Province/State”, “Lat”, and “Long” are removed by using df_confirmed.drop(columns=[’Province/State’,
’Lat’, ’Long’], inplace=True)
3. After removing unnecessary columns, we start grouping by Country. If a country has multiple rows
(because of province-level data), the counts are aggregated to get country-level cumulative cases:
df_confirmed = df_confirmed.groupby([’Country/Region’]).sum()
4. Next, I converted Date Columns to Datetime Objects, as this conversion simplifies date-based
operations and plotting.
5. Calculating Daily New Cases: Daily new cases are calculated by subtracting the previous day’s
cumulative value from the current day’s value. The first day uses a fill value of 0. This is shown by:
df_confirmed_daily = df_confirmed - df_confirmed.shift(1, axis=1, fill_value=0)
6. Calculating a 7-Day Moving Average: a 7-day rolling average smooths out daily fluctuations and
makes trends easier to interpret, using this:
df_confirmed_daily_moving = df_confirmed_daily.rolling(window=7, axis=1).mean()
As can be seen in the dataset description below, the DataFrame has 195 rows and 540 columns. In this
case, each row represents a country (after grouping by “Country/Region”), and each column represents a
specific date (from January 22, 2020, to July 14, 2021).
All columns represent dates (properly converted to datetime objects) and all rows have numeric (integer)
data.
Mean, Standard Deviation (std), Min, 25th, 50th, 75th percentiles, and Max:
For example, for 2020-01-22:
• Mean ≈ 2.86: On average, there were very few cases on this day.
• Std ≈ 39.24: There is some variation, but note that many countries reported 0 cases.
• Min = 0 and Max = 548: The lowest and highest cumulative counts on that day.
Similar statistics for other dates show how the numbers change as the pandemic progressed. This helps
you understand the distribution and scale of the case counts over time.
DATA STRUCTURE
The same data structure was applied for covid-19-death.csv file:
• Rows and Columns: Very similar to the cases dataset:
The file owid-covid-latest.csv is a snapshot of the most recent COVID-19 statistics for each
country, provided by Our World In Data. It contains a broad range of indicators beyond just cases and
deaths. Here are some key parameters and their meanings:
• iso_code: The ISO 3166-1 alpha-3 code that uniquely identifies each country.
• continent: The continent on which the country is located (e.g., Asia, Europe).
• last_updated_date: The date when the data for that country was last refreshed.
• Epidemiological Metrics:
– reproduction_rate: The effective reproduction number (R), indicating how many people, on
average, one infected person will pass the virus to.
– icu_patients: Number of patients currently in intensive care units (ICU).
– ...
• Vaccination Metrics:
A wide range of epidemiological, testing, vaccination, and socio-economic indicators is provided. These
parameters offer insights into the progression of the pandemic, the effectiveness of public health measures,
and the demographic and economic context of each country.
Here is the overall description for this dataset:
There are 194 rows, each row typically represents a country or region (after filtering or cleaning the
data). Besides, there are 60 different variables or features, which include epidemiological statistics
(like total cases, new cases, deaths, etc.), testing metrics, vaccination data, government response indica-
tors, and socio-economic/demographic information.
Next, these data suggest that the central tendency and the spread of data. For instance, the mean of
total_cases is around 3,084,054, but the high standard deviation (15,470,020) indicates there is a very
wide range between countries with few cases and those with very many cases.
First graph illustrates a line plot of Covid-19 confirmed cases for 10 countries over time. It plots the
7-Day moving average of daily new caess to smooth out short-term fluctuations
1 countries = [
2 ’ Vietnam ’ , ’ China ’ , ’ Japan ’ , ’US ’ , ’ France ’ ,
3 ’ I n d i a ’ , ’ B r a z i l ’ , ’ R u s s i a ’ , ’ United Kingdom ’ ,
4 ’ Spain ’ ,
5 ]
This is a list of 10 country names created. These countries will be plotted on the graph. Then it sets the
size of the figure to 16 inches wide and 6 inches tall to ensure better readability.
The US (red line) and India (brown line) show the highest spikes in Covid-19 cases, with India
reaching a massive peak around mid-2021. The US had multiple surges, with a significant peak around
late 2020 to early 2021 (Dec 2020 to Feb 2021). Brazil (pink line) had a prolonged period of high cases,
with multiple waves but never dropping to very low levels. France (purple line) had fluctuations, with
peaks occurring periodically. The UK (yellow line) saw a clear peak at the beginning of 2021 but had
fluctuations throughout. China (orange line) remained relatively flat, indicating very few confirmed cases
over time despite being recognized as the originated country. Vietnam (blue line) also remained low until
mid-2021, showing a small rise later.
To have a clear analysis over the dataset and the relationship between confirmed cases and death,
here is a line splot showing the 7-day moving average of daily new Covid-19 deaths for 10 countries over
time.
Pseudocode for Death cases analysis:
As can be seen, the US witnessed the first major wave in Spring 2020 (April-May) with ≈ 2, 200 daily
deaths corresponding to 30, 000 cases. The largest wave occured in the Winter 2020-2021 (December-
January) with ≈ 3, 500 daily deaths following 250,000 daily cases. A clear pattern where death peaks
followed confirmed case peaks by approximately 2-3 weeks
There was also a strong correlation between confirmed cases and deaths with about 2-week lag: There
were massive waves in April-May 2021 in India: Cases peaked at 400,000 daily and Deaths peaked at
4,000 daily
Here are some conclusions that I have noticed:
• Time Lag Pattern: Consistently across countries, death peaks followed case peaks by approxi-
mately 2-3 weeks. This lag remained relatively constant throughout the pandemic
– Later waves: Generally lower death rates per case, particularly visible in developed countries
• Regional Patterns: Western countries showed similar wave timing while Asian countries (except
India) maintained lower numbers. Brazil and India showed different patterns from other regions.
• Countries with large case spikes (e.g., US, India, Brazil) often show subsequent increases
in daily deaths, countries with lower reported cases (e.g., Vietnam, Japan) typically show
lower deaths, reflecting both their smaller infection counts and possibly stronger containment
measures.
To support these analysis, the scatter plot from another dataset recorded the Covid-19 total deaths
and total confirmed cases also reveals the same trend.
By using the owid-covid-latest.csv provided, I created a scatter plot use x-axis and y-axis as Total
Cases and Total Deaths, respectively; Here is the Pseudocode for this scatter plot:
As can be seen, the scatter plot also generally shows a positive correlation: countries with higher
total Covid-19 cases also tend to report higher total Covid-19 deaths. This aligns with the
previous analysis where we saw death patterns following case patterns.
Besides, many points are clustered in the lower left (small number of cases/deaths) as many countries
have relatively low total cases and low total deaths, forming a dense cluster near (0,0). One or more
points might stand out are the outliers (at the top-right corner). These represent countries with very
large total case counts and significant total deaths (e.g., the United States, India, or Brazil).
The plot shows total cases reaching up to ≈ 175million (1.75×10) and deaths up to 4 million. This
aggregate view supports the massive scale of outbreaks we saw in the time series, particularly in countries
like India, US, and Brazil
As shown as the large scale, now we will discuss top 10 countries that occupied for the majority of
confirmed cases and death cases.
I created two visualizations: a bar chart showing absolute numbers and a pie chart showing relative
proportions. This complements the time series and scatter plots by providing a snapshot view of cumu-
lative cases.
1. To create these graphs, first, I need to identify the latest date as I take the last column of
df_confirmed (using [-1]) and stores it in latest_date. Each column in df_confirmed represents
a date, so the last column is the most recent date available in the dataset.
This shows the final outcome of the trends we observed in the previous time series graphs. The bar
chart provides absolute comparisons while the pie chart shows relative contributions. This would likely
show countries like
• US - which had multiple large waves and had the highest bar, with over 30 million cases.
• India - be the second place, which had the massive spike in 2021, and around 20-25 million cases.
• Brazil - which maintained high cases throughout, closely followed by 15-20 million cases.
• European countries like Russia, the UK, France would have progressively smaller bars, ranging from
5-10 million cases.
• In general, the top 3 countries (United States, India, Brazil) would account for the majority of the
cases (possibly 60-70% combined). The remaining 7 countries would have smaller shares, reflecting
lower case numbers.
Similarly, these diagrams illustrate the correlation between confirmed cases and deaths, which point
out the data characteristic and trend preferences.
As can be seen, the bar chart displays the absolute cumulative death numbers for the top 10 countries.
Each bar represents a country, with the height corresponding to the total number of deaths. While the pie
chart shows the percentage share of deaths among these given countries, each slice of the pie corresponds
to a country’s proportion of total deaths within this group.
The figures for Death cases share quite the same trend with Confirmed cases, countries like the United
States, India, Brazil, Russia, and the United Kingdom are likely to appear in both lists, indicating a strong
correlation between high confirmed cases and high death counts.
Similarly, the pie chart also shared the same characteristic. For example, if a country like the United
States has a high percentage of confirmed cases (e.g., 30%), it is also likely to have a high percentage of
deaths (e.g., 20-25%). This suggests that countries with a larger number of confirmed cases tend to have
a higher number of deaths, reflecting the direct relationship between case numbers and fatalities.
However, some countries show the disproportion between these trends. As Peru and Mexico, these
countries have a notable share of cumulative deaths (e.g., 20.6% for Mexico), which might be dispropor-
tionately high compared to their share of confirmed cases. This could indicate a higher case fatality rate
(CFR) in these countries, possibly due to healthcare system limitations or other factors.
To support the idea of Western and America countries tend to have the higher death toll.
The same idea with above implementations, here is the application for OWID dataset:
• The code sorts the DataFrame df_latest by the column ’total_deaths’ in descending order.
• It then takes the first 10 rows (.head(10)) to get the top 10 locations with the highest total Covid-19
deaths.
The bar chart displays the 10 locations with the highest total Covid-19 deaths according to df_latest.
Note that “World,” continents (e.g., “Europe”), or unions (e.g., “European Union”) may appear alongside
actual countries (e.g., “United States,” “Brazil,” “India”). The bars are sorted from left to right in descend-
ing order of total deaths. The leftmost bar has the highest deaths, and each subsequent bar decreases in
height.
However, because the dataset includes broad regions (e.g., continents, unions) and the entire “World”
category, it’s not strictly “Top 10 Countries” in the traditional sense—it’s more accurately the top 10
entities (countries/regions/aggregates) by total deaths.
As can be seen, Europe accounted for the majority of death cases (the continent which have the highest
toll rate in the world), followed by America countries (both in North and South), where United States
was the most influential factor.
As above visualizations, it can prompt further questions about why certain countries have higher
death tolls. Factors might include population size, healthcare system capacity, reporting standards, de-
mographics, and the timing/intensity or most important - Vaccination of Covid-19 waves. Therefore, with
the figures here - the top 10 countries/regions by total vaccinations, which are mainly attributed
to the death toll.
Here is the Pseudocode of this implementation:
As can be seen above, the graph illustrates the “World” bar is significantly higher than any single
region or country. This is expected since it represents the sum of all vaccinations globally.
Entries like “Asia,” “Europe,” “North America,” and “South America” appear alongside individual
countries such as “China,” “India,” “United States,” and “Brazil.” This could be reasons to explain the
low death rate in China despite being regconised as an Covid-19 originated country. “China” stands out
for having administered a large number of vaccinations—on par with entire continents. “India,” “United
States,” and “Brazil” also appear in the top 10, each with hundreds of millions (or more) of administered
doses.
From the previous Total Deaths chart (also dominated by “World,” regions, and populous countries like
the United States, Brazil, and India), we can observe:
• High Population, High Numbers: The same large-population entities—such as “India,” “Brazil,”
and the “United States”—appear in both charts. They have high total deaths but also high total
vaccinations, largely because of their sheer population sizes.
• Importance of Vaccination: Higher vaccination levels typically correlate with lower Covid-19
mortality (or at least a significant reduction in the risk of severe outcomes). The presence of coun-
tries in both top 10 lists (deaths and vaccinations) underscores that large populations can accumu-
late high totals in both categories. However, effective vaccination efforts help mitigate the worst
outcomes by decreasing death rates over time.
After implementing the large overview of 10 countries, I noticed that the figures for the United States
was the largest among all categories. As this section would be the analysis for daily new confirmed Covid-
19 cases/Death cases and 7-Day moving average of Daily new cases/Death cases. Line charts, heatmap
and bar chart will be used to implement these figures.
The idea of plot the Daily New Cases/Death Cases or plot the 7-Day Moving Average was same:
These graphs demonstrate a day-by-day look at how many new Covid-19 cases are confirmed each day for
the selected country (here, the US) and a smoothed version of daily new cases over a 7-day period. Spikes
or dips in the first line graph indicate surges or drops in daily new infections while the smoothing second
line graph helps filter out short-term spikes or reporting anomalies, making trends easier to observe.
Hình 10: Daily New Cases and 7-Day Moving Average in the US
The same figure and algorithms are applied for Death Cases, with the following results:
Hình 11: Daily New Death Cases and 7-Day Moving Average of Death Cases in the US
As can be seen, both 2 figures: Daily New Cases and 7-Day Moving Average in the US show quite
a consistent correlation: higher case surges correlate with higher death tolls, but the death-to-case ratio
may decline over time due to improved treatments, vaccination, or broader testing capturing milder cases.
It is also noticeable that deaths typically trail confirmed cases by 2-4 weeks, as severe outcomes develop
after diagnosis. This lag is visible in the delayed peaks of the death curve compared to the case curve.
This is the heatmap showing 7-day moving average of daily new Covid-19 cases for selected countries
(US’, ’India’, ’Brazil’, ’Russia’, ’Vietnam’, ’France’, ’Italy’, ’Spain’, ’Germany’, ’China’). In this heatmap
there are some notice:
• The color intensity in each cell indicates the 7-day moving average of new cases:
• By scanning across a row, we can see how the daily cases for a single country change
over time.
Hình 12: Heatmap shows the 7-Day moving average for 10 countries
As can be seen, the US shows particularly intense periods (orange to red) during late 2020 and
early 2021) while India shows a notable intense red period around early-mid 2021. Most other countries
show varying intensities of yellow, indicating lower relative case counts. These figures are coherent with
the above death rates and confirmed cases, the winter surge (late 2020/early 2021) shows this clearly:
cases peaked around January 2021, while deaths peaked shortly after. This also reveals the major waves,
especially in the US:
2.7 Bonus
In this Lab Assignment, besides using Matplotlib, Plotly is a powerful, high-level visualization library
that allows you to create interactive, web-ready plots.
Here are some reasons that I choose Plotly to implement the same datasets, as the output would be more
interactive, responsive and intuitive: Interactivity:
• Plotly charts are interactive by default. I can zoom, pan, and hover over data points to see additional
information. This makes it especially useful for dashboards and presentations where user engagement
is important.
• Ease of Use with Plotly Express: Plotly Express is a high-level interface to Plotly, offering simple
syntax for creating complex visualizations. This often results in less code compared to Mat-
plotlib for similar plots.
• Web Integration: Plotly outputs can be saved as HTML files, which makes them easy to embed in
web pages and share online.
• Aesthetic and Modern Visuals: Plotly offers attractive, modern visual styles and smooth color scales,
which can be more appealing right out-of-the-box than Matplotlib’s default settings.
For example, to plot the total confirmed cases in Matplotlib, I have to list all the countries, even I
don’t know whether that country had data or not. However, with this library, I can easily aggregate the
COVID-19 confirmed cases data by using:
• df_confirmed.sum(axis=1) sums up all values in each row. Each row represents a country, so this
calculates the total confirmed cases for each country.
• .reset_index() converts the result into a DataFrame and turns the row indices into a column.
• The columns are then renamed to [’Country/Region’, ’Confirmed’] to clearly represent the data.
As the result, not only selected particular or limited countries, this library allows me to plot the whole
world map with their corresponding figures:
Plotly has specialized functions (like px.choropleth) to create geographic visualizations, which can be
more straightforward than creating similar maps in Matplotlib.
Similarly, for dashboards and live data visualizations, Plotly’s interactivity and support for dynamic
updates provide a robust solution compared to the more static nature of Matplotlib plots as graphs
below:
Hình 14: Top 10 countries with the most confirmed cases shown in Plotly
Hình 15: Top 10 countries with the most death cases shown in Plotly
Hình 16: Daily New Cases with 7-Day Moving Average and other categories shown in the same plot
[1]File containing all matches combined at combined_matches.csv. 244,038 total matches scraped.
There are 244,038 rows with 8 columns as describe below.
Each file contains following header: League,Date,HomeTeam,AwayTeam,HomeGoals,AwayGoals,Result
• Result: match result. KEY: A = Away team win. H = Home team win. D = Draw
As the same as above, the first step is to prepare the dataset, I use Pandas to reads the information in
the CSV file named combined_matches.csv from the data folder into Dataframe called df.
– dayfirst=True: Indicates that the date format uses day first (common in European formats).
– infer_datetime_format=True: Allows pandas to automatically detect the date format for
faster conversion.
– errors=’coerce’: Any date that cannot be parsed correctly will be set as NaT (Not a Time)
instead of causing an error.
• Then I clean the Result column, which is meant to indicate whether the match was won by the
home team (H), away team (A), or if it was a draw (D). Inconsistent formatting (like extra spaces
or punctuation) can cause issues when analyzing this data, so this cleaning step ensures consistency.
• After that I calculate total goals per Match, by using the line df[’TotalGoals’] = df[’HomeGoals’]
+ df[’AwayGoals’] - This line creates a new column named TotalGoals by summing the goals
scored by the home team (HomeGoals) and the away team (AwayGoals) for each match.
Here are some figures that I captured in both Plotly and Matplotlib, so that we could have intuitive
perspectives toward 2 libraries, and the strength of each library according to the user’s needs.
• The first graph is to illustrate the Home Team Wins, as the parsed parameters, I want to filter the
original DataFrame df to include only matches where the Result column is ’H’ (i.e., the home
team won).
• After filtering, it groups the matches by the HomeTeam column. Each group now corresponds to
a single home team. Then, size() returns the number of rows (i.e., matches) in each group. In this
context, it’s the total number of wins that a specific home team has when playing at home.
• .reset_index(name=’Wins’): This turns the grouped data into a DataFrame with two columns:
• Finally, the code sorts the DataFrame in descending order by the number of Wins and keeps only
the top 10 teams. This results in a list of the 10 teams with the most home wins.
In the first image, a Plotly bar chart shows the top 10 home teams by wins. This is easy to embed in
web applications or notebooks with interactive features. Also, it provokes interactivity such as Zoom,
pan, hover, and tooltips. This is the advantages that could outweigh the traditional method used by
Matplotlib. It offers an interactive bar chart, making it easy to hover over each bar and see
exact numbers (e.g., “Home Team = Barcelona, Number of Wins = 450”).
In the second image, a Matplotlib bar chart presents the same information: the x-axis lists the top
10 home teams, and the y-axis shows the number of home wins. It is straightforward for quick plots in
Python scripts. Especially it is highly customizable for print and publication.
As shown in the graph, Barcelona (450) edges out Real Madrid (444) by a small margin, highlighting
the fierce competition between these two giants in La Liga. There is a relatively tight cluster among several
teams (Rangers, Juventus, PSV all at 411; Porto at 407) indicating they share very similar home-win
records. These clubs are generally the most successful teams in their respective leagues, often dominating
domestically over many years. A high number of home wins is partly a reflection of long-term success
and consistency.
The same for Top 10 Away Teams by Wins:
The same clubs dominate both lists, just in a different order. These are historically successful teams
in their respective leagues (Spain, Scotland, England, Portugal, Germany, Italy, the Netherlands). With
Celtic leads away wins (359) and is 3rd at home (433). Rangers ranks 2nd away (329) and 5th at home
(411). Their relatively smaller “home-away gap” (e.g., 433 home vs. 359 away for Celtic) suggests they
travel well in a league they dominate.
Barcelona has 450 home wins vs. 318 away while Real Madrid has 444 home wins vs. 318 away.
Both are near the top in home performance, but the difference (home minus away) is around 130
wins—highlighting a more pronounced home advantage in La Liga for these two giants.
The remaining clubs have substantial home success, but also rank among the top away perform-
ers—testament to their dominance and consistency over many seasons.
Next graph is to calculate the total Matches per day. To achieve this, I need to group all the rows
(each representing a single match) in the DataFrame by their Date column using df.groupby(’Date’).
Then, once grouped, .size() returns the total number of rows (i.e., matches) in each group. Essentially,
for each unique date, it counts how many matches occurred on that day. Finally, I convert the grouped
data into a new DataFrame with two columns:
It can be seen that from around 1986 to the mid-2000s, the data is sparser (lower match counts).
There’s a noticeable jump in frequency and density of matches around 2016–2017, where the data becomes
much denser on the chart. This might indicate:
• Multiple leagues all playing on the same dates, leading to higher daily match counts.
Football matches often follow seasonal patterns. Therefore, we may see spikes in the chart for specific
days (weekends, holidays, or specific tournaments). That is the reason why the charts have many spikes
and fluctuations. Off-seasons or winter breaks in some leagues might show dips or even zero matches on
certain dates.
The last graph is to calculate home and away goals by match. Here I use a DataFrame df where each row
represents a match. The columns HomeGoals and AwayGoals contain the goals scored by the home and
away teams, respectively.
As can be seen, the majority of the scatter plot is Home Wins (Blue dots). This tend to occur when
home goals are significantly higher than away goals, and it appears more frequently as home goals increase,
particularly beyond 2-3 goals. High Avg_HomeGoals (2.6) and low Avg_AwayGoals (e.g., 0.9) indicate
strong home-team performance. For Away wins (A - Red dots), occur when away goals are significantly
higher than home goals. Higher Avg_AwayGoals (e.g., 2.5) compared to Avg_HomeGoals (e.g., 1.2)
suggest effective away teams, more common when home teams score few goals (0-2 range).
For draw results (green dots), appears along the diagonal where home and away goals are equal (e.g.,
0-0, 1-1, 2-2). Less frequent in very high-scoring games. There also some extreme scores, some matches
have unusually high away goals (e.g., 12-13 goals).
A Reference
References
[1] Aiden Flynn. European Football Matches. 8 Feb, 2025. url: https://www.kaggle.com/datasets/
flynn28/european-football-matches?select=combined_matches.csv.
[2] RehamFawzy. COVID-19 Global Analysis. 24 Nov, 2024. url: https://www .kaggle .com/code/
rehamfaw/covid-19-global-analysis.