4/4/25, 7:13 AM PROJECT5: Exploring COVID-19 Data using Databricks
PROJECT5: Exploring COVID-19 Data using Databricks
Start Assignment
Due Sunday by 11:59pm
Points 10
Submitting a text entry box or a file upload
Attempts 0
Allowed Attempts 3
Available Mar 25 at 6:30pm - Apr 8 at 11:59pm
Research Question:
Are COVID-19 vaccination rates associated with fewer COVID-19 cases and reduced mortality
(excess mortality)?
Project Overview
This project explores the relationship between COVID-19 vaccination rates, case rates, and mortality
across different continents during the pandemic. Using data from Our World in Data (use the data
provided), the World Bank, and Wikipedia, students will utilize Databricks to prepare, analyze, and
visualize data to determine whether higher vaccination rates are associated with reductions in
COVID-19 cases and mortality rates globally.
Data Sources (minimum):
1. Use the condensed Our World in Data file provided: OWID_COVID19_data_4_Project5-1.csv
(https://uc.instructure.com/courses/1737794/files/191737413?wrap=1)
(https://uc.instructure.com/courses/1737794/files/191737413/download?download_frd=1)
Our World in Data for more information: https://ourworldindata.org/covid-vaccinations
(https://ourworldindata.org/covid-vaccinations)
2. Please use any website, including the ones listed below, for Question 8 to find the missing GDP
(PPP) per capita:
World Bank Data: https://data.worldbank.org (https://data.worldbank.org/)
Wikipedia: https://en.wikipedia.org (https://en.wikipedia.org/)
Simple Google Search
Submission:
Submit 2 files named with your email username:
1. Requested screenshot information: e.g., KANSKRI_results.doc
2. Databricks Notebook DBC file with code/commands in an .dbc file: e.g., KANSAKRI_project5.dbc
Hints:
https://uc.instructure.com/courses/1737794/assignments/22128506 1/6
4/4/25, 7:13 AM PROJECT5: Exploring COVID-19 Data using Databricks
Please note that the Databricks Notebook (DBC) file can only be opened within a Databricks
notebook environment. Refer to the attachments for more information.
1. Getting started with Databricks.pdf
(https://uc.instructure.com/courses/1737794/modules/items/75219925)
2. W11C1-introduction-to-python-for-data-science-and-data-engineering-1.2.0.dbc
(https://uc.instructure.com/courses/1737794/files/191737465?wrap=1)
(https://uc.instructure.com/courses/1737794/files/191737465/download?download_frd=1)
3. W11C2 Deep dive session on Databricks.mp4
(https://uc.instructure.com/courses/1737794/modules/items/75220707)
4. Data Sources: There are some differences in the column names when using the dataset
from the website compared to the attached CSV.
The CSV contains columns named people_fully_vaccinated_phun and
new_cases_pmil.
The website data contains people_fully_vaccinated_per_hundred and
new_cases_per_million.
Basic Data Preparation Instructions (Databricks, PySpark, and
Pandas)
1. Install Databricks and upload the data. (Point: 1)
Set up the free version of Databricks (either Databricks Community Edition
(https://databricks.com/product/faq/community-edition) or Databricks Free Trial.
(https://docs.databricks.com/en/getting-started/free-trial.html)
Upload the project data file (OWID_COVID19_data_4_Project5-1.csv
(https://uc.instructure.com/courses/1737794/files/191737413?wrap=1)
(https://uc.instructure.com/courses/1737794/files/191737413/download?download_frd=1) ) to
Databricks.
Provide screenshots of completion:
You might need to enable DBFS in the Databricks Community Edition as follows:
https://uc.instructure.com/courses/1737794/assignments/22128506 2/6
4/4/25, 7:13 AM PROJECT5: Exploring COVID-19 Data using Databricks
Steps to upload files in Databricks:
2. Create a Databricks cluster and load the data file using Pandas or PySpark as needed.
(Point: 1)
https://uc.instructure.com/courses/1737794/assignments/22128506 3/6
4/4/25, 7:13 AM PROJECT5: Exploring COVID-19 Data using Databricks
# Example with PySpark
dbutils.fs.ls("dbfs:/FileStore/OWID_COVID19_data_4_Project5.csv")
df = spark.read.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.load("dbfs:/FileStore/OWID_COVID19_data_4_Project5.csv")
display(df)
3. Filter Records and display the row count by continent (Point: 1)
Remove any records with missing, zero, or negative values in the columns
people_fully_vaccinated_per_hundred or new_cases_per_million .
#Example with PySpark:
from pyspark.sql import functions as F
filtered_df = df.filter( (F.col('people_fully_vaccinated_per_hundred') > 0) &
(F.col('new_cases_per_million') > 0) )
continent_counts = filtered_df.groupBy("continent").count()
4. Create Month and Year Columns and display the total record count (Point: 1)
Use the date column to create month and year columns, and filter for data from September and
October 2021.
#Example with PySpark:
df['month'] = df['date'].dt.month
df['year'] = df['date'].dt.year
df = df[(df['year'] == 2021) & (df['month'].isin([9, 10]))]
5. Calculate Averages and display the row count by continent by month (Point: 1)
For the predictor and dependent variables, calculate the average values for September and
October 2021.
#Example with PySpark:
avg_df = filtered_df.groupBy('continent', 'month').agg(
F.mean('people_fully_vaccinated_per_hundred').alias('average_people_fully_vaccinated'),
F.mean('new_cases_per_million').alias('average_new_cases'),
F.mean('excess_mortality').alias('average_excess_mortality') )
6. Plotting a Bar Chart. (Point: 1)
Plot the average vaccination rate and new cases per million for each continent for September and
October 2021.
# Use Databricks' built-in plot options:
display(avg_df.orderBy('continent', 'month'))
Adjust the plot options to configure the plot properly, as shown below:
https://uc.instructure.com/courses/1737794/assignments/22128506 4/6
4/4/25, 7:13 AM PROJECT5: Exploring COVID-19 Data using Databricks
When you first run the cell, you'll get an HTML table as the result. To configure the plot:
Example only. Your graph will look different.
1. Click the graph button.
2. If the plot doesn't look correct, click the Plot Options button.
3. Configure the plot similar to the following example.
Hint: Order your results by continent , then month .
Example only. Your graph will look different.
7. Run Correlation Analysis (Point: 1)
Perform a correlation analysis between people_fully_vaccinated_per_hundred , new_cases_per_million .
#Example with PySpark:
correlation = avg_df.corr('average_people_fully_vaccinated', 'average_new_cases')
8. Fill missing GDP (PPP) per capita. (Point 1)
Identify and fill missing GDP data for 14 countries using World Bank Data, Wikipedia or
simple Google Search if unavailable. Merge this data with the main dataset.
9. Create Summary/Descriptive Statistics Table. (Point 1)
https://uc.instructure.com/courses/1737794/assignments/22128506 5/6
4/4/25, 7:13 AM PROJECT5: Exploring COVID-19 Data using Databricks
Generate a table summarizing the mean, standard deviation, and frequency for all relevant
variables (e.g., people_fully_vaccinated_per_hundred , GDP PPP , new_cases_per_million ,
excess_mortality ).
10. Reporting Results. (Point 1)
Interpret Findings: Write a paragraph summarizing whether COVID-19 vaccination rates are
associated with lower case numbers and reduced excess mortality. You can support this
analysis using regression, correlation, graphical results, or all of the above.
https://uc.instructure.com/courses/1737794/assignments/22128506 6/6