0% found this document useful (0 votes)
15 views6 pages

Cloud Project 5

This project investigates the correlation between COVID-19 vaccination rates and the incidence of cases and mortality across continents using Databricks. Students will analyze data from Our World in Data and other sources to determine if higher vaccination rates lead to fewer cases and reduced mortality. The project includes data preparation, analysis, visualization, and reporting of findings, with specific submission requirements and deadlines.

Uploaded by

boyapallib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views6 pages

Cloud Project 5

This project investigates the correlation between COVID-19 vaccination rates and the incidence of cases and mortality across continents using Databricks. Students will analyze data from Our World in Data and other sources to determine if higher vaccination rates lead to fewer cases and reduced mortality. The project includes data preparation, analysis, visualization, and reporting of findings, with specific submission requirements and deadlines.

Uploaded by

boyapallib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

4/4/25, 7:13 AM PROJECT5: Exploring COVID-19 Data using Databricks

PROJECT5: Exploring COVID-19 Data using Databricks

Start Assignment
Due Sunday by 11:59pm
Points 10
Submitting a text entry box or a file upload
Attempts 0
Allowed Attempts 3
Available Mar 25 at 6:30pm - Apr 8 at 11:59pm

Research Question:
Are COVID-19 vaccination rates associated with fewer COVID-19 cases and reduced mortality
(excess mortality)?

Project Overview

This project explores the relationship between COVID-19 vaccination rates, case rates, and mortality
across different continents during the pandemic. Using data from Our World in Data (use the data
provided), the World Bank, and Wikipedia, students will utilize Databricks to prepare, analyze, and
visualize data to determine whether higher vaccination rates are associated with reductions in
COVID-19 cases and mortality rates globally.

Data Sources (minimum):

1. Use the condensed Our World in Data file provided: OWID_COVID19_data_4_Project5-1.csv


(https://uc.instructure.com/courses/1737794/files/191737413?wrap=1)
(https://uc.instructure.com/courses/1737794/files/191737413/download?download_frd=1)
Our World in Data for more information: https://ourworldindata.org/covid-vaccinations
(https://ourworldindata.org/covid-vaccinations)
2. Please use any website, including the ones listed below, for Question 8 to find the missing GDP
(PPP) per capita:
World Bank Data: https://data.worldbank.org (https://data.worldbank.org/)
Wikipedia: https://en.wikipedia.org (https://en.wikipedia.org/)
Simple Google Search

Submission:

Submit 2 files named with your email username:

1. Requested screenshot information: e.g., KANSKRI_results.doc

2. Databricks Notebook DBC file with code/commands in an .dbc file: e.g., KANSAKRI_project5.dbc

Hints:

https://uc.instructure.com/courses/1737794/assignments/22128506 1/6
4/4/25, 7:13 AM PROJECT5: Exploring COVID-19 Data using Databricks

Please note that the Databricks Notebook (DBC) file can only be opened within a Databricks
notebook environment. Refer to the attachments for more information.


1. Getting started with Databricks.pdf
(https://uc.instructure.com/courses/1737794/modules/items/75219925)
2. W11C1-introduction-to-python-for-data-science-and-data-engineering-1.2.0.dbc
(https://uc.instructure.com/courses/1737794/files/191737465?wrap=1)
(https://uc.instructure.com/courses/1737794/files/191737465/download?download_frd=1)
3. W11C2 Deep dive session on Databricks.mp4
(https://uc.instructure.com/courses/1737794/modules/items/75220707)
4. Data Sources: There are some differences in the column names when using the dataset
from the website compared to the attached CSV.

The CSV contains columns named people_fully_vaccinated_phun and


new_cases_pmil.

The website data contains people_fully_vaccinated_per_hundred and


new_cases_per_million.

Basic Data Preparation Instructions (Databricks, PySpark, and


Pandas)
1. Install Databricks and upload the data. (Point: 1)
Set up the free version of Databricks (either Databricks Community Edition
(https://databricks.com/product/faq/community-edition) or Databricks Free Trial.
(https://docs.databricks.com/en/getting-started/free-trial.html)
Upload the project data file (OWID_COVID19_data_4_Project5-1.csv
(https://uc.instructure.com/courses/1737794/files/191737413?wrap=1)
(https://uc.instructure.com/courses/1737794/files/191737413/download?download_frd=1) ) to
Databricks.
Provide screenshots of completion:
You might need to enable DBFS in the Databricks Community Edition as follows:

https://uc.instructure.com/courses/1737794/assignments/22128506 2/6
4/4/25, 7:13 AM PROJECT5: Exploring COVID-19 Data using Databricks

Steps to upload files in Databricks:

2. Create a Databricks cluster and load the data file using Pandas or PySpark as needed.
(Point: 1)

https://uc.instructure.com/courses/1737794/assignments/22128506 3/6
4/4/25, 7:13 AM PROJECT5: Exploring COVID-19 Data using Databricks

# Example with PySpark

dbutils.fs.ls("dbfs:/FileStore/OWID_COVID19_data_4_Project5.csv")

df = spark.read.format("csv") \

.option("header", "true") \
.option("inferSchema", "true") \

.load("dbfs:/FileStore/OWID_COVID19_data_4_Project5.csv")

display(df)

3. Filter Records and display the row count by continent (Point: 1)


Remove any records with missing, zero, or negative values in the columns
people_fully_vaccinated_per_hundred or new_cases_per_million .

#Example with PySpark:


from pyspark.sql import functions as F

filtered_df = df.filter( (F.col('people_fully_vaccinated_per_hundred') > 0) &


(F.col('new_cases_per_million') > 0) )

continent_counts = filtered_df.groupBy("continent").count()

4. Create Month and Year Columns and display the total record count (Point: 1)
Use the date column to create month and year columns, and filter for data from September and
October 2021.

#Example with PySpark:


df['month'] = df['date'].dt.month

df['year'] = df['date'].dt.year

df = df[(df['year'] == 2021) & (df['month'].isin([9, 10]))]

5. Calculate Averages and display the row count by continent by month (Point: 1)
For the predictor and dependent variables, calculate the average values for September and
October 2021.

#Example with PySpark:

avg_df = filtered_df.groupBy('continent', 'month').agg(

F.mean('people_fully_vaccinated_per_hundred').alias('average_people_fully_vaccinated'),
F.mean('new_cases_per_million').alias('average_new_cases'),

F.mean('excess_mortality').alias('average_excess_mortality') )

6. Plotting a Bar Chart. (Point: 1)

Plot the average vaccination rate and new cases per million for each continent for September and
October 2021.

# Use Databricks' built-in plot options:


display(avg_df.orderBy('continent', 'month'))

Adjust the plot options to configure the plot properly, as shown below:

https://uc.instructure.com/courses/1737794/assignments/22128506 4/6
4/4/25, 7:13 AM PROJECT5: Exploring COVID-19 Data using Databricks

When you first run the cell, you'll get an HTML table as the result. To configure the plot:

Example only. Your graph will look different.

1. Click the graph button.


2. If the plot doesn't look correct, click the Plot Options button.
3. Configure the plot similar to the following example.

Hint: Order your results by continent , then month .

Example only. Your graph will look different.

7. Run Correlation Analysis (Point: 1)


Perform a correlation analysis between people_fully_vaccinated_per_hundred , new_cases_per_million .

#Example with PySpark:

correlation = avg_df.corr('average_people_fully_vaccinated', 'average_new_cases')


8. Fill missing GDP (PPP) per capita. (Point 1)
Identify and fill missing GDP data for 14 countries using World Bank Data, Wikipedia or
simple Google Search if unavailable. Merge this data with the main dataset.

9. Create Summary/Descriptive Statistics Table. (Point 1)

https://uc.instructure.com/courses/1737794/assignments/22128506 5/6
4/4/25, 7:13 AM PROJECT5: Exploring COVID-19 Data using Databricks

Generate a table summarizing the mean, standard deviation, and frequency for all relevant
variables (e.g., people_fully_vaccinated_per_hundred , GDP PPP , new_cases_per_million ,
excess_mortality ).

10. Reporting Results. (Point 1)

Interpret Findings: Write a paragraph summarizing whether COVID-19 vaccination rates are
associated with lower case numbers and reduced excess mortality. You can support this
analysis using regression, correlation, graphical results, or all of the above.

https://uc.instructure.com/courses/1737794/assignments/22128506 6/6

You might also like