0% found this document useful (0 votes)

15 views6 pages

Cloud Project 5

This project investigates the correlation between COVID-19 vaccination rates and the incidence of cases and mortality across continents using Databricks. Students will analyze data from Our World in Data and other sources to determine if higher vaccination rates lead to fewer cases and reduced mortality. The project includes data preparation, analysis, visualization, and reporting of findings, with specific submission requirements and deadlines.

Uploaded by

boyapallib

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views6 pages

Cloud Project 5

Uploaded by

boyapallib

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

4/4/25, 7:13 AM PROJECT5: Exploring COVID-19 Data using Databricks

PROJECT5: Exploring COVID-19 Data using Databricks

Start Assignment
Due Sunday by 11:59pm
Points 10
Submitting a text entry box or a file upload
Attempts 0
Allowed Attempts 3
Available Mar 25 at 6:30pm - Apr 8 at 11:59pm

Research Question:
Are COVID-19 vaccination rates associated with fewer COVID-19 cases and reduced mortality
(excess mortality)?

Project Overview

This project explores the relationship between COVID-19 vaccination rates, case rates, and mortality
across different continents during the pandemic. Using data from Our World in Data (use the data
provided), the World Bank, and Wikipedia, students will utilize Databricks to prepare, analyze, and
visualize data to determine whether higher vaccination rates are associated with reductions in
COVID-19 cases and mortality rates globally.

Data Sources (minimum):

1. Use the condensed Our World in Data file provided: OWID_COVID19_data_4_Project5-1.csv

(https://uc.instructure.com/courses/1737794/files/191737413?wrap=1)
(https://uc.instructure.com/courses/1737794/files/191737413/download?download_frd=1)
Our World in Data for more information: https://ourworldindata.org/covid-vaccinations
(https://ourworldindata.org/covid-vaccinations)
2. Please use any website, including the ones listed below, for Question 8 to find the missing GDP
(PPP) per capita:
World Bank Data: https://data.worldbank.org (https://data.worldbank.org/)
Wikipedia: https://en.wikipedia.org (https://en.wikipedia.org/)
Simple Google Search

Submission:

Submit 2 files named with your email username:

1. Requested screenshot information: e.g., KANSKRI_results.doc

2. Databricks Notebook DBC file with code/commands in an .dbc file: e.g., KANSAKRI_project5.dbc

Hints:

https://uc.instructure.com/courses/1737794/assignments/22128506 1/6
4/4/25, 7:13 AM PROJECT5: Exploring COVID-19 Data using Databricks

Please note that the Databricks Notebook (DBC) file can only be opened within a Databricks
notebook environment. Refer to the attachments for more information.


1. Getting started with Databricks.pdf
(https://uc.instructure.com/courses/1737794/modules/items/75219925)
2. W11C1-introduction-to-python-for-data-science-and-data-engineering-1.2.0.dbc
(https://uc.instructure.com/courses/1737794/files/191737465?wrap=1)
(https://uc.instructure.com/courses/1737794/files/191737465/download?download_frd=1)
3. W11C2 Deep dive session on Databricks.mp4
(https://uc.instructure.com/courses/1737794/modules/items/75220707)
4. Data Sources: There are some differences in the column names when using the dataset
from the website compared to the attached CSV.

The CSV contains columns named people_fully_vaccinated_phun and

new_cases_pmil.

The website data contains people_fully_vaccinated_per_hundred and

new_cases_per_million.

Basic Data Preparation Instructions (Databricks, PySpark, and

Pandas)
1. Install Databricks and upload the data. (Point: 1)
Set up the free version of Databricks (either Databricks Community Edition
(https://databricks.com/product/faq/community-edition) or Databricks Free Trial.
(https://docs.databricks.com/en/getting-started/free-trial.html)
Upload the project data file (OWID_COVID19_data_4_Project5-1.csv
(https://uc.instructure.com/courses/1737794/files/191737413?wrap=1)
(https://uc.instructure.com/courses/1737794/files/191737413/download?download_frd=1) ) to
Databricks.
Provide screenshots of completion:
You might need to enable DBFS in the Databricks Community Edition as follows:

https://uc.instructure.com/courses/1737794/assignments/22128506 2/6
4/4/25, 7:13 AM PROJECT5: Exploring COVID-19 Data using Databricks

Steps to upload files in Databricks:

2. Create a Databricks cluster and load the data file using Pandas or PySpark as needed.
(Point: 1)

https://uc.instructure.com/courses/1737794/assignments/22128506 3/6
4/4/25, 7:13 AM PROJECT5: Exploring COVID-19 Data using Databricks

# Example with PySpark

dbutils.fs.ls("dbfs:/FileStore/OWID_COVID19_data_4_Project5.csv")

df = spark.read.format("csv") \

.option("header", "true") \
.option("inferSchema", "true") \

.load("dbfs:/FileStore/OWID_COVID19_data_4_Project5.csv")

display(df)

3. Filter Records and display the row count by continent (Point: 1)

Remove any records with missing, zero, or negative values in the columns
people_fully_vaccinated_per_hundred or new_cases_per_million .

#Example with PySpark:

from pyspark.sql import functions as F

filtered_df = df.filter( (F.col('people_fully_vaccinated_per_hundred') > 0) &

(F.col('new_cases_per_million') > 0) )

continent_counts = filtered_df.groupBy("continent").count()

4. Create Month and Year Columns and display the total record count (Point: 1)
Use the date column to create month and year columns, and filter for data from September and
October 2021.

#Example with PySpark:

df['month'] = df['date'].dt.month

df['year'] = df['date'].dt.year

df = df[(df['year'] == 2021) & (df['month'].isin([9, 10]))]

5. Calculate Averages and display the row count by continent by month (Point: 1)
For the predictor and dependent variables, calculate the average values for September and
October 2021.

#Example with PySpark:

avg_df = filtered_df.groupBy('continent', 'month').agg(

F.mean('people_fully_vaccinated_per_hundred').alias('average_people_fully_vaccinated'),
F.mean('new_cases_per_million').alias('average_new_cases'),

F.mean('excess_mortality').alias('average_excess_mortality') )

6. Plotting a Bar Chart. (Point: 1)

Plot the average vaccination rate and new cases per million for each continent for September and
October 2021.

# Use Databricks' built-in plot options:

display(avg_df.orderBy('continent', 'month'))

Adjust the plot options to configure the plot properly, as shown below:

https://uc.instructure.com/courses/1737794/assignments/22128506 4/6
4/4/25, 7:13 AM PROJECT5: Exploring COVID-19 Data using Databricks

When you first run the cell, you'll get an HTML table as the result. To configure the plot:

Example only. Your graph will look different.

1. Click the graph button.

2. If the plot doesn't look correct, click the Plot Options button.
3. Configure the plot similar to the following example.

Hint: Order your results by continent , then month .

Example only. Your graph will look different.

7. Run Correlation Analysis (Point: 1)

Perform a correlation analysis between people_fully_vaccinated_per_hundred , new_cases_per_million .

#Example with PySpark:

correlation = avg_df.corr('average_people_fully_vaccinated', 'average_new_cases')

8. Fill missing GDP (PPP) per capita. (Point 1)
Identify and fill missing GDP data for 14 countries using World Bank Data, Wikipedia or
simple Google Search if unavailable. Merge this data with the main dataset.

9. Create Summary/Descriptive Statistics Table. (Point 1)

https://uc.instructure.com/courses/1737794/assignments/22128506 5/6
4/4/25, 7:13 AM PROJECT5: Exploring COVID-19 Data using Databricks

Generate a table summarizing the mean, standard deviation, and frequency for all relevant
variables (e.g., people_fully_vaccinated_per_hundred , GDP PPP , new_cases_per_million ,
excess_mortality ).

10. Reporting Results. (Point 1)

Interpret Findings: Write a paragraph summarizing whether COVID-19 vaccination rates are
associated with lower case numbers and reduced excess mortality. You can support this
analysis using regression, correlation, graphical results, or all of the above.

https://uc.instructure.com/courses/1737794/assignments/22128506 6/6

COVID-19 Vaccination Progress Analysis
No ratings yet
COVID-19 Vaccination Progress Analysis
39 pages
Covid Data Report
No ratings yet
Covid Data Report
21 pages
COVID19 Tableau Project Presentation Styled
No ratings yet
COVID19 Tableau Project Presentation Styled
8 pages
COVID-19 Data Visualization in Python
No ratings yet
COVID-19 Data Visualization in Python
8 pages
R Jeevitha
No ratings yet
R Jeevitha
16 pages
Report - Data Visualization and Exploration
No ratings yet
Report - Data Visualization and Exploration
14 pages
Adarsh Python
No ratings yet
Adarsh Python
18 pages
Nishant Mini Project 1 Rishi
No ratings yet
Nishant Mini Project 1 Rishi
18 pages
Vaccination Report
No ratings yet
Vaccination Report
8 pages
Spatial Disparities in COVID-19 Vaccination Coverage in Bangladesh 8july21
No ratings yet
Spatial Disparities in COVID-19 Vaccination Coverage in Bangladesh 8july21
34 pages
Rishi Mini Project
No ratings yet
Rishi Mini Project
18 pages
Project Proposal
No ratings yet
Project Proposal
2 pages
Data Analytics in COVID Forecasting
No ratings yet
Data Analytics in COVID Forecasting
11 pages
COVID-19 Vaccination Data Insights
No ratings yet
COVID-19 Vaccination Data Insights
3 pages
COVID Vaccine Data Analysis Tools
No ratings yet
COVID Vaccine Data Analysis Tools
5 pages
Covid
No ratings yet
Covid
9 pages
Assignment 8
No ratings yet
Assignment 8
2 pages
Project Covid 19 Data Analysis
No ratings yet
Project Covid 19 Data Analysis
2 pages
DAC Phase4
No ratings yet
DAC Phase4
4 pages
COVID Vaccine Analytics India
No ratings yet
COVID Vaccine Analytics India
11 pages
COVID-19 Data Analysis With Pandas and NumPy
No ratings yet
COVID-19 Data Analysis With Pandas and NumPy
5 pages
Sample
No ratings yet
Sample
13 pages
COVID-19 Data Analysis in Python
No ratings yet
COVID-19 Data Analysis in Python
13 pages
DSBDA Mini Project
No ratings yet
DSBDA Mini Project
19 pages
Covid Data Proposal
No ratings yet
Covid Data Proposal
2 pages
Report MSA Practice02
No ratings yet
Report MSA Practice02
29 pages
Essential Software Assignment 3
No ratings yet
Essential Software Assignment 3
2 pages
Ieee
No ratings yet
Ieee
4 pages
COVID-19 Clinical Trials EDA Pandas
No ratings yet
COVID-19 Clinical Trials EDA Pandas
30 pages
Name
No ratings yet
Name
23 pages
Name
No ratings yet
Name
23 pages
Assignment R
No ratings yet
Assignment R
6 pages
Maheswari Public School Kalwar Road: Project File Session 2023-24
No ratings yet
Maheswari Public School Kalwar Road: Project File Session 2023-24
28 pages
COVID 19 Some Challenges Some Data 1
No ratings yet
COVID 19 Some Challenges Some Data 1
26 pages
Python Report (Rabeeeh)
No ratings yet
Python Report (Rabeeeh)
7 pages
Syadatajveez
No ratings yet
Syadatajveez
21 pages
I.P Project
No ratings yet
I.P Project
24 pages
Abuzar
No ratings yet
Abuzar
18 pages
My P Report
No ratings yet
My P Report
14 pages
Ashutosh Project
No ratings yet
Ashutosh Project
19 pages
Covid Vaccine Statewise Dataset Analysis
No ratings yet
Covid Vaccine Statewise Dataset Analysis
18 pages
Harshdeep
No ratings yet
Harshdeep
57 pages
IP Project Covid-19 Impact
No ratings yet
IP Project Covid-19 Impact
25 pages
DSBDA Mini Project - Ipynb - Colab
No ratings yet
DSBDA Mini Project - Ipynb - Colab
22 pages
Dsbda Mini Project Covid Sample
No ratings yet
Dsbda Mini Project Covid Sample
20 pages
M23aid027 DCS Ass2
No ratings yet
M23aid027 DCS Ass2
14 pages
Covid Vaccine Data Analysis Report
No ratings yet
Covid Vaccine Data Analysis Report
14 pages
Project Final Report
No ratings yet
Project Final Report
4 pages
COVID Project
50% (2)
COVID Project
1 page
Rinku 22306670010 Project Report
No ratings yet
Rinku 22306670010 Project Report
33 pages
Informatics Practices Project 12 New
No ratings yet
Informatics Practices Project 12 New
31 pages
COVID19 Analysis Tableau
No ratings yet
COVID19 Analysis Tableau
13 pages
Machine Learning and OLAP On Big COVID-19 Data
No ratings yet
Machine Learning and OLAP On Big COVID-19 Data
10 pages
COVID-19 Data Analysis with Python
No ratings yet
COVID-19 Data Analysis with Python
15 pages
COVID-19 Data Visualization Project Report
No ratings yet
COVID-19 Data Visualization Project Report
16 pages
COVID-19 Vaccination and Death Data Analysis
No ratings yet
COVID-19 Vaccination and Death Data Analysis
3 pages
Analysis of Covid Dataset Using R and Python
No ratings yet
Analysis of Covid Dataset Using R and Python
3 pages
Corona Virus Analysis
No ratings yet
Corona Virus Analysis
27 pages
Assignment 2 J Latest
No ratings yet
Assignment 2 J Latest
1 page
Java 1
No ratings yet
Java 1
13 pages
ReviverSoft Driver Reviver 5.25.6.2 + Portable
No ratings yet
ReviverSoft Driver Reviver 5.25.6.2 + Portable
7 pages
PKVM - IT Practical SLIPS 9860272494 FOR SCIENCE
100% (1)
PKVM - IT Practical SLIPS 9860272494 FOR SCIENCE
24 pages
Double Hashing & B+ Trees Assignment
No ratings yet
Double Hashing & B+ Trees Assignment
7 pages
CS8711-Cloud Computing Lab Manual
No ratings yet
CS8711-Cloud Computing Lab Manual
95 pages
Final Year Project Proposal Template
No ratings yet
Final Year Project Proposal Template
4 pages
27595FLIR Latitude 9.2 Datasheet
No ratings yet
27595FLIR Latitude 9.2 Datasheet
2 pages
2025 Basic Computer Programming Week01 3
No ratings yet
2025 Basic Computer Programming Week01 3
38 pages
CSC 437: Compiler Fundamentals
No ratings yet
CSC 437: Compiler Fundamentals
82 pages
How To Guide-Data Retraction From SAC To S4 HANA
No ratings yet
How To Guide-Data Retraction From SAC To S4 HANA
32 pages
Predicate Logic and Quantifiers Overview
No ratings yet
Predicate Logic and Quantifiers Overview
291 pages
Web Standards and Architecture Guide
No ratings yet
Web Standards and Architecture Guide
30 pages
Class X IT Exam Sample Paper
No ratings yet
Class X IT Exam Sample Paper
2 pages
Cooperative Systems Design Scenario based Design of Collaborative Systems 1st Edition by Carla Simone, Manuel Zacklad, Francoise Darses, Rose Dieng ISBN 1586034227 9781586034221 - The ebook is available for online reading or easy download
100% (16)
Cooperative Systems Design Scenario based Design of Collaborative Systems 1st Edition by Carla Simone, Manuel Zacklad, Francoise Darses, Rose Dieng ISBN 1586034227 9781586034221 - The ebook is available for online reading or easy download
87 pages
V212 082008
No ratings yet
V212 082008
2 pages
Model Question For MIS
No ratings yet
Model Question For MIS
2 pages
C Programs for Basic Operations
100% (1)
C Programs for Basic Operations
28 pages
Configure Live Hotmail, Gmail, Yahoo or AOL
No ratings yet
Configure Live Hotmail, Gmail, Yahoo or AOL
12 pages
Design Concepts: by Deepika Chaudhary
No ratings yet
Design Concepts: by Deepika Chaudhary
25 pages
Class 10 IT Pre-Board Exam
No ratings yet
Class 10 IT Pre-Board Exam
6 pages
Lecture - 03 - SDLC (Waterfall)
No ratings yet
Lecture - 03 - SDLC (Waterfall)
27 pages
Banner Prerequisite Checking
No ratings yet
Banner Prerequisite Checking
33 pages
BCA Set 1
No ratings yet
BCA Set 1
2 pages
7.3.11 Lab - Using Windows PowerShell
No ratings yet
7.3.11 Lab - Using Windows PowerShell
10 pages
Ms Chauhan Organic Notes Part 1
100% (2)
Ms Chauhan Organic Notes Part 1
213 pages
Setting Up Virtual Desktops in Horizon - VMware Horizon 2106
No ratings yet
Setting Up Virtual Desktops in Horizon - VMware Horizon 2106
189 pages
Clinics Management System
No ratings yet
Clinics Management System
54 pages
Web Applications Desktop Integrator
No ratings yet
Web Applications Desktop Integrator
5 pages
Rustls Security Audit Summary
No ratings yet
Rustls Security Audit Summary
12 pages
Variable and Datatypes in Visual Basic
No ratings yet
Variable and Datatypes in Visual Basic
5 pages

Cloud Project 5

Uploaded by

Cloud Project 5

Uploaded by

4/4/25, 7:13 AM PROJECT5: Exploring COVID-19 Data using Databricks

PROJECT5: Exploring COVID-19 Data using Databricks

Data Sources (minimum):

1. Use the condensed Our World in Data file provided: OWID_COVID19_data_4_Project5-1.csv

Submit 2 files named with your email username:

1. Requested screenshot information: e.g., KANSKRI_results.doc

The CSV contains columns named people_fully_vaccinated_phun and

The website data contains people_fully_vaccinated_per_hundred and

Basic Data Preparation Instructions (Databricks, PySpark, and

Steps to upload files in Databricks:

# Example with PySpark

3. Filter Records and display the row count by continent (Point: 1)

#Example with PySpark:

filtered_df = df.filter( (F.col('people_fully_vaccinated_per_hundred') > 0) &

#Example with PySpark:

df = df[(df['year'] == 2021) & (df['month'].isin([9, 10]))]

#Example with PySpark:

avg_df = filtered_df.groupBy('continent', 'month').agg(

6. Plotting a Bar Chart. (Point: 1)

# Use Databricks' built-in plot options:

Example only. Your graph will look different.

1. Click the graph button.

Hint: Order your results by continent , then month .

Example only. Your graph will look different.

7. Run Correlation Analysis (Point: 1)

#Example with PySpark:

correlation = avg_df.corr('average_people_fully_vaccinated', 'average_new_cases')

9. Create Summary/Descriptive Statistics Table. (Point 1)

10. Reporting Results. (Point 1)

You might also like