0% found this document useful (0 votes)
298 views19 pages

Notebook 1 - Basic R & Data Exploration

Uploaded by

Blobby Hatchner
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
298 views19 pages

Notebook 1 - Basic R & Data Exploration

Uploaded by

Blobby Hatchner
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Notebook 1 - Basic R & Data Exploration

May 23, 2024

0.1 Data Science Project: Use data to determine the best and worst colleges
for conquering student debt.
0.1.1 Notebook 1: Basic R Commands & Data Exploration
Does college pay off? We’ll use some of the latest data from the US Department of Education’s
College Scorecard Database to answer that question. In this first notebook, you’ll get a gentle
introduction to R - a coding language used by data scientists to analyze large datasets. Then,
you’ll begin diving into the college scorecard data yourself. By the end of this notebook, you’ll get
a general sense of which colleges set up their graduates for success and which colleges … don’t.
[3]: ## Run this code but do not edit it. Hit Ctrl+Enter to run the code
# This command downloads a useful package of R commands
library(coursekata)

0.1.2 1.0 - Exploring the dataset


To begin, let’s download our data. Our full dataset is included in a file named colleges.csv,
which we’re retrieving from the skewthescript.org website. The command below downloads the
data from the file and stores it into an R dataframe object called dat.
[4]: ## Run this code but do not edit it. Hit Ctrl+Enter to run the code
# This command downloads data and stores it in the object `dat`
dat <- read.csv('https://skewthescript.org/s/colleges.csv')

The <- operator is used to store values. For example, x<-10 stores the value of 10 in x, meaning
the value 10 is saved in the object x.
To get a quick view of the dataframe (dat), we can use the head command to print out its first
several rows.
[5]: ## Run this code but do not edit it. Hit Ctrl+Enter to run the code
# This command prints out the first several rows of the dataset
head(dat)

1
OPEID name city state region m
<int> <chr> <chr> <chr> <chr> <d
1 100200 Alabama A & M University Normal AL South 15
2 105200 University of Alabama at Birmingham Birmingham AL South 15
A data.frame: 6 × 26
3 2503400 Amridge University Montgomery AL South 10
4 105500 University of Alabama in Huntsville Huntsville AL South 14
5 100500 Alabama State University Montgomery AL South 17
6 105100 The University of Alabama Tuscaloosa AL South 17
The vertical columns of the dataframe are called variables, and their elements are called values.
For example, the variable city has values Normal, Birmingham, Montgomery, Huntsville, etc.
The horizontal rows of the dataframe are called observations. For example, the first observation
is Alabama A & M University, which is located in AL (Alabama), in the city of Normal, and has a
median student debt of $15,250. For this dataframe, each observation describes a specific college.
1.1 - Of the variables displayed, identify one that is quantitative, one that is categorical, and one
that is a unique identifier.
Double-click to type a response: city –> categorical, median_debt –> quantitative, opeid –>
unique identifier
The head command only displays several rows of the dataframe. To see the full dimensions of the
dataframe, we can use the dim command.
1.2 - Use the dim command on dat to display the dimensions of the dataframe.
[6]: # Your code goes here
dim(dat)

1. 4435 2. 26
Check yourself: Your code should have printed out two numbers: 4435 and 26.
The first number outputted by dim is the number of horizontal rows in the dataframe. This
represents the number of observations (number of colleges). The second number is the number of
vertical columns in the dataframe. This represents the number of variables. What are all these
variables? See the description of the dataset below, along with links to descriptions of all the
variables.

0.1.3 The Dataset


General description - The US Department of Education’s College Scorecard Database shows
various metrics of cost, enrollment, size, student debt, student demographics, and alumni success. It
describes almost every University, college, community college, trade school, and certificate program
in the United States. The data is current as of the 2020-2021 school year.
Description of all variables: See here
Detailed data file description: See here
With such a large dataset, to make your life easier, you may want to work with only a few vari-
ables at a time. In the following code, we use the select command to select only the variables

2
name, median_debt, ownership, admit_rate, and hbcu and save them in a new dataframe called
example_dat.
[7]: ## Run this code but do not edit it
# Select certain columns from dat, store into example_dat
example_dat <- select(dat, name, median_debt, ownership, admit_rate, hbcu)

# Display head of example_dat


head(example_dat)

name median_debt ownership admit_rate


<chr> <dbl> <chr> <dbl>
1 Alabama A & M University 15.250 Public 89.65
2 University of Alabama at Birmingham 15.085 Public 80.60
A data.frame: 6 × 5
3 Amridge University 10.984 Private nonprofit NA
4 University of Alabama in Huntsville 14.000 Public 77.11
5 Alabama State University 17.500 Public 98.88
6 The University of Alabama 17.671 Public 80.39
1.3 - Use the select command to select the variables name, region, default_rate, ownership,
and pct_PELL from dat. Store your new dataframe in an object called my_dat and display its head.
[8]: # Your code goes here
my_dat <- select(dat, name, region, default_rate, ownership, pct_PELL)
head(my_dat)

name region default_rate ownership pct


<chr> <chr> <dbl> <chr> <db
1 Alabama A & M University South 12.1 Public 70.9
2 University of Alabama at Birmingham South 4.8 Public 33.9
A data.frame: 6 × 5
3 Amridge University South 12.9 Private nonprofit 74.5
4 University of Alabama in Huntsville South 4.7 Public 24.0
5 Alabama State University South 12.8 Public 73.6
6 The University of Alabama South 4.0 Public 17.1
In addition to filtering out columns (variables), we can also filter out rows (observations). For
example, if I only wanted to analyze colleges that are HBCUs and that have an admissions rate
below than 40%, I can use the subset command on example_dat like this:
[9]: ## Run this code but do not edit it
# Subset example_dat to only HBCUs with admissions rates lower than 40%
subset(example_dat, hbcu == "Yes" & admit_rate < 40)

3
name median_debt ownership
<chr> <dbl> <chr>
461 Delaware State University 18.264 Public
473 Howard University 19.500 Private nonprofit
A data.frame: 7 × 5 491 Florida Agricultural and Mechanical University 18.750 Public
503 Florida Memorial University 17.155 Private nonprofit
1376 Alcorn State University 16.895 Public
1401 Rust College 11.226 Private nonprofit
2747 Hampton University 18.500 Private nonprofit
A total of 7 colleges fit these conditions.
Note that R has different conventions for comparative statements. For example… - == means equals
exactly - != means does not equal - < means less than - > means greater than - <= means
less than or equal to - >= means greater than or equal to
Here are some other common conditional symbols - | means or - & means and
1.4 - Use the subset command to find the colleges in my_dat that are located in the Midwest
region of the United States and have more than a third of their students (greater than 33%) default
on their loans.
[10]: # Your code goes here
subset(my_dat, region == "Midwest" & pct_PELL > 33)

4
name region default
<chr> <chr> <dbl>
655 American Academy of Art College Midwest 5.7
658 Aurora University Midwest 4.2
661 Blackburn College Midwest 8.6
663 Paul Mitchell the School-Bradley Midwest 5.0
664 Cameo Beauty Academy Midwest 3.5
666 Capri Beauty College Midwest 4.7
667 Carl Sandburg College Midwest 9.0
668 Chicago State University Midwest 8.7
670 City Colleges of Chicago-Kennedy-King College Midwest 14.9
671 City Colleges of Chicago-Malcolm X College Midwest 9.4
672 City Colleges of Chicago-Olive-Harvey College Midwest 12.4
675 City Colleges of Chicago-Harold Washington College Midwest 11.6
677 Columbia College Chicago Midwest 6.1
678 Concordia University-Chicago Midwest 3.1
681 Cosmetology & Spa Academy Midwest 4.3
683 East-West University Midwest 21.9
684 Eastern Illinois University Midwest 6.1
686 Elmhurst University Midwest 3.2
687 Eureka College Midwest 7.8
688 First Institute of Travel Inc. Midwest 10.6
689 Fox College Midwest 6.4
690 Gem City College Midwest 5.0
691 Governors State University Midwest 6.2
692 Greenville University Midwest 4.9
693 G Skin & Beauty Institute Midwest 2.7
694 Hair Professionals Career College Midwest 0.0
695 Hair Professionals School of Cosmetology Midwest 5.3
697 Highland Community College Midwest 10.0
698 University of Illinois Chicago Midwest 2.5
A data.frame: 653 × 5 699 Benedictine University Midwest 4.9
� � � �
4290 Paul Mitchell the School-Toledo Midwest 10.4
4292 SAE Institute of Technology-Chicago Midwest 21.6
4312 Valor Christian College Midwest 12.6
4313 Bethany Global University Midwest 4.7
4314 Bella Academy of Cosmetology Midwest 28.5
4329 Ross Medical Education Center-Muncie Midwest 18.0
4335 Davines Professional Academy of Beauty and Business Midwest 11.7
4338 Paul Mitchell the School-Madison Midwest 12.5
4341 Protege Academy Midwest 16.0
4343 Fortis College-Cuyahoga Falls Midwest 13.9
4358 Aveda Institute-Madison Midwest 10.8
4360 Tricoci University of Beauty Culture-Elgin Midwest 4.3
4369 CAAN Academy of Nursing Midwest 0.0
4375 Ea La Mar’s Cosmetology & Barber College Midwest 25.0
4382 Kenny’s Academy of Barbering Midwest 44.4
4384 Indiana Wesleyan University-National & Global Midwest 5.0
4387 Ross Medical Education
5 Center-Elyria Midwest 19.4
4388 Ross Medical Education Center-Lafayette Midwest 18.6
4389 Ross Medical Education Center-Midland Midwest 18.0
4392 Academy of Beauty Professionals Midwest 5.3
Check yourself: You should find that 2 schools match your selection criteria.
1.5 - What do you notice about the observations that fit your selection criteria? What do you
wonder?
Double-click to type a response:
Suppose you’re interested in a particular college, such as Howard University. We can use the subset
command to filter the example_dat dataframe and focus solely on the information pertaining to
that college.
[11]: ## Run this code but do not edit it
# Subset example_dat to only show Howard University
subset(example_dat, name == "Howard University")

name median_debt ownership admit_rate hbcu


A data.frame: 1 × 5 <chr> <dbl> <chr> <dbl> <chr>
473 Howard University 19.5 Private nonprofit 38.64 Yes
1.6 - Select a college that interests you. Then use the subset command to locate and extract
information about the college from my_dat. Note: The exact spelling of the names of all the
colleges in the dataset can be found here.
[12]: # Your code goes here
subset(my_dat, name == "Harvard University")

name region default_rate ownership pct_PELL


A data.frame: 1 × 5 <chr> <chr> <dbl> <chr> <dbl>
1164 Harvard University Northeast 0.9 Private nonprofit 11.33
One further way to explore a dataset is to reorder its observations. For example, we can use the
arrange command to order the colleges in example_dat by their admission rate:

[13]: ## Run this code but do not edit it


# Arrange data in order of their admission rates
arrange(example_dat, admit_rate)

6
name median_debt ownersh
<chr> <dbl> <chr>
Curtis Institute of Music 16.250 Private
Harvard University 12.072 Private
Stanford University 11.000 Private
Princeton University 10.355 Private
Yale University 12.000 Private
Columbia University in the City of New York 19.250 Private
California Institute of Technology 9.867 Private
Massachusetts Institute of Technology 12.000 Private
University of Chicago 13.000 Private
The Juilliard School 25.000 Private
Brown University 12.000 Private
Duke University 12.500 Private
Pomona College 10.000 Private
University of Pennsylvania 14.000 Private
Swarthmore College 14.000 Private
Bowdoin College 14.000 Private
Dartmouth College 14.500 Private
Northwestern University 14.000 Private
Colby College 17.500 Private
Cornell University 13.108 Private
Rice University 10.500 Private
Johns Hopkins University 11.750 Private
Tulane University of Louisiana 19.000 Private
Vanderbilt University 12.420 Private
Amherst College 12.000 Private
Circle in the Square Theatre School 16.000 Private
Claremont McKenna College 12.070 Private
Colorado College 15.045 Private
Barnard College 16.250 Private
A data.frame: 4435 × 5 Bates College 12.610 Private
� � �
National Personal Training Institute-Tampa 6.333 Private
Mobile Technical Training 3.800 Private
California Institute of Arts & Technology 9.500 Private
Elite Cosmetology Barber & Spa Academy 6.054 Private
Gwinnett Institute 9.500 Private
Manuel and Theresa’s School of Hair Design 6.494 Private
Peloton College 9.500 Private
Ross Medical Education Center - Kalamazoo 8.089 Private
Ross College-Canton 8.347 Private
Ross College-Grand Rapids North 7.125 Private
American Institute-Somerset 9.176 Private
Bull City Durham Beauty and Barber College 9.833 Private
Fortis College-Cutler Bay 12.667 Private
Unitech Training Academy-Baton Rouge 6.991 Private
Empire Beauty School-Tampa 7.917 Private
Empire Beauty School-Lakeland 7.667 Private
Galen College of Nursing-ARH
7 16.500 Private
Tricoci University of Beauty Culture-Janesville 8.468 Private
Lynnes Welding Training-Bismarck 3.385 Private
No Grease Barber School 9.833 Private
As we can see, the most selective schools now top the list. You’ll see some NA values from
admit_rate at the bottom of the arranged dataset. These are missing values, which we’ll dis-
cuss later.
To arrange the data in descending order of admission rates (highest admission rates on top), we
can use the desc argument within our arrange command:

[14]: ## Run this code but do not edit it


# Arrange data in descending order of their admission rates
arrange(example_dat, desc(admit_rate))

8
name med
<chr> <db
University of Arkansas Community College-Morrilton 6.25
Design Institute of San Diego 31.0
Naropa University 16.3
VanderCook College of Music 27.0
Saint Elizabeth School of Nursing 20.2
Maharishi International University 13.0
Grace Christian University 9.70
Sacred Heart Major Seminary 7.34
JFK Muhlenberg Harold B. and Dorothy A. Snyder Schools 15.7
Arnot Ogden Medical Center 11.7
Neighborhood Playhouse School of the Theater 12.0
Samaritan Hospital School of Nursing 14.2
Trinity Bible College and Graduate School 12.8
Trinity Health System School of Nursing 13.6
Warner Pacific University 24.3
New Castle School of Trades 8.72
Saint Charles Borromeo Seminary-Overbrook 16.5
Universidad Adventista de las Antillas 11.8
Greene County Career and Technology Center 16.3
Western Area Career & Technology Center 16.5
Hussian College-Daymar College Clarksville 9.50
Eastern Center for Arts and Technology 8.55
Greater Lowell Technical School 5.50
Cass Career Center 9.50
Orange Ulster BOCES-Practical Nursing Program 11.8
Washington Saratoga Warren Hamilton Essex BOCES-Practical Nursing Program 12.8
Mifflin County Academy of Science and Technology 11.7
Living Arts College 10.0
Cayuga Onondaga BOCES-Practical Nursing Program 7.70
A data.frame: 4435 × 5 Delaware County Technical School-Practical Nursing Program 16.5
� �
National Personal Training Institute-Tampa 6.33
Mobile Technical Training 3.80
California Institute of Arts & Technology 9.50
Elite Cosmetology Barber & Spa Academy 6.05
Gwinnett Institute 9.50
Manuel and Theresa’s School of Hair Design 6.49
Peloton College 9.50
Ross Medical Education Center - Kalamazoo 8.08
Ross College-Canton 8.34
Ross College-Grand Rapids North 7.12
American Institute-Somerset 9.17
Bull City Durham Beauty and Barber College 9.83
Fortis College-Cutler Bay 12.6
Unitech Training Academy-Baton Rouge 6.99
Empire Beauty School-Tampa 7.91
Empire Beauty School-Lakeland 7.66
Galen College of Nursing-ARH
9 16.5
Tricoci University of Beauty Culture-Janesville 8.46
Lynnes Welding Training-Bismarck 3.38
No Grease Barber School 9.83
1.7 - Use the arrange command to organize the colleges in my_dat such that the colleges with the
highest student loan default rates are at the top.
[15]: # Your code goes here
arrange(my_dat, desc(pct_PELL))

10
name region
<chr> <chr>
PJ’s College of Cosmetology-Glasgow South
Mitchells Academy South
Charles and Sues School of Hair Design Rockies & Southw
Victoria Beauty College Inc Rockies & Southw
Nuvani Institute Rockies & Southw
New Community Career & Technical Institute Northeast
CBT Technology Institute-Hialeah South
SABER College South
Hands on Therapy Rockies & Southw
CBT Technology Institute-Cutler Bay South
CyberTex Institute of Technology Rockies & Southw
Bos-Man’s Barber College South
Barber Institute of Texas Rockies & Southw
Dewey University-Juana Diaz Territories
Automeca Technical College-Caguas Territories
Choffin Career and Technical Center Midwest
Academy of Hair Design-Jasper Rockies & Southw
More Tech Institute South
Trend Barber College Rockies & Southw
Reflections Academy of Beauty Midwest
Clinton College South
CBT Technology Institute-Main Campus South
Palladium Technical Academy Inc Far West
South Florida Institute of Technology South
Lee Professional Institute South
Nuvani Institute Rockies & Southw
Erie 2 Chautauqua Cattaraugus BOCES-Practical Nursing Program Northeast
Automeca Technical College-Ponce Territories
CEM College-Humacao Territories
A data.frame: 4435 × 5 Career Center of Southern Illinois Midwest
� �
Fairfield University Northeast
Arapahoe Community College Rockies & Southw
Worcester Polytechnic Institute Northeast
Bucknell University Northeast
Kenyon College Midwest
Oberlin College Midwest
Wake Forest University South
American Beauty Academy-West Valley Campus Rockies & Southw
Elon University South
South Seattle College Far West
Bellevue College Far West
University of New Mexico-Los Alamos Campus Rockies & Southw
Cascadia College Far West
Aveda Institute-Tucson Rockies & Southw
United States Merchant Marine Academy US Service School
Hope College of Arts and Sciences South
Western Texas College 11 Rockies & Southw
L3Harris Flight Academy South
Curtis Institute of Music Northeast
Foothill College Far West
1.8 - What patterns do you notice among the programs that have the highest student loan default
rates? What do you wonder?
Double-click to type a response:
Reference Guide for R (student resource) - Now that you’ve seen a number of different
commands in R, check out our reference guide for a full listing of useful R commands for this
project.

0.1.4 2.0 - Finding summary statistics


When analyzing variables of interest, it’s often helpful to calculate summary statistics. For quan-
titative variables, we can use the summary command to find the five-number summary (minimum,
Q1, median, Q3, maximum) and the average (mean) of the values. The code block shows how we
find these summary statistics for the admit_rate variable.
Note: The $ sign in R is used to isolate a single variable (admit_rate) from a full dataframe
(dat).

[16]: ## Run this code but do not edit it


# Find summary statistics for admit_rate
summary(dat$admit_rate)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's


2.44 59.79 74.68 70.81 86.11 100.00 2731
A few interesting facts about admit_rate that are revealed by this summary: - As expected,
no schools have a 0% admissions rate (the minimum admissions rate is 2.4%). - The maximum
admissions rate was 100%. So, there’s at least one school that admits every applicant. - The first
quartile (Q1) is a 59.79% admissions rate. This means only 25% of schools have admissions rates
lower than 59.79%. - For 2,731 schools, we have missing data. R uses the sybmol NA to represent
missing values. If we use admit_rate in future analyses, we should pay attention to which schools
have missing data and, ideally, investigate why their data is missing.
2.1 - Use the summary command to get summary statistics for the default_rate variable in the
dat dataframe.
[17]: # Your code goes here
summary(dat$default_rate)

Min. 1st Qu. Median Mean 3rd Qu. Max.


0.00 4.40 8.20 9.06 12.30 57.10
Check yourself: The median should be 8.20
2.2 - Comment on what these summary statistics reveal about the default_rate values in our
dataset.
Double-click to type a response:
For categorical data, it doesn’t make sense to find means and medians. Instead, it’s helpful to look
at value counts and proportions. We can use the table command to find the counts of the different
values for highest_degree:

12
[18]: ## Run this code but do not edit it
# Find counts of values for highest_degree, store in object 'degree_counts'
degree_counts <- table(dat$highest_degree)

# Print table stored in 'degree_counts'


degree_counts

Associates Bachelors Certificate Graduate


1096 501 1374 1464
1464 of the institutions in our dataset are Universities that offer graduate degrees. On the other end
of the spectrum, 1374 of the institutions aren’t Universities at all. Rather, they are career-oriented
programs that offer trade certificates.
To get a better sense of scale, we can turn these raw counts into proportions by dividing them by
the total:
[19]: ## Run this code but do not edit it
# Sum all counts in table, store in object 'total'
total <- sum(degree_counts)

# Print the value stored in 'total'


total

4435
[20]: ## Run this code but do not edit it
# Divide the table by the total to get proportions
degree_counts / total

Associates Bachelors Certificate Graduate


0.2471251 0.1129651 0.3098083 0.3301015
As you can see, you can use R just like a calculator. Addition, subtraction, multiplication, division
… it’s all there. Universities offering graduate degrees make up about 33% of the institutions in our
dataset. These are about three times more prevalent than 4-year colleges (Bachelors) that don’t
offer graduate degrees.
2.3 - Use the table command to get the value counts for the ownership variable.
[23]: # Your code goes here
degCounts <- table(dat$ownership)
degCounts

Private for-profit Private nonprofit Public


1684 1212 1539
Check yourself: There are 1539 public schools in the dataset

13
2.4 - Find the proportion of all institutions that are public, private nonprofit, and private for-profit.
[24]: # Your code goes here
total2 <- sum(degCounts)
total2

4435
Check yourself: About 34.7% of the schools in the dataset are public schools

0.1.5 3.0 - Visualizing data (histograms, barplots, and boxplots)


In addition to summary statistics, a great way to get an overall impression of our data is to visualize
it. In this section, we’ll walk through different types of visualizations we can create in R. Note:
We’re saving scatterplots for the next notebook in our series.
One of the most useful visualizations for displaying a quantitative variable is a histogram. Here,
we use the gf_histogram command to display the histogram for admit_rate.

[25]: ## Run this code but do not edit it


# Create histogram for admit_rate
gf_histogram(~admit_rate, data = dat)

Warning message:
“Removed 2731 rows containing non-finite outside the scale range
(`stat_bin()`).”

14
Note: A warning message was displayed about removing rows. This is R telling us that it’s
choosing not to visualize the missing data values (NA) that we discovered for admit_rate earlier in
the notebook.
As we suspected from the summary statistics, it appears that most programs have admissions rates
well above 50%, and only a small subset of programs have highly selective admissions rates. In
statistics, we call this distribution left skew, since there’s a tail on the left side. So, institutions with
low values (low admissions rates) are relatively unusual compared to most of the other institutions
in our dataset.
3.1 - Create a histogram to visualize all the default_rate values in the dat dataframe.
[26]: # Your code goes here
gf_histogram(~default_rate, data = dat)

3.2 - Describe the distribution and note any features of interest.


Double-click to type a response: skewed right, with potential high outliers
To visualize categorical variables, we can use the gf_bar command to make bar plots. Here we
create a bar plot for highest_degree:

[27]: ## Run this code but do not edit it


# Create bar plot for highest_degree

15
gf_bar(~highest_degree, data = dat)

As shown here, most of the institutions in our dataset are Universities that graduate degrees or
trade programs that offer professional certificates. There are about 500 colleges that only offer
bachelors degrees (without offering graduate degrees).
3.3 - Create a bar plot to visualize the ownership values from the dat dataframe.
[28]: # Your code goes here
gf_bar(~ownership, data = dat)

16
3.4 - Describe the distribution and note any features of interest.
Double-click to type a response:
Sometimes, we may want to explore the relationship between two variables by visualizing them both
at once. When we want to visualize the relationship between a categorical variable and quantitative
variable, we can use boxplots. Here, we show how to use gf_boxplot to visualize the relationship
between highest_degree (categorical) and admit_rate (quantitative).

[29]: ## Run this code but do not edit it


# Create boxplots for admit_rates of institutions with different highest_degree␣
↪values

gf_boxplot(admit_rate ~ highest_degree, data = dat)

Warning message:
“Removed 2731 rows containing non-finite outside the scale range
(`stat_boxplot()`).”

17
In this case, we’re using highest_degree as the predictor variable and admit_rate as the
outcome variable. In other words, we can use the degree level of an institution (certificate,
associates, bachelors, etc.) to help predict its admission rate. That’s because certain levels of
institutions typically have lower admissions rates than others. So, knowing the level of an institution
can help us better predict its admissions rate.
Note: This predictor-outcome relationship is coded in R through the syntax outcome ~
predictor, as in gf_boxplot(admit_rate ~ highest_degree,...).
We see that admission rates tend to be lower (lower medians) for colleges / Universities that grant
bachelors and graduate degrees. However, it’s worth noting that for every institution-type, the
first quartile is higher than a 50% admissions rate. So, most programs admit more than half
their applicants, regardless of insitution-type. Indeed, we see that the most prestigious Universities
with admissions rates lower than 25% are outliers (visualized as dots on the boxplot) among other
Universities that offer graduate degrees.
3.5 - Create boxplots to visualize the relationship between ownership and default_rate from the
dat dataframe.
[30]: # Your code goes here
gf_boxplot(default_rate ~ ownership, data = dat)

18
3.6 - Using your boxplot visualization, describe the relationship between institution ownership and
studen loan default rates.
Double-click to type a response:

0.1.6 Feedback (Required)


Please take 2 minutes to fill out this anonymous notebook feedback form, so we can continue
improving this notebook for future years!

19

You might also like