0% found this document useful (0 votes)

298 views19 pages

Notebook 1 - Basic R & Data Exploration

Uploaded by

Blobby Hatchner

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

298 views19 pages

Notebook 1 - Basic R & Data Exploration

Uploaded by

Blobby Hatchner

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Notebook 1 - Basic R & Data Exploration

May 23, 2024

0.1 Data Science Project: Use data to determine the best and worst colleges
for conquering student debt.
0.1.1 Notebook 1: Basic R Commands & Data Exploration
Does college pay off? We’ll use some of the latest data from the US Department of Education’s
College Scorecard Database to answer that question. In this first notebook, you’ll get a gentle
introduction to R - a coding language used by data scientists to analyze large datasets. Then,
you’ll begin diving into the college scorecard data yourself. By the end of this notebook, you’ll get
a general sense of which colleges set up their graduates for success and which colleges … don’t.
[3]: ## Run this code but do not edit it. Hit Ctrl+Enter to run the code
# This command downloads a useful package of R commands
library(coursekata)

0.1.2 1.0 - Exploring the dataset

To begin, let’s download our data. Our full dataset is included in a file named colleges.csv,
which we’re retrieving from the skewthescript.org website. The command below downloads the
data from the file and stores it into an R dataframe object called dat.
[4]: ## Run this code but do not edit it. Hit Ctrl+Enter to run the code
# This command downloads data and stores it in the object `dat`
dat <- read.csv('https://skewthescript.org/s/colleges.csv')

The <- operator is used to store values. For example, x<-10 stores the value of 10 in x, meaning
the value 10 is saved in the object x.
To get a quick view of the dataframe (dat), we can use the head command to print out its first
several rows.
[5]: ## Run this code but do not edit it. Hit Ctrl+Enter to run the code
# This command prints out the first several rows of the dataset
head(dat)

1
OPEID name city state region m
<int> <chr> <chr> <chr> <chr> <d
1 100200 Alabama A & M University Normal AL South 15
2 105200 University of Alabama at Birmingham Birmingham AL South 15
A data.frame: 6 × 26
3 2503400 Amridge University Montgomery AL South 10
4 105500 University of Alabama in Huntsville Huntsville AL South 14
5 100500 Alabama State University Montgomery AL South 17
6 105100 The University of Alabama Tuscaloosa AL South 17
The vertical columns of the dataframe are called variables, and their elements are called values.
For example, the variable city has values Normal, Birmingham, Montgomery, Huntsville, etc.
The horizontal rows of the dataframe are called observations. For example, the first observation
is Alabama A & M University, which is located in AL (Alabama), in the city of Normal, and has a
median student debt of $15,250. For this dataframe, each observation describes a specific college.
1.1 - Of the variables displayed, identify one that is quantitative, one that is categorical, and one
that is a unique identifier.
Double-click to type a response: city –> categorical, median_debt –> quantitative, opeid –>
unique identifier
The head command only displays several rows of the dataframe. To see the full dimensions of the
dataframe, we can use the dim command.
1.2 - Use the dim command on dat to display the dimensions of the dataframe.
[6]: # Your code goes here
dim(dat)

1. 4435 2. 26
Check yourself: Your code should have printed out two numbers: 4435 and 26.
The first number outputted by dim is the number of horizontal rows in the dataframe. This
represents the number of observations (number of colleges). The second number is the number of
vertical columns in the dataframe. This represents the number of variables. What are all these
variables? See the description of the dataset below, along with links to descriptions of all the
variables.

0.1.3 The Dataset

General description - The US Department of Education’s College Scorecard Database shows
various metrics of cost, enrollment, size, student debt, student demographics, and alumni success. It
describes almost every University, college, community college, trade school, and certificate program
in the United States. The data is current as of the 2020-2021 school year.
Description of all variables: See here
Detailed data file description: See here
With such a large dataset, to make your life easier, you may want to work with only a few vari-
ables at a time. In the following code, we use the select command to select only the variables

2
name, median_debt, ownership, admit_rate, and hbcu and save them in a new dataframe called
example_dat.
[7]: ## Run this code but do not edit it
# Select certain columns from dat, store into example_dat
example_dat <- select(dat, name, median_debt, ownership, admit_rate, hbcu)

# Display head of example_dat

head(example_dat)

name median_debt ownership admit_rate

<chr> <dbl> <chr> <dbl>
1 Alabama A & M University 15.250 Public 89.65
2 University of Alabama at Birmingham 15.085 Public 80.60
A data.frame: 6 × 5
3 Amridge University 10.984 Private nonprofit NA
4 University of Alabama in Huntsville 14.000 Public 77.11
5 Alabama State University 17.500 Public 98.88
6 The University of Alabama 17.671 Public 80.39
1.3 - Use the select command to select the variables name, region, default_rate, ownership,
and pct_PELL from dat. Store your new dataframe in an object called my_dat and display its head.
[8]: # Your code goes here
my_dat <- select(dat, name, region, default_rate, ownership, pct_PELL)
head(my_dat)

name region default_rate ownership pct

<chr> <chr> <dbl> <chr> <db
1 Alabama A & M University South 12.1 Public 70.9
2 University of Alabama at Birmingham South 4.8 Public 33.9
A data.frame: 6 × 5
3 Amridge University South 12.9 Private nonprofit 74.5
4 University of Alabama in Huntsville South 4.7 Public 24.0
5 Alabama State University South 12.8 Public 73.6
6 The University of Alabama South 4.0 Public 17.1
In addition to filtering out columns (variables), we can also filter out rows (observations). For
example, if I only wanted to analyze colleges that are HBCUs and that have an admissions rate
below than 40%, I can use the subset command on example_dat like this:
[9]: ## Run this code but do not edit it
# Subset example_dat to only HBCUs with admissions rates lower than 40%
subset(example_dat, hbcu == "Yes" & admit_rate < 40)

3
name median_debt ownership
<chr> <dbl> <chr>
461 Delaware State University 18.264 Public
473 Howard University 19.500 Private nonprofit
A data.frame: 7 × 5 491 Florida Agricultural and Mechanical University 18.750 Public
503 Florida Memorial University 17.155 Private nonprofit
1376 Alcorn State University 16.895 Public
1401 Rust College 11.226 Private nonprofit
2747 Hampton University 18.500 Private nonprofit
A total of 7 colleges fit these conditions.
Note that R has different conventions for comparative statements. For example… - == means equals
exactly - != means does not equal - < means less than - > means greater than - <= means
less than or equal to - >= means greater than or equal to
Here are some other common conditional symbols - | means or - & means and
1.4 - Use the subset command to find the colleges in my_dat that are located in the Midwest
region of the United States and have more than a third of their students (greater than 33%) default
on their loans.
[10]: # Your code goes here
subset(my_dat, region == "Midwest" & pct_PELL > 33)

4
name region default
<chr> <chr> <dbl>
655 American Academy of Art College Midwest 5.7
658 Aurora University Midwest 4.2
661 Blackburn College Midwest 8.6
663 Paul Mitchell the School-Bradley Midwest 5.0
664 Cameo Beauty Academy Midwest 3.5
666 Capri Beauty College Midwest 4.7
667 Carl Sandburg College Midwest 9.0
668 Chicago State University Midwest 8.7
670 City Colleges of Chicago-Kennedy-King College Midwest 14.9
671 City Colleges of Chicago-Malcolm X College Midwest 9.4
672 City Colleges of Chicago-Olive-Harvey College Midwest 12.4
675 City Colleges of Chicago-Harold Washington College Midwest 11.6
677 Columbia College Chicago Midwest 6.1
678 Concordia University-Chicago Midwest 3.1
681 Cosmetology & Spa Academy Midwest 4.3
683 East-West University Midwest 21.9
684 Eastern Illinois University Midwest 6.1
686 Elmhurst University Midwest 3.2
687 Eureka College Midwest 7.8
688 First Institute of Travel Inc. Midwest 10.6
689 Fox College Midwest 6.4
690 Gem City College Midwest 5.0
691 Governors State University Midwest 6.2
692 Greenville University Midwest 4.9
693 G Skin & Beauty Institute Midwest 2.7
694 Hair Professionals Career College Midwest 0.0
695 Hair Professionals School of Cosmetology Midwest 5.3
697 Highland Community College Midwest 10.0
698 University of Illinois Chicago Midwest 2.5
A data.frame: 653 × 5 699 Benedictine University Midwest 4.9
� � � �
4290 Paul Mitchell the School-Toledo Midwest 10.4
4292 SAE Institute of Technology-Chicago Midwest 21.6
4312 Valor Christian College Midwest 12.6
4313 Bethany Global University Midwest 4.7
4314 Bella Academy of Cosmetology Midwest 28.5
4329 Ross Medical Education Center-Muncie Midwest 18.0
4335 Davines Professional Academy of Beauty and Business Midwest 11.7
4338 Paul Mitchell the School-Madison Midwest 12.5
4341 Protege Academy Midwest 16.0
4343 Fortis College-Cuyahoga Falls Midwest 13.9
4358 Aveda Institute-Madison Midwest 10.8
4360 Tricoci University of Beauty Culture-Elgin Midwest 4.3
4369 CAAN Academy of Nursing Midwest 0.0
4375 Ea La Mar’s Cosmetology & Barber College Midwest 25.0
4382 Kenny’s Academy of Barbering Midwest 44.4
4384 Indiana Wesleyan University-National & Global Midwest 5.0
4387 Ross Medical Education
5 Center-Elyria Midwest 19.4
4388 Ross Medical Education Center-Lafayette Midwest 18.6
4389 Ross Medical Education Center-Midland Midwest 18.0
4392 Academy of Beauty Professionals Midwest 5.3
Check yourself: You should find that 2 schools match your selection criteria.
1.5 - What do you notice about the observations that fit your selection criteria? What do you
wonder?
Double-click to type a response:
Suppose you’re interested in a particular college, such as Howard University. We can use the subset
command to filter the example_dat dataframe and focus solely on the information pertaining to
that college.
[11]: ## Run this code but do not edit it
# Subset example_dat to only show Howard University
subset(example_dat, name == "Howard University")

name median_debt ownership admit_rate hbcu

A data.frame: 1 × 5 <chr> <dbl> <chr> <dbl> <chr>
473 Howard University 19.5 Private nonprofit 38.64 Yes
1.6 - Select a college that interests you. Then use the subset command to locate and extract
information about the college from my_dat. Note: The exact spelling of the names of all the
colleges in the dataset can be found here.
[12]: # Your code goes here
subset(my_dat, name == "Harvard University")

name region default_rate ownership pct_PELL

A data.frame: 1 × 5 <chr> <chr> <dbl> <chr> <dbl>
1164 Harvard University Northeast 0.9 Private nonprofit 11.33
One further way to explore a dataset is to reorder its observations. For example, we can use the
arrange command to order the colleges in example_dat by their admission rate:

[13]: ## Run this code but do not edit it

# Arrange data in order of their admission rates
arrange(example_dat, admit_rate)

6
name median_debt ownersh
<chr> <dbl> <chr>
Curtis Institute of Music 16.250 Private
Harvard University 12.072 Private
Stanford University 11.000 Private
Princeton University 10.355 Private
Yale University 12.000 Private
Columbia University in the City of New York 19.250 Private
California Institute of Technology 9.867 Private
Massachusetts Institute of Technology 12.000 Private
University of Chicago 13.000 Private
The Juilliard School 25.000 Private
Brown University 12.000 Private
Duke University 12.500 Private
Pomona College 10.000 Private
University of Pennsylvania 14.000 Private
Swarthmore College 14.000 Private
Bowdoin College 14.000 Private
Dartmouth College 14.500 Private
Northwestern University 14.000 Private
Colby College 17.500 Private
Cornell University 13.108 Private
Rice University 10.500 Private
Johns Hopkins University 11.750 Private
Tulane University of Louisiana 19.000 Private
Vanderbilt University 12.420 Private
Amherst College 12.000 Private
Circle in the Square Theatre School 16.000 Private
Claremont McKenna College 12.070 Private
Colorado College 15.045 Private
Barnard College 16.250 Private
A data.frame: 4435 × 5 Bates College 12.610 Private
� � �
National Personal Training Institute-Tampa 6.333 Private
Mobile Technical Training 3.800 Private
California Institute of Arts & Technology 9.500 Private
Elite Cosmetology Barber & Spa Academy 6.054 Private
Gwinnett Institute 9.500 Private
Manuel and Theresa’s School of Hair Design 6.494 Private
Peloton College 9.500 Private
Ross Medical Education Center - Kalamazoo 8.089 Private
Ross College-Canton 8.347 Private
Ross College-Grand Rapids North 7.125 Private
American Institute-Somerset 9.176 Private
Bull City Durham Beauty and Barber College 9.833 Private
Fortis College-Cutler Bay 12.667 Private
Unitech Training Academy-Baton Rouge 6.991 Private
Empire Beauty School-Tampa 7.917 Private
Empire Beauty School-Lakeland 7.667 Private
Galen College of Nursing-ARH
7 16.500 Private
Tricoci University of Beauty Culture-Janesville 8.468 Private
Lynnes Welding Training-Bismarck 3.385 Private
No Grease Barber School 9.833 Private
As we can see, the most selective schools now top the list. You’ll see some NA values from
admit_rate at the bottom of the arranged dataset. These are missing values, which we’ll dis-
cuss later.
To arrange the data in descending order of admission rates (highest admission rates on top), we
can use the desc argument within our arrange command:

[14]: ## Run this code but do not edit it

# Arrange data in descending order of their admission rates
arrange(example_dat, desc(admit_rate))

8
name med
<chr> <db
University of Arkansas Community College-Morrilton 6.25
Design Institute of San Diego 31.0
Naropa University 16.3
VanderCook College of Music 27.0
Saint Elizabeth School of Nursing 20.2
Maharishi International University 13.0
Grace Christian University 9.70
Sacred Heart Major Seminary 7.34
JFK Muhlenberg Harold B. and Dorothy A. Snyder Schools 15.7
Arnot Ogden Medical Center 11.7
Neighborhood Playhouse School of the Theater 12.0
Samaritan Hospital School of Nursing 14.2
Trinity Bible College and Graduate School 12.8
Trinity Health System School of Nursing 13.6
Warner Pacific University 24.3
New Castle School of Trades 8.72
Saint Charles Borromeo Seminary-Overbrook 16.5
Universidad Adventista de las Antillas 11.8
Greene County Career and Technology Center 16.3
Western Area Career & Technology Center 16.5
Hussian College-Daymar College Clarksville 9.50
Eastern Center for Arts and Technology 8.55
Greater Lowell Technical School 5.50
Cass Career Center 9.50
Orange Ulster BOCES-Practical Nursing Program 11.8
Washington Saratoga Warren Hamilton Essex BOCES-Practical Nursing Program 12.8
Mifflin County Academy of Science and Technology 11.7
Living Arts College 10.0
Cayuga Onondaga BOCES-Practical Nursing Program 7.70
A data.frame: 4435 × 5 Delaware County Technical School-Practical Nursing Program 16.5
� �
National Personal Training Institute-Tampa 6.33
Mobile Technical Training 3.80
California Institute of Arts & Technology 9.50
Elite Cosmetology Barber & Spa Academy 6.05
Gwinnett Institute 9.50
Manuel and Theresa’s School of Hair Design 6.49
Peloton College 9.50
Ross Medical Education Center - Kalamazoo 8.08
Ross College-Canton 8.34
Ross College-Grand Rapids North 7.12
American Institute-Somerset 9.17
Bull City Durham Beauty and Barber College 9.83
Fortis College-Cutler Bay 12.6
Unitech Training Academy-Baton Rouge 6.99
Empire Beauty School-Tampa 7.91
Empire Beauty School-Lakeland 7.66
Galen College of Nursing-ARH
9 16.5
Tricoci University of Beauty Culture-Janesville 8.46
Lynnes Welding Training-Bismarck 3.38
No Grease Barber School 9.83
1.7 - Use the arrange command to organize the colleges in my_dat such that the colleges with the
highest student loan default rates are at the top.
[15]: # Your code goes here
arrange(my_dat, desc(pct_PELL))

10
name region
<chr> <chr>
PJ’s College of Cosmetology-Glasgow South
Mitchells Academy South
Charles and Sues School of Hair Design Rockies & Southw
Victoria Beauty College Inc Rockies & Southw
Nuvani Institute Rockies & Southw
New Community Career & Technical Institute Northeast
CBT Technology Institute-Hialeah South
SABER College South
Hands on Therapy Rockies & Southw
CBT Technology Institute-Cutler Bay South
CyberTex Institute of Technology Rockies & Southw
Bos-Man’s Barber College South
Barber Institute of Texas Rockies & Southw
Dewey University-Juana Diaz Territories
Automeca Technical College-Caguas Territories
Choﬀin Career and Technical Center Midwest
Academy of Hair Design-Jasper Rockies & Southw
More Tech Institute South
Trend Barber College Rockies & Southw
Reflections Academy of Beauty Midwest
Clinton College South
CBT Technology Institute-Main Campus South
Palladium Technical Academy Inc Far West
South Florida Institute of Technology South
Lee Professional Institute South
Nuvani Institute Rockies & Southw
Erie 2 Chautauqua Cattaraugus BOCES-Practical Nursing Program Northeast
Automeca Technical College-Ponce Territories
CEM College-Humacao Territories
A data.frame: 4435 × 5 Career Center of Southern Illinois Midwest
� �
Fairfield University Northeast
Arapahoe Community College Rockies & Southw
Worcester Polytechnic Institute Northeast
Bucknell University Northeast
Kenyon College Midwest
Oberlin College Midwest
Wake Forest University South
American Beauty Academy-West Valley Campus Rockies & Southw
Elon University South
South Seattle College Far West
Bellevue College Far West
University of New Mexico-Los Alamos Campus Rockies & Southw
Cascadia College Far West
Aveda Institute-Tucson Rockies & Southw
United States Merchant Marine Academy US Service School
Hope College of Arts and Sciences South
Western Texas College 11 Rockies & Southw
L3Harris Flight Academy South
Curtis Institute of Music Northeast
Foothill College Far West
1.8 - What patterns do you notice among the programs that have the highest student loan default
rates? What do you wonder?
Double-click to type a response:
Reference Guide for R (student resource) - Now that you’ve seen a number of different
commands in R, check out our reference guide for a full listing of useful R commands for this
project.

0.1.4 2.0 - Finding summary statistics

When analyzing variables of interest, it’s often helpful to calculate summary statistics. For quan-
titative variables, we can use the summary command to find the five-number summary (minimum,
Q1, median, Q3, maximum) and the average (mean) of the values. The code block shows how we
find these summary statistics for the admit_rate variable.
Note: The $ sign in R is used to isolate a single variable (admit_rate) from a full dataframe
(dat).

[16]: ## Run this code but do not edit it

# Find summary statistics for admit_rate
summary(dat$admit_rate)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

2.44 59.79 74.68 70.81 86.11 100.00 2731
A few interesting facts about admit_rate that are revealed by this summary: - As expected,
no schools have a 0% admissions rate (the minimum admissions rate is 2.4%). - The maximum
admissions rate was 100%. So, there’s at least one school that admits every applicant. - The first
quartile (Q1) is a 59.79% admissions rate. This means only 25% of schools have admissions rates
lower than 59.79%. - For 2,731 schools, we have missing data. R uses the sybmol NA to represent
missing values. If we use admit_rate in future analyses, we should pay attention to which schools
have missing data and, ideally, investigate why their data is missing.
2.1 - Use the summary command to get summary statistics for the default_rate variable in the
dat dataframe.
[17]: # Your code goes here
summary(dat$default_rate)

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.00 4.40 8.20 9.06 12.30 57.10
Check yourself: The median should be 8.20
2.2 - Comment on what these summary statistics reveal about the default_rate values in our
dataset.
Double-click to type a response:
For categorical data, it doesn’t make sense to find means and medians. Instead, it’s helpful to look
at value counts and proportions. We can use the table command to find the counts of the different
values for highest_degree:

12
[18]: ## Run this code but do not edit it
# Find counts of values for highest_degree, store in object 'degree_counts'
degree_counts <- table(dat$highest_degree)

# Print table stored in 'degree_counts'

degree_counts

Associates Bachelors Certificate Graduate

1096 501 1374 1464
1464 of the institutions in our dataset are Universities that offer graduate degrees. On the other end
of the spectrum, 1374 of the institutions aren’t Universities at all. Rather, they are career-oriented
programs that offer trade certificates.
To get a better sense of scale, we can turn these raw counts into proportions by dividing them by
the total:
[19]: ## Run this code but do not edit it
# Sum all counts in table, store in object 'total'
total <- sum(degree_counts)

# Print the value stored in 'total'

total

4435
[20]: ## Run this code but do not edit it
# Divide the table by the total to get proportions
degree_counts / total

Associates Bachelors Certificate Graduate

0.2471251 0.1129651 0.3098083 0.3301015
As you can see, you can use R just like a calculator. Addition, subtraction, multiplication, division
… it’s all there. Universities offering graduate degrees make up about 33% of the institutions in our
dataset. These are about three times more prevalent than 4-year colleges (Bachelors) that don’t
offer graduate degrees.
2.3 - Use the table command to get the value counts for the ownership variable.
[23]: # Your code goes here
degCounts <- table(dat$ownership)
degCounts

Private for-profit Private nonprofit Public

1684 1212 1539
Check yourself: There are 1539 public schools in the dataset

13
2.4 - Find the proportion of all institutions that are public, private nonprofit, and private for-profit.
[24]: # Your code goes here
total2 <- sum(degCounts)
total2

4435
Check yourself: About 34.7% of the schools in the dataset are public schools

0.1.5 3.0 - Visualizing data (histograms, barplots, and boxplots)

In addition to summary statistics, a great way to get an overall impression of our data is to visualize
it. In this section, we’ll walk through different types of visualizations we can create in R. Note:
We’re saving scatterplots for the next notebook in our series.
One of the most useful visualizations for displaying a quantitative variable is a histogram. Here,
we use the gf_histogram command to display the histogram for admit_rate.

[25]: ## Run this code but do not edit it

# Create histogram for admit_rate
gf_histogram(~admit_rate, data = dat)

Warning message:
“Removed 2731 rows containing non-finite outside the scale range
(`stat_bin()`).”

14
Note: A warning message was displayed about removing rows. This is R telling us that it’s
choosing not to visualize the missing data values (NA) that we discovered for admit_rate earlier in
the notebook.
As we suspected from the summary statistics, it appears that most programs have admissions rates
well above 50%, and only a small subset of programs have highly selective admissions rates. In
statistics, we call this distribution left skew, since there’s a tail on the left side. So, institutions with
low values (low admissions rates) are relatively unusual compared to most of the other institutions
in our dataset.
3.1 - Create a histogram to visualize all the default_rate values in the dat dataframe.
[26]: # Your code goes here
gf_histogram(~default_rate, data = dat)

3.2 - Describe the distribution and note any features of interest.

Double-click to type a response: skewed right, with potential high outliers
To visualize categorical variables, we can use the gf_bar command to make bar plots. Here we
create a bar plot for highest_degree:

[27]: ## Run this code but do not edit it

# Create bar plot for highest_degree

15
gf_bar(~highest_degree, data = dat)

As shown here, most of the institutions in our dataset are Universities that graduate degrees or
trade programs that offer professional certificates. There are about 500 colleges that only offer
bachelors degrees (without offering graduate degrees).
3.3 - Create a bar plot to visualize the ownership values from the dat dataframe.
[28]: # Your code goes here
gf_bar(~ownership, data = dat)

16
3.4 - Describe the distribution and note any features of interest.
Double-click to type a response:
Sometimes, we may want to explore the relationship between two variables by visualizing them both
at once. When we want to visualize the relationship between a categorical variable and quantitative
variable, we can use boxplots. Here, we show how to use gf_boxplot to visualize the relationship
between highest_degree (categorical) and admit_rate (quantitative).

[29]: ## Run this code but do not edit it

# Create boxplots for admit_rates of institutions with different highest_degree␣
↪values

gf_boxplot(admit_rate ~ highest_degree, data = dat)

Warning message:
“Removed 2731 rows containing non-finite outside the scale range
(`stat_boxplot()`).”

17
In this case, we’re using highest_degree as the predictor variable and admit_rate as the
outcome variable. In other words, we can use the degree level of an institution (certificate,
associates, bachelors, etc.) to help predict its admission rate. That’s because certain levels of
institutions typically have lower admissions rates than others. So, knowing the level of an institution
can help us better predict its admissions rate.
Note: This predictor-outcome relationship is coded in R through the syntax outcome ~
predictor, as in gf_boxplot(admit_rate ~ highest_degree,...).
We see that admission rates tend to be lower (lower medians) for colleges / Universities that grant
bachelors and graduate degrees. However, it’s worth noting that for every institution-type, the
first quartile is higher than a 50% admissions rate. So, most programs admit more than half
their applicants, regardless of insitution-type. Indeed, we see that the most prestigious Universities
with admissions rates lower than 25% are outliers (visualized as dots on the boxplot) among other
Universities that offer graduate degrees.
3.5 - Create boxplots to visualize the relationship between ownership and default_rate from the
dat dataframe.
[30]: # Your code goes here
gf_boxplot(default_rate ~ ownership, data = dat)

18
3.6 - Using your boxplot visualization, describe the relationship between institution ownership and
studen loan default rates.
Double-click to type a response:

0.1.6 Feedback (Required)

Please take 2 minutes to fill out this anonymous notebook feedback form, so we can continue
improving this notebook for future years!

Hospital Management System
No ratings yet
Hospital Management System
2 pages
Module 2.9
No ratings yet
Module 2.9
11 pages
Pandas Library Documentation
No ratings yet
Pandas Library Documentation
16 pages
Notebook 3 - Multiple Regression
No ratings yet
Notebook 3 - Multiple Regression
11 pages
Notebook 4 - Machine Learning
No ratings yet
Notebook 4 - Machine Learning
17 pages
Data Mining
100% (1)
Data Mining
6 pages
40_NumPy_and_Pandas_interview_questions_with_answers_1740141557
No ratings yet
40_NumPy_and_Pandas_interview_questions_with_answers_1740141557
6 pages
Pandas
No ratings yet
Pandas
30 pages
Ashoka Brochure Final_12.5 x 10 Inch
No ratings yet
Ashoka Brochure Final_12.5 x 10 Inch
6 pages
024 Price and Everything PDF
No ratings yet
024 Price and Everything PDF
12 pages
HGE-1113_LAS3_ONG_STE12-6P (1)
No ratings yet
HGE-1113_LAS3_ONG_STE12-6P (1)
3 pages
1-Pandas Cheat Sheet
No ratings yet
1-Pandas Cheat Sheet
7 pages
ICT2103 Full Book-Part-3
No ratings yet
ICT2103 Full Book-Part-3
14 pages
Pandas Methods
No ratings yet
Pandas Methods
6 pages
Advanced Statistics (AS) Project Report
No ratings yet
Advanced Statistics (AS) Project Report
52 pages
Viernes Vs NLRC
100% (2)
Viernes Vs NLRC
1 page
Pandas
No ratings yet
Pandas
13 pages
Notebook 1_ Basic R & Data Exploration - Jupyter Notebook
No ratings yet
Notebook 1_ Basic R & Data Exploration - Jupyter Notebook
21 pages
Unit 7
No ratings yet
Unit 7
18 pages
Pandas For Data Science
No ratings yet
Pandas For Data Science
42 pages
Pandas Class XII (2021-22)
No ratings yet
Pandas Class XII (2021-22)
246 pages
PDF
No ratings yet
PDF
11 pages
Pandas CheatSheet
No ratings yet
Pandas CheatSheet
18 pages
Notebook 1 - Basic R & Data Exploration
No ratings yet
Notebook 1 - Basic R & Data Exploration
19 pages
Words Dissertation in A Day
100% (2)
Words Dissertation in A Day
7 pages
Pandas_Notes_Design
No ratings yet
Pandas_Notes_Design
5 pages
0610_w24_ms_62
No ratings yet
0610_w24_ms_62
8 pages
Module1-Cheat-Sheet-LINE PLOT
No ratings yet
Module1-Cheat-Sheet-LINE PLOT
3 pages
Pandas
No ratings yet
Pandas
8 pages
Updated Advisor List f2023 Updated
No ratings yet
Updated Advisor List f2023 Updated
5 pages
Pandas_Notes
No ratings yet
Pandas_Notes
6 pages
Unit-1 Python Pandas (1)
No ratings yet
Unit-1 Python Pandas (1)
56 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
ML Lab1 Python Panda
No ratings yet
ML Lab1 Python Panda
9 pages
Chapter 5 Summary
100% (1)
Chapter 5 Summary
9 pages
digital_marketing
No ratings yet
digital_marketing
20 pages
1 Samuel 1new International Version (NIV) : The Birth of Samuel
No ratings yet
1 Samuel 1new International Version (NIV) : The Birth of Samuel
4 pages
Pandas
No ratings yet
Pandas
9 pages
Kohli Batting Analysis
No ratings yet
Kohli Batting Analysis
19 pages
You Have - Type You Answers In: 15 Minutes
100% (1)
You Have - Type You Answers In: 15 Minutes
2 pages
Data Connections and Networking
No ratings yet
Data Connections and Networking
2 pages
Daniel and Ella Feel . E. The Clothes They Should Wear: Good Luck
No ratings yet
Daniel and Ella Feel . E. The Clothes They Should Wear: Good Luck
3 pages
Hill Reaction
No ratings yet
Hill Reaction
2 pages
Journal 12
No ratings yet
Journal 12
54 pages
ISYE6501 HW1 Kevin
No ratings yet
ISYE6501 HW1 Kevin
7 pages
Business Report: Advanced Statistics Module Project I
100% (1)
Business Report: Advanced Statistics Module Project I
5 pages
IPL DATA ANLYSIS (1)
No ratings yet
IPL DATA ANLYSIS (1)
20 pages
Simple Gifts Lesson Plan
No ratings yet
Simple Gifts Lesson Plan
2 pages
Gonadal Hormones, Their Inhibitors and Fertility and Antifertility Agents
No ratings yet
Gonadal Hormones, Their Inhibitors and Fertility and Antifertility Agents
29 pages
18_Pandas
No ratings yet
18_Pandas
33 pages
Linear Statistical Models The Less Than Full Rank Model: Yao-Ban Chan
100% (1)
Linear Statistical Models The Less Than Full Rank Model: Yao-Ban Chan
140 pages
Notebook 2 - Linear Regression
No ratings yet
Notebook 2 - Linear Regression
11 pages
Atty Alan F Paguia Vs Atty Manuel T Molina
No ratings yet
Atty Alan F Paguia Vs Atty Manuel T Molina
4 pages
Exercises PDF
No ratings yet
Exercises PDF
30 pages
AMA Customer Engagement Plan - FINAL
No ratings yet
AMA Customer Engagement Plan - FINAL
26 pages
IP TERM-1 Study Material (Session 2021-22)
No ratings yet
IP TERM-1 Study Material (Session 2021-22)
84 pages
Panel 101
No ratings yet
Panel 101
48 pages
CLS - Xii - Ip - Practical & Project - 2022-23
No ratings yet
CLS - Xii - Ip - Practical & Project - 2022-23
6 pages
Education Department Letter Regarding BPS Corrective Action Plan
No ratings yet
Education Department Letter Regarding BPS Corrective Action Plan
6 pages
Pandas Cheat Sheet - Python For Data Science
No ratings yet
Pandas Cheat Sheet - Python For Data Science
5 pages
Pandas-Creating Series & Dataframes (DR V Gowri, Srmist)
No ratings yet
Pandas-Creating Series & Dataframes (DR V Gowri, Srmist)
47 pages
Public Switched Telephone Network (PSTN) BY M C Koladiya
No ratings yet
Public Switched Telephone Network (PSTN) BY M C Koladiya
48 pages
ABAP Programming Model For Fiori EN
100% (2)
ABAP Programming Model For Fiori EN
612 pages
FAQ SAP Sybase Certification v1.0
No ratings yet
FAQ SAP Sybase Certification v1.0
3 pages
Ss Project With Python
No ratings yet
Ss Project With Python
9 pages
Project DVT CarInsurance
No ratings yet
Project DVT CarInsurance
10 pages
DevOps Session 3 Pandas.pptx
No ratings yet
DevOps Session 3 Pandas.pptx
33 pages
Math 1280 Notes
No ratings yet
Math 1280 Notes
91 pages
COMPX310-19A Machine Learning: An Introduction Using Python, Scikit-Learn, Keras, and Tensorflow
No ratings yet
COMPX310-19A Machine Learning: An Introduction Using Python, Scikit-Learn, Keras, and Tensorflow
44 pages
House Price Prediction: Project Description
No ratings yet
House Price Prediction: Project Description
11 pages
Comma Exercise
No ratings yet
Comma Exercise
2 pages
Salary Prediction LinearRegression
100% (1)
Salary Prediction LinearRegression
7 pages
Problem 1 - (Download Data) : Importing Nessceary Libraries
No ratings yet
Problem 1 - (Download Data) : Importing Nessceary Libraries
16 pages
Project
No ratings yet
Project
18 pages
Dual Acting Shaper
No ratings yet
Dual Acting Shaper
30 pages
ML Project Report: (Text Learning Case Study)
No ratings yet
ML Project Report: (Text Learning Case Study)
9 pages
Ved Vyasa Rajdharma (Shantiparva)
No ratings yet
Ved Vyasa Rajdharma (Shantiparva)
10 pages
Panel Data Analysis Using STATA 13
No ratings yet
Panel Data Analysis Using STATA 13
17 pages
Osprey, Men-At-Arms #051 Spanish Armies of The Napoleonic Wars (1975) OCR 8.12
100% (10)
Osprey, Men-At-Arms #051 Spanish Armies of The Napoleonic Wars (1975) OCR 8.12
50 pages
Chapter 06 - Heteroskedasticity
100% (1)
Chapter 06 - Heteroskedasticity
30 pages
AS Extended Buisnesss Report
No ratings yet
AS Extended Buisnesss Report
25 pages
Block 1-Data Handling Using Pandas DataFrame
No ratings yet
Block 1-Data Handling Using Pandas DataFrame
17 pages
Question and Answers For Pyplots
No ratings yet
Question and Answers For Pyplots
11 pages
Statistical Methods For Decision Making
100% (1)
Statistical Methods For Decision Making
15 pages
SWM VHF PDF
No ratings yet
SWM VHF PDF
6 pages
Introduction To Regression Models For Panel Data Analysis Indiana University Workshop in Methods October 7, 2011 Professor Patricia A. Mcmanus
No ratings yet
Introduction To Regression Models For Panel Data Analysis Indiana University Workshop in Methods October 7, 2011 Professor Patricia A. Mcmanus
42 pages
An R Tutorial Starting Out
No ratings yet
An R Tutorial Starting Out
9 pages
SMDM Report
No ratings yet
SMDM Report
12 pages
1st QE Answer Key Plate No. 1
No ratings yet
1st QE Answer Key Plate No. 1
7 pages
Wooldridge Session 4
No ratings yet
Wooldridge Session 4
64 pages

Notebook 1 - Basic R & Data Exploration

Uploaded by

Notebook 1 - Basic R & Data Exploration

Uploaded by

Notebook 1 - Basic R & Data Exploration

May 23, 2024

0.1.2 1.0 - Exploring the dataset

0.1.3 The Dataset

# Display head of example_dat

name median_debt ownership admit_rate

name region default_rate ownership pct

name median_debt ownership admit_rate hbcu

name region default_rate ownership pct_PELL

[13]: ## Run this code but do not edit it

[14]: ## Run this code but do not edit it

0.1.4 2.0 - Finding summary statistics

[16]: ## Run this code but do not edit it

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

Min. 1st Qu. Median Mean 3rd Qu. Max.

# Print table stored in 'degree_counts'

Associates Bachelors Certificate Graduate

# Print the value stored in 'total'

Associates Bachelors Certificate Graduate

Private for-profit Private nonprofit Public

0.1.5 3.0 - Visualizing data (histograms, barplots, and boxplots)

[25]: ## Run this code but do not edit it

3.2 - Describe the distribution and note any features of interest.

[27]: ## Run this code but do not edit it

[29]: ## Run this code but do not edit it

gf_boxplot(admit_rate ~ highest_degree, data = dat)

0.1.6 Feedback (Required)

You might also like