Notebook 1 - Basic R & Data Exploration
Notebook 1 - Basic R & Data Exploration
0.1 Data Science Project: Use data to determine the best and worst colleges
for conquering student debt.
0.1.1 Notebook 1: Basic R Commands & Data Exploration
Does college pay off? We’ll use some of the latest data from the US Department of Education’s
College Scorecard Database to answer that question. In this first notebook, you’ll get a gentle
introduction to R - a coding language used by data scientists to analyze large datasets. Then,
you’ll begin diving into the college scorecard data yourself. By the end of this notebook, you’ll get
a general sense of which colleges set up their graduates for success and which colleges … don’t.
[3]: ## Run this code but do not edit it. Hit Ctrl+Enter to run the code
# This command downloads a useful package of R commands
library(coursekata)
The <- operator is used to store values. For example, x<-10 stores the value of 10 in x, meaning
the value 10 is saved in the object x.
To get a quick view of the dataframe (dat), we can use the head command to print out its first
several rows.
[5]: ## Run this code but do not edit it. Hit Ctrl+Enter to run the code
# This command prints out the first several rows of the dataset
head(dat)
1
OPEID name city state region m
<int> <chr> <chr> <chr> <chr> <d
1 100200 Alabama A & M University Normal AL South 15
2 105200 University of Alabama at Birmingham Birmingham AL South 15
A data.frame: 6 × 26
3 2503400 Amridge University Montgomery AL South 10
4 105500 University of Alabama in Huntsville Huntsville AL South 14
5 100500 Alabama State University Montgomery AL South 17
6 105100 The University of Alabama Tuscaloosa AL South 17
The vertical columns of the dataframe are called variables, and their elements are called values.
For example, the variable city has values Normal, Birmingham, Montgomery, Huntsville, etc.
The horizontal rows of the dataframe are called observations. For example, the first observation
is Alabama A & M University, which is located in AL (Alabama), in the city of Normal, and has a
median student debt of $15,250. For this dataframe, each observation describes a specific college.
1.1 - Of the variables displayed, identify one that is quantitative, one that is categorical, and one
that is a unique identifier.
Double-click to type a response: city –> categorical, median_debt –> quantitative, opeid –>
unique identifier
The head command only displays several rows of the dataframe. To see the full dimensions of the
dataframe, we can use the dim command.
1.2 - Use the dim command on dat to display the dimensions of the dataframe.
[6]: # Your code goes here
dim(dat)
1. 4435 2. 26
Check yourself: Your code should have printed out two numbers: 4435 and 26.
The first number outputted by dim is the number of horizontal rows in the dataframe. This
represents the number of observations (number of colleges). The second number is the number of
vertical columns in the dataframe. This represents the number of variables. What are all these
variables? See the description of the dataset below, along with links to descriptions of all the
variables.
2
name, median_debt, ownership, admit_rate, and hbcu and save them in a new dataframe called
example_dat.
[7]: ## Run this code but do not edit it
# Select certain columns from dat, store into example_dat
example_dat <- select(dat, name, median_debt, ownership, admit_rate, hbcu)
3
name median_debt ownership
<chr> <dbl> <chr>
461 Delaware State University 18.264 Public
473 Howard University 19.500 Private nonprofit
A data.frame: 7 × 5 491 Florida Agricultural and Mechanical University 18.750 Public
503 Florida Memorial University 17.155 Private nonprofit
1376 Alcorn State University 16.895 Public
1401 Rust College 11.226 Private nonprofit
2747 Hampton University 18.500 Private nonprofit
A total of 7 colleges fit these conditions.
Note that R has different conventions for comparative statements. For example… - == means equals
exactly - != means does not equal - < means less than - > means greater than - <= means
less than or equal to - >= means greater than or equal to
Here are some other common conditional symbols - | means or - & means and
1.4 - Use the subset command to find the colleges in my_dat that are located in the Midwest
region of the United States and have more than a third of their students (greater than 33%) default
on their loans.
[10]: # Your code goes here
subset(my_dat, region == "Midwest" & pct_PELL > 33)
4
name region default
<chr> <chr> <dbl>
655 American Academy of Art College Midwest 5.7
658 Aurora University Midwest 4.2
661 Blackburn College Midwest 8.6
663 Paul Mitchell the School-Bradley Midwest 5.0
664 Cameo Beauty Academy Midwest 3.5
666 Capri Beauty College Midwest 4.7
667 Carl Sandburg College Midwest 9.0
668 Chicago State University Midwest 8.7
670 City Colleges of Chicago-Kennedy-King College Midwest 14.9
671 City Colleges of Chicago-Malcolm X College Midwest 9.4
672 City Colleges of Chicago-Olive-Harvey College Midwest 12.4
675 City Colleges of Chicago-Harold Washington College Midwest 11.6
677 Columbia College Chicago Midwest 6.1
678 Concordia University-Chicago Midwest 3.1
681 Cosmetology & Spa Academy Midwest 4.3
683 East-West University Midwest 21.9
684 Eastern Illinois University Midwest 6.1
686 Elmhurst University Midwest 3.2
687 Eureka College Midwest 7.8
688 First Institute of Travel Inc. Midwest 10.6
689 Fox College Midwest 6.4
690 Gem City College Midwest 5.0
691 Governors State University Midwest 6.2
692 Greenville University Midwest 4.9
693 G Skin & Beauty Institute Midwest 2.7
694 Hair Professionals Career College Midwest 0.0
695 Hair Professionals School of Cosmetology Midwest 5.3
697 Highland Community College Midwest 10.0
698 University of Illinois Chicago Midwest 2.5
A data.frame: 653 × 5 699 Benedictine University Midwest 4.9
� � � �
4290 Paul Mitchell the School-Toledo Midwest 10.4
4292 SAE Institute of Technology-Chicago Midwest 21.6
4312 Valor Christian College Midwest 12.6
4313 Bethany Global University Midwest 4.7
4314 Bella Academy of Cosmetology Midwest 28.5
4329 Ross Medical Education Center-Muncie Midwest 18.0
4335 Davines Professional Academy of Beauty and Business Midwest 11.7
4338 Paul Mitchell the School-Madison Midwest 12.5
4341 Protege Academy Midwest 16.0
4343 Fortis College-Cuyahoga Falls Midwest 13.9
4358 Aveda Institute-Madison Midwest 10.8
4360 Tricoci University of Beauty Culture-Elgin Midwest 4.3
4369 CAAN Academy of Nursing Midwest 0.0
4375 Ea La Mar’s Cosmetology & Barber College Midwest 25.0
4382 Kenny’s Academy of Barbering Midwest 44.4
4384 Indiana Wesleyan University-National & Global Midwest 5.0
4387 Ross Medical Education
5 Center-Elyria Midwest 19.4
4388 Ross Medical Education Center-Lafayette Midwest 18.6
4389 Ross Medical Education Center-Midland Midwest 18.0
4392 Academy of Beauty Professionals Midwest 5.3
Check yourself: You should find that 2 schools match your selection criteria.
1.5 - What do you notice about the observations that fit your selection criteria? What do you
wonder?
Double-click to type a response:
Suppose you’re interested in a particular college, such as Howard University. We can use the subset
command to filter the example_dat dataframe and focus solely on the information pertaining to
that college.
[11]: ## Run this code but do not edit it
# Subset example_dat to only show Howard University
subset(example_dat, name == "Howard University")
6
name median_debt ownersh
<chr> <dbl> <chr>
Curtis Institute of Music 16.250 Private
Harvard University 12.072 Private
Stanford University 11.000 Private
Princeton University 10.355 Private
Yale University 12.000 Private
Columbia University in the City of New York 19.250 Private
California Institute of Technology 9.867 Private
Massachusetts Institute of Technology 12.000 Private
University of Chicago 13.000 Private
The Juilliard School 25.000 Private
Brown University 12.000 Private
Duke University 12.500 Private
Pomona College 10.000 Private
University of Pennsylvania 14.000 Private
Swarthmore College 14.000 Private
Bowdoin College 14.000 Private
Dartmouth College 14.500 Private
Northwestern University 14.000 Private
Colby College 17.500 Private
Cornell University 13.108 Private
Rice University 10.500 Private
Johns Hopkins University 11.750 Private
Tulane University of Louisiana 19.000 Private
Vanderbilt University 12.420 Private
Amherst College 12.000 Private
Circle in the Square Theatre School 16.000 Private
Claremont McKenna College 12.070 Private
Colorado College 15.045 Private
Barnard College 16.250 Private
A data.frame: 4435 × 5 Bates College 12.610 Private
� � �
National Personal Training Institute-Tampa 6.333 Private
Mobile Technical Training 3.800 Private
California Institute of Arts & Technology 9.500 Private
Elite Cosmetology Barber & Spa Academy 6.054 Private
Gwinnett Institute 9.500 Private
Manuel and Theresa’s School of Hair Design 6.494 Private
Peloton College 9.500 Private
Ross Medical Education Center - Kalamazoo 8.089 Private
Ross College-Canton 8.347 Private
Ross College-Grand Rapids North 7.125 Private
American Institute-Somerset 9.176 Private
Bull City Durham Beauty and Barber College 9.833 Private
Fortis College-Cutler Bay 12.667 Private
Unitech Training Academy-Baton Rouge 6.991 Private
Empire Beauty School-Tampa 7.917 Private
Empire Beauty School-Lakeland 7.667 Private
Galen College of Nursing-ARH
7 16.500 Private
Tricoci University of Beauty Culture-Janesville 8.468 Private
Lynnes Welding Training-Bismarck 3.385 Private
No Grease Barber School 9.833 Private
As we can see, the most selective schools now top the list. You’ll see some NA values from
admit_rate at the bottom of the arranged dataset. These are missing values, which we’ll dis-
cuss later.
To arrange the data in descending order of admission rates (highest admission rates on top), we
can use the desc argument within our arrange command:
8
name med
<chr> <db
University of Arkansas Community College-Morrilton 6.25
Design Institute of San Diego 31.0
Naropa University 16.3
VanderCook College of Music 27.0
Saint Elizabeth School of Nursing 20.2
Maharishi International University 13.0
Grace Christian University 9.70
Sacred Heart Major Seminary 7.34
JFK Muhlenberg Harold B. and Dorothy A. Snyder Schools 15.7
Arnot Ogden Medical Center 11.7
Neighborhood Playhouse School of the Theater 12.0
Samaritan Hospital School of Nursing 14.2
Trinity Bible College and Graduate School 12.8
Trinity Health System School of Nursing 13.6
Warner Pacific University 24.3
New Castle School of Trades 8.72
Saint Charles Borromeo Seminary-Overbrook 16.5
Universidad Adventista de las Antillas 11.8
Greene County Career and Technology Center 16.3
Western Area Career & Technology Center 16.5
Hussian College-Daymar College Clarksville 9.50
Eastern Center for Arts and Technology 8.55
Greater Lowell Technical School 5.50
Cass Career Center 9.50
Orange Ulster BOCES-Practical Nursing Program 11.8
Washington Saratoga Warren Hamilton Essex BOCES-Practical Nursing Program 12.8
Mifflin County Academy of Science and Technology 11.7
Living Arts College 10.0
Cayuga Onondaga BOCES-Practical Nursing Program 7.70
A data.frame: 4435 × 5 Delaware County Technical School-Practical Nursing Program 16.5
� �
National Personal Training Institute-Tampa 6.33
Mobile Technical Training 3.80
California Institute of Arts & Technology 9.50
Elite Cosmetology Barber & Spa Academy 6.05
Gwinnett Institute 9.50
Manuel and Theresa’s School of Hair Design 6.49
Peloton College 9.50
Ross Medical Education Center - Kalamazoo 8.08
Ross College-Canton 8.34
Ross College-Grand Rapids North 7.12
American Institute-Somerset 9.17
Bull City Durham Beauty and Barber College 9.83
Fortis College-Cutler Bay 12.6
Unitech Training Academy-Baton Rouge 6.99
Empire Beauty School-Tampa 7.91
Empire Beauty School-Lakeland 7.66
Galen College of Nursing-ARH
9 16.5
Tricoci University of Beauty Culture-Janesville 8.46
Lynnes Welding Training-Bismarck 3.38
No Grease Barber School 9.83
1.7 - Use the arrange command to organize the colleges in my_dat such that the colleges with the
highest student loan default rates are at the top.
[15]: # Your code goes here
arrange(my_dat, desc(pct_PELL))
10
name region
<chr> <chr>
PJ’s College of Cosmetology-Glasgow South
Mitchells Academy South
Charles and Sues School of Hair Design Rockies & Southw
Victoria Beauty College Inc Rockies & Southw
Nuvani Institute Rockies & Southw
New Community Career & Technical Institute Northeast
CBT Technology Institute-Hialeah South
SABER College South
Hands on Therapy Rockies & Southw
CBT Technology Institute-Cutler Bay South
CyberTex Institute of Technology Rockies & Southw
Bos-Man’s Barber College South
Barber Institute of Texas Rockies & Southw
Dewey University-Juana Diaz Territories
Automeca Technical College-Caguas Territories
Choffin Career and Technical Center Midwest
Academy of Hair Design-Jasper Rockies & Southw
More Tech Institute South
Trend Barber College Rockies & Southw
Reflections Academy of Beauty Midwest
Clinton College South
CBT Technology Institute-Main Campus South
Palladium Technical Academy Inc Far West
South Florida Institute of Technology South
Lee Professional Institute South
Nuvani Institute Rockies & Southw
Erie 2 Chautauqua Cattaraugus BOCES-Practical Nursing Program Northeast
Automeca Technical College-Ponce Territories
CEM College-Humacao Territories
A data.frame: 4435 × 5 Career Center of Southern Illinois Midwest
� �
Fairfield University Northeast
Arapahoe Community College Rockies & Southw
Worcester Polytechnic Institute Northeast
Bucknell University Northeast
Kenyon College Midwest
Oberlin College Midwest
Wake Forest University South
American Beauty Academy-West Valley Campus Rockies & Southw
Elon University South
South Seattle College Far West
Bellevue College Far West
University of New Mexico-Los Alamos Campus Rockies & Southw
Cascadia College Far West
Aveda Institute-Tucson Rockies & Southw
United States Merchant Marine Academy US Service School
Hope College of Arts and Sciences South
Western Texas College 11 Rockies & Southw
L3Harris Flight Academy South
Curtis Institute of Music Northeast
Foothill College Far West
1.8 - What patterns do you notice among the programs that have the highest student loan default
rates? What do you wonder?
Double-click to type a response:
Reference Guide for R (student resource) - Now that you’ve seen a number of different
commands in R, check out our reference guide for a full listing of useful R commands for this
project.
12
[18]: ## Run this code but do not edit it
# Find counts of values for highest_degree, store in object 'degree_counts'
degree_counts <- table(dat$highest_degree)
4435
[20]: ## Run this code but do not edit it
# Divide the table by the total to get proportions
degree_counts / total
13
2.4 - Find the proportion of all institutions that are public, private nonprofit, and private for-profit.
[24]: # Your code goes here
total2 <- sum(degCounts)
total2
4435
Check yourself: About 34.7% of the schools in the dataset are public schools
Warning message:
“Removed 2731 rows containing non-finite outside the scale range
(`stat_bin()`).”
14
Note: A warning message was displayed about removing rows. This is R telling us that it’s
choosing not to visualize the missing data values (NA) that we discovered for admit_rate earlier in
the notebook.
As we suspected from the summary statistics, it appears that most programs have admissions rates
well above 50%, and only a small subset of programs have highly selective admissions rates. In
statistics, we call this distribution left skew, since there’s a tail on the left side. So, institutions with
low values (low admissions rates) are relatively unusual compared to most of the other institutions
in our dataset.
3.1 - Create a histogram to visualize all the default_rate values in the dat dataframe.
[26]: # Your code goes here
gf_histogram(~default_rate, data = dat)
15
gf_bar(~highest_degree, data = dat)
As shown here, most of the institutions in our dataset are Universities that graduate degrees or
trade programs that offer professional certificates. There are about 500 colleges that only offer
bachelors degrees (without offering graduate degrees).
3.3 - Create a bar plot to visualize the ownership values from the dat dataframe.
[28]: # Your code goes here
gf_bar(~ownership, data = dat)
16
3.4 - Describe the distribution and note any features of interest.
Double-click to type a response:
Sometimes, we may want to explore the relationship between two variables by visualizing them both
at once. When we want to visualize the relationship between a categorical variable and quantitative
variable, we can use boxplots. Here, we show how to use gf_boxplot to visualize the relationship
between highest_degree (categorical) and admit_rate (quantitative).
Warning message:
“Removed 2731 rows containing non-finite outside the scale range
(`stat_boxplot()`).”
17
In this case, we’re using highest_degree as the predictor variable and admit_rate as the
outcome variable. In other words, we can use the degree level of an institution (certificate,
associates, bachelors, etc.) to help predict its admission rate. That’s because certain levels of
institutions typically have lower admissions rates than others. So, knowing the level of an institution
can help us better predict its admissions rate.
Note: This predictor-outcome relationship is coded in R through the syntax outcome ~
predictor, as in gf_boxplot(admit_rate ~ highest_degree,...).
We see that admission rates tend to be lower (lower medians) for colleges / Universities that grant
bachelors and graduate degrees. However, it’s worth noting that for every institution-type, the
first quartile is higher than a 50% admissions rate. So, most programs admit more than half
their applicants, regardless of insitution-type. Indeed, we see that the most prestigious Universities
with admissions rates lower than 25% are outliers (visualized as dots on the boxplot) among other
Universities that offer graduate degrees.
3.5 - Create boxplots to visualize the relationship between ownership and default_rate from the
dat dataframe.
[30]: # Your code goes here
gf_boxplot(default_rate ~ ownership, data = dat)
18
3.6 - Using your boxplot visualization, describe the relationship between institution ownership and
studen loan default rates.
Double-click to type a response:
19