Introducing Data Science Techniques by Connecting Database Concepts and Dplyr
Introducing Data Science Techniques by Connecting Database Concepts and Dplyr
To cite this article: Jennifer E. Broatch, Suzanne Dietrich & Don Goelman (2019) Introducing
Data Science Techniques by Connecting Database Concepts and dplyr, Journal of Statistics
Education, 27:3, 147-153, DOI: 10.1080/10691898.2019.1647768
DATA SCIENCE
ABSTRACT
KEYWORDS
Early exposure to data science skills, such as relational databases, is essential for students in statistics as Data science; Databases;
well as many other disciplines in an increasingly data driven society. The goal of the presented pedagogy is Education; Teaching tool
to introduce undergraduate students to fundamental database concepts and to illuminate the connection
between these database concepts and the functionality provided by the dplyr package for R. Specifically,
students are introduced to relational database concepts using visualizations that are specifically designed
for students with no data science or computing background. These educational tools, which are freely
available on the Web, engage students in the learning process through a dynamic presentation that gently
introduces relational databases and how to ask questions of data stored in a relational database. The
visualizations are specifically designed for self-study by students, including a formative self-assessment
feature. Students are then assigned a corresponding statistics lesson to utilize statistical software in R
within the dplyr framework and to emphasize the need for these database skills. This article describes
a pilot experience of introducing this pedagogy into a calculus-based introductory statistics course for
mathematics and statistics majors, and provides a brief evaluation of the student perspective of the
experience. Supplementary materials for this article are available online.
CONTACT Jennifer E. Broatch [email protected] School of Mathematics and Natural Sciences, Arizona State University, PO Box 37100, Mail Code 2352, Phoenix, AZ
85069-7100.
Supplementary materials for this article are available online. Please go to www.tandfonline.com/ujse.
© 2019 The Author(s). Published with license by Taylor and Francis Group, LLC.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution,
and reproduction in any medium, provided the original work is properly cited. The moral rights of the named author(s) have been asserted.
148 J. E. BROATCH, S. DIETRICH, AND D. GOELMAN
In the Sports Statistics customization of the visualization, MatchUps and PName in MatchUps. Hitters and Pitchers do not
baseball statistics relate Hitters and Pitchers with information have any foreign keys because they do not contain a primary key
on the match-ups between the hitter and pitcher. The visual- of another table. In the screen shot, the orange foreign key next
ization illustrates how spreadsheets with redundant data may to HName has been selected, highlighting the HName column
have anomalous situations when updating, deleting, and insert- in MatchUps in orange and highlighting the HName column in
ing data. Databases avoid these issues by breaking down the Hitters in gold. This visualizes that a value of the foreign key
data into separate tables without unnecessary repetition, as (HName in Matchups) must appear as a value of the primary
illustrated by the Breakdown topic of the visualization. Fig- key in the related table (HName in Hitters). Databases use this
ure 2 shows that the given spreadsheet combines three concepts: relationship between the primary and foreign keys to combine
Hitters, Pitchers, and MatchUps. The Hitters button has been tables together when needed to answer a query, which is further
selected, showing how the concept would be stored in a table, elaborated on in the querying visualization.
where the green rows represent the resulting data (in the Hitters
table), the red strike through rows are redundant data, and the
gray data are not relevant to that particular concept. By selecting
2.2. QueryDB: Introduction to Querying
each button, the user sees how the data in the database tables are
created. The Introduction to Querying module provides a conceptual
Thus, a database is a collection of tables without unnecessary introduction to the various operations required to retrieve data
repetition. These tables must be combined to answer certain from a database to answer a question. The visualization of
questions over the data. The associations between the tables are these operations and their corresponding specification in SQL
formed with the concept of primary and foreign keys. Figure 3 provides a strong foundation for students to use SQL to query
shows a screen shot of the Baseball Statistics database, indicat- relational databases. In this module students learn various oper-
ing the primary keys with a gold key and foreign keys with a ations for combining data to answer queries, such as common
different-shaped orange key. Primary keys are attribute(s) that set operators, horizontal and vertical filtering, and joins. In addi-
uniquely identify a row in a table, such as HName in Hitters tion, these operations are mapped to the SQL industry standard
and PName in Pitchers. Typically, every table has a primary query language so students can understand basic SQL syntax.
key, which may consist of one attribute or a combination of QueryDB assumes that students have already viewed the
several attributes. Note that the MatchUps table has a compos- IntroDB visualization, and focuses more on how to answer
ite primary key consisting of the combination of HName and queries by filtering and combining tables. When designing a
PName (shown linked in the screen shot), uniquely identifying query, it is important to know the tables, their attributes, and the
the matchup average for that hitter and pitcher combination. A primary and foreign keys. Figure 4 provides this abstraction of
foreign key consists of attribute(s) in a table that are referencing a database, known as a schema, for the Baseball Statistics appli-
the value of a primary key in another table, such as HName in cation. This is a visual schema, showing the primary and for-
eign keys with these associations between the tables illustrated the student is walked through the design of that query using
through links. The highlighted rows illustrate the design of the the given information and the primary-foreign key associations
query mentioned in the screen shot, which finds the match- between the tables. The visualization then introduces by exam-
up average for a hitter with the given batting average against ple the fundamental set operators (union, intersection, nega-
a particular pitcher. The highlighting is built dynamically as tion) that operate on tables having the same format. Students
JOURNAL OF STATISTICS EDUCATION 151
are then presented with new operators that filter the tables both database concepts and data manipulation operations illustrated
horizontally and vertically as well as operators that combine in the visualizations and to apply these techniques to statistics
tables on the primary-foreign keys to provide a larger table activities. Although students are learning how to apply these
needed to answer a query. The latter are called joins. After operations in the context of the syntax of SQL and R, specifically
introducing these various operators, the industry standard SQL dplyr, the ultimate goal is to have students truly understand
for querying databases is introduced and a textual query is built the operations so that they can apply these concepts to other
up incrementally along with a visual representation of the query, contexts and languages even as technology changes. Thus, other
as shown in Figure 5. The SQL query provides the answer to popular packages, such as SAS, can also be used to analyze
the query designed in Figure 4, which incorporates horizontal and access a relational database utilizing the basic tools taught
filtering (batting average and pitcher name) and vertical fil- within the tool (SAS Institute Inc. 2019).
tering (matchup average) in combination with a join (hitters The data wrangling connections emphasized in the activities
name). highlighted essential data manipulation of a dataset and the
The key concepts from IntroDB and the operations for combination of datasets. A first step in data wrangling is the
manipulating and combining data in QueryDB are definitely ability to perform basic data queries and to reduce the dataset
applicable to statistics as an essential part of the data analysis to only the variables and characteristics of interest, arranging
cycle. The dplyr package for R also uses these concepts for data the data for communication of results. A second essential step
manipulation, as illustrated in the next section. in data wrangling is the ability to combine various datasets cor-
rectly to answer more complicated questions. Table 1 shows the
visualization concept and its correspondence in SQL and dplyr.
3. dplyr Connection
Most of these connections, with the exception of the set based
This introduction to data science concepts was embedded operations (union, intersection, difference), were incorporated
within the context of an introductory statistics course, by within the activities assigned to the students.
exposing students to questions from datasets requiring some The first student activity focuses on essential data wrangling
“data wrangling” skills, which are important in both databases without the combination of datasets. This activity should be
and statistics. In the database field, SQL is the industry standard assigned as early in the term as possible after a discussion on
language that provides extensive support for asking questions basic descriptive statistics and basic visualizations. The experi-
over data stored in the database, including some fundamental ence with the integration of the activity is described briefly in
operations for data analysis, such as sum, minimum, maximum, Section 4 and the activity itself is included in the supplementary
average, and count. In statistics, R is a tool that offers extensive materials. Students were asked to “Find the average height of
data visualization and analysis tools, and dplyr is the language human Starwars Characters” using the Star Wars characters
that manipulates the data for analysis. The learning objective dataset included in dplyr as the “starwars” tibble. This question
of the activities is for students to understand the fundamental requires the student to select the variable height, filter the
152 J. E. BROATCH, S. DIETRICH, AND D. GOELMAN
Table 1. Connections between visualizations, SQL, dplyr. Table 3. Syntax comparison: joining batting with salaries.
Visualization SQL dplyr dplyr Batting %>%
filter(yearID >= 1985) %>%
vertical filtering SELECT select inner_join( Salaries, by=c(“playerID”, “yearID”, “teamID”))
horizontal filtering WHERE filter
join FROM ... JOIN inner_join SQL select b.*, s.salary
ordering ORDER BY arrange from batting b inner join salaries s
union UNION union on b.playerID = s.playerID and b.yearID = s.yearID and b.teamID = s.teamID
intersection INTERSECT intersect where b.yearID >= 1985
difference EXCEPT setdiff
NOTE: The Batting and Salaries datasets are found in Friendly (2017).
Table 2. Syntax comparison: average height of human starwars characters. Table 4. Student responses to the integration of the visualizations and dplyr.
dplyr starwars %>% Question Number of students with
filter(species==‘Human’) %>% positive responses
select(height) %>%
summarize(avgheight=mean(height, na.rm=T)) How much did the online visualization 20 of 25
help your learning of SQL?
SQL select avg(height) as avgheight How much did the online visualization 19 of 25
from Starwars help your learning of the basics of dplyr?
where species = ‘Human’ How much did the dplyr example help 23 of 25
your understanding of the material?
NOTE: Starwars is included in dplyr, which was exported as a data frame Star-
wars=as.data.frame(starwars) and imported into a relational database to create NOTE: A positive response is defined as either moderately help (3), much help (4),
the corresponding table. or great help (5). Five students did not respond.
dataset to only include human characters, and summarize to these techniques in both contexts. Students are able to retrieve
find the mean or average height. Table 2 displays the corre- the data that they need from databases and further analyze
sponding dplyr and SQL syntax. Both dplyr and SQL rename and visualize that data using dplyr in R. The experience of
the resulting column as avgheight. Note that the activities do not integrating the above activities in the classroom, along with
require the use of the pipe operator %>%. The actual activity is student’s perspectives on the experience are discussed in the
broken down into smaller pieces. The pipe operator is shown next section.
here to provide a unified syntax that corresponds to the SQL
query.
Although not explored as part of this initial pilot activity, 4. Discussion
the students can explore entering SQL syntax directly in R To briefly assess the perceived impact of the integrated materials,
utilizing the SQLdf package (Grothendieck 2017). Since SQLdf students (n = 30) were given a survey to answer both Likert-
uses data frames in R and not dplyr tibbles, the Starwars tibble type and open-ended questions about the integration experi-
should be exported to a data frame for use in SQLdf [Starwars ence after the first activity. Student responses to the integration
= as.data.frame(starwars)]. The SQL syntax shown in Table 2 activity presented in Table 4 were mostly positive, and most
would be a quoted parameter of the function sqldf. This would students reported that the visualizations support their learning
be an interesting connection to add to future versions of the and understanding of the problem. Representative feedback is
assignment. included below.
A second activity was created to reinforce the concept of
a join, which combines datasets, within a statistical question. • “I liked how the class involved another aspect to our course
The second activity should be introduced after a discussion of such as SQL. I feel like more self taught projects and examples
one-way analysis of variance or linear regression. Again, the would be really beneficial for us the students to expand our
activity itself is included in the supplementary materials. A horizons of what is out there."
question is posed to the students using the Lahman package • “I appreciated the chance to get a flavor of data science in the
(Friendly 2017), which provides the tables from Sean Lahman class.”
Baseball Database as R data frames, “Are higher salaries in Major
League Baseball related to higher batting statistics post 1985?” The skills learned in the animations are transferable to other
To answer this question, students must first filter the Batting platforms. For example, the first animation introduces tables
dataset for seasons post 1985 and then combine the Batting and that are conceptually linked using primary and foreign keys. In
Salaries data frames on the combination of attributes linking fact, 96% of students identified the proper composite primary
the data using a join in dplyr. Table 3 compares the syntax for key and successfully joined the two tables for analysis. The
dplyr and SQL for the question posed to the students. Again, concepts presented provide a foundation for learning database
the pipe operator is introduced for correspondence with SQL. querying in any language including R (see Wickham 2014 for
Also, the join connection can be further explored by asking an R example). Although the querying visualization focuses on
students to provide a corresponding SQL specification of their SQL, it was noted by Hardin et al. (2015) (when referring to a
dplyr commands, which is then tested using the SQLdf package. course by Wickham) “while each language may have its own
The connection between database and statistics is an integral syntax the underlying operation that is being performed on
component of data science. By learning fundamental database the data is the same.” The animations can also assist students
concepts and querying, statistics students will be able to apply conducting Senior Capstone courses, like those reviewed in
JOURNAL OF STATISTICS EDUCATION 153
Martonosi and Williams (2016), to bridge the gap between stu- References
dents’ statistical training and the data manipulation and man-
American Statistical Association (ASA) (2014), “Curriculum Guidelines for
agement challenges of the real-world.
Undergraduate Programs in Statistical Science.”
This integration of database skills in an introductory statis- Baumer, B. (2015), “A Data Science Course for Undergraduates: Thinking
tics course is promising. The pilot described in this article With Data,” The American Statistician, 69, 334–342.
provided an initial experience with the introduction of database Dietrich, S. W., and Goelman, D. (2017), “Formative Self-Assessment for
visualizations of fundamental concepts into an introductory Customizable Database Visualizations: Checkpoints for Learning,” in
statistics course. Students appreciated the visualization of the 2017 ASEE Annual Conference & Exposition, Columbus, OH: ASEE
Conferences.
concepts and the ability to apply these concepts in R with dplyr,
Dietrich, S. W., Goelman, D., Borror, C. M., and Crook, S. M. (2015), “An
and were able to successfully perform a more complicated data Animated Introduction to Relational Databases for Many Majors,” IEEE
manipulation by the second activity. Transactions on Education, 58, 81–89.
There are additional benefits of introducing database skills Friendly, M. (2017), “Lahman: Sean ‘Lahman’ Baseball Database,” R Pack-
early in this course. After the students were able to perform age Version 6.0-0.
basic data manipulation, larger, nontrivial datasets were used in Goelman, D., and Dietrich, S. W. (2018), “A Visual Introduction to Con-
homework assignments. Students were required to manipulate ceptual Database Design for all,” in SIGCSE ’18 Proceedings of the 49th
ACM Technical Symposium on Computer Science Education, Baltimore,
the data using the five verbs to accurately answer the questions MD. ACM, pp. 320–325.
in the homework; expanding their experience beyond the text- Grothendieck, G. (2017), “sqldf: Manipulate R Data Frames Using SQL,” R
book. Package Version 0.4-11.
The synergy of databases and statistics is an important learn- Hardin, J., Hoerl, R., Horton, N. J., Nolan, D., Baumer, B., Hall-Holt, O.,
ing objective for an introductory statistics course. Future work Murrell, P., Peng, R., Roback, P., Lang, D. T., and Ward, M. D. (2015),
will explore additional avenues and opportunities to revise and “Data Science in Statistics Curricula: Preparing Students to ‘Think With
Data’,” The American Statistician, 69, 343–353.
incorporate synergistic activities into the curriculum. Higgins, J. J. (1999), “Nonmathematical Statistics: A New Direction for the
Undergraduate Discipline,” The American Statistician, 53, 1–6.
Supplementary Materials Horton, N. J., Baumer, B. S., and Wickham, H. (2015), “Setting the Stage
for Data Science: Integration of Data Management Skills in Introductory
The Database handout includes a brief description of key dplyr “verbs” and Second Courses in Statistics,” CHANCE, 28, 40–50.
to Query a dataset just like the DBMM visualization. The five key data Martonosi, S. E., and Williams, T. D. (2016). “A Survey of Statisti-
manipulation “verbs” are presented with their corresponding SQL com-
cal Capstone Projects,” Journal of Statistics Education, 24, 127–135.
mands. The Verb Activity is a series of questions that require the application
doi:10.1080/10691898.2016.1257927
of all five verbs presented. The Join Activity is an active learning activity
R Core Team (2017), R: A Language and Environment for Statistical Com-
that connects the Database Concepts and dplyr utilizing a sports statistics
puting, Vienna, Austria: R Foundation for Statistical Computing.
example in the R package: Lahman. This activity reinforces the concept of
a join, which combines datasets, within a statistical question. Rudo, P. (2014), “6 Important Stages in the Data Processing Cycle.”
SAS Institute Inc. (2019), “Methods for Accessing Relational Database
Data.”
Funding Wickham, H. (2014), “Tidy Data,” Journal of Statistical Software, 59, 1–23.
Wickham, H., Francois, R., Henry, L., and Müller, K. (2017), “dplyr: A
This material is based upon work supported by the National Science Foun-
Grammar of Data Manipulation,” R Package Version 0.7.4.
dation under grant nos. DUE-1431848, DUE-1431661, DUE-0941584, and
Wickham, H., and Grolemund, G. (2017), R for Data Science: Import, Tidy,
DUE-0941401. Any opinions, findings, and conclusions or recommen-
dations expressed in this material are those of the author and do not Transform, Visualize, and Model Data (1st ed.), Newton, MA: O’Reilly
necessarily reflect the views of the National Science Foundation. Media, Inc.