0% found this document useful (0 votes)
18 views

Introducing Data Science Techniques by Connecting Database Concepts and Dplyr

Uploaded by

Muhammad Hafizh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Introducing Data Science Techniques by Connecting Database Concepts and Dplyr

Uploaded by

Muhammad Hafizh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Journal of Statistics Education

ISSN: (Print) 1069-1898 (Online) Journal homepage: https://www.tandfonline.com/loi/ujse20

Introducing Data Science Techniques by


Connecting Database Concepts and dplyr

Jennifer E. Broatch, Suzanne Dietrich & Don Goelman

To cite this article: Jennifer E. Broatch, Suzanne Dietrich & Don Goelman (2019) Introducing
Data Science Techniques by Connecting Database Concepts and dplyr, Journal of Statistics
Education, 27:3, 147-153, DOI: 10.1080/10691898.2019.1647768

To link to this article: https://doi.org/10.1080/10691898.2019.1647768

© 2019 The Author(s). Published with View supplementary material


license by Taylor and Francis Group, LLC

Published online: 16 Sep 2019. Submit your article to this journal

Article views: 3038 View related articles

View Crossmark data Citing articles: 7 View citing articles

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=ujse21
JOURNAL OF STATISTICS EDUCATION
2019, VOL. 27, NO. 3, 147–153
https://doi.org/10.1080/10691898.2019.1647768

DATA SCIENCE

Introducing Data Science Techniques by Connecting Database Concepts and dplyr


Jennifer E. Broatcha , Suzanne Dietricha , and Don Goelmanb
a
School of Mathematics and Natural Sciences, Arizona State University, Phoenix, AZ; b Department of Computing Sciences, Villanova University,
Villanova, PA

ABSTRACT
KEYWORDS
Early exposure to data science skills, such as relational databases, is essential for students in statistics as Data science; Databases;
well as many other disciplines in an increasingly data driven society. The goal of the presented pedagogy is Education; Teaching tool
to introduce undergraduate students to fundamental database concepts and to illuminate the connection
between these database concepts and the functionality provided by the dplyr package for R. Specifically,
students are introduced to relational database concepts using visualizations that are specifically designed
for students with no data science or computing background. These educational tools, which are freely
available on the Web, engage students in the learning process through a dynamic presentation that gently
introduces relational databases and how to ask questions of data stored in a relational database. The
visualizations are specifically designed for self-study by students, including a formative self-assessment
feature. Students are then assigned a corresponding statistics lesson to utilize statistical software in R
within the dplyr framework and to emphasize the need for these database skills. This article describes
a pilot experience of introducing this pedagogy into a calculus-based introductory statistics course for
mathematics and statistics majors, and provides a brief evaluation of the student perspective of the
experience. Supplementary materials for this article are available online.

1. Introduction Although omitted in many statistics courses, the “Tidy” and


“Transform” aspects of the data cycle (Wickham 2014, p. 3) are
Data wrangling and database skills are essential to statistics
included in newly developed data science courses. In a review
careers, yet many students (both majors and nonmajors)
and discussion of seven exemplar data science courses, Hardin
are not exposed to these concepts. Horton, Baumer, and et al. (2015) note that these examples all include relational
Wickham (2015) suggest that students develop data science databases and SQL as a topic in a semester-long data science
skills early and often, beginning with the introductory course, course. The data science course discussed in Baumer (2015)
and that early exposure is critical. Similarly, the Curriculum includes a three week section on data manipulation/wrangling
Guidelines for Undergraduate Programs in Statistical Science where students “learn to perform the most fundamental data
prepared by the American Statistical Association (ASA) (2014) operations in both R (R Core Team 2017) and SQL and are
emphasize the increased importance of data science and asked to think about their connection” (Baumer 2015, p. 37). An
the need for students to be “facile with database systems" introductory statistics class does not have three weeks to focus
(for an early discussion see Higgins (1999)). Since database on data manipulation.
systems provide efficient, shared access to persistent data, the This article reports on the experience of incorporat-
understanding of these systems is critical to asking questions ing fundamental aspects of data manipulation within the
about the data. Data wrangling or “data processing” converts context of an introductory statistics course. The activities
the raw data contained in the database into meaningful presented will focus on basic data manipulation in R and
information (Rudo 2014), and it is an essential component in its relationship to SQL. These elementary database skills are
the data analysis cycle highlighted by “Tidy” and “Transform” introduced by integrating the computer science education
in Figure 1 (Wickham and Grolemund 2017, p. 3). One product from the Databases for Many Majors (DBMM)
must strategically manipulate and process the data prior to project (http://databasesmanymajors.faculty.asu.edu/) into
visualization, modeling, and the communication of results, a statistics classroom. The skills are then applied in the
yet this part of the cycle is often omitted in an introductory context of the dplyr statistical package (Wickham et al.
course. This article will highlight an integrated set of activities 2017) for R to focus on the statistical application of these
to assist in introducing these essential data manipulation skills. This experience should also be useful for data science
concepts. courses by providing a conceptual overview of the connection

CONTACT Jennifer E. Broatch [email protected] School of Mathematics and Natural Sciences, Arizona State University, PO Box 37100, Mail Code 2352, Phoenix, AZ
85069-7100.
Supplementary materials for this article are available online. Please go to www.tandfonline.com/ujse.
© 2019 The Author(s). Published with license by Taylor and Francis Group, LLC.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution,
and reproduction in any medium, provided the original work is properly cited. The moral rights of the named author(s) have been asserted.
148 J. E. BROATCH, S. DIETRICH, AND D. GOELMAN

Section 2 overviews the concepts introduced within the database


visualizations that students were asked to complete outside
of class-time. Section 3 describes the connections between
these database concepts and the dplyr package for R, and the
corresponding statistics activities that utilize the five main
dplyr verbs: select, filter, arrange, summarize, and mutate. The
presented activities allow students to translate database skills
learned in the visualization into the dplyr framework. The
Figure 1. Featured aspects of the data analysis cycle (Wickham and Grolemund article concludes with a discussion of the experience, student
2017, p. 3) are highlighted in the rectangle. perspectives, and future research directions.

between fundamental data operations in R and SQL before 2. Database Visualizations


delving into the deeper details of data manipulation and The database visualizations are three separate but related
analysis. interactive animations that introduce the concepts of relational
The goal of the Databases for Many Majors project is to databases and querying using SQL (Dietrich et al. 2015), and
provide engaging modules to visually introduce fundamental database design (Goelman and Dietrich 2018). The objective
database concepts to students with diverse backgrounds. is to provide all faculty, including introductory statistics
There are three visualizations with supporting curricula that instructors, with self-contained database animations that they
cover various database aspects. The first introduces relational can use to supplement their curricula. The visualizations can be
databases and how they differ from spreadsheets. The second assigned as an out-of-class activity, with each animation taking
covers querying of relational databases; and the third discusses about an hour to complete. Each module contains a formative
the conceptual design of data, which explains how to model data self-assessment component, known as a checkpoint, that can be
and then map the design to a relational database schema. Each assigned to students to complete before class. It is recommended
of these topics is contextualized within multiple STEM domains to provide students with a performance goal for the checkpoints.
that utilize databases, specifically in Astronomy, Computational There are additional instructor resources available that consist
Molecular Biology, Environmental Science/Ecology, Forensics, of cooperative learning exercises to use in class, if desired.
Geographic Information Systems, and Sports Statistics, to The visualizations are also customizable and are available in
attract students of all majors (not just statistics/data science different application domains to promote relevance to a variety
majors) and to promote relevance to a variety of students. This of students. The visualizations and resources are freely available
early exposure and introduction to database topics has a broad at http://databasesmanymajors.faculty.asu.edu/.
audience and can be used in any course that wants to promote This article reports on an experience that incorporates the
early data science skills, including introductory statistics. first two modules on relational databases and querying into an
Data can be manipulated and queried using a number of introductory statistics course. Both of these modules have been
tools. SQL is a tool used in the database field for database previously introduced into a computational molecular biology
manipulation and querying. In statistics, R is a comprehen- class as well as database courses for nonmajors and majors
sive open source software that is capable of all aspects of the across two major universities. Dietrich et al. (2015) supports
data analysis cycle, and the dplyr package supports the pro- the pedagogical effectiveness of the visualizations. After this
cess of manipulating, sorting, summarizing, and joining data study, the checkpoint was added to the visualizations so that
frames and efficiently storing and accessing large amounts of students can check the status of their learning (Dietrich and
data within R. “dplyr is a grammar of data manipulation, pro- Goelman 2017). Students appreciate the opportunity to quiz
viding a consistent set of verbs that help you solve the most themselves on the topics within the animations and typically
common data manipulation challenges” Wickham et al. (2017) use the visualizations multiple times to reinforce concepts and
(see https://dplyr.tidyverse.org/ for more details). Although one review for exams. The reader is encouraged to run the visu-
can directly use SQL within R in package sqldf (Grothendieck alizations customized for sports statistics, specifically baseball,
2017), the desired learning objective is for students to under- that are described in this section. An appendix, within the
stand how dplyr supports these fundamental database concepts online visualization, provides the information on how to run the
and to apply this knowledge in the context of problem solving animations.
in statistics.
This article describes an experience of incorporating the 2.1. IntroDB: Introduction to Relational Databases
database visualizations for relational databases and querying
within the context of an introductory statistics course, followed The IntroDB module promotes the basic understanding behind
by the application of these concepts in exercises within the relational databases. Databases provide a powerful tool to ask
dplyr framework. The integration of the visualizations and different questions, or queries, of the data without changing the
dplyr activities were piloted in a calculus-based Introductory data. In this module students learn: the limitation of spread-
Statistics Course for mathematics and statistics majors/minors.1 sheets, the breakdown of spreadsheets into smaller tables to
avoid redundancies, the introduction of primary and foreign
1 keys and how a database uses keys to identify and relate infor-
This course is a year-long introduction to statistics and probability with a
calculus requirement, often referred to as Engineering Statistics. This is the mation, as well as a brief introduction to asking questions over
students’ first course in University-level Statistics. a database.
JOURNAL OF STATISTICS EDUCATION 149

In the Sports Statistics customization of the visualization, MatchUps and PName in MatchUps. Hitters and Pitchers do not
baseball statistics relate Hitters and Pitchers with information have any foreign keys because they do not contain a primary key
on the match-ups between the hitter and pitcher. The visual- of another table. In the screen shot, the orange foreign key next
ization illustrates how spreadsheets with redundant data may to HName has been selected, highlighting the HName column
have anomalous situations when updating, deleting, and insert- in MatchUps in orange and highlighting the HName column in
ing data. Databases avoid these issues by breaking down the Hitters in gold. This visualizes that a value of the foreign key
data into separate tables without unnecessary repetition, as (HName in Matchups) must appear as a value of the primary
illustrated by the Breakdown topic of the visualization. Fig- key in the related table (HName in Hitters). Databases use this
ure 2 shows that the given spreadsheet combines three concepts: relationship between the primary and foreign keys to combine
Hitters, Pitchers, and MatchUps. The Hitters button has been tables together when needed to answer a query, which is further
selected, showing how the concept would be stored in a table, elaborated on in the querying visualization.
where the green rows represent the resulting data (in the Hitters
table), the red strike through rows are redundant data, and the
gray data are not relevant to that particular concept. By selecting
2.2. QueryDB: Introduction to Querying
each button, the user sees how the data in the database tables are
created. The Introduction to Querying module provides a conceptual
Thus, a database is a collection of tables without unnecessary introduction to the various operations required to retrieve data
repetition. These tables must be combined to answer certain from a database to answer a question. The visualization of
questions over the data. The associations between the tables are these operations and their corresponding specification in SQL
formed with the concept of primary and foreign keys. Figure 3 provides a strong foundation for students to use SQL to query
shows a screen shot of the Baseball Statistics database, indicat- relational databases. In this module students learn various oper-
ing the primary keys with a gold key and foreign keys with a ations for combining data to answer queries, such as common
different-shaped orange key. Primary keys are attribute(s) that set operators, horizontal and vertical filtering, and joins. In addi-
uniquely identify a row in a table, such as HName in Hitters tion, these operations are mapped to the SQL industry standard
and PName in Pitchers. Typically, every table has a primary query language so students can understand basic SQL syntax.
key, which may consist of one attribute or a combination of QueryDB assumes that students have already viewed the
several attributes. Note that the MatchUps table has a compos- IntroDB visualization, and focuses more on how to answer
ite primary key consisting of the combination of HName and queries by filtering and combining tables. When designing a
PName (shown linked in the screen shot), uniquely identifying query, it is important to know the tables, their attributes, and the
the matchup average for that hitter and pitcher combination. A primary and foreign keys. Figure 4 provides this abstraction of
foreign key consists of attribute(s) in a table that are referencing a database, known as a schema, for the Baseball Statistics appli-
the value of a primary key in another table, such as HName in cation. This is a visual schema, showing the primary and for-

Figure 2. IntroDB: Sports statistics spreadsheet breakdown.


150 J. E. BROATCH, S. DIETRICH, AND D. GOELMAN

Figure 3. IntroDB: Baseball statistics tables with keys.

Figure 4. QueryDB: Baseball statistics query design.

eign keys with these associations between the tables illustrated the student is walked through the design of that query using
through links. The highlighted rows illustrate the design of the the given information and the primary-foreign key associations
query mentioned in the screen shot, which finds the match- between the tables. The visualization then introduces by exam-
up average for a hitter with the given batting average against ple the fundamental set operators (union, intersection, nega-
a particular pitcher. The highlighting is built dynamically as tion) that operate on tables having the same format. Students
JOURNAL OF STATISTICS EDUCATION 151

Figure 5. QueryDB: Baseball statistics SQL query with visual representation.

are then presented with new operators that filter the tables both database concepts and data manipulation operations illustrated
horizontally and vertically as well as operators that combine in the visualizations and to apply these techniques to statistics
tables on the primary-foreign keys to provide a larger table activities. Although students are learning how to apply these
needed to answer a query. The latter are called joins. After operations in the context of the syntax of SQL and R, specifically
introducing these various operators, the industry standard SQL dplyr, the ultimate goal is to have students truly understand
for querying databases is introduced and a textual query is built the operations so that they can apply these concepts to other
up incrementally along with a visual representation of the query, contexts and languages even as technology changes. Thus, other
as shown in Figure 5. The SQL query provides the answer to popular packages, such as SAS, can also be used to analyze
the query designed in Figure 4, which incorporates horizontal and access a relational database utilizing the basic tools taught
filtering (batting average and pitcher name) and vertical fil- within the tool (SAS Institute Inc. 2019).
tering (matchup average) in combination with a join (hitters The data wrangling connections emphasized in the activities
name). highlighted essential data manipulation of a dataset and the
The key concepts from IntroDB and the operations for combination of datasets. A first step in data wrangling is the
manipulating and combining data in QueryDB are definitely ability to perform basic data queries and to reduce the dataset
applicable to statistics as an essential part of the data analysis to only the variables and characteristics of interest, arranging
cycle. The dplyr package for R also uses these concepts for data the data for communication of results. A second essential step
manipulation, as illustrated in the next section. in data wrangling is the ability to combine various datasets cor-
rectly to answer more complicated questions. Table 1 shows the
visualization concept and its correspondence in SQL and dplyr.
3. dplyr Connection
Most of these connections, with the exception of the set based
This introduction to data science concepts was embedded operations (union, intersection, difference), were incorporated
within the context of an introductory statistics course, by within the activities assigned to the students.
exposing students to questions from datasets requiring some The first student activity focuses on essential data wrangling
“data wrangling” skills, which are important in both databases without the combination of datasets. This activity should be
and statistics. In the database field, SQL is the industry standard assigned as early in the term as possible after a discussion on
language that provides extensive support for asking questions basic descriptive statistics and basic visualizations. The experi-
over data stored in the database, including some fundamental ence with the integration of the activity is described briefly in
operations for data analysis, such as sum, minimum, maximum, Section 4 and the activity itself is included in the supplementary
average, and count. In statistics, R is a tool that offers extensive materials. Students were asked to “Find the average height of
data visualization and analysis tools, and dplyr is the language human Starwars Characters” using the Star Wars characters
that manipulates the data for analysis. The learning objective dataset included in dplyr as the “starwars” tibble. This question
of the activities is for students to understand the fundamental requires the student to select the variable height, filter the
152 J. E. BROATCH, S. DIETRICH, AND D. GOELMAN

Table 1. Connections between visualizations, SQL, dplyr. Table 3. Syntax comparison: joining batting with salaries.
Visualization SQL dplyr dplyr Batting %>%
filter(yearID >= 1985) %>%
vertical filtering SELECT select inner_join( Salaries, by=c(“playerID”, “yearID”, “teamID”))
horizontal filtering WHERE filter
join FROM ... JOIN inner_join SQL select b.*, s.salary
ordering ORDER BY arrange from batting b inner join salaries s
union UNION union on b.playerID = s.playerID and b.yearID = s.yearID and b.teamID = s.teamID
intersection INTERSECT intersect where b.yearID >= 1985
difference EXCEPT setdiff
NOTE: The Batting and Salaries datasets are found in Friendly (2017).

Table 2. Syntax comparison: average height of human starwars characters. Table 4. Student responses to the integration of the visualizations and dplyr.
dplyr starwars %>% Question Number of students with
filter(species==‘Human’) %>% positive responses
select(height) %>%
summarize(avgheight=mean(height, na.rm=T)) How much did the online visualization 20 of 25
help your learning of SQL?
SQL select avg(height) as avgheight How much did the online visualization 19 of 25
from Starwars help your learning of the basics of dplyr?
where species = ‘Human’ How much did the dplyr example help 23 of 25
your understanding of the material?
NOTE: Starwars is included in dplyr, which was exported as a data frame Star-
wars=as.data.frame(starwars) and imported into a relational database to create NOTE: A positive response is defined as either moderately help (3), much help (4),
the corresponding table. or great help (5). Five students did not respond.

dataset to only include human characters, and summarize to these techniques in both contexts. Students are able to retrieve
find the mean or average height. Table 2 displays the corre- the data that they need from databases and further analyze
sponding dplyr and SQL syntax. Both dplyr and SQL rename and visualize that data using dplyr in R. The experience of
the resulting column as avgheight. Note that the activities do not integrating the above activities in the classroom, along with
require the use of the pipe operator %>%. The actual activity is student’s perspectives on the experience are discussed in the
broken down into smaller pieces. The pipe operator is shown next section.
here to provide a unified syntax that corresponds to the SQL
query.
Although not explored as part of this initial pilot activity, 4. Discussion
the students can explore entering SQL syntax directly in R To briefly assess the perceived impact of the integrated materials,
utilizing the SQLdf package (Grothendieck 2017). Since SQLdf students (n = 30) were given a survey to answer both Likert-
uses data frames in R and not dplyr tibbles, the Starwars tibble type and open-ended questions about the integration experi-
should be exported to a data frame for use in SQLdf [Starwars ence after the first activity. Student responses to the integration
= as.data.frame(starwars)]. The SQL syntax shown in Table 2 activity presented in Table 4 were mostly positive, and most
would be a quoted parameter of the function sqldf. This would students reported that the visualizations support their learning
be an interesting connection to add to future versions of the and understanding of the problem. Representative feedback is
assignment. included below.
A second activity was created to reinforce the concept of
a join, which combines datasets, within a statistical question. • “I liked how the class involved another aspect to our course
The second activity should be introduced after a discussion of such as SQL. I feel like more self taught projects and examples
one-way analysis of variance or linear regression. Again, the would be really beneficial for us the students to expand our
activity itself is included in the supplementary materials. A horizons of what is out there."
question is posed to the students using the Lahman package • “I appreciated the chance to get a flavor of data science in the
(Friendly 2017), which provides the tables from Sean Lahman class.”
Baseball Database as R data frames, “Are higher salaries in Major
League Baseball related to higher batting statistics post 1985?” The skills learned in the animations are transferable to other
To answer this question, students must first filter the Batting platforms. For example, the first animation introduces tables
dataset for seasons post 1985 and then combine the Batting and that are conceptually linked using primary and foreign keys. In
Salaries data frames on the combination of attributes linking fact, 96% of students identified the proper composite primary
the data using a join in dplyr. Table 3 compares the syntax for key and successfully joined the two tables for analysis. The
dplyr and SQL for the question posed to the students. Again, concepts presented provide a foundation for learning database
the pipe operator is introduced for correspondence with SQL. querying in any language including R (see Wickham 2014 for
Also, the join connection can be further explored by asking an R example). Although the querying visualization focuses on
students to provide a corresponding SQL specification of their SQL, it was noted by Hardin et al. (2015) (when referring to a
dplyr commands, which is then tested using the SQLdf package. course by Wickham) “while each language may have its own
The connection between database and statistics is an integral syntax the underlying operation that is being performed on
component of data science. By learning fundamental database the data is the same.” The animations can also assist students
concepts and querying, statistics students will be able to apply conducting Senior Capstone courses, like those reviewed in
JOURNAL OF STATISTICS EDUCATION 153

Martonosi and Williams (2016), to bridge the gap between stu- References
dents’ statistical training and the data manipulation and man-
American Statistical Association (ASA) (2014), “Curriculum Guidelines for
agement challenges of the real-world.
Undergraduate Programs in Statistical Science.”
This integration of database skills in an introductory statis- Baumer, B. (2015), “A Data Science Course for Undergraduates: Thinking
tics course is promising. The pilot described in this article With Data,” The American Statistician, 69, 334–342.
provided an initial experience with the introduction of database Dietrich, S. W., and Goelman, D. (2017), “Formative Self-Assessment for
visualizations of fundamental concepts into an introductory Customizable Database Visualizations: Checkpoints for Learning,” in
statistics course. Students appreciated the visualization of the 2017 ASEE Annual Conference & Exposition, Columbus, OH: ASEE
Conferences.
concepts and the ability to apply these concepts in R with dplyr,
Dietrich, S. W., Goelman, D., Borror, C. M., and Crook, S. M. (2015), “An
and were able to successfully perform a more complicated data Animated Introduction to Relational Databases for Many Majors,” IEEE
manipulation by the second activity. Transactions on Education, 58, 81–89.
There are additional benefits of introducing database skills Friendly, M. (2017), “Lahman: Sean ‘Lahman’ Baseball Database,” R Pack-
early in this course. After the students were able to perform age Version 6.0-0.
basic data manipulation, larger, nontrivial datasets were used in Goelman, D., and Dietrich, S. W. (2018), “A Visual Introduction to Con-
homework assignments. Students were required to manipulate ceptual Database Design for all,” in SIGCSE ’18 Proceedings of the 49th
ACM Technical Symposium on Computer Science Education, Baltimore,
the data using the five verbs to accurately answer the questions MD. ACM, pp. 320–325.
in the homework; expanding their experience beyond the text- Grothendieck, G. (2017), “sqldf: Manipulate R Data Frames Using SQL,” R
book. Package Version 0.4-11.
The synergy of databases and statistics is an important learn- Hardin, J., Hoerl, R., Horton, N. J., Nolan, D., Baumer, B., Hall-Holt, O.,
ing objective for an introductory statistics course. Future work Murrell, P., Peng, R., Roback, P., Lang, D. T., and Ward, M. D. (2015),
will explore additional avenues and opportunities to revise and “Data Science in Statistics Curricula: Preparing Students to ‘Think With
Data’,” The American Statistician, 69, 343–353.
incorporate synergistic activities into the curriculum. Higgins, J. J. (1999), “Nonmathematical Statistics: A New Direction for the
Undergraduate Discipline,” The American Statistician, 53, 1–6.
Supplementary Materials Horton, N. J., Baumer, B. S., and Wickham, H. (2015), “Setting the Stage
for Data Science: Integration of Data Management Skills in Introductory
The Database handout includes a brief description of key dplyr “verbs” and Second Courses in Statistics,” CHANCE, 28, 40–50.
to Query a dataset just like the DBMM visualization. The five key data Martonosi, S. E., and Williams, T. D. (2016). “A Survey of Statisti-
manipulation “verbs” are presented with their corresponding SQL com-
cal Capstone Projects,” Journal of Statistics Education, 24, 127–135.
mands. The Verb Activity is a series of questions that require the application
doi:10.1080/10691898.2016.1257927
of all five verbs presented. The Join Activity is an active learning activity
R Core Team (2017), R: A Language and Environment for Statistical Com-
that connects the Database Concepts and dplyr utilizing a sports statistics
puting, Vienna, Austria: R Foundation for Statistical Computing.
example in the R package: Lahman. This activity reinforces the concept of
a join, which combines datasets, within a statistical question. Rudo, P. (2014), “6 Important Stages in the Data Processing Cycle.”
SAS Institute Inc. (2019), “Methods for Accessing Relational Database
Data.”
Funding Wickham, H. (2014), “Tidy Data,” Journal of Statistical Software, 59, 1–23.
Wickham, H., Francois, R., Henry, L., and Müller, K. (2017), “dplyr: A
This material is based upon work supported by the National Science Foun-
Grammar of Data Manipulation,” R Package Version 0.7.4.
dation under grant nos. DUE-1431848, DUE-1431661, DUE-0941584, and
Wickham, H., and Grolemund, G. (2017), R for Data Science: Import, Tidy,
DUE-0941401. Any opinions, findings, and conclusions or recommen-
dations expressed in this material are those of the author and do not Transform, Visualize, and Model Data (1st ed.), Newton, MA: O’Reilly
necessarily reflect the views of the National Science Foundation. Media, Inc.

You might also like