Curriculum Guidelines For Undergraduate Programs in Data Science
Curriculum Guidelines For Undergraduate Programs in Data Science
V I E W
E
R
S
C E
I N
N
A
D V A
2.1
ST04CH02-De-Veaux ARI 16 December 2016 16:19
1. INTRODUCTION
Data science is experiencing rapid and unplanned growth, spurred by the proliferation of complex
and rich data in science, industry, and government. Fueled in part by reports, such as the widely
cited McKinsey report (McKinsey Global Inst. 2011), that forecast a need for hundreds of thou-
sands of data science jobs in the next decade, data science programs have exploded in academics
as university administrators have rushed to meet the demand. The website http://datascience.
community/colleges currently lists 530 programs in data science, analytics, and related fields
at more than 200 universities around the world. The vast majority of these are master’s degree
and certificate programs offered both traditionally and online. Although PhD programs in data
science (or data analytics) are still relatively rare, there has been rapid growth of undergraduate
programs at both research institutions and liberal arts colleges. We expect this number to increase
significantly in the near future.
The 2016 Park City Mathematics Institute (PCMI), sponsored by the National Science
Founda-tion (NSF) and the Institute for Advanced Study at Princeton, held a workshop focused
on the task of producing curriculum guidelines for an undergraduate degree in data science.
Twenty-five faculty, comprised of computer scientists, statisticians, and mathematicians from a
variety of liberal arts colleges and research universities, met for three weeks to discuss our vision
for data science in an undergraduate context, what activities and skills we thought would be
necessary for a data science program, and how we could imagine implementing such a major both
currently and in the future. These guidelines are the product of that effort.
We have based our guidelines for an undergraduate data science major on a ten semester-
course major common among the liberal arts colleges, realizing that research universities typically
add several courses to that. We do not intend that these guidelines be prescriptive, but rather
we hope that they will serve to inform and enumerate the core skills that a data science major
should have before graduation. We started with the reports from the NSF Workshop on Data
Science Education (Cassel & Topi 2015), the AALAC (Alliance to Advance Liberal Arts Colleges)
conference “Teaching Big Data in the Liberal Arts Context,” and the guidelines for undergraduate
majors in mathematics, statistics, and computer science (see the sidebar Curriculum Guidelines
in Related Disciplines).
We begin by discussing the background and some guiding principles that informed our think-
ing in Section 2, then consider skills that students should develop while pursuing the major in
Section 3, and finally summarize key curriculum topics in Sections 4 and 5. We show a possible
selection of current courses that cover most of the basics of our identified skills in Section 6.
However, it is important to point out that this smorgasbord approach to course selection is less
than ideal. We believe that many of the courses traditionally found in computer science, statis-
tics, and mathematics offerings should be redesigned for the data science major in the interests of
2015 CUPM Curriculum Guide to Majors in the Mathematical Sciences (MAA 2015): http://www.maa.org/
sites/default/files/pdf/CUPM/pdf/CUPMguide_print.pdf
Computer Science Curricula 2013: Curriculum Guidelines for Undergraduate Degree Programs in Computer Science
(ACM/IEEE 2013): https://www.acm.org/education/CS2013-final-report.pdf
Curriculum Guidelines for Undergraduate Programs in Statistical Science (ASA 2014b): http://www.amstat.
org/education/pdfs/guidelines2014-11-15.pdf
efficiency and the potential synergy that integrated courses would offer. Relying on existing courses
at most institutions, a student might have to take 14 or more courses in order to obtain all the
skills one would expect from a data science major. With some significant course redesign, we think
that this number could be substantially reduced to fit into the constraints of a typical ten-course
liberal arts major. Details of those courses are found in the Supplemental Appendix (provided
in the supplemental material, follow the Supplemental Material link from the Annual Reviews
home page at http://www.annualreviews.org).
There is still considerable debate about exactly what the science of data science is, but prominent scientists such as
David Donoho, Michael Jordan, and others suggest that there is a science at the core and that it will continue to
evolve. As Donoho says, “Fortunately, there is a solid case for some entity called ‘Data Science’ to be created, which
would be a true science: facing essential questions of a lasting nature and using scientifically rigorous techniques
to attack those questions” (Donoho 2015, p. 10). Regardless of the consensus (or lack thereof ) surrounding the
evolution of the science of data science, a data science program at the undergraduate level provides a synergistic
approach to problem solving, one that leverages the content in all three disciplines. We believe that a data science
program will serve students well whether they join the marketplace or continue on to more advanced study.
of data, the curriculum can be significantly streamlined and enhanced. The integration of courses,
focused on data, is a fundamental feature of an effective data science program and results in a
synergistic approach to problem solving.
This document outlines the core knowledge and methods that data science students should
master. Our position is that, ideally, new courses should be developed to take advantage of the
efficiencies and synergies that an integrated approach to data science would provide. However,
because not all institutions will be able to create many new courses immediately, we suggest which
traditional courses might provide coverage of the basic topics of the major. We also propose a
model of an integrated curriculum to serve as a possible blueprint for the future.
the analysis of data provides an opportunity for students to gain experience with the interplay between
abstraction and context that is critical for the mathematical sciences major to master. Experience with
data analysis is particularly important for majors entering the workforce directly after graduation, for
students with interests in allied disciplines, and for students preparing to teach secondary mathematics.
(MAA 2004, p. 45)
Statisticians, naturally, feel the same. In 2002, a report by the ASA recommended that under-
graduate statistics curriculum include a heavy emphasis on data analysis (perhaps more weight
should be given to the data than the analysis) (ASA 2002). By the same logic, students learn data
science by doing data science. The recursive data cycle should be a featured component of most
data science learning experiences, and projects involving group analysis and presentation should
be common throughout the curriculum. Capstone projects are also an essential component of the
experience and internships fit naturally in a data science program.
Tukey accorded algorithmic models the same foundational status as the algebraic (data) models
that statisticians had favored in the previous half-century. The two pillars of computational and
statistical thinking should not be taught separately. The balance between them may change from
one course to another, but both should be present for the most effective and efficient teaching.
2.6. Flexibility
We must prepare students to learn new techniques and methods that may not exist today. They
will need to work with increasingly varied forms of data, or they will not be prepared for the jobs of
the future. We need to pay attention to the core foundations of mathematical, computational, and
statistical thinking and practice while incorporating the practical and important data science skills.
Data science, at all levels, is evolving and changing quickly. Most institutions will implement
a data science major from current courses in existing disciplines, perhaps transitioning to more
fully integrated courses as outlined in the Supplemental Appendix (provided in the supplemen-
tal material, follow the Supplemental Material link from the Annual Reviews home page at
http://www.annualreviews.org) at a future date. Our hope is that institutions use these guide-
lines in their planning to meet the needs of their students both now and in the future. We fully
expect that institutions will regularly review their programs to reflect new developments in this
fast-evolving field.
efficiency. Data scientists must be capable of adapting smoothly to such changes. Data
scientists should understand both the computational and modeling challenges in their work,
and how they might be intertwined. For example, data scientists should recognize that fitting
a given model on a particular set of data will engender computational challenges, and they
should have some facility for implementing a solution that may involve either a modification
of the model or a change in the computing environment, or both. When integrated with
statistical thinking, computational thinking greatly amplifies the ability of data scientists to
distribute solutions to clients, understand many modern statistical modeling approaches,
and achieve scientific reproducibility.
seeing the value of mathematical methods while understanding their limitations. Data science
students should also develop a geometric, intuitive, visual way of thinking throughout their
mathematical training. We propose a two-semester Mathematics for Data Science sequence
that begins with students who would have placed into Calculus 1. This sequence emphasizes
mathematical modeling, especially linear and polynomial models (see the Supplemental
Appendix), and would include the following topics.
- Mathematical structures (e.g., functions, sets, relations, and logic)
- Linear modeling and matrix computation (e.g., matrix algebra and factorization,
eigenvalues/eigenvectors, and projection/least-squares)
- Optimization (e.g., calculus concepts related to differentiation)
- Multivariate thinking (e.g., concepts and numerical computation of multivariate derivatives
and integrals)
- Probabilistic thinking and modeling (e.g., counting principles, univariate and multivariate
distributions, and independence, relying often on computational simulations)
Algorithms and Software Foundations. To develop a grounded computational ability,
a data science undergraduate should study foundational computer science topics and build
facility in algorithmic problem solving and development of software/programming.
- Algorithm design: Students must develop the skill set to understand the problem, break
it into manageable pieces, assess alternative problem solving strategies, and arrive at an
algorithm that efficiently solves the problem.
- Programming concepts and data structures: Students should have the knowledge to im-
plement their algorithms using procedural and functional programming techniques and
their associated data structures, including lists, vectors, data frames, dictionaries, trees,
and graphs.
- Tools and environments: Students should understand the appropriate use of tools and
packages available. Such packages enable programmatic access to data services and
input/output; perform data transformations, explorations, visualization, and analysis; and
assist in the development and maintenance of software, including development environ-
ments, and tools for versioning and tracking.
- Scaling for big data: As the data and processing associated with data science continue to
scale, data science undergraduates should develop the capacity to work with larger data
sets. They should be able to apply techniques in concurrent programming to build systems
that perform parallel processing of data. They must also be able to work with current and
new forms of distributed data storage as a part of the data management areas discussed
above. They should be knowledgeable in how to work with streaming data.
Data Curation—Databases and Data Management. A data science undergraduate major
must understand and be able to effectively apply principles of data management. This is much
broader than traditional database management and must include systems supporting the
volume and velocity attributed to big data. Thus a data science major must apply knowledge
of data query languages to relational databases and emerging large store NoSQL (not only
SQL) data systems, and must be able to access data from less-structured systems through
web services, lower-level access to data available across the Internet, and data sourced from
streams. Once data are collected, data management includes cleaning and initial structuring,
using the software knowledge and skills outlined above, and then transforming data into
structured forms required for exploration, visualization, and analysis.
Introduction to Statistical Models. This serves to introduce students to the statistical
analysis of data and the elements of a framework for inference. The foundation is linear
models, which are then compared to nonlinear approaches. The course builds on important
concepts introduced in the first-year data science courses that form the foundation of any
statistical analysis. All the ideas are firmly grounded in and inspired from real-world data.
- Exploratory data analysis approaches and graphical data analysis methods
- Estimation and testing: exposure to statistical (e.g., basic central limit theory and law of
large numbers) and algorithmic (e.g., bootstrap resampling methods) approaches to point
and interval estimation and hypothesis testing; likelihood theory; and Bayesian methods
- Simulation and resampling: Monte Carlo simulation of stochastic systems; resampling-
based inference (e.g., bootstrap, jackknife, permutations); basic understanding of design
of studies, surveys, and experiments (e.g., random assignment, random selection, data
collection, and efficiency) and issues of bias, causality, confounding, and coincidence
- Introduction to models: simple linear, multivariate, and generalized linear models; algo-
rithmic models (e.g., regression trees and nearest neighbors); and unsupervised learning
(e.g., clustering)
- Introduction to model selection and performance: regularization, parsimony and
bias/variance tradeoff; loss functions and model selection (e.g., cross-validation, penal-
ized regression, and ridge regression)
Statistical and Machine Learning. This course blends the algorithmic perspective of ma-
chine learning in computer science and the predictive perspective of statistical thinking.
Its focus is on the common machine learning methods and their application to problems
in various disciplines. The student will gain not only an understanding of the theoretical
foundations of statistical learning, but also the practical skills necessary for their successful
application to new problems in science and industry.
- Further exploration of alternatives to classical regression and classification
- Algorithmic analysis of models, addressing issues of scalability and implementation
- Performance metrics and prediction, and cross validation
- Data transformations: re-expression of variables and feature creation, techniques of di-
mension reduction (e.g., principal component analysis), and smoothing and aggregating
- Supervised learning versus unsupervised learning
- Ensemble methods (e.g., boosting, bagging, and model averaging)
Data in Context—Capstone Experience. A capstone experience in which students con-
sider scientific questions, collect and analyze data and communicate the results
A possible path through the major is shown in Figure 1.
5. ADDITIONAL CONSIDERATIONS
1. Graduate study. Students interested in graduate study in mathematics, statistics, or com-
puter science may consider taking more advanced courses in theoretical foundations. The
courses in mathematics for data science will not likely prepare a student for immediate
acceptance into a PhD program in one of the three disciplines.
2. Articulation with community colleges. Community colleges attempt to prepare students
for many different purposes and institutions and, as a result, institutional change may be
slow. In the meantime, given the existing course structure, students can prepare themselves
to transfer to a college or university data science degree program.
Students can prepare by taking Calculus 1 and 2 as well as an Introduction to Com-
puter Science course. Additional computer science courses, if offered, would be very
First-year sequences
Intro to Data Mathematical
Science I Foundations I
Other
Capstone
Discipline
Experience
Course
Figure 1
A flow chart displaying a possible path through the data science major.
helpful preparation. A few community colleges teach introductory statistics courses that
emphasize data analysis and statistical thinking, and such courses should be required
for transfer. More mathematical statistics courses that emphasize a rote, methods-based
approach to statistics may not be an optimal preparation for data science.
Not all community colleges will have the resources within a single department to develop
a course such as the Data Science I and II courses proposed here. However, institutions
should encourage collaboration between departments of mathematics and computer
science in order to develop introductory statistics courses that (a) emphasize statistical
thinking in the context of real and complex data sets, (b) develop fundamental computa-
tional thinking through learning and using statistical software, and (c) develop basic data
handling skills, such as creating new variables through transformations, uploading data
with different delimiter types and different basic row/column structures, and developing
habits of reproducibility.
The Statway (http://www.carnegiefoundation.org/resources/videos/introducing-
statway/) and the New Mathways (http://www.utdanacenter.org/higher-education/
new-mathways-project/) course sequences offered at some institutions may, depending
on the local implementation, provide students with a strong grounding in statistical
thinking and, if students learn to explore statistical concepts through simulations, also
develop basic computational thinking.
3. Prerequisites and preparation in high school. As students exposed to Common Core
standards for statistics enter college, some introductory material may need to be reexamined.
To be prepared for the data science curriculum, students should
Be calculus ready (i.e., have a good precalculus course)
SUMMARY POINTS
In summary, the key points of our proposal involve:
1. Data science is a fast evolving discipline centered on the acquisition, curation, and analysis
of data.
2. Courses from the traditional disciplines of mathematics, statistics, and computer science
provide the basic infrastructure for the major at present.
3. A redesign of the curriculum, integrating the elements of mathematical foundations and
computational and statistical thinking at all levels, will provide a rich and effective series
of courses to prepare graduates for a career in data science.
We realize that the field is evolving rapidly but hope that the basic areas we have outlined will
be useful. During our discussions, several issues arose that were outside the scope of our meeting,
including the following.
FUTURE ISSUES
1. Faculty development. The courses outlined in the Supplemental Appendix are clearly
bold steps toward a new integrated program in data science. To be effective they will
require many iterations. Resources for faculty including notes, examples, case studies,
and perhaps most importantly, new textbooks, will be essential.
2. Engagement with two-year colleges and high schools. The data science major will
be attractive to many students coming from both high school and two-year colleges.
Interactions with these institutions will be crucial in order to coordinate courses and
instruction to facilitate transfer to four-year institutions.
3. Periodic revision. This is a first attempt at providing concrete guidelines for this emerg-
ing field. We realize that revisions will be necessary as the field continues to evolve, and
we welcome feedback on these guidelines.
DISCLOSURE STATEMENT
The authors are not aware of any affiliations, memberships, funding, or financial holdings that
might be perceived as affecting the objectivity of this review.
ACKNOWLEDGMENTS
The authors would like to thank the PCMI for supporting us in this effort. In addition, we would
like to thank the National Science Foundation and the Institute for Advanced Study for supporting
PCMI.
1
Department of Mathematics and Statistics, Williams College, Williamstown, Massachusetts
01267
2
Department of Mathematics and Statistics, University of Michigan, Dearborn, Michigan
48128-2406
3
Department of Mathematics and Computer Science, Mills College, Oakland, California 94613
4
Department of Statistical & Data Sciences, Smith College, Northampton, Massachusetts
01063
5
Department of Mathematics, Reed College, Portland, Oregon 97202
6
Department of Mathematics and Computer Science, Denison University, Granville, Ohio
43023
7
Department of Mathematics, Shippensburg University, Shippensburg, Pennsylvania 17257
8
Department of Mathematics, Olivet Nazarene University, Bourbonnais, Illinois 60914
9
Department of Mathematics, Brigham Young University, Provo, Utah 84601
10
Department of Statistics, University of California, Los Angeles, Los Angeles, California
90095-1554
11
Department of Mathematics, Middlebury College, Middlebury, Vermont 05753
12
Department of Mathematics and Computer Science, Denison University, Granville, Ohio
43023
13
Department of Mathematics, Lafayette College, Easton, Pennsylvania 18042-1780
14
Department of Mathematics and Computer Science, Rhode Island College, Providence,
Rhode Island 02908
15
Department of Statistics, University of California, Berkeley, California 94720
16
Department of Mathematics, University of Hawaii, Hilo, Hawaii 96720-4091
17
Department of Mathematics, Westminster College, Salt Lake City, Utah 84105
18
Department of Computer Science, Fitchburg State University, Fitchburg, Massachusetts
01420
19
Department of Mathematics, New York University, New York, New York 10012
20
Department of Mathematics, University of Southern California, Los Angeles, California
90089
21
Department of Mathematics, St. Mary’s University, San Antonio, Texas 78228
22
Department of Mathematics, Howard University, Washington, DC 20059
23
Department of Mathematics, LeTourneau University, Longview, Texas 75602
24
Department of Mathematics and Computer Science, Denison University, Granville, Ohio
43023
25
Department of Mathematics, University of North Georgia, Oakwood, Georgia 30566
LITERATURE CITED
ACM/IEEE (Assoc. Comput. Mach./Inst. Electr. Electron. Eng.). 2013. Computer Science Curricula 2013: Cur-
riculum Guidelines for Undergraduate Degree Programs in Computer Science. New York: ACM. https://www.
acm.org/education/CS2013-final-report.pdf
ASA (Am. Stat. Assoc.). 2002. Curriculum guidelines for bachelor of arts degrees in statistical science. J. Stat.
Educ. 10(2). http://ww2.amstat.org/publications/jse/v10n2/tarpey.html
ASA (Am. Stat. Assoc.). 2014a. Discovery with Data: Leveraging Statistics with Computer Science to
Transform Science and Society. Arlington, VA: ASA. http://www.amstat.org/asa/files/pdfs/POL-
BigDataStatisticsJune2014.pdf
ASA (Am. Stat. Assoc.). 2014b. Curriculum Guidelines for Undergraduate Programs in Statistical Science. Alexan-
dria, VA: ASA. http://www.amstat.org/education/pdfs/guidelines2014-11-15.pdf
Breiman L. 2001. Statistical modeling: the two cultures. Stat. Sci. 16(3):199–231
Bryce GR, Gould R, Notz WI, Peck RL. 2001. Curriculum guidelines for bachelor of science degrees in
statistical science. Am. Stat. 55(1):7–13
Cassel B, Topi H. 2015. Strengthening Data Science Education Through Collaboration. Rep. on Workshop on
Data Science Education Funded by the Natl. Sci. Found., Award #: DOE 1545135, Oct. 3–5, Arlington,
VA
Donoho D. 2015. 50 Years of Data Science. Presented at Tukey Centennial Worksh., Princeton, NJ, Sept. 18
Horton NJ, Hardin JS. 2015. Teaching the next generation of statistics students to ‘‘think with data”: special
issue on statistics and the undergraduate curriculum. Am. Stat. 69:259–65
MAA (Math. Assoc. Am.). 2004. Undergraduate Programs and Courses in the Mathematical Sciences:
CUPM Curriculum Guide 2004. Washington, DC: MAA. http://www.maa.org/programs/faculty-and-
departments/curriculum-department-guidelines-recommendations/cupm/cupm-guide-2004
MAA (Math. Assoc. Am.) 2015. 2015 CUPM Curriculum Guide to Majors in the Mathematical Sciences. Wash-
ington, DC: MAA. http://www.maa.org/sites/default/files/pdf/CUPM/pdf/CUPMguiderint.pdf
McKinsey Global Inst. 2011. Big Data: The Next Frontier for Innovation, Competition, and Productivity.
New York: McKinsey & Co. http://www.mckinsey.com/business-functions/digital-mckinsey/our-
insights/big-data-the-next-frontier-for-innovation
NSF (Natl. Sci. Found.). 2014. Data Science at NSF. Draft Report of StatSNSF Committee: Revisions Since January
MPSAC Meeting. April. https://www.nsf.gov/attachments/130849/public/Stodden-StatsNSF.pdf
Shron M. 2014. Thinking with Data: How to Turn Information into Insights. Sebastopol, CA: O’Reilly Media
Inc.
Wild CJ, Pfannkuch M. 1999. Statistical thinking in empirical enquiry. Int. Stat. Rev. 67(3):223–65
Wilkinson L. 2008. The future of statistical computing. Technometrics 50(4):418–35
Wing J. 2006. Computational thinking. Comm. ACM 49(3):33–35