0% found this document useful (0 votes)
16 views14 pages

Data For Geoscience

This paper reviews the integration of data science into geoscience, highlighting its significance and recent advancements through the lens of a data life cycle. It outlines key steps such as concept, collection, preprocessing, analysis, and repurposing, making the information accessible for geoscientists with limited data science experience. The paper also discusses future trends in data science applications within geoscience, emphasizing the importance of open science and interdisciplinary collaboration.

Uploaded by

yaxer1624
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views14 pages

Data For Geoscience

This paper reviews the integration of data science into geoscience, highlighting its significance and recent advancements through the lens of a data life cycle. It outlines key steps such as concept, collection, preprocessing, analysis, and repurposing, making the information accessible for geoscientists with limited data science experience. The paper also discusses future trends in data science applications within geoscience, emphasizing the importance of open science and interdisciplinary collaboration.

Uploaded by

yaxer1624
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

OL D

The Geological Society of America


Special Paper 558
OPEN ACCESS

Data science for geoscience: Recent progress and


future trends from the perspective of a data life cycle

Xiaogang Ma*
Department of Computer Science, University of Idaho, 875 Perimeter Drive, MS 1010, Moscow, Idaho 83844-1010, USA

ABSTRACT

Data science is receiving increased attention in a variety of geoscience disciplines


and applications. Many successful data-driven geoscience discoveries have been
reported recently, and the number of geoinformatics and data science sessions at
many geoscience conferences has begun to increase. Across academia, industry, and
government, there is strong interest in knowing more about current progress as well as
the potential of data science for geoscience. To address that need, this paper provides
a review from the perspective of a data life cycle. The key steps in the data life cycle
include concept, collection, preprocessing, analysis, archive, distribution, discovery,
and repurpose. Those subjects are intuitive and easy to follow even for geoscientists
with very limited experience with cyberinfrastructure, statistics, and machine learn-
ing. The review includes two key parts. The first addresses the fundamental concepts
and theoretical foundation of data science, and the second summarizes highlights and
sharable experience from existing publications centered on each step in the data life
cycle. At the end, a vision about the future trends of data science applications in geo-
science is provided that includes discussion of open science, smart data, and the sci-
ence of team science. We hope this review will be useful to data science practitioners
in the geoscience community and will lead to more discussions on the best practices
and future trends of data science for the geosciences.

1. INTRODUCTION recent years have demonstrated the enormous potential of the


data revolution. It is obvious that to scale up the innovation and
Data-driven discovery has received a lot of attention in geo- accelerate new findings in geoscience, data science will play an
science research in the past decade, as reflected in the increas- important role in the coming decades. Nevertheless, as the theo-
ing number of projects funded, facilities constructed, data sets retical foundation of data science is still under development, dis-
shared, and scientific findings published. Cyberinfrastructure, cussion and review of data science in the geosciences is limited.
data portals, databases, workflow platforms, statistical models, In contrast, data science methods and tools are currently in high
machine learning algorithms, data management, and data sharing demand among geoscientists. To address that need, this paper
are becoming the new normal in many geoscientists’ daily work. reviews progress in both data science and data-driven geoscience
Various success stories of data-driven geoscience discovery in and discusses the future trends.

*[email protected]

Ma, X., 2022, Data science for geoscience: Recent progress and future trends from the perspective of a data life cycle, in Ma, X., Mookerjee, M., Hsu, L.,
and Hills, D., eds., Recent Advancement in Geoinformatics and Data Science: Geological Society of America Special Paper 558, p. 57–69, https://doi
.org/10.1130/2022.2558(05). © The Author. Gold Open Access: This chapter is published under the terms of the CC-BY license and is available open access on
www.gsapubs.org.

57

Downloaded from http://pubs.geoscienceworld.org/gsa/books/book/2377/chapter-pdf/5743375/spe558-05.pdf


by INGEMMET user
58 X. Ma

Data science is the study of extracting value from data The quick progress of big data and data science has inspired
(Wing, 2019). A primary driving force of data science in geo­ plans and schemes for data-driven geoscience research at a larger
science is the fast growing volume, velocity, and variety of data, scale. In 2018, the Carnegie Institution for Science started the
i.e., big data. Hey et al. (2009) stated that data exploration is Deep-time Data Driven Discovery (4D) Initiative (4D Initiative,
the key feature of “the fourth paradigm” in science for tackling 2018). In 2019, the International Union of Geological Sciences
the data deluge, as compared with the previous three scientific initiated the Deep-Time Digital Earth (DDE) big science program
paradigms in which empirical, theoretical, and computational (Cheng et al., 2020). In the vision (NASEM, 2020) for the next
approaches were the key features. There are several factors in decade of Earth science priorities within the U.S. National Science
their vision of this new paradigm. Big data are captured by instru- Foundation (NSF), key recommendations were made regarding
ments or generated by simulators. Advanced infrastructures are open data and community practices for cyberinfrastructure needs
deployed to store and transmit data, along with data analysis soft- and advances. We are now at a dramatic tipping point in science—
ware and knowledge systems. Scientists, with the support and a time when the open data resources, cyberinfrastructure facilities,
assistance of those resources, will focus more on scientific dis- and new data science methods for analysis and visualization will
covery in the midstream to downstream of the data flow. Another change the way geoscientists conduct their research. Keys to dis-
point raised by Hey et al. (2009) is that data-intensive science covery lie in the continued development, integration, and exploi-
in the fourth paradigm is not only computational science but tation of facilities, data, and expertise to build and explore path-
should also incorporate theories and methods from many other ways for a deeper understanding of the evolving Earth (Hazen et
disciplines. Many later publications (Drineas and Huo, 2016; al., 2019). The review and analysis presented in this paper aim to
Kelleher and Tierney, 2018; NASEM, 2018a) resonate with Hey answer questions such as, “What changes can data science bring
et al.’s (2009) vision of the theoretical foundation of data sci- to geoscience?” “What are the fundamental data science skills
ence. It is now commonly understood that data science will set that a geoscientist should learn?” “What will be the patterns of
its root in the basic research of computer science, mathematics, data science applications in the next five or ten years?” and “As
statistics, information science, and other disciplines. Successful a student of geoscience, how can I quickly learn the data science
data-driven scientific discovery also requires an open cyberinfra- methods and use them in my work?”
structure and innovative pathways to enable the synergy of data The perspective of this paper is from the point of view of a
science methods and domain-specific research questions. data life cycle. The data life cycle includes key steps such as con-
Researchers of geoinformatics and geomathematics have cept, data collection, preprocessing, archive, distribution, discov-
also reviewed and discussed the evolution of information tech- ery, analysis, and repurposing. The theme of each step is intuitive
nologies in their work. Merriam (2004) listed six stages for the and easy to follow. Through this structure, this article summarizes
history of quantitative geology: origins (1650–1833), formative sharable experience from existing studies with regards to data
(1833–1895), exploration (1895–1941), development (1941– science workflows in geoscience. In the writing, the author has
1958), automated (1958–1982), and integration (1982). Ma tried to present a comprehensive list and review of existing publi-
(2018) added that since the early 2010s, geoinformatics has been cations; however, the analysis presented may not cover all of the
in the intelligent stage. Recently, there have been several review highlights of the cited publications. The remainder of the paper is
articles summarizing the latest trends of different aspects of data organized as follows. Section 2 summarizes key concepts in data
science in geoscience. Chan et al. (2016) and Shipley and Tikoff science. Section 3 reviews a number of recent publications on
(2019) analyzed the changes that open data and cyberinfrastruc- each step of a data life cycle. Section 4 analyzes the trends of data
ture can bring to the workflow of geoscience, such as sedimen- science in geoscience, and Section 5 offers a conclusion.
tary geology and structural geology. Gil et al. (2019) analyzed
the characteristics of research challenges in geoscience and then 2. THE SCIENCE OF DATA SCIENCE
proposed a roadmap for developing and deploying knowledge-
rich intelligent systems to address those challenges. In Karpatne To better understand the workflows in data science, it is
et al. (2019), Bergen et al. (2019), and Reichstein et al. (2019), necessary to know a few fundamental concepts. The author has
the challenges and opportunities of machine learning and deep taught database and data science classes for senior undergraduate
learning for geoscience were thoroughly reviewed. Each of those and graduate students in recent years. The experience has shown
three articles also has its own highlights. Karpatne et al. (2019) that even students majoring in computer science may confuse the
pointed out the synergistic advancement that such applications meanings of data, metadata, information, and knowledge. Data are
can bring to both machine learning and geoscience. Bergen et the recorded representation of facts. In the current digital era, the
al. (2019) analyzed the larger function space and data-processing records are normally presented in a digital form, such as plain text,
capability of machine learning in comparison to the conventional spreadsheet, relational database, and graph database. In addition
approaches in geoscience. Reichstein et al. (2019) asserted that to a hard disk, data can also be recorded on other types of media,
data-driven machine learning should be coupled with the spatial such as paper and tape. Archived records from the old days, such
and temporal context to obtain better understanding of Earth sys- as literature printed on hardcopies, can be digitized. Metadata are
tem processes and thus to improve prediction. data about data. Metadata are ­important in data ­sharing and reuse

Downloaded from http://pubs.geoscienceworld.org/gsa/books/book/2377/chapter-pdf/5743375/spe558-05.pdf


by INGEMMET user
Geodata science 59

because they give an overview of the background of the data. An ber of diagrams from the existing publications (Chapman et al.,
end user can get a quick summary and understanding of a piece 2000; Schutt and O’Neil, 2013; Berman et al., 2018; Wing, 2019;
of data just by reading the metadata. Structured metadata can DDI Alliance, 2021). Most are easy to read and understand, and
improve the performance of search engines and enable them to we will omit the detailed description for each of them. Neverthe-
accurately index records and find the best match for a request. less, some shared topics in those diagrams are worthy of high-
Information is the meaning or message extracted from data. The lighting. For instance, the data life cycles presented in Figures
information extraction process often depends on the purpose of 1B and 1E both include the steps of data sharing, publication, and
data analysis, the methods and tools used, and the interpretation reuse. The step of data processing in Figures 1B and 1C actually
of data analysis results. It is not strange to see the same piece of means data cleansing, wrangling, and munging, which is simi-
data used in studies of different topics to generate varied infor- lar to the step of data preprocessing in Figure 1F. In Figures 1D
mation. Knowledge is the expertise and familiarity with a topic. and 1F, the steps of visualization and interpretation address the
In traditional understanding, a human can attain knowledge by needs of meaningful data science, i.e., to appropriately interpret
learning, practice, and experience. In data science, there are now the results of data analysis. This includes not only the precision
knowledge bases that can save knowledge in quantitative and and efficiency of algorithms but also the domain-specific mean-
qualitative formats, which can in turn be used in the data analysis ing in the outputs of those algorithms. Also, the issues of data
process. The three concepts of data, information, and knowledge privacy and ethics have received more attention and discussion
are also used in combination with other concepts, such as wisdom in recent publications to highlight data science as an ecosystem
and action, to form a pyramid or flowchart and depict the ability (Figs. 1D–1E).
of using knowledge and insight gained from data to think and act Interdisciplinary collaboration led to the emergence and
in real-world practices (Fig. 1A). evolution of data science. Donoho (2017) offered a thorough
Many researchers and communities have depicted the data review of data science’s evolution over the past decades. In par-
life cycle and the data science process. Figure 1 presents a num- ticular, he summarized the perspectives of several statisticians on

Archive
Wisdom
A B
Knowledge Concept Collection Processing Distribution Discovery Analysis
Information
Data Repurposing

Business Data Data


C Understanding Understanding Preparation
Modeling Evaluation Deployment

D Generation Collection Processing Storage Management Analysis Visualization Interpretation

Privacy and ethical concerns throughout

Exploratory
E Data Data Pre-
Data Analysis
Clean Data
{Ethics, Policy, Regulatory, Stewardship, Platform, Domain} Collection processing
Confirmatory
Environment Data Analysis

Use/ Preserve/ Visualization/ Scientific


Acquire Clean Publish Derived Data
Reuse Destroy Communication Findings

Decision
F Making

Figure 1. Different depictions of the data life cycle and the data science process are shown. (A) The DIKW model; (B) the Data Documentation
Initiative (DDI) data life cycle (DDI Alliance, 2021); (C) the cross-industry standard process for data mining (CRISP-DM) (Chapman et al.,
2000); (D) the data life cycle in data science (Wing, 2019); (E) the data life cycle and surrounding data ecosystem (Berman et al., 2018); and
(F) the data science process (Schutt and O’Neil, 2013).

Downloaded from http://pubs.geoscienceworld.org/gsa/books/book/2377/chapter-pdf/5743375/spe558-05.pdf


by INGEMMET user
60 X. Ma

the need to expand the boundaries of classical statistics to cover learning; (3) data science software is easy to use, and data science
topics of data preparation, presentation, and prediction. In the is an easy job; and (4) data science pays for itself quickly. Aware-
review it was mentioned that the term “Data Science” had been ness of those myths will help geoscientists understand the limita-
used two decades ago by Cleveland (2001) for the envisioned tions of data science and be better prepared to problem solve in
new field. Recent discussions have made clear that the field of the real world.
data science should be interdisciplinary, including computer sci-
ence, statistics, mathematics, information science, and progress 3. A REFLECTION ON THE KEY STEPS OF A DATA
in subject matter applications (Drineas and Huo, 2016; Kelle- LIFE CYCLE
her and Tierney, 2018). Those discussions were reflected in the
list of data science courses and the curricula of those courses. A Focusing on the theme of data science for geoscience, the
recent National Academies of Sciences, Engineering, and Medi- following sub-sections review a list of recent publications for
cine report (NASEM, 2018a) stated that a critical task of data each key step in the data life cycle and summarize the shareable
science education is to establish data acumen, which includes experiences from them.
these key concepts: mathematical foundations, computational
foundations, statistical foundations, data management and 3.1. Business Understanding and Concept
curation, data description and visualization, data modeling and
assessment, workflow and reproducibility, communication and The steps labeled “concept” in Figure 1B and “business
teamwork, domain-specific considerations, and ethical problem understanding” in Figure 1C are intended to determine the objec-
solving. Those topics of data acumen are reflected in the data life tives of a data science project and estimate the data needs (Chap-
cycle and data science process (Fig. 1) to address the real-world man et al., 2000; DDI Alliance, 2021). They are about turning
needs of data science applications. Several universities already business goals into data science plans. If the planned activities
offer data science courses. For example, the University of Cali- include database construction, this step will also include the
fornia at Berkeley offers Data 8: Foundation of Data Science to work of developing data structures, such as a conceptual model,
entry-level undergraduates in any major (Adhikari and DeNero, logical model, physical model, as well as controlled vocabular-
2017). Its curriculum covers most of the subjects in the above ies for data standardization. Cyberinfrastructure researchers rec-
data acumen list. ognize that consideration and action regarding data semantics in
Many geoscience and geoinformatics researchers have ana- the early stage will help improve data interoperability when data
lyzed the science of data science from the perspective of their are generated, collected, integrated, and shared in a later stage
experiences with real-world practices. Mattmann (2013) dis- (Reitsma et al., 2009; Narock and Shepherd, 2017).
cussed four advancements that are necessary to tackle the chal- The Semantic Web extends the World Wide Web by add-
lenges of big data: algorithm integration, software development ing structures and meaning to terms in documents on the web
and stewardship, automated data format identification and read- (Berners-Lee et al., 2001). The key technical approach to enable
ing, and the training of data scientists. Fox and Hendler (2014) the Semantic Web is the use of ontologies, which are formal
addressed that the field of data science includes not only the specifications of a shared conceptualization of a domain (Gru-
disciplinary foundations but also strategies for real-world chal- ber, 1995). Researchers have suggested a semantic spectrum that
lenges. They provided details about four cross-cutting data sci- consists of a sequence of items such as catalog, glossary, tax-
ence challenges: understanding scale in systems, sparse systems onomy thesaurus, conceptual schema, and formal logical models,
with incomplete and heterogeneous data, abductive reasoning, for constructing and implementing ontology in practice (Welty,
and next-generation semantic data infrastructure. Here, abduction 2002; McGuinness, 2003; Obrst, 2003; Uschold and Gruninger,
reasoning is similar to the “Exploratory Data Analysis” proposed 2004). The items in this spectrum provide a roadmap for increas-
by Tukey (1977, p. v), “It regards whatever appearances we have ing the semantic precision and interoperability of data in a variety
recognized as partial descriptions, and tries to look beneath them of applications.
for new insights.” Ho (1994) summarized that abduction creates, Data interoperability has received tremendous attention
deduction explicates, and induction verifies. This means that in recent years. The widely accepted FAIR (Findable, Acces-
abduction is a good way of finding clues to scientific questions sible, Interoperable, and Reusable) data principles (Wilkinson
through the activities of data exploration. Hazen (2014), based on et al., 2016; Stall et al., 2019) are closely related to the discus-
his experience of data-driven studies in mineralogy, further sum- sion of data interoperability in the past decades (Fig. 2). Sev-
marized that deduction and induction are to discover what we eral researchers presented the layered structure of data interop-
know we do not know while abduction is to discover what we do erability, including systems, syntax, schematics, semantics, and
not know we do not know. Recognition of the data science myths pragmatics (Bishr, 1998; Sheth, 1999; Ludäscher et al., 2003;
pointed out by Kitchin (2014) and Kelleher and Tierney (2018) Brodaric, 2007, 2018). A few other researchers explained those
is important for avoiding unrealistic expectations. The myths are: layers in layman’s terms, including discoverable, accessible,
(1) data science is an autonomous process without human over- decodable, understandable, and usable (Wood et al., 2010; Ma
sight; (2) every data science project needs big data and machine et al., 2011). The layered structures of data interoperability and

Downloaded from http://pubs.geoscienceworld.org/gsa/books/book/2377/chapter-pdf/5743375/spe558-05.pdf


by INGEMMET user
Geodata science 61

the FAIR p­ rinciples can also be compared with the technical that NASA is deploying new sensors that alone will be able to
architecture of the Semantic Web (Berners-Lee, 2000). Many generate 24 TB of data daily. Similar advances in instruments
best practices of data interoperability can be seen in the domain and facilities for data generation, transmission, and management
of geoscience. The U.S. National Geologic Map Database of the were also seen in field-based geological surveys (Mookerjee et
U.S. Geological Survey (USGS) has adopted the North Ameri- al., 2015). Wing (2019) made a distinction between data genera-
can Geologic Map Data Model (NADM) (NADM Steering Com- tion and collection and pointed out that not all data generated are
mittee, 2004) as a common schema for coordinating state-level collected (Fig. 1D). That may be because we only want to collect
geologic map databases. Such efforts to determine standards are a certain part of the data or because the velocity of data streams
continuously active at USGS, such as the recently released Geo- is too high to be processed with existing tools.
logic Map Schema (GeMS) (USGS NCGMP, 2020). Similarly, Crowd-sourcing platforms, such as social media and com-
NASA has implemented the Global Change Master Directory munity portals, are generating massive data. Many of our daily
(GCMD) Keywords as a hierarchical set of controlled vocabular- activities, such as posting on Twitter or Facebook, watching and
ies to ensure the interoperability of its data and services (GCMD, commenting on a video on YouTube, and searching on Google,
2020). In Europe, the INSPIRE Directive aims to create a Euro- all generate digital records in a way in which many of us are
pean Union spatial data infrastructure (Bartha and Kocsis, 2011; even not aware. A great deal of social media data are used for
Ma and Fox, 2014). Its data and metadata specifications cover 34 scientific studies. For example, Twitter data were used for wild-
data themes in Earth and environmental sciences, with full imple- fire disaster management (Wang et al., 2016). Google search
mentation required by 2021 across all of the participating Euro- data were used to predict seasonal influenza trends (Carneiro
pean nations. Scientific communities such as the World Wide and Mylonakis, 2009). The community collaboration on Open-
Web Consortium and the Open Geospatial Consortium have also StreetMap greatly helped the rescue work after the 2010 Haitian
summarized best practices for publishing and serving data on the earthquake (Ahmouda et al., 2018). Images on Flickr were used
web (Loscio et al., 2017; Tandy et al., 2017). for ecosystem assessment in remote areas (Rossi et al., 2020).
Besides the public social media, another type of crowd-sourcing
3.2. Data Understanding, Generation, and Collection platform focuses on a certain subject and is normally maintained
by a community of enthusiasts. For example, Mindat.org is such
Along with the quick development of hardware and soft- a community platform focused on mineral species. It has a small
ware in the cyberinfrastructure, data are now generated at an ever team of database administrators and data reviewers and is open
increasing speed. Sensor networks (Martinez et al., 2004; Hart to thousands of data contributors and users across the world.
and Martinez, 2006) greatly facilitate the generation, transmis- Researchers have used Mindat data in many recent studies on
sion, and integration of Earth and environmental data. NASA mineral evolution and mineral ecology (Hazen et al., 2011; Mor-
organizes ~100 missions and thousands of platforms, instru- rison et al., 2020).
ments, and sensors around the Earth and nearby space and is The massive collection of geoscience literature is another
one of the biggest geoscience data producers worldwide. It was good source of data. For example, GeoDeepDive (Zhang et al.,
reported (Shannon, 2019) that in 2016, NASA was already gen- 2013; Peters et al., 2014) is a machine learning package for dis-
erating 12.1 TB of data every day. The same article also reported covering data and knowledge from published documents. By

Semantic Web Data Interoperability FAIR Principles


Trust Pragmatics Usable
Reusable
Proof
Legal & Ethical
signature

Logic Semantics
Digital

Understandable
Ontology Interoperable
Schematics
RDF, RDFS Decodable
XML Syntax Accessible
Accessible
Unicode URI Systems Discoverable Findable

(Berners-Lee, 2000) (Bishr, 1998; Sheth, 1999; (Wood et al., 2010; Ma et (Wilkinson et al., 2016;
Ludäscher et al., 2003; al., 2011) Stall et al., 2019)
Brodaric, 2007, 2018)

Figure 2. Comparison shows the layered structure of data interoperability with the Semantic Web architecture and
the FAIR data principles (from Ma et al., 2020; CC BY 4.0 license). For sources of sub-diagrams, see description in text.

Downloaded from http://pubs.geoscienceworld.org/gsa/books/book/2377/chapter-pdf/5743375/spe558-05.pdf


by INGEMMET user
62 X. Ma

January 2021 it had preprocessed more than 13 million docu- ect (Laxton, 2017) has utilized multilingual vocabularies to
ments. Peters et al. (2014) successfully used the fossil records develop innovative capabilities for the online geologic maps of
extracted from GeoDeepDive to enhance the Paleobiology Data- participating European nations. New functions of OneGeology-
base. GeoDeepDive also allows other researchers to use the data Europe include the multilingual user interface, federated queries
resource to explore their own scientific topics. Recently, there across distributed geologic map services, consistency with other
have also been many studies on using text mining technologies to regional and international data standards, and more. As reflected
extract knowledge graphs from the geoscience literature (Wang in those examples, well-organized data preprocessing prepara-
et al., 2018; Qiu et al., 2019; Fan et al., 2020b). tion can significantly change the 80/20 rule in data science activi-
ties or even reverse it.
3.3. Data Preprocessing and Preparation
3.4. Data Archive, Distribution, and Discovery
Data preprocessing is an increasingly important step in data
science. It is also referred to by several alternative names such Nowadays, it is a new normal that funding agencies require
as data cleansing, data wrangling, and data munging. The gen- researchers to include a data management plan in their grant
eral purpose of data preprocessing is to ensure the quality of data proposals (Dietrich et al., 2012; NSF, 2015). Increasingly, data
before any data analysis is conducted. In real-world practice, it are treated as a formal research output and receive the same
may involve tasks such as removing noisy and unreliable records, attention as paper publications. The FAIR data principles
reducing data dimensionality, transforming data formats, select- (Wilkinson et al., 2016) are now well received in almost all
ing records of interest, enriching the existing data with additional scientific disciplines, including geoscience (Stall et al., 2019;
attributes, and combing data from different sources to build a new Lannom et al., 2020). The FAIR data principles represent many
piece of data (Wang et al., 2018). Many researchers (Press, 2016; preceding efforts on data management and stewardship and rep-
Mons, 2018), including geoscientists (Fox, 2019), spend 80% of resent a systematic approach to sharing and reusing scientific
their time cleansing and preparing data before analyzing the data data in an open scientific environment. Those efforts include
(i.e., the 80/20 rule). Good data preprocessing can significantly data infrastructure construction (Cutcher-Gershenfeld et al.,
increase the efficiency of data analysis and lead to remarkable 2016), persistent and resolvable identifiers for data publication
scientific discoveries. For example, the above-mentioned Min- (Klump et al., 2016), metadata standardization (Starr and Gastl,
dat data portal was used as a source for the Mineral Evolution 2011), provenance documentation (Lebo et al., 2013), data cita-
Database (Golden et al., 2019). Nevertheless, a limitation of the tion (Parsons et al., 2010), and more. There are many general-
original Mindat is that it does not include an age attribution for a purpose data portals where researchers can upload and share
mineral species’ first occurrence on Earth. Golden et al. (2019) their data. Moreover, there are specific data portals that only
searched over 1600 publications and several existing databases focus on one or a few subjects, such as petrology, geochemis-
to extract such age data and then used them to enrich the Mineral try, and geophysics. Data-producing agencies such as NASA,
Evolution Database. The updated database underpinned many USGS, the National Oceanic and Atmospheric Administration
new research discoveries, including mineral evolution and ecol- (NOAA), and the U.S. Department of Agriculture (USDA) all
ogy (Morrison et al., 2019, 2020) and the co-evolution between have their own data archives and data portals that allow users to
the geosphere and the biosphere (Spielman and Moore, 2020). search and access data of interest. For instance, USGS enables
The database also led to new designs of mineral species data- federated query to a long list of mineral resource spatial data-
bases and discussions on better methods for data curation and bases through a central portal (USGS MRDATA, 2021). As
sharing (Prabhu et al., 2021). workflow platforms such as Jupyter Notebook and R Mark-
Applying data standards to transform existing data or medi- down are increasingly used, many data portals have also devel-
ate between databases is also a widely used approach in data oped packages to enable data access from workflow platforms,
preprocessing and preparation. The above-mentioned metadata such as the paleobioDB R package for the Paleobiology Data-
and data specifications in the INSPIRE Directive is a good use base (Varela et al., 2015) and the neotoma R package for the
case for that approach. Another example is the global “OneGeol- Neotoma Paleoecology Database (Goring et al., 2015).
ogy” project for improving the accessibility of geological maps The FAIR data principles prioritize findability. It is true that
on the Internet (Jackson, 2010). OneGeology has developed a from the perspective of a user, data discovery is a key step if the
tool kit to set up online geologic map services. More than 110 user’s work needs to access data on external databases or data
countries have participated in the project, and about half of them portals. A top-down approach can be used to search records in
are serving map data to a web map portal. The original maps data portals with specific themes, such as EarthChem (earthchem
are heterogeneous because they are recorded in different for- .org), PANGAEA (pangaea.de), Neotoma (neotomadb.org),
mats and use different data models, terminology, and language. Paleo­BioDB (paleobiodb.org), and many data portals organized
Through the OneGeology map service tool kit, the online ser- by the federal agencies. Moreover, there are also registries for
vices of those maps are made consistent, and they can be browsed metadata from multiple data portals, such as DataONE (­dataone
in a centralized map window. The “OneGeology-Europe” proj- .org), as well as registries of data portals, such as RE3DATA

Downloaded from http://pubs.geoscienceworld.org/gsa/books/book/2377/chapter-pdf/5743375/spe558-05.pdf


by INGEMMET user
Geodata science 63

(re3data.org). On those data portals, a user can quickly narrow elements and mineral species and generated new research ques-
down the scope of a search by selecting disciplines, subjects, geo- tions for detailed analyses. Morrison et al. (2017) applied net-
spatial range, time span, and other attributes. Another approach work analysis to visualize the patterns of co-existence of miner-
of data discovery is the free-style search, such as those enabled als. In Dutkiewicz et al. (2015), machine learning was used to
by the Schema.org (Noy et al., 2019). By providing metadata generate new hypotheses based on the analysis of big seafloor
through the Schema.org specifications, the records in a data portal sediment data. GPlates software (Müller et al., 2018) was used
will be made indexable to search engines. For example, Google as a data visualization platform in that study, which generated
has indexed millions of data sets on thousands of data portals and impressive results. These examples show that data visualiza-
made them searchable through the Google Dataset Search engine tion is an efficient approach for facilitating collaboration among
(Noy et al., 2019). A user can search data sets with any combina- geoscientists, data scientists, mathematicians, and data manag-
tion of keywords. Once a data set is identified on the search engine, ers and for making the data science process and results under-
the user can access the data set through the web address provided standable to a broader audience.
in the metadata. Recently, there have also been discussions about
dataguides, which are a type of computer-aided analysis that can 3.6. Repurposing
inform researchers about what data to collect and where to find
them (Shipley and Tikoff, 2019). Repurposing means that a piece of data can be reused in
other projects either by external users or the data producers them-
3.5. Data Analysis and Result Interpretation selves. Data interoperability and reusability will be the focus in
this step. The FAIR data principles as well as the open data and
Many people would simply think of data science just as data open science campaigns suggest that metadata should include the
analysis. Indeed, data analysis is a key step in the data life cycle, provenance information of the original research activities that
but it is just a part of the process. In past decades, many studies generated the data (Di et al., 2013; Gil et al., 2016; Wilkinson et
focused on the theories and applications of statistical models and al., 2016; Zeng et al., 2019; Lehmann et al., 2020). According to
data mining in geoscience (Merriam, 2004; Sagar et al., 2018). In best practice, besides sharing data, researchers should also docu-
recent years, the fast-growing methods and technologies of big ment their software packages, workflow setup, and the context
data (Yang et al., 2017, 2019), cloud computing (Li et al., 2015; information that interconnects the entities, agents, and activities
He et al., 2019), machine learning (Lary et al., 2016; Bergen et involved in a research program. Open data and open science are
al., 2019; Karpatne et al., 2019), and deep learning (Reichstein et helping to change the culture of research and create a virtuous
al., 2019) have been widely used in geoscience with achievement data ecosystem in geoscience (Sinha et al., 2010; Welle Donker
of significant outcomes. Many innovative, data-driven discover- and van Loenen, 2017; Caron, 2020). Many new scientific dis-
ies were seen in paleobiology (Peters et al., 2017), paleontology coveries are based on research activities that use “other people’s
(Fan et al., 2020a), mineralogy (Hystad et al., 2015, 2019), water data.” For example, the work of Muscente et al. (2018) on the
resources (Wen et al., 2018; Sun and Scanlon, 2019), forest cover ecological impacts of mass extinctions used fossil community
change (Hansen et al., 2013), and public health (Goovaerts, 2008, data from the Paleobiology Database. The work of Keller and
2021). Data analysis often includes two steps: exploratory and Schoene (2012) on disruption in secular lithospheric evolution
confirmatory data analysis (Fig. 1F). This conventional statisti- and Keller et al. (2015) on volcanic–plutonic parity and continen-
cal method can still be very useful for data science applications tal crust both used data from EarthChem. The work of Hazen et
today. Exploratory data analysis is used to get a better understand- al. (2019) on mineral evolution used data from Mindat and other
ing of the data and draw plausible research questions or hypoth- open data resources. To promote a healthy open data ecosystem,
eses (Tukey, 1977; Camizuli and Carranza, 2018). Confirmatory legal and ethical issues are also discussed (Berman et al., 2018;
data analysis, in contrast, is where the complicated models and/ Kelleher and Tierney, 2018; Wing, 2019).
or algorithms are applied to prove or disprove the hypotheses.
Data visualization has been increasingly discussed as an 4. FROM BIG DATA TO DATA SCIENCE ECOSYSTEM:
efficient way to improve the understandability of a data science A VISION FOR THE NEXT DECADE
process and the interpretability of the data science results (Fox
and Hendler, 2011; Ma et al., 2015; Wing, 2019). Data visual- Along with the evolution of data science theory and meth-
ization not only means to make the information visible, but also odology, the upgrading of computational facilities and capabili-
that the visualization should make the information easy to per- ties, the thriving of big data and open data in geoscience, and the
ceive by a reader. Many may think of visualization just as a way training of geoscientists with data science skill sets, it is certain
to present data science results, but in actual practice, many data that data science will be applied more frequently in geoscience,
visualization techniques can also be used in data preprocessing which will lead to more scientific discoveries. What will be the
and analysis. For example, box plot is a widely used visualiza- trends in methodology and technology, and what should a geo-
tion in exploratory data analysis. Ma et al. (2017) used a three- scientist be aware of to be better prepared for the data revolution?
dimensional cube matrix to explore the co-relationship between This section offers a few thoughts.

Downloaded from http://pubs.geoscienceworld.org/gsa/books/book/2377/chapter-pdf/5743375/spe558-05.pdf


by INGEMMET user
64 X. Ma

4.1. Open Data and Open Science Will Be the New Normal 4.2. Big Data, Smart Data, Data Science, and the Changes
They Bring to Geoscience
The concept of open science is being widely accepted in
academia (Donoho, 2017; NASEM, 2018b; Aspesi and Brand, Big data does not mean we can dump and share data while
2020). Open science is an umbrella concept for a long list of simply relying on machine learning to identify patterns in the
“open” activities, including open access to publications, open chaos. Many researchers have discussed the idea of smart data
source software programs, open data, open samples, and open (Iafrate, 2014; Sheth, 2014; Maskey et al., 2020). That is, the
workflows, just to name a few. Many open science activities will application of metadata and semantics to add more machine-
take place on the internet and the web (Berendt et al., 2020). For readable structures in data generation and collections and the
example, data will become more open, accessible, and interac- deployment of intelligent algorithms to improve the precision
tive through various protocols and interfaces, such as those main- of data discovery and analysis. Smart data will bring refreshing
tained by the World Wide Web Consortium and the Open Geo- changes to the data life cycle and help researchers quickly iden-
spatial Consortium. Facilitated by the FAIR principles and other tify the data to be used and extract value from the data. Many
associated efforts, the shared data will be better curated, which geoscience data portals, such as EarthChem, Neotoma, and the
will save researchers time on data preprocessing and preparation. Paleobiology Database have already applied controlled vocab-
The USGS mineral resources spatial data portal is an example ularies to improve the precision of data search and query. The
of that trend (USGS MRDATA, 2021). The Giovanni infrastruc- Google Dataset Search engine, enabled by Schema.org, offers
ture of NASA (Acker and Leptoukh, 2007) has also been work- a playground for developing more innovative functions in data
ing toward cooperation among NASA’s distributed data archives search. The geoscience community has already begun to work
to enable federated data exploration and comparison (Lynnes, on approaches to expose Schema.org-compatible metadata on
2020). For reflection, a key idea in the vision of the Semantic their data portals (Shepherd et al., 2019; Valentine et al., 2020)
Web (Berners-Lee et al., 2001) is the persistence and traceability and make the metadata indexable to the Google Dataset Search
of resources on the web. Similar to the digital object identifier engine. When more data portals enable such functions, an end
(DOI) for publications, many other entities and agents in open user will be able to search a variety of data on the Google Data-
science, such as data, software packages, samples, researchers, set Search engine. Metadata portals for specific geoscience dis-
organizations, and research grants, will also have their persistent ciplines or subjects such as deep time (Stephenson et al., 2020)
and resolvable identifiers on the web. By connecting those identi- can also be built with those indexable metadata from various data
fiers, we can easily weave a graph for all of the objects, steps, and portals. Those improved functionalities will greatly benefit end
workflows involved in generating a scientific finding. users (Chapman et al., 2020). With more provenance information
Workflow platforms such as Jupyter Notebook, R Mark- about workflows documented and shared, smart search engines
down, and others will be widely used in geoscience from research can be developed that use such information to provide recom-
projects to classroom education. Those workflow platforms are mendations not only on data, but also on software packages that
not only good tools for collaborative and reproducible research can be used to analyze the data, potential research topics for the
activities, they also provide well-organized environments for stu- data, and researchers with whom to collaborate. For example,
dents to learn and use programming languages. Many geoscience Mookerjee et al. (this volume) discussed that by using machine
data portals now have Python or R packages to enable users to learning, data management systems will be able to make connec-
search and access data directly from a workflow, and there have tions to other data sets that can potentially build collaborations or
been various successful applications in geoscience (Varela et al., suggest other geographical areas to study.
2015; Peters and McClennen, 2016; Choi et al., 2021; Rosen- The smart data will save researchers time on data discovery
berg et al., 2020). We anticipate that workflow platforms will and allow them to put more efforts toward proposing research
become more popular in geoscience in the future. Similar to the questions and conducting data analysis. This will be possible
needs of computer scientists and data scientists for trustworthy whether working with a small amount of data and identified
artificial intelligence (Floridi, 2019; Wing, 2020), geoscientists research questions or a large amount of data that requires explor-
also express the request for provenance in their workflows (Gil atory data analysis and hypothesis generation (Kitchin and Lau-
et al., 2019). Recently, packages have been developed in work- riault, 2015). Ma (2018) compared the data science process with
flow platforms to capture provenance. For example, the MetaClip conventional science approaches and pointed out that a unique
(Bedia et al., 2019) framework is able to capture the provenance feature of data science in the big data era is that while a lot of data
description of a climate product and then append the provenance are collected, we may not yet have formed a specific research
information inside the resulting image. Once that image is loaded question. Bergen et al. (2019) discussed that machine learning
to the MetaClip Web portal, the provenance information inside it provides the means to discover high-dimensional and complex
will be read and visualized. To tackle large data sets, researchers relationships in data and enables exploration of more scientific
have begun to deploy workflow platforms in the cloud environ- hypotheses. If the conventional approaches are small data and
ment (Hamman et al., 2018; Sun et al., 2020). This will be a trend small knowledge (i.e., domain experts and personal comput-
in big geoscience data processing in the near future. ers), then the data science process can enable big data and big

Downloaded from http://pubs.geoscienceworld.org/gsa/books/book/2377/chapter-pdf/5743375/spe558-05.pdf


by INGEMMET user
Geodata science 65

knowledge (i.e., domain experts, smart data, machine learning, collaboration with EarthCube, the American Geophysical Union,
and cloud environment). In big data–enabled, multidisciplinary European Geosciences Union, Geological Society of America,
geoscience research projects, interpretability of the workflow American Meteorological Society, and other organizations, has
will help people from different disciplinary backgrounds bet- successfully organized many successful Data Help Desk activi-
ter understand the results and findings (Reichstein et al., 2019). ties recently and archived a long list of reusable resources (ESIP,
This overlaps with the work on explainable and meaningful arti- 2020). Hundreds of researchers across the world have joined
ficial intelligence in computer science (Hagras, 2018; Holzinger, those activities as volunteers to answer questions and share
2018; Chari et al., 2020). In the geoscience community, there research outcomes. We anticipate that more such activities will
has been some initial work on this topic in workflow platforms, be organized in the future to promote data science applications
such as the “Meaning Spatial Statistics” initiative (Stasch et al., and cultural change in the geosciences.
2014), and we anticipate that more projects will be launched in
the near future. 5. CONCLUDING REMARKS

4.3. Science of Team Science to Facilitate Data-Driven This paper presents a review of recent data science activities
Geoscience Discovery in geoscience from the perspective of a data life cycle. It first pro-
vides a description of the basic concepts and theoretical founda-
In the data ecosystem underpinned by open science, there tion of data science. Then, by following the process of the data life
will be small data science projects that only require a small team, cycle, it reviews a number of the latest publications on each step in
personal computers, and open source software packages. There the data life cycle and summarizes the shareable experience from
will also be large-scale data science projects that cross disci- them. Finally, a vision of the trends in data science applications in
plinary boundaries and require the collaboration of researchers geoscience is discussed, including open science, smart data, and
from different institutions, high-performance computing facili- the science of team science. The author hopes the review from the
ties, efficient infrastructure for data storage and transmission, aspect of a data life cycle will lower the barrier of data science for
and large software programs for data management and process- geoscientists, especially newcomers to data science applications.
ing. To succeed in such data science projects, the science of Individual geoscientists can gain awareness of resources available
team science is recommended by many communities (NASEM, in the cyberinfrastructure, explore representative examples of data
2015). Key elements of the science of team science include science, and initiate ideas for their own work. Research teams can
(1) clear communication to reach consensus on the objective learn methods for collaboration and team science. Geoscientists
among team members, (2) regular brainstorming activities to have been successfully embracing the strategy of community of
identify and specify research questions, (3) complementary practice to share data science resources and promote best prac-
expertise from team members on problem solving, (4) regular tices. The author hopes the open science campaign will further
team meetings to review progress and seek alternative facilitate data science applications in geoscience and lead to more
approaches, and (5) positive and supportive working relation- data-driven scientific discoveries.
ships within the team. The recent collaboration on data-driven
mineral evolution study (Hazen et al., 2019) shows success- ACKNOWLEDGMENTS
ful real-world practices of team science. In that work, a list of
activities was organized to create an environment where people The work presented in this paper was supported by the
from different knowledge backgrounds could quickly step out National Science Foundation under grants 1835717, 2019609,
of their comfort zones, get familiar with each other, and work and 2126315. Additional support was provided by the Interna-
together on focused scientific topics. tional Union of Geological Sciences Deep-Time Digital Earth
Geoscience communities also need some cultural change to (DDE) Big Science program, the Deep Carbon Observatory,
fully embrace open data and open science. The NASEM (2020) the Alfred P. Sloan Foundation, and the Carnegie Institution for
“Earth in Time” report envisioned a list of science priority ques- Science for communicating research progress at several work-
tions for the NSF Earth science programs in the next decade. shops and meetings.
The report also made two recommendations on cyberinfrastruc-
ture. One is about a strategy to support FAIR data practices in REFERENCES CITED
community data efforts and the other is about the initiation of
a ­community-based standing committee to provide advice on 4D Initiative, 2018, White Paper of the 4D Initiative: Deep-Time Data Driven Dis-
cyberinfrastructure needs and advances. Community of prac- covery: https://4d.carnegiescience.edu/sites/default/files/4D_materials/4D
_WhitePaper.pdf (March 04, 2020).
tice has received increasing attention in many academic asso- Acker, J.G., and Leptoukh, G., 2007, Online analysis enhances use of NASA
ciations and has been discussed as a catalyzer for open science earth science data: Eos (Transactions, American Geophysical Union),
(Cutcher-Gershenfeld et al., 2017). Many researchers have been v. 88, no. 2, p. 14–17, https://doi.org/10.1029/2007EO020003.
Adhikari, A., and DeNero, J., 2017, Computational and inferential thinking:
actively promoting open science in geoscience (Caron, 2020). The foundations of data science: https://www.inferentialthinking.com
For instance, the Earth Science Information Partners, through (accessed January 2021).

Downloaded from http://pubs.geoscienceworld.org/gsa/books/book/2377/chapter-pdf/5743375/spe558-05.pdf


by INGEMMET user
66 X. Ma

Ahmouda, A., Hochmair, H.H., and Cvetojevic, S., 2018, Analyzing the effect Toward open and reproducible environmental modeling by integrating
of earthquakes on OpenStreetMap contribution patterns and tweeting online data repositories, computational environments, and model Appli-
activities: Geo-Spatial Information Science, v. 21, no. 3, p. 195–212, cation Programming Interfaces: Environmental Modelling & Software,
https://doi.org/10.1080/10095020.2018.1498666. v. 135, https://doi.org/10.1016/j.envsoft.2020.104888.
Aspesi, C., and Brand, A., 2020, In pursuit of open science, open access is not Cleveland, W.S., 2001, Data science: An action plan for expanding the technical
enough: Science, v. 368, no. 6491, p. 574–577, https://doi.org/10.1126 areas of the field of statistics: International Statistical Review, v. 69, no. 1,
/science.aba3763. p. 21–26, https://doi.org/10.1111/j.1751-5823.2001.tb00477.x.
Bartha, G., and Kocsis, S., 2011, Standardization of geographic data: The Euro- Cutcher-Gershenfeld, J., Baker, K.S., Berente, N., Carter, D.R., DeChurch,
pean INSIPIRE Directive: European Journal of Geography, v. 2, no. 2, L.A., Flint, C.G., Gershenfeld, G., Haberman, M., King, J.L., Kirk­patrick,
p. 79–89. C., and Knight, E., 2016, Build it, but will they come? A geoscience
Bedia, J., San-Martín, D., Iturbide, M., Herrera, S., Manzanas, R., and Gutiér- cyberinfrastructure baseline analysis: Data Science Journal, v. 15, p. 8,
rez, J.M., 2019, The METACLIP semantic provenance framework for cli- https://doi.org/10.5334/dsj-2016-008.
mate products: Environmental Modelling & Software, v. 119, p. 445–457, Cutcher-Gershenfeld, J., Baker, K.S., Berente, N., Flint, C., Gershenfeld, G.,
https://doi.org/10.1016/j.envsoft.2019.07.005. Grant, B., Haberman, M., King, J.L., Kirkpatrick, C., Lawrence, B., and
Berendt, B., Gandon, F., Halford, S., Hall, W., Hendler, J., Kinder-Kurlanda, K., Lewis, S., 2017, Five ways consortia can catalyse open science: Nature,
Ntoutsi, E., and Staab, S., eds., 2020, Web futures: Inclusive, intelligent, v. 543, no. 7647, p. 615–617, https://doi.org/10.1038/543615a.
sustainable: The 2020 manifesto for web science: http://webscience.org DDI Alliance, 2021, Why use DDI?: https://ddialliance.org/training/why-use
/the-2020-manifesto-for-web-science/ (accessed January 2021). -ddi (accessed January 2021).
Bergen, K.J., Johnson, P.A., Maarten, V., and Beroza, G.C., 2019, Machine Di, L., Yue, P., Ramapriyan, H.K., and King, R.L., 2013, Geoscience data prov-
learning for data-driven discovery in solid Earth geoscience: Science, enance: An overview: IEEE Transactions on Geoscience and Remote
v. 363, no. 6433, https://doi.org/10.1126/science.aau0323. Sensing, v. 51, no. 11, p. 5065–5072, https://doi.org/10.1109/TGRS
Berman, F., Rutenbar, R., Hailpern, B., Christensen, H., Davidson, S., Estrin, .2013.2242478.
D., Franklin, M., Martonosi, M., Raghavan, P., Stodden, V., and Szalay, Dietrich, D., Adamus, T., Miner, A., and Steinhart, G., 2012, De-mystifying the data
A.S., 2018, Realizing the potential of data science: Communications of management requirements of research funders: Issues in Science & Technol-
the Association for Computing Machinery, v. 61, no. 4, p. 67–72, https:// ogy Librarianship, no. 70, Summer, https://doi.org/10.5062/F44M92G2.
doi.org/10.1145/3188721. Donoho, D., 2017, 50 years of data science: Journal of Computational and
Berners-Lee, T., 2000, Semantic Web on XML. Presentation at XML 2000 Graphical Statistics, v. 26, no. 4, p. 745–766, https://doi.org/10.1080
Conference: Washington, D.C., World Wide Web Consortium, http:// /10618600.2017.1384734.
www.w3.org/2000/Talks/1206-xml2k-tbl (accessed 24 January 2021). Drineas, P., and Huo, X., 2016, NSF Workshop Report: Theoretical Founda-
Berners-Lee, T., Hendler, J., and Lassila, O., 2001, The Semantic Web: tions of Data Science (TFoDS): http://www.cs.rpi.edu/TFoDS/TFoDS
Scientific American, v. 284, no. 5, p. 34–43, https://doi.org/10.1038 _v5.pdf (accessed January 2021).
/­scientificamerican0501-34. Dutkiewicz, A., Müller, R.D., O’Callaghan, S., and Jónasson, H., 2015, Census
Bishr, Y., 1998, Overcoming the semantic and other barriers to GIS inter­ of seafloor sediments in the world’s ocean: Geology, v. 43, no. 9, p. 795–
operability: International Journal of Geographical Information Science, 798, https://doi.org/10.1130/G36883.1.
v. 12, no. 4, p. 299–314, https://doi.org/10.1080/136588198241806. ESIP (Earth Science Information Partners), 2020, Data Help Desk: Connect-
Brodaric, B., 2007, Geo-pragmatics for the Geospatial Semantic Web: Trans- ing researchers and data experts to enhance research and make data and
actions in GIS, v. 11, no. 3, p. 453–477, https://doi.org/10.1111/j.1467 software more open and FAIR: https://www.esipfed.org/data-help-desk
-9671.2007.01055.x. (accessed January 2021).
Brodaric, B., 2018, Interoperability of representations, in Richardson, D., Cas- Fan, J.X., Shen, S.Z., Erwin, D.H., Sadler, P.M., MacLeod, N., Cheng, Q.M.,
tree, N., Goodchild, M.F., Kobayashi, A., Liu, W., and Marston, R.A., Hou, X.D., Yang, J., Wang, X.D., Wang, Y., and Zhang, H., 2020a, A
eds., The International Encyclopedia of Geography: Hoboken, New Jer- high-resolution summary of Cambrian to Early Triassic marine inverte-
sey, John Wiley & Sons, 18 p., https://doi.org/10.1002/9781118786352 brate biodiversity: Science, v. 367, no. 6475, p. 272–277, https://doi.org
.wbieg0894.pub2. /10.1126/science.aax4953.
Camizuli, E., and Carranza, E.J., 2018, Exploratory Data Analysis (EDA), Fan, R., Wang, L., Yan, J., Song, W., Zhu, Y., and Chen, X., 2020b, Deep learn-
in Varela, S.L.L., ed., The Encyclopedia of Archaeological Sciences: ing-based named entity recognition and knowledge graph construction for
Hoboken, New Jersey, Wiley Online Library, 7 p., https://doi.org/10 geological hazards: ISPRS International Journal of Geo-Information, v. 9,
.1002/9781119188230.saseas0271. no. 1, p. 15, https://doi.org/10.3390/ijgi9010015.
Carneiro, H.A., and Mylonakis, E., 2009, Google trends: A web-based tool for Floridi, L., 2019, Establishing the rules for building trustworthy AI: Nature
real-time surveillance of disease outbreaks: Clinical Infectious Diseases, Machine Intelligence, v. 1, no. 6, p. 261–262, https://doi.org/10.1038
v. 49, no. 10, p. 1557–1564, https://doi.org/10.1086/630200. /s42256-019-0055-y.
Caron, B.C., 2020, Open Scientist Handbook, 305 p., https://doi.org/10 Fox, P., 2019, Disruption in biogeosciences: Conceptual, methodological, digi-
.21428/8bbb7f85.35a0e14b. tal, and technological: Acta Geologica Sinica, v. 93, no. S3, p. 17–18,
Chan, M.A., Peters, S.E., and Tikoff, B., 2016, The future of field geology, https://doi.org/10.1111/1755-6724.14231.
open data sharing and cybertechnology in Earth science: The Sedimentary Fox, P., and Hendler, J., 2011, Changing the equation on scientific data visu-
Record, v. 14, p. 4–10, https://doi.org/10.2110/sedred.2016.1.4. alization: Science, v. 331, no. 6018, p. 705–708, https://doi.org/10.1126
Chapman, A., Simperl, E., Koesten, L., Konstantinidis, G., Ibáñez, L.D., Kacprzak, /science.1197654.
E., and Groth, P., 2020, Dataset search: A survey: The VLDB Journal, v. 29, Fox, P., and Hendler, J., 2014, Science of data science: Big Data, v. 2, no. 2,
no. 1, p. 251–272, https://doi.org/10.1007/s00778-019-00564-x. p. 68–70, https://doi.org/10.1089/big.2014.0011.
Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., GCMD (Global Change Master Directory), 2020, GCMD Keywords, Version
and Wirth, R., 2000, CRISP-DM 1.0: Step-by-Step Data Mining Guide: 9.1. Earth Science Data and Information System, Earth Science Proj-
CRISP-DM Consortium, 78 p. ects Division, Goddard Space Flight Center (GSFC), National Aeronau-
Chari, S., Seneviratne, O., Gruen, D.M., Foreman, M.A., Das, A.K., and tics and Space Administration (NASA): https://wiki.earthdata.nasa.gov
McGuinness, D.L., 2020, November. Explanation ontology: A model of /display/gcmdkey (accessed January 2021).
explanations for user-centered AI, in Pan, J.Z., Tamma, V., d’Amato, C., Gil, Y., David, C.H., Demir, I., Essawy, B.T., Fulweiler, R.W., Goodall, J.L.,
Janowicz, K., Fu, B., and Polleres, A., eds., The Semantic Web—ISWC Karlstrom, L., Lee, H., Mills, H.J., Oh, J.H., and Pierce, S.A., 2016,
2020: Cham, Switzerland, Springer, p. 228–243. Toward the geoscience paper of the future: Best practices for documenting
Cheng, Q., Oberhänsli, R., and Zhao, M., 2020, A new international initiative for and sharing research from data to software to provenance: Earth and Space
facilitating data-driven Earth science transformation, in Hill, P.R., Lebel, Science, v. 3, no. 10, p. 388–415, https://doi.org/10.1002/2015EA000136.
D., Hitzman, M., Smelror, M., and Thorleifson, H., eds., The Changing Gil, Y., Pierce, S.A., Babaie, H., Banerjee, A., Borne, K., Bust, G., Cheatham,
Role of Geological Surveys: Geological Society, London, Special Publi- M., Ebert-Uphoff, I., Gomes, C., Hill, M., and Horel, J., 2019, Intelli-
cation 499, p. 225–240, https://doi.org/10.1144/SP499-2019-158. gent systems for geosciences: An essential research agenda: Communica-
Choi, Y.D., Goodall, J.L., Sadler, J.M., Castronova, A.M., Bennett, A., Li, Z., tions of the Association for Computing Machinery, v. 62, no. 1, p. 76–84,
Nijssen, B., Wang, S., Clark, M.P., Ames, D.P., and Horsburgh, J.S., 2021, https://doi.org/10.1145/3192335.

Downloaded from http://pubs.geoscienceworld.org/gsa/books/book/2377/chapter-pdf/5743375/spe558-05.pdf


by INGEMMET user
Geodata science 67

Golden, J.J., Downs, R.T., Hazen, R.M., Pires, A.J., and Ralph, J., 2019, Min- Kelleher, J.D., and Tierney, B., 2018, Data Science: Cambridge, Massachusetts,
eral Evolution Database: Data-driven age assignment, how does a mineral MIT Press, 280 p., https://doi.org/10.7551/mitpress/11140.001.0001.
get an age?: Geological Society of America Abstracts with Programs, Keller, C.B., and Schoene, B., 2012, Statistical geochemistry reveals disrup-
v. 51, no. 5, https://doi.org/10.1130/abs/2019AM-334056. tion in secular lithospheric evolution about 2.5 Gyr ago: Nature, v. 485,
Goovaerts, P., 2008, Geostatistical analysis of health data: State-of-the-art and no. 7399, p. 490–493, https://doi.org/10.1038/nature11024.
perspectives, in Soares, A., Pereira, M.J., and Dimitrakopoulos, R., eds., Keller, C.B., Schoene, B., Barboni, M., Samperton, K.M., and Husson, J.M.,
geoENV VI—Geostatistics for Environmental Applications: Dordrecht, 2015, Volcanic–plutonic parity and the differentiation of the continen-
Netherlands, Springer, p. 3–22. tal crust: Nature, v. 523, no. 7560, p. 301–307, https://doi.org/10.1038
Goovaerts, P., 2021, From natural resources evaluation to spatial epidemiology: /nature14584.
25 years in the making: Mathematical Geosciences, v. 53, p. 239–266, Kitchin, R., 2014, The Data Revolution: Big Data, Open Data, Data Infra-
https://doi.org/10.1007/s11004-020-09886-x. structures & Their Consequences: London, Sage, 222 p., https://doi
Goring, S., Dawson, A., Simpson, G., Ram, K., Graham, R., Grimm, E., and .org/10.4135/9781473909472.
Williams, J., 2015, Neotoma: A programmatic interface to the Neotoma Kitchin, R., and Lauriault, T.P., 2015, Small data in the era of big data: GeoJour-
Paleoecological Database: Open Quaternary, v. 1, no. 1, p. 2, https://doi nal, v. 80, no. 4, p. 463–475, https://doi.org/10.1007/s10708-014-9601-7.
.org/10.5334/oq.ab. Klump, J., Huber, R., and Diepenbroek, M., 2016, DOI for geoscience data—
Gruber, T.R., 1995, Toward principles for the design of ontologies used for How early practices shape present perceptions: Earth Science Informat-
knowledge sharing?: International Journal of Human-Computer Studies, ics, v. 9, no. 1, p. 123–136, https://doi.org/10.1007/s12145-015-0231-5.
v. 43, no. 5–6, p. 907–928, https://doi.org/10.1006/ijhc.1995.1081. Lannom, L., Koureas, D., and Hardisty, A.R., 2020, FAIR data and services
Hagras, H., 2018, Toward human-understandable, explainable AI: Computer, in biodiversity science and geoscience: Data Intelligence, v. 2, no. 1–2,
v. 51, no. 9, p. 28–36, https://doi.org/10.1109/MC.2018.3620965. p. 122–130, https://doi.org/10.1162/dint_a_00034.
Hamman, J., Rocklin, M., and Abernathy, R., 2018. Pangeo: A big-data eco- Lary, D.J., Alavi, A.H., Gandomi, A.H., and Walker, A.L., 2016, Machine learn-
system for scalable earth system science: 2014 EGU General Assembly ing in geosciences and remote sensing: Geoscience Frontiers, v. 7, no. 1,
Conference Abstracts, v. 20, no. EGU2018-12146. p. 3–10, https://doi.org/10.1016/j.gsf.2015.07.003.
Hansen, M.C., Potapov, P.V., Moore, R., Hancher, M., Turubanova, S.A., Tyu- Laxton, J.L., 2017, Geological map fusion: OneGeology-Europe and INSPIRE,
kavina, A., Thau, D., Stehman, S.V., Goetz, S.J., Loveland, T.R., and in Riddick, A.T., Kessler, H., and Giles, J.R.A., eds., Integrated Envi-
Kommareddy, A., 2013, High-resolution global maps of 21st-century for- ronmental Modelling to Solve Real World Problems: Methods, Vision
est cover change: Science, v. 342, p. 850–853. and Challenges: Geological Society, London, Special Publication 408,
Hart, J.K., and Martinez, K., 2006, Environmental sensor networks: A revolu- p. 147–160, https://doi.org/10.1144/SP408.16.
tion in the earth system science?: Earth-Science Reviews, v. 78, no. 3–4, Lebo, T., Sahoo, S., and McGuinness, D., 2013, PROV-O: The PROV Ontol-
p. 177–191, https://doi.org/10.1016/j.earscirev.2006.05.001. ogy. W3C recommendation: https://www.w3.org/TR/2013/REC-prov-o
Hazen, R.M., 2014, Data-driven abductive discovery in mineralogy: The Amer- -20130430 (accessed January 2021).
ican Mineralogist, v. 99, no. 11–12, p. 2165–2170, https://doi.org/10.2138 Lehmann, A., Nativi, S., Mazzetti, P., Maso, J., Serral, I., Spengler, D., Niamir,
/am-2014-4895. A., McCallum, I., Lacroix, P., Patias, P., and Rodila, D., 2020, GEO-
Hazen, R.M., Bekker, A., Bish, D.L., Bleeker, W., Downs, R.T., Farqu- Essential—Mainstreaming workflows from data sources to environment
har, J., Ferry, J.M., Grew, E.S., Knoll, A.H., Papineau, D., and Ralph, policy indicators with essential variables: International Journal of Digital
J.P., 2011, Needs and opportunities in mineral evolution research: The Earth, v. 13, no. 2, p. 322–338, https://doi.org/10.1080/17538947.2019
American Mineralogist, v. 96, no. 7, p. 953–963, https://doi.org/10.2138 .1585977.
/am.2011.3725. Li, Z., Yang, C., Jin, B., Yu, M., Liu, K., Sun, M., and Zhan, M., 2015, Enabling
Hazen, R.M., Downs, R.T., Eleish, A., Fox, P., Gagne, O., Golden, J.J., Grew, big geoscience data analytics with a cloud-based, MapReduce-enabled
E.S., Hummer, D.R., Hystad, G., Krivovichev, S.V., Li, C., Liu, C., Ma, and service-oriented workflow framework: PLoS One, v. 10, no. 3, https://
X., Morrison, S.M., Pan, F., Pires, A.J., Prabhu, A., Ralph, J., Rumyon, doi.org/10.1371/journal.pone.0116781.
S.E., and Zhong, H., 2019, Data-driven discovery in mineralogy: Recent Loscio, B.F., Burle, C., and Calegari, N., eds., 2017, Data on the web best prac-
advances in data resources, analysis, and visualization: Engineering, v. 5, tices, W3C recommendation: https://www.w3.org/TR/dwbp.
no. 3, p. 397–405, https://doi.org/10.1016/j.eng.2019.03.006. Ludäscher, B., Lin, K., Brodaric, B., and Baru, C., 2003, GEON: Toward a
He, Z., Liu, G., Ma, X., and Chen, Q., 2019, GeoBeam: A distributed com- cyberinfrastructure for the geosciences—A prototype for geological map
puting framework for spatial data: Computers & Geosciences, v. 131, interoperability via domain ontologies, in Soller, D.R., ed., Digital Map-
p. 15–22, https://doi.org/10.1016/j.cageo.2019.06.003. ping Techniques ’03—Workshop Proceedings, 1–4 June, Millersville,
Hey, T., Tansley, S., and Tolle, K., eds., 2009, The Fourth Paradigm: Data-Intensive Pennsylvania: U.S. Geological Survey Open-File Report 03-471 p. 223–
Scientific Discovery: Redmond, Washington, Microsoft Corporation, 252 p. 229, https://pubs.usgs.gov/of/2003/of03-471/report.pdf.
Ho, Y.C., 1994, Abduction? Deduction? Induction? Is there a logic of exploratory Lynnes, C., 2020, Federated Giovanni for multi-sensor data exploration: https://
data analysis?, in Proceedings of the Annual Meeting of the American Edu- earthdata.nasa.gov/esds/competitive-programs/access/federated-giovanni
cational Research Association, New Orleans, Louisiana, 28 p. (accessed January 2021).
Holzinger, A., 2018, From machine learning to explainable AI, in Proceedings Ma, X., 2018, Data science for geoscience: Leveraging mathematical geosci-
of the 2018 World Symposium on Digital Intelligence for Systems and ences with semantics and open data, in Sagar, B.S.D., Cheng, Q., and
Machines (DISA), Kosice, Slovakia, p. 55–66. Agterberg, F.D., eds., Handbook of Mathematical Geosciences: Fifty
Hystad, G., Downs, R.T., and Hazen, R.M., 2015, Mineral species frequency Years of IAMG: Cham, Switzerland, Springer, p. 687–702, https://doi
distribution conforms to a large number of rare events model: Predic- .org/10.1007/978-3-319-78999-6_34.
tion of Earth’s missing minerals: Mathematical Geosciences, v. 47, no. 6, Ma, X., and Fox, P., 2014, A jigsaw puzzle layer cake of spatial data: Eos
p. 647–661, https://doi.org/10.1007/s11004-015-9600-3. (Transactions, American Geophyscial Union), v. 95, no. 19, p. 161–162,
Hystad, G., Eleish, A., Hazen, R.M., Morrison, S.M., and Downs, R.T., 2019, https://doi.org/10.1002/2014EO190006.
Bayesian estimation of Earth’s undiscovered mineralogical diversity Ma, X., Asch, K., Laxton, J.L., Richard, S.M., Asato, C.G., Carranza, E.J.M.,
using noninformative priors: Mathematical Geosciences, v. 51, no. 4, van der Meer, F.D., Wu, C., Duclaux, G., and Wakita, K., 2011, Data
p. 401–417, https://doi.org/10.1007/s11004-019-09795-8. exchange facilitated: Nature Geoscience, v. 4, no. 12, p. 814, https://doi
Iafrate, F., 2014, A journey from big data to smart data, in Benghozi, P.-J., .org/10.1038/ngeo1335.
Krob, D., Lonjon, A., and Panetto, H., eds., Digital Enterprise Design Ma, X., Chen, Y., Wang, H., Zheng, J., Fu, L., West, P., Erickson, J.S., and Fox,
& Management: Cham, Switzerland, Springer, p. 25–33, https://doi P., 2015, Data visualization in the Semantic Web, in Narock, T., and Fox,
.org/10.1007/978-3-319-04313-5_3. P., eds., The Semantic Web in Earth and Space Science: Current Status
Jackson, I., 2010, OneGeology: Improving access to geoscience globally: and Future Directions: Berlin, IOS Press, p. 149–167.
Earthwise, v. 26, p. 14–15. Ma, X., Hummer, D., Golden, J.J., Fox, P.A., Hazen, R.M., Morrison, S.M.,
Karpatne, A., Ebert-Uphoff, I., Ravela, S., Babaie, H.A., and Kumar, V., 2019, Downs, R.T., Madhikarmi, B.L., Wang, C., and Meyer, M.B., 2017, Using
Machine learning for the geosciences: Challenges and opportunities: visual exploratory data analysis to facilitate collaboration and hypothesis
IEEE Transactions on Knowledge and Data Engineering, v. 31, no. 8, generation in cross-disciplinary research: ISPRS International Journal of
p. 1544–1554, https://doi.org/10.1109/TKDE.2018.2861006. Geo-Information, v. 6, no. 11, p. 368, https://doi.org/10.3390/ijgi6110368.

Downloaded from http://pubs.geoscienceworld.org/gsa/books/book/2377/chapter-pdf/5743375/spe558-05.pdf


by INGEMMET user
68 X. Ma

Ma, X., Ma, C., and Wang, C., 2020, A new structure for representing and track- NASEM (National Academies of Sciences, Engineering, and Medicine), 2020,
ing version information in a deep time knowledge graph: Computers & A Vision for NSF Earth Sciences 2020–2030: Earth in Time: Wash-
Geosciences, v. 145, https://doi.org/10.1016/j.cageo.2020.104620. ington, D.C., The National Academies Press, 172 p., https://doi.org
Martinez, K., Hart, J.K., and Ong, R., 2004, Environmental sensor networks: /10.17226/25761.
Computer, v. 37, no. 8, p. 50–56, https://doi.org/10.1109/MC.2004.91. Noy, N., Burgess, M., and Brickley, D., 2019, Google Dataset Search: Building
Maskey, M., Alemohammad, H., Murphy, K.J., and Ramachandran, R., a search engine for datasets in an open web ecosystem, in Proceedings of
2020, Advancing AI for Earth science: A data systems perspective: Eos the 2019 World Wide Web Conference, San Francisco, California: New
(Transactions, American Geophysical Union), v. 101, https://doi.org York, Association for Computing Machinery, p. 1365–1375.
/10.1029/2020EO151245. NSF (National Science Foundation), 2015, NSF Public Access Plan: Today’s
Mattmann, C.A., 2013, A vision for data science: Nature, v. 493, no. 7433, Data, Tomorrow’s Discoveries—Increasing Access to the Results of
p. 473–475, https://doi.org/10.1038/493473a. Research Funded by the National Science Foundation, 31 p., https://www
McGuinness, D.L., 2003, Ontologies come of age, in Fensel, D., Hendler, .nsf.gov/pubs/2015/nsf15052/nsf15052.pdf (accessed May 2019).
J., Lieberman, H., and Wahlster, W., eds., Spinning the Semantic Web: Obrst, L., 2003, Ontologies for semantically interoperable systems, in Proceed-
Bringing the World Wide Web to Its Full Potential: Cambridge, Massa- ings, Twelfth International Conference on Information and Knowledge
chusetts, MIT Press, p. 171–196. Management, 3–8 November, New Orleans, Louisiana: New York, Asso-
Merriam, D., 2004, The quantification of geology: From abacus to pentium: ciation for Computing Machinery, p. 366–369.
A chronicle of people, places, and phenomena: Earth-Science Reviews, Parsons, M.A., Duerr, R., and Minster, J.B., 2010, Data citation and peer
v. 67, no. 1–2, p. 55–89, https://doi.org/10.1016/j.earscirev.2004.02.002. review: Eos (Transactions, American Geophysical Union), v. 91, no. 34,
Mons, B., 2018, Data Stewardship for Open Science: Implementing FAIR p. 297–298, https://doi.org/10.1029/2010EO340001.
Principles: New York, Chapman and Hall, 244 p., https://doi.org Peters, S.E., and McClennen, M., 2016, The Paleobiology Database applica-
/10.1201/9781315380711. tion programming interface: Paleobiology, v. 42, no. 1, p. 1–7, https://doi
Mookerjee, M., Vieira, D., Chan, M.A., Gil, Y., Pavlis, T.L., Spear, F.S., and .org/10.1017/pab.2015.39.
Tikoff, B., 2015, Field data management: Integrating cyberscience and Peters, S.E., Zhang, C., Livny, M., and Ré, C., 2014, A machine reading sys-
geoscience: Eos (Transactions, American Geophysical Union), v. 96, tem for assembling synthetic paleontological databases: PLoS One, v. 9,
no. 20, p. 18–21, https://doi.org/10.1029/2015EO036703. no. 12, https://doi.org/10.1371/journal.pone.0113523.
Mookerjee, M., Chan, M.A., Gil, Y., Gill, G., Goodwin, C., Pavlis, T.L., Ship­ Peters, S.E., Husson, J.M., and Wilcots, J., 2017, The rise and fall of stromato-
ley, T.F., Swain, T., Tikoff, B., and Vieira, D., 2022, this volume, Cyber- lites in shallow marine environments: Geology, v. 45, no. 6, p. 487–490,
infrastructure for collecting and integrating geology field data: Com- https://doi.org/10.1130/G38931.1.
munity priorities and research agenda, in Ma, X., Mookerjee, M., Hsu, Prabhu, A., Morrison, S.M., Eleish, A., Zhong, H., Huang, F., Golden, J.J.,
L., and Hills, D., eds., Recent Advancement in Geoinformatics and Data Perry, S.N., Hummer, D.R., Ralph, J., Runyon, S.E., and Fontaine, K.,
Science: Geological Society of America Special Paper 558, https://doi 2021, Global earth mineral inventory: A data legacy: Geoscience Data
.org/10.1130/2022.2558(01). Journal, v. 8, no. 1, p. 74–89, https://doi.org/10.1002/gdj3.106.
Morrison, S.M., Liu, C., Eleish, A., Prabhu, A., Li, C., Ralph, J., Downs, R.T., Press, G., 2016 (23 March), Cleaning big data: Most time-consuming, least
Golden, J.J., Fox, P., Hummer, D.R., and Meyer, M.B., 2017, Network enjoyable data science task, survey says: Forbes, https://www.forbes.com
analysis of mineralogical systems: The American Mineralogist, v. 102, /sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least
no. 8, p. 1588–1596, https://doi.org/10.2138/am-2017-6104CCBYNCND. -enjoyable-data-science-task-survey-says/ (accessed January 2021).
Morrison, S.M., Prabhu, A., Eleish, A., Pan, F., Zhong, H., Huang, F., Fox, P., Qiu, Q., Xie, Z., Wu, L., and Li, W., 2019, Geoscience keyphrase extraction
Ma, X., Ralph, J., Golden, J.J., and Downs, R.T., 2019, Application of algorithm using enhanced word embedding: Expert Systems with Appli-
advanced analytics and visualization in mineral systems: Acta Geologica cations, v. 125, p. 157–169, https://doi.org/10.1016/j.eswa.2019.02.001.
Sinica, v. 93, no. S3, p. 55, https://doi.org/10.1111/1755-6724.14243. Reichstein, M., Camps-Valls, G., Stevens, B., Jung, M., Denzler, J., and Carv-
Morrison, S.M., Buongiorno, J., Downs, R.T., Eleish, A., Fox, P., Giovannelli, alhais, N., 2019, Deep learning and process understanding for data-driven
D., Golden, J.J., Hummer, D.R., Hystad, G., Kellogg, L.H., and Kreylos, Earth system science: Nature, v. 566, no. 7743, p. 195–204, https://doi
O., 2020, Exploring carbon mineral systems: Recent advances in C min- .org/10.1038/s41586-019-0912-1.
eral evolution, mineral ecology, and network analysis: Frontiers of Earth Reitsma, F., Laxton, J., Ballard, S., Kuhn, W., and Abdelmoty, A., 2009, Seman-
Science, v. 8, p. 208, https://doi.org/10.3389/feart.2020.00208. tics, ontologies and eScience for the geosciences: Computers & Geosci-
Müller, R.D., Cannon, J., Qin, X., Watson, R.J., Gurnis, M., Williams, S., ences, v. 35, no. 4, p. 706–709, https://doi.org/10.1016/j.cageo.2008.03.014.
Pfaffelmoser, T., Seton, M., Russell, S.H., and Zahirovic, S., 2018, Rosenberg, D.E., Filion, Y., Teasley, R., Sandoval-Solis, S., Hecht, J.S., Van
GPlates: Building a virtual Earth through deep time: Geochemistry, Zyl, J.E., McMahon, G.F., Horsburgh, J.S., Kasprzyk, J.R., and Tarboton,
Geophysics, Geosystems, v. 19, no. 7, p. 2243–2261, https://doi.org D.G., 2020, The next frontier: Making research more reproducible: Jour-
/10.1029/2018GC007584. nal of Water Resources Planning and Management, v. 146, no. 6, https://
Muscente, A.D., Prabhu, A., Zhong, H., Eleish, A., Meyer, M.B., Fox, P., Hazen, doi.org/10.1061/(ASCE)WR.1943-5452.0001215.
R.M., and Knoll, A.H., 2018, Quantifying ecological impacts of mass Rossi, S.D., Barros, A., Walden-Schreiner, C., and Pickering, C., 2020, Using
extinctions with network analysis of fossil communities in Proceedings social media images to assess ecosystem services in a remote protected
of the National Academy of Sciences of the United States of America, area in the Argentinean Andes: Ambio, v. 49, p. 1146–1160, https://doi
v. 115, no. 20, p. 5217–5222, https://doi.org/10.1073/pnas.1719976115. .org/10.1007/s13280-019-01268-w.
NADM Steering Committee, 2004, NADM Conceptual Model 1.0—A Con- Sagar, D.B.S., Cheng, Q., and Agterberg, F., eds., 2018, Handbook of Mathe-
ceptual Model for Geologic Map Information: U.S. Geological Survey matical Geosciences: Fifty Years of IAMG: Cham, Switzerland, Springer,
Open-File Report 2004-1334, 58 p., https://doi.org/10.3133/ofr20041334. 914 p., https://doi.org/10.1007/978-3-319-78999-6.
Narock, T., and Shepherd, A., 2017, Semantics all the way down: The Semantic Schutt, R., and O’Neil, C., 2013, Doing Data Science: Straight Talk from the
Web and open science in big earth data: Big Earth Data, v. 1, no. 1–2, Frontline: New York, O’Reilly, 406 p.
p. 159–172, https://doi.org/10.1080/20964471.2017.1397408. Shannon, M., 2019, How does NASA use big data?: Big Data Made Simple,
NASEM (National Academies of Sciences, Engineering, and Medicine), 2015, https://bigdata-madesimple.com/how-does-nasa-use-big-data (accessed
Enhancing the Effectiveness of Team Science: Washington, D.C., The January 2021).
National Academies Press, 268 p., https://doi.org/10.17226/19007. Shepherd, A., Minnett, R., Jarboe, N., Koppers, A., Tauxe, L., Constable, C.,
NASEM (National Academies of Sciences, Engineering, and Medicine), and Jonestrask, L., 2019, Thorough Annotation of Magnetics Information
2018a, Data Science for Undergraduates: Opportunities and Options: Consortium (MagIC) contributions with Schema.org structured metadata:
Washington, D.C., The National Academies Press, 107 p., https://doi.org Abstract IN22B-01 presented at 2019 Fall Meeting, American Geophysi-
/10.17226/25104. cal Union, San Francisco, California, 9–13 December.
NASEM (National Academies of Sciences, Engineering, and Medicine), 2018b, Sheth, A., 2014, Transforming big data into smart data: Deriving value via har-
Open Science by Design: Realizing a Vision for 21st Century Research: nessing volume, variety, and velocity using semantic techniques and tech-
Washington, D.C., The National Academies Press, 216 p., https://doi.org nologies, in Proceedings of the 2014 IEEE 30th International Conference
/10.17226/25116. on Data Engineering (ICDE), Chicago, p. 2–2.

Downloaded from http://pubs.geoscienceworld.org/gsa/books/book/2377/chapter-pdf/5743375/spe558-05.pdf


by INGEMMET user
Geodata science 69

Sheth, A.P., 1999, Changing focus on interoperability in information systems: Varela, S., González‐Hernández, J., Sgarbi, L.F., Marshall, C., Uhen, M.D.,
From system, syntax, structure to semantics, in Goodchild, M., Egen- Peters, S., and McClennen, M., 2015, paleobioDB: An R package for
hofer, M., Fegeas, R., and Kottman, C., eds., Interoperating Geographic downloading, visualizing and processing data from the Paleobiology
Information Systems: Dordrecht, Netherlands, Kluwer Academic Pub- Database: Ecography, v. 38, no. 4, p. 419–425, https://doi.org/10.1111
lishers, p. 5–29, https://doi.org/10.1007/978-1-4615-5189-8_2. /ecog.01154.
Shipley, T.F., and Tikoff, B., 2019, Collaboration, cyberinfrastructure, and Wang, C., Ma, X., Chen, J., and Chen, J., 2018, Information extraction and
cognitive science: The role of databases and dataguides in 21st century knowledge graph construction from geoscience literature: Comput-
structural geology: Journal of Structural Geology, v. 125, p. 48–54,
­ ers & Geosciences, v. 112, p. 112–120, https://doi.org/10.1016/j.cageo
https://doi.org/10.1016/j.jsg.2018.05.007. .2017.12.007.
Sinha, A.K., Malik, Z., Rezgui, A., Barnes, C.G., Lin, K., Heiken, G., Thomas, Wang, Z., Ye, X., and Tsou, M.H., 2016, Spatial, temporal, and content analysis
W.A., Gundersen, L.C., Raskin, R., Jackson, I., and Fox, P., 2010, Geoin- of Twitter for wildfire hazards: Natural Hazards, v. 83, no. 1, p. 523–540,
formatics: Transforming data to knowledge for geosciences: GSA Today, https://doi.org/10.1007/s11069-016-2329-6.
v. 20, no. 12, p. 4–10, https://doi.org/10.1130/GSATG85A.1. Welle Donker, F., and van Loenen, B., 2017, How to assess the success of the
Spielman, S.J., and Moore, E.K., 2020, dragon: A new tool for exploring redox open data ecosystem?: International Journal of Digital Earth, v. 10, no. 3,
evolution preserved in the mineral record: Frontiers of Earth Science, v. 8, p. 284–306, https://doi.org/10.1080/17538947.2016.1224938.
https://doi.org/10.3389/feart.2020.585087. Welty, C., 2002, Ontology-driven conceptual modeling, in Pidduck, A.B.,
Stall, S., Yarmey, L., Cutcher-Gershenfeld, J., Hanson, B., Lehnert, K., Nosek, Mylopoulos, J., Woo, C.C., and Ozsu, M.T., eds., Advanced Information
B., Parsons, M., Robinson, E., and Wyborn, L., 2019, Make scientific Systems Engineering, Lecture Notes in Computer Science, Volume 2348:
data FAIR: Nature, v. 570, p. 27–29, https://doi.org/10.1038/d41586-019 Berlin, Springer, p. 3.
-01720-7. Wen, T., Niu, X., Gonzales, M., Zheng, G., Li, Z., and Brantley, S.L., 2018,
Starr, J., and Gastl, A., 2011, isCitedBy: A metadata scheme for DataCite: Big groundwater data sets reveal possible rare contamination amid other-
D-Lib Magazine: The Magazine of the Digital Library Forum, v. 17, wise improved water quality for some analytes in a region of Marcellus
no. 1/2, https://doi.org/10.1045/january2011-starr. shale development: Environmental Science & Technology, v. 52, no. 12,
Stasch, C., Scheider, S., Pebesma, E., and Kuhn, W., 2014, Meaningful spatial p. 7149–7159, https://doi.org/10.1021/acs.est.8b01123.
prediction and aggregation: Environmental Modelling & Software, v. 51, Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M.,
p. 149–165, https://doi.org/10.1016/j.envsoft.2013.09.006. Baak, A., Blomberg, N., Boiten, J.W., da Silva Santos, L.B., Bourne, P.E.,
Stephenson, M.H., Cheng, Q., Wang, C., Fan, J., and Oberhänsli, R., 2020, and Bouwman, J., 2016, The FAIR Guiding Principles for scientific data
Progress towards the establishment of the IUGS Deep-Time Digital Earth management and stewardship: Scientific Data, v. 3, no. 1, 160018, https://
(DDE) programme: Episodes Journal of International Geoscience, v. 43, doi.org/10.1038/sdata.2016.18.
no. 4, p. 1057–1062, https://doi.org/10.18814/epiiugs/2020/020057. Wing, J.M., 2019, The data life cycle: Harvard Data Science Review, v. 1, no. 1,
Sun, A.Y., and Scanlon, B.R., 2019, How can big data and machine learning https://doi.org/10.1162/99608f92.e26845b4.
benefit environment and water management: A survey of methods, appli- Wing, J.M., 2020, Ten research challenge areas in data science: Harvard Data
cations, and future directions: Environmental Research Letters, v. 14, Science Review, v. 2, no. 3, https://doi.org/10.1162/99608f92.c6577b1f.
no. 7, 073001, https://doi.org/10.1088/1748-9326/ab1b7d. Wood, J., Andersson, T., Bachem, A., Best, C., Genova, F., Lopez, D.R., Los,
Sun, Z., Di, L., Burgess, A., Tullis, J.A., and Magill, A.B., 2020, Geoweaver: W., Marinucci, M., Romary, L., Van de Sompel, H., and Vigen, J., 2010,
Advanced cyberinfrastructure for managing hybrid geoscientific AI work- Riding the wave: How Europe can gain from the rising tide of scientific
flows: ISPRS International Journal of Geo-Information, v. 9, no. 2, p. 119, data, in Final Report of the High Level Expert Group on Scientific Data–
https://doi.org/10.3390/ijgi9020119. A Submission to the European Commission: European Union, 36 p.
Tandy, J., van den Brink, L., and Barnaghi, P., eds., 2017, Spatial data on the Yang, C., Huang, Q., Li, Z., Liu, K., and Hu, F., 2017, Big data and cloud com-
web best practices, W3C Working Group note, https://www.w3.org/TR puting: Innovation opportunities and challenges: International Journal of
/sdw-bp. Digital Earth, v. 10, no. 1, p. 13–53, https://doi.org/10.1080/17538947
Tukey, J.W., 1977, Exploratory Data Analysis: Reading, Pennsylvania, .2016.1239771.
­Addison-Wesley, 688 p. Yang, C., Yu, M., Li, Y., Hu, F., Jiang, Y., Liu, Q., Sha, D., Xu, M., and Gu,
Uschold, M., and Gruninger, M., 2004, Ontologies and semantics for seam- J., 2019, Big Earth data analytics: A survey: Big Earth Data, v. 3, no. 2,
less connectivity: SIGMOD Record, v. 33, no. 4, p. 58–64, https://doi p. 83–107, https://doi.org/10.1080/20964471.2019.1611175.
.org/10.1145/1041410.1041420. Zeng, Y., Su, Z., Barmpadimos, I., Perrels, A., Poli, P., Boersma, K.F., Frey,
USGS MRDATA, 2021, Mineral resources online spatial data, https://mrdata A., Ma, X., de Bruin, K., Goosen, H., and John, V.O., 2019, Towards a
.usgs.gov (accessed January 2021). traceable climate service: Assessment of quality and usability of essen-
USGS NCGMP (U.S. Geological Survey National Cooperative Geologic Map- tial climate variables: Remote Sensing, v. 11, no. 10, p. 1186, https://doi
ping Program), 2020, GeMS (Geologic Map Schema)—A Standard For- .org/10.3390/rs11101186.
mat for the Digital Publication of Geologic Maps: Reston, Virginia, U.S. Zhang, C., Govindaraju, V., Borchardt, J., Foltz, T., Ré, C., and Peters, S., 2013,
Geological Survey, 74 p., https://doi.org/10.3133/tm11B10. GeoDeepDive: Statistical inference using familiar data-processing lan-
Valentine, D., Zaslavsky, I., Richard, S., Meier, O., Hudman, G., Peucker‐ guages, in Proceedings of the 2013 ACM SIGMOD International Confer-
Ehrenbrink, B., and Stocks, K., 2020, EarthCube Data Discovery Studio: ence on Management of Data, New York, p. 993–996.
A gateway into geoscience data discovery and exploration with Jupy-
ter notebooks: Concurrency and Computation, v. 33, no. 19, https://doi Manuscript Accepted by the Society 17 March 2022
.org/10.1002/cpe.6086. Manuscript Published Online 23 November 2022

Printed in the USA

Downloaded from http://pubs.geoscienceworld.org/gsa/books/book/2377/chapter-pdf/5743375/spe558-05.pdf


by INGEMMET user
Downloaded from http://pubs.geoscienceworld.org/gsa/books/book/2377/chapter-pdf/5743375/spe558-05.pdf
by INGEMMET user

You might also like