DFT To Machine Learning
DFT To Machine Learning
This Accepted Manuscript is © 2019 The Author(s). Published by IOP Publishing Ltd.
As the Version of Record of this article is going to be / has been published on a gold open access basis under a CC BY 3.0 licence, this Accepted
Manuscript is available for reuse under a CC BY 3.0 licence immediately.
Everyone is permitted to use all or part of the original content in this article, provided that they adhere to all the terms of the licence
https://creativecommons.org/licences/by/3.0
Although reasonable endeavours have been taken to obtain all necessary permissions from third parties to include their copyrighted content
within this article, their full citation and copyright line may not be present in this Accepted Manuscript version. Before using any content from this
article, please refer to the Version of Record on IOPscience once published for full citation and copyright details, as permissions may be required.
All third party content is fully copyright protected and is not published on a gold open access basis under a CC BY licence, unless that is
specifically stated in the figure caption in the Version of Record.
1
2
3
4
pt
5
6
7 Topical Review
8
9
cri
10 From DFT to Machine Learning: recent approaches to
11 Materials Science – a review
12
13
14 Gabriel R. Schleder∗1,2 , Antonio C. M. Padilha2 , Carlos Mera
15 Acosta1,2 , Marcio Costa2 and Adalberto Fazzio∗2,1
1 Center
16 for Natural and Human Sciences, Federal University of ABC, 09210-580, Santo André,
us
São Paulo, Brazil
17 2 Brazilian Nanotechnology National Laboratory/CNPEM, 13083-970, Campinas, São Paulo,
18 Brazil
19 ∗ Corresponding authors.
39
40 Submitted to: J. Phys. Materials
41
42
43
44
45
46
ce
47
48
49
50
51
52
Ac
53
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 2 of 74
1
2
3
2
4
pt
5
6
7
8 1 Introduction 3
9 1.1 Science paradigms: Data science . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
cri
10 1.2 Development of computational materials science . . . . . . . . . . . . . . . . . . . 6
11
12 2 Fundamentals of methods 8
13 2.1 Density Functional Theory (DFT) . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
14 2.1.1 Historical developments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
15 2.1.2 Current status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
16 2.1.2.1 Structure Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 12
us
17 2.2 High-throughput (HT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
18 2.2.1 (Big data) Screening and Mining . . . . . . . . . . . . . . . . . . . . . . . . 16
19 2.3 Machine Learning (ML) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
20 2.3.1 Types of machine learning problems . . . . . . . . . . . . . . . . . . . . . . 19
21 2.3.2 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
22 2.3.3 Materials Informatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
23
24
25
26
27
28
3 Applications in Materials Science an
2.3.3.1 Representations and descriptors . . . . . . .
2.3.3.2 Novel ML methods in physics and materials
.
.
.
.
.
.
3.1 High-Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Materials discovery, design, and characterization . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
34
36
37
38
38
dM
3.1.2 Topological ordered materials . . . . . . . . . . . . . . . . . . . . . . . . . . 39
29
3.1.3 2D materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
30
3.2 Machine Learning for materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
31
3.2.1 Discovery, energies, and stability . . . . . . . . . . . . . . . . . . . . . . . . 42
32
3.2.2 Electronic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
33
3.2.3 Magnetic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
34
3.2.4 Topological ordered materials . . . . . . . . . . . . . . . . . . . . . . . . . . 47
35
3.2.4.1 Quantum phase transition in topological insulator models . . . . . 48
36
3.2.4.2 Topological materials classification . . . . . . . . . . . . . . . . . . 50
37
3.2.5 Superconductivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
38
pte
47
48
49
50
51
52
Ac
53
54
55
56
57
58
59
60
Page 3 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
3
4
pt
5 1 Introduction
6 In the last three decades, we have witnessed the generation of huge amounts of theoretical
7 and experimental data in several areas of knowledge. Within the field of computational materials
8 science, such abundance of data is possible mainly due to the success of density functional theory
9 (DFT) and the fast development of computational capabilities. On the other hand, advances in
cri
10 instrumentation and electronics have enabled experiments to produce large quantities of results.
11 Therefore, along with the high-throughput (HT) approach, we have obtained a huge number of
12 theoretical as well as experimental data, and the logical next step is the emergence of novel tools
13 capable of extracting knowledge from such data. Among such tools, the field of statistical learning
14 has coined the so-called machine learning (ML) techniques, which are currently steering research
15 into a new data-driven science paradigm.
16
us
In this review, we strive to present the historical development, state of the art, and synergy
17 between the concepts of theoretical and computational materials science, and statistical learning.
18 Our choice is to focus on DFT and HT methods for the former and ML for the latter. A
19 chronological evolution of science, with emphasis into the specific area of materials research is
20 presented in Section 1. Next, in Section 2 we describe the development and current status of
21 the methods used to generate data within the DFT and HT frameworks and analyze it via ML.
22
23
24
25
26
27
28
an
We also discuss how these ingredients merged into the field of Materials Informatics (MI). In
section 2.1, we chose to discuss DFT, since it has become the cornerstone simulation procedure
in theoretical materials science. High-Throughput (HT) and Machine Learning (ML) approaches,
which are discussed in sections 2.2 and 2.3 respectively, follow a logical sequence. The former
is used to generate large amounts of data, while the latter requires the existence of such data
in order to extract knowledge from it. In the sequence, in Section 3 we review the progress
dM
of current research applying those methods to materials science problems, including materials
29 discovery, design, properties, and applications. Finally, in Section 4 we discuss an overview and
30 perspectives for future research. A simplified presentation of the topics presented in this work and
31 their complex relationships are summarized in Figure 1.
32
33
34
35
36
37
38
pte
39
40
41
42
43
44
45
46
ce
47
48
49
50 Figure 1. Schematic presentation of the topics discussed in this review and their relationships.
51 The possible atomic combinations form a great number of compounds, which can be studied by
52 means of experimental, theoretical or computational approaches, especially with high-throughput
Ac
calculations. Large quantities of data generated are then stored in databases, which can be used
53
by means of materials screening or machine learning, both of which leads to promising materials
54 candidates. Data-driven or traditional routes select materials suited for specific applications. We
55 illustrate these relationships in the context of two-dimensional materials. Figures from panels
56 adapted from [1] (experiments), [2] (high-throughput and machine learning), and [3] (materials
prediction).
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 4 of 74
1
2
3
4
4
pt
5 1.1 Science paradigms: Data science
6
7 As part of the human endeavor, Science is subject to constant reshaping owing to historical
8 circumstances. The present “data deluge” resulting from advances in information technologies [4]
9 is deeply affecting the way we do Science. Experimental, theoretical, and computational sciences
cri
10 are also responsible for generating huge amounts of data and can benefit from a new perspective.
11 Jim Gray, the 1998 Turing award–winner, presented this idea historically in his last presentation:
12 “Originally, there was just experimental science, and then there was theoretical science,
13 with Kepler’s Laws, Newton’s Laws of Motion, Maxwell’s equations, and so on. Then,
14 for many problems, the theoretical models grew too complicated to solve analytically, and
15 people had to start simulating. These simulations have carried us through much of the
16 last half of the last century. At this point, these simulations are generating a whole lot
us
17 of data, along with a huge increase in data from the experimental sciences. People now
18 do not actually look through telescopes. Instead, they are ‘looking’ through large-scale,
19 complex instruments which relay data to datacenters, and only then do they look at the
20 information on their computers.
21 The world of science has changed, and there is no question about this. The new
22
23
24
25
26
27
28
an
model is for the data to be captured by instruments or generated by simulations before
being processed by software and for the resulting information or knowledge to be stored
in computers. Scientists only get to look at their data fairly late in this pipeline. The
techniques and technologies for such data-intensive science are so different that it is
worth distinguishing data-intensive science from computational science as a new, fourth
paradigm for scientific exploration [4].” – Jim Gray, 2007 [5].
dM
The amount of data being generated by experiments and simulations has lead us to the
29
fourth paradigm of science over the last years, which is the so-called (big) data-driven science.
30
Such a paradigm naturally follows from the first three paradigms of experiment, theory, and
31
computation/simulation, as shown in Figure 2. Its impact in the field of materials science has
32
led to the emergence of the new field of Materials Informatics (MI). Within this new data-driven
33
34
35
36
37
38
pte
39
40
41
42
43
44
45
46
ce
47
48
49
50
Figure 2. The four science paradigms: empirical, theoretical, computational, and data-driven.
51 Each paradigm both benefit from and contribute to the others. Adapted from [6]. CC BY 4.0
52
Ac
53
54
55
56
57
58
59
60
Page 5 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
5
4
pt
5 point of view, a variety of pieces, such as Big Data and Data Science, come together in order to make
6 possible the extraction of knowledge from data. Big Data is defined as a collection of data which is
7 unfeasible to be processed, searched or analyzed by on-hand database tools due to its large size and
8 complexity. It is characterized by their diverse and huge volume, usually ranging from terabytes
9 to petabytes of data, being created in or near real-time. Such data is found either structured
cri
10 and unstructured in nature, and is exhaustive, usually aiming to capture entire populations in a
11 scalable manner [7]. Simple tasks represent challenges in this scale: capture, curation, storage,
12 search, sharing, analysis, and visualization of the data cannot be accomplished without the proper
13 tools. Thus, it can be effectively summarized by the popular “five V’s”: volume, velocity, variety,
14 veracity, and value, shown in Figure 3b. A related sixth V is visualization, although not exclusive
15 to big data, which requires different techniques to handle data with various characteristics.
16
us
17
18 1.
19 5.
Volume Size of data
20 Value
21 Extracting useful
22 information from data Big Data 2.
23
24
25
26
27
28
anAccuracy and
reliability of data
4.
Veracity
5 Vs
3.
Variety
Velocity
39
40 The analysis process within Data Science is challenging, as the techniques are very different
41 from traditional static and rigid datasets, generated and analyzed under a predetermined
42 hypothesis. The distinction from traditional data is based on the larger abundance, exhaustivity,
43 and variety of Big Data. It is also much more dynamic, messy and uncertain, being highly relational
44 [7]. Recently, the possibility of overcoming such a challenge slowly started to be envisaged due to
45 advances in high-performance computation and discovery of new analytical techniques, enabling
46 one to deal with the complexity and vastness of the data. Originally, these techniques were
developed in artificial intelligence (AI) and expert systems fields. Their objective was to produce
ce
47
48 machine learning (ML) algorithms that could automatically mine and detect patterns, and then
49 build predictive models and optimize outcomes [7]. The number of different algorithms that can
50 be applied to a dataset is huge, which makes possible their performance comparison, thus, letting
51 one choose the best model or explanation, or even a combination of those (ensemble approach).
52 This approach differs from the traditional selection based on knowledge specific to the technique
Ac
53 and data. Thus, the set of Big Data and Data Science, or simply Big Data analytics, can be seen
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 6 of 74
1
2
3
6
4
pt
5 as a new epistemological approach, where insights can be “born from the data”. The contrast with
6 traditional methods of testing a theory by analyzing relevant data (e.g., fit the data to theory) is
7 striking [7].
8 A new research paradigm is related to the way we produce knowledge. As stated by the
9 philosopher Thomas Kuhn, “a paradigm constitutes an accepted way of interrogating the world and
cri
10 synthesizing knowledge common to a substantial proportion of researchers in a discipline at any one
11 moment in time” [9]. Periodically, the accepted theories and approaches are challenged by a new
12 way of thinking, and the framework encompassed by Big Data and ML incarnates such paradigm
13 in multiple disciplines.
14
15 1.2 Development of computational materials science
16
us
17 Novel materials enable the development of technological applications that are key to overcome
18 challenges faced by society. Even though the impact of materials discovery along the history is
19 hard to quantify, ranging from the stone age, going through bronze and iron, crafting up to the
20 modern silicon technologies, their impact is easily grasped [10]. Furthermore, it is estimated that
21 materials development enabled two-thirds of advancements in computation, and transformed other
22 industries as well, such as energy storage [11].
23
24
25
26
27
28
an
Time to market for new technologies based on novel materials takes approximately 20 years,
while their development can span an even longer period [12]. Moreover, once a material is
consolidated for a technology, it is rarely substituted owing to the costs associated with the
establishment of large-scale production infrastructure [13]. Silicon in the semiconductor industry
is an enduring example of that. Therefore, the introduction of a material for a specific sector is
increasingly important for its establishment success, and recently several new technological niches
dM
29 face demands for potential materials.
30 Given the fast-growing demand for novel materials and relatively slow development of them,
31 at the same time that computational resources and algorithms face huge improvements, it sounds
32 almost natural to ask: how can computational science improve the efficiency of materials discovery?
33 Other areas, such as the pharmaceutical and biotechnology industries have already given some hints
34 [14, 15]. However, within the fourth data-driven science paradigm, the computational materials
35 community finds itself somehow delayed, in comparison to these fields. This late arrival is related
36 to bottlenecks in computational capability, but since the first materials simulations were carried
37 out, an ever increasing amount of research is taking place within this paradigm. In Figure 4, the
38 number of publications indicate this situation. Novel emerging approaches usually face an initial
pte
47 the US. The task to accelerate the time from discovery to commercialization of novel technologies
48 is a central one in MGI.
49 Traditional approaches to theoretical and computational materials science, termed direct
50 approach, rely on the calculation of properties given the structural and compositional data of
51 materials. Search for candidate materials presenting target properties in this scenario is a tedious
52 process performed case-by-case or by fortuitous sampling of the right example. The search space
Ac
53 can be restricted on prior knowledge about similar materials, nonetheless, the search is still a
54
55
56
57
58
59
60
Page 7 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
7
4
pt
5
6 15
Density Functional Theory
Publications (×103 )
7 High-Throughput
8 Machine Learning
9 Materials Informatics
10
cri
10
11
12 0.1
13 5
14
15 0.0
1960 1980 2000
16
us
17 0
18 1960 1980 2000 2019
19 Year
20
21 Figure 4. Chronological evolution of the number of publications for DFT, HT, ML, and
22
23
24
25
26
27
28
an
Materials Informatics. Initial developments of each discipline date to many decades before
actual adoption by the community. Data compiled from the Web of Science platform, using
each keyword in the “Topic” search term.
structure and composition to property mapping. This trial and error experimentation is now been
complemented and guided by computational science in an attempt to narrow this search space [20].
dM
The sheer massive data generation is no assurance of converting it into information and then
29 to knowledge. Moreover, converting this knowledge to society benefits, which is the ultimate goal,
30 is an even larger challenge. In Figure 5, Glick [21] represents these ideas as gaps between data
31 creation and storage, and the capability to obtain knowledge and usable technologies. The tendency
32 of this gap is to increase over time. Therefore, usage of data-driven approaches is paramount in
33 order to reduce the gap and advance research given this scenario.
34
35
Data Creation Information Stores
36
37
38
pte
39
Volume
40 Knowledge
41 Captured
42
43 Growing Gap Utility
44 Achieved
45
46 Time
ce
47
Figure 5. The increasing gap between data, information, knowledge, and utility, which calls
48
for more efficient approaches to accelerate this conversion. Adapted from [21], with permission
49 from Elsevier.
50
51
52
Ac
53
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 8 of 74
1
2
3
8
4
pt
5 2 Fundamentals of methods
6
7 Recent advances in experimental and computational methods resulted in massive quantities of
8 data generated, presenting increasing complexity. Machine learning techniques aim to extract
9 knowledge and insight from this data by identifying its correlations and patterns. Although we
cri
10 focus on computational techniques, the general concepts are not restricted to them. In this section
11 we present the fundamental approaches, following a logical timeline from DFT to HT to ML. As
12 here we focus on materials science research using computational methods, the first topic is DFT.
13 It is a natural choice of representative within the general class of methods used to generate data,
14 due to its widespread use in materials science. Next, the HT approach is presented, where any
15 experimental or computational methodology (such as DFT) can be employed to generate massive
16 amounts of data in an automated fashion. Resulting data, irrespective of its origin, is then used as
us
17 a substrate to the learning process, within the ML approach, resulting in extraction of knowledge
18 from the patterns discovered.
19 Considering the historical development of research in computational materials science, we can
20 classify the different problems and methods used to tackle them into three generations related to the
21 topics mentioned above [22]. The first generation is related to materials property attainment given
22
23
24
25
26
27
28
an
its structure, using local optimization algorithms, usually based on DFT calculations performed
one at a time. It is still the most widespread approach, owing to the great improvements enabled
by large scale high-throughput calculations. The second generation is related to crystal structure
prediction given a fixed composition, using global optimization tasks like genetic and evolutionary
algorithms. Such an approach requires a considerable number of calculations to be performed in
a systematic manner, thus relying heavily on HT methods. Finally, the third generation is based
on statistical learning. It also enables the discovery of novel compositions, besides much faster
dM
29 predictions of properties and crystalline structures given the vast amount of available physical and
30 chemical data via ML algorithms.
31
32 2.1 Density Functional Theory (DFT)
33
34 2.1.1 Historical developments
35 In the first half of the 20th century, with the formulation of Quantum Mechanics, it was possible
36 to understand the microscopic properties of the materials. Much of the empirical models used by
37 chemists, for example, the concept of bond proposed in the Lewis model, appeared in the solution
38 of the Schrödinger equation [23]. However, the precise resolution of that equation when we have
systems involving the electron-electron interaction introduces intrinsic difficulties in its solution,
pte
39
40 leading to the famous remark by Dirac in 1929 [24]: “The fundamental laws necessary for the
41 mathematical treatment of a large part of physics and the whole of chemistry are thus completely
42 known, and the difficulty lies only in the fact that application of these laws leads to equations
43 that are too complex to be solved”. There was a shift such that major efforts were now needed in
44 computational aspects rather than theoretical ones.
45 In the late 1920s and early 1930s, when computers were not in use, some approximate methods
46 were born. The goal was to make many-electron systems treatable. Examples are the Hartree
model [25], which seeks to obtain the observables via approximate wave function construction and
ce
47
48 the Thomas-Fermi-Dirac model [26] that attempted to describe the systems via their electronic
49 density. In 1964 Hohenberg and Kohn [27] published an article that became the paradigm for the
50 understanding of materials properties, today known as Density Functional Theory (DFT). The
51 DFT is based on two theorems elegantly demonstrated in Ref.[27]. They showed that in a system
52 with N electrons, (i) the external potential V (r), felt by the electrons is a unique functional of the
Ac
53 electronic density n(r) and (ii) The ground state energy E[n] is minimal for the exact density. In
54
55
56
57
58
59
60
Page 9 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
9
4
pt
5 other words, by knowing the electron density, we can obtain the precise energy of the ground state.
6
7 E = E [n(r)] (1)
8 The question of how to write down the density was answered by Kohn and Sham a year later
9 [28]. They proposed the addition of an exchange-correlation term to the energy, Exc [n] capable of
cri
10 mapping the kinetic energy of the interacting electrons T [n] system into a non-interacting picture
11 Ts [n],
12
E[n] = Ts [n] + UH [n] + Vext [n] + Exc [n] (2)
13
14 where UH is the Hartree potential, and Vext is an external potential. Such new formulation leads
15 to the famous Kohn-Sham (KS) equations,
16
1 2
us
17 − ∇ + vef f (r) φj (r) = j φj (r) (3)
2
18
n(r0 )
Z
19 vef f = vext (r) + dr0 + vxc (r) (4)
20 |r − r0 |
21
X
n(r) = |φi (r)|2 (5)
22
23
24
25
26
27
28
i
an
where j and φj are the Lagrange multipliers of the variational problem that leads to the KS
equation (equation 3), usually interpreted as the energy levels of the many-electron system and
the Kohn-Sham orbitals respectively, while vef f and vxc = δE/δn are referred to the Kohn-Sham
effective potential and exchange-correlation potential, respectively. With this set of equations,
a self consistent cycle could be envisaged: one starts with a tentative density n(r), plugs in a
dM
functional form of vxc and builds the effective potential vef f . Next, they obtain the eigenvalues
29 j and eigenvectors φj of the Kohn-Sham equations. The electronic density is obtained then from
30 the set of φj and the process is repeated until a convergence criteria (usually on the total energy
31 of the system) is reached.
32 It is important to note that although the work of H-K and K-S was published in the 1960s, the
33 major trust and recognition of its importance came only in the 1980s. This delay in recognition
34 by the community, especially by the chemists, occurred mainly for two reasons. The first is
35 the increase in computational capacity available to the scientific community and secondly the
36 continuous development of theoretical methods that have made it possible to deal with more
37 complex problems with more predictive capacity algorithms.
38 The DFT is formally exact, however, in practice, a series of approximations are required in
pte
39 order to solve the K-S equations. First, one needs to select the exchange-correlation term contained
40 in equation 2. A large variety of functionals can be found in the literature, some parameter-free
41 and other semi-empirical, i.e., containing parameters which are fitted from data. Next, one has to
42 choose how to treat the valence and core electrons. In the early days of DFT, only the so-called
43 all-electron treatment was available, and its drawback was the restriction of systems that could
44 be simulated at that time. However, valence orbitals determine the properties of solids. In 1940,
45 with that in mind, Herring proposed a powerful method for the determination of electronic states
46 in crystalline materials. In Herring’s approach, known as orthogonalized plane waves (OPW), an
ce
47 orbital base is proposed as a linear combination of core states and plane waves [29]. From the
48 formal point of view, it was a success, but it presented severe problems of convergence due to the
49 need to orthogonalize the plane waves with the orbitals of the core states. Phillips and Kleinman
50 elegantly solved this inconvenience. They showed that it is possible to obtain the same eigenvalues
51 from the secular equation of the OPW method in an effortless way known as the pseudopotential
52 method [30].
Ac
53
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 10 of 74
1
2
3
10
4
pt
5 The pseudopotential method led to the possibility of simulation of the whole periodic
6 table. Such method basically describes the core electrons and corresponding nuclei in a
7 simplified manner, by means of an effective potential which the valence electrons are subject to.
8 Some popular approaches are the projector augmented waves (PAW) [31], norm-conserving and
9 ultrasoft pseudopotentials as developed by Troullier and Martins [32] and Vanderbilt [33]. These
cri
10 approximations reach accuracy comparable to all-electron methods [34]. Therefore, in the 1970s
11 the pseudopotentials ab initio methods became the most powerful tool for accurate description of
12 many-electron systems.
13 Another important advance in DFT was the treatment of materials imposing links on
14 translational symmetry, via Bloch’s Theorem [35], known at the time as “Large Unit Cell”. This
15 procedure allowed the study of more realistic systems such as surfaces, defects, and impurities in
16 amorphous systems, clusters, etc. Owing to the seminal work by Ihm, Zunger and Cohen [36, 37],
us
17 the calculation of the total energy was also made possible in early 1980.
18
19 2.1.2 Current status
20 Since its initial development, DFT has evolved from limited calculations capable of providing
21 approximate results to an increasingly accurate and predictive methodology, leading to important
22
23
24
25
26
27
28
splitting materials, etc.
an
contributions in several areas such as materials discovery and design, drug design, solar cells, water
As we mentioned earlier, DFT is an exact formulation. However, we are not fully aware of
how the electron-electron interactions contained in the exchange-correlation functional occur. The
pursuit of the “exact” functional is still a subject of research, which is elegantly summarized by
Perdew as an analogy to the climbing of the so-called Jacob’s ladder of DFT approximations [38].
dM
In its first implementation, DFT codes employed the Local Spin Density approximation (LSDA or
29 simply LDA) for the exchange-correlation functional, described by the corresponding energy,
30 Z
31 Exc [n↑ , n↓ ] = drn(r)LDA
LDA
xc [n↑ (r), n↓ (r)] (6)
32
33 where n↑/↓ are the uniform spin densities of an electron gas, and LDA xc is the exchange-correlation
34 energy per electron of that system. The LDA was very successful in describing systems where
35 the electronic density vary slowly, such as bulk metals, and was in great part responsible for the
36 growing popularity of DFT methods among physicist during the 1970s. On the other hand, the
37 chemistry community did not embrace LDA due to a few systematic errors, such as overestimation
38 of molecular atomization energies and overestimation of bond lengths. Such shortcomings were
pte
39 alleviated in great part when the generalized gradient approximation (GGA) was introduced in
40 the 1980s. In this approximation, the exchange-correlation energy is rewritten taking into account
41 not only the spin densities but also their spatial variation,
42 Z
43 Exc [n↑ , n↓ ] = drn(r)GGA
GGA
xc [n↑ (r), n↓ (r), ∇n↑ (r), ∇n↓ (r)] (7)
44
45 where GGA
xc is the GGA corresponding energy density. One interesting characteristic of the
46 GGA approximation is that it does not require any particular functional form of the exchange-
ce
47 correlation energy density. In fact, only a number of constraints are imposed in the construction of
48 GGA functionals. Owing to that, a number of flavours of exchange-correlation functionals within
49 this approximation are available, namely the Perdew-Burke-Ernzerhof (PBE) [39], Perdew-Wang
50 (PW91) [40], and Becke-Lee-Yang-Parr (BLYP) [41, 42] are some examples of very successful
51 functionals.
52 The next step in the complexity of exchange-correlation functionals is usually referred to as the
Ac
53 advent of the meta-GGA approximation. Their new ingredient is the introduction of the so-called
54
55
56
57
58
59
60
Page 11 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
11
4
pt
5 Kohn-Sham kinetic energy density τ↑/↓ (r),
6 Z
7 M GGA
Exc [n↑ , n↓ ] = drn(r)M
xc
GGA
[n↑ (r), n↓ (r), ∇n↑ (r), ∇n↓ (r), τ↑ (r), τ↓ (r)] , (8)
8
9 where the implicit dependence of the kinetic energy on the spin density should be noted, i.e.,
cri
10 τ↑/↓ = τ↑/↓ n↑/↓ (r) . Meta-GGA approximation represented an improvement over many issues
11 known to plague GGA functionals, for example, delivering better atomization energies as well as
12 metal surface energies. Popular functionals within this approximation comprise the Tao-Perdew-
13 Staroverov-Scuseria functional (TPSS) [43], and the more recent proposal of the non-empirical
14 strongly constrained and appropriately normed (SCAN) functional of Sun et al. [44]. Successful
15 attempts of semilocal functionals for improved bandgaps of different materials include the Tran-
16 Blaha modified Becke-Johnson (mBJ) [45] and ACBN0 functionals [46, 47].
us
17 Up to this point the DFT approximations Jacob’s ladder, one can find only local (LDA) or
18 semilocal (GGA and meta-GGA) functionals of the density. Representing a step further, a proposal
19 inspired by the Hartree-Fock formulation introduced non-locality in DFT by mixing a fraction of
20 the exchange term
21 φ∗i (r)φ∗j (r0 )φi (r0 )φj (r)
Z Z
1X
ExHF = dr dr0 (9)
22
23
24
25
26
27
28
Exc
2 i,j
= (1 − α)Exc an
into the exchange-correlation energy within the GGA,
hyb GGA
+ αExHF
|r − r0 |
where α ∈ [0, 1] is a mixing parameter, usually chosen in the range between 0.15 and 0.25.
Such approach is known as hybrid functionals, which partially mended a serious problem of
(10)
dM
materials band-gap underestimation known to plague GGA functionals. Its main shortcoming
29 is the computational requirements, once the calculation of the non-local term in equation 9 is an
30 intensive task, once it involves the exchange of each orbital φj with all other orbitals in the system.
31 Nonetheless, some hybrid functionals were widely adopted in both the solid state physics as well
32 as quantum chemistry communities. Examples are the PBE0 [48, 49] and the Coulomb interaction
33 screened Heyd-Scuseria-Ernzerhof (HSE) [50] hybrid functionals based on the PBE Exc and the
34 B3LYP functional [42, 51], which introduced mixing as well as other empirical parameters into its
35 precursor BLYP.
36 Finally, by considering both occupied and unoccupied orbitals in the theory, one reaches
37 what could be considered the furthermost degree of complexity of DFT. Within this level of
38 approximation, one finds the Random Phase Approximation (RPA) [52, 53], which can successfully
pte
47 as well as band gaps and optical transitions are available from DFT. Structural properties include
48 stress tensors, bulk modulus, and phonon spectra, which can help identify the structural stability
49 of materials. Dispersion interaction is not an intrinsic ingredient within LSDA or GGA. However,
50 many parametrized models of such forces have been included into DFT codes [54–57], allowing a
51 good description of non-covalent bonding between molecules.
52 A fundamental limitation of DFT arises from its mathematical construction: it works only
Ac
53 for the ground state density. Thus, the study of excited states is hindered within this method,
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 12 of 74
1
2
3
12
4
pt
5 even though workarounds such as time-dependent DFT (TDDFT) [58–60] have been proposed.
6 Moreover, despite the fact that they usually are interpreted as physical quantities, the KS
7 eigenvalues and eigenvectors do not correspond, at least formally, to the energy levels and
8 eigenstates of the system, respectively. Strongly correlated systems, such as d electrons in transition
9 metal oxides, also have to be tackled with auxiliary theories such as the Hubbard U parameter
cri
10 [61, 62]. Many other methods which are usually referred to as post-KS have been proposed in
11 order to overcome DFT deficiencies. The GW approximation [63, 64], and the solution of the
12 Bethe-Salpeter equation for exciton dynamics [65, 66], among other methods are famous examples.
13 Moreover, strongly correlated phenomena, which is not captured by the standard DFT approach
14 are now being investigated using the Dynamical Mean Field Theory (DMFT) [67, 68]. Which
15 can be integrated into the DFT self-consistent cycle [69], or used in post-processing level [70].
16 However, the greater precision delivered by such methods is accompanied by greater computational
us
17 demands, hindering the widespread use of these algorithms. Roughly speaking, a scaling law of
18 O(N 3 ) impedes the application of DFT calculations for very large systems (presently, N > 1000s
19 atoms). Linear scaling O(N ) methods [71, 72] enable the calculation of much larger systems,
20 currently up to 106 s atoms [73].
21 An important strategy to extend beyond the capabilities of the DFT method is to use
22
23
24
25
26
27
28
an
auxiliary codes. For example, quantities that require large reciprocal space sampling, such as
electrical conductivity, spin Hall conductivity (SHC), Anomalous Hall conductivity (AHC), to
cite a few, are cumbersome to obtain. The electrical conductivity can be calculated using
interpolation methods based on DFT calculations implemented in BoltzTraP, BoltzWann,
ShengBTE, and PAOFLOW [74–77]. PAOFLOW can also calculate SHC, AHC, Fermi surfaces,
topological invariants, and other properties. Topological invariants are also calculated in DFT
dM
using Z2Pack [78] and Wannier Tools [79] which are integrated into many different DFT
29 codes. Investigation of ballistic transport phenomena is possible via SIESTA-based codes [80],
30 namely the TranSIESTA [81], TRANSAMPA [82], and Smeagol [83] packages. Excitation
31 properties can also be addressed with YAMBO [84], and BerkeleyGW [85]. The vibrational
32 properties are mainly obtained via perturbation theory or the finite displacement approach. The
33 first is not general and is implemented primarily in Quantum Espresso. The second approach
34 is compatible with several DFT codes that can optimize crystal structures. Nevertheless, they
35 are very computational-demanding, due to the large supercells involved. The Phonopy code is a
36 helpful resource to obtain vibration related quantities such as phonon band structure and density
37 of states, dynamic structure factor, and Grüneisen parameters [86].
38 In summary, DFT is a mature theory which is currently the undisputed choice of method for
pte
39 electronic structure calculations. A number of papers and reviews are presented in the literature
40 [87–92], facilitating the widespread of the theory and, thus, the entry of researchers into the field
41 of computational solid state physics, materials science, and quantum chemistry. Although the
42 implementations of DFT take place in many codes and scopes (see Table 1), it has been shown
43 recently that the results are consistent as a whole [34].
44
45 2.1.2.1 Structure Prediction
46 DFT calculations provide a reliable method to study materials once the crystalline or molecular
ce
47 structure is known. Based on the Hellman-Feynman theorem [131], one can use DFT calculations
48 to find a local structural minima of materials and molecules. However, a global optimization of
49 such systems is a much more involved process. The possible number of structures for a system
50 containing N atoms inside a box of volume V is huge, given by the combinatorial expression
51
V δ −3 Y N
52 |Ω| = (11)
Ac
53 N ni
i
54
55
56
57
58
59
60
Page 13 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
13
4
pt
5
6 Table 1. Selection of DFT codes according to their basis types. GPL stands for GNU Public
License.
7 License Ref.
8 Name
9 Plane-waves basis sets
cri
VASP commerciala [93–96]
10
Quantum Espresso GPL [97, 98]
11
CASTEP commercialb [99, 100]
12 ABINIT GPL [101–103]
13 CP2Kd GPL [104–108]
14 CPMD free [109–111]
15 ONETEP commercial [112]
16 BigDFT GPL [113]
us
17 Atom-centered basis sets
18 Gaussian commercial [114]
19 GAMESS free [115, 116]
20 Molpro commercial [117]
SIESTA freec [80]
21
Turbomole commercial [118]
22
23
24
25
26
27
28
ORCA
CRYSTAL
Q-Chem
FHI-aims
Real-space grids
octopus
GPAWe
an
free
commercialb
commercial
commercial
GPL
GPL
[119]
[120]
[121]
[122]
[123–125]
[126, 127]
dM
29 Linearized augmented plane waves
30 WIEN2k commercial [128]
31 exciting GPL [129]
FLEUR free [130]
32 a Free for academic institutions in Austria
33 b Free for academic institutions in UK
34 c For academics
35 d CP2K employs mixed plane-waves and atom-centered basis sets
36 e GPAW can also employ plane-waves or atom-centered basis sets
37
38
pte
39 where δ is the side of a discrete box which partitions the volume V and ni is the number of atomic
40 species i in the compound. This number becomes very large (≈ 10N ) even for small systems
41 (N < 20) and large discretization box (δ = 1 Å). In order to probe such potential energy surface,
42 one has to visit states in a 3N + 3 dimensional space (3N − 3 degrees of freedom for atomic
43 positions and 6 degrees of freedom for the lattice constants) and asses their feasibility, usually by
44 calculating the total energy in that particular configuration. This is a global optimization problem
45 in a high-dimensional space, which has been tackled by several authors. Here we discuss two of
46 the most popular methods proposed in the literature, namely evolutionary algorithms and basin
ce
47 hopping optimization.
48 Owing to the fact that not all configurations in this landscape are physically acceptable (i.e.
49 there might be too close pairs of atoms) and some of these are more feasible, some authors realized
50 that the search space should be restricted somehow. One way of achieving such restriction is
51 by means of evolutionary algorithms, where the survival of the fittest candidate structures is
52 taken into account, thus restricting the search to a small region of the space state. Introducing
Ac
53 mating operations between pairs of candidate structures and mutation operators on single samples,
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 14 of 74
1
2
3
14
4
pt
5 a series of generations of candidate structures is created, and in each of these series only the
6 fittest candidates survive. The search is optimized by allowing local relaxation, via DFT or MD
7 calculations, of the candidate structures, thus avoiding nonphysical configurations, such as too
8 short bond lengths. Evolutionary algorithms have been used to find new materials, such as a new
9 high-pressure phase of Na [132–134].
cri
10 Another popular method of theoretical structure prediction is basin hopping [135, 136]. In
11 this approach, the optimization starts with a random structure that is deformed randomly given
12 a threshold, which is in turn brought to an energy minima, via e.g. DFT calculations. If the
13 reached minima are distinct from the previous configuration, the Metropolis criterion [137] is
14 used to decide if the move is accepted or not. If the answer is yes, it is said that the system
15 hopped between neighboring basins. Owing to the fact that distinct basins represent distinct local
16 structural minima, this algorithm probes the space state in an efficient way.
us
17 Other methods of global optimization and theoretical structure prediction of molecules
18 and materials comprise random structure searching (AIRSS) [138], particle-swarm optimization
19 methods [139, 140], parallel tempering, minima hopping [141], and simulated annealing.
20 The so-called Inverse Design, is an inversion of the traditional direct approach, discussed
21 in Section 1.2. Strategies for direct design usually fall into three categories: descriptive, which
22
23
24
25
26
27
28
an
in general interpret or confirm experimental evidence; predictive, which predicts novel materials
or properties; or predictive for a material class, which predicts novel functionalities by sweeping
the candidate compound space. The inverse mapping, from target properties to the material was
proposed by Zunger [142] as a means to drive materials discovery presenting specific functionalities.
According to his inverse design framework, one could find the desired property in known materials,
as well as discover new materials while searching for the functionality. This can be seen as another
dM
global optimization task, but instead of finding the minimum energy structure, it searches for the
29 structure that maximizes the target functionality (figure of merit). This can be done in three ways:
30 (i ) search for a global minimum using local optimization methods, e.g. evolutionary algorithms,
31 aimed to select best fitted candidates based on the property of interest, (ii ) materials database
32 querying and subsequent hierarchical screening based on design principles of properties in order
33 to uncover properties in known compounds (materials screening is discussed in section 2.2.1), and
34 (iii ) screening of novel compounds obtained by high-throughput calculations (section 2.2) of the
35 convex hull of stable compositions. A number of examples have been reported as a successful
36 application of inverse design principles, such as the discovery of non-toxic, high efficient halide
37 perovskites solar absorbers [143].
38
pte
47 input creation and perform several (even millions) simulations in parallel or sequentially. This
48 development is presented in Figure 6 and the approach is called high-throughput (HT) [144].
49 The idea is to generate and store large quantities of thermodynamic and electronic properties
50 by means of either simulations or experiments for both existing and hypothetical materials, and
51 then perform the discovery or selection of materials with desired properties from these databases
52 [13]. This approach does not necessarily involve ML, however, there is an increasing tendency to
Ac
53 combine these two methodologies in materials science, as already shown in Figure 1. Importantly,
54
55
56
57
58
59
60
Page 15 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
15
4
pt
5
6
7
8
9
cri
10
11
12
13
14
15
16
us
Figure 6. Time spent for calculations (and similarly for experiments) as a function of
17 technological developments. With the computer technological advances, the calculation step
18 can be less time consuming than the setup construction and the results analysis. Adapted from
[145].
19
20
21 the HT approach is compatible with theoretical, computational, and experimental methodologies.
22
23
24
25
26
27
28
an
The main hindrance of a given method is the time necessary to perform a single calculation or
measurement. The HT engine has to be fast and accurate in order to produce massive amounts
of data in a reasonable time, otherwise, its purpose is lost. Despite the HT generality, here we
are mainly interested in its use in the context of first principles DFT calculations and its adapted
strategies, discussed in Section 2.
The implementation of HT-DFT methods is usually performed in three main steps: i )
dM
thermodynamic or electronic structure calculations for a large number of synthesized and
29 hypothetical materials; ii ) systematic information storage in databases and; iii ) materials
30 characterization and selection: data analysis to select novel materials or extracting new physical
31 insight [13]. The great interest in the use of this methodology, the strong diffusion of methods and
32 algorithms for data processing, and the wide acceptance of ML as a new paradigm of science, have
33 resulted in intensive implementation work to create codes to manage calculations and simulations,
34 as well as materials repositories that allow sharing and distributing results obtained in these
35 simulations, i.e., steps i ) and ii ). In general, this is performed in high-performance computers
36 (HPC) with multi-level parallel architectures managing hundreds of simulations at once. A
37 principled way for database construction and dissemination related to step ii ) is the FAIR concept,
38 which stands for findable, accessible, interoperable, and reusable [146, 147]. Meanwhile, item
pte
47 On the other hand, the profusion of experimental materials databases is less diverse. In this
48 area, we can highlight the Inorganic Crystal Structure Database (ICSD) [152] and crystallographic
49 open database (COD) [153], with ≈ 200.000 and ≈ 400.000 crystal structures entries, respectively.
50 The main difference between the two databases is the inclusion of organic, metal-organic compounds
51 and minerals in the COD database.
52 Despite the complexities involved in steps i ) and ii ), the third step is more significant. In iii )
Ac
53 the researcher inquiries the database in order to discover novel materials with a given property,
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 16 of 74
1
2
3
16
4
pt
5
6 Table 2. High-Throughput databases, codes, and tools according to source and purpose. All
listed databases are open source for academic applications. We define a complete package for
7 HT as a multi-engine code that can generate, manipulate, manage and analyze the simulation
8 results.
9 Name Description URL Ref.
cri
10 HT databases
11 ICSD Inorganic experimental http://www.z-karlsruhe.com/icsd.html [152]
COD Organic and Inorganic ex- http://www.crystallography.net [154]
12
perimental
13 AFLOWlib Multi-purpose repository http://aflowlib.org/ [148]
14 Materials Project Multi-purpose repository https://materialsproject.org/ [149]
15 OQDM Multi-purpose repository http://www.oqmd.org/ [150]
16 CMR Multi-purpose (3D and 2D [155]
us
https://cmr.fysik.dtu.dk/
17 materials) repository
18 OMDB Organic materials database https://omdb.diracmaterials.org/ [156]
19 MaterialsWeb 2D materials (derived from https://materialsweb.org/ [157]
20 Materials Project) twodmaterials
21 JARVIS-DFT 2D materials https://www.ctcms.nist.gov/~knc6/ [158, 159]
JVASP.html
22
23
24
25
26
27
28
NOMAD
Materials Cloud
Citrination
[160]
[161]
dM
https://cleanenergyprojectnv.org/
29 for solar cells
30 C2DB 2D materials (derived from https://cmrdb.fysik.dtu.dk/?project= [162]
31 CMR) c2db
32 HT Codes and tools
ASE Complete package for HT https://wiki.fysik.dtu.dk/ase/ [163]
33
Pymatgen Complete package for HT http://pymatgen.org/ [164]
34
AiiDA Framework for HT http://www.aiida.net/ [165]
35 AFLOWπ Framework for HT http://aflowlib.org/src/aflowpi/ [166]
36 index.html
37 Atomate Complete package for HT https://atomate.org/ [167]
38 Pylada Framework for HT http://pylada.github.io/pylada/
pte
47 investigations, which involves more calculations or not. The quality of the inquiry will determine
48 the success of the search. This is usually performed via a constraint filter or a descriptor, which
49 will be used to separate the materials with the desired property, or a proxy variable. We extend
50 the discussion of this process in the next section.
51
52
Ac
1
2
3
17
4
pt
5 Materials screening or mining can be seen as an integral part of a HT workflow, but here
6 we highlight it as a step on its own. In a rigorous definition, HT concerns the high-volume data
7 generation step, whereas screening or mining process refers to the application of constraints to
8 the database in order to filter or select the best candidates according to the desired attributes.
9 The database is generally screened in sequence through a funnel-like approach, where materials
cri
10 satisfying each constraint pass to the next step, while those who fail to meet one or more of them
11 are eliminated [171]. A final step may be to evaluate what characteristics make the top candidates
12 perform best in the desired property, and then predict if these features can be improved further.
13 Thus, every material who satisfied the various criteria can be optionally ranked according to a
14 problem-defined merit figure, and then this subgroup of selected materials can be additionally
15 investigated or used in applications.
16
us
17
18
19
20
21
22
23
24
25
26
27
28
an
dM
29 Figure 7. The materials screening process as a systematic materials selection strategy based
on constraints filters.
30
31 The constraints can be descriptors derived from ML processes or filters guided by the previous
32 understanding of the phenomena and properties, or even guided by human intuition. Traditionally,
33 descriptors construction requires an intimate knowledge of the problem. The descriptor can be as
34 simple as the free energy of hydrogen adsorbed on a surface, which is a reasonable predictor of good
35 metal alloys for hydrogen catalysis [172]. Or more complex such as the variational ratio of spin-
36 orbit distortion versus non-spin-orbit derivative strain, which was used to predict new topological
37 insulators using the AFLOWLIB database [173]. Although materials screening procedure has as its
38 final objective the materials prediction and selection, more complex properties, e.g. that depend on
pte
39 specific symmetries, require direct interaction between ML and materials screening, as represented
40 in Figure 1. Specifically, the filters used for the screening can be descriptors obtained via ML
41 techniques. In Section 2.3.3.1 we discuss descriptors of increasing complexity degree. In the same
42 way, the ML process can, in turn, depend on an initial selection of materials. This initial step is
43 to restrict the data set exclusively to materials that potentially exhibit the property of interest.
44 For example, in the prediction of topological insulators protected by the time-reversal symmetry,
45 compounds featuring a non-zero magnetic moment are excluded from the database, as we discuss
46 in Section 3.2.4.
ce
53 extreme values of the desired behavior. After passing through the filters, if there are candidates
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 18 of 74
1
2
3
18
4
pt
5 that satisfy the criteria, a set of selected materials will be obtained, which could lead to novel
6 technological or scientific applications.
7
8 2.3 Machine Learning (ML)
9
cri
10 Having presented the most used approaches used to generate large volumes of data, now we examine
11 the next step of dealing and extracting knowledge from the information obtained. Exploring the
12 evolution of the fourth paradigm of science, a parallel can be made between the 1960 Wigner’s paper
13 “The Unreasonable Effectiveness of Mathematics in the Natural Sciences” [174] to the nowadays
14 “The Unreasonable Effectiveness of Data” [175]. What makes this unreasonable effectiveness of
15 data in recent times? A case can be made for the fifth “V” of big data (Figure 3): extracting
16 value from the large quantity of data accumulated. How is this accomplished? Through machine
us
17 learning techniques which can identify relationships in the data, however complex they might be,
18 even for arbitrarily high-dimensional spaces, inaccessible for human reasoning.
19 Machine learning (ML) can be defined as a class of methods for automated data analysis,
20 which are capable of detecting patterns in data. These extracted patterns can be used to predict
21 unknown data or to assist in decision-making processes under uncertainty [176]. The traditional
22 definition states that the machine learning, i.e. progressive performance improvement on a task
23
24
25
26
27
28
an
directed by available data, takes place without being explicitly programmed [177]. This research
field evolved from the broader area of artificial intelligence (AI), inspired by the 1950s developments
in statistics, computer science and technology, and neuroscience. Figure 8 shows the hierarchical
relationship between the broader AI area and ML. Much of the learning algorithms developed have
dM
29 Deep Learning
30 Machine Learning Multilayered neural networks
31 Artificial Computer techniques for
which learn representations
in data
32 Intelligence automated data analysis that
progressively improve at tasks
33 Computer techniques
with experience
34 to mimic human
Classification, regression, clustering,
intelligence
35 dimensionality reduction
36
Decision trees, if-then rules, knowledge bases, and machine learning
37
38
pte
39
40 Figure 8. Hierarchical description and techniques examples of artificial intelligence and its
machine learning and deep learning sub-fields.
41
42
been applied in areas as diverse as finances, navigation control and locomotion, speech processing,
43
game playing, computer vision, personality profiling, bioinformatics, and many others. In contrast,
44
an AI loose definition is any technique that enables computers to mimic human intelligence. This
45
can be achieved not only by ML, but also by “less intelligent” rigid strategies such as decision trees,
46
if-then rules, knowledge bases, and computer logic. Recently, an ML subfield that is increasingly
ce
47 gaining attention due to its successes in several areas is deep learning (DL) [178]. It is a kind
48 of representation learning loosely inspired by biological neural networks, having multiple layers
49 between its input and output layers.
50 A closely related field and very important component of ML is the source of data that will
51 allow the algorithms to learn from. This is the field of data science, which we introduced in Section
52 1.1 and Figure 3a.
Ac
53
54
55
56
57
58
59
60
Page 19 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
19
4
pt
5 2.3.1 Types of machine learning problems
6 Formally, the learning problem can be described [179] by: given a known set X, predict or
7 approximate the unknown function y = f (X). The set X is named feature space and an element x
8 from it is called a feature (or attribute) vector, or simply an input. With the learned approximate
9 function yb = fb(X), the model can then predict the output for unknown examples outside the
cri
10 training data, and its ability to do so is called generalization of the model. There are a few
11 categories of ML problems based on the types of inputs and outputs handled, the two main ones
12 are supervised and unsupervised learning.
13 In unsupervised learning, also known as descriptive, the goal is to find structure in the data
14 given only unlabeled inputs xi ∈ X, in which the output is unknown. If f (X) is finite, the learning
15 is called clustering, which groups data in a (known or unknown) number of clusters by the similarity
16
us
in its features. On the other hand, if f (X) is in [0, ∞), the learning is called density estimation,
17 which learns the features marginal distribution. Another important type of unsupervised learning
18 is dimensionality reduction, which compresses the number of input variables for representing the
19 data, useful when f (X) has high dimensionality and therefore a complex data structure to detect
20 patterns.
21 In contrast, in predictive or supervised learning the goal is to learn the function that leads
22
23
24
25
26
27
28
an
inputs to outputs, given a set of labeled data (xi , yi ) ∈ (X, f (X)), known as the training set
(contrary to an unknown test set), with i = N number of examples. If the output yi type is
a categorical or nominal finite set (for example, metal or insulator), it is called a classification
problem, which predicts the class label for unknown samples. Else, if the outputs are continuous
real-valued scalars yi ∈ R , it is then called a regression problem, which will predict the output
values for the unknown examples. These types of problems and its related algorithms which we
dM
introduce in section 2.3.2 are summarized in Figure 9. Other types of ML problems are the semi-
29 supervised learning, where a large number of unlabeled data is combined with a small number
30 of labeled ones, multi-task and transfer learning, where information from related problems are
31 exploited to improve the learning task (usually one with little data available [180]), and the called
32 reinforcement learning, in which no input/output is given, but feedback on decisions as means to
33 maximize a reward signal toward learning desired actions in an environment.
34 A typical ML workflow can be summarized as follows [182]:
35
36 (i) Data collection and curation: generating and selecting the relevant and useful subset of
37 available data to the problem-solving.
38 (ii) Data preprocessing: understandable presentation of data consisting of formatting to a
pte
39 proper format, cleaning corrupt and missing data, transform data as needed by normalizing,
40 discretizing, averaging, smoothing, or differentiating, uniform conversion to integers, doubles,
41 or strings, and proper sampling to optimize representativeness of the set.
42 (iii) Data representation and transformation: choose and transform the input data (often a table)
43 to the problem at hand by feature engineering such as scaling, decomposition, or a combination.
44 Especially for materials science applications, this is an important issue which we discuss in
45 section 2.3.3.1.
46 (iv) Learning algorithm training: from the previous step, split the dataset into 3 sets: training,
ce
47 validation, and testing datasets. The first one is used in the learning process, where the model
48 parameters are obtained. This step is usually not necessary for unsupervised learning tasks.
49
50 (v) Model testing and optimization: evaluate effectiveness and performance, by means of the
51 validation set. Parameters that cannot be learned (the so-called hyperparameters) are to be
52 optimized using this dataset. Once an optimal set of parameters is obtained, the test set is
Ac
53 used in order to assess the performance of the model. If the obtained model is unsuccessful,
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 20 of 74
1
2
3
20
4
pt
5
6
7
8
9
cri
10 Numerical
prediction?
11 More than Visualization
or
Labeled
~50 data?
12 samples? dimension
reduction
13
14
15 Need
16 more data! Interpretable Large data?
us
17
18
19
20
21
22
23
24
25
26
27
28
an
Figure 9. Machine learning algorithms and usage diagram, divided into the main types of
problems: unsupervised (dimensionality reduction and clustering) and supervised (classification
and regression) learning. Adapted from [181] Copyright c SAS Institute Inc., Cary, NC, USA.
All Rights Reserved. Used with permission.
the previous steps are repeated with improved data selection, representation, transformation,
dM
29 sampling, and removing outliers, or by changing the algorithm altogether.
30 (vi) Applications: using the validated model to make predictions on unknown data. The model
31 can be continually retrained whenever new data is available.
32
33 In the present context of materials science, we explore the steps: i) data collection in sections 2.1
34 to 2.1.2.1, related to any method used to generate data, whether experimental or theoretical, and
35 also show critical examples in section 3.1.1; iii) data representation and transformation in section
36 2.3.3.1, discussing how to represent materials in increasing degrees of complexity; iv) learning
37 algorithms in the next section 2.3.2, presenting the most common and useful algorithms for the
38 different types of ML problems; and vi) applications in the whole section 3.2, showing the progress,
challenges, and perspectives in ML applications to materials science research.
pte
39
40
41 2.3.2 Learning Algorithms
42
43 According to the “No Free Lunch Theorems” [183, 184], no ML algorithm is universally
44 superior. Thus, the task of constructing such an algorithm is a case-by-case study. In particular,
45 the choice of the learning algorithm is a key step in building an ML pipeline, and many choices are
46 available, each suited for a particular problem and/or dataset. Such dataset can be of two types:
ce
47 either labeled or unlabeled. In the first case, the task at hand is to find the mapping between data
48 points and corresponding labels {x(i) } → {y (i) } by means of a supervised learning algorithm. On
49 the other hand, if no labels are present in the dataset, the task is to find a structure within the
50 data, and unsupervised learning takes place.
51 Owing to the large abundance of data, one can easily obtain feature vectors of overwhelmingly
52 large size, leading to what is referred to as “the curse of dimensionality”. As an example, imagine
Ac
53 an ML algorithm that receives as input images of n × n greyscale pixels, each one represented as
54
55
56
57
58
59
60
Page 21 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
21
4
pt
5 a numeric value. In this case, the matrix containing these number is flattened into an array of
6 length n2 which is the feature vector, describing a point in a high dimensional space. Due to the
7 exponential dependency, a huge number of dimensions is easily reachable for average sized images.
8 Memory or processing power become limiting factors in this scenario.
9 A key point is that within the high-dimensional data cloud spanned by the dataset, one
cri
10 might find a lower dimensional structure. The set of points can be projected into a hyperplane
11 or manifold, reducing its dimensionality while preserving most of the information contained in
12 the original data cloud. A number of procedures with that aim, such as principal component
13 analysis (PCA) in conjunction with single value decomposition (SVD) are routinely employed
14 in ML algorithms[185]. In a few words, PCA is a rotation of each axis of the coordinate system
15 of the space where the data points reside, leading to the maximization of the variance along these
16 axes. The way to find out where the new axis should point to is by obtaining the eigenvector
us
17 corresponding to the largest eigenvalue of the XT X, where X is the data matrix. Once the largest
18 variance eigenvector, also referred to as the principal component, is found, data points are projected
19 into it, resulting in a compression of the data, as is depicted in Figure 10.
20
21
22
23
24
25
26
27
28
an
dM
29
30
31
32
33
34
35 Figure 10. Principal component analysis (PCA) performed over a 3D dataset with 3 labels
36 given by the color code (left) resulting in a 2D dataset (right).
37
38 A variety of ML methods is available for unsupervised learning. One of the most popular
pte
39 methods is k-means [186], which is widely used to find classes within the dataset. k-means
40 consists of an algorithm capable of clustering n data points into k subgroups (k < n) by direct
41 calculation of points distances with respect to each groups’ centroid. Once the number of centroids
(j)
42 (k) is chosen and their starting position is selected (µ0 , 1 ≤ j ≤ k), e.g. randomly selected,
43 the algorithm iterates over two steps. First, the distances of the data points to each centroid are
44 calculated, and the points are labeled y (i) as belonging to the subgroup corresponding to the closest
45 (j)
centroid. Next, a new set of centroids ({µt }, t > 0) is computed by averaging the positions of the
46 class members of each group. The two steps are described by equations 12 and 13,
ce
47 (i)
yt = argmin kx(i) − µt kp
(j)
(12)
48 j
49 nj
1 X (i)
50 (j)
µt+1 = x δy(i) ,j (13)
51 nj i=1 t
52 where p ∈ N represents the choice of the metric (being p = 2, the Euclidean metric the most
Ac
53 (j)
popular), nj is the number of points assigned to cluster with centroid µt , δn,m is the Kronecker
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 22 of 74
1
2
3
22
4
pt
5 delta function, which is 1 if m = n or zero otherwise, and t is the iteration step index. Convergence
6 is reached when no change in the assigned labels is observed. The choice of the starting positions
7 for the centroids is a source of problems in k-means clustering, leading to different final clusters
8 depending on the initial configuration. A common practice is to run the clustering algorithm
9 several times and consider the final configuration as the most representative clustering.
cri
10 Hierarchical Clustering is another method employed in unsupervised learning which can
11 be found in two flavors, either agglomerative or divisive. The former can be described by a simple
12 algorithm: one starts with n classes, or clusters, one containing a single example x(i) from the
13 training set, and then measures the dissimilarity d(A, B) between pairs of clusters labeled A and
14 B. The two clusters with the smallest dissimilarity, i.e. more similar, are merged into a new
15 cluster. The process is then repeated recursively until only one cluster, containing all the training
16 set elements, remains. The process can be better visualized by plotting a dendrogram, shown in
us
17 Figure 12. In order to cluster the data into k clusters, 1 < k < n, the user is required to cut the
18 hierarchical structure obtained at some intermediate clustering step. There is certain freedom into
19 choosing the measure of dissimilarity d(A, B), and three main measures are popular. First, the
20 single linkage takes into account the closest pair of cluster members,
21
22 dSL (A, B) = min dij (14)
23
24
25
26
27
28
i∈A,j∈B
and finally group averaging clustering considers the average dissimilarity, representing a
(15)
dM
29 compromise between the two former measures,
30 dGA (A, B) =
1 XX
dij . (16)
31 |A||B|
i∈A j∈B
32
33 The particular form of dij can also be chosen, usually being considered the Euclidean distance for
34 numerical data. Unless the data at hand is highly clustered, the choice of the dissimilarity measure
35 can result in distinct dendrograms, and thus, distinct clusters.
36 As the name suggests, divisive clustering performs the opposite operation, starting from a
37 single cluster containing all examples from the data set and divides it recursively in a way that
38 cluster dissimilarity is maximized. Similarly, it requires the user to determine the cut line in order
to cluster the data.
pte
39
40 In the case where not only the features X but also the labels yi are present into the dataset,
41 one is faced with a supervised learning task. Within this scenario, if the labels are continuous
42 variables, the most used learning algorithm is known as Linear Regression. It is a regression
43 method capable of learning the continuous mapping between the data points and the labels. Its
44 basic assumption is that the data points are normally distributed with respect to a fitted expression,
45 ŷ (i) = θT x(i) (17)
46
where the superscript T denotes the transpose of a vector, ŷ (i) is the predicted label, and θ is a
ce
47
vector of parameters. In order to obtain the θ parameters, one plugs a cost function, which is given
48
by a sum of least squares error terms, into the model,
49 n n
50 X
(i) (i) (i) 1 X T (i)
J(θ) = L[ŷ (x , θ), y ] = (θ x − y (i) )2 . (18)
51 2 i=1
i=1
52
Ac
53
54
55
56
57
58
59
60
Page 23 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
23
4
pt
5 By minimizing the above function with respect to its parameters, one finds the best set of θ for
6 the problem at hand, thus leading to a trained ML model. In this case, a closed-form solution for
7 the parameter vector θ exists
8
9 θ = (XT X)−1 XT y (19)
cri
10 where X is a matrix with each row containing a training set example x(i) and y is the corresponding
11 vector of labels.
12 Once the ML model is considered trained, its performance can be assessed by a test set, which
13 consists of a smaller sample in comparison to the train set that is not used during training. Two
14 main problems might arise then: (i) if the descriptor vectors present an insufficient number of
15 features, i.e. it is not general enough to capture the trends in the data and the regression model is
16 considered plagued by bias, and (ii) if the descriptor presents too much information, which makes
us
17 the regression model fit the training data exceedingly well but struggles to generalize to new data,
18 then one says it is suffering from overfitting or variance. Roughly speaking, these are the two
19 extremes of model complexity, which is in turn directly related to the number of parameters of
20 the ML model, as is depicted in Figure 11. In this case, the use of a regularization parameter λ
21 usually takes place, in order to decrease in a systematic way the complexity of the model and find
22
23
24
25
26
27
28
the optimum spot.
Underfitting anApproximation
set
err
or
Interpolation
Overfitting
Prediction Error
dM
t
Tes
29 Optimum ian
ce
ar
30 U
nm
ise
:V
od
no
31 el
ed d
ele
32 da
ta M
od
:B
33 ia
s
34 Trainin
g set err
35 or
36 Model Complexity
37
38 Figure 11. Bias × variance trade-off. The optimum model complexity is evaluated against the
prediction error given by the test set. Adapted from [187].
pte
39
40
Ridge or LASSO Regression are extensions of the linear regression, where a regularization
41
parameter λ is inserted into the cost function
42
n
43 1X
44 J(θ) = (θ · x(i)T − y (i) )2 + λkθkp (20)
2 i=1
45
46 and p denotes the metric in this case: p = 0 is simply the number of non-zero elements (usually
ce
53 are penalized, adding to the cost function. In both the LASSO as well as ridge regression, the λ
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 24 of 74
1
2
3
24
4
pt
5 parameter controls the complexity of the model, by decreasing and/or selecting features. Thus,
6 in both cases, it is recommendable to start with a very specialized (or complex) model and use
7 λ to decrease its complexity. The λ parameter however cannot be learned in the same way as
8 θ, being referred to as a hyperparameter that should be fine-tuned by e.g. grid search in order
9 to find the one that maximizes the prediction power without introducing too much bias. One
cri
10 is not restrained to choose a specific metric for the regularization term in equation 20: methods
11 for interpolation, such as elastic net [188, 189], are capable of finding an optimal combination of
12 regularization parameters.
13 Another class of supervised learning, known as classification algorithms, is broadly used when
14 the dataset is labeled by discrete labels. A very popular algorithm for classification is logistic
15 regression, which can be interpreted as a mapping of the predictions made by linear regression
16 into the [0, 1] interval. Lets suppose that the classification task at hand is to decide if a given data
us
17 point x(i) belongs to a particular class (y (i) = 1) or not (y (i) = 0). The desired binary prediction
18 can be obtained from
19 1
20 ŷ = σ(θT x) = (21)
1 + e−θT x
21
where θ is again a parameter vector, and σ is referred to as the logistic or sigmoid function. As an
22
23
24
25
26
27
28
an
example, the sigmoid function along with some prediction from a fictitious dataset is presented in
Figure 12. Usually one considers that data point x(i) belongs to class labeled by y (i) if ŷ (i) ≥ 0.5,
even though the predicted label can be interpreted as a probability ŷ = P (y = 1|x, θ).
In the case of classification, the cost function is obtained from the negative log-likelihood.
Thus, obtaining the best parameters θ requires the minimization of the aforementioned quantity,
given by
dM
n
29 1 X h (i) i
J(θ) = − y log(ŷ (i) ) + (1 − y (i) ) log(1 − ŷ (i) ) (22)
30 n i=1
31
32 where y (i) and ŷ (i) = σ(θT x(i) ) are the actual and predicted binary labels. A regularization
33 parameter λ can be inserted in equation 22 with the same intent of selecting the features as we
34 used in linear regression earlier. Notice that logistic regression can also be used when the data
35 presents multiple classes. In this case, one should employ the one-vs-all strategy, which consists
36 on training n logistic regression models, one for each class, and predicting the labels using the
37 classifier that presents the highest probability.
38 By proposing a series of changes in the logistic regression, Cortes and Vapnik introduced one
pte
39 of the most popular ML classification algorithms, support vector machines (SVMs) [190]. Such
40 changes can be summarized by the introduction of the following cost function,
41 Xn h i 1X n
44 where C is a hyperparameter. Insertion of max(z, 0) into the cost function leads to a maximization
45 of a classification gap containing the decision boundary in the data space. The optimization
46 problem described above can also be interpreted as the minimization of kθk2 subject to the
ce
47 constraints y (i) (θT x(i) + b) ≥ 1 for all (x(i) , y (i) ) belonging to the training set. In this case,
48 the labels y (i) are either +1 or -1, signaling that example i is or is not, respectively, a member of
49 a particular class. In fact, by writing the Lagrangian for this constrained minimization problem,
50 one ends up with an expression that corresponds to the cost function given by equation 23.
51 One of the most powerful features of SVMs is the kernel trick. ItPcan be proved that the
52 parameter vector θ can be written in terms of the training samples, θ = i αi y (i) x(i) . This makes
Ac
53
54
55
56
57
58
59
60
Page 25 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
25
4
pt
5 possible to express the decision rule as a function of dot products between data vectors
6 X
7 θT x + b = αi y (i) x(i) · x + b ≥ 0 → y = +1 (24)
8 i
9 where b and {αi } are the parameters to be learned. The kernel trick consists into transforming
cri
10 the vectors in the dot products x(i) · x using a mapping φ(x) : Rn → Rm that takes the data
11 points into a larger dimensional space, where a decision boundary can be envisaged. Moreover,
12 any transformation that maps the dot product into a vector-pair function has been proven to work
13 similarly to what was described above. A couple of the most popular kernels are the polynomial
14 kernel, K(x(i) , x(j) ) = φ(x(i) ) · φ(x(j) ) = (x(i) · x(j) + 1)d , d ∈ N, and the Gaussian kernel, also
15 known as radial basis function (RBF) kernel,
16 −kx(i) −x(j) k2
us
17 K(x(i) , x(j) ) = e 2σ 2 (25)
18 where σ is a hyperparameter to be adjusted. The Gaussian kernel usage is usually interpreted as
19 a pattern-matching process, by measuring the similarity between data points in high-dimensional
20 space.
21 Up to this point, all classification algorithms presented are based on discriminative models,
22
23
24
25
26
27
28
p(y|x) =
p(x)
=P
i
an
where the task is to model the probability of a label given the data points or features p(y|x).
Another class of algorithm capable of performing the same task, but using a different approach of
a generative model, where one aims to learn the probability of the features given the label p(x|y)
can be derived from the famous Bayes formula for calculation of a posterior probability,
p(x|y)p(y) p(x|y)p(y)
p(x|y = i)p(y = i)
(26)
dM
29 where p(y) is the prior probability, i.e. the probability one infers before any additional knowledge
30 about the problem is presented. By making the assumption that the feature vectors x(i) are
31 conditionally independent given the labels y (i) , a very popular ML algorithm, the Naïve Bayes
32 classifier is obtained [191]. Its assumption enables one to rewrite the posterior probability from
33 equation 26 as
34
Qn
j=1 p(xj |y)p(y)
35 p(y|x) = (27)
p(x)
36
where xi are the components of the feature vector x. Usually the denominator in this equation is
37
disregarded since it is a constant for all possible values of y, and the probability is renormalized.
38
The training step for this classifier comprises the tabulation of the priors p(y) for all labels in the
pte
39
training set as well as the conditional probabilities p(xi |y) from the same source. Once trained,
40
the Naïve Bayes algorithm predicts the label y by selecting the largest posterior probability p(y|x)
41
over all possible labels y.
42
Another popular and simple classification algorithm is k-nearest neighbors (kNN). Based
43
on similarity by distance, this algorithm does not require a training step, which makes it attractive
44
for quick tasks. In short, given a training set composed of data points in a d-dimensional space
45
{x(i) }, kNN calculates the distance between these points and an unseen data point x,
46
ce
53 in this case, leaving the task of choosing a sensitive k to the user. For classification tasks, different
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 26 of 74
1
2
3
26
4
pt
5 choices of such hyperparameter might result in distinct partitionings of the data cloud, which can
6 be visualized as the Voronoi tessellation diagrams in Figure 12.
7 Finally, some ML algorithms are suited both for classification and regression. Decision
8 Trees are popular and fast ML algorithm that can be used in both cases. Since it can be
9 implemented in a variety of flavors, we chose to explain briefly the workings of two of the most
cri
10 popular implementations, the classification, and regression trees, or CART, and the C4.5 algorithm
11 [192, 193]. Both methods are based on the partitioning of the data space, i.e. the creation of nodes,
12 in order to optimize a certain splitting algorithm. Each node of the tree contains a question which
13 defines such a partition. When no further partitioning of the space is possible, each disjoint
14 subspace, referred to as the leaves, contains the data points one wishes to classify or predict.
15 C4.5 performs a series of multinary partitioning operations over the training set S. This is
16 done in such a way to maximize the ratio between information gain and potential information that
us
17 can be obtained from a particular partitioning or test B
18
G(S, B)
19 argmax (29)
20 B P (S, B)
21 where the information gain G(S, B) is
22
23
24
25
26
27
28
G(S, B) = −
k
X
i=1
an
fi log(fi ) +
l
X
j=1
k
|Sj | X
|S| i=1
(j)
fi
(j)
log(fi ) (30)
(j)
where fi is the relative frequency of elements belonging to class Ci in the training set S, while fi
is the same relative frequency with respect to a particular partitioning Sj of the training set after
performing the test B. The potential information P (S, B) that such partitioning can provide is
dM
29 given by
30 l
X |Si | |Si |
31 P (S, B) = − log( ). (31)
|S| |S|
32 i=1
33 Partitioning takes place up to the point where the nodes contain only examples of one class or
34 examples of distinct classes that cannot be distinguished by their attributes.
35 On the other hand, CART is a decision tree method which is capable of binary partitioning
36 only. In the case of classification tasks, it uses a criterion for splitting which is based on the
37 minimization of the Gini impurity coefficient
38 k
pte
39
X
IG (S) = 1 − fj2 (32)
40 j=1
41 where S is the training set and fj is the relative frequency of member of the j-th class in this
42 set. If one is interested in using CART for a regression task, there are two main differences to be
43 considered. First, the nodes predict real numbers instead of classes. Second, the splitting criterion,
44 in this case, is the minimization of the resubstitution estimate, which is basically a mean squared
45 error
46 n
1X
ce
53 they suffer from overfitting. A couple of strategies to overcome this problem have been proposed,
54
55
56
57
58
59
60
Page 27 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
27
4
pt
5 such as pruning the trees’ structures in order to increase its generalization power, loosing however
6 some of their accuracies. More advanced methods include Random Forests, which is an ensemble
7 method based on training several decision trees and averaging their predictions [194]. In this case,
8 the trees are smaller versions of the structures described previously, trained using a randomly
9 chosen subset of the features of the dataset, and usually a bootstrap sample of the same set.
cri
10 In some sense, building a series of weaker learners and combining their predictions enables the
11 algorithm to learn particular features of the dataset and better generalize to new, unseen data.
12
13
14
15
16
us
17
18
19
20
21
22
23
24
25
26
27
28
an
Figure 12. (a) Example of the sigmoid function and the classification of negative (red) and
positive (blue) examples in logistic regression. The gray arrow points out to the incorrectly
classified points in the dataset. (b) A Voronoi diagram depicting the k-means classification
algorithm. The data point labels correspond to the distinct colors of the scatter points while the
assignment to each cluster, defined by their centroids (black crosses), corresponds to the color
patches. (c) A dendrogram explaining the hierarchical clustering algorithm, where the color code
is a guide to visualize the clusters, and each vertical line represents a cluster. Horizontal lines
dM
denote the merging process of two clusters. The number of cuts between a horizontal line and
29 the cluster lines denotes the number of clusters at a given height, which in the case of the gray
dashed line is five. (d) k-nearest neighbors Voronoi diagram showing the data point labels and
30 classification patches as color codes.
31
32 Artificial Neural Networks (ANNs) corresponds to a class of algorithms that were, at
33 least in their early stages, inspired by the brain structure. An ANN can be described as a directed
34 weighted graph, i.e, a structure composed of layers containing processing units called neurons,
35 which are in turn connected to other such layers, as depicted in Figure 13. Many kinds of ANNs
36 are used for a variety of tasks, namely regression, and classification, and some of the most popular
37 architectures for such networks are feed-forward, recurrent, and convolutional ANNs. The main
38 difference between these architectures is basically on the connection patterns and operations that
pte
47
48 (k) (k)
X (k−1) (k)
zi = ωi0 + yj ωij (34)
49 j
50
(k) (k)
51 where ωij is the matrix element which connects the adjacent layers. The element wi0 is referred
52 to as the bias, because it is not part of the linear combination of inputs. The input is then
Ac
53
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 28 of 74
1
2
3
28
4
pt
5
6
7
8
9
cri
10
11
12
13
14
15
16
us
17
18
19
20
21 Figure 13. Example of a feed-forward ANN with N hidden layers and a single neuron in the
22 output layer. Red neurons represent sigmoid activated units (see equation 35) while yellow ones
23
24
25
26
27
28
(k)
yi =
1
(k)
1 + e−zi
an
correspond to the ReLU activation (equation 37).
39
error given by equation 18. For a single class classification task, an ANN should output a single
40
sigmoid-activated neuron, corresponding to the probability of the input example belonging to the
41
particular class. In this case, the measure of accuracy is the same as in the logistic regression
42
algorithm, the cross-entropy given by equation 22. The difference is that the parameters to be
43 (k)
learned are now the interlayer matrix elements ωij instead of a single parameter vector θ and
44
the predicted labels are a complicated compound non-linear function. In case one is interested in
45
multi-class classification, a softmax activation should be used, corresponding to the probability of
46 (k−1) (k−1)
output vector y(k−1) = [y1 , . . . , yn ] representing a member of class yi ,
ce
47
48 (k) eyi
(k−1)
49 yi =P (k−1)
, (38)
n yj
50 j=1 e
51 and the loss function to be minimized is the cross-entropy,
52
X
L {ω (k) } = − yij log[ŷij ({ω (k) })]
Ac
(39)
53
ijk
54
55
56
57
58
59
60
Page 29 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
29
4
pt
5 where {ω (k) } is the set of the matrices containing the weights one is interested in learning, yij
6 is the i-th entry of the label vector corresponding to the j-th training example and ŷij is the
7 (k)
corresponding predicted value. Optimal values for the parameters ωij are found by calculating
8
the gradient of L with respect to these parameters and performing gradient descent minimization.
9
cri
This process is referred to as back-propagation.
10
In a nutshell, using ANNs for machine learning tasks comprise a series of steps: (i) random
11 (k)
12 initialization of the weights {ωij }, (ii) forward pass training examples and computing their
13 outcomes, (iii) calculate their deviations from the corresponding labels via the loss function, (iv)
obtain the gradients of that function with respect to the network weights via back-propagation,
14
and finally (v) adjust the weights in order to minimize the loss function. Such process might be
15
performed for each example of the training set at a time, which is called online learning, or using
16
us
samples of the set at each step, being referred to as mini-batch or simply batch learning.
17
A ML supervised learning algorithm is considered trained when its optimal parameters given
18
the training data are found, by minimizing a loss function or negative log likelihood. However, the
19
hyperparameters usually cannot be learned in this manner, and the study of the performance of
20
the model over a separate set, referred to as the validation set, as a function of such parameters
21
is of order. This process is referred to as validation. The usual way of doing so is separating
22
23
24
25
26
27
28
an
the dataset into 3 separate sets: the training, validation, and test sets. It is expected that their
contents are of the same nature, i.e. come from the same statistical distribution. The learning
process is then performed several times in order to optimize the model. Finally, by using the test
set, one can confront the predictions with the actual labels and measure how far off the model is
performing. The optimal balance is represented in Figure 11. When a limited amount of data is
available for training, removing a fraction of that set in order to create the test set might impact
dM
negatively the training process, and alternative ways should be employed. One of the most popular
29
methods in this scenario is k-fold cross-validation, which consists in partitioning the train set in
30
k subsets, and train the model using k − 1 of the subsets and validate the trained model using
31
the set that was not used for training. This process is performed k times and the average of each
32
validation step is used to average the performance,
33
34 K
K nk
1 XX (i)
35 Ecv = L(ŷk , y (i) ) (40)
K
36 i=1
k=1
37 (i)
where L is the loss function and ŷk is the predicted label of the i-th training example of the model
38 trained using the subset of the training data excluding subset k, which is of size nk . A particular
pte
39 case when K = n, i.e. the number of subsets is the number of elements in the train set, is called
40 leave-one-out cross-validation.
41 Cross-validation can also be used to evaluate the performance of the trained model with
42 respect to some hyperparameter, such as λ when one introduces regularization or σ for SVMs with
43 a Gaussian kernel. Other parameters that might not seem so obvious, such as the pruning level
44 of binary trees or the number of features one selects in order to create the ensemble for a random
45 forest can also be optimized in the same way. The error is then evaluated for a series of values of
46 the parameters and the value that minimizes the prediction or test error is selected in this case.
ce
47 There are many different ways of evaluating the performance. As an example, in binary
48 or multinary classification tasks, the use of confusion matrices, where the number of correctly
49 predicted elements are presented in the diagonal entries while the elements that were incorrectly
50 predicted are counted in the off-diagonal entries, is very common. One can think of the vertical
51 index as the actual labels and horizontal index as the predictions, and false (F) positives (P)
52 or negatives (N) are positive predictions for negative cases and the converse, respectively. The
Ac
53 receiver operating curve (ROC) is also routinely used, being the plot of the true (T) positive rate
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 30 of 74
1
2
3
30
4
pt
5 T P R = T PT+FP FP
N versus the false positive rate F P R = F P +T N with changing threshold. In the
6 case of regression
7 Pn tasks, there are several measures of the fitting accuracy. The mean absolute error
M AE = n1 i |yi −ŷi |, measures deviations in the same unit as the variable, and also is not sensitive
8 to outliers. There is the normalized version expressed in percentage M AP E = 100%
Pn yi −ŷi
yi .
9 n
n i
cri
1 2
P
The mean squared error M SE = n i (yi − ŷi ) combines bias and variance measurements of the
10
11 prediction. From a frequentist point of view the estimation θ̂m of a distribution parameter θ is
12 intimately related to the MSE, via the formula M SE = E[(θ̂m − θ)2 ] = Bias(θ̂m )2 + V ar(θ̂m ).
13 The MSE, i.e., the cost function given in equation 18 or equation 20 (when one introduces a
14 regularization parameter λ) ideally would add up to zero for data points lying exactly on top of the
15 function obtained via regression. The MSE usually is used taking its root (RMSE), which recovers
16 the original unit, facilitating model accuracy interpretation. Finally, the statistical coefficient of
us
17 determination R2 is also used, defined as R2 = 1 − SS SStot , where the total sum of squares is
res
18 2
(yi − fˆi )2 .
P P
SStot = i(yi − ȳ) and the residual sum of squares is SSres = i
19
20 2.3.3 Materials Informatics
21 Inspired by the success of applied information sciences such as bioinformatics, the application
22
23
24
25
26
27
28
an
of machine learning and data-driven techniques to materials science developed into a new sub-
field called “Materials Informatics” [195], which aims to discover the relations between known
standard features and materials properties. These features are usually restricted to the structure,
composition, symmetry, and properties of the constituent elements. Recasting the learning problem
stated in section 2.3.1 to this context, we usually want to answer a question of the type: given
a material xi , what is its property yi = f (xi )? Or, {material → property}? Naturally, this
question has always been at the heart of material science, what changes here is the way to solve
dM
29 it. Specifically, one has to give a known example dataset to train an approximate ML model and
30 then make predictions on materials of interest that are outside the dataset. Ultimately the inverse
31 question can also be answered (see section 2.1.2.1): given the desired property y, what material
32 can present it?
33 A model must be constructed to predict properties or functional relationships from the data.
34 The model is an approximate function that brings the inputs (materials features) to the outputs
35 (properties). As such, it can be seen as a phenomenological or empirical model, because it arrives
36 at a heuristic function that describes the available data. The ML process is expected to provide
37 features-property relationships that are hidden to human capacities. In the context of science
38 paradigms (discussed in Section 1.1), this contrasts with theoretical models, which discover the
pte
39 fundamental underlying physics behind the data. Even though, these approximate models can lead
40 to better understanding and ultimately aid in the construction of theories. In Feynman’s words:
41 “We do not know what the rules of the game are; all we are allowed to do is to watch the playing.
42 Of course, if we watch long enough, we may eventually catch on to a few of the rules. The rules
43 of the game are what we mean by fundamental physics.” [196].
44 The machine learning task for constructing models for materials is an applied version of the
45 general ML workflow presented in Section 2.3.1. As discussed, the supervised tasks can be divided
46 into two groups: learning of a numerical material property or materials classification. In the first
ce
47 case, the ML process aims to find a functional form f (x) for a numeric target property, requiring
48 the use of methods such as regression. Otherwise, classification aims to create “materials maps”, in
49 which compounds or molecules exhibiting different categories of the same property are accordingly
50 identified by class labels. For example, magnetic and non-magnetic systems (non-zero and zero
51 magnetic moment), or compounds stable at zinc blende or rock salt structures form two different
52 classes. In these maps, the overlap between the classes must be zero, as schematically represented
Ac
53 for a Voronoi diagram depicting the k-means classification (see Figure 12). Thus, the class of a
54
55
56
57
58
59
60
Page 31 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
31
4
pt
5 material outside the training set can be identified only by its position on the map. In section 3,
6 we discuss examples and progress based on these kinds of material informatics tasks. Here, we first
7 outline the usually followed process.
8 The materials informatics workflow consists basically of the same general components (see
9 Section 2.3.1) combined:
cri
10
11 0) Problem definition: one of the most important tasks, here the desired outcome (classification,
12 regression, clustering, optimization, probability estimation, etc) must be defined and
13 translated into a specific, measurable, attainable, relevant, and timely (SMART) goal that
14 will be the learning algorithm target. Besides the desired output, the possible inputs (data
15 and representations) that are needed to describe the goal must be thought. We will briefly
16 discuss types of problems that are or not suited to ML at the end of this section.
us
17 1) Data: the essential component of any data-driven strategy. It must be sufficient to describe
18 the defined problem. A minimum data set consists of a measured material property for the
19 set of available examples, i.e. the ML target output. Typically (but not always if the problem
20 is to find such information) this set is also accompanied by an identification of each example,
21 which can be used as input. We presented approaches capable of data generation in previous
22 sections, but this is not restricted to them, any data sources can work.
23
24
25
26
27
28
an
2) Representations: perhaps the most demanding task. The representation of materials will
determine the machine learning capacity and performance. The process goes along mapping
into a vector the accessible descriptive input quantities that identify a material into the
property of interest. In statistical learning, this set of variables identifying materials features
is called a descriptor [197], or fingerprint. Due to the importance of this topic, this is discussed
in greater detail in the next Subsection 2.3.3.1.
dM
29 3) ML algorithms and model selection, evaluation and optimization: according to the
30 problem goal, a suitable algorithm must be chosen and evaluated. Special attention to
31 the characteristics of the algorithm regarding accuracy/performance, training time, and
32 complexity/interpretability of the model must be taken. Evaluation and optimization methods
33 such as CV combined with RMSE, MAE, R2 , etc. The ultimate evaluation should always be
34 performed on the unseen test data, which will reveal if bias/variance is modeled resulting in
35 under/overfitting (Figure 11). We presented a selection of algorithms and their evaluation in
36 the previous Subsection 2.3.2.
37
Therefore, the model creation can be synthesized in the following equation:
38
pte
47 Only 50 years later, quantum mechanics brings the physical reasoning behind this two-dimensional
48 descriptor, the shell structure of the electrons. Despite this delayed interpretation, the periodic
49 table anticipated undiscovered elements and their properties, assuring its predictive power [199].
50 On the other hand, the challenge to sort all materials is much complex, since there are potentially
51 millions of materials instead of only 118 elements. Additionally, only a small fraction of these
52 compounds have their basic properties determined [200]. This problem is even more complex
Ac
53 for the infinitely large dataset formed by the all possible combinations of surfaces, interfaces,
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 32 of 74
1
2
3
32
4
pt
5 Step 0: Problem
6 Material Property
7
8 ? Predictions!
cri
Step 1: Data Step 3: Learning Algorithm
10 Step 2: Representation
11 Material Property
Features Property
12 Material ...
...
...
13 # #
...
14
...
...
15
16 Figure 14. Materials informatics workflow summarized as: (goal+) data + representation +
us
17 learning algorithm and optimization. Adapted from [198]. CC BY 4.0
18
19
nanostructures, and organic materials, in which the complexity of materials properties is much
20
higher. Therefore, it is reasonable to suppose that materials with promising properties are still to
21
be discovered in almost every field [199].
22
23
24
25
26
27
28
an
In practice, several software packages and tools for different types of ML tasks exist, and
are presented in Table 3. General purpose codes work for the various types of problems (Section
2.3.1) irrespective of the data source, given that it is in the right format, and implement the
most common algorithms discussed in Section 2.3.2. Materials specific codes aid in the different
steps of the MI workflow. These include data curation and representation by transforming general
materials information (compositional, structural, electronic, etc) into feature vectors (details in
dM
the next Section 2.3.3.1), algorithm training and validation, and in employing the generated ML
29
models, as is the case for ML atomistic potentials, generally interfaced with a MD software or HT
30
framework.
31
Finally, we now discuss an essential question regarding ML research: when ML should or not be
32
employed and what kind of problems it tackles. An obvious crucial prerequisite is the availability
33
of data, which should be consistent, sufficient, validated, and representative of the behavior of
34
interest to be described. Once more we emphasize this requirement and thus, the common data
35
generation process is generally better suited to traditional or HT approaches, at least initially.
36
Additionally, one has to consider the strengths of machine learning methods, which can manage
37
high-dimensional spaces in searching for relationships in data. The patterns discovered are then
38
explicit encoded, rendering computational models that can be manipulated. In contrast, if human
pte
39 intuition can produce a physical model, ML is probably not needed by the problem. Therefore,
40 ML methods are best suited to problems where traditional approaches have difficulties. Although
41 it is not always clear to specify, if a problem can be identified into one of the general ML problem
42 types described in Section 2.3.1, ML can be a useful tool. In order of increasing added value and
43 difficulty, the general problems tackled are replacing the collection of difficult, complex or expensive
44 properties/data; generalizing a pattern present in a data set for a similar data class; obtaining a
45 relationship between correlated variables but with unknown or indirect links, which is beyond
46 intuition or domain knowledge; obtaining a general approximate model for a complex unknown
ce
47 property or phenomena which have no fundamental theory or equations [198]. Historically, areas
48 which have questions with these characteristics have had successful applications of ML methods,
49 such as in automation, image and language processing, social, chemical and biological sciences,
50 and in recent times many more examples are emerging.
51 Based on these characteristics, we glimpse on the common types of materials science
52 applied problems which make use of data-driven strategies, and that are exemplified in Section
Ac
53
54
55
56
57
58
59
60
Page 33 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
33
4
pt
5
6 Table 3. Selection of Materials Informatics and machine learning codes and tools. Adapted
from [22].
7 Name Description URL Ref.
8 General purpose
9 scikit-learn General purpose ML http://scikit-learn.org [201]
cri
10 TensorFlow General purpose ML www.tensorflow.org [202]
11 PyTorch/Caffe2 Open source deep learning https://pytorch.org/ [203]
12 platform
13 Weka General purpose ML https://www.cs.waikato.ac.nz/ml/ [204]
14 weka/
15 Materials specific
16 SISSO General purpose ML https://github.com/rouyang2017/SISSO [205]
us
Magpie General purpose ML https://bitbucket.org/wolverton/ [206]
17
magpie
18
MatMiner Feature construction library https://hackingmaterials.github.io/ [207]
19 matminer
20 AFLOW-ML General purpose ML http://aflowlib.org/aflow-ml/ [208]
21 PROPhet Neural networks to materi- https://biklooost.github.io/PROPhet/ [209]
22
23
24
25
26
27
28
COMBO
Phoenics
JARVIS-ML
OMDB-ML
als predictions
brary
an
Bayesian Optimization Li-
https://github.com/
aspuru-guzik-group/phoenics
https://www.ctcms.nist.gov/jarvisml/
https://omdb.mathub.io/ml
[210]
[211]
[212]
[213]
dM
ML atomistic potentials
29 SchNetPack Neural Networks https://github.com/ [214]
30 atomistic-machine-learning/
31 schnetpack
32 GAP/SOAP Gaussian Approximation http://libatoms.org/Home/Software [215, 216]
33 Potentials (GAPs)
34 TensorMol Neural networks https://github.com/jparkhill/ [217]
35 TensorMol
36 ANI Neural networks https://github.com/isayev/ASE_ANI [218]
37 Amp Complete package https://bitbucket.org/ [219]
38 andrewpeterson/amp
DeePMD-kit Neural networks https://github.com/deepmodeling/ [220]
pte
39
deepmd-kit
40 ænet Neural Networks http://ann.atomistic.net/ [221]
41
42
43 3.2. The first one is the evident attainment of models for phenomena which have unknown
44 relationships/mechanisms. A related strategy is to replace the description of a very complex or
45 expensive property (that is somewhat known, at least for a small class of materials) by a simpler ML
46 model, rendering its calculation less expensive. If properly validated, this model can then predict
ce
47 the complex property for unknown examples, expanding the data set. In the context of materials
48 discovery and design, this strategy can be employed as a form of extending the data set before the
49 screening, where the initial expensive data leads to more data through the ML model, which can
50 then be screened for novel promising candidates. Other problems use feature selection techniques
51 to discover approximate models and descriptors, which aid in the phenomenological understanding
52 of the problem. Another type of problem and perhaps the most abundant is the clear advantageous
Ac
53 problems in which expensive calculations can be replaced by a much more efficient model, such
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 34 of 74
1
2
3
34
4
pt
5 as replacing altogether DFT calculations for ML models such as in obtaining atomistic potentials
6 for MD simulations, predicting the value of different properties (gap, formation and total energies,
7 conductivity, magnetization, etc).
8
9
cri
2.3.3.1 Representations and descriptors
10 The representation of materials is a crucial component determining the machine learning
11 performance. Only if the necessary variables are sufficiently represented then the learning algorithm
12 will be able to describe the desired relationship. A representation objective is to transform
13 materials characteristics such as composition, stoichiometry, structure, and properties into a
14 quantitative numerical list, i.e. a vector or a matrix, which will be used as input for the ML
15 model. These variables used to represent materials characteristics are called features, descriptors,
16
us
or even fingerprints. A general guideline can be expressed by a variant of Occam’s razor which is a
17 paraphrase famously attributed to Einstein, a representation should be “as simple as possible, but
18 not simpler”. For any new ML problem, the feature engineering process is responsible for most of
19 the effort and time used in the project [222].
20 To better represent materials in a systematic, physics-based, and computational-friendly way,
21 some universal desirable requisites have been proposed [197, 223, 224], such as: the representation
22
23
24
25
26
27
28
an
should be i ) complete (sufficient to differentiate the examples), ii ) unique (two materials have the
same representation only if they are in fact the same), iii ) discriminative (similar/different systems
will be characterized by accordingly similar/different representations), and iv ) efficient and simple
(representation computation is fast). Other helpful characteristics for representations are having
a high target similarity (similarity between representation and original represented function), and
from a computational perspective, functions of fixed dimensionality, and smooth and continuous,
dM
which ensures differentiability. These requisites presented acts to assure that the models will be
29 efficient with only the essential information.
30 The relationship between structure and properties of molecules and materials is studied
31 for more than a hundred years [225], and a whole research field called quantitative structure-
32 activity/property relationship (QSAR/QSPR) developed with the aim of finding heuristic functions
33 that connect these. Such field has shown a relative degree of success, but also inconsistent
34 performance of its models, arising from a lack of either proper domain of applicability, satisfactory
35 descriptors, or machine learning validation [226]. Recent research of ML for materials and molecules
36 is bridging the gap between more traditional simulation methods such as DFT and MD, and the
37 QSAR/QSPR and related bio- and cheminformatics fields.
38 Generally, a material can be described in several ways, of increasing complexity degree,
pte
39 depending on each problem needs. The simplest way is using only the chemical features such as
40 atomic element types and stoichiometric information, which involves no structural characterization,
41 therefore being more general but less specific to distinct polymorphs which can present different
42 properties. This kind of rough description manages to describe general trends among very
43 different types of materials.In order to increase the description capability of the ML models, higher
44 complexity can be handled by introducing more relevant information available [206]. For descriptors
45 based on elemental properties [197, 199, 227, 228] this involves including and combining elements
46 properties and statistics of these such as the mean, mean absolute deviation, range, minimum,
ce
47 maximum and mode. Stoichiometric attributes can include the number of elements, fractions, and
48 norms. Even beyond, ionic character [206, 229] and electronic structure attributes [156, 230, 231],
49 fingerprints [232] and statistics can be included, to account for more intricate relationships.
50 Including the structural information of the high dimensional space of atomic configurations
51 [233] is not a simple task. Common structural representations are not directly applicable to
52 computational descriptions. Materials, especially solids, are commonly represented by their Bravais
Ac
53 matrix and a basis, including the information of the translation vectors and the atom types and
54
55
56
57
58
59
60
Page 35 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
35
4
pt
5 positions, respectively. For machine learning purposes, this representation is not suitable due
6 to not being unique. In case of structural input, the requisites presented above indicate that
7 the chemical species and atomic coordinates should suffice for an efficient representation. As
8 such, the models should preserve the systems symmetries such as translational, rotational, and
9 permutational. Ultimately, the representation objective is to ensure accuracy comparable to or
cri
10 superior than quantum mechanics calculations, for a wide range of systems, but with reduced
11 computational cost.
12 These so-called structural fingerprints are increasingly used to describe the potential energy
13 surfaces (PES) of different systems, leading to force fields for classical atomistic simulations with
14 QM accuracy, but with computational cost orders of magnitude lower and also linear scaling O(n)
15 behavior with the number of atoms. Most of these potentials benefit from chemical locality, i.e.
16 the system total energy can be described as a sum over local (atomic) environment contributions
us
17 P
E = atom Eatom , which improve transferability. Commonly these ML potentials use as learning
18 algorithms kernel ridge regression (KRR), neural networks (NN), or even support vector machines
19 (SVM) [234], which are very efficient in mapping the complex PES. Notable examples are the
20 Gaussian Approximation Potentials (GAPs) [215, 235], Behler–Parrinello high-dimensional neural
21 network potentials [236, 237], and Deep Potential molecular dynamics [238]. Related to the
22
23
24
25
26
27
28
an
structural representation, a scoring parameter to identify dimensionality of materials was recently
developed [239]. There are also methods for structural similarity measurements improving upon the
commonly used root-mean-square distance (RMSD) [240], such as fingerprint distances [241, 242],
functional representation of an atomic configuration (FRAC) [243], distance matrix and eigen-
subspace projection function (EPF) [244], and the regularized entropy match (REMatch) [245],
used with SOAP.
dM
There is a vast collection of descriptor proposals in the literature that include more complex
29 representations than the simple elemental properties discussed above, ranging from molecule-
30 oriented fingerprints to descriptors for extended materials systems and tensorial properties. We
31 now present a relatively chronological list used in recent materials research, which is considerable
32 but not exhaustive. These include: bond-orientational order parameters (BOP) [246]; Behler–
33 Parrinello atom-centered symmetry functions (ACSF) [236, 247], and its modified [248] and
34 weighted (wACSF) [249] versions; Gaussian Approximation Potentials (GAP) [215, 235] using
35 smooth overlap of atomic positions (SOAP) [216] also extended for tensorial properties [250];
36 Coulomb matrix [251] and Bag of Bonds (BOB) [252], and the subsequent interatomic many
37 body expansions (MBE) [253, 254] like the so-called BAML (bonds, angles machine learning)
38 [255] and fixed-size inverse distances [256]; metric fingerprints [241]; bispectrum [216]; atomic
pte
39 local frame (ALF) [257];partial radial and angular distribution functions (PRDF, ADF) [258]
40 and generalized radial distribution functions (GRDF) [227]; Fourier series of radial distribution
41 functions [259]; force vectors representations [260]; spectral neighbor analysis potential (SNAP)
42 [261]; permutation invariant polynomials [248]; particle densities [262]; angular Fourier series (AFS)
43 [216]; topological polyhedra [263], Voronoi [264] and Voronoi-Dirichlet [265] tessellations; spherical
44 harmonics [266]; histogram of distances, angles, or dihedral angles [267]; classical force-field-inspired
45 descriptors (CFID) [212]; graph-based such as Graph Approximated Energy (GRAPE) [268];
46 constant complexity descriptors based on Chebyshev polynomials [269]; symmetrized gradient-
ce
47 domain machine learning (sGDML) [270]; generalized crystal graph convolutional neural networks
48 (CGCNN) [271]; and grid-based real-space local environment property such as the potential [272].
49 An important open discussion regards the interpretability [273, 274] of the descriptors and
50 consequently of the models obtained with ML [197, 199]. As already stated, one of the materials
51 science objectives is to discover governing relationships for the different materials properties, which
52 enable predictive capacity for a wide materials space. A choice can be made when choosing and
Ac
53 designing the descriptors to be used. When prediction accuracy is the main goal, ML methods can
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 36 of 74
1
2
3
36
4
pt
5 be used as a black-box, and the descriptor interpretation and dimensionality are secondary. On the
6 other hand, if the goal in addition to accuracy is understanding, physically meaningful descriptors
7 can provide insight into the relationship described, and help even to formulate approximate and
8 rough phenomenological models [275]. This cycle is presented in Figure 15. Regarding algorithms,
9 dimensionality reduction and the regularization techniques already presented such as LASSO and
cri
10 SISSO can assist in this quest.
11 The apparent distinction can be seen as a version of the Keplerian empirical/phenomenological
12 (descriptive laws without a fundamental physical reason for them to be that way) first science
13 paradigm in contrast to Newtonian theoretical second science paradigms. In the ML case, the
14 debate questions whether ML models can be purely interpolative (closer to the 1st empirical
15 science paradigm) or also extrapolative (closer to 2nd fundamental theoretical science paradigm),
16 predicting more fundamental relationships beyond the given data class. Recently, Sahoo et al.
us
17 presented a novel approach capable of accurate extrapolation, by identifying and generalizing the
18 fundamental relations to unknown regions of the parameter space [276]. No consensus exists about
19 this discussion, and advances in research can make this debate obsolete. A pragmatic view on the
20 causation vs correlation debate is to acknowledge that while discovering the underlying physical
21 laws is the ideal goal, it is not guaranteed to happen. Otherwise, obtaining association patterns
22
23
24
25
26
27
28
[8].
Machine
an
can be done much more quickly and could be an acceptable substitute for many practical problems
Data
Applications
dM
29
Learning
30
31 Theory
32
33 Simulations
34
35 Figure 15. The revised connectivity between the four science paradigms. From empirical
36 data to fundamental theories, which are realized in computational simulations, generating even
more data. Statistical learning in turn can obtain simple phenomenological models that aids in
37 theoretical understanding.
38
pte
39
40 2.3.3.2 Novel ML methods in physics and materials
41 We discussed ways that machine learning can be used to directly predict materials properties
42 or even for the discovery of novel materials. Another broader strategy is that ML methods can
43 also be used to bypass or replace the calculations necessary to obtain the data in the first place.
44 Here we briefly discuss the use of ML to extend and advance current methods for a variety of
45 problems. Works in this direction have a broader interface with physics in general, developing
46 methods applicable and inspired by different areas.
ce
47 There are several strategies that can be employed to circumvent the expensive Schrödinger
48 equations calculations and optimize computational resources by using ML, without sacrificing
49 accuracy. The general idea is presented in Figure 16a. A prominent example and intuitive approach
50 are using ML to predict novel density functionals to be used within DFT, which can be readily
51 used with current implementations [262, 277–280]. The functionals to be predicted can be the
52 exchange-correlation as used in the traditional DFT Kohn–Sham (KS) mapping, or of the orbital-
Ac
53 free type. Another approach that bypasses the KS–DFT is to use ML to predict directly the
54
55
56
57
58
59
60
Page 37 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
37
4
pt
5 electronic density [281–283], which is a form of the Hohenberg–Kohn (HK) map from potential to
6 density. These three forms of mapping are presented in Figure 16b.
7
8
9
cri
10
11
12
13
14
15
16
us
17
18
19
20
21
22
23
24
25
26
27
28
an
Figure 16. (Left) An alternative to costly theoretical calculations by more efficient ML
predictions. Reproduced from [284] c 2016 John Wiley & Sons, Inc. (Right) The three DFT
mappings that can be learned with ML: Kohn–Sham (KS), orbital-free, and Hohenberg–Kohn.
Reproduced from [281]. CC BY 4.0
For molecular dynamics simulations acceleration, ML was used to predict the properties of
dM
29 configurations already evaluated and similar ones, leaving only the expensive calculations of unseen
30 configurations to be made on-the-fly [260, 285]. ML can also be used to generate coarse-grained
31 models for large-scale MD simulations [286], and to obtain adaptive basis sets for DFT MDs [287].
32 When referring to the ML training process, the datasets generation can be done with active learning
33 [288] instead of more traditional approaches like MD or metadynamics [289]. Quantum “intuition”
34 can also be incorporated in the ML training process by using a density-functional tight-binding
35 (DFTB) or other model processing layer in neural networks [290].
36 Wider ML applications include obtaining corrections for non-covalent interactions [291, 292],
37 finding transition states configurations [293] in a more efficient way than the nudged elastic
38 band (NEB) method, as well for determining parameters for semiempirical quantum chemical
calculations [294] and density-functional tight-binding (DFTB) [295] models. Machine learning
pte
39
40 has also been used to obtain tight-binding like approximations of Hamiltonians [296], solving the
41 quantum many-body problem [297–299] and Schrödinger equation [300] directly. The applications
42 in physics also involve the important problems of partition functions determination [301], finding
43 phase transitions and order parameters [302–306], and obtaining the Green’s function of models
44 [307]. These examples show promising strategies to extend the frontiers of materials science
45 research, which can be applied to study a variety of systems and phenomena.
46
ce
53 science.
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 38 of 74
1
2
3
38
4
pt
5 We used DFT as a representant of the general class of methods used to generate data, due
6 to being the most used method to materials science. The data, irrespective of where it came
7 from, is then used in the HT and ML approaches. Therefore, we choose to highlight HT and ML
8 applications and only briefly comment on DFT applications.
9 Much has been written on DFT applications, and articles and reviews of general [308] and
cri
10 specific scope are constantly seen. DFT has been used for almost every kind of system ranging from
11 atomic [309], molecular [310, 311], and chemical systems, extended solids, surfaces [312], defects
12 [313], 0D [314–316], 1D [317–323], and 2D [324] systems. In terms of properties, structural [325],
13 electronic/transport [326, 327], thermal [328–330], electron-phonon [331], optical [332], catalytic
14 [333], magnetic [334–336], topological [337–339], and many others have been studied.
15
16
us
3.1 High-Throughput
17
18
19 The HT methods for novel materials discovery are directly related to the generation and
20 storage of massive amounts of data. This data availability (most theoretical databases are
21 open access) to the general scientific community is an important collaborative strategy for
22 accelerating the discovery of innovative applications. The DFT-HT calculation is somehow a
23
24
25
26
27
28
an
new and rapidly growing field. In Table 2 some examples of the largest databases are highlighted.
These theoretical and experimental databases have been used for several applications: battery
technologies, high entropy alloys, water splitting, high-performance optoelectronic materials [340],
topological materials, and others. Here we show some examples of its usage. We choose to focus
mainly on the usage of large databases. Nevertheless, several groups generate their own databases,
not relying only on those reported at Table 2.
dM
29
30 3.1.1 Materials discovery, design, and characterization
31 Castelli et al. screened 2400 material from the Materials Project for solar light
32 photoelectrochemical water splitting materials. Materials Project is fully calculated with
33 PBE, i.e., its calculated band gaps are underestimated. To circumvent this problem, the GLLB-
34 SC [341] correction was applied. They confronted GLLB-SC and GW (also HSE06) band gaps for
35 a smaller subset of materials, their findings show that GLLB-SC improves the band gaps. With the
36 improved bandgap description they created a descriptor based on the materials stability, band gap
37 in the visible light region, and band edges alignment. They found 5 possible candidates Ca2 PbO4 ,
38 Cu2 PbO2 , AgGaO2 , AgInO2 , and NaBiO3 .
pte
39 The simplest definition of a high entropy alloys (HEAs) is based on the number and
40 concentration of its components and on the formation of a single phase (solid solution). Some
41 authors have more restrict definitions based also in its micro-structural arrangement [342]. These
42 HEAs have attracted attraction recently, due to its promise of high stability against precipitations
43 of its components. Precipitation is undesirable because it may modify the properties of the alloys.
44 The mechanism behind HEAs stability relies on its high entropy, which will result in a dominance
45 of the entropic (TS) term over the enthalpic (H) one in the Gibbs free energy. With an effect of
46 avoiding phase separation and a solid solution will be formed. The existent models to predict
ce
47 phase transition in HEAs are, in general, unsatisfactory. The main reason is the absence of
48 experimental and theoretical data. Even the tremendous computations capability of the modern
49 days cannot handle performing DFT calculations for multi-component HEAs. The combinatorial
50 rules are unforgiven, and a 5 components HEAs with an 8 atoms unit cell would require more than
51 100.000 DFT total energy calculations. Recently, Lederer et al. proposed a novel methodology to
52 better predict phase separation in HEAs [343], the so-called LTVC model (Lederer-Toher-Vecchio-
Ac
1
2
3
39
4
pt
5 configurational subspace. Followed by cluster expansion calculations, to increase the energetics
6 data availability, and mean field statistical analysis. Finally, an order parameter is proposed to
7 determine possible phase transitions. For quaternary and quinary HEAs the most stable phase
8 (bcc or fcc) results are in perfect agreement (100%) with experimental data. They also predict
9 that almost 50% of the investigated quaternary and quinary HEAs will present a single phase, i.e.,
cri
10 solid solution. They used the AFLOW framework for all steps during the process.
11 Thermoelectrics materials are able to generate electrical current via a temperature gradient.
12 A promising application is to recover dissipated energy (heat). Its ability to generate power
13 is measured by the so-called figure of merit, ZT = σS 2 T /κ. Where σ, S, T, and κ are the
14 electrical conductivity, Seebeck coefficient, temperature, and thermal conductivity, respectively.
15 The last term, in general, has an electronic and lattice contribution. DFT is able to calculate
16 the components of the ZT. Nevertheless, its extreme computational costly, since it requires a fine
us
17 sampling of the reciprocal space [344, 345]. Only recently HT investigations of thermoelectric
18 material were feasible, owing to interpolation schemes capable of circumventing the computational
19 cost [74–77]. Wang et al. calculated the thermoelectric properties of ≈ 2500. They found several
20 large power factor materials. Also as a direct relation between the power factor and the materials
21 band gap. Bhattacharya et al. explored alloys as possible novel thermoelectric. materials [346].
22
23
24
25
26
27
28
an
Chen et al. performed calculation over 48000 entries on the MP database [347]. They found
a good agreement between experimental and theoretical Seebeck coefficients. Nevertheless, the
power factor is less accurate. They also determine correlations between the crystal structure and
specific band structure characteristics (valley degeneracy) that could guide material modification
for enhanced performance.
Identification of suitable optoelectronic materials [340] as well as solar absorbers [348, 349]
dM
have also been possible via HT calculations. Another important study based on HT methods is the
29 obtention of elastic properties of inorganic materials [18, 159] and the subsequent structuring
30 of the data into publicly available databases. Additionally, Mera Acosta et al. [350] performed
31 a screening in the AFLOWLIB database, showing that three-dimensional materials can exhibit
32 non-magnetic spin splittings similar to the splitting found in the Zeeman effect.
33
34 3.1.2 Topological ordered materials
35 Topological materials can be classified into topological insulators (TIs), topological crystalline
36 insulators, topological Dirac semimetals, topological Weyl semimetals, topological nodal-line
37 semimetals, and others [337, 351–353]. The topologically nontrivial nature is tied to the
38 appearance of inverted bands in the electronic structure. For most topological materials, band
pte
47 known TIs and then, employing DFT calculations, verify if the proposed materials feature band
48 inversions at symmetry protected k-points or non-zero topological invariants. These calculations
49 typically have a high computational cost, and hence, this trial-and-error process is not usually
50 feasible.
51 In the seminal work of Yang et al., it was shown that semi-empirical descriptors can aid
52 the selection of materials, allowing the efficient use of HT topological invariant calculation to
Ac
53 predict TIs [357]. The proposed descriptor represents the derivative of the bandgap without
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 40 of 74
1
2
3
40
4
pt
5 SOC with respect to the lattice constant [357], requires the band structure calculation various
6 values of the lattice constant. Thus, material screening and high-throughput calculations were
7 combined to study the bandgap evolution as a function of the hydrostatic strain. These semi-
8 empirical descriptors capture the evolution of the states involved in the band inversion for a given
9 compound. The authors thus predicted 29 novel TIs. In order to avoid the complex calculations of
cri
10 the topological invariants, a simple and efficient criterion that allows ready screening of potential
11 topological insulators was proposed by Cao et al. [358] . A band inversion is typically observed in
12 compounds in which the SOC of the constituent elements is comparable with the bandgap. This
13 was precisely the criterion proposed by Cao et al.: representing the strength of the interaction
14 through the average of the atomic numbers (Z̃) and the bandgap in terms of difference of Pauling
15 electronegativity (∆ξ), the band inversion is indicated by an unique parameter (δ = 0.0184Z̃/∆ξ),
16 i.e., the band inversion is found in compound with δ > 1. The validity and predictive power of
us
17 such criterion were demonstrated by rationalizing many known topological insulators and potential
18 candidates in the tetradymite and half-Heusler families [359, 360]. This is an unusual example since
19 the use of atomic properties for the prediction of complex properties has only been extensively
20 explored through ML techniques, such as the SISSO method (See Section 3.2.4).
21 Despite the great influence that has had the understanding of the nontrivial topological phases
22
23
24
25
26
27
28
an
in condensed matter physics and the great efforts to find novel TI candidates, the predicted systems
are reduced to few groups of TIs. For instance, only 17 potential TIs were identified by carrying
out HT electronic band structure calculations for 60,000 materials [361]. Using novel approaches,
this problem was recently addressed by three different works [362–364], in which thousands of
compounds have been predicted to behave like TIs. Here we will briefly discuss these works.
In the first work, Barry Bradlyn et al. [365] put forward the understanding of topologically
dM
protected materials by solving a general question: “Out of 200,000 stoichiometric compounds
29 extant in material databases, only several hundred of them are topologically nontrivial. Are TIs
30 that esoteric, or does this reflect a fundamental problem with the current piecemeal approach to
31 finding them?” In this work, the authors introduced the generalization of the theory of elementary
32 band representations to SOC systems with TR-symmetry and proved that all bands that do not
33 transform as band representations are topological. This theory gives a complete description of
34 periodic materials, unifying the chemical orbitals described by local degrees of freedom and band
35 theory in the momentum space [365, 366]. Using this theory, Vergniory et. al., found 2861 TIs
36 and 2936 topological semimetals in the ICSD database [362]. The recently proposed elementary
37 bands representation is an example of generals descriptors to perform material screening, however,
38 details related to the atomic composition require the band structure calculation. A features space
pte
39 including the elementary band representations could be a strategy to find ML-based models for
40 novel hypothetical TI candidates. Zhang et. al. [364] designed a fully automated algorithm in
41 obtaining the topological invariants for all non-magnetic materials, comparing bands describing
42 occupied states with the elementary band representation [365, 366]. The authors designed what is
43 known as the first catalog of topological electronic materials. In the same spirit, using the recently
44 developed symmetry indicators method [367], Tang et. al. found 258 TIs and 165 topological
45 crystalline insulators which have robust topological character [363], i.e., considerable full or direct
46 band gap. The authors also found 489 topological semimetals with the band crossing points
ce
47 located near the Fermi level [363]. Finally, Choudhary et al. performed HT calculations for the
48 SOC spillage, a method for comparing wave functions at a given k-point with and without SOC,
49 reporting more than 1699 high-spillage TIs candidates [368]. The authors extended the original
50 definition of the spillage, which was only defined for insulators [369], by including the number of
51 occupied electrons nocc (k), i,e., η = (k) = nocc (k) − Tr(P P̃ ), where P =
Pnocc (k)
|ψn,k ihψn,k |
n=1
52 for wave functions without SOC and P̃ with SOC calculations. Thus, this screening method is
Ac
53
54
55
56
57
58
59
60
Page 41 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
41
4
pt
5 not only suitable to identify topological semimetals, but is also applicable to the investigation of
6 disordered or distorted materials. We consider that the prediction of new TIs has been one of the
7 greatest contributions and victories of HT methods and screening materials. In spite of these great
8 advances, there is still a very long route for the total comprehension of phenomena in non-trivial
9 topological states and the discovery of materials presenting phases not yet investigated.
cri
10
11 3.1.3 2D materials
12
13 The 2D materials era was initiated with the graphene isolation by Novoselov and Geim [370].
14 Graphene has shown how quantum confinement can significantly alter the 2D allotrope in
15 comparison with its 3D counterpart. Posterior to discovery of graphene a profusion of 2D materials
16
us
have been proposed and synthesized: transition metal dichalcogenides (TMDC), h-BN, silicene,
17 germanene, stanene, borophene, II-VI semiconductors, metal oxides, MXenes, and many others,
18 including recently non van der Waals materials [371–374]. The first approach using data-mining
19 and HT calculations to discover novel 2D materials was performed by Björkman et al. [375, 376].
20 Using a descriptor based on symmetry, packing ration, crystallography gaps and covalent radii they
21 screened the ICSD database [152] and 92 possible two-dimensional compounds were identified.
22
23
24
25
26
27
28
an
The interlayer binding energy, which is closely related to the exfoliation energy, was calculated
using a very accurate scheme based on nonlocal correlation functional method (NLCF) [377],
the adiabatic-connection fluctuation-dissipation theorem within the random-phase approximation
(RPA) [378] and different van der Waals functionals [379–381] along with the traditional LDA and
GGA functionals. Despite their pioneer work, the results were still communicated in the traditional
narrative form.
dM
Only recently, the construction of large databases of 2D materials became popular. In
29 general, these databases are constructed via DFT calculations using as prototypes experimental
30 information. In the next few lines, we will briefly describe some of these 2D databases and their
31 construction strategies.
32 Choudhary et al. made publicly available a 2D database with hundreds of single-layered
33 materials [158]. They used an ingenious idea of comparing PBE lattice parameter, from the
34 Materials Project database [149], against the experimental values of ICSD. The PBE functional
35 is known to overestimate the lattice constant. This overestimation is larger for van der Waals
36 systems. For example, PBE is unable to describe the graphite structure. Since there is no energy
37 minimum as a function of the interlayer distance [382]. Their strategy was to calculate the PBE and
38 experimental lattice parameter relative error and separate the subset with values larger than 5%.
pte
39 After this initial screening, they computed the exfoliation energy, with the proper vdW functionals,
40 to identify possible 2D candidates. This simple descriptor correct predicts a layered material 88.9%
41 of the times. The exfoliation criteria used was 200 meV/atom. Another distinct feature of this
42 database is the large plane wave cut-off and reciprocal space sampling.
43 Ashton and coworkers [157] proposed the topology-scaling algorithm (TSA) to identify layered
44 materials from the ICSD database. The TSA first calculates the materials bonding, based on
45 covalent radii, to identify atoms clusters. If only one cluster is present the structure is unlikely to
46 be layered. If TSA finds clusters structures the supercell is increased (n times in each direction)
ce
47 and a new search for clustering is performed. If the cluster number of atoms increases quadratically
48 with n, the system is layered. They adopted exfoliation criteria of 150 meV/atom and found 680
49 stable monolayers.
50 Mounet et al. [3] used an algorithm in the same spirit of the TSA approach to search for layered
51 compounds. They mined the ICSD and COD databases and found 1036 easily exfoliable materials.
52 The adopted exfoliation criteria was 35 meV/Å−2 . Further, they calculated vibrational, electronic,
Ac
53 magnetic and topological properties for a subset of 258 materials. They found 56 magnetically
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 42 of 74
1
2
3
42
4
pt
5 ordered systems and two topological insulators.
6 Haastrup released one of the largest 2D database [162] with more than 3000 materials. The
7 adopted strategy is different from the previous databases. They implemented a combinatorial
8 decoration approach of known crystal structure prototypes, more than 30 different ones. The
9 thermodynamic stability is determined via the convex hull approach. Also, the dynamical stability
cri
10 is accessed using Γ-point phonons calculations with the finite displacement method. They used
11 information about the formation energy and phonons frequencies of known 2D materials to conceive
12 a stability criterion. The prototypes are classified as having a low, medium and high stability
13 depending on its hull energy and dynamical matrix minimum eigenvalue. Other calculated
14 properties include elastic, electronic, magnetic (including magneto-crystalline anisotropy), and
15 optical properties. They also employed, in a smaller subset, more sophisticated schemes such as
16 Hybrid functionals, GW approximation, and RPA calculations.
us
17 These databases are now being screened for different properties. Ashton [383] et al. discovered
18 a new family of Fe-based large spin-gap (as large as 6.4 eV) half-metallic 2D materials with
19 magnetic moments around 4µB . Four new topological insulators have been predicted by Li
20 et al. [384] screening 641 2D materials of the Materials Web database. The largest gap
21 found was 48 meV for TiNiI. Olsen et al. discovered several 2D nontrivial materials including
22
23
24
25
26
27
28
an
topological insulators, topological crystalline insulators, quantum anomalous Hall insulators and
dual topological insulators (which possess time reversal and mirror symmetry) [385]. The 3D
databases are well established and have been used widely. In contrast, the number of works using
the proposed 2D databases is relatively small. This current status provides great opportunities for
further exploration in the near future.
dM
3.2 Machine Learning for materials
29
30
31 In this section we present a selection of research applying machine learning techniques to
32 materials science problems, illustrating the materials informatics capabilities explored in the
33 literature. The research questions studied involve different types of ML problems as described
34 generally in Section 2.3.1 and specifically for MI in Section 2.3.3, for a wide range of materials
35 properties, discovery, and evaluation.
36 As the application of ML techniques to materials problems is relatively recent, articles,
37 perspectives, and reviews are nowadays increasingly emerging in the literature. Some works that
38 illustrate the ML concepts and examples applied to diversified materials problems are given by
pte
39 refs. [22, 198, 223, 386–391]. We therefore focus here on selected examples which present recent
40 advances to this area.
41
42 3.2.1 Discovery, energies, and stability
43
44 A common topic for ML applied to materials research is the accelerated discovery of compounds
45 guided by data. Specifically, the prediction of compounds formation energies can be effectively
46 accelerated by ML, elucidating the thermodynamic stability of materials. One of the first works
ce
47 predicting crystal structures was reported by Curtarolo et al. [392] at Ceder group, combining
48 simple ML methods such as frequency ordering, PCA, linear regression, and correlation matrix,
49 to predict formation energies and optimize HT calculations [393, 394]. Hautier et al. used a
50 ML model based on experimental data (ICSD) to accelerate the discovery of ternary oxides by
51 predicting possible novel compositions, which are then simulated by a HT approach [395]. Using
52 the well known binary compounds as an example, Saad et al. discussed general ML concepts
Ac
53 and examples of dimensionality reduction techniques, supervised and unsupervised learning [396].
54
55
56
57
58
59
60
Page 43 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
43
4
pt
5 Crystal structure classification between brittle or ductile phases of intermetallic compounds with
6 only atomic radii was also studied [397]. Patra et al. described a new strategy called neural-
7 network-biased genetic algorithm (NBGA) to accelerate the discovery of materials with desired
8 properties [398]. It uses artificial neural networks to bias the evolution of a genetic algorithm using
9 fitness evaluations performed via direct simulation or experiments. The prediction of intermetallic
cri
10 Heusler compounds was studied with a random forest algorithm using composition-only descriptors,
11 resulting in a 0.94 true positive rate, which were then experimentally synthesized [399]. Faber et
12 al. performed a kernel ridge regression of formation energies of the most abundant prototype
13 (millions) in the ICSD database, elpasolite crystals, finding 90 structures on the convex hull, with
14 a MAE of 0.1 eV/atom [400], this result is presented in Figure 17. From an initial set of 3200
15
16
us
17
18
19
20
21
22
23
24
25
26
27
28
an
dM
29
30
31
32
33
34
35 Figure 17. Formation energies of elpasolite crystals obtained with kernel ridge regression.
36 Reproduced from [400]. CC BY 3.0
37
38 materials, Balachandran et al. predicted 242 novel noncentrosymmetric compounds by integrating
pte
39 group theory which indicates inversion symmetry breaking, informatics which recommend systems
40 and density-functional theory which computes the structures energies [401]. Illustrating how to
41 apply a Bayesian approach to combinatorial problems in materials science and chemistry problems,
42 Okamoto found the stable structures of lithium–graphite intercalation compounds by using only
43 6% of the search space [402].
44 The prediction of crystal structures and their stability [403, 404] has also been performed
45 for several materials such as perovskites [290, 405–407], superhard materials [408], bcc materials
46 and Fe alloys[409], binary alloys [410], phosphor hosts [411], Heuslers [412, 413], catalysts [414],
ce
53 A recent work reported the identification of lattice symmetries by representing crystals via
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 44 of 74
1
2
3
44
4
pt
5 diffraction image calculations, which then serve to construct a deep learning neural network model
6 for classification [422]. Not only to structural properties, recently the vibrational free energies
7 and entropies of compounds were studied by ML models and achieved good accuracy with only
8 chemical compositions [423]. Even further, ML was used to predict interatomic force constants,
9 which can then be used to obtain vibrational properties of metastable structures, good indicators
cri
10 of finite temperature stability [424].
11 Several works report the use of atomistic potentials obtained via different ML methods, as
12 discussed in Section 2.3.3.1. These are trained for systems ranging from molecular to materials
13 science applications, and greatly expand the current capabilities of atomistic simulations such as
14 MD. Comparison of different atomistic ML potentials (presented in Section 2.3.3.1) was studied for
15 water interactions [425]. Gaussian approximation potentials (GAPs) have been extensively used
16 to study different systems, such as elemental boron [426], amorphous carbon [427, 428], silicon
us
17 [429], thermal properties of amorphous GeTe and carbon [430], thermomechanics and defects of
18 iron [431], prediction structures of inorganic crystals by combing ML with random search [432],
19 λ-SOAP method for tensorial properties of atomistic systems [250], and a unified framework to
20 predict the properties of materials and molecules such as silicon, organic molecules and proteins
21 ligands [433]. A recent review of applications of high-dimensional neural neural network potentials
22
23
24
25
26
27
28
an
[434] summarized the notable number of molecular and materials systems studied, which ranges
from simple semiconductors such as silicon [236, 435, 436] and ZnO [437], to more complex systems
such as water and metallic clusters [438], molecules [439–441], surfaces [442, 443], and liquid/solid
interfaces [418, 444]. Force fields for nanoclusters have been developed with 2-, 3-, and many-
body descriptors [445], and the hydrogen adsorption on nanoclusters was described with structural
descriptors such as SOAP [446].
dM
A common and important research focus is to use feature selection techniques to guide the
29 descriptor selection process, which is usually performed by means of regularization techniques and
30 algorithms such as LASSO. In this line of reasoning, Ghiringhelli et al. developed a methodology
31 able to extract the best low-dimensional and physically meaningful descriptors by an extensive
32 systematic analysis, using compressed-sensing methods for feature selection [197, 199]. The
33 implementation of this methodology, called sure independence screening and sparsifying operator
34 (SISSO) [205, 447] is presented in Table 3. As proof of concept, the methodology was applied
35 to quantitative predict the crystal structure of binary compound semiconductors between zinc
36 blende (ZB) or rock salt (RS) structures, which have very small energy differences, shown in
37 Figure 18. Bartel et al. used SISSO to obtain a tolerance factor descriptor to predict the stability
38 of perovskites by using only the atomic oxidation states and ionic radius, achieving an overall
pte
39 accuracy of 92% [448]. Bartel et al. also used SISSO to find a physical descriptor for the inorganic
40 crystalline solids Gibbs energy and temperature related properties [449]. Their simple descriptor
41 based only on atomic volume, reduced mass, and temperature reached a 61 meV/atom RMSD,
42 almost comparable to the much expensive quasi-harmonic approximation. LASSO has been used
43 to predict the stability of monolayer metal oxides coatings and used to understand which features
44 influence this property [450]. It was found that for stoichiometric oxides the substrate surface
45 energy, orbital radii, and ionization energies are important while for the nonstoichiometric oxides,
46 the parent oxide stability of the coating material, as well as oxidation state differences between
ce
47 coating and support, are important descriptors. In a related case, the bootstrapped projected
48 gradient descent (BoPGD) algorithm was used to obtain interpretable models from small datasets,
49 being recommended when the LASSO algorithm present instabilities due to correlations in the
50 input features [451].
51 Finally, relevant methods are also being developed to tackle issues related to ML research
52 applied to materials. These include the ∆-approach to ML, that in order to increase prediction
Ac
53 accuracy, it uses as the learning target the difference between a lower-quality model to the property
54
55
56
57
58
59
60
Page 45 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
45
4
pt
5 of interest [452]. Another technique is the subgroup discovery, which finds local structure in data,
6 as opposed to mapping a unique global relationship [453]. And the recent multi-fidelity learning
7 which aims to be applied to small datasets, where in order to enhance the sampling and therefore
8 learning capacity, one can combine lower precision data to overcome the scarcity of higher precision
9 data [454].
cri
10
11 3.2.2 Electronic Properties
12 From the great number of materials properties predicted by ab initio calculations, electronic
13 properties such as bandgaps and electronic conductivity are considered as key quantities in
14 describing materials. Applications such as photocatalysis, electronic and optical devices, as well
15 as charge storage rely on the fact that the electronic bandgap is properly characterized. As
16
us
described previously, DFT calculations at the LDA or GGA level of approximation present a
17 chronic problem known as the underestimation of the electronic bandgap [38]. The introduction
18 of hybrid functionals in DFT, as well as TD-DFT or GW-based electronic structure calculations,
19 enabled theoretical predictions compatible with experimental values [64, 455]. However, their
20 application demands greater computational resources, thus, making their wide use unfeasible
21 for most systems. Owing to that, and also due to the availability of ab initio data from
22
23
24
25
26
27
28
an
online repositories, see Table 2, much faster interpolative prediction of electronic properties of
materials via ML algorithms is now a reality. Training times and descriptor selection processes,
although being time-consuming, correspond to an initial computational effort, while the subsequent
prediction of electronic properties from properly trained models such as linear regressors, support
vector machines, random forests or neural networks is a much more facile task.
Roughly two classes of properties can be predicted, or classified, using machine learning
dM
methods: bandgaps and electronic conductivity. The former being widely explored by regression
29 techniques, capable of presenting a numerical value for the gap [209, 213, 256, 267, 456–461, 461–
30 466], or classification methods, which simply provide an answer to the question “is this compound
31 or material a metal?”[467]. The use of a neural network to predict the bandgap of inorganic
32
33
34
35
36
37
38
pte
39
40
41
42
43
44
45
46
ce
47
48
49
50
Figure 18. SISSO classification of energy differences ∆EAB between rock salt and zinc blende
51
structures of 82 binary AB materials, as arranged by the optimal two-dimensional descriptor.
52 Reproduced from [199]. CC BY 3.0
Ac
53
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 46 of 74
1
2
3
46
4
pt
5 materials dates back to the end of the last century [468]. More recent examples can be found in
6 the literature where the authors make use of both methods [265, 465, 469], first classifying the
7 materials as metals or insulators/semiconductors and in the sequence, obtaining a prediction of
8 the bandgap of the former class, avoiding in this manner the nonphysical prediction of negative
9 values of Eg . Figure 19 shows a few examples of predictions of bandgaps using a variety of ML
cri
10 algorithms.
11
12
13
14
15
16
us
17
18
19
20
21
22
23
24
25
26
27
28
an
Figure 19. Several examples of bandgap prediction using ML methods. (a) Isayev et al.
predicted the bandgap of inorganic materials using a gradient boosting decision tree algorithm
trained with 26,674 entries from ICSD and AFLOW repositories. Reproduced from [265]. CC BY
4.0. (b) Mannodi-Kanakkithodi et al. predicted the bandgap of dielectric polymers using a kernel
ridge regression model trained with a set of 284 polymers built from basic blocks: CH2 , NH,
dM
29 CO, C6 H4 , C4 H2 S, CS and O. Reproduced from [462]. CC BY 4.0. (c) Using LASSO-selected
primary and compound descriptors, Pilania et al. used a database of 1306 insulating double
30 perovskite oxides to train a regressor capable of predicting the bandgaps of these materials.
31 Reproduced from [464]. CC BY 4.0.
32
33 The focus of ML methods in the prediction of materials conductivity lies mainly in
34 thermoelectric applications. For this class of materials, not only the electronic conductivity σ (or
35 resistivity ρ = σ −1 ) are the properties of interest, but also thermal conductivity κp and Seebeck
36 coefficient S need to be predicted in order to obtain the figure of merit ZT . Thermoelectric
37 efficiency, along with the aforementioned properties could be predicted by decision trees [470] as
38 well as Bayesian optimization[471, 472] for example. Gaultois proposed a recommendation engine
pte
47 electronic density [209, 275, 456, 460, 465, 475, 476]. Frequently a combination of two or more
48 classes of descriptors [457, 461, 464, 477] as well as experimental data as features [275, 469] is found
49 in the literature. The overall picture is that the community is aware of the importance of careful
50 selection of the descriptor set. However, no consensus has emerged yet on which ones to pick or if
51 a systematic procedure to build compound features should be employed, even though recent efforts
52 have been reported in that front [205, 267, 456, 457].
Ac
53 Several works report comparisons between different ML algorithms for the prediction of
54
55
56
57
58
59
60
Page 47 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
47
4
pt
5 electronic properties of materials and organic molecules. Even though there is no consensus on
6 which method performs best, given the heterogeneity of the data available and the variety of
7 properties one is interested in predicting, the performance metric in many cases is similar.
8
9
cri
3.2.3 Magnetic properties
10 Magnetic materials are at the heart of several modern technological applications. They are
11 used for data storage, energy harvesting, magnetic cooling, and other applications. Nevertheless,
12 the occurrence of magnetic ordering can be considered a rare phenomenon, with around 4% of
13 the known inorganic compounds presenting such property [478]. The search for novel magnetic
14 materials is not only a scientific interesting problem but an economic necessity. The specificity of
15 each application will require a broad search on the chemical, composition and structural space. For
16
us
example, energy harvesting devices need permanent magnets (PM) with high coercivity, i.e. large
17 magnetic anisotropy energy, and large saturation magnetic moments (MS ) [479]. PM for magnetic
18 refrigeration applications is more efficient when the magnetic phase transition temperature is close
19 to its operating environment temperature [479]. Here we comment on two important papers in the
20 field of ML applied to magnetism.
21 Sanvito and coworkers [478] used high-throughput DFT calculations to construct a Heusler
22
23
24
25
26
27
28
an
alloys database containing 236,115 entries. In order to search for novel high performing magnetic
Heusler alloys, the convex Hull (binary and ternary) for 36,540 compounds was calculated to
determine possible stable candidates. The calculation scope was narrowed (from the 236,115
compounds) considering only 3d, 4d and 5d elements, which is a reasonable choice given the focus on
magnetic properties. They found 8 highly stable magnetic candidates. The Curie temperature (TC )
was estimated via linear regression, see Section 2.3.2, using equilibrium volume, magnetic moment
dM
per formula unit, spin decomposition, and number of valence electrons as inputs features. The
29 regression was performed with a training set containing 60 TC experimental values and the average
30 error was around 50 K. They also synthesized some compounds and found an impressive agreement
31 between the estimated ML and experimental TC for Co2 MnTi with 940 and 938 K, respectively.
32 Later they used a ML classification scheme, validated with receiver operating characteristic (ROC)
33 curve, to investigate soft and hard magnets of the same Heusler alloys set. Their vector feature
34 contained the atomic number, number of valence electron, local magnetic moment and a quantity
35 associated with the spin-orbit coupling strength [480].
36 Lam Pham et al. used KRR analyses to correctly predict the DFT magnetic moment for
37 Lanthanides-transition metal alloys [230]. They proposed the so-called Orbital Field Matrix (OFM)
38 descriptor which is based on the electronic configuration, coordination number and local structure
pte
39 (defined as the weighted sum of the neighbors vector). The obtained local magnetic moment
40 RMSE, MAE and R2 were 0.18 µB , 0.05 µB and 0.93, respectively. The OFM results were also
41 shown to be superior when compared to CM descriptor regarding the local magnetic moments
42 and formation energies. Later they proposed an extended descriptor OFM1, which includes the
43 central atom (in the local structure) information [481]. This new descriptor improved the magnetic
44 moment RMSE, MAE and R2 to 0.12 µB , 0.03 µB and 0.97, respectively.
45 Despite the fundamental importance of magnetic phenomena in science and technology, the
46 scarce number of ML papers applied to magnetic materials show that the field is on its infancy
ce
53
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 48 of 74
1
2
3
48
4
pt
5 cannot be characterized in terms of a local order parameter [483], as previously has been done
6 for a great variety of physical properties. The topological classification requires the topological
7 invariant calculation, which typically depends on the Berry curvature associated with all occupied
8 states satisfying a given symmetry [483], e.g., time-reversal, mirror or rotation symmetry. For
9 instance, quantum spin Hall insulators (QSHIs) and topological crystalline insulators (TCIs) are
cri
10 two-dimensional (2D) materials protected by the TR and mirror symmetry, respectively. These
11 systems are in turn characterized by a non-zero topological invariant Z2 = 1 [484–486] and a
12 mirror Chern number [352, 487, 488] CM 6= 0, respectively. InP TCIs, the
R mirror Chern number is
13 calculated through the Berry phase, Ω±i 1 ±i
n (kx , ky ), i.e., C±i = 2π n<Ef BZ Ωn (kx , ky )dxdy, where
14 the sum is only over the occupied states (n < Ef ). The topological phase can thus be changed
15 by modifying the orbital character of the occupied states [489] or breaking the symmetry that
16
us
protects the topological phase. Consequently, quantum phase transitions from trivial insulators
17 (non-topological insulators) to topological insulators can be induced by external perturbations.
18 These topological transitions are usually visualized in the band structure as a band inversion at
19 the symmetry protected k-points. Since the non-trivial topological classes result from the ground
20 state many-body wave function and all occupied states are involved in the topological invariant
21 calculation, the ML prediction of novel TIs materials is in some sense counter-intuitive to one of
22
23
24
25
26
27
28
an
the ideas in which the materials prediction is based, i.e., physical intuition and experience suggest
that many important material properties are primarily determined by just a few key variables.
For its part intuition is not necessarily the best strategy, because there is no rule that allows
a priori to define whether a system features non-zero topological invariants. Although materials
formed by atoms with a large spin-orbit tend to exhibit non-trivial topological phases, this intuitive
belief is not a general trend. For example, topological nodal-line semimetals can be formed by
dM
light atoms. With the modern computing power and access to larger datasets for topological
29 materials (see Section 3.1.2), ML is a natural strategy to be explored. However, from the previous
30 discussion, a fundamental question arose: can the ML algorithms classify topologically ordered
31 states and topological phase transitions? Works addressing this problem can be classified into two
32 different approaches. The first approach is the direct prediction of topological transitions using
33 neural networks. This approach is usually focused on the prediction of invariants for topological
34 phase models. The second approach is based on the material classification of trivial and topological
35 insulators in terms of descriptors. So far, these descriptors are required to depend on the atomic
36 properties and material properties, providing trends in the chemical space for a set of materials in
37 a specific family, e.g., same point group, similar formula, and isoelectronic.
38
pte
39
3.2.4.1 Quantum phase transition in topological insulator models
40
Classifying phases for condensed matter models has been a historical task for the
41
understanding of physical phenomena. However, the use of machine learning techniques as an
42 approach to these problems is very recent [490–498]. The supervised learning requires labeling
43 different topological classes by computing the topological invariant. Remarkably, unsupervised
44 learning also allows for the phase transition prediction, opening the way for the discovery of novel
45 quantum phases [491, 492, 499–502]. The Ising model has been widely used as the starting point to
46 demonstrate the success of these techniques in the prediction of phase transitions [490, 495, 498–
ce
47 501]. Topological states can also be learned by artificial neural networks as discussed below.
48 The capacity of neural networks to capture the information contained in the wave function
49 of the topological and trivial insulators is presented in a pedagogical way by van Nieuwenburg,
50 Liu, and Huber [491]. The authors predicted in a very accurate way the topological transitions
51 for the one-dimensional Kitaev chain using unsupervised learning based on principal component
52 analysis, supervised learning via neural networks, and a scheme combining both supervised and
Ac
53
54
55
56
57
58
59
60
Page 49 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
49
4
pt
5
6
7
8
9
cri
10
11
12
13
14
15
16
us
17
18
19
20
21
22
23
24
25
26
27
28
an
Figure 20. General descriptors for topological insulators (TIs) protected by the time reversal
dM
29 (TR)-symmetry: 2D TIs, 3D TIs, and Weyl semimetals. 2D and 3D TIs adapted from [503]
c 2013 The Physical Society of Japan (J. Phys. Soc. Jpn. 82, 102001), and Weyl semimetal
30 adapted from [504], CC BY 3.0.
31
32
33 unsupervised methods referred to as a confusion scheme. The ground state of this model has a
34 topological transition as the chemical potential µ is tuned across µ = ±2t, where t is the hopping
35 term. This result is well established and it is not necessary to use ML to predict it, however, the
36 demonstration of the use of this technique was relevant for subsequent works. Remarkably, Deng,
37 Li, and Das Sarma demonstrated mathematically that certain topological states can be represented
38 with classical artificial neural networks [505]. They introduced a further restricted restricted
pte
39 Boltzmann machine, where the hidden neurons connect only locally to the visible neurons, enabling
40 to use artificial-neural-network quantum states to represent topological states. For instance, the
41 ground state of a 2D toric code model introduced by Kitaev, which is the simplest spin liquid ground
42 state exhibiting a Z2 topological order [506, 507]. By introducing quantum loop topography, Zhang
43 and Kim showed that a fully connected neural network can be trained to distinguish the Chern
44 insulator and the fractional Chern insulator from trivial insulators [508]. Therefore, artificial-
45 neural-network quantum states are not necessarily needed to use ML for topological states. In the
46 same spirit, Zhang, Melko, and Kim showed that the phase boundary between the topological and
ce
47 trivial phases for the Z2 quantum spin liquid can be identified by feed-forward neural networks by
48 defining quantum loop topography sensitive to quasiparticle statistics.
49 The success of ML techniques in topological phase models aroused interest in experimentally
50 fabricated systems, which in turn gave rise to the study of topological band insulators models [509],
51 e.g., the Su-Schrieffer-Heeger model. Thus, using the Hamiltonian in the momentum space as an
52 input of convolutional neural-networks, Zhang, Shen, and Zhai found a model for the topological
Ac
53 invariant of general one-dimensional models, i.e., the winding number [509]. Although the winding
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 50 of 74
1
2
3
50
4
pt
5 number for a one-dimensional Hamiltonian H(k) = h̃x σx + h̃y σy (with h̃i = hi /|h(k)|) is very well
6 H 2π ∗
established w = −i 2πi
U (k)∂k U ∗ (k)dk, where U (k) = h̃x + ih̃y , in ref. [509], the authors found
7 0
an equivalent neural-network-based expression for more general Hamiltonians. In other work, the
8 authors extend this methodology for four-band insulators in AIII class and two-dimensional two-
9
cri
band insulators in A class, arguing that the output of some intermediate hidden layers leads to
10 either the winding angle for models in AIII class or the solid angle (Berry curvature) for models
11 in A class, respectively [509]. This suggests that neural networks can capture the mathematical
12 formula of topological invariants. However, the application of these methods for realistic materials
13 with specific symmetries and atomic compositions is still a challenge, which we will discuss in the
14 next section.
15
16
us
3.2.4.2 Topological materials classification
17
In the perfect scenario, one wishes to find an n-dimensional space defined by descriptors
18
separating all fabricated materials into regions related to all topological and trivial insulators.
19
Thus, systems characterized by more than one non-zero topological invariant [510], e.g., dual
20
topological insulators, would be in the intersection of the regions describing different topological
21
phases. In these “materials maps”, the boundary of such regions should then be related to the
22
23
24
25
26
27
28
an
topological transitions. Naturally, it is expected that systems protected by different symmetries
will have different trends in the chemical space. The dimensionality is another factor that must be
taken into account, i.e., two-dimensional and three-dimensional systems formed by the same atoms
are not necessarily protected by the same symmetry neither part of the same topological class.
Indeed, topologically protected states in different dimensions have different electronic properties,
as shown in Figure 20 for topological insulators protected by the TR-symmetry. Additionally, it has
dM
not yet been demonstrated that a descriptor classifying TIs from trivial insulators with a specific
29
symmetry and dimension is transferable to another material family (see discussion in section 2.3.2).
30
Here, we will comment on some of the progress that has been made in the classification of these
31
materials.
32
As previously discussed, the ML classification provides “materials maps” whose axes are defined
33
by descriptors. These descriptors are expected to be related to the key properties that are behind
34
the material property that differentiates material classes, e.g., metals and non-metals [205]. The
35
material map separating QSHIs from trivial insulators and metals (see Figure 21) developed by
36
Mera Acosta et al., is an example of the success of machine learning to create models to classify
37
systems with different topological phases [2]. Using the SISSO method, the authors selected a
38
descriptor for functionalized honeycomb lattice materials from a feature space of ten millions of
pte
39
combinations of atomic properties. Besides confirming the QSHI character of known materials,
40
the study revealed several other yet unreported QSHIs. Additionally, the authors found that the
41
descriptors are proportional to the separation between the states involved in the band inversion.
42
Thus, not only the band inversion can be predicted considering only atomic properties, but also
43
the topological bandgap. This study combines high-throughput DFT calculations for 220 materials
44
with ML classification to understand the topological transition in two-dimensional systems. Cao et
45
al. extended this approach to classify tetradymite compounds, demonstrating that the topological
46
transition in three-dimensional materials can be learned and described in terms of a few atomic
ce
47
properties (see Figure 21) [511], i.e., the atomic number and electronegativity. The authors found
48
a predictive accuracy as high as 97%, which suggests that the descriptor capture the essential
49
nature of TIs, and hence, it could be used to fast screen other potential TIs. Subsequently, also
50
using the SISSO method, Liu et al. shows that a one-dimensional descriptor is capable to classify
51
materials as trivial and TIs in half-Heusler family [512]. This descriptor is defined by the atomic
52 number, the valence electron number, and the Pauli electronegativity of the constituent atoms. The
Ac
53
54
55
56
57
58
59
60
Page 51 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
51
4
pt
5
6
7
8
9
cri
10
11
12
13
14
15
16
us
17
18
19
20 Figure 21. Materials maps for two- and three-dimensional compounds protected by the time
21 reversal (TR) symmetry. For two-dimensional TIs (left), i.e., quantum spin Hall insulators
22 (QSHIs), we show as an example the honeycomb functionalized compounds [2]. Here, there
23
24
25
26
27
28
superposition [511]. an
are materials whose topological class depends on functionalization (FD-QSHIs) and systems in
which the topological character is independent of functionalization (FI-QSHIs). The descriptors
D1 and D2 classifying tetradymite compounds (right) result in a convex-hull with very small
authors performed DFT calculations to verify the reliability and predictive power of the proposed
dM
descriptor, discovering 161 potential TIs within the half-Heusler family. Although the atomic
29
number is a common feature in all the descriptors found using the SISSO method [2, 511, 512], only
30
this parameter is not enough to explain the topological transitions. Certainly, the demonstration
31
of the existence of a general descriptor is still an open question.
32
33
34 3.2.5 Superconductivity
35 The discovery of a superconductor material dates to 1911 and 46 years passed until the BCS
36 theory managed to explain its properties. The first unconventional superconductor, or high-TC ,
37 was reported in 1975 and, despite enormous efforts from the scientific community, up to the
38 present date, no theory managed to contemplate the problem’s full complexity. This lack of a
comprehensive theory, capable not only to explain but to predict novel superconductors, opens a
pte
39
40 wide road for modern computational science. Modern attempts used support vector regression to
41 develop a regression model to estimate the TC of different superconductors of doped MgB2 [513].
42 More recently Stanev et al. [514] presented the most comprehensive study of superconductivity
43 using MI. The methodology combines data mining and ML with the random forest algorithm (see
44 Section 2.3.2) to investigate ≈ 16,400 superconductors harvested from the SuperCon experimental
45 database [515]. They obtained the experimental chemical composition and TC from the database
46 and compared the ML results using only elemental features, constructed using the Materials
Agnostic Platform for Informatics and Exploration (Magpie) [206]. They obtained a regression
ce
47
48 model with R2 = 0.88, which is impressive given the different compositions of the dataset. With
49 the regression model, 35 new non-cuprate and non-iron-based oxides have been identified as possible
50 superconductors. The range of crystal symmetries is an interesting non-expected and non-induced
51 surprise. They obtained 14 Orthorhombic, 9 Monoclinic, 6 Hexagonal, 5 Cubic, and 1 Trigonal
crystals. This path opens several possibilities for novel superconductors discoveries.
52
Ac
53
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 52 of 74
1
2
3
REFERENCES 52
4
pt
5 4 Conclusions and outlook
6
7
8 The data-driven era for materials discovery has been established by the Materials Genome
9 Initiative, and the scientific community has just started embracing it. Electronic structure methods,
cri
10 led by density functional theory, and statistical learning methods, or simply machine learning
11 algorithms, underwent great improvements over the last decades. These are a consequence of
12 the advances in computational capabilities, development of novel algorithms, and availability of
13 data storage infrastructures. Their convergence represents a very fruitful and promising scenario
14 for materials discovery. The major outcome of such convergence is the approximation between
15 computational predictions and experimental realizations of novel materials. Therefore, the goal of
16 reducing the time-to-market of new materials is starting to become a reality.
us
17 Successful applications exploring the above techniques, have started to appear in the form of
18 regression and classification models for prediction of basic properties, such as electronic band gaps,
19 formation energies, and crystalline structures. The area of atomistic potentials have benefited early
20 from machine learning methods, and as a consequence, it shows relative maturity. Conversely,
21 niches such as magnetic, superconductive, and other complex phenomena, just begun to be
22
23
24
25
26
27
28
an
addressed. Nevertheless, they show great potential for further breakthroughs. Disentangling
high-dimensional correlations is precisely where machine learning algorithms excel. Propelled by
the recent creation of large databases, we also foresee an acute activity in the 2D materials and
symmetry protected topological materials areas in the near future, regarding machine learning
applications.
Lastly, the materials research field is shifting into a new paradigm of data-driven science.
Relative success has been shown, nevertheless the construction of a broader route is an ongoing
dM
29 process. The possibilities and limitations are only starting to be grasped by the community, and
30 the ever increasing amount of scientific data invites theoretical, computational and experimental
31 scientists to explore it.
32
33 Acknowledgments
34
35 GRS, ACMP, CMA, MC, and AF acknowledges financial support from the Fundação de Amparo à
36 Pesquisa do Estado de São Paulo (FAPESP), project numbers 2017/18139-6, 18/05565-0, 18/11856-
37 7, 16/14011-2, 17/02317-2.
38
pte
39 References
40
41 [1] J. Simon and M. Greiner, Nature 483, 282 (2012).
42 [2] C. Mera Acosta, R. Ouyang, A. Fazzio, M. Scheffler, L. M. Ghiringhelli, and C. Carbogno,
43 ArXiv e-prints (2018), arXiv:1805.10950 [cond-mat.mtrl-sci] .
44
45 [3] N. Mounet, M. Gibertini, P. Schwaller, D. Campi, A. Merkys, A. Marrazzo, T. Sohier, I. E.
46 Castelli, A. Cepellotti, G. Pizzi, and N. Marzari, Nature Nanotechnology 13, 246 (2018),
arXiv:1611.05234 .
ce
47
48 [4] G. Bell, T. Hey, and A. Szalay, Science 323, 1297 (2009).
49 [5] J. Gray, in The Fourth Paradigm: Data-Intensive Scientific Discovery, edited by T. Hey,
50 S. Tansley, and K. Tolle (Microsoft Research, Redmond, 2009) pp. xvii–xxxi.
51 [6] A. Agrawal and A. Choudhary, APL Materials 4, 053208 (2016).
52
Ac
1
2
3
REFERENCES 53
4
pt
5 [8] B. Sun, M. Fernandez, and A. S. Barnard, Nanoscale Horizons 1, 89 (2016).
6
[9] T. Kuhn, The Structure of Scientific Revolutions (University of Chicago Press, Chicago,
7
1962).
8
9 [10] A. Jain, K. A. Persson, and G. Ceder, APL Materials 4, 053102 (2016).
cri
10 [11] C. L. Magee, Complexity 18, 10 (2012).
11 [12] T. W. Eagar, Technology Review 98, 42 (1995).
12
[13] S. Curtarolo, G. L. W. Hart, M. B. Nardelli, N. Mingo, S. Sanvito, and O. Levy, Nature
13
Materials 12, 191 (2013).
14
15 [14] P. Gribbon and S. Andreas, Drug Discovery Today 10, 17 (2005).
16 [15] D. A. Pereira and J. A. Williams, British Journal of Pharmacology 152, 53,
us
17 https://bpspubs.onlinelibrary.wiley.com/doi/pdf/10.1038/sj.bjp.0707373 .
18 [16] J. Allison, JOM 63, 15 (2011).
19
[17] J. A. Warren, MRS Bulletin 43, 452 (2018).
20
21 [18] M. de Jong, W. Chen, T. Angsten, A. Jain, R. Notestine, A. Gamst, M. Sluiter, C. Krishna
22 Ande, S. van der Zwaag, J. J. Plata, C. Toher, S. Curtarolo, G. Ceder, K. A. Persson, and
23
24
25
26
27
28
an
M. Asta, Sci. Data 2, 150009 (2015).
[19] J. J. de Pablo, B. Jones, C. L. Kovacs, V. Ozolins, and A. P. Ramirez, Curr. Opin. Solid
State Mater. Sci. 18, 99 (2014), arXiv:arXiv:1011.1669v3 .
[20] R. Dehghannasiri, D. Xue, P. V. Balachandran, M. R. Yousefi, L. A. Dalton, T. Lookman,
and E. R. Dougherty, Computational Materials Science 129, 311 (2017).
dM
[21] J. Glick, in Informatics for Materials Science and Engineering (Elsevier, 2013) pp. 147–187.
29
30 [22] K. T. Butler, D. W. Davies, H. Cartwright, O. Isayev, and A. Walsh, Nature 559, 547
31 (2018), arXiv:arXiv:1402.6991v1 .
32 [23] E. Schrödinger, Phys. Rev. 28, 1049 (1926), arXiv:1112.5320 .
33 [24] P. A. M. Dirac, Proc. R. Soc. A Math. Phys. Eng. Sci. 123, 714 (1929).
34 [25] D. R. Hartree, Math. Proc. Cambridge Philos. Soc. 24, 111 (1928).
35
36 [26] L. H. Thomas, Math. Proc. Cambridge Philos. Soc. 23, 542 (1927).
37 [27] P. Hohenberg and W. Kohn, Physical Review 136, B864 (1964).
38 [28] W. Kohn and L. J. Sham, Physical Review 140, A1133 (1965).
pte
47
48 K. F. Garrity, L. Genovese, P. Giannozzi, M. Giantomassi, S. Goedecker, X. Gonze, O. Gr,
49 D. R. Hamann, P. J. Hasnip, N. A. W. Holzwarth, D. Ius, D. B. Jochym, D. Jones, G. Kresse,
50 K. Koepernik, K. Emine, I. L. M. Locht, S. Lubeck, M. Marsman, N. Marzari, U. Nitzsche,
51 L. Nordstr, T. Ozaki, L. Paulatto, C. J. Pickard, W. Poelmans, and I. J. Matt, Science 351,
52 1 (2016).
Ac
1
2
3
REFERENCES 54
4
pt
5 [36] J. Ihm, A. Zunger, and M. L. Cohen, J. Phys. C Solid State Phys 12, 4409 (1979).
6
[37] J. Ihm, A. Zunger, and M. L. Cohen, J. Phys. C Solid State Phys. 13, 516 (1980).
7
8 [38] J. P. Perdew, AIP Conference Proceedings 577, 1 (2001).
9 [39] J. P. Perdew, K. Burke, and M. Ernzerhof, Phys. Rev. Lett. 77, 3865 (1996), arXiv:0927-
cri
10 0256(96)00008 [10.1016] .
11 [40] J. P. Perdew and Y. Wang, Phys. Rev. B 45, 13244 (1992), arXiv:arXiv:1011.1669v3 .
12
[41] A. D. Becke, Phys. Rev. A 38, 3098 (1988), arXiv:PhysRevA.38.3098 [10.1103] .
13
14 [42] C. Lee, W. Yang, and R. G. Parr, Phys. Rev. B 37, 785 (1988), arXiv:PhysRevA.38.3098
15 [10.1103] .
16 [43] J. Tao, J. P. Perdew, V. N. Staroverov, and G. E. Scuseria, Phys. Rev. Lett. 91, 146401
us
17 (2003), arXiv:0306203 [cond-mat] .
18 [44] J. Sun, R. C. Remsing, Y. Zhang, Z. Sun, A. Ruzsinszky, H. Peng, Z. Yang, A. Paul,
19 U. Waghmare, X. Wu, M. L. Klein, and J. P. Perdew, Nat. Chem. 8, 831 (2016),
20 arXiv:1511.01089 .
21 [45] F. Tran and P. Blaha, Phys. Rev. Lett. 102, 226401 (2009).
22
23
24
25
26
27
28
[46]
[47]
[48]
arXiv:1406.3259 .
an
L. A. Agapito, S. Curtarolo, and M. Buongiorno Nardelli, Phys. Rev. X 5, 011006 (2015),
39
40 [56] S. Grimme, J. Comput. Chem. 25, 1463 (2004).
41 [57] A. Tkatchenko and M. Scheffler, Physical Review Letters 102, 073005 (2009).
42 [58] E. Runge and E. K. U. Gross, Phys. Rev. Lett. 52, 997 (1984).
43
44 [59] M. Petersilka, U. J. Gossmann, and E. K. U. Gross, Phys. Rev. Lett. 76, 1212 (1996).
45 [60] C. A. Ullrich and Z. hui Yang, Brazilian J. Phys. 44, 154 (2014), arXiv:1305.1388 .
46 [61] A. I. Liechtenstein, V. I. Anisimov, and J. Zaanen, Phys. Rev. B 52, R5467 (1995).
ce
47 [62] S. Dudarev and G. Botton, Phys. Rev. B - Condens. Matter Mater. Phys. 57, 1505 (1998),
48 arXiv:0927-0256(96)00008 [10.1016] .
49
[63] L. Hedin, Phys. Rev. 139, A796 (1965), arXiv:9712013v1 [arXiv:cond-mat] .
50
51 [64] F. Aryasetiawan and O. Gunnarsson, Reports Prog. Phys. 61, 237 (1998).
52 [65] X. Blase, I. Duchemin, and D. Jacquemin, Chem. Soc. Rev. 47, 1022 (2018).
Ac
1
2
3
REFERENCES 55
4
pt
5 [67] G. Kotliar, S. Y. Savrasov, K. Haule, V. S. Oudovenko, O. Parcollet, and C. A. Marianetti,
6 Rev. Mod. Phys. 78, 865 (2006).
7
[68] A. Paul and T. Birol, “Applications of dft+dmft in materials science,” (2018),
8
arXiv:1809.09246 .
9
cri
10 [69] M. Costa, P. Thunström, I. Di Marco, A. Bergman, A. B. Klautau, A. I. Lichtenstein, M. I.
11 Katsnelson, and O. Eriksson, Phys. Rev. B 87, 115142 (2013).
12 [70] M. Aichhorn, L. Pourovskii, P. Seth, V. Vildosola, M. Zingl, O. E. Peil, X. Deng,
13 J. Mravlje, G. J. Kraberger, C. Martins, M. Ferrero, and O. Parcollet, Computer Physics
14 Communications 204, 200 (2016).
15 [71] S. Goedecker, Rev. Mod. Phys. 71, 1085 (1999).
16
us
[72] D. R. Bowler and T. Miyazaki, Reports Prog. Phys. 75, 036503 (2012).
17
18 [73] L. E. Ratcliff, S. Mohr, G. Huhs, T. Deutsch, M. Masella, and L. Genovese, Wiley Interdiscip.
19 Rev. Comput. Mol. Sci. 7, e1290 (2017).
20 [74] G. K. Madsen, J. Carrete, and M. J. Verstraete, Computer Physics Communications 231,
21 140 (2018).
22
23
24
25
26
27
28
Communications 185, 422 (2014).
(2014). an
[75] G. Pizzi, D. Volja, B. Kozinsky, M. Fornari, and N. Marzari, Computer Physics
[76] W. Li, J. Carrete, N. A. Katcho, and N. Mingo, Comp. Phys. Commun. 185, 1747–1758
39 [82] F. D. Novaes, A. J. R. da Silva, and A. Fazzio, Brazilian Journal of Physics 36, 799 (2006).
40 [83] A. R. Rocha, V. M. García-Suárez, S. Bailey, C. Lambert, J. Ferrer, and S. Sanvito, Phys.
41 Rev. B 73, 085414 (2006).
42 [84] A. Marini, C. Hogan, M. GrÃŒning, and D. Varsano, Computer Physics Communications
43 180, 1392 (2009).
44
45 [85] J. Deslippe, G. Samsonidze, D. A. Strubbe, M. Jain, M. L. Cohen, and S. G. Louie, Computer
46 Physics Communications 183, 1269 (2012).
ce
arXiv:arXiv:1412.8405v1 .
53
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 56 of 74
1
2
3
REFERENCES 56
4
pt
5 [90] W. Kohn, Rev. Mod. Phys. 71, 1253 (1999), arXiv:0707.2088 .
6
[91] J. P. Perdew and A. Ruzsinszky, Int. J. Quantum Chem. 110, 2801 (2010), arXiv:0610288
7
[physics] .
8
9 [92] K. Burke, J. Chem. Phys. 136 (2012), 10.1063/1.4704546, arXiv:1201.3679 .
cri
10 [93] G. Kresse and J. Hafner, Phys. Rev. B 47, 558 (1993).
11 [94] G. Kresse and J. Furthmuller, Phys. Rev. B 54, 11169 (1996).
12
[95] G. Kresse and J. Hafner, Phys. Rev. B 49, 14251 (1994), arXiv:0927-0256(96)00008 [10.1016]
13
.
14
15 [96] G. Kresse and J. Furthmüller, Comput. Mater. Sci. 6, 15 (1996), arXiv:0927-0256(96)00008
16 [10.1016] .
us
17 [97] P. Giannozzi, S. Baroni, N. Bonini, M. Calandra, R. Car, C. Cavazzoni, D. Ceresoli, G. L.
18 Chiarotti, M. Cococcioni, I. Dabo, A. Dal Corso, S. de Gironcoli, S. Fabris, G. Fratesi,
19 R. Gebauer, U. Gerstmann, C. Gougoussis, A. Kokalj, M. Lazzeri, L. Martin-Samos,
20 N. Marzari, F. Mauri, R. Mazzarello, S. Paolini, A. Pasquarello, L. Paulatto, C. Sbraccia,
21 S. Scandolo, G. Sclauzero, A. P. Seitsonen, A. Smogunov, P. Umari, and R. M. Wentzcovitch,
22 J. Phys. Condens. Matter 21, 395502 (2009), arXiv:0906.2569 .
23
24
25
26
27
28
an
[98] P. Giannozzi, O. Andreussi, T. Brumme, O. Bunau, M. Buongiorno Nardelli, M. Calandra,
R. Car, C. Cavazzoni, D. Ceresoli, M. Cococcioni, N. Colonna, I. Carnimeo, A. Dal Corso,
S. de Gironcoli, P. Delugas, R. A. DiStasio, A. Ferretti, A. Floris, G. Fratesi, G. Fugallo,
R. Gebauer, U. Gerstmann, F. Giustino, T. Gorni, J. Jia, M. Kawamura, H.-Y. Ko, A. Kokalj,
E. Küçükbenli, M. Lazzeri, M. Marsili, N. Marzari, F. Mauri, N. L. Nguyen, H.-V. Nguyen,
A. Otero-de-la Roza, L. Paulatto, S. Poncé, D. Rocca, R. Sabatini, B. Santra, M. Schlipf,
dM
29 A. P. Seitsonen, A. Smogunov, I. Timrov, T. Thonhauser, P. Umari, N. Vast, X. Wu, and
30 S. Baroni, J. Phys. Condens. Matter 29, 465901 (2017).
31 [99] S. J. Clark, M. D. Segall, C. J. Pickard, P. J. Hasnip, M. I. J. Probert, K. Refson, and M. C.
32 Payne, Zeitschrift für Krist. - Cryst. Mater. 220, 567 (2005).
33
34 [100] M. D. Segall, P. J. D. Lindan, M. J. Probert, C. J. Pickard, P. J. Hasnip, S. J. Clark, and
35 M. C. Payne, J. Phys. Condens. Matter 14, 2717 (2002).
36 [101] X. Gonze, F. Jollet, F. Abreu Araujo, D. Adams, B. Amadon, T. Applencourt, C. Audouze,
37 J. M. Beuken, J. Bieder, A. Bokhanchuk, E. Bousquet, F. Bruneval, D. Caliste, M. Côté,
38 F. Dahm, F. Da Pieve, M. Delaveau, M. Di Gennaro, B. Dorado, C. Espejo, G. Geneste,
pte
1
2
3
REFERENCES 57
4
pt
5 [104] S. Goedecker, M. Teter, and J. Hutter, Phys. Rev. B 54, 1703 (1996), arXiv:9512004v1
6 [arXiv:mtrl-th] .
7
[105] J. VandeVondele, M. Krack, F. Mohamed, M. Parrinello, T. Chassaing, and J. Hutter,
8
Comput. Phys. Commun. 167, 103 (2005), arXiv:NIHMS150003 .
9
cri
10 [106] M. Krack, Theor. Chem. Acc. 114, 145 (2005), arXiv:NIHMS150003 .
11 [107] J. VandeVondele and J. Hutter, J. Chem. Phys. 127, 114105 (2007).
12 [108] J. Hutter, M. Iannuzzi, F. Schiffmann, and J. VandeVondele, Wiley Interdiscip. Rev.
13 Comput. Mol. Sci. 4, 15 (2014), arXiv:9512004 [mtrl-th] .
14 [109] D. Marx and J. Hutter, “Ab-initio molecular dynamics: Theory and implementation,” in
15 Modern Methods and Algorithms of Quantum Chemistry, NIC, edited by J. Grotendorst
16 (Forschungszentrum Jülich, 2000) Chap. 13, pp. 301–449, i ed., publicly available at the
us
17 URL: http://www2.fz-juelich.de/nic-series/Volume3/marx.pdf.
18 [110] W. Andreoni and A. Curioni, Parallel Computing 26, 819 (2000).
19
20 [111] D. Marx and J. Hutter, Ab Initio Molecular Dynamics (Cambridge University Press, 2009)
21 http://www.cambridge.org/gb/knowledge/isbn/item2327682/.
22 [112] C.-K. Skylaris, P. D. Haynes, A. A. Mostofi, and M. C. Payne, J. Chem. Phys. 122, 084119
23
24
25
26
27
28
(2005).
an
[113] S. Mohr, L. E. Ratcliff, L. Genovese, D. Caliste, P. Boulanger, S. Goedecker, and T. Deutsch,
Phys. Chem. Chem. Phys. 17, 31360 (2015), arXiv:1501.0588 .
[114] M. J. Frisch, G. W. Trucks, H. B. Schlegel, G. E. Scuseria, M. A. Robb, J. R. Cheeseman,
G. Scalmani, V. Barone, G. A. Petersson, H. Nakatsuji, X. Li, M. Caricato, A. V. Marenich,
dM
J. Bloino, B. G. Janesko, R. Gomperts, B. Mennucci, H. P. Hratchian, J. V. Ortiz, A. F.
29 Izmaylov, J. L. Sonnenberg, D. Williams-Young, F. Ding, F. Lipparini, F. Egidi, J. Goings,
30 B. Peng, A. Petrone, T. Henderson, D. Ranasinghe, V. G. Zakrzewski, J. Gao, N. Rega,
31 G. Zheng, W. Liang, M. Hada, M. Ehara, K. Toyota, R. Fukuda, J. Hasegawa, M. Ishida,
32 T. Nakajima, Y. Honda, O. Kitao, H. Nakai, T. Vreven, K. Throssell, J. A. Montgomery, Jr.,
33 J. E. Peralta, F. Ogliaro, M. J. Bearpark, J. J. Heyd, E. N. Brothers, K. N. Kudin, V. N.
34 Staroverov, T. A. Keith, R. Kobayashi, J. Normand, K. Raghavachari, A. P. Rendell, J. C.
35 Burant, S. S. Iyengar, J. Tomasi, M. Cossi, J. M. Millam, M. Klene, C. Adamo, R. Cammi,
36 J. W. Ochterski, R. L. Martin, K. Morokuma, O. Farkas, J. B. Foresman, and D. J. Fox,
37 “Gaussian~16 Revision B.01,” (2016), gaussian Inc. Wallingford CT.
38 [115] M. W. Schmidt, K. K. Baldridge, J. A. Boatz, S. T. Elbert, M. S. Gordon,
pte
53 Sci. 8, 1 (2018).
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 58 of 74
1
2
3
REFERENCES 58
4
pt
5 [121] Y. Shao, Z. Gan, E. Epifanovsky, A. T. Gilbert, M. Wormit, J. Kussmann, A. W.
6 Lange, A. Behn, J. Deng, X. Feng, D. Ghosh, M. Goldey, P. R. Horn, L. D. Jacobson,
7 I. Kaliman, R. Z. Khaliullin, T. Kuś, A. Landau, J. Liu, E. I. Proynov, Y. M. Rhee, R. M.
8 Richard, M. A. Rohrdanz, R. P. Steele, E. J. Sundstrom, H. L. W. III, P. M. Zimmerman,
9 D. Zuev, B. Albrecht, E. Alguire, B. Austin, G. J. O. Beran, Y. A. Bernard, E. Berquist,
cri
10 K. Brandhorst, K. B. Bravaya, S. T. Brown, D. Casanova, C.-M. Chang, Y. Chen, S. H.
11 Chien, K. D. Closser, D. L. Crittenden, M. Diedenhofen, R. A. D. Jr., H. Do, A. D. Dutoi,
12 R. G. Edgar, S. Fatehi, L. Fusti-Molnar, A. Ghysels, A. Golubeva-Zadorozhnaya, J. Gomes,
13 M. W. Hanson-Heine, P. H. Harbach, A. W. Hauser, E. G. Hohenstein, Z. C. Holden, T.-
14 C. Jagau, H. Ji, B. Kaduk, K. Khistyaev, J. Kim, J. Kim, R. A. King, P. Klunzinger,
15 D. Kosenkov, T. Kowalczyk, C. M. Krauter, K. U. Lao, A. D. Laurent, K. V. Lawler,
16 S. V. Levchenko, C. Y. Lin, F. Liu, E. Livshits, R. C. Lochan, A. Luenser, P. Manohar,
us
17 S. F. Manzer, S.-P. Mao, N. Mardirossian, A. V. Marenich, S. A. Maurer, N. J. Mayhall,
18 E. Neuscamman, C. M. Oana, R. Olivares-Amaya, D. P. O’Neill, J. A. Parkhill, T. M.
19 Perrine, R. Peverati, A. Prociuk, D. R. Rehn, E. Rosta, N. J. Russ, S. M. Sharada,
20 S. Sharma, D. W. Small, A. Sodt, T. Stein, D. Stück, Y.-C. Su, A. J. Thom, T. Tsuchimochi,
21 V. Vanovschi, L. Vogt, O. Vydrov, T. Wang, M. A. Watson, J. Wenzel, A. White, C. F.
22
23
24
25
26
27
28
an
Williams, J. Yang, S. Yeganeh, S. R. Yost, Z.-Q. You, I. Y. Zhang, X. Zhang, Y. Zhao, B. R.
Brooks, G. K. Chan, D. M. Chipman, C. J. Cramer, W. A. G. III, M. S. Gordon, W. J.
Hehre, A. Klamt, H. F. S. III, M. W. Schmidt, C. D. Sherrill, D. G. Truhlar, A. Warshel,
X. Xu, A. Aspuru-Guzik, R. Baer, A. T. Bell, N. A. Besley, J.-D. Chai, A. Dreuw, B. D.
Dunietz, T. R. Furlani, S. R. Gwaltney, C.-P. Hsu, Y. Jung, J. Kong, D. S. Lambrecht,
W. Liang, C. Ochsenfeld, V. A. Rassolov, L. V. Slipchenko, J. E. Subotnik, T. V. Voorhis,
dM
J. M. Herbert, A. I. Krylov, P. M. Gill, and M. Head-Gordon, Molecular Physics 113, 184
29 (2015), https://doi.org/10.1080/00268976.2014.952696 .
30
[122] V. Blum, R. Gehrke, F. Hanke, P. Havu, V. Havu, X. Ren, K. Reuter, and M. Scheffler,
31 Comput. Phys. Commun. 180, 2175 (2009).
32
33 [123] X. Andrade, D. Strubbe, U. De Giovannini, A. H. Larsen, M. J. T. Oliveira, J. Alberdi-
34 Rodriguez, A. Varas, I. Theophilou, N. Helbig, M. J. Verstraete, L. Stella, F. Nogueira,
35 A. Aspuru-Guzik, A. Castro, M. A. L. Marques, and A. Rubio, Phys. Chem. Chem. Phys.
36 17, 31371 (2015), arXiv:1501.05654 .
37 [124] A. Castro, H. Appel, M. Oliveira, C. A. Rozzi, X. Andrade, F. Lorenzen, M. A. L. Marques,
38 E. K. U. Gross, and A. Rubio, Phys. status solidi 243, 2465 (2006).
pte
39 [125] M. Marques, A. Castro, G. F. Bertschc, and A. Rubio, Comput. Phys. Commun. 151, 60
40 (2003).
41 [126] J. J. Mortensen, L. B. Hansen, and K. W. Jacobsen, Phys. Rev. B - Condens. Matter Mater.
42 Phys. 71, 1 (2005), arXiv:0411218 [cond-mat] .
43
44 [127] J. Enkovaara, C. Rostgaard, J. J. Mortensen, J. Chen, M. Dułak, L. Ferrighi, J. Gavnholt,
45 C. Glinsvad, V. Haikola, H. A. Hansen, H. H. Kristoffersen, M. Kuisma, A. H. Larsen,
46 L. Lehtovaara, M. Ljungberg, O. Lopez-Acevedo, P. G. Moses, J. Ojanen, T. Olsen,
V. Petzold, N. A. Romero, J. Stausholm-Møller, M. Strange, G. A. Tritsaris, M. Vanin,
ce
47
48 M. Walter, B. Hammer, H. Häkkinen, G. K. H. Madsen, R. M. Nieminen, J. K. Nørskov,
49 M. Puska, T. T. Rantala, J. Schiøtz, K. S. Thygesen, and K. W. Jacobsen, J. Phys. Condens.
50 Matter 22, 253202 (2010).
51 [128] K. Schwarz and P. Blaha, Comput. Mater. Sci. 28, 259 (2003).
52 [129] A. Gulans, S. Kontur, C. Meisenbichler, D. Nabok, P. Pavone, S. Rigamonti, S. Sagmeister,
Ac
53
54
55
56
57
58
59
60
Page 59 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
REFERENCES 59
4
pt
5 U. Werner, and C. Draxl, J. Phys. Condens. Matter 26 (2014), 10.1088/0953-
6 8984/26/36/363202.
7
[130] S. Blügel and G. Bihlmayer, “The Full-Potential Linearized Augmented Plane Wave
8
Method,” in Computational Nanoscience: Do It Yourself ! , NIC series, Vol. 31 (John von
9
cri
Neumann Institute for Computing, Jülich, 2006) pp. 85–129.
10
11 [131] R. P. Feynman, Physical Review 56, 340 (1939).
12 [132] A. R. Oganov and C. W. Glass, The Journal of Chemical Physics 124, 244704 (2006),
13 arXiv:1604.08746 .
14 [133] A. R. Oganov, A. O. Lyakhov, and M. Valle, Acc. Chem. Res. 44, 227 (2011).
15
[134] A. O. Lyakhov, A. R. Oganov, H. T. Stokes, and Q. Zhu, Comput. Phys. Commun. 184,
16
us
1172 (2013).
17
18 [135] S. Heiles and R. L. Johnston, Int. J. Quantum Chem. 113, 2091 (2013).
19 [136] Z. Li and H. A. Scheraga, Proc. Natl. Acad. Sci. 84, 6611 (1987).
20 [137] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller, J. Chem.
21 Phys. 21, 1087 (1953), 5744249209 .
22
23
24
25
26
27
28
[138]
[139]
[140]
[141]
arXiv:1101.3987 .
an
C. J. Pickard and R. J. Needs, Journal of Physics: Condensed Matter 23, 053201 (2011),
Y. Wang, J. Lv, L. Zhu, and Y. Ma, Phys. Rev. B 82, 094116 (2010), arXiv:1008.3601 .
Y. Wang, J. Lv, L. Zhu, and Y. Ma, Comput. Phys. Commun. 183, 2063 (2012),
arXiv:1205.2264 [cond-mat.mtrl-sci] .
S. Goedecker,
dM
The Journal of Chemical Physics 120, 9911 (2004),
29 https://doi.org/10.1063/1.1724816 .
30 [142] A. Zunger, Nat. Rev. Chem. 2, 0121 (2018).
31
32 [143] D. Yang, J. Lv, X. Zhao, Q. Xu, Y. Fu, Y. Zhan, A. Zunger, and L. Zhang, Chem. Mater.
33 29, 524 (2017), arXiv:1611.08032 .
34 [144] N. Nosengo, Nature 533, 22 (2016).
35 [145] G. N. Simm, A. C. Vaucher, and M. Reiher, J. Phys. Chem. A 123, 385 (2019).
36 [146] M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Axton, A. Baak,
37 N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, P. E. Bourne, J. Bouwman, A. J. Brookes,
38 T. Clark, M. Crosas, I. Dillo, O. Dumon, S. Edmunds, C. T. Evelo, R. Finkers, A. Gonzalez-
pte
53 [150] J. E. Saal, S. Kirklin, M. Aykol, B. Meredig, and C. Wolverton, Jom 65, 1501 (2013).
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 60 of 74
1
2
3
REFERENCES 60
4
pt
5 [151] NOMAD, “The Novel Materials Discovery (NOMAD) Repository,” (2017).
6
[152] M. Hellenbrandt, Crystallography Reviews 10, 17 (2004),
7
https://doi.org/10.1080/08893110410001664882 .
8
9 [153] S. Gražulis, D. Chateigner, R. T. Downs, A. F. T. Yokochi, M. Quirós, L. Lutterotti,
cri
10 E. Manakova, J. Butkus, P. Moeck, and A. Le Bail, Journal of Applied Crystallography
11 42, 726 (2009).
12 [154] S. Gražulis, D. Chateigner, R. T. Downs, A. F. T. Yokochi, M. Quirós, L. Lutterotti,
13 E. Manakova, J. Butkus, P. Moeck, and A. Le Bail, J. Appl. Crystallogr. 42, 726 (2009).
14 [155] D. D. Landis, J. S. Hummelshoj, S. Nestorov, J. Greeley, M. Dulak, T. Bligaard, J. K.
15 Norskov, and K. W. Jacobsen, Computing in Science & Engineering 14, 51 (2012),
16 https://aip.scitation.org/doi/pdf/10.1109/MCSE.2012.16 .
us
17
[156] S. S. Borysov, R. M. Geilhufe, and A. V. Balatsky, PLoS One 12, e0171501 (2017).
18
19 [157] M. Ashton, J. Paul, S. B. Sinnott, and R. G. Hennig, Physical Review Letters 118, 106101
20 (2017).
21 [158] K. Choudhary, I. Kalish, R. Beams, and F. Tavazza, Scientific Reports 7, 5179 (2017).
22
23
24
25
26
27
28
an
[159] K. Choudhary, G. Cheon, E. Reed, and F. Tavazza, Phys. Rev. B 98, 014107 (2018).
[160] J. Hill, A. Mannodi-Kanakkithodi, R. Ramprasad, and B. Meredig, “Materials data
infrastructure and materials informatics,” in Computational Materials System Design, edited
by D. Shin and J. Saal (Springer International Publishing, Cham, 2018) pp. 193–225.
[161] J. Hachmann, R. Olivares-Amaya, S. Atahan-Evrenk, C. Amador-Bedolla, R. S. Sánchez-
Carrera, A. Gold-Parker, L. Vogt, A. M. Brockway, and A. Aspuru-Guzik, The Journal of
dM
29 Physical Chemistry Letters 2, 2241 (2011), https://doi.org/10.1021/jz200866s .
30 [162] S. Haastrup, M. Strange, M. Pandey, T. Deilmann, P. S. Schmidt, N. F. Hinsche, M. N.
31 Gjerding, D. Torelli, P. M. Larsen, A. C. Riis-Jensen, J. Gath, K. W. Jacobsen, J. Jørgen
32 Mortensen, T. Olsen, and K. S. Thygesen, 2D Materials 5, 042002 (2018), arXiv:1806.03173
33 .
34 [163] A. H. Larsen, J. J. Mortensen, J. Blomqvist, I. E. Castelli, R. Christensen, M. Dułak,
35 J. Friis, M. N. Groves, B. Hammer, C. Hargus, E. D. Hermes, P. C. Jennings, P. B. Jensen,
36 J. Kermode, J. R. Kitchin, E. L. Kolsbjerg, J. Kubal, K. Kaasbjerg, S. Lysgaard, J. B.
37 Maronsson, T. Maxson, T. Olsen, L. Pastewka, A. Peterson, C. Rostgaard, J. Schiøtz,
38 O. Schütt, M. Strange, K. S. Thygesen, T. Vegge, L. Vilhelmsen, M. Walter, Z. Zeng, and
pte
53
54
55
56
57
58
59
60
Page 61 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
REFERENCES 61
4
pt
5 [169] H. Lambert, A. Fekete, J. Kermode, and A. D. Vita, Comput. Phys. Commun. 232, 256
6 (2018).
7
[170] A. Jain, S. P. Ong, W. Chen, B. Medasani, X. Qu, M. Kocher, M. Brafman, G. Petretto, G.-
8
M. Rignanese, G. Hautier, D. Gunter, and K. A. Persson, Concurrency and Computation:
9
cri
Practice and Experience 27, 5037 (2015), cPE-14-0307.R2.
10
11 [171] J. Glick, in Informatics for Materials Science and Engineering (Elsevier, 2013) pp. 147–187.
12 [172] J. Greeley, T. F. Jaramillo, J. Bonde, I. Chorkendorff, and J. K. Nøzrskov, Nature Materials
13 5, 909 EP (2006), article.
14 [173] K. Yang, W. Setyawan, S. Wang, M. Buongiorno Nardelli, and S. Curtarolo, Nature Materials
15 11, 614 EP (2012), article.
16
us
[174] E. P. Wigner, Communications on Pure and Applied Mathematics 13, 1 (1960).
17
18 [175] A. Halevy, P. Norvig, and F. Pereira, IEEE Intelligent Systems 24, 8 (2009),
arXiv:arXiv:1201.1832v1 .
19
20 [176] K. P. Murphy, Machine Learning: A Probabilistic Perspective (MIT Press, Cambridge, MA,
21 2012).
22
23
24
25
26
27
28
an
[177] a. L. Samuel, IBM Journal of Research and Development 3, 210 (1959).
[178] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (MIT Press, 2016) http:
//www.deeplearningbook.org.
[179] S. W. Knox, Machine Learning, Wiley Series in Probability and Statistics (John Wiley &
Sons, Inc., Hoboken, NJ, USA, 2018).
[180] M. L. Hutchinson, E. Antono, B. M. Gibbons, S. Paradiso, J. Ling, and B. Meredig, (2017),
dM
29 1711.05099, arXiv:1711.05099 .
30 [181] H. Li, “Which machine learning algorithm should I use?” .
31 [182] M. Awad and R. Khanna, Efficient Learning Machines (Apress, Berkeley, CA, 2015).
32 [183] D. H. Wolpert and W. G. Macready, Mach. Learn. 20, 273 (1995).
33
[184] D. H. Wolpert, Neural Comput. 8, 1341 (1996).
34
35 [185] M. van Heel, R. V. Portugal, and M. Schatz, Open J. Stat. 6, 701 (2016).
36 [186] H. H. Bock, Electron. Journ@l Hist. Probab. Stat. 4, 1 (2008).
37 [187] F. Brunet, Contributions to Parametric Image Registration and 3D Surface Reconstruction,
38 Ph.D. thesis, Université d’Auvergne, Technische Universität München (2010).
pte
39
[188] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer
40
Series in Statistics (Springer New York Inc., New York, NY, USA, 2001).
41
42 [189] K. P. Murphy, Machine Learning: A Probabilistic Perspective (The MIT Press, 2012).
43 [190] C. Cortes and V. Vapnik, Mach. Learn. 20, 273 (1995), arXiv:arXiv:1011.1669v3 .
44 [191] R. R. Bouckaert, Proc. 17th Autrian Conference on AI , 1089 (2005).
45 [192] J. R. Quinlan, C4.5: Programs for Machine Learning (Morgan Kaufmann Publishers Inc.,
46 San Francisco, CA, USA, 1993).
ce
47
[193] R. Kohavi and R. Quinlan, Lecture Notes 3 (1999).
48
49 [194] L. Breiman, Mach. Learn. 45, 5 (2001).
50 [195] K. Rajan, Mater. Today 8, 38 (2005).
51 [196] R. P. Feynman, R. B. Leighton, and M. Sands, The Feynman Lectures on Physics, Vol.
52 I: The New Millennium Edition: Mainly Mechanics, Radiation, and Heat, The Feynman
Ac
1
2
3
REFERENCES 62
4
pt
5 [197] L. M. Ghiringhelli, J. Vybiral, S. V. Levchenko, C. Draxl, and M. Scheffler, Physical Review
6 Letters 114, 1 (2015).
7
[198] R. Ramprasad, R. Batra, G. Pilania, A. Mannodi-Kanakkithodi, and C. Kim, npj
8
Computational Materials 3 (2017), 10.1038/s41524-017-0056-5, arXiv:1707.07294 .
9
cri
10 [199] L. M. Ghiringhelli, J. Vybiral, E. Ahmetcik, R. Ouyang, S. V. Levchenko, C. Draxl, and
11 M. Scheffler, New Journal of Physics 19, 023017 (2017).
12 [200] “Springer materials, the most comprehensive collection of data in the fields of physics,
13 physical and inorganic chemistry, materials science, and related fields,” (2017).
14 [201] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
15 P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
16 M. Brucher, M. Perrot, and E. Duchesnay, Journal of Machine Learning Research 12, 2825
us
17 (2011).
18
[202] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis,
19
J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia,
20
R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray,
21
C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke,
22
23
24
25
26
27
28
an
V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and
X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” (2015),
software available from tensorflow.org.
[203] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,
L. Antiga, and A. Lerer, in NIPS-W (2017).
[204] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, ACM
dM
29 SIGKDD Explor. Newsl. 11, 10 (2009).
30 [205] R. Ouyang, S. Curtarolo, E. Ahmetcik, M. Scheffler, and L. M. Ghiringhelli, Physical Review
31 Materials 2, 083802 (2018), arXiv:1710.03319 .
32 [206] L. Ward, A. Agrawal, A. Choudhary, and C. Wolverton, npj Computational Materials 2, 1
33 (2016), arXiv:1606.09551 .
34
35 [207] L. Ward, A. Dunn, A. Faghaninia, N. E. Zimmermann, S. Bajaj, Q. Wang, J. Montoya,
36 J. Chen, K. Bystrom, M. Dylla, K. Chard, M. Asta, K. A. Persson, G. J. Snyder, I. Foster,
37 and A. Jain, Computational Materials Science 152, 60 (2018).
38 [208] E. Gossett, C. Toher, C. Oses, O. Isayev, F. Legrain, F. Rose, E. Zurek, J. Carrete, N. Mingo,
pte
(2010), arXiv:0910.1019 .
53
54
55
56
57
58
59
60
Page 63 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
REFERENCES 63
4
pt
5 [216] A. P. Bartók, R. Kondor, and G. Csányi, Physical Review B 87, 184115 (2013),
6 arXiv:1209.3140 .
7
[217] K. Yao, J. E. Herr, D. W. Toth, R. Mckintyre, and J. Parkhill, Chem. Sci. 9, 2261 (2018),
8
arXiv:1711.06385 .
9
cri
10 [218] J. S. Smith, O. Isayev, and A. E. Roitberg, Chem. Sci. 8, 3192 (2017), arXiv:1610.08935 .
11 [219] A. Khorshidi and A. A. Peterson, Comput. Phys. Commun. 207, 310 (2016).
12 [220] H. Wang, L. Zhang, J. Han, and W. E, Computer Physics Communications 228, 178 (2018),
13 arXiv:1712.03641 .
14 [221] N. Artrith and A. Urban, Computational Materials Science 114, 135 (2016).
15
16 [222] P. Domingos, Commun. ACM 55, 78 (2012), arXiv:9605103 [cs] .
us
17 [223] L. Ward and C. Wolverton, Current Opinion in Solid State and Materials Science 21, 167
18 (2016).
19 [224] B. Huang, N. O. Symonds, and O. A. von Lilienfeld, in Handbook of Materials Modeling
20 (Springer International Publishing, Cham, 2018) pp. 1–27, arXiv:1807.04259 .
21 [225] A. C. Brown and T. R. Fraser, Trans. R. Soc. Edinburgh 25, 151 (1868).
22
23
24
25
26
27
28
an
[226] K. Wu, B. Natarajan, L. Morkowchuk, M. Krein, and C. M. Breneman, in Informatics
Mater. Sci. Eng. (Elsevier, 2013) pp. 385–422.
[227] A. Seko, H. Hayashi, K. Nakayama, A. Takahashi, and I. Tanaka, Physical Review B 95,
144110 (2017), arXiv:1611.08645 .
[228] J. E. Herr, K. Koh, K. Yao, and J. Parkhill, , 1 (2018), arXiv:1811.00123 .
dM
[229] B. Meredig, A. Agrawal, S. Kirklin, J. E. Saal, J. W. Doak, A. Thompson, K. Zhang,
29 A. Choudhary, and C. Wolverton, Phys. Rev. B 89, 094104 (2014).
30
[230] T. Lam Pham, H. Kino, K. Terakura, T. Miyake, K. Tsuda, I. Takigawa, and H. Chi Dam,
31
Sci Technol Adv Mater 18, 756 (2017).
32
33 [231] T.-L. Pham, N.-D. Nguyen, V.-D. Nguyen, H. Kino, T. Miyake, and H.-C. Dam, The Journal
34 of Chemical Physics 148, 204106 (2018).
35 [232] O. Isayev, D. Fourches, E. N. Muratov, C. Oses, K. Rasch, A. Tropsha, and S. Curtarolo,
36 Chemistry of Materials 27, 735 (2015).
37 [233] A. Seko, A. Togo, and I. Tanaka, “Descriptors for machine learning of materials data,” in
38 Nanoinformatics, edited by I. Tanaka (Springer Singapore, Singapore, 2018) pp. 3–23.
pte
39 [234] R. M. Balabin and E. I. Lomakina, Phys. Chem. Chem. Phys. 13, 11710 (2011).
40
41 [235] A. P. Bartók and G. Csányi, International Journal of Quantum Chemistry 115, 1051 (2015),
42 arXiv:1502.01366 .
43 [236] J. Behler and M. Parrinello, Phys. Rev. Lett. 98, 146401 (2007).
44 [237] J. Behler, Int. J. Quantum Chem. 115, 1032 (2015), arXiv:1609.02815 .
45 [238] L. Zhang, J. Han, H. Wang, R. Car, and W. E, Physical Review Letters 120, 143001 (2018),
46 arXiv:1707.09571 .
ce
47
[239] P. M. Larsen, M. Pandey, M. Strange, and K. W. Jacobsen, , 1 (2018), arXiv:1808.02114 .
48
49 [240] W. Kabsch, Acta Crystallographica Section A 32, 922 (1976).
50 [241] A. Sadeghi, S. A. Ghasemi, B. Schaefer, S. Mohr, M. A. Lill, and S. Goedecker, Journal of
51 Chemical Physics 139 (2013), 10.1063/1.4828704, arXiv:1302.2322 .
52 [242] L. Zhu, M. Amsler, T. Fuhrer, B. Schaefer, S. Faraji, S. Rostami, S. A. Ghasemi, A. Sadeghi,
Ac
1
2
3
REFERENCES 64
4
pt
5 [243] G. Ferré, J.-B. Maillet, and G. Stoltz, J. Chem. Phys. 143, 104114 (2015).
6
[244] X.-T. Li, S.-G. Xu, X.-B. Yang, and Y.-J. Zhao, J. Chem. Phys. 147, 144106 (2017).
7
8 [245] S. De, A. P. Bartók, G. Csányi, and M. Ceriotti, Phys. Chem. Chem. Phys. 18, 13754 (2016),
9 arXiv:1601.04077 .
cri
10 [246] P. J. Steinhardt, D. R. Nelson, and M. Ronchetti, Phys. Rev. B 28, 784 (1983).
11 [247] J. Behler, Journal of Chemical Physics 134 (2011), 10.1063/1.3553717, arXiv:1609.02815 .
12
[248] B. Jiang, J. Li, and H. Guo, International Reviews in Physical Chemistry 35, 479 (2016).
13
14 [249] M. Gastegger, L. Schwiedrzik, M. Bittermann, F. Berzsenyi, and P. Marquetand, The Journal
15 of Chemical Physics 148, 241709 (2018), arXiv:1712.05861 .
16 [250] A. Grisafi, D. M. Wilkins, G. Csányi, and M. Ceriotti, Physical Review Letters 120, 036002
us
17 (2018), arXiv:1709.06757 .
18 [251] M. Rupp, A. Tkatchenko, K.-R. Müller, and O. A. von Lilienfeld, Phys. Rev. Lett. 108,
19 058301 (2012).
20
[252] K. Hansen, F. Biegler, R. Ramakrishnan, W. Pronobis, O. A. von Lilienfeld, K.-R.
21
Müller, and A. Tkatchenko, The Journal of Physical Chemistry Letters 6, 2326 (2015),
22
23
24
25
26
27
28
arXiv:1109.2618 .
an
[253] R. M. Richard and J. M. Herbert, Journal of Chemical Physics 137 (2012),
10.1063/1.4742816.
[254] K. Yao, J. E. Herr, and J. Parkhill, The Journal of Chemical Physics 146, 014106 (2017).
[255] B. Huang and O. A. von Lilienfeld, The Journal of Chemical Physics 145, 161102 (2016).
dM
[256] W. Pronobis, A. Tkatchenko, and K.-R. Müller, Journal of Chemical Theory and
29 Computation 14, 2991 (2018).
30
31 [257] S. M. Kandathil, T. L. Fletcher, Y. Yuan, J. Knowles, and P. L. A. Popelier, Journal of
32 Computational Chemistry 34, 1850 (2013).
33 [258] K. T. Schütt, H. Glawe, F. Brockherde, A. Sanna, K. R. Müller, and E. K. U. Gross, Physical
34 Review B 89, 205118 (2014), arXiv:1307.1266 .
35 [259] O. A. von Lilienfeld, R. Ramakrishnan, M. Rupp, and A. Knoll, International Journal of
36 Quantum Chemistry 115, 1084 (2015), arXiv:1307.2918 .
37 [260] Z. Li, J. R. Kermode, and A. De Vita, Physical Review Letters 114, 096405 (2015).
38
[261] A. Thompson, L. Swiler, C. Trott, S. Foiles, and G. Tucker, J. Comput. Phys. 285, 316
pte
39
(2015), arXiv:1409.3880 .
40
41 [262] L. Li, T. E. Baker, S. R. White, and K. Burke, Physical Review B 94, 1 (2016),
42 arXiv:1609.03705 .
43 [263] T. Schablitzki, J. Rogal, and R. Drautz, Model. Simul. Mater. Sci. Eng. 21, 075008 (2013).
44 [264] L. Ward, R. Liu, A. Krishna, V. I. Hegde, A. Agrawal, A. Choudhary, and C. Wolverton,
45 Physical Review B 96, 024104 (2017).
46
[265] O. Isayev, C. Oses, C. Toher, E. Gossett, S. Curtarolo, and A. Tropsha, Nature
ce
47
Communications 8, 15679 (2017), arXiv:1608.04782 .
48
49 [266] S. Jindal, S. Chiriki, and S. S. Bulusu, The Journal of Chemical Physics 146, 204301 (2017).
50 [267] F. A. Faber, L. Hutchison, B. Huang, J. Gilmer, S. S. Schoenholz, G. E. Dahl, O. Vinyals,
51 S. Kearnes, P. F. Riley, and O. A. von Lilienfeld, Journal of Chemical Theory and
52 Computation 13, 5255 (2017), arXiv:1706.08566 .
Ac
53 [268] G. Ferré, T. Haut, and K. Barros, The Journal of Chemical Physics 146, 114107 (2017).
54
55
56
57
58
59
60
Page 65 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
REFERENCES 65
4
pt
5 [269] N. Artrith, A. Urban, and G. Ceder, Phys. Rev. B 96, 014112 (2017).
6
[270] S. Chmiela, H. E. Sauceda, K.-R. Müller, and A. Tkatchenko, Nature Communications 9,
7
3887 (2018), arXiv:1802.09238 .
8
9 [271] T. Xie and J. C. Grossman, Phys. Rev. Lett. 120, 145301 (2018), arXiv:1710.10324 .
cri
10 [272] H. Ji and Y. Jung, J. Chem. Phys. 148, 241742 (2018).
11 [273] F. Doshi-Velez and B. Kim, (2017), arXiv:arXiv:1702.08608v2 .
12 [274] Z. C. Lipton, ACM Queue 16, 30:31 (2018), arXiv:arXiv:1606.03490v3 .
13
[275] C. Kim, G. Pilania, and R. Ramprasad, Chem. Mater. 28, 1304 (2016).
14
15 [276] S. S. Sahoo, C. H. Lampert, and G. Martius, in Proc. 35th Int. Conf. Mach. Learn.,
16 Proceedings of Machine Learning Research, Vol. 80, edited by J. Dy and A. Krause (PMLR,
us
17 Stockholmsmässan, Stockholm Sweden, 2018) pp. 4442–4450, arXiv:1806.07259 .
18 [277] J. C. Snyder, M. Rupp, K. Hansen, K.-R. Müller, and K. Burke, Physical Review Letters
19 108, 253002 (2012), arXiv:1112.5441 .
20 [278] J. C. Snyder, M. Rupp, K. Hansen, L. Blooston, K.-R. Müller, and K. Burke, The Journal
21 of Chemical Physics 139, 224104 (2013), arXiv:1306.1812 .
22
23
24
25
26
27
28
an
[279] L. Li, J. C. Snyder, I. M. Pelaschier, J. Huang, U.-N. Niranjan, P. Duncan, M. Rupp, K.-
R. Müller, and K. Burke, International Journal of Quantum Chemistry 116, 819 (2016),
arXiv:1404.1333 .
[280] J. Seino, R. Kageyama, M. Fujinami, Y. Ikabata, and H. Nakai, J. Chem. Phys. 148, 241705
(2018), arXiv:1712.06113 .
[281] F. Brockherde, L. Vogt, L. Li, M. E. Tuckerman, K. Burke, and K.-R. Müller, Nature
dM
29 Communications 8, 872 (2017), arXiv:1609.02815 .
30 [282] E. Schmidt, A. T. Fowler, J. A. Elliott, and P. D. Bristowe, Computational Materials Science
31 149, 250 (2018).
32 [283] M. Bogojeski, F. Brockherde, L. Vogt-maranto, L. Li, M. E. Tuckerman, and K. Burke, , 1
33 (2018), arXiv:1811.06255v1 .
34
[284] T. Mueller, A. G. Kusne, and R. Ramprasad, “Machine Learning in Materials Science,” in
35
Reviews in Computational Chemistry, Vol. 29 (John Wiley & Sons, Ltd, 2016) Chap. 4, pp.
36
186–273.
37
38 [285] V. Botu and R. Ramprasad, Int. J. Quantum Chem. 115, 1074 (2015), arXiv:1410.3353 .
pte
39 [286] L. Zhang, J. Han, H. Wang, R. Car, and W. E, Journal of Chemical Physics 149 (2018),
40 arXiv:1802.08549v3 .
41 [287] O. Schütt and J. Vandevondele, Journal of Chemical Theory and Computation 14, 4168
42 (2018).
43 [288] J. S. Smith, B. Nebgen, N. Lubbers, O. Isayev, and A. E. Roitberg, J. Chem. Phys. 148,
44 241733 (2018), arXiv:1801.09319 .
45
46 [289] J. E. Herr, K. Yao, R. McIntyre, D. W. Toth, and J. Parkhill, J. Chem. Phys. 148, 241710
(2018), arXiv:1712.07240 .
ce
47
48 [290] H. Li, C. Collins, M. Tanha, G. J. Gordon, and D. J. Yaron, Journal of Chemical Theory
49 and Computation 14, 5764 (2018), arXiv:1808.04526 .
50 [291] T. Gao, H. Li, W. Li, L. Li, C. Fang, H. Li, L. Hu, Y. Lu, and Z.-M. Su, Journal of
51 Cheminformatics 8, 24 (2016).
52 [292] Q. Liu, J. C. Wang, P. L. Du, L. H. Hu, X. Zheng, and G. H. Chen, Journal of Physical
Ac
1
2
3
REFERENCES 66
4
pt
5 [293] A. A. Peterson, The Journal of Chemical Physics 145, 074106 (2016).
6
[294] P. O. Dral, O. A. von Lilienfeld, and W. Thiel, Journal of Chemical Theory and Computation
7
11, 2120 (2015).
8
9 [295] J. J. Kranz, M. Kubillus, R. Ramakrishnan, O. A. von Lilienfeld, and M. Elstner, Journal
cri
10 of Chemical Theory and Computation 14, 2341 (2018).
11 [296] G. Hegde and R. C. Bowen, Nature Publishing Group , 1 (2017).
12 [297] I. Lagaris, A. Likas, and D. Fotiadis, Comput. Phys. Commun. 104, 1 (1997), arXiv:9705029
13 [quant-ph] .
14
[298] G. Carleo and M. Troyer, Science 355, 602 (2017).
15
16 [299] P. Teng, Phys. Rev. E 98, 033305 (2018), arXiv:1710.03213 .
us
17 [300] K. Mills, M. Spanner, and I. Tamblyn, Physical Review A 96, 042113 (2017),
18 arXiv:1702.01361 .
19 [301] C. Desgranges and J. Delhommelle, The Journal of Chemical Physics 149, 044118 (2018).
20
[302] L. Wang, Physical Review B 94, 195105 (2016), arXiv:1606.00318 .
21
22 [303] J. Carrasquilla and R. G. Melko, Nature Physics 13, 431 (2017), arXiv:1605.01735 .
23
24
25
26
27
28
arXiv:1608.07848 . an
[304] P. Ponte and R. G. Melko, Physical Review B 96, 205146 (2017), arXiv:1704.05848 .
[305] P. Broecker, J. Carrasquilla, R. G. Melko, and S. Trebst, Sci. Rep. 7, 8823 (2017),
47 [318] S. B. Fagan, R. J. Baierle, R. Mota, A. J. R. da Silva, and A. Fazzio, Phys. Rev. B 61, 9994
48 (2000).
49
50 [319] T. M. Schmidt, R. J. Baierle, P. Piquini, and A. Fazzio, Phys. Rev. B 67, 113407 (2003).
51 [320] E. Z. da Silva, F. D. Novaes, A. J. R. da Silva, and A. Fazzio, Phys. Rev. B 69, 115411
52 (2004).
Ac
53
54
55
56
57
58
59
60
Page 67 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
REFERENCES 67
4
pt
5 [321] S. B. Fagan, A. J. R. da Silva, R. Mota, R. J. Baierle, and A. Fazzio, Phys. Rev. B 67,
6 033405 (2003).
7
[322] R. G. Amorim, A. Fazzio, A. Antonelli, F. D. Novaes, and A. J. R. da Silva, Nano Letters
8
7, 2459 (2007), pMID: 17630813, https://doi.org/10.1021/nl071217v .
9
cri
10 [323] S. B. Fagan, R. Mota, A. J. R. da Silva, and A. Fazzio, Nano Letters 4, 975 (2004),
11 https://doi.org/10.1021/nl049805l .
12 [324] J. T. Paul, A. K. Singh, Z. Dong, H. Zhuang, B. C. Revard, B. Rijal, M. Ashton, A. Linscheid,
13 M. Blonsky, D. Gluhovic, J. Guo, and R. G. Hennig, Journal of Physics: Condensed Matter
14 29, 473001 (2017).
15 [325] R. Wu, A. J. Freeman, and G. B. Olson, Science 265, 376 (1994),
16 http://science.sciencemag.org/content/265/5170/376.full.pdf .
us
17
[326] T. B. Martins, R. H. Miwa, A. J. R. da Silva, and A. Fazzio, Phys. Rev. Lett. 98, 196803
18
(2007).
19
20 [327] J. E. Padilha, A. Fazzio, and A. J. R. da Silva, Physical Review Letters 114, 066803 (2015).
21 [328] S. Baroni, S. de Gironcoli, A. Dal Corso, and P. Giannozzi, Rev. Mod. Phys. 73, 515 (2001).
22
23
24
25
26
27
28
5386 (2015).
[330] E. O. Wrasse, A. Torres, R. J. Baierle, A. Fazzio‡, and T. M. Schmidt, Phys. Chem. Chem.
39 [337] A. Bansil, H. Lin, and T. Das, Rev. Mod. Phys. 88, 021004 (2016).
40 [338] C. M. Acosta, M. P. Lima, R. H. Miwa, A. J. R. da Silva, and A. Fazzio, Phys. Rev. B 89,
41 155438 (2014).
42
43 [339] C. Mera Acosta, O. Babilonia, L. Abdalla, and A. Fazzio, Phys. Rev. B 94, 041302 (2016).
44 [340] K. Choudhary, Q. Zhang, A. C. Reid, S. Chowdhury, N. Van Nguyen, Z. Trautt, M. W.
45 Newrock, F. Y. Congo, and F. Tavazza, Sci. Data 5, 180082 (2018).
46 [341] M. Kuisma, J. Ojanen, J. Enkovaara, and T. T. Rantala, Phys. Rev. B 82, 115106 (2010).
ce
47 [342] P. K. L. Michael C. Gao, Jien-Wei Yeh and Y. Zhang, High-Entropy Alloys: Fundamentals
48 and Applications (Springer, 2016).
49
50 [343] Y. Lederer, C. Toher, K. S. Vecchio, and S. Curtarolo, Acta Materialia 159, 364 (2018).
51 [344] G. K. H. Madsen, Journal of the American Chemical Society 128, 12140 (2006).
52 [345] P. Gorai, V. StevanoviÄ?, and E. S. Toberer, Nature Reviews Materials 2, 17053 EP (2017),
Ac
53 review Article.
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 68 of 74
1
2
3
REFERENCES 68
4
pt
5 [346] S. Bhattacharya and G. K. H. Madsen, Phys. Rev. B 92, 085205 (2015).
6
[347] W. Chen, J.-H. Pöhls, G. Hautier, D. Broberg, S. Bajaj, U. Aydemir, Z. M. Gibbs, H. Zhu,
7
M. Asta, G. J. Snyder, B. Meredig, M. A. White, K. Persson, and A. Jain, Journal of
8
Materials Chemistry C 4, 4414 (2016).
9
cri
10 [348] L. Yu and A. Zunger, Phys. Rev. Lett. 108, 068701 (2012).
11 [349] D. J. Baquião and G. M. Dalpian, Computational Materials Science 158, 382 (2019).
12 [350] C. Mera Acosta, A. Fazzio, and G. M. Dalpian, arXiv e-prints , arXiv:1901.02276 (2019),
13 arXiv:1901.02276 [cond-mat.mtrl-sci] .
14
[351] M. Z. Hasan and C. L. Kane, Rev. Mod. Phys. 82, 3045 (2010).
15
16 [352] Y. Ando and L. Fu, Annual Review of Condensed Matter Physics 6, 361 (2015),
us
17 https://doi.org/10.1146/annurev-conmatphys-031214-014501 .
18 [353] N. P. Armitage, E. J. Mele, and A. Vishwanath, Rev. Mod. Phys. 90, 015001 (2018).
19 [354] C. Weeks, J. Hu, J. Alicea, M. Franz, and R. Wu, Phys. Rev. X 1, 021001 (2011).
20
[355] C.-C. Liu, H. Jiang, and Y. Yao, Phys. Rev. B 84, 195430 (2011).
21
22 [356] M. Zhou, W. Ming, Z. Liu, Z. Wang, Y. Yao, and F. Liu, Sci. Rep. 4, 7102 (2014).
23
24
25
26
27
28
11, 614 EP (2012).
an
[357] K. Yang, W. Setyawan, S. Wang, M. Buongiorno Nardelli, and S. Curtarolo, Nature Materials
[358] G. Cao, H. Liu, X.-Q. Chen, Y. Sun, J. Liang, R. Yu, and Z. Zhang, Science Bulletin 62,
1649 (2017).
[359] H. Zhang, C.-X. Liu, X.-L. Qi, X. Dai, Z. Fang, and S.-C. Zhang, Nature Physics 5, 438 EP
dM
(2009).
29
30 [360] D. Xiao, Y. Yao, W. Feng, J. Wen, W. Zhu, X.-Q. Chen, G. M. Stocks, and Z. Zhang, Phys.
31 Rev. Lett. 105, 096404 (2010).
32 [361] M. Klintenberg, J. T. Haraldse, and A. V. Balatsky, Applied Physics Research 6, 31 (2014).
33 [362] M. G. Vergniory, L. Elcoro, C. Felser, B. A. Bernevig, and Z. Wang, ArXiv e-prints ,
34 arXiv:1807.10271 (2018), arXiv:1807.10271 [cond-mat.mtrl-sci] .
35 [363] F. Tang, H. C. Po, A. Vishwanath, and X. Wan, ArXiv e-prints , arXiv:1807.09744 (2018),
36 arXiv:1807.09744 [cond-mat.mes-hall] .
37
38 [364] T. Zhang, Y. Jiang, Z. Song, H. Huang, Y. He, Z. Fang, H. Weng, and C. Fang, ArXiv
e-prints , arXiv:1807.08756 (2018), arXiv:1807.08756 [cond-mat.mtrl-sci] .
pte
39
40 [365] B. Bradlyn, L. Elcoro, J. Cano, M. G. Vergniory, Z. Wang, C. Felser, M. I. Aroyo, and B. A.
41 Bernevig, Nature 547, 298 EP (2017).
42 [366] J. Cano, B. Bradlyn, Z. Wang, L. Elcoro, M. G. Vergniory, C. Felser, M. I. Aroyo, and B. A.
43 Bernevig, Phys. Rev. B 97, 035139 (2018).
44 [367] H. C. Po, A. Vishwanath, and H. Watanabe, Nature Communications 8, 50 (2017).
45
46 [368] K. Choudhary, K. F. Garrity, and F. Tavazza, arXiv e-prints , arXiv:1810.10640 (2018),
arXiv:1810.10640 [cond-mat.mtrl-sci] .
ce
47
48 [369] J. Liu and D. Vanderbilt, Phys. Rev. B 90, 125133 (2014).
49 [370] K. S. Novoselov, Science 306, 666 (2004).
50 [371] K. S. Novoselov, D. Jiang, F. Schedin, T. J. Booth, V. V. Khotkevich, S. V. Morozov,
51 and A. K. Geim, Proceedings of the National Academy of Sciences 102, 10451 (2005),
52 http://www.pnas.org/content/102/30/10451.full.pdf .
Ac
53
54
55
56
57
58
59
60
Page 69 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
REFERENCES 69
4
pt
5 [372] J. C. Alvarez-Quiceno, G. R. Schleder, E. Marinho, and A. Fazzio, J. Phys. Condens. Matter
6 29, 305302 (2017).
7
[373] V. Kochat, A. Samanta, Y. Zhang, S. Bhowmick, P. Manimunda, S. A. S. Asif, A. S. Stender,
8
R. Vajtai, A. K. Singh, C. S. Tiwary, and P. M. Ajayan, Sci. Adv. 4, e1701373 (2018).
9
cri
10 [374] A. Puthirath Balan, S. Radhakrishnan, C. F. Woellner, S. K. Sinha, L. Deng, C. D. L. Reyes,
11 B. M. Rao, M. Paulose, R. Neupane, A. Apte, V. Kochat, R. Vajtai, A. R. Harutyunyan,
12 C.-W. Chu, G. Costin, D. S. Galvao, A. A. Martí, P. A. van Aken, O. K. Varghese, C. S.
13 Tiwary, A. Malie Madom Ramaswamy Iyer, and P. M. Ajayan, Nat. Nanotechnol. 13, 602
14 (2018).
15 [375] T. Björkman, A. Gulans, A. V. Krasheninnikov, and R. M. Nieminen, Phys. Rev. Lett. 108,
16 235502 (2012).
us
17 [376] S. Lebègue, T. Björkman, M. Klintenberg, R. M. Nieminen, and O. Eriksson, Phys. Rev. X
18 3, 031002 (2013).
19
[377] A. Gulans, M. J. Puska, and R. M. Nieminen, Phys. Rev. B 79, 201105 (2009).
20
21 [378] J. Harl and G. Kresse, Phys. Rev. Lett. 103, 056401 (2009).
22 [379] M. Dion, H. Rydberg, E. Schröder, D. C. Langreth, and B. I. Lundqvist, Phys. Rev. Lett.
23
24
25
26
27
28
92, 246401 (2004).
081101 (2010). an
[380] K. Lee, E. D. Murray, L. Kong, B. I. Lundqvist, and D. C. Langreth, Phys. Rev. B 82,
[381] O. A. Vydrov and T. Van Voorhis, The Journal of Chemical Physics 133, 244103 (2010),
https://doi.org/10.1063/1.3521275 .
dM
29 [382] Z. Wang, S. M. Selbach, and T. Grande, RSC Adv. 4, 4069 (2014).
30 [383] M. Ashton, D. Gluhovic, S. B. Sinnott, J. Guo, D. A. Stewart, and R. G. Hennig, Nano
31 Letters 17, 5251 (2017), pMID: 28745061, https://doi.org/10.1021/acs.nanolett.7b01367 .
32 [384] X. Li, Z. Zhang, Y. Yao, and H. Zhang, 2D Materials 5, 045023 (2018).
33 [385] T. Olsen, E. Andersen, T. Okugawa, D. Torelli, T. Deilmann, and K. S. Thygesen, arXiv
34 e-prints , arXiv:1812.06666 (2018), arXiv:1812.06666 [cond-mat.mtrl-sci] .
35
36 [386] Y. Liu, T. Zhao, W. Ju, S. Shi, S. Shi, and S. Shi, Journal of Materiomics 3, 159 (2017),
37 arXiv:1704.03983 .
38 [387] A. Jain, G. Hautier, S. P. Ong, and K. Persson, Journal of Materials Research 31, 977
pte
39 (2016).
40 [388] J. Hill, G. Mulholland, K. Persson, R. Seshadri, C. Wolverton, and B. Meredig, MRS Bulletin
41 41, 399 (2016).
42 [389] B. Sanchez-Lengeling and A. Aspuru-Guzik, Science 361, 360 (2018).
43
44 [390] M. Rupp, O. A. von Lilienfeld, and K. Burke, 241401 (2018), 10.1063/1.5043213,
45 arXiv:1806.02690 .
46 [391] L. Ward, M. Aykol, B. Blaiszik, I. Foster, B. Meredig, J. Saal, and S. Suram, MRS Bull. 43,
ce
47 683 (2018).
48 [392] S. Curtarolo, D. Morgan, K. Persson, J. Rodgers, and G. Ceder, Physical Review Letters
49 91, 135503 (2003), arXiv:0307262 [cond-mat] .
50 [393] D. Morgan, G. Ceder, and S. Curtarolo, Meas. Sci. Technol. 16, 296 (2005), arXiv:0502465v1
51 [arXiv:cond-mat] .
52
Ac
[394] C. C. Fischer, K. J. Tibbetts, D. Morgan, and G. Ceder, Nat. Mater. 5, 641 (2006).
53
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 70 of 74
1
2
3
REFERENCES 70
4
pt
5 [395] G. Hautier, C. C. Fischer, A. Jain, T. Mueller, and G. Ceder, Chem. Mater. 22, 3762 (2010).
6
[396] Y. Saad, D. Gao, T. Ngo, S. Bobbitt, J. R. Chelikowsky, and W. Andreoni, Physical Review
7
B 85, 104104 (2012).
8
9 [397] P. V. Balachandran, J. Theiler, J. M. Rondinelli, and T. Lookman, Sci. Rep. 5, 13285 (2015).
cri
10 [398] T. K. Patra, V. Meenakshisundaram, J.-H. Hung, and D. S. Simmons, ACS Comb. Sci. 19,
11 96 (2017).
12 [399] A. O. Oliynyk, E. Antono, T. D. Sparks, L. Ghadbeigi, M. W. Gaultois, B. Meredig, and
13 A. Mar, Chem. Mater. 28, 7324 (2016).
14
[400] F. A. Faber, A. Lindmaa, O. A. von Lilienfeld, and R. Armiento, Phys. Rev. Lett. 117,
15
135502 (2016), arXiv:1508.05315 .
16
us
17 [401] P. V. Balachandran, J. Young, T. Lookman, and J. M. Rondinelli, Nat. Commun. 8, 14282
18 (2017).
19 [402] Y. Okamoto, J. Phys. Chem. A 121, 3299 (2017).
20 [403] J. Schmidt, J. Shi, P. Borlido, L. Chen, S. Botti, and M. A. L. Marques, Chem. Mater. 29,
21 5090 (2017), arXiv:arXiv:1310.1546v1 .
22
23
24
25
26
27
28
an
[404] W. Ye, C. Chen, Z. Wang, I.-h. Chu, and S. P. Ong, Nat. Commun. 9, 3800 (2018).
[405] P. V. Balachandran, A. A. Emery, J. E. Gubernatis, T. Lookman, C. Wolverton, and
A. Zunger, Phys. Rev. Mater. 2, 043802 (2018).
[406] S. Lu, Q. Zhou, Y. Ouyang, Y. Guo, Q. Li, and J. Wang, Nat. Commun. 9, 3405 (2018).
[407] G. Pilania, P. V. Balachandran, C. Kim, and T. Lookman, Front. Mater. 3, 1 (2016).
dM
[408] A. Mansouri Tehrani, A. O. Oliynyk, M. Parry, Z. Rizvi, S. Couper, F. Lin, L. Miyagi, T. D.
29
Sparks, and J. Brgoch, J. Am. Chem. Soc. 140, 9844 (2018).
30
31 [409] K. Takahashi and Y. Tanaka, Computational Materials Science 112, 364 (2016).
32 [410] C. Nyshadham, M. Rupp, B. Bekker, A. V. Shapeev, T. Mueller, C. W. Rosenbrock,
33 G. Csányi, D. W. Wingate, and G. L. W. Hart, 12 (2018), arXiv:1809.09203 .
34 [411] Y. Zhuo, A. Mansouri Tehrani, A. O. Oliynyk, A. C. Duke, and J. Brgoch, Nat. Commun.
35 9, 4377 (2018).
36 [412] F. Legrain, J. Carrete, A. van Roekeghem, G. K. Madsen, and N. Mingo, J. Phys. Chem. B
37 122, 625 (2018), arXiv:1706.00192 .
38
[413] K. Kim, L. Ward, J. He, A. Krishna, A. Agrawal, and C. Wolverton, Phys. Rev. Mater. 2,
pte
39
40 123801 (2018).
41 [414] B. R. Goldsmith, J. Esterhuizen, J.-X. Liu, C. J. Bartel, and C. Sutton, AIChE J. 64, 2311
42 (2018), arXiv:arXiv:1402.6991v1 .
43 [415] K. Takahashi and Y. Tanaka, Phys. Rev. B 95, 054110 (2017).
44 [416] J. R. Hattrick-Simpers, K. Choudhary, and C. Corgnale, Mol. Syst. Des. Eng. 3, 509 (2018).
45
46 [417] S. Ubaru, A. Miȩdlar, Y. Saad, and J. R. Chelikowsky, Phys. Rev. B 95, 214102 (2017).
ce
47 [418] A. R. Natarajan and A. Van der Ven, npj Comput. Mater. 4, 56 (2018).
48 [419] A. Jain and T. Bligaard, Phys. Rev. B 98, 214112 (2018), 1809.03960 .
49 [420] F. Ren, L. Ward, T. Williams, K. J. Laws, C. Wolverton, J. Hattrick-Simpers, and A. Mehta,
50 Sci. Adv. 4, eaaq1566 (2018).
51
52 [421] L. Ward, S. C. O’Keeffe, J. Stevick, G. R. Jelbert, M. Aykol, and C. Wolverton, Acta Mater.
Ac
1
2
3
REFERENCES 71
4
pt
5 [422] A. Ziletti, D. Kumar, M. Scheffler, and L. M. Ghiringhelli, Nat. Commun. 9, 2775 (2018),
6 arXiv:1709.02298 .
7
[423] F. Legrain, J. Carrete, A. van Roekeghem, S. Curtarolo, and N. Mingo, Chem. Mater. 29,
8
6220 (2017), arXiv:1703.02309 .
9
cri
10 [424] F. Legrain, A. van Roekeghem, S. Curtarolo, J. Carrete, G. K. H. Madsen, and N. Mingo,
11 J. Chem. Inf. Model. 58, 2460 (2018).
12 [425] T. T. Nguyen, E. Székely, G. Imbalzano, J. Behler, G. Csányi, M. Ceriotti, A. W. Götz, and
13 F. Paesani, J. Chem. Phys. 148, 241725 (2018), arXiv:1802.00564 .
14 [426] V. L. Deringer, C. J. Pickard, and G. Csányi, Phys. Rev. Lett. 120, 156001 (2018),
15 arXiv:1710.10475 .
16
us
[427] V. L. Deringer and G. Csányi, Phys. Rev. B 95, 094203 (2017), arXiv:1611.03277 .
17
18 [428] M. A. Caro, V. L. Deringer, J. Koskinen, T. Laurila, and G. Csányi, Phys. Rev. Lett. 120,
19 166101 (2018), arXiv:1804.07463 .
20 [429] V. L. Deringer, N. Bernstein, A. P. Bartók, M. J. Cliffe, R. N. Kerber, L. E. Marbella, C. P.
21 Grey, S. R. Elliott, and G. Csányi, J. Phys. Chem. Lett. 9, 2879 (2018), arXiv:1803.02802 .
22
23
24
25
26
27
28
an
[430] G. C. Sosso, V. L. Deringer, S. R. Elliott, and G. Csányi, Mol. Simul. 44, 866 (2018).
[431] D. Dragoni, T. D. Daff, G. Csányi, and N. Marzari, Phys. Rev. Mater. 2, 013808 (2018),
arXiv:1706.10229 .
[432] V. L. Deringer, D. M. Proserpio, G. Csányi, and C. J. Pickard, Faraday Discuss. 211, 45
(2018).
[433] A. P. Bartók, S. De, C. Poelking, N. Bernstein, J. R. Kermode, G. Csányi, and M. Ceriotti,
dM
29 Sci. Adv. 3, e1701816 (2017).
30 [434] J. Behler, Angewandte Chemie International Edition 56, 12828 (2017).
31 [435] J. Behler, R. Martonák, D. Donadio, and M. Parrinello, Phys. status solidi 245, 2618 (2008).
32
33 [436] J. Behler, R. Martoňák, D. Donadio, and M. Parrinello, Phys. Rev. Lett. 100, 185501 (2008).
34 [437] N. Artrith, T. Morawietz, and J. Behler, Phys. Rev. B 83, 153101 (2011), arXiv:1512.09110
35 .
36 [438] N. Artrith and A. M. Kolpak, Nano Lett. 14, 2670 (2014).
37 [439] K. V. J. Jose, N. Artrith, and J. Behler, J. Chem. Phys. 136, 194111 (2012).
38
[440] M. Gastegger and P. Marquetand, J. Chem. Theory Comput. 11, 2187 (2015).
pte
39
40 [441] M. Gastegger, C. Kauffmann, J. Behler, and P. Marquetand, J. Chem. Phys. 144, 194110
41 (2016), arXiv:1609.07072 .
42 [442] J. R. Boes, M. C. Groenenboom, J. A. Keith, and J. R. Kitchin, Int. J. Quantum Chem.
43 116, 979 (2016).
44 [443] J. R. Boes and J. R. Kitchin, J. Phys. Chem. C 121, 3479 (2017).
45
46 [444] V. Quaranta, M. Hellström, and J. Behler, J. Phys. Chem. Lett. 8, 1476 (2017).
ce
47 [445] C. Zeni, K. Rossi, A. Glielmo, Á. Fekete, N. Gaston, F. Baletto, and A. De Vita, J. Chem.
48 Phys. 148, 241739 (2018).
49 [446] M. O. J. Jäger, E. V. Morooka, F. Federici Canova, L. Himanen, and A. S. Foster, npj
50 Comput. Mater. 4, 37 (2018).
51 [447] R. Ouyang, E. Ahmetcik, C. Carbogno, M. Scheffler, and L. M. Ghiringhelli, J. Phys. Mater.
52 , 1 (2019).
Ac
53
54
55
56
57
58
59
60
AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1 Page 72 of 74
1
2
3
REFERENCES 72
4
pt
5 [448] C. J. Bartel, C. Sutton, B. R. Goldsmith, R. Ouyang, C. B. Musgrave, L. M. Ghiringhelli,
6 and M. Scheffler, 6, 1 (2018), arXiv:1801.07700 .
7
[449] C. J. Bartel, S. L. Millican, A. M. Deml, J. R. Rumptz, W. Tumas, A. W. Weimer,
8
S. Lany, V. Stevanović, C. B. Musgrave, and A. M. Holder, Nat. Commun. 9, 4168 (2018),
9
cri
arXiv:1805.08155 .
10
11 [450] A. S. M. Jonayat, A. C. T. van Duin, and M. J. Janik, ACS Appl. Energy Mater. 1, 6217
12 (2018).
13 [451] N. Kumar, P. Rajagopalan, P. Pankajakshan, A. Bhattacharyya, S. Sanyal, J. Balachandran,
14 and U. V. Waghmare, Chem. Mater. 31, 314 (2019).
15 [452] R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. Von Lilienfeld, Journal of Chemical
16 Theory and Computation 11, 2087 (2015), arXiv:arXiv:1503.04987v1 .
us
17
[453] B. R. Goldsmith, M. Boley, J. Vreeken, M. Scheffler, and L. M. Ghiringhelli, New Journal
18
of Physics 19, 013031 (2017).
19
20 [454] Y. Zhang and C. Ling, npj Computational Materials 4, 28 (2018).
21 [455] M. Gerosa, C. E. Bottani, C. Di Valentin, G. Onida, and G. Pacchioni, Journal of Physics:
22 Condensed Matter 30, 044003 (2018).
23
24
25
26
27
28
an
[456] P. Dey, J. Bible, S. Datta, S. Broderick, J. Jasinski, M. Sunkara, M. Menon, and K. Rajan,
Comput. Mater. Sci. 83, 185 (2014).
[457] S. A. Tawfik, O. Isayev, C. Stampfl, J. Shapter, D. A. Winkler, and M. J. Ford, Adv. Theory
Simulations 2, 1800128 (2019).
[458] L. Bassman, P. Rajak, R. K. Kalia, A. Nakano, F. Sha, J. Sun, D. J. Singh, M. Aykol,
dM
29 P. Huck, K. Persson, and P. Vashishta, npj Comput. Mater. 4, 74 (2018).
30 [459] P. C. S. John, C. Phillips, T. W. Kemper, A. N. Wilson, M. F. Crowley, R. Mark, and R. E.
31 Larsen, , 1 (2018), arXiv:arXiv:1807.10363v1 .
32 [460] J. Lee, A. Seko, K. Shitara, K. Nakayama, and I. Tanaka, Phys. Rev. B 93, 1 (2016),
33 1509.00973 .
34 [461] G. Montavon, M. Rupp, V. Gobre, A. Vazquez-Mayagoitia, K. Hansen, A. Tkatchenko,
35 K. R. Müller, and O. Anatole Von Lilienfeld, New J. Phys. 15 (2013), 10.1088/1367-
36 2630/15/9/095003, arXiv:1305.7074 .
37
38 [462] A. Mannodi-Kanakkithodi, G. Pilania, T. D. Huan, T. Lookman, and R. Ramprasad, Sci.
Rep. 6, 20952 (2016).
pte
39
40 [463] G. Pilania, C. Wang, X. Jiang, S. Rajasekaran, and R. Ramprasad, Sci. Rep. 3, 1 (2013).
41 [464] G. Pilania, A. Mannodi-Kanakkithodi, B. P. Uberuaga, R. Ramprasad, J. E. Gubernatis,
42 and T. Lookman, Sci. Rep. 6, 1 (2016).
43 [465] A. C. Rajan, A. Mishra, S. Satsangi, R. Vaish, H. Mizuseki, K. R. Lee, and A. K. Singh,
44 Chem. Mater. 30, 4031 (2018).
45
46 [466] Z. Zhu, B. Dong, T. Yang, and Z. Zhang, , 1 (2017), arXiv:1708.04766 .
ce
47 [467] Y. He, E. D. Cubuk, M. D. Allendorf, and E. J. Reed, J. Phys. Chem. Lett. 9, 4562 (2018).
48 [468] Z. Zhaochun, P. Ruiwu, and C. Nianyi, Mater. Sci. Eng. B 54, 149 (1998).
49 [469] Y. Zhuo, A. Mansouri Tehrani, and J. Brgoch, J. Phys. Chem. Lett. 9, 1668 (2018).
50
51 [470] J. Carrete, N. Mingo, S. Wang, and S. Curtarolo, Adv. Funct. Mater. 24, 7427 (2014),
52 arXiv:1408.5859 .
Ac
53 [471] S. Ju, T. Shiga, L. Feng, Z. Hou, K. Tsuda, and J. Shiomi, Phys. Rev. X 7, 021024 (2017).
54
55
56
57
58
59
60
Page 73 of 74 AUTHOR SUBMITTED MANUSCRIPT - JPMATER-100093.R1
1
2
3
REFERENCES 73
4
pt
5 [472] M. Yamawaki, M. Ohnishi, S. Ju, and J. Shiomi, Sci. Adv. 4, 1 (2018).
6
[473] M. W. Gaultois, A. O. Oliynyk, A. Mar, T. D. Sparks, G. J. Mulholland, M. W. Gaultois,
7
A. O. Oliynyk, A. Mar, and T. D. Sparks, 053213 (2016), 10.1063/1.4952607.
8
9 [474] F. Häse, S. Valleau, E. Pyzer-Knapp, and A. Aspuru-Guzik, Chem. Sci. 7, 5139 (2016),
cri
10 arXiv:1511.07883 .
11 [475] K. Fujimura, A. Seko, Y. Koyama, A. Kuwabara, I. Kishida, K. Shitara, C. A. Fisher,
12 H. Moriwake, and I. Tanaka, Adv. Energy Mater. 3, 980 (2013).
13 [476] X. Ma, Z. Li, L. E. Achenie, and H. Xin, J. Phys. Chem. Lett. 6, 3528 (2015).
14
[477] W. Pronobis, K. T. Schütt, A. Tkatchenko, and K.-R. Müller, Eur. Phys. J. B 91, 178
15
(2018).
16
us
17 [478] S. Sanvito, C. Oses, J. Xue, A. Tiwari, M. Zic, T. Archer, P. Tozman, M. Venkatesan,
18 M. Coey, and S. Curtarolo, Science Advances 3 (2017), 10.1126/sciadv.1602241,
19 http://advances.sciencemag.org/content/3/4/e1602241.full.pdf .
20 [479] J. M. D. Coey, Magnetism and Magnetic Materials (Cambridge University Press, 2010).
21 [480] S. Sanvito, M. Žic, J. Nelson, T. Archer, C. Oses, and S. Curtarolo, “Machine learning
22
23
24
25
26
27
28
an
and high-throughput approaches to magnetism,” in Handbook of Materials Modeling:
Applications: Current and Emerging Materials, edited by W. Andreoni and S. Yip (Springer
International Publishing, Cham, 2018) pp. 1–23.
[481] T.-L. Pham, N.-D. Nguyen, V.-D. Nguyen, H. Kino, T. Miyake, and H.-C. Dam, The Journal
of Chemical Physics 148, 204106 (2018), https://doi.org/10.1063/1.5021089 .
[482] D. J. Thouless, M. Kohmoto, M. P. Nightingale, and M. den Nijs, Phys. Rev. Lett. 49, 405
dM
29 (1982).
30 [483] L. Fu and C. L. Kane, Phys. Rev. B 74, 195312 (2006).
31 [484] C. L. Kane and E. J. Mele, Phys. Rev. Lett. 95, 226801 (2005).
32 [485] C. L. Kane and E. J. Mele, Phys. Rev. Lett. 95, 146802 (2005).
33
34 [486] L. Fu and C. L. Kane, Phys. Rev. B 76, 045302 (2007).
35 [487] L. Fu, Phys. Rev. Lett. 106, 106802 (2011).
36 [488] T. H. Hsieh, H. Lin, J. Liu, W. Duan, A. Bansil, and L. Fu, Nature Communications 3, 982
37 EP (2012).
38 [489] W.-J. Shi, J. Liu, Y. Xu, S.-J. Xiong, J. Wu, and W. Duan, Phys. Rev. B 92, 205118 (2015).
pte
39
40 [490] J. Carrasquilla and R. G. Melko, Nature Physics 13, 431 EP (2017).
41 [491] E. P. L. van Nieuwenburg, Y.-H. Liu, and S. D. Huber, Nature Physics 13, 435 EP (2017).
42 [492] X. L. Zhao and L. B. Fu, ArXiv e-prints (2018), arXiv:1808.01731 [cond-mat.dis-nn] .
43 [493] W. Zhang, J. Liu, and T.-C. Wei, ArXiv e-prints (2018), arXiv:1804.02709 [cond-mat.stat-
44 mech] .
45
46 [494] P. Suchsland and S. Wessel, Phys. Rev. B 97, 174435 (2018).
ce
47 [495] P. Huembeli, A. Dauphin, and P. Wittek, Phys. Rev. B 97, 134109 (2018).
48 [496] K. Ch’ng, J. Carrasquilla, R. G. Melko, and E. Khatami, Phys. Rev. X 7, 031038 (2017).
49 [497] L. Li, T. E. Baker, S. R. White, and K. Burke, Phys. Rev. B 94, 245129 (2016).
50
51 [498] C. Wang and H. Zhai, Phys. Rev. B 96, 144432 (2017).
52 [499] W. Hu, R. R. P. Singh, and R. T. Scalettar, Phys. Rev. E 95, 062122 (2017).
Ac
1
2
3
REFERENCES 74
4
pt
5 [501] L. Wang, Phys. Rev. B 94, 195105 (2016).
6
[502] J. Venderley, V. Khemani, and E.-A. Kim, Physical Review Letters 120, 257204 (2018),
7
arXiv:1711.00020 [cond-mat.dis-nn] .
8
9 [503] Y. Ando, Journal of the Physical Society of Japan 82, 102001 (2013),
cri
10 https://doi.org/10.7566/JPSJ.82.102001 .
11 [504] B. Q. Lv, H. M. Weng, B. B. Fu, X. P. Wang, H. Miao, J. Ma, P. Richard, X. C. Huang,
12 L. X. Zhao, G. F. Chen, Z. Fang, X. Dai, T. Qian, and H. Ding, Phys. Rev. X 5, 031013
13 (2015).
14 [505] D.-L. Deng, X. Li, and S. Das Sarma, Phys. Rev. B 96, 195145 (2017).
15
[506] A. Kitaev, Annals of Physics 321, 2 (2006), january Special Issue.
16
us
17 [507] A. Kitaev, Annals of Physics 303, 2 (2003).
18 [508] Y. Zhang and E.-A. Kim, Physical Review Letters 118, 216401 (2017), arXiv:1611.01518
19 [cond-mat.str-el] .
20 [509] P. Zhang, H. Shen, and H. Zhai, Physical Review Letters 120, 066401 (2018),
21 arXiv:1708.09401 [cond-mat.mes-hall] .
22
23
24
25
26
27
28
.
an
[510] C. Mera Acosta and A. Fazzio, ArXiv e-prints (2018), arXiv:1811.11014 [cond-mat.mes-hall]
39
40
41
42
43
44
45
46
ce
47
48
49
50
51
52
Ac
53
54
55
56
57
58
59
60