Methods, tools, and examples (Part II):
High-throughput computation and machine learning
applied to materials design
Anubhav Jain
Energy Technologies Area
Lawrence Berkeley National Laboratory
Berkeley, CA
LLNL Computational Chemistry Materials Science
Summer Institute, 2018
Slides (already) posted to http://www.slideshare.net/anubhavster
2
Summary of yesterday’s talk
Quantum mechanics Density functional theory High-throughput DFT
Materials databases Machine learning
e–	e–	
e–	 e–	
e–	 e–
Outline
3
①  High-throughput DFT: thermoelectrics design
②  High-throughput DFT: do-it-yourself!
③  Future of the Materials Project database
④  Machine learning: examples
⑤  Machine learning: do-it-yourself!
⑥  Conclusion
Thermoelectric materials convert heat to electricity
•  A thermoelectric material
generates a voltage based
on thermal gradient
•  Applications
–  Heat to electricity
–  Refrigeration
•  Advantages include:
–  Reliability
–  Easy to scale to different
sizes (including compact)
4
www.alphabetenergy.com
Alphabet Energy – 25kW generator
Thermoelectric figure of merit
5
•  Many materials properties are important for thermoelectrics
•  Focus is usually on finding materials that possess a high “figure
of merit”, or zT, for high efficiency
•  Target: zT at least 1, ideally >2
ZT = α2σT/κ
power factor
>2 mW/mK2
(PbTe=10 mW/mK2)
Seebeck coefficient
> 100 V/K
Band structure + Boltztrap
electrical conductivity
> 103 /(ohm-cm)
Band structure + Boltztrap
thermal conductivity
< 1 W/(m*K)
•  e from Boltztrap
•  l difficult (phonon-phonon scattering)
Very difficult to balance these properties using intuition
alone!
Example: Seebeck and e– conductivity tradeoff
6
Heavy band:
ü  Large DOS
(higher Seebeck and more carriers)
✗ Large effective mass
(poor mobility)
Light band:
ü  Small effective mass
(improved mobility)
✗ Small DOS
(lower Seebeck, fewer carriers)
Multiple bands, off symmetry:
ü  Large DOS with small effective
mass
✗ Difficult to design!
E
k
Finding good thermoelectrics is tough –
can computations help?
•  Thermoelectric (TE) materials must exhibit properties that are
difficult to obtain simultaneously
•  Can theory / computation help? As proposed as early as 2003 by
Blake and Metiu1:
7
“With the cost of computing become relatively inexpensive one can
envisage a time where one runs multiple computer test tube
reactions like these on large Beowulf clusters - as a means of
screening for new TE materials. Certainly it appears that in the
future theory may be a very competent dance partner for what has
previously been a solo experimental effort in searching for ever
better TE materials.”
1. Blake and Metiu. Can theory help in the search for better thermoelectric materials? Chemistry, Physics,
and Materials Science of Thermoelectric Materials: Beyond Bismuth Telluride, 2003 !
We’ve initiated a search for new bulk thermoelectrics
8
Initial procedure similar to
Madsen (2006)
On top of this traditional
procedure we add:
•  thermal conductivity
model of Pohl-Cahill
•  targeted defect
calculations to assess
doping
•  Today - ~50,000
compounds screened!
Madsen, G. K. H. Automated search for new
thermoelectric materials: the case of LiZnSb.
J. Am. Chem. Soc., 2006, 128, 12140–6
Chen,	W.	et	al.	Understanding	thermoelectric	properties	from	high-
throughput	calculations:	trends,	insights,	and	comparisons	with	
experiment.	J.	Mater.	Chem.	C	4,	4414–4426	(2016).
Known limitations to theory,
but useful for estimation / pre-screening
•  Limitations of our theoretical
approach include:
–  constant electronic relaxation time
–  Cahill/Pohl minimum thermal conductivity
–  “standard” DFT limitations –
e.g., band gap
•  We are “upgrading” the theory
–  new model for electronic transport
•  But: our actual problems (shown
later) are usually not in known
theory problems, but properties
that were not estimated at all
9
Chen, W., et al., Understanding thermoelectric
properties from high-throughput calculations:
trends, insights, and comparisons with
experiment. J. Mater. Chem. C 4, 4414–4426 (2016).!
computed vs expt power factors
Database of transport properties calculated
10
All data (~300GB total) is
available for direct download
through the Dryad repository
linked in the following
publication:
F. Ricci, W. Chen, U. Aydemir, G.J. Snyder, G.-M.
Rignanese, A. Jain, et al., An ab initio electronic
transport database for inorganic materials, Sci.
Data. 4 (2017) 170085.
New Materials from screening – TmAgTe2 (calcs)
11
Zhu, H.; Hautier, G.; Aydemir, U.; Gibbs, Z. M.; Li, G.; Bajaj, S.; Pöhls, J.-H.; Broberg, D.; Chen, W.; Jain, A.; White, M. A.; Asta,
M.; Snyder, G. J.; Persson, K.; Ceder, G. Computational and experimental investigation of TmAgTe 2 and XYZ 2 compounds, a
new group of thermoelectric materials identified by first-principles high-throughput screening, J. Mater. Chem. C, 2015, 3
•  Calculations:
trigonal p-
TmAgTe2 could
have power
factor up to 8
mW/mK2
•  requires 1020/cm3
carriers
TmAgTe2 (experiments)
12
Zhu, H.; Hautier, G.; Aydemir, U.; Gibbs, Z. M.; Li, G.; Bajaj, S.; Pöhls, J.-H.; Broberg, D.; Chen, W.; Jain, A.; White, M. A.; Asta,
M.; Snyder, G. J.; Persson, K.; Ceder, G. Computational and experimental investigation of TmAgTe 2 and XYZ 2 compounds, a
new group of thermoelectric materials identified by first-principles high-throughput screening, J. Mater. Chem. C, 2015, 3
•  Expt: p-zT only 0.35 despite
very low thermal
conductivity (~0.25 W/mK)
•  Limitation: carrier
concentration (~1017/cm3)
•  likely limited by TmAg
defects, as determined by
followup calculations
YCuTe2 – friendlier elements, higher zT (0.75)
13
Aydemir, U.; Pöhls, J.-H.; Zhu, H., Hautier, G.; Bajaj, S.; Gibbs, Z. M.; Chen, W.; Li, G.; Broberg, D.;
Kang, S.D.; White, M. A.; Asta, M.; Ceder, G.; Persson, K.; Jain, A.; Snyder, G. J. YCuTe2: A Member of
a New Class of Thermoelectric Materials with CuTe4-Based Layered Structure. J. Mat Chem C, 2016
experiment
computation
•  Calculations: p-YCuTe2
could only reach PF of 0.4
mW/mK2
•  SOC inhibits PF
•  if thermal conductivity is low
(e.g., 0.4, we get zT ~1)
•  Expt: zT ~0.75 – not too far
from calculation limit
•  carrier concentration of 1019
•  Decent performance, but
unlikely to be improved with
further optimization
Bournonites – CuPbSbS3 and analogues (no expt)
14
A. Faghaninia, G. Yu, U. Aydemir, M. Wood, W. Chen, G.-M.G.-M. Rignanese, et al., A computational assessment of the electronic,
thermoelectric, and defect properties of bournonite (CuPbSbS3) and related substitutions, Phys. Chem. Chem. Phys. 19 (2017)
6743–6756.
•  Previously studied TE material
–  Measured thermal conductivity < 1 W/m*K
–  Measured Seebeck coefficient ~ 400 µV/K
–  BUT electrical conductivity requires improvement – can
calculations help?
•  Calculations: p-PF is 13.8, but we know
electronic conductivity will be lower than
estimated
–  Try ~320 chemical substitutions, see whether
electronic scattering time is reduced in
computational models or whether favorable defect
diagram can be found for high carrier concentration
–  Computations suggest a few interesting candidates,
including CuPbSnSe3 and CuPbAsSe3
•  Experiments: Preliminary experiments are
unsuccessful in synthesizing bournonite
CuPbSnSe3; stopped investigations.
Thermoelectrics screening: lessons so far
•  When considering our screening strategy in the abstract, the
major limitations appeared to be:
–  no modeling of electron relaxation time
–  limited modeling of thermal conductivity
•  However, in reality the biggest limitation has been estimating
dopability. This has been the major limitation for all our picks.
–  The materials we pick just aren’t very dopable
–  Computing doping limits is hard; we also learned not to trust GGA
defect diagrams for this purpose (at least shift the band edges with
an HSE band gap estimate)
•  So the problems are often not in the known limitations of the
theory, but in optimizing aspects of the material you are not
computing at all
15
Outline
16
①  High-throughput DFT: thermoelectrics design
②  High-throughput DFT: do-it-yourself!
③  Future of the Materials Project database
④  Machine learning: examples
⑤  Machine learning: do-it-yourself!
⑥  Conclusion
With HT-DFT, we can generate data rapidly – what to do next?
17
M. de Jong, W. Chen, H.
Geerlings, M. Asta, and K. A.
Persson, Sci. Data, 2015, 2,
150053.!
M. De Jong, W. Chen, T.
Angsten, A. Jain, R. Notestine,
A. Gamst, M. Sluiter, C. K.
Ande, S. Van Der Zwaag, J. J.
Plata, C. Toher, S. Curtarolo,
G. Ceder, K. a Persson, and M.
Asta, Sci. Data, 2015, 2, 150009.!
>4500 elastic
tensors
>900
piezoelectric
tensors
>48000
Seebeck
coefficients +
cRTA transport
Ricci, Chen, Aydemir, Snyder,
Rignanese, Jain, & Hautier (in
submission)!
With HT-DFT, we can generate data rapidly – what to do next?
18
M. de Jong, W. Chen, H.
Geerlings, M. Asta, and K. A.
Persson, Sci. Data, 2015, 2,
150053.!
M. De Jong, W. Chen, T.
Angsten, A. Jain, R. Notestine,
A. Gamst, M. Sluiter, C. K.
Ande, S. Van Der Zwaag, J. J.
Plata, C. Toher, S. Curtarolo,
G. Ceder, K. a Persson, and M.
Asta, Sci. Data, 2015, 2, 150009.!
>4500 elastic
tensors
>900
piezoelectric
tensors
>48000
Seebeck
coefficients +
cRTA transport
Ricci, Chen, Aydemir, Snyder,
Rignanese, Jain, & Hautier (in
submission)!
Atomate’s goal: make
it easy to generate
comparable data sets
on your own
A “black-box” view of performing a calculation
19
“something”!
Results!!
researcher!
What	is	the	
GGA-PBE	elastic	
tensor	of	GaAs?
Unfortunately, the inside of the “black box”
is usually tedious and “low-level”
20
lots of tedious,
low-level work…!
Results!!
researcher!
What	is	the	
GGA-PBE	elastic	
tensor	of	GaAs?	
Input	file	flags	
SLURM	format	
how	to	fix	ZPOTRF?	
	
		
q  set	up	the	structure	coordinates	
q  write	input	files,	double-check	all	
the	flags	
q  copy	to	supercomputer	
q  submit	job	to	queue	
q  deal	with	supercomputer	
headaches	
q  monitor	job	
q  fix	error	jobs,	resubmit	to	queue,	
wait	again	
q  repeat	process	for	subsequent	
calculations	in	workflow	
q  parse	output	files	to	obtain	results	
q  copy	and	organize	results,	e.g.,	into	
Excel
What would be a better way?
21
“something”!
Results!!
researcher!
What	is	the	
GGA-PBE	elastic	
tensor	of	GaAs?
What would be a better way?
22
Results!!
researcher!
What	is	the	
GGA-PBE	elastic	
tensor	of	GaAs?	
Workflows to run!
q  band structure!
q  surface energies!
ü  elastic tensor!
q  Raman spectrum!
q  QH thermal expansion!
Ideally the method should scale to millions of calculations
23
Results!!
researcher!
Start	with	all	binary	
oxides,	replace	O->S,	
run	several	different	
properties	
Workflows to run!
ü  band structure!
ü  surface energies!
ü  elastic tensor!
q  Raman spectrum!
q  QH thermal expansion!
q  spin-orbit coupling!
Atomate tries make it easy, automatic, and flexible to
generate data with existing simulation packages
24
Results!!
researcher!
Run	many	different	
properties	of	many	
different	materials!
Atomate contains a library of simulation procedures
25
VASP-based
•  band structure
•  spin-orbit coupling
•  hybrid functional
calcs
•  elastic tensor
•  piezoelectric tensor
•  Raman spectra
•  NEB
•  GIBBS method
•  QH thermal
expansion
•  AIMD
•  ferroelectric
•  surface adsorption
•  work functions
Other
•  BoltzTraP
•  FEFF method
•  LAMMPS MD
Mathew, K. et al Atomate: A high-level interface to generate, execute, and analyze
computational materials science workflows, Comput. Mater. Sci. 139 (2017) 140–152.
Each simulation procedure translates high-level instructions
into a series of low-level tasks
26
quickly and automatically translate PI-style (minimal)
specifications into well-defined FireWorks workflows
What	is	the	
GGA-PBE	elastic	
tensor	of	GaAs?	
M.	De	Jong,	W.	Chen,	T.	Angsten,	A.	Jain,	R.	Notestine,	A.	Gamst,	et	al.,	
Charting	the	complete	elastic	properties	of	inorganic	crystalline	compounds,	
Sci.	Data.	2	(2015).
Atomate thus encodes and standardizes knowledge about
running various kinds of simulations from domain experts
27
K. Mathew J. Montoya S. Dwaraknath A. Faghaninia
All past and present knowledge, from everyone in the group,
everyone previously in the group, and our collaborators,
about how to run calculations
M. Aykol
S.P. Ong
B. Bocklund T. Smidt
H. Tang I.H. Chu M. Horton J. Dagdalen B. Wood
Z.K. Liu J. Neaton K. Persson A. Jain
+
28
Full operation diagram
job 1
job 2
job 3 job 4
structure! workflow! database of
all workflows!
automatically submit + execute!output files + database!
29
Full operation diagram
job 1
job 2
job 3 job 4
structure! workflow! database of
all workflows!
automatically submit + execute!output files + database!
•  Pymatgen can retrieve crystal
structures from the Materials
Project database (MPRester class)
•  It can also manipulate crystal
structures
–  substitutions
–  supercell creation
–  order-disorder (shown at right)
–  interstitial finding
–  surface / slab generation
•  A visual interface to many of the
tools are in Materials Project’s
“Crystal Toolkit” app
30
Crystal structure generation via pymatgen
Example: Order-disorder
resolve partial or mixed
occupancies into a fully
ordered crystal structure
(e.g., mixed oxide-fluoride site
into separate oxygen/fluorine)
31
Full operation diagram
job 1
job 2
job 3 job 4
structure! workflow! database of
all workflows!
automatically submit + execute!output files + database!
32
Atomate’s main goal – convert structures to workflows
Workflows consist of a series of jobs (“FireWorks”), each
with multiple tasks. Atomate jobs typically (i) run a
calculation and (ii) store the results in a database
33
Full operation diagram
job 1
job 2
job 3 job 4
structure! workflow! database of
all workflows!
automatically submit + execute!output files + database!
FireWorks allows you to write your workflow once and
execute (almost) anywhere
34
•  Execute workflows
locally or at a
supercomputing
center
•  Queue systems
supported
–  PBS
–  SGE
–  SLURM
–  IBM LoadLeveler
–  NEWT (a REST-based
API at NERSC)
–  Cobalt (Argonne LCF)
Dashboard with status of all jobs
35
•  Job provenance and automatic metadata storage
•  Detect and rerun failures
•  “Dynamic” workflows that change behavior based on
results
•  Customize job priorities
•  Much more…
36
Other features
37
Full operation diagram
job 1
job 2
job 3 job 4
structure! workflow! database of
all workflows!
automatically submit + execute!output files + database!
Atomate – builders
framework
38
“Builders” start with base
collections in a database and
create higher-level collections
that summarize information or
add metadata
39
The atomate database makes it easy to perform various
analyses with pymatgen
atomate output
database(s)!
phase
diagrams
Pourbaix
diagrams
diffusivity via MDband structure analysis
40
Many research groups have run tens of thousands of
materials science workflows with atomate
also used by:
•  Persson research group, UC Berkeley
•  Ong research group, UC San Diego
•  Neaton research group, UC Berkeley
•  Liu research group, Penn State
•  Groups not developing on atomate!
•  e.g., see “Thermal expansion of quaternary nitride coatings” by
Tasnadi et al.
atomate now powers the Materials
Project and will be used to run
hundreds of thousands of
simulations in the next year
(www.materialsproject.org)
Outline
41
①  High-throughput DFT: thermoelectrics design
②  High-throughput DFT: do-it-yourself!
③  Future of the Materials Project database
④  Machine learning: examples
⑤  Machine learning: do-it-yourself!
⑥  Conclusion
Materials Project database
•  Online resource of density
functional theory simulation data
for ~85,000 inorganic materials
•  Includes band structures, elastic
tensors, piezoelectric tensors,
battery properties and more
•  Nearly 55,000 registered users
•  Free
•  www.materialsproject.org
42
Jain et al. Commentary: The Materials Project: A
materials genome approach to accelerating
materials innovation. APL Mater. 1, 11002 (2013).!
Today when you search for materials with bulk modulus, you
get back a table of results
43
A long table of data
is difficult to get a
“feel” for
Only shows 500
entries out of
6000+ possibilities
Here’s an MP example we put together two years ago but
hasn’t yet made it to the web site
44
We also want to enable more sophisticated
crystal structure searches
•  “Find all compounds where a given cation is square
planar coordinated by oxygen, has a bandgap between
1.1 and 3eV, and is stable in water at pH = 2”
•  Here there are:
–  structural constraints (4-fold coordination w/oxygen)
–  materials property constraints (bandgap = 1.1 – 3 eV)
–  complex analysis constraints (stable in water at pH=2)
•  For now, focus on the first problem – how to analyze
crystal structures to generate “features”?
45
Defining local order parameters for various environments
46
Use	a	given	local	order	parameter	
with	a	threshold	
for	motif	recognition:	
	
If	qtet	>	qthresh,	
				then	motif	is	tetrahedron.	
	
Else	
				not	(too	much)	a	tetrahedron.	
Tetrahedral order parameter, qtet, [1]:
[1] Zimmermann et al., J. Am. Chem. Soc., 2017, 10.1021/jacs.5b08098
We have now developed mathematical order parameters for
various types of local environments
47
Key step: how well do these work?
48
1. Order parameters clearly
distinguish different environments
even after thermal distortion
2. Work well in applications (defect site
finding, diffusion characterization)
[1] Zimmermann et al., Frontiers of Materials, 2017, doi: 10.3389/fmats.2017.00034
•  Describe each site in a crystal as a vector of all order
parameter values
–  this tells you how much each site consists of different local
env. characters, e.g. tetrahedral, octahedral, square
pyramid, etc.
•  Describe “crystal fingerprints” based on site
fingerprints statistics
–  e.g., this essentially tells you things like “spinel is 2/3
tetrahedral, 1/3 octahedral cation sites”
•  Turns each structure into a numerical vector that
describes its geometric local environments
49
Describe crystals by local environment
Application: crystal structure similarity
50
Goal: determine crystal structure “similarity” between all
structure pairs in MP database
Example: BCC,
CsCl, and
Heusler are all
orderings into the
same essential
crystal
Difficulty:
different bond
lengths, # of
atoms, small
distortions, etc
51
Can cluster
crystal structures
by “local
environment
similarity”
Results on MP web site, e.g. for BCC-like structures
52
https://www.materialsproject.org/materials/mp-91/!
Target: W
similar structures
(distance near 0)
Cs3Sb!
TiGaFeCo!
CeMg2Cu!
More tools coming, e.g., more sophisticated tools to
design and submit structures for computation
53
Input generation
(parameter choice)
Workflow mapping Supercomputer
submission /
monitoring
Error
handling
File Transfer
File Parsing /
DB insertion
Custom material
Submit!
www.materialsproject.org
“Crystal Toolkit”
Anyone can find, edit,
and submit (suggest)
structures
Currently, this feature is available for:
•  structure optimization
•  band structures
•  elastic tensors
Outline
54
①  High-throughput DFT: thermoelectrics design
②  High-throughput DFT: do-it-yourself!
③  Future of the Materials Project database
④  Machine learning: examples
⑤  Machine learning: do-it-yourself!
⑥  Conclusion
Future: rationally control the band structure
55
example:
•  understanding the character of states that form the VBM / CBM
•  in TmAgTe2, increased hybridization lowers the valley degeneracy
•  Can we predict the orbital character of arbitrary materials?
Jain, A., Hautier, G., Ong, S. P. & Persson, K. New opportunities for materials informatics: Resources and data
mining techniques for uncovering hidden relationships. J. Mater. Res. 1–18 (2016).
DFT/GGA+U
projected
DOS
for MoO3
Procedure for ranking likelihood to form VBM/CBM
•  Data set of 2558 materials
–  ionic materials evaluated via Bond Valence Sum method
–  band gap of 0.2 or higher (clear VBM and CBM)
–  avoid f-electron materials
–  limited pool of elements/orbitals competing for VBM/CBM
•  For each material:
–  determine the “ionic orbitals” (e.g., Mn3+:d, O2-:p, P5+:p) that are present
–  determine the contribution of each ionic orbital to VBM/CBM using
projected DOS
–  For each pair of ionic orbitals (e.g., Mn3+:d versus O2-:p), score a “win” for the
ionic orbital that contributes more to VBM/CBM
•  Use model to determine universal ranking from the series of pairwise
competitions (Bradley-Terry model)
56
Jain, A., Hautier, G., Ong, S. P. & Persson, K. New opportunities for
materials informatics: Resources and data mining techniques for
uncovering hidden relationships. J. Mater. Res. 1–18 (2016).
Results: likelihood to form VBM/CBM
57
•  Example interpretation: in a material with Cu1+:d, Fe3+:d, and O2-:p states,
the Cu is likely to be VBM and Fe likely to be CBM (this is true for FeCuO2)
•  There are also problems with such a universal ranking (discussed in paper)
that require refinement
Jain, A., Hautier, G., Ong, S. P. & Persson, K. New opportunities for materials informatics: Resources and data
mining techniques for uncovering hidden relationships. J. Mater. Res. 1–18 (2016).
Can we build a general optimizer?
58
Generalizable
forward solver
Supercomputing
Power
Statistical
optimization
FireWorks NERSC Various optimization libraries
(Figure: J. Mueller)
Rocketsled: Automatic materials screening that selects
materials to compute AND submits them to supercomputer
59
screening space of ~20,000 potential
ABX3 perovskite combinations as
water splitting materials –
precomputed in DFT by different group
if a machine learning algorithm was in
charge of picking the next compound
based on past data, how efficient
would it be?
60
Text mining: learning from scientific abstracts
Matstract
corpus
Unlabeled
data
Data
labels
Feature engineering
Text cleaning
Tokenization
POS tag
labels
Word embeddings
(word2vec)
Text processing
Hand crafted features
Supervised learning
Neural network
(LSTM)
Logistic regression
Train/test
sets
Named
Entities
Named
Entities
“Learning” what a
scientific study is about
from >1 million
materials science
abstracts
61
Application: a revised materials search engine
Auto-generated summaries of materials based on text mining
•  We asked our text mining engine
to guess 6 compositions that
could be associated with the
word “thermoelectric”, but not
studied as thermoelectric in our
corpus
•  We then independently tested
these guesses against our
database of computed
thermoelectric quantities
–  5/6 were better than 80% of the
compounds in the DB
–  4/6 beat the 90th percentile
–  3/6 beat the 95th percentile
–  1/6 beat the 99.5th percentile, i.e.
is a “1 in 200” compound
•  More results to come …
62
Application: guessing materials for an application
Outline
63
①  High-throughput DFT: thermoelectrics design
②  High-throughput DFT: do-it-yourself!
③  Future of the Materials Project database
④  Machine learning: examples
⑤  Machine learning: do-it-yourself!
⑥  Conclusion
Machine learning: the big problem in my view is connecting
data to ML algorithms through features
64
Lots of data on
complex objects that
you want to interrelate
Clustering,	Regression,	Feature	
extraction,	Model-building,	etc.	
Well developed
data-mining routines that work
only on numbers (ideally ones
with high relevance to your
problem)
Need to transform materials science objects into a set of
physically relevant numerical data (“features” or “descriptors”)
Goal of matminer: connect materials data with data mining
algorithms and data visualization libraries
65
Ward, L. et al. Matminer: An open source toolkit for materials data mining. Comput. Mater. Sci. 152, 60–69 (2018).
>40 featurizer classes can
generate thousands of
potential descriptors
66
Matminer contains a library of descriptors for various
materials science entities
feat	=	EwaldEnergy([options])	
y	=	feat.featurize([input_data])	
•  compatible with
scikit-learn
pipelining
•  automatically deploy
multiprocessing to
parallelize over data
•  include citations to
methodology papers
matminer also contains easy integration with Plotly for
quickly creating interactive, shareable HTML graphs
67
68
Interactive Jupyter notebooks demonstrate use cases
https://github.com/hackingmaterials/
matminer_examples!
Example 1: combining data from Citrine and MP to plot
computed vs. experimental band gap
69
DataFrame
Data
Retrieval
Data
Visualization
Materials Databases
Citrine Materials
Project
MATERIAL PROPERTY
TiO2 rutile gap = 3.0 eV
C diamond gap = 5.5 eV
… …
PbTe rocksalt gap = 0.3 eV
Run the full Jupyter
notebook:
!
https://github.com/
hackingmaterials/
matminer_examples!
!
(experiment_vs_computed_
bandgap.ipynb)!
Example 2: predicting bulk modulus from MP data
70
MATERIAL FEATURES PROPERTY
TiO2 rutile F11 F12 … F1N E = 400
C diamond F21 F22 … F2N E = 230
… … … … … …
PbTe rocksalt FM1 FM2 … FMN E = 120
Data
Featurization
Data
Retrieval
Python ML
libraries
Materials Databases
Materials
Project
mean RMSE: 20 GPa
(10-fold CV)
Run the full Jupyter
notebook:
!
https://github.com/
hackingmaterials/
matminer_examples!
!
(intro_predicting_bulk_mo
dulus.ipynb)!
•  Making interactive plots
•  Predicting formation energies:
–  from composition alone
–  with Voronoi-based structure features included
–  with Coulomb matrix and Orbital Field matrix
descriptors (reproducing previous studies in the
literature)
•  Creating an ML pipeline
71
Other examples
https://github.com/hackingmaterials/matminer_examples!
Outline
72
①  High-throughput DFT: thermoelectrics design
②  High-throughput DFT: do-it-yourself!
③  Future of the Materials Project database
④  Machine learning: examples
⑤  Machine learning: do-it-yourself!
⑥  Conclusion
Next steps
•  This purpose of this presentation was to explain
the benefits of the software tools available to
you
•  You will need to take more steps to actually gain
use out of the software
•  There are multiple resources to help you with
this!
73
Video tutorials are available
74
www.youtube.com/user/MaterialsProject
75
For general information / overview
•  There are papers on the software tools for
general information
75
Ward et al. Matminer : An open
source toolkit for materials data
mining. Computational Materials
Science, 152, 60–69 (2018).!
Jain, A. et al. FireWorks: a dynamic
workflow system designed for high-
throughput applications. Concurr.
Comput. Pract. Exp. 22, 5037–5059
(2015).!
Mathew, K. et al. Atomate: A high-
level interface to generate, execute,
and analyze computational
materials science workflows.
Comput. Mater. Sci. 139, 140–152
(2017).!
For installing / usage / examples
•  The online documentation is best for practical
usage
–  https://materialsproject.github.io/pymatgen/
–  https://materialsproject.github.io/fireworks/
–  https://materialsproject.github.io/custodian/
–  https://hackingmaterials.github.io/atomate/
–  https://hackingmaterials.github.io/matminer/
•  The online documentation includes installation,
examples, tutorials, and descriptions of how to
use the code
76
•  High-throughput computations, materials
databases, and machine learning are a new set of
tools for doing materials science
•  There are now various resources to help you get
started much more quickly than the last “PhD
generation”
•  If you are interested, give this software a try!
77
Conclusions
•  Thermoelectrics discovery
–  GJ Snyder, MA White, G Hautier & team
•  Atomate
–  K Matthew (project lead) & team
•  Materials Project
–  K Persson and MP Center team
•  Structure order parameters
–  N. Zimmermann (project lead) & team
•  Rocketsled
–  A.. Dunn
•  Matminer
–  L. Ward (project lead) & team
•  Text mining
–  V. Tshitoyan, J. Dagdelen, L. Weston
•  All that provided feedback & contributed code to open-source software efforts!
•  Funding: DOE-BES, Computing: NERSC
78
Thank you!

Methods, tools, and examples (Part II): High-throughput computation and machine learning applied to materials design

  • 1.
    Methods, tools, andexamples (Part II): High-throughput computation and machine learning applied to materials design Anubhav Jain Energy Technologies Area Lawrence Berkeley National Laboratory Berkeley, CA LLNL Computational Chemistry Materials Science Summer Institute, 2018 Slides (already) posted to http://www.slideshare.net/anubhavster
  • 2.
    2 Summary of yesterday’stalk Quantum mechanics Density functional theory High-throughput DFT Materials databases Machine learning e– e– e– e– e– e–
  • 3.
    Outline 3 ①  High-throughput DFT:thermoelectrics design ②  High-throughput DFT: do-it-yourself! ③  Future of the Materials Project database ④  Machine learning: examples ⑤  Machine learning: do-it-yourself! ⑥  Conclusion
  • 4.
    Thermoelectric materials convertheat to electricity •  A thermoelectric material generates a voltage based on thermal gradient •  Applications –  Heat to electricity –  Refrigeration •  Advantages include: –  Reliability –  Easy to scale to different sizes (including compact) 4 www.alphabetenergy.com Alphabet Energy – 25kW generator
  • 5.
    Thermoelectric figure ofmerit 5 •  Many materials properties are important for thermoelectrics •  Focus is usually on finding materials that possess a high “figure of merit”, or zT, for high efficiency •  Target: zT at least 1, ideally >2 ZT = α2σT/κ power factor >2 mW/mK2 (PbTe=10 mW/mK2) Seebeck coefficient > 100 V/K Band structure + Boltztrap electrical conductivity > 103 /(ohm-cm) Band structure + Boltztrap thermal conductivity < 1 W/(m*K) •  e from Boltztrap •  l difficult (phonon-phonon scattering) Very difficult to balance these properties using intuition alone!
  • 6.
    Example: Seebeck ande– conductivity tradeoff 6 Heavy band: ü  Large DOS (higher Seebeck and more carriers) ✗ Large effective mass (poor mobility) Light band: ü  Small effective mass (improved mobility) ✗ Small DOS (lower Seebeck, fewer carriers) Multiple bands, off symmetry: ü  Large DOS with small effective mass ✗ Difficult to design! E k
  • 7.
    Finding good thermoelectricsis tough – can computations help? •  Thermoelectric (TE) materials must exhibit properties that are difficult to obtain simultaneously •  Can theory / computation help? As proposed as early as 2003 by Blake and Metiu1: 7 “With the cost of computing become relatively inexpensive one can envisage a time where one runs multiple computer test tube reactions like these on large Beowulf clusters - as a means of screening for new TE materials. Certainly it appears that in the future theory may be a very competent dance partner for what has previously been a solo experimental effort in searching for ever better TE materials.” 1. Blake and Metiu. Can theory help in the search for better thermoelectric materials? Chemistry, Physics, and Materials Science of Thermoelectric Materials: Beyond Bismuth Telluride, 2003 !
  • 8.
    We’ve initiated asearch for new bulk thermoelectrics 8 Initial procedure similar to Madsen (2006) On top of this traditional procedure we add: •  thermal conductivity model of Pohl-Cahill •  targeted defect calculations to assess doping •  Today - ~50,000 compounds screened! Madsen, G. K. H. Automated search for new thermoelectric materials: the case of LiZnSb. J. Am. Chem. Soc., 2006, 128, 12140–6 Chen, W. et al. Understanding thermoelectric properties from high- throughput calculations: trends, insights, and comparisons with experiment. J. Mater. Chem. C 4, 4414–4426 (2016).
  • 9.
    Known limitations totheory, but useful for estimation / pre-screening •  Limitations of our theoretical approach include: –  constant electronic relaxation time –  Cahill/Pohl minimum thermal conductivity –  “standard” DFT limitations – e.g., band gap •  We are “upgrading” the theory –  new model for electronic transport •  But: our actual problems (shown later) are usually not in known theory problems, but properties that were not estimated at all 9 Chen, W., et al., Understanding thermoelectric properties from high-throughput calculations: trends, insights, and comparisons with experiment. J. Mater. Chem. C 4, 4414–4426 (2016).! computed vs expt power factors
  • 10.
    Database of transportproperties calculated 10 All data (~300GB total) is available for direct download through the Dryad repository linked in the following publication: F. Ricci, W. Chen, U. Aydemir, G.J. Snyder, G.-M. Rignanese, A. Jain, et al., An ab initio electronic transport database for inorganic materials, Sci. Data. 4 (2017) 170085.
  • 11.
    New Materials fromscreening – TmAgTe2 (calcs) 11 Zhu, H.; Hautier, G.; Aydemir, U.; Gibbs, Z. M.; Li, G.; Bajaj, S.; Pöhls, J.-H.; Broberg, D.; Chen, W.; Jain, A.; White, M. A.; Asta, M.; Snyder, G. J.; Persson, K.; Ceder, G. Computational and experimental investigation of TmAgTe 2 and XYZ 2 compounds, a new group of thermoelectric materials identified by first-principles high-throughput screening, J. Mater. Chem. C, 2015, 3 •  Calculations: trigonal p- TmAgTe2 could have power factor up to 8 mW/mK2 •  requires 1020/cm3 carriers
  • 12.
    TmAgTe2 (experiments) 12 Zhu, H.;Hautier, G.; Aydemir, U.; Gibbs, Z. M.; Li, G.; Bajaj, S.; Pöhls, J.-H.; Broberg, D.; Chen, W.; Jain, A.; White, M. A.; Asta, M.; Snyder, G. J.; Persson, K.; Ceder, G. Computational and experimental investigation of TmAgTe 2 and XYZ 2 compounds, a new group of thermoelectric materials identified by first-principles high-throughput screening, J. Mater. Chem. C, 2015, 3 •  Expt: p-zT only 0.35 despite very low thermal conductivity (~0.25 W/mK) •  Limitation: carrier concentration (~1017/cm3) •  likely limited by TmAg defects, as determined by followup calculations
  • 13.
    YCuTe2 – friendlierelements, higher zT (0.75) 13 Aydemir, U.; Pöhls, J.-H.; Zhu, H., Hautier, G.; Bajaj, S.; Gibbs, Z. M.; Chen, W.; Li, G.; Broberg, D.; Kang, S.D.; White, M. A.; Asta, M.; Ceder, G.; Persson, K.; Jain, A.; Snyder, G. J. YCuTe2: A Member of a New Class of Thermoelectric Materials with CuTe4-Based Layered Structure. J. Mat Chem C, 2016 experiment computation •  Calculations: p-YCuTe2 could only reach PF of 0.4 mW/mK2 •  SOC inhibits PF •  if thermal conductivity is low (e.g., 0.4, we get zT ~1) •  Expt: zT ~0.75 – not too far from calculation limit •  carrier concentration of 1019 •  Decent performance, but unlikely to be improved with further optimization
  • 14.
    Bournonites – CuPbSbS3and analogues (no expt) 14 A. Faghaninia, G. Yu, U. Aydemir, M. Wood, W. Chen, G.-M.G.-M. Rignanese, et al., A computational assessment of the electronic, thermoelectric, and defect properties of bournonite (CuPbSbS3) and related substitutions, Phys. Chem. Chem. Phys. 19 (2017) 6743–6756. •  Previously studied TE material –  Measured thermal conductivity < 1 W/m*K –  Measured Seebeck coefficient ~ 400 µV/K –  BUT electrical conductivity requires improvement – can calculations help? •  Calculations: p-PF is 13.8, but we know electronic conductivity will be lower than estimated –  Try ~320 chemical substitutions, see whether electronic scattering time is reduced in computational models or whether favorable defect diagram can be found for high carrier concentration –  Computations suggest a few interesting candidates, including CuPbSnSe3 and CuPbAsSe3 •  Experiments: Preliminary experiments are unsuccessful in synthesizing bournonite CuPbSnSe3; stopped investigations.
  • 15.
    Thermoelectrics screening: lessonsso far •  When considering our screening strategy in the abstract, the major limitations appeared to be: –  no modeling of electron relaxation time –  limited modeling of thermal conductivity •  However, in reality the biggest limitation has been estimating dopability. This has been the major limitation for all our picks. –  The materials we pick just aren’t very dopable –  Computing doping limits is hard; we also learned not to trust GGA defect diagrams for this purpose (at least shift the band edges with an HSE band gap estimate) •  So the problems are often not in the known limitations of the theory, but in optimizing aspects of the material you are not computing at all 15
  • 16.
    Outline 16 ①  High-throughput DFT:thermoelectrics design ②  High-throughput DFT: do-it-yourself! ③  Future of the Materials Project database ④  Machine learning: examples ⑤  Machine learning: do-it-yourself! ⑥  Conclusion
  • 17.
    With HT-DFT, wecan generate data rapidly – what to do next? 17 M. de Jong, W. Chen, H. Geerlings, M. Asta, and K. A. Persson, Sci. Data, 2015, 2, 150053.! M. De Jong, W. Chen, T. Angsten, A. Jain, R. Notestine, A. Gamst, M. Sluiter, C. K. Ande, S. Van Der Zwaag, J. J. Plata, C. Toher, S. Curtarolo, G. Ceder, K. a Persson, and M. Asta, Sci. Data, 2015, 2, 150009.! >4500 elastic tensors >900 piezoelectric tensors >48000 Seebeck coefficients + cRTA transport Ricci, Chen, Aydemir, Snyder, Rignanese, Jain, & Hautier (in submission)!
  • 18.
    With HT-DFT, wecan generate data rapidly – what to do next? 18 M. de Jong, W. Chen, H. Geerlings, M. Asta, and K. A. Persson, Sci. Data, 2015, 2, 150053.! M. De Jong, W. Chen, T. Angsten, A. Jain, R. Notestine, A. Gamst, M. Sluiter, C. K. Ande, S. Van Der Zwaag, J. J. Plata, C. Toher, S. Curtarolo, G. Ceder, K. a Persson, and M. Asta, Sci. Data, 2015, 2, 150009.! >4500 elastic tensors >900 piezoelectric tensors >48000 Seebeck coefficients + cRTA transport Ricci, Chen, Aydemir, Snyder, Rignanese, Jain, & Hautier (in submission)! Atomate’s goal: make it easy to generate comparable data sets on your own
  • 19.
    A “black-box” viewof performing a calculation 19 “something”! Results!! researcher! What is the GGA-PBE elastic tensor of GaAs?
  • 20.
    Unfortunately, the insideof the “black box” is usually tedious and “low-level” 20 lots of tedious, low-level work…! Results!! researcher! What is the GGA-PBE elastic tensor of GaAs? Input file flags SLURM format how to fix ZPOTRF? q  set up the structure coordinates q  write input files, double-check all the flags q  copy to supercomputer q  submit job to queue q  deal with supercomputer headaches q  monitor job q  fix error jobs, resubmit to queue, wait again q  repeat process for subsequent calculations in workflow q  parse output files to obtain results q  copy and organize results, e.g., into Excel
  • 21.
    What would bea better way? 21 “something”! Results!! researcher! What is the GGA-PBE elastic tensor of GaAs?
  • 22.
    What would bea better way? 22 Results!! researcher! What is the GGA-PBE elastic tensor of GaAs? Workflows to run! q  band structure! q  surface energies! ü  elastic tensor! q  Raman spectrum! q  QH thermal expansion!
  • 23.
    Ideally the methodshould scale to millions of calculations 23 Results!! researcher! Start with all binary oxides, replace O->S, run several different properties Workflows to run! ü  band structure! ü  surface energies! ü  elastic tensor! q  Raman spectrum! q  QH thermal expansion! q  spin-orbit coupling!
  • 24.
    Atomate tries makeit easy, automatic, and flexible to generate data with existing simulation packages 24 Results!! researcher! Run many different properties of many different materials!
  • 25.
    Atomate contains alibrary of simulation procedures 25 VASP-based •  band structure •  spin-orbit coupling •  hybrid functional calcs •  elastic tensor •  piezoelectric tensor •  Raman spectra •  NEB •  GIBBS method •  QH thermal expansion •  AIMD •  ferroelectric •  surface adsorption •  work functions Other •  BoltzTraP •  FEFF method •  LAMMPS MD Mathew, K. et al Atomate: A high-level interface to generate, execute, and analyze computational materials science workflows, Comput. Mater. Sci. 139 (2017) 140–152.
  • 26.
    Each simulation proceduretranslates high-level instructions into a series of low-level tasks 26 quickly and automatically translate PI-style (minimal) specifications into well-defined FireWorks workflows What is the GGA-PBE elastic tensor of GaAs? M. De Jong, W. Chen, T. Angsten, A. Jain, R. Notestine, A. Gamst, et al., Charting the complete elastic properties of inorganic crystalline compounds, Sci. Data. 2 (2015).
  • 27.
    Atomate thus encodesand standardizes knowledge about running various kinds of simulations from domain experts 27 K. Mathew J. Montoya S. Dwaraknath A. Faghaninia All past and present knowledge, from everyone in the group, everyone previously in the group, and our collaborators, about how to run calculations M. Aykol S.P. Ong B. Bocklund T. Smidt H. Tang I.H. Chu M. Horton J. Dagdalen B. Wood Z.K. Liu J. Neaton K. Persson A. Jain +
  • 28.
    28 Full operation diagram job1 job 2 job 3 job 4 structure! workflow! database of all workflows! automatically submit + execute!output files + database!
  • 29.
    29 Full operation diagram job1 job 2 job 3 job 4 structure! workflow! database of all workflows! automatically submit + execute!output files + database!
  • 30.
    •  Pymatgen canretrieve crystal structures from the Materials Project database (MPRester class) •  It can also manipulate crystal structures –  substitutions –  supercell creation –  order-disorder (shown at right) –  interstitial finding –  surface / slab generation •  A visual interface to many of the tools are in Materials Project’s “Crystal Toolkit” app 30 Crystal structure generation via pymatgen Example: Order-disorder resolve partial or mixed occupancies into a fully ordered crystal structure (e.g., mixed oxide-fluoride site into separate oxygen/fluorine)
  • 31.
    31 Full operation diagram job1 job 2 job 3 job 4 structure! workflow! database of all workflows! automatically submit + execute!output files + database!
  • 32.
    32 Atomate’s main goal– convert structures to workflows Workflows consist of a series of jobs (“FireWorks”), each with multiple tasks. Atomate jobs typically (i) run a calculation and (ii) store the results in a database
  • 33.
    33 Full operation diagram job1 job 2 job 3 job 4 structure! workflow! database of all workflows! automatically submit + execute!output files + database!
  • 34.
    FireWorks allows youto write your workflow once and execute (almost) anywhere 34 •  Execute workflows locally or at a supercomputing center •  Queue systems supported –  PBS –  SGE –  SLURM –  IBM LoadLeveler –  NEWT (a REST-based API at NERSC) –  Cobalt (Argonne LCF)
  • 35.
    Dashboard with statusof all jobs 35
  • 36.
    •  Job provenanceand automatic metadata storage •  Detect and rerun failures •  “Dynamic” workflows that change behavior based on results •  Customize job priorities •  Much more… 36 Other features
  • 37.
    37 Full operation diagram job1 job 2 job 3 job 4 structure! workflow! database of all workflows! automatically submit + execute!output files + database!
  • 38.
    Atomate – builders framework 38 “Builders”start with base collections in a database and create higher-level collections that summarize information or add metadata
  • 39.
    39 The atomate databasemakes it easy to perform various analyses with pymatgen atomate output database(s)! phase diagrams Pourbaix diagrams diffusivity via MDband structure analysis
  • 40.
    40 Many research groupshave run tens of thousands of materials science workflows with atomate also used by: •  Persson research group, UC Berkeley •  Ong research group, UC San Diego •  Neaton research group, UC Berkeley •  Liu research group, Penn State •  Groups not developing on atomate! •  e.g., see “Thermal expansion of quaternary nitride coatings” by Tasnadi et al. atomate now powers the Materials Project and will be used to run hundreds of thousands of simulations in the next year (www.materialsproject.org)
  • 41.
    Outline 41 ①  High-throughput DFT:thermoelectrics design ②  High-throughput DFT: do-it-yourself! ③  Future of the Materials Project database ④  Machine learning: examples ⑤  Machine learning: do-it-yourself! ⑥  Conclusion
  • 42.
    Materials Project database • Online resource of density functional theory simulation data for ~85,000 inorganic materials •  Includes band structures, elastic tensors, piezoelectric tensors, battery properties and more •  Nearly 55,000 registered users •  Free •  www.materialsproject.org 42 Jain et al. Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. APL Mater. 1, 11002 (2013).!
  • 43.
    Today when yousearch for materials with bulk modulus, you get back a table of results 43 A long table of data is difficult to get a “feel” for Only shows 500 entries out of 6000+ possibilities
  • 44.
    Here’s an MPexample we put together two years ago but hasn’t yet made it to the web site 44
  • 45.
    We also wantto enable more sophisticated crystal structure searches •  “Find all compounds where a given cation is square planar coordinated by oxygen, has a bandgap between 1.1 and 3eV, and is stable in water at pH = 2” •  Here there are: –  structural constraints (4-fold coordination w/oxygen) –  materials property constraints (bandgap = 1.1 – 3 eV) –  complex analysis constraints (stable in water at pH=2) •  For now, focus on the first problem – how to analyze crystal structures to generate “features”? 45
  • 46.
    Defining local orderparameters for various environments 46 Use a given local order parameter with a threshold for motif recognition: If qtet > qthresh, then motif is tetrahedron. Else not (too much) a tetrahedron. Tetrahedral order parameter, qtet, [1]: [1] Zimmermann et al., J. Am. Chem. Soc., 2017, 10.1021/jacs.5b08098
  • 47.
    We have nowdeveloped mathematical order parameters for various types of local environments 47
  • 48.
    Key step: howwell do these work? 48 1. Order parameters clearly distinguish different environments even after thermal distortion 2. Work well in applications (defect site finding, diffusion characterization) [1] Zimmermann et al., Frontiers of Materials, 2017, doi: 10.3389/fmats.2017.00034
  • 49.
    •  Describe eachsite in a crystal as a vector of all order parameter values –  this tells you how much each site consists of different local env. characters, e.g. tetrahedral, octahedral, square pyramid, etc. •  Describe “crystal fingerprints” based on site fingerprints statistics –  e.g., this essentially tells you things like “spinel is 2/3 tetrahedral, 1/3 octahedral cation sites” •  Turns each structure into a numerical vector that describes its geometric local environments 49 Describe crystals by local environment
  • 50.
    Application: crystal structuresimilarity 50 Goal: determine crystal structure “similarity” between all structure pairs in MP database Example: BCC, CsCl, and Heusler are all orderings into the same essential crystal Difficulty: different bond lengths, # of atoms, small distortions, etc
  • 51.
    51 Can cluster crystal structures by“local environment similarity”
  • 52.
    Results on MPweb site, e.g. for BCC-like structures 52 https://www.materialsproject.org/materials/mp-91/! Target: W similar structures (distance near 0) Cs3Sb! TiGaFeCo! CeMg2Cu!
  • 53.
    More tools coming,e.g., more sophisticated tools to design and submit structures for computation 53 Input generation (parameter choice) Workflow mapping Supercomputer submission / monitoring Error handling File Transfer File Parsing / DB insertion Custom material Submit! www.materialsproject.org “Crystal Toolkit” Anyone can find, edit, and submit (suggest) structures Currently, this feature is available for: •  structure optimization •  band structures •  elastic tensors
  • 54.
    Outline 54 ①  High-throughput DFT:thermoelectrics design ②  High-throughput DFT: do-it-yourself! ③  Future of the Materials Project database ④  Machine learning: examples ⑤  Machine learning: do-it-yourself! ⑥  Conclusion
  • 55.
    Future: rationally controlthe band structure 55 example: •  understanding the character of states that form the VBM / CBM •  in TmAgTe2, increased hybridization lowers the valley degeneracy •  Can we predict the orbital character of arbitrary materials? Jain, A., Hautier, G., Ong, S. P. & Persson, K. New opportunities for materials informatics: Resources and data mining techniques for uncovering hidden relationships. J. Mater. Res. 1–18 (2016). DFT/GGA+U projected DOS for MoO3
  • 56.
    Procedure for rankinglikelihood to form VBM/CBM •  Data set of 2558 materials –  ionic materials evaluated via Bond Valence Sum method –  band gap of 0.2 or higher (clear VBM and CBM) –  avoid f-electron materials –  limited pool of elements/orbitals competing for VBM/CBM •  For each material: –  determine the “ionic orbitals” (e.g., Mn3+:d, O2-:p, P5+:p) that are present –  determine the contribution of each ionic orbital to VBM/CBM using projected DOS –  For each pair of ionic orbitals (e.g., Mn3+:d versus O2-:p), score a “win” for the ionic orbital that contributes more to VBM/CBM •  Use model to determine universal ranking from the series of pairwise competitions (Bradley-Terry model) 56 Jain, A., Hautier, G., Ong, S. P. & Persson, K. New opportunities for materials informatics: Resources and data mining techniques for uncovering hidden relationships. J. Mater. Res. 1–18 (2016).
  • 57.
    Results: likelihood toform VBM/CBM 57 •  Example interpretation: in a material with Cu1+:d, Fe3+:d, and O2-:p states, the Cu is likely to be VBM and Fe likely to be CBM (this is true for FeCuO2) •  There are also problems with such a universal ranking (discussed in paper) that require refinement Jain, A., Hautier, G., Ong, S. P. & Persson, K. New opportunities for materials informatics: Resources and data mining techniques for uncovering hidden relationships. J. Mater. Res. 1–18 (2016).
  • 58.
    Can we builda general optimizer? 58 Generalizable forward solver Supercomputing Power Statistical optimization FireWorks NERSC Various optimization libraries (Figure: J. Mueller)
  • 59.
    Rocketsled: Automatic materialsscreening that selects materials to compute AND submits them to supercomputer 59 screening space of ~20,000 potential ABX3 perovskite combinations as water splitting materials – precomputed in DFT by different group if a machine learning algorithm was in charge of picking the next compound based on past data, how efficient would it be?
  • 60.
    60 Text mining: learningfrom scientific abstracts Matstract corpus Unlabeled data Data labels Feature engineering Text cleaning Tokenization POS tag labels Word embeddings (word2vec) Text processing Hand crafted features Supervised learning Neural network (LSTM) Logistic regression Train/test sets Named Entities Named Entities “Learning” what a scientific study is about from >1 million materials science abstracts
  • 61.
    61 Application: a revisedmaterials search engine Auto-generated summaries of materials based on text mining
  • 62.
    •  We askedour text mining engine to guess 6 compositions that could be associated with the word “thermoelectric”, but not studied as thermoelectric in our corpus •  We then independently tested these guesses against our database of computed thermoelectric quantities –  5/6 were better than 80% of the compounds in the DB –  4/6 beat the 90th percentile –  3/6 beat the 95th percentile –  1/6 beat the 99.5th percentile, i.e. is a “1 in 200” compound •  More results to come … 62 Application: guessing materials for an application
  • 63.
    Outline 63 ①  High-throughput DFT:thermoelectrics design ②  High-throughput DFT: do-it-yourself! ③  Future of the Materials Project database ④  Machine learning: examples ⑤  Machine learning: do-it-yourself! ⑥  Conclusion
  • 64.
    Machine learning: thebig problem in my view is connecting data to ML algorithms through features 64 Lots of data on complex objects that you want to interrelate Clustering, Regression, Feature extraction, Model-building, etc. Well developed data-mining routines that work only on numbers (ideally ones with high relevance to your problem) Need to transform materials science objects into a set of physically relevant numerical data (“features” or “descriptors”)
  • 65.
    Goal of matminer:connect materials data with data mining algorithms and data visualization libraries 65 Ward, L. et al. Matminer: An open source toolkit for materials data mining. Comput. Mater. Sci. 152, 60–69 (2018).
  • 66.
    >40 featurizer classescan generate thousands of potential descriptors 66 Matminer contains a library of descriptors for various materials science entities feat = EwaldEnergy([options]) y = feat.featurize([input_data]) •  compatible with scikit-learn pipelining •  automatically deploy multiprocessing to parallelize over data •  include citations to methodology papers
  • 67.
    matminer also containseasy integration with Plotly for quickly creating interactive, shareable HTML graphs 67
  • 68.
    68 Interactive Jupyter notebooksdemonstrate use cases https://github.com/hackingmaterials/ matminer_examples!
  • 69.
    Example 1: combiningdata from Citrine and MP to plot computed vs. experimental band gap 69 DataFrame Data Retrieval Data Visualization Materials Databases Citrine Materials Project MATERIAL PROPERTY TiO2 rutile gap = 3.0 eV C diamond gap = 5.5 eV … … PbTe rocksalt gap = 0.3 eV Run the full Jupyter notebook: ! https://github.com/ hackingmaterials/ matminer_examples! ! (experiment_vs_computed_ bandgap.ipynb)!
  • 70.
    Example 2: predictingbulk modulus from MP data 70 MATERIAL FEATURES PROPERTY TiO2 rutile F11 F12 … F1N E = 400 C diamond F21 F22 … F2N E = 230 … … … … … … PbTe rocksalt FM1 FM2 … FMN E = 120 Data Featurization Data Retrieval Python ML libraries Materials Databases Materials Project mean RMSE: 20 GPa (10-fold CV) Run the full Jupyter notebook: ! https://github.com/ hackingmaterials/ matminer_examples! ! (intro_predicting_bulk_mo dulus.ipynb)!
  • 71.
    •  Making interactiveplots •  Predicting formation energies: –  from composition alone –  with Voronoi-based structure features included –  with Coulomb matrix and Orbital Field matrix descriptors (reproducing previous studies in the literature) •  Creating an ML pipeline 71 Other examples https://github.com/hackingmaterials/matminer_examples!
  • 72.
    Outline 72 ①  High-throughput DFT:thermoelectrics design ②  High-throughput DFT: do-it-yourself! ③  Future of the Materials Project database ④  Machine learning: examples ⑤  Machine learning: do-it-yourself! ⑥  Conclusion
  • 73.
    Next steps •  Thispurpose of this presentation was to explain the benefits of the software tools available to you •  You will need to take more steps to actually gain use out of the software •  There are multiple resources to help you with this! 73
  • 74.
    Video tutorials areavailable 74 www.youtube.com/user/MaterialsProject
  • 75.
    75 For general information/ overview •  There are papers on the software tools for general information 75 Ward et al. Matminer : An open source toolkit for materials data mining. Computational Materials Science, 152, 60–69 (2018).! Jain, A. et al. FireWorks: a dynamic workflow system designed for high- throughput applications. Concurr. Comput. Pract. Exp. 22, 5037–5059 (2015).! Mathew, K. et al. Atomate: A high- level interface to generate, execute, and analyze computational materials science workflows. Comput. Mater. Sci. 139, 140–152 (2017).!
  • 76.
    For installing /usage / examples •  The online documentation is best for practical usage –  https://materialsproject.github.io/pymatgen/ –  https://materialsproject.github.io/fireworks/ –  https://materialsproject.github.io/custodian/ –  https://hackingmaterials.github.io/atomate/ –  https://hackingmaterials.github.io/matminer/ •  The online documentation includes installation, examples, tutorials, and descriptions of how to use the code 76
  • 77.
    •  High-throughput computations,materials databases, and machine learning are a new set of tools for doing materials science •  There are now various resources to help you get started much more quickly than the last “PhD generation” •  If you are interested, give this software a try! 77 Conclusions
  • 78.
    •  Thermoelectrics discovery – GJ Snyder, MA White, G Hautier & team •  Atomate –  K Matthew (project lead) & team •  Materials Project –  K Persson and MP Center team •  Structure order parameters –  N. Zimmermann (project lead) & team •  Rocketsled –  A.. Dunn •  Matminer –  L. Ward (project lead) & team •  Text mining –  V. Tshitoyan, J. Dagdelen, L. Weston •  All that provided feedback & contributed code to open-source software efforts! •  Funding: DOE-BES, Computing: NERSC 78 Thank you!