Accelerating New Materials Design with
Supercomputing and Machine Learning
Anubhav Jain
Lawrence Berkeley National Laboratory
Alvarez seminar series
Slides (already) posted to hackingmaterials.lbl.gov
Outline
2
What I did BEFORE the
Alvarez fellowship
What I did
DURING the
Alvarez
fellowship
What I've been
doing AFTER the
Alvarez
fellowship
• High school research program
• SULI internships
• PhD research
• High-throughput workflows for
materials science
• Open source software
• Launching the Materials Project
• Growing the Materials Project
• Screening for functional materials
• Applying AI / ML to materials science
• Integrating NLP
• Future directions and other projects
What’s next?
I began my research career with a high school
internship program at Brookhaven NL
• The project was supposed to
be about how to build large
rectifier circuits to ensure a
stable power supply to
superconducting magnets
• But mostly, we just goofed
around
3
Spending half a day getting a pen cap out of a colleague’s cast
Fun with the Plan9 operating system and circuit diagrams
In college, I did an initial SULI internship on
functional MRI (fMRI) research
• We wanted to see if all fMRI
signals were real or if some
signals were spurious
• I also did a lot of manual data
cleaning, and I always wondered
why they didn’t just write a
program to do this kind of grunt
work (but was too shy to ask)
• Sometimes I actually ran live
scans on subjects with little
assistance, something that would
definitely not pass safety checks
today … (and maybe didn’t at the
time either)
4
increasing rotation leads to more
spurious signals
I then did another SULI internship, this time
engaging for a couple of years and publishing!
• Goal was to have a robotic system for
performing protein crystallography –
allowing experiments to run
autonomously / overnight and at
higher rates
• My portion of the robot system was
to automatically align protein crystals
in an x-ray using computer vision
• Lots of Java programming and
traditional computer vision
5
Today, I’ve advised a lot of SULI interns myself – many of whom are now
PhD students pushing forward their own research boundaries!
6
For my PhD application, I applied (and was
admitted) for doing biology / materials science
research
• I was targeting drug delivery materials
• But I didn’t get a position in any labs related to biomaterials
• So I ended up taking a position with my thermodynamics teacher (G.
Ceder), who seemed smart and also played with dinosaur figurines in
class
7
In the end, I focused on automating
calculations for materials screening for PhD
• Traditionally, theoretical
calculations on materials
were performed one-at-a-
time, mostly manually
• We built a complex system
to automate these
calculations and used it to
screen Li-ion battery
cathode materials
• We also began a process of
putting the data online …
8
+ )
};
({
)
};
({
t
r
H
dt
t
r
d
i i
i
Y
=
Y Ù
!
+
Total energy
Optimized structure
Magnetic ground state
Charge density
Band structure / DOS
H = Ñi
2
i =1
Ne
å + Vnuclear (ri )
i=1
Ne
å + Veffective(ri )
i =1
Ne
å
Setyawan, Curtarolo
Comp. Mat. Sci (2010) 384
Xeon cores
134,000
lines of code
50
core tables
Chemistry Novelty Energy density
vs. LiFePO4
% of theoretical capacity
already achieved in the lab
Li9V3(P2O7)3(PO4)2 New 20% greater ~65%
Origin:
V to Fe substitution in Li9Fe3(P2O7)3(PO4)2*
Remarks:
• Structure has “layers” and “tunnels”
• Pyrophosphate-phosphate mixture
• Potential 2-electron material
My thesis defense talk would hint at my
Alvarez postdoc work …
9
Outline
10
What I did BEFORE the
Alvarez fellowship
What I did
DURING the
Alvarez
fellowship
What I've been
doing AFTER the
Alvarez
fellowship
• High school research program
• SULI internships
• PhD research
• High-throughput workflows for
materials science
• Open source software
• Launching the Materials Project
• Growing the Materials Project
• Screening for functional materials
• Applying AI / ML to materials science
• Integrating NLP
• Future directions and other projects
What’s next?
Yet another workflow software ….
• By 2011, our computing infrastructure at MIT used for battery screening
was showing a lot of wear and tear / barnacles. It was also not suitable
for running at LBNL’s supercomputers
• We essentially rebuilt 5 years of work from scratch
• Part of this was creating a new workflow software that also merged with
LBNL work called “rockets” for launching jobs. We called the new
software “FireWorks” for naming harmony, but is a name I regret now …
• This was most of what I did during the Alvarez fellowship
• Supported by the infinite patience of Kristin Persson who co-advised by
Alvarez fellowship and provided supplemental funding, in addition to D.
Bailey who served as my CRD host
• Note that in the end, things were better by scrapping almost all of
“rockets” and designing FireWorks from scratch
11
https://xkcd.com/927/
Note: we did think a lot about whether to use an existing
workflow package, but none met our needs for (i) ease of use
(could be operated by scientists), (ii) good documentation,
and (iii) compatibility with error-prone and dynamic high-
throughput workflows
I spent a lot of time developing FireWorks and associated
infrastructure for high-throughput computing
• We did several things that were new at
the time
• Based everything off MongoDB
• Really planned for job failures and reruns
• Took into account that duplicate steps of
workflows may be submitted, but should
run only once
• Allowed jobs to modify their own workflow
graph or create new workflows
• Spent a lot of time on documentation and
support
• FireWorks continues to be an active
project and is now largely community
supported
12
I spent a lot of time programming …
Rocketsled: Use FireWorks to perform virtual active
learning, even when simulations are expensive
and require supercomputers
Borealis: Run FireWorks in the cloud via GCP
[[externally developed and maintained]]
atomate: Use FireWorks to run materials
science calculations
Growing the Materials Project
• Apart from the workflow software
itself, I was running a lot of density
functional calculations to populate a
public database of calculations (The
Materials Project, headed by K.
Persson)
• Interest grew steadily in this resource
and a few core members
• Each of us was wearing lots of hats –
materials scientists, web developers,
workflow programmers, REST API
developers
13
“I am so incredibly happy an
effort like this exists now... I
have been lamenting for years
that despite the importance of
materials we have remained
relatively unaided by the
information age. Please please
don't stop growing!” Cymbet
A continuing challenge has been that every
mistake in high-throughput is magnified …
“I’m overly paranoid probably because I (and others on
the Materials Project team) spend inordinate
amounts of time fixing problems in the Materials Project
data. A search for the word “bug” in my email gives ~500
results in the past year (and there are additional
“issues”, “problems”, and “errors”).
… trying to exterminate the Materials Project’s bugs can
be somewhat maddening – the past few years have
demonstrated that the infestation will always return,
usually based on something that appears innocent at
first glance …
For example, on multiple occasions, code that
incorrectly set (or failed to set) a single input tag
ruined tens of thousands of dollars worth of computing
and several weeks of work. Currently, we’re struggling
to find out whether old bugs in a crystal structure
matching code may have affected what we’ve computed and
potentially any of the reported results …”
14
(myself in a blog post about MP work)
Outline
15
What I did BEFORE the
Alvarez fellowship
What I did
DURING the
Alvarez
fellowship
What I've been
doing AFTER the
Alvarez
fellowship
• High school research program
• SULI internships
• PhD research
• High-throughput workflows for
materials science
• Open source software
• Launching the Materials Project
• Growing the Materials Project
• Screening for functional materials
• Applying AI / ML to materials science
• Integrating NLP
• Future directions and other projects
What’s next?
Transitioning to LBNL staff
• I became staff in 2013 after being hired by K. Persson
• At first, this mainly meant that I spent more time training new
postdocs in some of the things we were doing and helping launch an
new project on multivalent batteries
• The real career game changer was when I got a DOE Early Career
Award in 2015, which came with enough funding to make me an
independent researcher essentially overnight
• Nevertheless, continued working on past projects like Materials
Project to this day (first as co-PI, now Associate Director)
16
The Materials Project continues to grow
• The Materials Project has grown
beyond what most of us imagined
• The team now includes ~3-4 staff
dedicated to infrastructure and
scaling
• Staff web developer currently needed!
• FireWorks is still used to run the
calculations
• We’ve begun new outreach efforts,
like the MP seminar series
• https://materialsproject.org/seminars
17
> 180,000 registered
users
4
2. Materials Project links
to your contribution
3. Your data set and
paper are linked
1. Google links to
Materials Project page
18
A new phase of Materials Project: researchers can contribute
their own data sets to MP
Today, the Materials Project has led to
many examples of “computer to lab”
success stories
MP for p-type transparent conductors
References
✦ Hautier, G., Miglio,A., Ceder, G., Rignanese, G.-M. & Gonze, X. Identification and
design principles of low hole effective mass p-type transparent conducting oxides.
Nature Communications 4, (2013)
✦ Bhatia,A. et al. High-Mobility Bismuth-based Transparent p-Type Oxide from High-
Throughput Material Screening. Chemistry of Materials 28, 30–34 (2015)
✦ Ricci, F. et al.An ab initio electronic transport database for inorganic materials.
Scientific Data 4, (2017)
Prediction
Screening based on band
gap, transport properties
and band alignments.
Experiment
Predictions revealed
material with s–p
hybridized valence band
(thought to correlate
well with dopability).
When synthesized,
material has excellent
transparency and readily
dopable with K.
Ba2BiTaO6
MP for thermoelectrics
References
✦ Aydemir, U. et al.YCuTe2: a member of a new class of thermoelectric materials with
CuTe4-based layered structure. Journal of Materials Chemistry A 4, 2461–2472 (2016)
✦ Zhu, H. et al. Computational and experimental investigation of TmAgTe2and
XYZ2compounds, a new group of thermoelectric materials identified by first-principles
high-throughput screening. Journal of Materials Chemistry C 3, 10554–10565 (2015).
✦ Pöhls, J.-H. et al. Metal phosphides as potential thermoelectric materials. Journal of
Materials Chemistry C 5, 12441–12456 (2017).
Prediction
Screening of tens of
thousands of materials
with predicted electron
transport properties
revealed a family of
promising XYZ2
candidates
Experiment
Several materials made:
YCuTe2 (zT = 0.75),
TmAgTe2 (zT = 0.47, 1.8
theoretical), novel NiP2
phosphide
TmAgTe2
MP for phosphors
References
✦ Wang, Z. et al. Mining Unexplored Chemistries for Phosphors for High-Color-
Quality White-Light-Emitting Diodes. Joule 2, 914–926 (2018)
✦ Li, S. et al. Data-Driven Discovery of Full-Visible-Spectrum Phosphor. Chemistry of
Materials 31, 6286–6294 (2019)
✦ Ha, J. et al. Color tunable single-phase Eu2+ and Ce3+ co-activated Sr2LiAlO4
phosphors. Journal of Materials Chemistry C 7, 7734–7744 (2019)
Prediction
Statistical analysis of existing
materials that co-occur with
word ‘phosphor’ followed
by structure prediction for
new materials
Experiment
Predicted first known Sr-Li-
Al-N quaternary, showed
green-yellow/blue emission
with quantum efficiency of
25% (Eu), 40% (Ce), 55%
(co-activated Eu, Ce)
Sr2LiAlN4
≈ç ≈
19
One of the applications we looked into was
thermoelectric materials
20
• A thermoelectric material
generates a voltage based on
thermal gradient
• Applications
• Heat to electricity
• Refrigeration
• Advantages include:
• Reliability
• Easy to scale to different sizes
(including compact)
www.alphabetenergy.com
It is difficult to balance trade-offs in
thermoelectrics properties, so use screening
21
ZT = α2σT/κ
power factor
>2 mW/mK2
(PbTe=10 mW/mK2)
Seebeck coefficient
> 100 V/K
Band structure + Boltztrap
electrical conductivity
> 103 /(ohm-cm)
Band structure + Boltztrap
thermal conductivity
< 1 W/(m*K)
•  e from Boltztrap
•  l difficult (phonon-phonon scattering)
Heavy band:
ü Large DOS
(higher Seebeck and more carriers)
✗ Large effective mass
(poor mobility)
Light band:
ü Small effective mass
(improved mobility)
✗ Small DOS
(lower Seebeck, fewer carriers)
Multiple bands, off symmetry:
ü Large DOS with small effective
mass
✗ Difficult to design!
E
k
~50,000 crystal
structures and
band structures
from Materials
Project are used
as a source F. Ricci, et al., An ab initio electronic transport
database for inorganic materials, Sci. Data. 4
(2017) 170085.
We compute electronic
transport properties
with BoltzTraP and
minimum thermal
conductivity (Cahill-
Pohl) for some
compounds
About 300GB of
electronic transport
data is generated. All
data is available free
for download.
We found several compounds with promising
figure-of-merit, but no breakthroughs
22
• Calculations:
trigonal p-
TmAgTe2 could
have power
factor up to 8
mW/mK2
• requires 1020/cm3
carriers
experiment
computation
• Calculations: p-YCuTe2 could
only reach PF of 0.4
mW/mK2
• SOC inhibits PF
• if thermal conductivity is low
(e.g., 0.4, we get zT ~1)
• Expt: zT ~0.75 – not too far
from calculation limit
• carrier concentration of 1019
• Decent performance, but
unlikely to be improved with
further optimization
• Expt: p-zT only 0.35 despite
very low thermal
conductivity (~0.25 W/mK)
• Limitation: carrier
concentration (~1017/cm3)
• likely limited by TmAg
defects, as determined by
followup calculations
• Later, we achieved zT ~ 0.47
using Zn-doping
TmAgTe2
YCuTe2
We also developed a new method for more
accurately screening electronic transport
23
Old method (BoltzTraP – screening is qualitative w/pitfalls)
New method (AMSET – screening is more quantitative)
Ganose, A. M.; Park, J.; Faghaninia, A.; Woods-Robinson, R.; Persson, K. A.; Jain, A. Efficient Calculation of Carrier Scattering Rates from First
Principles. Nat Commun 2021, 12 (1), 2222.
acoustic deformation potential (ad)
deformation potential, elastic tensor
ionized impurity (ii)
dielectric tensor
piezoelectric (pi)
dielectric tensor, piezoelectric tensor
polar optical phonon (po)
dielectric tensor, polar phonon frequency
• The method, AMSET, was in development for
~5 years and took a very talented postdoc (A.
Ganose) to finalize everything
• Can calculate e- mobility + Seebeck coefficient
much more accurately than standard models
What about machine learning?
24
• “Simulation-only” screening is
becoming rarer
• More common now is to integrate
machine learning models before
performing expensive calculations
• Our group developed a popular
open-source library called
“matminer” to help with ML in
materials
• Since then, we’ve been interested in
benchmarking methods from the
community
MATERIAL FEATURES PROPERTY
TiO2 rutile F11 F12 … F1N gap = 3.0 eV
C diamond F21 F22 … F2N gap = 5.5 eV
… … … … … …
PbTe rocksalt FM1 FM2 … FMN gap = 0.3 eV
Python 
ML Libraries
Data
Featurization
Data
Retrieval
Data
Visualization
Materials Databases
MPDS
Citrine
Materials
Project
Proper benchmarking is becoming more of an
issue in materials ML
New algorithms are constantly reported!
25
But it is very difficult to compare
algorithms
26
Data set used
in study A
Data set used
in study B
Data set used
in study C
• Different data sets
• Source (e.g., OQMD vs MP vs JARVIS)
• Quantity (e.g., MP 2019 vs MP 2022)
• Subset / data filtering (e.g., ehull<X)
• Different evaluation metrics
• Test set vs. cross validation?
• Different test set fraction?
• Can be difficult to install and retrain
many of these algorithms
MAE 5-Fold CV = 0.102 eV
RMSE Test set = 0.098 eV
vs.
? ?
Matbench includes 13 different ML tasks
27
Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference
Algorithm. npj Comput Mater 2020, 6 (1), 138. https://doi.org/10.1038/s41524-020-00406-3.
How to read the Matbench leaderboard
28
Bigger datasets
Better
relative
performance
• A scaled error of 0.0 means all
predictions are correct
• A scaled error of 1.0 is equal
to always predicting the
average value
Magpie + SCF Model
• Composition features using
chemical descriptors such as
averages/stdevs of elemental
properties such as melting
point, electronegativity
• Structure features using sine
Coulomb matrix
29
Ward, L., Agrawal, A., Choudhary, A. et al. A general-purpose machine learning framework for predicting properties of inorganic materials. npj Comput Mater 2, 16028 (2016).
Faber, Felix, et al. "Crystal structure representations for machine learning models of formation energies." International Journal of Quantum Chemistry 115.16 (2015): 1094-1101.
https://matbench.materialsproject.org
MODNet Model
30
De Breuck, P.-P.; Evans, M. L.; Rignanese, G.-M. Robust Model Benchmarking and Bias-Imbalance in Data-Driven Materials Science: A Case Study on MODNet. Journal of Physics:
Condensed Matter, Volume 33, Number 40, 2021
https://matbench.materialsproject.org
CGCNN Model
31
Xie, T.; Grossman, J. C. Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable Prediction of Material Properties. Phys. Rev. Lett. 2018, 120 (14), 145301.
https://matbench.materialsproject.org
ALIGNN Model
32
Choudhary, Kamal, and Brian DeCost. "Atomistic Line Graph Neural Network for improved materials property predictions." npj Computational Materials 7.1 (2021): 1-8.
https://matbench.materialsproject.org
How much have we
improved overall?
33
• In some cases (e.g., Ef DFT) we
have made a lot of
improvement
• In contrast, for others (e.g., σy
steel alloys) we have barely
improved
• Possible reasons
• Amount of attention paid to
certain problems
• Small vs large data emphasis –
there is a lot more room for
improvement for small data
How else can machine learning be used?
34
Flood of information
Important things get missed
Useful data, but unstructured
NLP algorithms
The types of features that would be very
helpful for materials research
35
5
Zinc oxide
ZnO
OZn
Chemistry aware search
(same input, same results)
Summary data
• Physical properties
• Synthesis information
• Known applications
ferroelectrics All known compositions
(PbTiO3, BaTiO3, etc.)
Links to computational databases
User annotates a small
number of example text for
data extraction
annotation
source text
Train custom model for
completing annotations
Apply to entire literature (millions of
articles) or internal text database
+ question and answer, e.g.
• What is the band gap of
“Si”?
• What are all the known
dopants into GaAs?
• What are all materials
studied as thermoelectrics?
36
We developed a pipeline to extract data from
materials science abstracts
Weston, L. et al Named Entity
Recognition and Normalization Applied
to Large-Scale Information Extraction
from the Materials Science Literature. J.
Chem. Inf. Model. (2019).
The resulting model can label abstracts
37
Named Entity Recognition
X
• Custom machine learning models to
extract the most valuable materials-related
information.
• Utilizes a long short-term memory (LSTM)
network trained on ~1000 hand-annotated
abstracts.
• f1 scores of ~0.9. f1 score for inorganic
materials extraction is >0.9.
And enables new kinds of searches …
38
www.matscholar.com
We also found that word embeddings trained on
literature have hidden chemical information
39
• We use the word2vec
algorithm (Google) to turn
each unique word in our
corpus into a 200-
dimensional vector
• These vectors encode the
meaning of each word
meaning based on trying to
predict context words
around the target
When we train word2vec on
inorganic materials science
abstracts, we get representations
in-line with chemical knowledge
crystal structures of the elements
This hidden information can be used to
predict compounds that might be interesting
40
• Dot product of a composition word with
the word “thermoelectric” essentially
predicts how likely that word is to appear
in an abstract with the word
thermoelectric
• Compositions with high dot products are
typically known thermoelectrics
• Sometimes, compositions have a high dot
product with “thermoelectric” but have
never been studied as a thermoelectric
• These compositions usually have high
computed power factors! (DFT+BoltzTraP)
For example, we can rank which compounds are
likely to co-occur with “thermoelectric” in the
future
41
• For every year since 2001,
see which compounds we
would have predicted using
only literature data until that
point in time
• Make predictions of what
materials are the most
promising thermoelectrics
for data until that year
• See if those materials were
actually studied as
thermoelectrics in
subsequent years
Investigated as thermoelectrics
(independently of our study)
Investigated by our own collaborators
(as a result of our study)
We’ve since also applied NLP to synthesis and
are working actively in this area
42
Outline
43
What I did BEFORE the
Alvarez fellowship
What I did
DURING the
Alvarez
fellowship
What I've been
doing AFTER the
Alvarez
fellowship
• High school research program
• SULI internships
• PhD research
• High-throughput workflows for
materials science
• Open source software
• Launching the Materials Project
• Growing the Materials Project
• Screening for functional materials
• Applying AI / ML to materials science
• Integrating NLP
• Future directions and other projects
What’s next?
Moving from the virtual world to the physical
world – automated synthesis
44
In operation:
XRD
Robot
Box furnace
x 4
Tube furnace
x 4
Arriving soon:
SEM/EDS (Early June)
Labman dosing and mixing
LBNL bldg. 30
Dosing and mixing
Lab starting to take shape …
45
Courtesy Y. Fei,
Ceder Group
And once again, we need workflow software!
46
• Monitor the lab and runs experiment on
different devices
• Collect data generated in the experiment
• Handle exceptions in the lab
Conclusions
• A lot of things can change in 15 years
• 15 years ago, the idea of high-throughput DFT was scoffed at by many researchers (“too
computationally expensive”, “theory not good enough”, “people will be confused”)
• Today it has become a standard procedure for materials design
• I got to see the technique grow from being used by a handful of people with the smallest
possible conference sessions to now being a large, standing room-only symposia at large
conferences
• I see similar changes happening or happened in emerging areas
• Machine learning in materials was a niche subject, now it’s potentially bigger than DFT-based
screening
• NLP in materials is still small, but the trajectory looks on-track to become big (at a slower
pace)
• Automated synthesis is still small, but that trajectory is growing very rapidly
• I’ve been fortunate to be a part of many great projects and teams at the lab and
am looking forward to the next iteration of materials design!
47
Acknowledgements
• My mentors and advisors, without whom I wouldn’t have a job
• Vivian Stojanoff (SULI adviser), Gerd Ceder (PhD advisor), Kristin Persson
(postdoc advisor + early staff supervisor)
• Our research group, without whom there’d be no exciting research
results
• Our collaborators
• Entire Materials Project team
• J. Snyder and J. Pohls who took time-consuming experimental leaps on
computational screening results for thermoelectrics
• Our funders
• DOE BES, DOE EERE, Toyota Research Institutes, LBNL LDRD
48

More Related Content

PDF
Discovering new functional materials for clean energy and beyond using high-t...
PDF
The Materials Project: Applications to energy storage and functional materia...
PDF
The Materials Project: Experiences from running a million computational scien...
PDF
Open-source tools for generating and analyzing large materials data sets
PDF
Discovering and Exploring New Materials through the Materials Project
PDF
ICME Workshop Jul 2014 - The Materials Project
PDF
Software tools to facilitate materials science research
PDF
Computational Materials Design and Data Dissemination through the Materials P...
Discovering new functional materials for clean energy and beyond using high-t...
The Materials Project: Applications to energy storage and functional materia...
The Materials Project: Experiences from running a million computational scien...
Open-source tools for generating and analyzing large materials data sets
Discovering and Exploring New Materials through the Materials Project
ICME Workshop Jul 2014 - The Materials Project
Software tools to facilitate materials science research
Computational Materials Design and Data Dissemination through the Materials P...

Similar to Accelerating New Materials Design with Supercomputing and Machine Learning (20)

PDF
Machine learning for materials design: opportunities, challenges, and methods
PDF
Automating materials science workflows with pymatgen, FireWorks, and atomate
PDF
Open Source Tools for Materials Informatics
PDF
Software tools for calculating materials properties in high-throughput (pymat...
PDF
The Materials Project: A Community Data Resource for Accelerating New Materia...
PDF
Overview of accelerated materials design efforts in the Hacking Materials res...
PDF
Materials Project computation and database infrastructure
PDF
NANO266 - Lecture 12 - High-throughput computational materials design
PDF
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
PDF
Discovering advanced materials for energy applications (with high-throughput ...
PDF
The Materials Project: An Electronic Structure Database for Community-Based M...
PDF
Combining density functional theory calculations, supercomputing, and data-dr...
PDF
The Materials Project: overview and infrastructure
PDF
Materials discovery through theory, computation, and machine learning
PDF
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
PDF
Software tools, crystal descriptors, and machine learning applied to material...
PDF
How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)
PPTX
Summary of June 2014 Workshop Report: Building a Materials Accelerator Network
PDF
Atomate: a tool for rapid high-throughput computing and materials discovery
PDF
An AI-driven closed-loop facility for materials synthesis
Machine learning for materials design: opportunities, challenges, and methods
Automating materials science workflows with pymatgen, FireWorks, and atomate
Open Source Tools for Materials Informatics
Software tools for calculating materials properties in high-throughput (pymat...
The Materials Project: A Community Data Resource for Accelerating New Materia...
Overview of accelerated materials design efforts in the Hacking Materials res...
Materials Project computation and database infrastructure
NANO266 - Lecture 12 - High-throughput computational materials design
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
Discovering advanced materials for energy applications (with high-throughput ...
The Materials Project: An Electronic Structure Database for Community-Based M...
Combining density functional theory calculations, supercomputing, and data-dr...
The Materials Project: overview and infrastructure
Materials discovery through theory, computation, and machine learning
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Software tools, crystal descriptors, and machine learning applied to material...
How Web APIs and Data Centric Tools Power the Materials Project (PyData SV 2013)
Summary of June 2014 Workshop Report: Building a Materials Accelerator Network
Atomate: a tool for rapid high-throughput computing and materials discovery
An AI-driven closed-loop facility for materials synthesis

More from Anubhav Jain (20)

PDF
A Career at a U.S. National Lab: Perspective from a Mid-Career Scientist
PDF
Research opportunities in materials design using AI/ML
PDF
Accelerating materials discovery with big data and machine learning
PDF
Predicting the Synthesizability of Inorganic Materials: Convex Hulls, Literat...
PDF
Discovering advanced materials for energy applications: theory, high-throughp...
PDF
Applications of Large Language Models in Materials Discovery and Design
PDF
Best practices for DuraMat software dissemination
PDF
Best practices for DuraMat software dissemination
PDF
Available methods for predicting materials synthesizability using computation...
PDF
Efficient methods for accurately calculating thermoelectric properties – elec...
PDF
Natural Language Processing for Data Extraction and Synthesizability Predicti...
PDF
Machine Learning for Catalyst Design
PDF
Natural language processing for extracting synthesis recipes and applications...
PDF
DuraMat CO1 Central Data Resource: How it started, how it’s going …
PDF
The Materials Project
PDF
Evaluating Chemical Composition and Crystal Structure Representations using t...
PDF
Perspectives on chemical composition and crystal structure representations fr...
PDF
Machine Learning Platform for Catalyst Design
PDF
Applications of Natural Language Processing to Materials Design
PDF
Assessing Factors Underpinning PV Degradation through Data Analysis
A Career at a U.S. National Lab: Perspective from a Mid-Career Scientist
Research opportunities in materials design using AI/ML
Accelerating materials discovery with big data and machine learning
Predicting the Synthesizability of Inorganic Materials: Convex Hulls, Literat...
Discovering advanced materials for energy applications: theory, high-throughp...
Applications of Large Language Models in Materials Discovery and Design
Best practices for DuraMat software dissemination
Best practices for DuraMat software dissemination
Available methods for predicting materials synthesizability using computation...
Efficient methods for accurately calculating thermoelectric properties – elec...
Natural Language Processing for Data Extraction and Synthesizability Predicti...
Machine Learning for Catalyst Design
Natural language processing for extracting synthesis recipes and applications...
DuraMat CO1 Central Data Resource: How it started, how it’s going …
The Materials Project
Evaluating Chemical Composition and Crystal Structure Representations using t...
Perspectives on chemical composition and crystal structure representations fr...
Machine Learning Platform for Catalyst Design
Applications of Natural Language Processing to Materials Design
Assessing Factors Underpinning PV Degradation through Data Analysis

Recently uploaded (20)

PDF
Microplastics: Environmental Impact and Remediation Strategies
PDF
Physics of Bitcoin #30 Perrenod Santostasi.pdf
PDF
Pharmacokinetics Lecture_Study Material.pdf
PDF
Sujay Rao Mandavilli Variable logic FINAL FINAL FINAL FINAL FINAL.pdf
PPTX
ELS 2ND QUARTER 2 FOR HUMSS STUDENTS.pptx
PDF
Pentose Phosphate Pathway by Rishikanta Usham, Dhanamanjuri University
PPTX
Antihypertensive Medicinal Chemistry Unit II BP501T.pptx
DOCX
Introduction , chapter 1 , Nahid Fatema thesis
PDF
BCKIC FOUNDATION_MAY-JUNE 2025_NEWSLETTER
PDF
Thyroid Hormone by Iqra Nasir detail.pdf
PDF
2024_PohleJellKlug_CambrianPlectronoceratidsAustralia.pdf
PDF
SOCIAL PSYCHOLOGY_ CHAPTER 2.pdf- the self in a social world
PPTX
Personality for guidance related to theories
PPTX
Cutaneous tuberculosis Dermatology
PPTX
1. (Teknik) Atoms, Molecules, and Ions.pptx
PPT
INSTRUMENTAL ANALYSIS (Electrochemical processes )-1.ppt
PDF
naas-journal-rating-2025 for all the journals
PDF
SOCIAL PSYCHOLOGY chapter 1-what is social psychology and its definition
PDF
software engineering for computer science
PPTX
Chapter 7 HUMAN HEALTH AND DISEASE, NCERT
Microplastics: Environmental Impact and Remediation Strategies
Physics of Bitcoin #30 Perrenod Santostasi.pdf
Pharmacokinetics Lecture_Study Material.pdf
Sujay Rao Mandavilli Variable logic FINAL FINAL FINAL FINAL FINAL.pdf
ELS 2ND QUARTER 2 FOR HUMSS STUDENTS.pptx
Pentose Phosphate Pathway by Rishikanta Usham, Dhanamanjuri University
Antihypertensive Medicinal Chemistry Unit II BP501T.pptx
Introduction , chapter 1 , Nahid Fatema thesis
BCKIC FOUNDATION_MAY-JUNE 2025_NEWSLETTER
Thyroid Hormone by Iqra Nasir detail.pdf
2024_PohleJellKlug_CambrianPlectronoceratidsAustralia.pdf
SOCIAL PSYCHOLOGY_ CHAPTER 2.pdf- the self in a social world
Personality for guidance related to theories
Cutaneous tuberculosis Dermatology
1. (Teknik) Atoms, Molecules, and Ions.pptx
INSTRUMENTAL ANALYSIS (Electrochemical processes )-1.ppt
naas-journal-rating-2025 for all the journals
SOCIAL PSYCHOLOGY chapter 1-what is social psychology and its definition
software engineering for computer science
Chapter 7 HUMAN HEALTH AND DISEASE, NCERT

Accelerating New Materials Design with Supercomputing and Machine Learning

  • 1. Accelerating New Materials Design with Supercomputing and Machine Learning Anubhav Jain Lawrence Berkeley National Laboratory Alvarez seminar series Slides (already) posted to hackingmaterials.lbl.gov
  • 2. Outline 2 What I did BEFORE the Alvarez fellowship What I did DURING the Alvarez fellowship What I've been doing AFTER the Alvarez fellowship • High school research program • SULI internships • PhD research • High-throughput workflows for materials science • Open source software • Launching the Materials Project • Growing the Materials Project • Screening for functional materials • Applying AI / ML to materials science • Integrating NLP • Future directions and other projects What’s next?
  • 3. I began my research career with a high school internship program at Brookhaven NL • The project was supposed to be about how to build large rectifier circuits to ensure a stable power supply to superconducting magnets • But mostly, we just goofed around 3 Spending half a day getting a pen cap out of a colleague’s cast Fun with the Plan9 operating system and circuit diagrams
  • 4. In college, I did an initial SULI internship on functional MRI (fMRI) research • We wanted to see if all fMRI signals were real or if some signals were spurious • I also did a lot of manual data cleaning, and I always wondered why they didn’t just write a program to do this kind of grunt work (but was too shy to ask) • Sometimes I actually ran live scans on subjects with little assistance, something that would definitely not pass safety checks today … (and maybe didn’t at the time either) 4 increasing rotation leads to more spurious signals
  • 5. I then did another SULI internship, this time engaging for a couple of years and publishing! • Goal was to have a robotic system for performing protein crystallography – allowing experiments to run autonomously / overnight and at higher rates • My portion of the robot system was to automatically align protein crystals in an x-ray using computer vision • Lots of Java programming and traditional computer vision 5
  • 6. Today, I’ve advised a lot of SULI interns myself – many of whom are now PhD students pushing forward their own research boundaries! 6
  • 7. For my PhD application, I applied (and was admitted) for doing biology / materials science research • I was targeting drug delivery materials • But I didn’t get a position in any labs related to biomaterials • So I ended up taking a position with my thermodynamics teacher (G. Ceder), who seemed smart and also played with dinosaur figurines in class 7
  • 8. In the end, I focused on automating calculations for materials screening for PhD • Traditionally, theoretical calculations on materials were performed one-at-a- time, mostly manually • We built a complex system to automate these calculations and used it to screen Li-ion battery cathode materials • We also began a process of putting the data online … 8 + ) }; ({ ) }; ({ t r H dt t r d i i i Y = Y Ù ! + Total energy Optimized structure Magnetic ground state Charge density Band structure / DOS H = Ñi 2 i =1 Ne å + Vnuclear (ri ) i=1 Ne å + Veffective(ri ) i =1 Ne å Setyawan, Curtarolo Comp. Mat. Sci (2010) 384 Xeon cores 134,000 lines of code 50 core tables Chemistry Novelty Energy density vs. LiFePO4 % of theoretical capacity already achieved in the lab Li9V3(P2O7)3(PO4)2 New 20% greater ~65% Origin: V to Fe substitution in Li9Fe3(P2O7)3(PO4)2* Remarks: • Structure has “layers” and “tunnels” • Pyrophosphate-phosphate mixture • Potential 2-electron material
  • 9. My thesis defense talk would hint at my Alvarez postdoc work … 9
  • 10. Outline 10 What I did BEFORE the Alvarez fellowship What I did DURING the Alvarez fellowship What I've been doing AFTER the Alvarez fellowship • High school research program • SULI internships • PhD research • High-throughput workflows for materials science • Open source software • Launching the Materials Project • Growing the Materials Project • Screening for functional materials • Applying AI / ML to materials science • Integrating NLP • Future directions and other projects What’s next?
  • 11. Yet another workflow software …. • By 2011, our computing infrastructure at MIT used for battery screening was showing a lot of wear and tear / barnacles. It was also not suitable for running at LBNL’s supercomputers • We essentially rebuilt 5 years of work from scratch • Part of this was creating a new workflow software that also merged with LBNL work called “rockets” for launching jobs. We called the new software “FireWorks” for naming harmony, but is a name I regret now … • This was most of what I did during the Alvarez fellowship • Supported by the infinite patience of Kristin Persson who co-advised by Alvarez fellowship and provided supplemental funding, in addition to D. Bailey who served as my CRD host • Note that in the end, things were better by scrapping almost all of “rockets” and designing FireWorks from scratch 11 https://xkcd.com/927/ Note: we did think a lot about whether to use an existing workflow package, but none met our needs for (i) ease of use (could be operated by scientists), (ii) good documentation, and (iii) compatibility with error-prone and dynamic high- throughput workflows
  • 12. I spent a lot of time developing FireWorks and associated infrastructure for high-throughput computing • We did several things that were new at the time • Based everything off MongoDB • Really planned for job failures and reruns • Took into account that duplicate steps of workflows may be submitted, but should run only once • Allowed jobs to modify their own workflow graph or create new workflows • Spent a lot of time on documentation and support • FireWorks continues to be an active project and is now largely community supported 12 I spent a lot of time programming … Rocketsled: Use FireWorks to perform virtual active learning, even when simulations are expensive and require supercomputers Borealis: Run FireWorks in the cloud via GCP [[externally developed and maintained]] atomate: Use FireWorks to run materials science calculations
  • 13. Growing the Materials Project • Apart from the workflow software itself, I was running a lot of density functional calculations to populate a public database of calculations (The Materials Project, headed by K. Persson) • Interest grew steadily in this resource and a few core members • Each of us was wearing lots of hats – materials scientists, web developers, workflow programmers, REST API developers 13 “I am so incredibly happy an effort like this exists now... I have been lamenting for years that despite the importance of materials we have remained relatively unaided by the information age. Please please don't stop growing!” Cymbet
  • 14. A continuing challenge has been that every mistake in high-throughput is magnified … “I’m overly paranoid probably because I (and others on the Materials Project team) spend inordinate amounts of time fixing problems in the Materials Project data. A search for the word “bug” in my email gives ~500 results in the past year (and there are additional “issues”, “problems”, and “errors”). … trying to exterminate the Materials Project’s bugs can be somewhat maddening – the past few years have demonstrated that the infestation will always return, usually based on something that appears innocent at first glance … For example, on multiple occasions, code that incorrectly set (or failed to set) a single input tag ruined tens of thousands of dollars worth of computing and several weeks of work. Currently, we’re struggling to find out whether old bugs in a crystal structure matching code may have affected what we’ve computed and potentially any of the reported results …” 14 (myself in a blog post about MP work)
  • 15. Outline 15 What I did BEFORE the Alvarez fellowship What I did DURING the Alvarez fellowship What I've been doing AFTER the Alvarez fellowship • High school research program • SULI internships • PhD research • High-throughput workflows for materials science • Open source software • Launching the Materials Project • Growing the Materials Project • Screening for functional materials • Applying AI / ML to materials science • Integrating NLP • Future directions and other projects What’s next?
  • 16. Transitioning to LBNL staff • I became staff in 2013 after being hired by K. Persson • At first, this mainly meant that I spent more time training new postdocs in some of the things we were doing and helping launch an new project on multivalent batteries • The real career game changer was when I got a DOE Early Career Award in 2015, which came with enough funding to make me an independent researcher essentially overnight • Nevertheless, continued working on past projects like Materials Project to this day (first as co-PI, now Associate Director) 16
  • 17. The Materials Project continues to grow • The Materials Project has grown beyond what most of us imagined • The team now includes ~3-4 staff dedicated to infrastructure and scaling • Staff web developer currently needed! • FireWorks is still used to run the calculations • We’ve begun new outreach efforts, like the MP seminar series • https://materialsproject.org/seminars 17 > 180,000 registered users 4
  • 18. 2. Materials Project links to your contribution 3. Your data set and paper are linked 1. Google links to Materials Project page 18 A new phase of Materials Project: researchers can contribute their own data sets to MP
  • 19. Today, the Materials Project has led to many examples of “computer to lab” success stories MP for p-type transparent conductors References ✦ Hautier, G., Miglio,A., Ceder, G., Rignanese, G.-M. & Gonze, X. Identification and design principles of low hole effective mass p-type transparent conducting oxides. Nature Communications 4, (2013) ✦ Bhatia,A. et al. High-Mobility Bismuth-based Transparent p-Type Oxide from High- Throughput Material Screening. Chemistry of Materials 28, 30–34 (2015) ✦ Ricci, F. et al.An ab initio electronic transport database for inorganic materials. Scientific Data 4, (2017) Prediction Screening based on band gap, transport properties and band alignments. Experiment Predictions revealed material with s–p hybridized valence band (thought to correlate well with dopability). When synthesized, material has excellent transparency and readily dopable with K. Ba2BiTaO6 MP for thermoelectrics References ✦ Aydemir, U. et al.YCuTe2: a member of a new class of thermoelectric materials with CuTe4-based layered structure. Journal of Materials Chemistry A 4, 2461–2472 (2016) ✦ Zhu, H. et al. Computational and experimental investigation of TmAgTe2and XYZ2compounds, a new group of thermoelectric materials identified by first-principles high-throughput screening. Journal of Materials Chemistry C 3, 10554–10565 (2015). ✦ Pöhls, J.-H. et al. Metal phosphides as potential thermoelectric materials. Journal of Materials Chemistry C 5, 12441–12456 (2017). Prediction Screening of tens of thousands of materials with predicted electron transport properties revealed a family of promising XYZ2 candidates Experiment Several materials made: YCuTe2 (zT = 0.75), TmAgTe2 (zT = 0.47, 1.8 theoretical), novel NiP2 phosphide TmAgTe2 MP for phosphors References ✦ Wang, Z. et al. Mining Unexplored Chemistries for Phosphors for High-Color- Quality White-Light-Emitting Diodes. Joule 2, 914–926 (2018) ✦ Li, S. et al. Data-Driven Discovery of Full-Visible-Spectrum Phosphor. Chemistry of Materials 31, 6286–6294 (2019) ✦ Ha, J. et al. Color tunable single-phase Eu2+ and Ce3+ co-activated Sr2LiAlO4 phosphors. Journal of Materials Chemistry C 7, 7734–7744 (2019) Prediction Statistical analysis of existing materials that co-occur with word ‘phosphor’ followed by structure prediction for new materials Experiment Predicted first known Sr-Li- Al-N quaternary, showed green-yellow/blue emission with quantum efficiency of 25% (Eu), 40% (Ce), 55% (co-activated Eu, Ce) Sr2LiAlN4 ≈ç ≈ 19
  • 20. One of the applications we looked into was thermoelectric materials 20 • A thermoelectric material generates a voltage based on thermal gradient • Applications • Heat to electricity • Refrigeration • Advantages include: • Reliability • Easy to scale to different sizes (including compact) www.alphabetenergy.com
  • 21. It is difficult to balance trade-offs in thermoelectrics properties, so use screening 21 ZT = α2σT/κ power factor >2 mW/mK2 (PbTe=10 mW/mK2) Seebeck coefficient > 100 V/K Band structure + Boltztrap electrical conductivity > 103 /(ohm-cm) Band structure + Boltztrap thermal conductivity < 1 W/(m*K) •  e from Boltztrap •  l difficult (phonon-phonon scattering) Heavy band: ü Large DOS (higher Seebeck and more carriers) ✗ Large effective mass (poor mobility) Light band: ü Small effective mass (improved mobility) ✗ Small DOS (lower Seebeck, fewer carriers) Multiple bands, off symmetry: ü Large DOS with small effective mass ✗ Difficult to design! E k ~50,000 crystal structures and band structures from Materials Project are used as a source F. Ricci, et al., An ab initio electronic transport database for inorganic materials, Sci. Data. 4 (2017) 170085. We compute electronic transport properties with BoltzTraP and minimum thermal conductivity (Cahill- Pohl) for some compounds About 300GB of electronic transport data is generated. All data is available free for download.
  • 22. We found several compounds with promising figure-of-merit, but no breakthroughs 22 • Calculations: trigonal p- TmAgTe2 could have power factor up to 8 mW/mK2 • requires 1020/cm3 carriers experiment computation • Calculations: p-YCuTe2 could only reach PF of 0.4 mW/mK2 • SOC inhibits PF • if thermal conductivity is low (e.g., 0.4, we get zT ~1) • Expt: zT ~0.75 – not too far from calculation limit • carrier concentration of 1019 • Decent performance, but unlikely to be improved with further optimization • Expt: p-zT only 0.35 despite very low thermal conductivity (~0.25 W/mK) • Limitation: carrier concentration (~1017/cm3) • likely limited by TmAg defects, as determined by followup calculations • Later, we achieved zT ~ 0.47 using Zn-doping TmAgTe2 YCuTe2
  • 23. We also developed a new method for more accurately screening electronic transport 23 Old method (BoltzTraP – screening is qualitative w/pitfalls) New method (AMSET – screening is more quantitative) Ganose, A. M.; Park, J.; Faghaninia, A.; Woods-Robinson, R.; Persson, K. A.; Jain, A. Efficient Calculation of Carrier Scattering Rates from First Principles. Nat Commun 2021, 12 (1), 2222. acoustic deformation potential (ad) deformation potential, elastic tensor ionized impurity (ii) dielectric tensor piezoelectric (pi) dielectric tensor, piezoelectric tensor polar optical phonon (po) dielectric tensor, polar phonon frequency • The method, AMSET, was in development for ~5 years and took a very talented postdoc (A. Ganose) to finalize everything • Can calculate e- mobility + Seebeck coefficient much more accurately than standard models
  • 24. What about machine learning? 24 • “Simulation-only” screening is becoming rarer • More common now is to integrate machine learning models before performing expensive calculations • Our group developed a popular open-source library called “matminer” to help with ML in materials • Since then, we’ve been interested in benchmarking methods from the community MATERIAL FEATURES PROPERTY TiO2 rutile F11 F12 … F1N gap = 3.0 eV C diamond F21 F22 … F2N gap = 5.5 eV … … … … … … PbTe rocksalt FM1 FM2 … FMN gap = 0.3 eV Python ML Libraries Data Featurization Data Retrieval Data Visualization Materials Databases MPDS Citrine Materials Project
  • 25. Proper benchmarking is becoming more of an issue in materials ML New algorithms are constantly reported! 25
  • 26. But it is very difficult to compare algorithms 26 Data set used in study A Data set used in study B Data set used in study C • Different data sets • Source (e.g., OQMD vs MP vs JARVIS) • Quantity (e.g., MP 2019 vs MP 2022) • Subset / data filtering (e.g., ehull<X) • Different evaluation metrics • Test set vs. cross validation? • Different test set fraction? • Can be difficult to install and retrain many of these algorithms MAE 5-Fold CV = 0.102 eV RMSE Test set = 0.098 eV vs. ? ?
  • 27. Matbench includes 13 different ML tasks 27 Dunn, A.; Wang, Q.; Ganose, A.; Dopp, D.; Jain, A. Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm. npj Comput Mater 2020, 6 (1), 138. https://doi.org/10.1038/s41524-020-00406-3.
  • 28. How to read the Matbench leaderboard 28 Bigger datasets Better relative performance • A scaled error of 0.0 means all predictions are correct • A scaled error of 1.0 is equal to always predicting the average value
  • 29. Magpie + SCF Model • Composition features using chemical descriptors such as averages/stdevs of elemental properties such as melting point, electronegativity • Structure features using sine Coulomb matrix 29 Ward, L., Agrawal, A., Choudhary, A. et al. A general-purpose machine learning framework for predicting properties of inorganic materials. npj Comput Mater 2, 16028 (2016). Faber, Felix, et al. "Crystal structure representations for machine learning models of formation energies." International Journal of Quantum Chemistry 115.16 (2015): 1094-1101. https://matbench.materialsproject.org
  • 30. MODNet Model 30 De Breuck, P.-P.; Evans, M. L.; Rignanese, G.-M. Robust Model Benchmarking and Bias-Imbalance in Data-Driven Materials Science: A Case Study on MODNet. Journal of Physics: Condensed Matter, Volume 33, Number 40, 2021 https://matbench.materialsproject.org
  • 31. CGCNN Model 31 Xie, T.; Grossman, J. C. Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable Prediction of Material Properties. Phys. Rev. Lett. 2018, 120 (14), 145301. https://matbench.materialsproject.org
  • 32. ALIGNN Model 32 Choudhary, Kamal, and Brian DeCost. "Atomistic Line Graph Neural Network for improved materials property predictions." npj Computational Materials 7.1 (2021): 1-8. https://matbench.materialsproject.org
  • 33. How much have we improved overall? 33 • In some cases (e.g., Ef DFT) we have made a lot of improvement • In contrast, for others (e.g., σy steel alloys) we have barely improved • Possible reasons • Amount of attention paid to certain problems • Small vs large data emphasis – there is a lot more room for improvement for small data
  • 34. How else can machine learning be used? 34 Flood of information Important things get missed Useful data, but unstructured NLP algorithms
  • 35. The types of features that would be very helpful for materials research 35 5 Zinc oxide ZnO OZn Chemistry aware search (same input, same results) Summary data • Physical properties • Synthesis information • Known applications ferroelectrics All known compositions (PbTiO3, BaTiO3, etc.) Links to computational databases User annotates a small number of example text for data extraction annotation source text Train custom model for completing annotations Apply to entire literature (millions of articles) or internal text database + question and answer, e.g. • What is the band gap of “Si”? • What are all the known dopants into GaAs? • What are all materials studied as thermoelectrics?
  • 36. 36 We developed a pipeline to extract data from materials science abstracts Weston, L. et al Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature. J. Chem. Inf. Model. (2019).
  • 37. The resulting model can label abstracts 37 Named Entity Recognition X • Custom machine learning models to extract the most valuable materials-related information. • Utilizes a long short-term memory (LSTM) network trained on ~1000 hand-annotated abstracts. • f1 scores of ~0.9. f1 score for inorganic materials extraction is >0.9.
  • 38. And enables new kinds of searches … 38 www.matscholar.com
  • 39. We also found that word embeddings trained on literature have hidden chemical information 39 • We use the word2vec algorithm (Google) to turn each unique word in our corpus into a 200- dimensional vector • These vectors encode the meaning of each word meaning based on trying to predict context words around the target When we train word2vec on inorganic materials science abstracts, we get representations in-line with chemical knowledge crystal structures of the elements
  • 40. This hidden information can be used to predict compounds that might be interesting 40 • Dot product of a composition word with the word “thermoelectric” essentially predicts how likely that word is to appear in an abstract with the word thermoelectric • Compositions with high dot products are typically known thermoelectrics • Sometimes, compositions have a high dot product with “thermoelectric” but have never been studied as a thermoelectric • These compositions usually have high computed power factors! (DFT+BoltzTraP)
  • 41. For example, we can rank which compounds are likely to co-occur with “thermoelectric” in the future 41 • For every year since 2001, see which compounds we would have predicted using only literature data until that point in time • Make predictions of what materials are the most promising thermoelectrics for data until that year • See if those materials were actually studied as thermoelectrics in subsequent years Investigated as thermoelectrics (independently of our study) Investigated by our own collaborators (as a result of our study)
  • 42. We’ve since also applied NLP to synthesis and are working actively in this area 42
  • 43. Outline 43 What I did BEFORE the Alvarez fellowship What I did DURING the Alvarez fellowship What I've been doing AFTER the Alvarez fellowship • High school research program • SULI internships • PhD research • High-throughput workflows for materials science • Open source software • Launching the Materials Project • Growing the Materials Project • Screening for functional materials • Applying AI / ML to materials science • Integrating NLP • Future directions and other projects What’s next?
  • 44. Moving from the virtual world to the physical world – automated synthesis 44 In operation: XRD Robot Box furnace x 4 Tube furnace x 4 Arriving soon: SEM/EDS (Early June) Labman dosing and mixing LBNL bldg. 30 Dosing and mixing
  • 45. Lab starting to take shape … 45 Courtesy Y. Fei, Ceder Group
  • 46. And once again, we need workflow software! 46 • Monitor the lab and runs experiment on different devices • Collect data generated in the experiment • Handle exceptions in the lab
  • 47. Conclusions • A lot of things can change in 15 years • 15 years ago, the idea of high-throughput DFT was scoffed at by many researchers (“too computationally expensive”, “theory not good enough”, “people will be confused”) • Today it has become a standard procedure for materials design • I got to see the technique grow from being used by a handful of people with the smallest possible conference sessions to now being a large, standing room-only symposia at large conferences • I see similar changes happening or happened in emerging areas • Machine learning in materials was a niche subject, now it’s potentially bigger than DFT-based screening • NLP in materials is still small, but the trajectory looks on-track to become big (at a slower pace) • Automated synthesis is still small, but that trajectory is growing very rapidly • I’ve been fortunate to be a part of many great projects and teams at the lab and am looking forward to the next iteration of materials design! 47
  • 48. Acknowledgements • My mentors and advisors, without whom I wouldn’t have a job • Vivian Stojanoff (SULI adviser), Gerd Ceder (PhD advisor), Kristin Persson (postdoc advisor + early staff supervisor) • Our research group, without whom there’d be no exciting research results • Our collaborators • Entire Materials Project team • J. Snyder and J. Pohls who took time-consuming experimental leaps on computational screening results for thermoelectrics • Our funders • DOE BES, DOE EERE, Toyota Research Institutes, LBNL LDRD 48