Machine Learning in Scientific Discovery
Machine Learning in Scientific Discovery
net/publication/380403657
CITATIONS READS
0 938
12 authors, including:
All content following this page was uploaded by Paola Cinnella on 08 May 2024.
4 Robotics, Perception and Learning, KTH Royal Institute of Technology, Stockholm, Sweden
5 TUM School of Computation, Information and Technology, Technical University Munich, Munich, Germany
6 Helmholtz AI, Helmholtz Center Munich, Munich, Germany
7 Department of Biology, University of Washington, Seattle, WA 98195, USA
8 Dept. of Biochemistry and Biophysics and Science for Life Laboratory, Stockholm University, 171 21 Solna
9 Dept. Mathematics, KTH Royal Institute of Technology, 100 44 Stockholm, Sweden
10 Department of Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden
11 Dept. Molecular Medicine and Surgery, Karolinska Institutet, 171 77 Stockholm, Sweden
12 Inst. for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
13 Institut Jean le Rond D’Alembert, Sorbonne Université, France
14 Department of Mechanical Engineering, University of Washington, Seattle, WA 98195, USA
* E-mail for correspondence: rvinuesa@[Link]
ABSTRACT
Technological advancements have substantially increased computational power and data availability, enabling the application of
powerful machine-learning (ML) techniques across various fields. However, our ability to leverage ML methods for scientific
discovery, i.e. to obtain fundamental and formalized knowledge about natural processes, is still in its infancy. In this review, we
explore how the scientific community can increasingly leverage ML techniques to achieve scientific discoveries. We observe
that the applicability and opportunity of ML depends strongly on the nature of the problem domain, and whether we have full
(e.g., turbulence), partial (e.g., computational biochemistry), or no (e.g., neuroscience) a-priori knowledge about the governing
equations and physical properties of the system. Although challenges remain, principled use of ML is opening up new avenues
for fundamental scientific discoveries. Throughout these diverse fields, there is a theme that ML is enabling researchers to
embrace complexity in observational data that was previously intractable to classic analysis and numerical investigations.
Keywords: machine learning (ML); deep learning (DL); artificial intelligence (AI); scientific discovery; physics; life sciences;
computer science
Introduction
Machine learning (ML) has shown great potential to transform a broad range of domains [1–5], and it is increasingly
being applied to problems in science and engineering. ML has been widely used for predictive tasks in these areas,
and despite an initial promising phase where ML methods have outperformed well-established techniques [6–
10], such predictive applications are starting to exhibit diminishing returns. There are, by contrast, increasing
opportunities in the academic usage of ML for scientific discovery, i.e. answering challenging scientific questions
while leveraging existing fundamental knowledge. Such focus on scientific discovery can move the frontiers of
science forward when progress in more traditional methods has slowed. Furthermore, the development of novel and
more powerful ML methods can help to tackle some open subjects in the context of predictions from scarce, noisy,
or incomplete data, out-of-sample generalization, extreme-event predictions and predictions under uncertainty.
Science is fundamentally interested in identifying structure and explaining the world, the many systems that
constitute it, and the laws that govern it. The development of the scientific method as we know it has taken
time and been progressive, from early forms found in antiquity (Aristotle, first forms of causality [11], etc.) to
rigorous methodologies based on objective, quantitative, and mathematical evidence [12]. This approach has
encountered great success, although the “unreasonable effectiveness of mathematics in natural sciences” [13]
increasingly appears to encounter difficulty making further progress with the “complexity” observed in more
challenging problems and real-world data. Complexity is hard to define but often arises from non-linearity, high
dimensionality, and multiscale dynamics in space and time, leading to a system comprising a very large number
of parts and mechanisms that cannot easily be simplified or approximated. Traditional tools inherited from the
rigorous mathematical tradition of the 1700s to late 1900s are challenged by these problems: still today, we can
observe and, to some degree, simulate and reproduce in numerical models complex problems such as turbulence,
processes in the brain, biological systems, etc. But arguably, we do not fully understand these phenomena at a
deeper level.
Machine learning comprises a growing set of algorithms, enabled by high-performance computing and in-
creasingly vast data, that show incredible promise for handling complexity [14, 15]. Interestingly, artificial neural
networks are themselves “complex” artificial systems that can now perform “complex” tasks for which no tradi-
tional algorithms are known. Even though their governing laws are based on simple operations, they are fully
observable, and they run deterministically on microprocessors, we cannot in general explain the decisions and
outputs generated by neural networks. Regardless, neural networks are enabling novel discoveries with profound
impact, such as the first new class of antibiotics in decades [16]. This gives rise to several challenges, such as the
need for explainable artificial intelligence (XAI) [17, 18].
Symbolic approaches, such as gene-expression programming, sparse regression, and sparse Bayesian learn-
ing [19–23], have led to successful approaches over the years (see e.g. the work on “machine scientists” from the
1980s [24]). Such approaches are limited by exponential complexity with the size of the search space, motivating
recent attempts to combine symbolic and deep-learning-based strategies [25]; however, they can still be used to
enable advances in science, including the discovery of novel materials [26]. This observation also raises several
fundamental, quasi-philosophical questions: is a complex system complex enough to understand its complexity, or
can it only understand lesser complexities? This is also related to the question of how much AI can discover that
is not already contained in the training data [27]. These considerations naturally raise the fundamental question
of what opportunities (and challenges) are offered by the growing impact of data-driven methods and machine
learning in science: will we, as some authors have suggested, observe the “unreasonable effectiveness of data” [28],
and of deep learning [29]? And as a consequence, what does this mean in practice for future scientific discover-
ies? How and to what degrees can we extract fundamental understanding, not just computational recipes, from
modern machine learning, and by using which approaches? In this sense, and for this article, we do not consider
optimization or automation tasks as “scientific discovery”. Still, being able to effectively solve governing equations
(for instance, partial differential equations, PDEs) through ML would be considered as “scientific discovery”, as
long as this enables addressing fundamental questions that cannot be answered with other tools.
There have been some recent studies on the possible impact of ML on science. For instance, Wang et al. [30]
focus on the potential of self-supervised learning and generative models for generally achieving improvements in
scientific experiments and simulations. Zenil et al. [31] are more focused on how AI can help formulate scientific
questions and answer them, emphasizing the potential of large language models (LLMs) in this context. The latter
are raising a number of open questions in terms of potential benefits and threats of their application in science [32].
Fajardo-Fontiveros et al. [33] discuss the interesting question of when it is possible to learn models from data and
what is the maximum level of noise acceptable for learning the correct model.
In the present work we adopt a different approach, analyzing separately the potential of AI to enable/facilitate
scientific discovery in three types of problems: (i) problems where the governing phenomenological equations are
entirely known. This corresponds to cases where it would be directly possible, given enough computational power,
to simulate and reproduce the system entirely. (ii) Cases where we have some partial knowledge regarding the
governing equations and/or some physical properties that hold. (iii) Scenarios where nothing is known about
the governing equations or the physical properties of the problem under study. We illustrate these categories by
considering examples from two concrete application areas: Physical Sciences and Life Sciences. A summary of the
various categories and applications is presented in Figure 1, where we discuss the examples of turbulent flows,
dark matter, drug discovery and brain research, sorted from more to less knowledge of the governing equations
and/or their underlying properties. While the proposed categorization is a convenient schematic view, the lines
between full knowledge and partial or no knowledge become increasingly blurred as system complexity grows.
Examples given in the following sections illustrate how different uses of ML may support scientific discovery by
helping to tackle complexity.
2/22
Complete information is available
There are several applications within Physical Sciences where the underlying governing equations are known
perfectly but the high-level global dynamics are still not well understood. Even if, technically, we know the
governing quantum equations of chemistry underlying the dynamics of bio-molecules, and subsequently of full
organisms, complexity makes biology a partial-knowledge problem, because only very limited parts of complex
biological systems can be actually measured or simulated, while extreme complexity makes it impossible to simulate
the full human brain. Similarly, turbulent flows of continuous media (i.e. flows characterized by very small values
of the Knudsen number Kn = lm.f.p. /L, where lm.f.p. is the mean free path of molecules constituting the fluid and L
is a macroscopic length scale) are well described by the Navier–Stokes equations, which in turn can be derived
from the Boltzmann equations describing the kinetics of gases at the microscopic level [34]. Nevertheless, the
tiny and highly chaotic small scales in turbulent flows can no longer be measured or simulated as the advection
forces become increasingly dominant over diffusion, i.e. as the Reynolds number increases, making turbulence a
major unsolved problem [35]. On the other hand, the mean-flow properties or the largest, coherent scales can still
be simulated and observed, thus providing a partial knowledge of the system, but we do not have a consistent
theory to prove fundamental flow properties based on these, nor describe the detailed interactions among turbulent
structures and phenomena they imply.
In this context, supervised ML holds promises for scientific discoveries, as summarized in Figure 2. Indeed, it
is possible to generate large amounts of fully resolved DNS data, yielding large training datasets of high quality.
Such data can be used, for example, to learn what the most important structures are through explainable deep
learning [36], sensitivity analysis [37] or information theory [38]. In such a context, a broad range of ML techniques,
such as symbolic regression, reduced-order modeling and autoencoders, can be leveraged to discover novel
structures and relationships in flow features and to uncover previously unknown physical properties of turbulence.
Similarly, turbulence closure modeling has seen significant advances using supervised-ML techniques [39, 40].
In particular, novel closures have been developed for Reynolds-averaged Navier–Stokes (RANS) [41–43] and
large-eddy simulation (LES) [44] turbulence models. Importantly, these closures are analytic expressions that
are interpretable and generalizable, built on sparse symbolic-regression techniques [22]. The broader family of
symbolic-regression methods [41, 45, 46] promises to further aid in the development of these improved closure
models. Furthermore, accelerated computational-fluid-dynamics codes are also being developed using machine
learning, for example to discover improved stencils and numerical schemes for computations [47, 48]. In the
context of astrophysical sciences, a supervised classifier denominated SPOCK has allowed to predict from short-
term simulations the long-term stability of compact multi-planet systems that required integration of the laws of
gravitation over billions of orbital periods [49]. Not only SPOCK was able to predict the long-term stability of small
systems similar to those used for its training, but it was able to generalize to large multi-planet systems.
An additional challenge associated with complex physical and biological systems is to perform optimal control,
which is a key aspect of many scientific questions as well as industrial applications [50, 51]. In this context there
are also many promising applications of ML: specifically, deep reinforcement learning (DRL) is now leading to
similar discoveries as AlphaGo [52], but in Physics. It is common to use simulation data, based on the known
governing equations of the system, to define the environment with which the DRL agent interacts to develop the
best-performing policy. In the case of experimental environments, numerical data can also be used to enhance
the measurements. The use of DRL has spread across different physics applications in recent years. Several
examples can be found in quantum physics, where DRL was able to find the ground state and describe the unitary
time evolution of complex interacting quantum systems [53]. Reinforcement learning has also been employed in
astronomy for the adaptive control of astronomy systems [54]. In the field of turbulence research, DRL has allowed
to discover novel and previously unknown flow-control strategies which outperform previous state of the art [55].
Similar advances have been done in, e.g., tokamak-instability control [14, 56]. This allows the discovery of novel
and more effective ways to tune instability control, which is of great importance in practice. But this may also
allow to understand better complexity arising in turbulent systems. By performing an a-posteriori analysis of the
novel control laws using traditional scientific methods, one can analyze and explain how such control laws work,
providing novel insights into the underlying physics. Moreover, novel physical regimes that are not spontaneously
observed, for example, because they lie behind an energy barrier or they are intrinsically unstable, can be discovered
through DRL. For instance, application of RL in thermodynamics [57] has allowed to learn previously unknown
thermodynamic cycles.
Intractable simulations and optimization of complex systems with known governing equations can be dra-
matically accelerated using ML. For example, autoencoders can be leveraged to discover effective latent-space
representations of complex systems [58]. Such latent spaces can then be used to design effective time integrators
3/22
and faster simulations [59], or to perform optimization at a lower computational cost [60]. While speeding up
simulations is not, per se, a scientific discovery, the improved understanding that comes from faster simulations can
be used to perform systematic studies and lead to breakthroughs. For example, in high-energy physics, particle
discovery relies on the ability to accurately compare observed detector response data with expectations based
on physical models [61]. While the processes of subatomic particle interactions with matter are known, the an-
alytical calculation of the detector response is analytically intractable, and Monte-Carlo methods must be used
to simulate the propagation of particles in detectors for comparison with the data [62, 63]. As a consequence,
recent advances in high-fidelity fast generative models, such as generative adversarial networks (GANs) [64] or
variational autoencoders (VAEs) [65], offer a promising alternative for simulation, gaining orders of magnitude
in simulation speed over existing techniques, provided that these methodologies can be developed to achieve
the required accuracy, which is a subject of ongoing research [66]. A striking example is given by the recent
introduction of LLMs for weather and climate modeling, which is not only revolutionizing weather forecasting [67],
but has also the potential to accelerate scientific discoveries in climate change. Studies on Paleoclimates can also
be accelerated by leveraging AI to replace or complement the cumbersome coupled resolution of several complex
climate processes (e.g. convection, clouds, atmospheric chemistry), and by enabling access to the finer resolutions
required to understand regional to local changes in climate [15].
In applied Mathematics and scientific computing, approaches based on DRL similar to AlphaGo [52] can also
lead to discoveries. Fawzi et al. [8] discovered novel algorithmic optimizations to accelerate matrix operations.
This is important for both practical applications (faster programs, in particular for numerical simulations) and
scientific advancement, opening for new algorithmic optimization possibilities that were previously unknown.
Large language models (LLMs) are now also being leveraged to generate possible algorithms to approach solutions
for classical problems in combinatorics, e.g. the cap set problem [5]. Similar to the matrix-acceleration task, since the
validation of suggested solutions can be performed with a classical algorithm, this allows the discovery of strategies
and algorithms that are both previously unknown and that can be formally validated and proved for correctness. In
both cases, AI is proving useful as a heuristic method to provide good candidate solutions for a problem where the
verification of a solution candidate can be done cheaply, e.g. deterministic linear [8] or polynomial [5] time. Note
however that deciding how to find, or discover, an adequate candidate is challenging.
4/22
makes it possible to train ROMs more effectively without having to learn the symmetries and it enables examining
the physics of the system in the latent space more effectively. A general framework to impose and discover
symmetries in physical systems for ML applications was proposed by Otto et al. [76]. Such approaches, combining
data-driven methods and physics, are essential to achieve novel techniques for scientific discovery. Recent work has
shown the potential of novel deep-learning methods for geometric reasoning [9], a possibility that can be critical
when dealing with symmetries and other physical properties in systems containing intrinsic structures.
Machine learning can also be used to obtain physical knowledge where partial knowledge is available through
the discovery of constitutive laws. Many materials are characterized by complex rheologies that are difficult or
impossible to describe using standard modeling approaches. However, costly high-fidelity or ab initio simulations
can be produced under various loading configurations and used to indirectly infer the constitutive equations or
rheological behaviour. For instance, De Lorenzis and coworkers have recently proposed the EUCLID hybrid finite-
element/neural-network framework for learning constitutive equations in hyperelastic solids [77], and the SpaRTA
framework initially introduced in Ref. [41] for data-driven discovery of turbulence models has been adapted to
the discovery of constitutive equations for elastic solids [78]. Sparse identification has also been used to discover
constitutive equations of crystal structures, learning from ab initio calculations or interatomic potentials [79]. Finally,
a deep-learning method incorporating strong inductive biases, such as objectivity, consistency, and stability, was
developed by As’ad et al. [80] to learn constitutive laws for complex nonlinear materials. In all these examples,
which are schematically represented in Figure 3, the macroscopic behaviour of the material follows the conservation
laws of Mechanics, and only the constitutive equations are unknown. Mahmoudabadbozchelou et al. [81] introduced
the notion of “digital rheometer twin”, where ML methodologies are leveraged to learn the hidden rheology of
complex fluids through a limited number of experiments.
Multiscale modeling and stochastic simulations are other areas where learning from simulated data can lead to
a real discovery. In multiscale simulations, an appropriate model is available at a small scale (e.g., the fundamental
laws of molecular dynamics), and the goal is to learn a model at another scale (e.g., a continuum-scale partial
differential equation) from the data generated at the first scale. In stochastic simulations, the governing equations
contain uncertain parameters or are driven by randomly fluctuating forcing terms representing subgrid variability
and processes (e.g., Langevin equations). Solutions to such problems are typically characterized using probability
density functions (PDFs). The goal is to learn the deterministic dynamics of either the PDF of a system state or its
statistical moments. While PDF equations can be exact under certain conditions, their derivation requires closure
approximations based on field-specific knowledge and can introduce uncontrollable errors. Machine learning
can then infer closure terms from the databases, e.g. based on sparse regression [82]. The D-CIPHER method
has recently been shown that it can discover many ordinary and partial differential equations [83]. Similarly,
when studying biomolecules, their interactions, and generally their functions that serve the mechanisms of a cell,
Newton’s laws of motion are commonly used to model molecular dynamics at the atomistic level. However, in
these cases, the exact energy functions are not known as these are generally optimized in smaller systems using a
combination of techniques [84]. Therefore, the exact energy function governing these simulations always contains
some noise and includes necessary compromises to achieve computational efficiency. Furthermore, even if the
precise energy functions were known, such simulations are prohibitively expensive [85], and we do not yet have
complete predictive rules for larger-scale interactions at the size of biological macromolecules. Here, ML methods
provide alternative paths to analyse the energies and dynamics of these systems, with both molecular-dynamics
simulations and experiments contributing to generate databases for learning large-scale interactions [86]. They
can also serve as a regulariser for ML models, thus enabling the use of a more coarse-grained, and therefore faster,
representation [87–89].
In the context of Life Sciences, structural biology is a very important example where only partial knowledge
of the phenomena is available. Yet, ML is leading to significant new scientific discoveries. Fast identification of
free-energy minima has been developed for small molecules [90], proteins [7], ligand-binding [91], and mixtures
of these [10]. More concretely, one notable recent progress is AlphaFold [7], which employs a complex deep
architecture with several embedded inductive biases to fold proteins from their one-dimensional amino-acids
sequence into their three-dimensional (3D) native form. Although AlphaFold does not provide any insights into
how proteins fold, it has been proven to be very valuable for the structural-biology community. Among AlphaFold’s
architectural biases, the most notable are: (i) multiple sequence alignment (MSA), a source of co-evolutionary signal
of protein folding. In this context, a novel transformer architecture, the EvoFormer, is used, and (ii) structural
equivariance is guaranteed by using a 3D equivariant transformer trained to predict a novel loss function, the FAPE
loss.
There are several additional examples of ML leading to scientific discovery in systems where partial information
5/22
is available, for instance through generative models. Such models constitute a powerful tool which has crossed
over into popular culture in recent years, especially due to their capability of generating artistic pictures or videos.
Recently, generative AI has been used to learn physical models from large datasets by incorporating prior knowledge
expressed as constraints on the functional form of the learned model or from axiomatic knowledge and experimental
data by combining logical reasoning with symbolic regression [92]. The discovered models enable generalizing
known phenomena to new configurations, for example, new geometries or operating conditions. This can help to
shed light on the physical phenomena in new scenarios beyond those in which data is available. When it comes
to quantum systems, deep reinforcement learning (DRL) has enabled the discovery of novel approaches to put a
quantum system in a given state, which provides novel insights into the underlying physics [93, 94]. Furthermore,
ML is currently allowing to reduce the noise in quantum-computing systems, while quantum computing is allowing
in turn to improve ML performance with reduced energy consumption [95]. Another example is climate, where ML
is helping to develop climate models enabling the characterization of novel physics. This includes establishing new
large-eddy simulation (LES) models for climate ensuring stable behavior for long-term forecast [96]. Note that in
LES only the larger scales are resolved, whereas the smaller ones need to be represented by a model, which, in this
case, is developed through ML. There are also several studies which reflect how ML can help to improve classical
weather-prediction systems [97, 98].
AlphaFold [7] has also spurred many innovations in the ML field, e.g. single-sequence methods such as
ESMfold [99]. These single-sequence-based methods are based on foundation models trained to recover a masked
region of a protein sequence [100], a technique also used in AlphaFold to improve its performance. They have also
been used for protein design [101], but have lately been replaced by generative models that take both sequence and
structure into account [102]. This highlights that a method not trained directly to uncover scientific insights can
be utilised to provide such insights in another scientific discipline. Another such example is the use of diffusion
models [103, 104] to generate protein backbone structures suitable for protein design that are routinely tested
using AlphaFold. Lately, diffusion and flow-matching models can also be used to altogether bypass the need for
simulations when generating a molecular ensemble [105].
No information is available
In various scientific fields, certain phenomena exist whose origins and descriptions remain largely unknown, and
we lack known governing equations or foundational physical models to accurately capture their critical aspects.
For example, the field of neuroscience has no governing equations from first principles, as there are no known
conservation laws, symmetries, or other physics that may be used to derive generalizable differential equations.
Even if we have access to equations to describe the behavior of individual molecules or cells, the complexity of the
brain makes it unfeasible to simulate at every scale; furthermore, it is not tractable to take enough measurements to
initialize such a simulation. At the same time, rapid progress is being made that advances our ability to acquire
neural data. For instance, large-scale neural recording and imaging techniques now routinely produce datasets of
unprecedented spatial and temporal resolution [106, 107], and advances in connectomics offer detailed views of cell
types and neural circuit architecture [108, 109]. Taken together, the state of what is possible in data-driven modeling
promises a new era of empirical models that leverage machine learning for discovery in our understanding of
neuroscience and behavior.
Full numerical simulations of the brain and behavior of an animal are impossible to construct and infeasible to
initialize, but data-driven models that recapitulate key input-output relationships and generate predictions can
be crucial tools in future discovery. In scientific fields focused on complex phenomena for which we lack known
governing equations, a common approach to gaining understanding is through perturbation experiments. For
instance, in systems neuroscience, experimentally activating a specific population of neurons at a particular phase
of a visual discrimination task can directly bias the animal’s perception. Such causal experiments allow us to gain
insights into this part of the visual perception pathway. Thus, data-driven models are particularly useful when
they are constructed in a space that matches experimentally measurable quantities and are capable of generating
novel, testable hypotheses. What would happen if we combinatorially activated populations of neurons, and at
several different phases of the task? Because data-driven models are not based on known physical equations, such
predictions can diverge quite drastically from reality, so we emphasize the importance of a rapid iteration between
generating hypotheses and performing experiments, which in turn generate data that continually refine models.
Another way in which ML contributes to discovery is through providing an approach to synthesize diverse
experimental data that measure parts of the same system, but collected separately and at different resolutions. For
instance, the MICrONS dataset [110] includes both functional-imaging data and structural reconstructions of the
same cortical neurons. Although we know that morphological and physiological features of these neurons are
6/22
related, each dataset is incomplete, and the relationships between them are difficult to describe. In such cases, a
data-driven model is a valuable way to formalize these relationships.
Where no a-priori knowledge of governing equations is available, ML methods can serve as an approach to learn
dynamics directly from observations. There is a long history of learning differential equations to model neural
dynamics, exemplified by the famed Hodgkin–Huxley equations written to describe the electrical action potential
based on changes in the conductance of ion channels [111]. There are several categories of modern ML approaches
to learn dynamical models. For instance, the SINDy (sparse identification of non-linear dynamics) family of
techniques [22] identifies the underlying dynamics of a system from the data and trajectories it generates by using
sparse-promoting regression of the transitions matrix, or related sparsity-generating techniques. Related approaches
have been used for the same purposes using either genetic algorithms [112] or reinforcement learning [113]. Where
we have no governing equations but are aware of some underlying structure, incorporating such constraints to the
learning process can yield faster and more robust discovery of equations describing the dynamics. Once a set of
equations that govern a system are identified, they can be applied to discover novel mechanisms and to efficiently
perform engineering tasks on complex systems.
Another approach to equation discovery is the neural ordinary differential equation model (NODE) [114].
Unlike traditional neural networks that map inputs to outputs through a series of layers with fixed parameters,
NODEs formulate the transformation from input to output as the solution to an ODE and learn the dynamics of this
transformation over continuous time. While NODE is a powerful and general method, it cannot provide direct
scientific insights into the learned functional relationships from the dataset. To address this problem, there has
been a recent interest in hybrid approaches that rely on transformer backbones [115] or on shallow neural-network
architectures [116] inspired by SINDY [22] for learning differential equations from data. Note that interpretable
models may lead to new scientific insights. For instance, such interpretable ML models can be trained on patient
data with specific diagnoses or prognoses and then subsequently inspected to discover novel biomarkers for the
detection of diseases or outcome-prediction of treatments [117, 118].
From a statistical perspective, a key element that enables the learning of the underlying mechanisms is whether,
in addition to observational data (e.g. electronic-health records), interventional data are available. As an example
of interventional data, perturbational CRISPR-Ko in single-cell biology [119] already includes genome-wide
perturbational experiments, featuring millions of perturbations [120]. With increasingly automated setups becoming
available [121, 122], a key research question is how to experimentally design and select the next intervention to
facilitate identifying the underlying mechanisms and governing equations. While decades of research have focused
on active learning for system identification in linear cases [123], more research is needed to consider learning
an underlying equation from data in such a setting when the parametrized model is a NODE and in generally
unknown contexts [124–126].
With no prior knowledge of the functional form of the equations, one often does not observe the variables of
interest directly. In these cases, representation learning and latent-space identification are essential for understanding
and predicting complex dynamical systems. A common assumption underlying the idea of representation learning
is that while the observations might exhibit complex behaviour, the underlying dynamics might be expressible in a
simple form in some abstract space. Thus, many techniques for representation learning from high-dimensional data
rely on variational autoencoders and latent space embedding by a generative model [127, 128]. By identifying the
correct latent spaces, one often hopes to achieve: i) dimensionality reduction, ii) simplifying system characterization,
and iii) facilitating interpretation by identifying some form of causal structure [129].
Causality and causal formulations of dynamical systems [130] have been of significant interest in the past
years, especially in Earth sciences [131–133] and molecular Biology [134, 135]. Using data from Biology, Lippe
et al. [136] identified causal ODEs using invariance from heterogeneous experiments as a learning signal, while
other approaches focus on understanding latent causal factors of a dynamical system by learning disentangled
representations from time series [137, 138]. These approaches aim to not only model the latent space but also the
temporal dependencies within a sequence [139, 140]. Learning these structured latent spaces is of crucial importance
since it provides an effective coordinate system in which the dynamics have a simple representation, which is a key
requirement for generalization and interpretability [141]. In Figure 4 we provide a schematic representation of the
identification of an underlying causal structure from observations where the variables of interest are not directly
observed.
Orthogonal to the approaches of using AI to learn and identify dynamical systems is the current research trend
of investigating different ODEs and solvers for improving the general class of diffusion models [142]. The model
class of diffusion, denoising autoencoders, or flow-matching models [143] has the key advantage that it is amenable
to theoretical investigations while providing state-of-the-art results in various domains. This is an example where
7/22
not only systems identification can profit from machine learning but where general ML approaches profit from the
insights and cross pollination with control theory and existing knowledge of dynamical systems [144, 145]. While
these models have found widespread success not only in image generation [146] but in many scientific applications
from protein modeling [147] to materials science [148], we currently have very limited causal or scientific insights
from these large pre-trained models [149], which may provide ample opportunities for further research.
In acquiring data from complex systems with unknown governing equations, our measurements are often
indirect and incomplete, so that they require extensive processing before they are suitable for modeling and
learning using the approaches described above. In such cases, applications of ML have critically facilitated data
collection and have potential to further advance scientific discovery, by expanding the realm of the observable. For
instance, where neural recordings are incomplete or corrupted, machine learning approaches to imputation and
generative modeling can produce more complete timeseries data that improve downstream applications [150–152].
In large-scale image analysis, several notable examples highlight how once laborious manual annotation has been
successfully automated by advanced computer vision, turning large quantities of unstructured images or movies
into structured scientific data. In microscopy, automated segmentation and profiling have made tractable the
analysis of heterogenous cell population as well as their development in time [153, 154]. Automated segmentation
and image analysis also enabled the assembly of high value datasets to the community, including brain connectomes
at cellular and synaptic resolution [108–110, 155, 156] . In animal behavior and ethology, tracking the body
movements of one or many individuals transform video data into analyzable kinematics and poses, which can then
be related to the underlying neural computational and to their impact on social interactions and behavior [157–161].
Taken together, it is hard to overstate the ongoing impact of machine learning as a critical tool that, when used in
conjunction with other approaches, catalyzes scientific discovery.
8/22
Lastly, when employing ML for scientific discovery, a challenge lies in overcoming the black-box nature of
standard applied deep-learning frameworks to gain formal knowledge. A black-box setup utilizing a trained
network is by construction not directly open for human insight [18]: in particular, internalized neural-network data
are hidden in the network weights and not easily interpreted, making it challenging to gain formal knowledge and
understanding. In the context of scientific discovery, this is a major challenge for the formalization of knowledge, as
well as for public dissemination and acceptance. Both explainable [36] and interpretable [162] machine learning
offer an alternative in which human oversight is preserved, however, the development and applicability of these
methods is still challenging in many applications. Suppose explainable and/or interpretable ML can be deployed.
In that case, the discovered new knowledge or proposed scientific advances can be grounded and validated in
existing notions and know-hows, securing scientifically sound knowledge progression.
Scientific communities across a wide range of disciplines are now, following the rapid development of ML
techniques, benefiting from incorporating these methods into their research methodology and toolbox, opening
new opportunities for scientific discoveries. While this comes with new challenges, we believe that the recent
development shows that we can expect further evolution of ML techniques adapted to various specific needs across
disciplines. These are likely to provide opportunities to mitigate the limitations outlined in recent works, enabling
new discoveries to take place.
References
1. Deng, J. et al. Imagenet: A large-scale hierarchical image database. 2009 IEEE conference on computer vision
pattern recognition 248–255 (2009).
2. Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).
3. Vinyals, O. et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature 575, 350–354
(2019).
4. Vinuesa, R. et al. The role of artificial intelligence in achieving the Sustainable Development Goals. Nat.
Commun. 11, 233 (2020).
5. Romera-Paredes, B. et al. Mathematical discoveries from program search with large language models. Nature
625, 468–475 (2024).
6. Guastoni, L. et al. Convolutional-network models to predict wall-bounded turbulence from wall quantities. J.
Fluid Mech. 928, A27 (2021).
7. Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583–589 (2021).
8. Fawzi et al, A. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature 610,
47–53 (2022).
9. Trinh, T. H., Wu, Y., Le, Q. V., He, H. & Luong, T. Solving olympiad geometry without human demonstrations.
Nature 625, 476–482 (2024).
10. Krishna, R. et al. Generalized biomolecular modeling and design with RoseTTAFold all-atom. Science eadl2528,
DOI: 10.1126/science.adl2528 (2024).
11. Camps-Valls, G. et al. Discovering causal relations and equations from data. Phys. Reports 1044, 1–68 (2023).
12. Castillo, M. The scientific method: a need for something better? Am. J. Neuroradiol. 34, 1669–1671 (2013).
13. Wigner, E. P. The unreasonable effectiveness of mathematics in the natural sciences. In Mathematics and science,
291–306 (World Scientific, 1990).
14. Degrave et al, J. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature 602,
414–419 (2022).
15. Wong, C. How AI is improving climate forecasts. Nature (2024).
16. Wong, F. et al. Discovery of a structural class of antibiotics with explainable deep learning. Nature 626, 177–185,
DOI: 10.1038/s41586-023-06887-8 (2024).
9/22
17. Hoffman, R. R., Mueller, S. T., Klein, G. & Litman, J. Metrics for explainable AI: Challenges and prospects.
arXiv preprint arXiv:1812.04608 (2018).
18. Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable
models instead. Nat. Mach. Intell. 1, 206–215 (2019).
19. Ferreira, C. Gene expression programming: mathematical modeling by an artificial intelligence, vol. 21 (Springer,
2006).
20. Schmidt, M. & Lipson, H. Distilling free-form natural laws from experimental data. Science 324, 81–85 (2009).
21. McConaghy, T. Ffx: Fast, scalable, deterministic symbolic regression technology. Genet. Program. Theory Pract.
IX 235–260 (2011).
22. Brunton, S. L., Proctor, J. L. & Kutz, J. N. Discovering governing equations from data by sparse identification
of nonlinear dynamical systems. Proc. Natl. Acad. Sci. 113, 3932–3937 (2016).
23. Fuentes, R. et al. Equation discovery for nonlinear dynamical systems: A bayesian viewpoint. Mech. Syst.
Signal Process. 154, 107528 (2021).
24. Langley, P., Bradshaw, G. L. & Simon, H. A. Bacon. 5: The discovery of conservation laws. In IJCAI, vol. 81,
121–126 (1981).
25. Guimerà, R. et al. A Bayesian machine scientist to aid in the solution of challenging scientific problems. Sci.
Adv. 6, eaav6971 (2020).
26. Merchant, A. et al. Scaling deep learning for materials discovery. Nature 624, 80–85, DOI: 10.1038/
s41586-023-06735-9 (2023).
27. Leslie, D. Does the sun rise for ChatGPT? Scientific discovery in the age of generative AI. AI Ethics 1–6 (2023).
28. Halevy, A., Norvig, P. & Pereira, F. The unreasonable effectiveness of data. IEEE intelligent systems 24, 8–12
(2009).
29. Sejnowski, T. J. The unreasonable effectiveness of deep learning in artificial intelligence. Proc. Natl. Acad. Sci.
117, 30033–30038 (2020).
30. Wang, H. et al. Scientific discovery in the age of artificial intelligence. Nature 620, 47–60 (2023).
31. Zenil, H. et al. The future of fundamental science led by generative closed-loop artificial intelligence. Prepr.
arXiv:2307.07522 (2023).
32. Birhane, A., Kasirzadeh, A., Leslie, D. & Wachter, S. Science in the age of large language models. Nat. Rev.
Phys. 5, 277–280 (2023).
33. Fajardo-Fontiveros, O. et al. Fundamental limits to learning closed-form mathematical models from data. Nat.
Commun. 14, 1043 (2023).
34. Chapman, S. & Cowling, T. G. The mathematical theory of non-uniform gases: an account of the kinetic theory of
viscosity, thermal conduction and diffusion in gases (Cambridge university press, 1990).
35. Fefferman, C. L. Existence and smoothness of the Navier-Stokes equation. The millennium prize problems 57, 67
(2000).
36. Cremades, A. et al. Identifying regions of importance in wall-bounded turbulence through explainable deep
learning. Nat. Commun. To Appear. Prepr. arXiv:2302.01250 (2023).
37. Encinar, M. P. & Jiménez, J. Identifying causally significant features in three-dimensional isotropic turbulence.
J. Fluid Mech. 965, A20 (2023).
38. Lozano-Durán, A. & Arranz, G. Information-theoretic formulation of dynamical systems: Causality, modeling,
and control. Phys. Rev. Res. 4, 023195 (2022).
10/22
39. Duraisamy, K., Iaccarino, G. & Xiao, H. Turbulence modeling in the age of data. Annu. Rev. Fluid Mech. 51,
357–377 (2019).
40. Brunton, S. L., Noack, B. R. & Koumoutsakos, P. Machine learning for fluid mechanics. Annu. Rev. Fluid Mech.
52, 477–508 (2020).
41. Schmelzer, M., Dwight, R. P. & Cinnella, P. Discovery of algebraic Reynolds-stress models using sparse
symbolic regression. Flow, Turbul. Combust. 104, 579–603 (2020).
42. Beetham, S. & Capecelatro, J. Formulating turbulence closures using sparse regression with embedded form
invariance. Phys. Rev. Fluids 5, 084611 (2020).
43. Beetham, S., Fox, R. O. & Capecelatro, J. Sparse identification of multiphase turbulence closures for coupled
fluid–particle flows. J. Fluid Mech. 914 (2021).
44. Zanna, L. & Bolton, T. Data-driven equation discovery of ocean mesoscale closures. Geophys. Res. Lett. 47,
e2020GL088376 (2020).
45. Bongard, J. & Lipson, H. Automated reverse engineering of nonlinear dynamical systems. Proc. Natl. Acad. Sci.
104, 9943–9948 (2007).
46. Cranmer, M. Interpretable machine learning for science with PySR and SymbolicRegression. jl. arXiv preprint
arXiv:2305.01582 (2023).
47. Bar-Sinai, Y., Hoyer, S., Hickey, J. & Brenner, M. P. Learning data-driven discretizations for partial differential
equations. Proc. Natl. Acad. Sci. 116, 15344–15349 (2019).
48. Kochkov, D. et al. Machine learning accelerated computational fluid dynamics. Proc. Natl Acad. Sci. USA 118,
e2101784118 (2021).
49. Tamayo, D. et al. Predicting the long-term stability of compact multiplanet systems. Proc. Natl. Acad. Sci. 117,
18194–18205 (2020).
50. Sipp, D., Marquet, O., Meliga, P. & Barbagallo, A. Dynamics and Control of Global Instabili-
ties in Open-Flows: A Linearized Approach. Appl. Mech. Rev. 63, 030801, DOI: 10.1115/1.4001478
(2010). [Link]
030801_1.pdf.
51. Al-Housseiny, T. T., Tsai, P. A. & Stone, H. A. Control of interfacial instabilities using flow geometry. Nat. Phys.
8, 747–750 (2012).
52. Silver et al, D. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484–489
(2016).
53. Carleo, G. & Troyer, M. Solving the quantum many-body problem with artificial neural networks. Science 355,
602–606 (2017).
54. Nousiainen, J., Rajani, C., Kasper, M. & Helin, T. Adaptive optics control using model-based reinforcement
learning. Opt. Express 29, 15327–15344 (2021).
55. Guastoni, L., Rabault, J., Schlatter, P., Azizpour, H. & Vinuesa, R. Deep reinforcement learning for turbulent
drag reduction in channel flows. The Eur. Phys. J. E 46, 27 (2023).
56. Seo, J. et al. Feedforward beta control in the kstar tokamak by deep reinforcement learning. Nucl. Fusion 61,
106010 (2021).
57. Beeler, C. et al. Optimizing thermodynamic trajectories using evolutionary and gradient-based reinforcement
learning. Phys. Rev. E 104, 064128 (2021).
58. Solera-Rico, A. et al. β -Variational autoencoders and transformers for reduced-order modelling of fluid flows.
Nat. Commun. 15, 1361 (2014).
11/22
59. Wiewel, S., Becher, M. & Thuerey, N. Latent space physics: Towards learning the temporal evolution of fluid
flow. In Computer graphics forum, vol. 38, 71–82 (Wiley Online Library, 2019).
60. Park, S. et al. Optimization of physical quantities in the autoencoder latent space. Sci. Reports 12, 9003 (2022).
61. A detailed map of Higgs boson interactions by the ATLAS experiment ten years after the discovery. Nature
607, 52–59 (2022).
62. Sjöstrand, T., Mrenna, S. & Skands, P. Pythia 6.4 physics and manual. J. High Energy Phys. 2006, 026 (2006).
63. Collaboration, A. et al. Observation of a new particle in the search for the Standard Model Higgs boson with
the ATLAS detector at the LHC. Phys. Lett. B 716, p1–29 (2012).
64. Goodfellow, J. et al. Generative adversarial networks. Prepr. arXiv:1406.2661 (2014).
65. Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Prepr. arXiv:1312.6114 (2014).
66. Albertsson, K. et al. Machine learning in high energy physics community white paper. In Journal of Physics:
Conference Series, vol. 1085, 022008 (IOP Publishing, 2018).
67. Pathak, J. et al. Fourcastnet: A global data-driven high-resolution weather model using adaptive Fourier neural
operators. arXiv preprint arXiv:2202.11214 (2022).
68. Neumann, J. v. Theory of self-reproducing automata. Ed. by Arthur W. Burks (1966).
69. Souza, L. F., Rocha Filho, T. M. & Moret, M. A. Relating SARS-CoV-2 variants using cellular automata imaging.
Sci. Reports 12, 10297 (2022).
70. Alert, R., Casademunt, J. & Joanny, J.-F. Active turbulence. Annu. Rev. Condens. Matter Phys. 13, 143–170 (2022).
71. Cranmer, M. et al. Discovering symbolic models from deep learning with inductive biases. 34th Conf. on Neural
Inf. Process. Syst. (NeurIPS 2020), Vancouver, Can. (2020).
72. Liu, Z., Chen, Y., Du, Y. & Tegmark, M. Physics-augmented learning: A new paradigm beyond physics-
informed learning. arXiv preprint arXiv:2109.13901 (2021).
73. Moya, B., Badías, A., González, D., Chinesta, F. & Cueto, E. A thermodynamics-informed active learning
approach to perception and reasoning about fluids. Comput. Mech. 72, 577–591 (2023).
74. Kapteyn, M. G., Pretorius, J. V. R. & Willcox, K. E. A probabilistic graphical model foundation for enabling
predictive digital twins at scale. Nat. Comput. Sci. 1, 337–347 (2021).
75. Kneer, S., Sayadi, T., Sipp, D., Schmid, P. & Rigas, G. Symmetry-aware autoencoders: s-PCA and s-NLPCA.
Prepr. arXiv:2111.02893v3 (2022).
76. Otto, S. E., Zolman, N., Kutz, J. N. & Brunton, S. L. A unified framework to enforce, discover, and promote
symmetry in machine learning. Prepr. arXiv:2311.00212 (2023).
77. Flaschel, M., Kumar, S. & De Lorenzis, L. Automated discovery of generalized standard material models with
euclid. Comput. Methods Appl. Mech. Eng. 405, 115867 (2023).
78. Wang, M., Chen, C. & Liu, W. Establish algebraic data-driven constitutive models for elastic solids with a
tensorial sparse symbolic regression method and a hybrid feature selection technique. J. Mech. Phys. Solids 159,
104742 (2022).
79. Im, S., Kim, H., Kim, W., Chung, H. & Cho, M. Discovering constitutive equations of crystal structures by
sparse identification. Int. J. Mech. Sci. 236, 107756 (2022).
80. As’ad, F., Avery, P. & Farhat, C. A mechanics-informed artificial neural network approach in data-driven
constitutive modeling. Int. J. for Numer. Methods Eng. 123, 2738–2759 (2022).
81. Mahmoudabadbozchelou, M., Kamani, K. M., Rogers, S. A. & Jamali, S. Digital rheometer twins: Learning the
hidden rheology of complex fluids through rheology-informed graph neural networks. Proc. Natl. Acad. Sci.
119, e2202234119 (2022).
12/22
82. Bakarji, J. & Tartakovsky, D. M. Data-driven discovery of coarse-grained equations. J. Comput. Phys. 434,
110219 (2021).
83. Kacprzyk, K., Qian, Z. & van der Schaar, M. D-cipher: Discovery of closed-form partial differential equations.
Prepr. arXiv:2206.10586 (2023).
84. Adcock, S. A. & McCammon, J. A. Molecular dynamics: survey of methods for simulating the activity of
proteins. Chem. Rev. 106, 1589–1615 (2006).
85. Freddolino, P. L., Harrison, C. B., Liu, Y. & Schulten, K. Challenges in protein-folding simulations. Nat. Phys. 6,
751–758 (2010).
86. Glielmo, A. et al. Unsupervised learning methods for molecular simulation data. Chem. Rev. 121, 9722–9758
(2021).
87. Noe, F., Tkatchenko, A., Muller, K.-R. & Clementi, C. Machine learning for molecular simulation. Annu. Rev.
Phys. Chem. 71, 361–390 (2020).
88. Noé, F., Olsson, S., Köhler, J. & Wu, H. Boltzmann generators: Sampling equilibrium states of many-body
systems with deep learning. Science 365, eaaw1147 (2019).
89. Noé, F., Tkatchenko, A., Müller, K.-R. & Clementi, C. Machine learning for molecular simulation. Annu. Rev.
Phys. Chem. 71, 361–390 (2020).
90. Nagai, R., Akashi, R. & Sugino, O. Completing density functional theory by machine learning hidden messages
from molecules. NPJ Comput. Mater 6, 43 (2020).
91. Corso, G., Stärk, H., Jing, B., Barzilay, R. & Jaakkola, T. DiffDock: Diffusion steps, twists, and turns for
molecular docking. Prepr. arXiv:2210.01776 (2023).
92. Cornelio, C. et al. Combining data and theory for derivable scientific discovery with ai-descartes. Nat. Commun.
14, 1777 (2023).
93. Ma, H., Dong, D., Ding, S. X. & Chen, C. Curriculum-based deep reinforcement learning for quantum control.
IEEE Transactions on Neural Networks Learn. Syst. (2022).
94. Melnikov, A. A. et al. Active learning machine learns to create new quantum experiments. Proc. Natl. Acad. Sci.
115, 1221–1226 (2018).
95. Melnikov, A., Kordzanganeh, M., Alodjants, A. & Lee, R.-K. Quantum machine learning: from physics to
software engineering. Adv. Physics: X 8, 2165452 (2023).
96. Frezat, H., Sommer, J., Fablet, R., Balarac, G. & Lguensat, R. A posteriori learning for quasi-geostrophic
turbulence parametrization. J. Adv. Model. Earth Syst. 14, e2022MS003124 (2022).
97. Molina, M. J. et al. A review of recent and emerging machine learning applications for climate variability and
weather phenomena. Artif. Intell. for Earth Syst. 2, 220086 (2023).
98. de Burgh-Day, C. O. & Leeuwenburg, T. Machine learning for numerical weather and climate modelling: a
review. Geosci. Model. Dev. 16, 6433–6477 (2023).
99. Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science
379, 1123–1130 (2023).
100. Elnaggar, A. et al. Prottrans: Toward understanding the language of life through self-supervised learning.
IEEE Trans Pattern Anal Mach Intell 44, 7112–7127, DOI: 10.1109/TPAMI.2021.3095381 (2022).
101. Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat
Biotechnol 41, 1099–1106, DOI: 10.1038/s41587-022-01618-2 (2023).
102. Dauparas, J. et al. Robust deep learning-based protein sequence design using ProteinMPNN. Science 378,
49–56, DOI: 10.1126/science.add2187 (2022).
13/22
103. Watson, J. et al. De novo design of protein structure and function with RFdiffusion. Nature 620, 1089–1100,
DOI: 10.1038/s41586-023-06415-8 (2023).
104. Ingraham, J. et al. Illuminating protein space with a programmable generative model. Nature 623, 1070–1078,
DOI: 10.1038/s41586-023-06728-8 (2023).
105. Jing, B., Berger, B. & Jaakkola, T. Alphafold meets flow matching for generating protein ensembles. Prepr.
arXiv:2402.04845 (2024). 2402.04845.
106. Steinmetz, N. A., Zatka-Haas, P., Carandini, M. & Harris, K. D. Distributed coding of choice, action and
engagement across the mouse brain. Nature 576, 266–273, DOI: 10.1038/s41586-019-1787-x (2019). Number:
7786 Publisher: Nature Publishing Group.
107. Zhou, Y. et al. Distributed functions of prefrontal and parietal cortices during sequential categorical decisions.
eLife 10, e58782, DOI: 10.7554/eLife.58782 (2021). Publisher: eLife Sciences Publications, Ltd.
108. Dorkenwald, S. et al. Flywire: online community for whole-brain connectomics. Nat. methods 19, 119–128
(2022).
109. Yao, S. et al. A whole-brain monosynaptic input connectome to neuron classes in mouse visual cortex. Nat.
neuroscience 26, 350–364 (2023).
110. Consortium, M. et al. Functional connectomics spanning multiple areas of mouse visual cortex. BioRxiv 2021–07
(2021).
111. Nelson, M. & Rinzel, J. The Hodgkin–Huxley model. The Book Genes. (1995).
112. Chen, Y., Luo, Y., Liu, Q., Xu, H. & Zhang, D. Symbolic genetic algorithm for discovering open-form partial
differential equations (sga-pde). Phys. Rev. Res. 4, 023174 (2022).
113. Du, M., Chen, Y. & Zhang, D. Discover: Deep identification of symbolic open-form pdes via enhanced
reinforcement-learning. Prepr. arXiv:2210.02181 (2022).
114. Chen, R. T., Rubanova, Y., Bettencourt, J. & Duvenaud, D. K. Neural ordinary differential equations. Adv.
neural information processing systems 31 (2018).
115. Becker, S., Klein, M., Neitz, A., Parascandolo, G. & Kilbertus, N. Predicting ordinary differential equations
with transformers. In International Conference on Machine Learning, 1978–2002 (PMLR, 2023).
116. Sahoo, S., Lampert, C. & Martius, G. Learning equations for extrapolation and control. In International
Conference on Machine Learning, 4442–4450 (Pmlr, 2018).
117. Qiu, S. et al. Development and validation of an interpretable deep learning framework for alzheimer’s disease
classification. Brain 143, 1920–1933 (2020).
118. Jin, T., Nguyen, N. D., Talos, F. & Wang, D. Ecmarker: interpretable machine learning model identifies gene
expression biomarkers predicting clinical outcomes and reveals molecular mechanisms of human disease in
early stages. Bioinformatics 37, 1115–1124 (2021).
119. Ji, Y., Lotfollahi, M., Wolf, F. A. & Theis, F. J. Machine learning for perturbational single-cell omics. Cell Syst.
12, 522–537 (2021).
120. Gasperini, M. et al. A genome-wide framework for mapping gene regulation via cellular genetic screens. Cell
176, 377–390 (2019).
121. Häse, F., Roch, L. M. & Aspuru-Guzik, A. Next-generation experimentation with self-driving laboratories.
Trends Chem. 1, 282–291 (2019).
122. MacLeod, B. P. et al. A self-driving laboratory advances the pareto front for material properties. Nat. Commun.
13, 995 (2022).
123. Wagenmaker, A. & Jamieson, K. Active learning for identification of linear dynamical systems. In Conference
on Learning Theory, 3487–3582 (PMLR, 2020).
14/22
124. Pauwels, E., Lajaunie, C. & Vert, J.-P. A Bayesian active learning strategy for sequential experimental design in
systems biology. BMC Syst. Biol. 8, 1–11 (2014).
125. Du, J., Futoma, J. & Doshi-Velez, F. Model-based reinforcement learning for semi-Markov decision processes
with neural ODEs. Adv. Neural Inf. Process. Syst. 33, 19805–19816 (2020).
126. Wu, K., O’Leary-Roseberry, T., Chen, P. & Ghattas, O. Large-scale Bayesian optimal experimental design with
derivative-informed projected neural network. J. Sci. Comput. 95, 30 (2023).
127. Bengio, Y., Courville, A. & Vincent, P. Representation learning: A review and new perspectives. IEEE
transactions on pattern analysis machine intelligence 35, 1798–1828 (2013).
128. Pandarinath, C. et al. Inferring single-trial neural population dynamics using sequential auto-encoders. Nat.
methods 15, 805–815 (2018).
129. Schölkopf, B. et al. Toward causal representation learning. Proc. IEEE 109, 612–634 (2021).
130. Peters, J., Bauer, S. & Pfister, N. Causal models for dynamical systems. In Probabilistic and Causal Inference: The
Works of Judea Pearl, 671–690 (2022).
131. Runge, J. et al. Inferring causation from time series in earth system sciences. Nat. Commun. 10, 2553 (2019).
132. Nowack, P., Runge, J., Eyring, V. & Haigh, J. D. Causal networks for climate model evaluation and constrained
projections. Nat. communications 11, 1415 (2020).
133. Nguyen, T., Brandstetter, J., Kapoor, A., Gupta, J. K. & Grover, A. Climax: A foundation model for weather
and climate. Prepr. arXiv:2301.10343 (2023).
134. Tejada-Lapuerta, A. et al. Causal machine learning for single-cell genomics. Prepr. arXiv:2310.14935 (2023).
135. Lobentanzer, S., Rodriguez-Mier, P., Bauer, S. & Saez-Rodriguez, J. Molecular causality in the advent of
foundation models. Prepr. arXiv:2401.09558 (2024).
136. Pfister, N., Bauer, S. & Peters, J. Learning stable and predictive structures in kinetic systems. Proc. Natl. Acad.
Sci. 116, 25405–25411 (2019).
137. Lippe, P. et al. Citris: Causal identifiability from temporal intervened sequences. In International Conference on
Machine Learning, 13557–13603 (PMLR, 2022).
138. Song, X. et al. Temporally disentangled representation learning under unknown nonstationarity. Adv. Neural
Inf. Process. Syst. 36 (2024).
139. Yildiz, C., Heinonen, M. & Lahdesmaki, H. Ode2vae: Deep generative second order odes with bayesian neural
networks. Adv. Neural Inf. Process. Syst. 32 (2019).
140. Girin, L. et al. Dynamical variational autoencoders: A comprehensive review. Prepr. arXiv:2008.12595 (2020).
141. Champion, K., Lusch, B., Kutz, J. N. & Brunton, S. L. Data-driven discovery of coordinates and governing
equations. Proc. Natl. Acad. Sci. 116, 22445–22451 (2019).
142. Lu, C. et al. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Adv.
Neural Inf. Process. Syst. 35, 5775–5787 (2022).
143. Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M. & Le, M. Flow matching for generative modeling. Prepr.
arXiv:2210.02747 (2022).
144. Berner, J., Richter, L. & Ullrich, K. An optimal control perspective on diffusion-based generative modeling.
Prepr. arXiv:2211.01364 (2022).
145. Karras, T. et al. Analyzing and improving the training dynamics of diffusion models. Prepr. arXiv:2312.02696
(2023).
15/22
146. Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-resolution image synthesis with latent
diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10684–10695
(2022).
147. Watson, J. L. et al. De novo design of protein structure and function with rfdiffusion. Nature 620, 1089–1100
(2023).
148. Zeni, C. et al. Mattergen: a generative model for inorganic materials design. Prepr. arXiv:2312.03687 (2023).
149. Nichani, E., Damian, A. & Lee, J. D. How transformers learn causal structure with gradient descent. Prepr.
arXiv:2402.14735 (2024).
150. Peterson, S. M. et al. Ajile12: Long-term naturalistic human intracranial neural recordings and pose. Sci. data 9,
184 (2022).
151. Talukder, S., Sun, J. J., Leonard, M., Brunton, B. W. & Yue, Y. Deep neural imputation: A framework for
recovering incomplete brain recordings. arXiv preprint arXiv:2206.08094 (2022).
152. Vetter, J., Macke, J. H. & Gao, R. Generating realistic neurophysiological time series with denoising diffusion
probabilistic models. bioRxiv 2023–08 (2023).
153. Kirillov, A. et al. Segment anything (2023). 2304.02643.
154. Greenwald, N. F. et al. Whole-cell segmentation of tissue images with human-level performance using
large-scale data annotation and deep learning. Nat. biotechnology 40, 555–565 (2022).
155. Scheffer, L. K. et al. A connectome and analysis of the adult drosophila central brain. elife 9, e57443 (2020).
156. Takemura, S.-Y. et al. A connectome of the male drosophila ventral nerve cord. bioRxiv 2023–06 (2023).
157. Mathis, A. et al. DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Nat.
Neurosci. 21, 1281–1289, DOI: 10.1038/s41593-018-0209-y (2018). Number: 9 Publisher: Nature Publishing
Group.
158. Pereira, T. D. et al. SLEAP: A deep learning system for multi-animal pose tracking. Nat. Methods 19, 486–495,
DOI: 10.1038/s41592-022-01426-1 (2022). Number: 4 Publisher: Nature Publishing Group.
159. Karashchuk, P. et al. Anipose: A toolkit for robust markerless 3D pose estimation. Cell Reports 36, 109730, DOI:
10.1016/[Link].2021.109730 (2021).
160. Schweihoff, J. F. et al. DeepLabStream enables closed-loop behavioral experiments using deep learning-based
markerless, real-time posture detection. Commun. Biol. 4, 1–11, DOI: 10.1038/s42003-021-01654-9 (2021).
Number: 1 Publisher: Nature Publishing Group.
161. Dunn, T. W. et al. Geometric deep learning enables 3D kinematic profiling across species and environments.
Nat. Methods 18, 564–573, DOI: 10.1038/s41592-021-01106-6 (2021). Number: 5 Publisher: Nature Publishing
Group.
162. Vinuesa, R. & Sirmacek, B. Interpretable deep-learning models to help achieve the Sustainable Development
Goals. Nat. Mach. Intell. 3, 926 (2021).
163. Vinuesa, R. et al. Turbulent boundary layers around wing sections up to Rec = 1, 000, 000. Int. J. Heat Fluid Flow
72, 86–99 (2018).
164. Institute, S. T. S. Dark matter even darker than once thought. Sci. Release – ESA (2015).
165. Fleming, N. Computer-calculated compounds. Nature 557, S55–S57 (2015).
166. Eivazi, H., Le Clainche, S., Hoyas, S. & Vinuesa, R. Towards extraction of orthogonal and parsimonious
non-linear modes from turbulent flows. Expert. Syst. with Appl. 202, 117038 (2022).
167. Suárez, P. et al. Active flow control for three-dimensional cylinders through deep reinforcement learning. arXiv
preprint arXiv:2309.02462 (2023).
16/22
Acknowledgements
The following researchers are acknowledged for helpful discussions during the preparation of this article: Frida
Bender, Annica Ekman, Inga Koszalka, Romit Maulik, Henrik Nielsen, Gunilla Svensson, Björn Wallner. RV and
HA acknowledge SeRC and Digital Futures for funding the workshop that initiated this work. RV acknowledges
financial support from ERC grant no. ‘2021-CoG-101043998, DEEPCONTROL’. DM acknowledges financial support
from ERC grant no. 2022-StG-101075494, MultiPRESS. Views and opinions expressed are, however, those of the
author(s) only and do not necessarily reflect those of the European Union or the European Research Council. Neither
the European Union nor the granting authority can be held responsible. AE was funded by the Vetenskapsrådet
Grant No. 2021-03979 and the Knut and Alice Wallenberg Foundation and by SeRC. SLB acknowledges funding
support from the US National Science Foundation AI Institute in Dynamic Systems (grant number 2112085) and
from The Boeing Company.
Author contributions
RV and HA initiated the idea for this article following a workshop celebrated in November 2022 at KTH. All the
authors contributed equally to the rest of this work.
Competing interests
The authors declare no competing interests.
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
17/22
Figure 1. Schematic representation of the various applications of ML for scientific discovery, depending on the
amount of knowledge available in each category. A number of examples are provided, including turbulent
flows [163], dark matter [164], drug discovery [165] and brain research. Other relevant figures were adapted from
Refs. [71, 166].
18/22
Figure 2. Schematic representation of ML directions to enable scientific discoveries when complete information
about the governing equations is available. In such a case, both supervised, unsupervised, and
reinforcement-learning methodologies can be leveraged. Supervised and unsupervised methodologies are made
possible by the ability to generate large datasets of synthetic data simulated from the governing equations. This
allows to deploy a variety of ML techniques that can discover complex hidden relations, nonlinear coordinate
systems, hidden dynamics or solve problems that are otherwise intractable. Reinforcement learning can also be
used by coupling it to the physics simulator, which has already proven successful at discovering previously
unknown control strategies and regimes of complex systems, or to generate high-quality heuristic guesses that can
be tested in the case of problems where solution verification is easy, but the suggestion of good candidate solutions
is hard. Some panels were adapted from Refs [7, 167].
19/22
Figure 3. Example of machine learning applied to a case where partial knowledge is available about the
underlying system, illustrating a model (for instance a flow with complex rheology or a flow through a porous
medium) which depends on a set of known inputs x (e.g. geometry, boundary conditions, etc.) as well as on a set
of hidden (unobservable) variables describing, e.g., the fluid constitutive behavior. The latter may involve
small-scale phenomena that can be difficult or impossible to describe. In such conditions, experimental or
numerical data for observable quantities (e.g. velocity fields or stresses) can be used to infer the unknown field by
training a machine learning model (here represented as a neural network, although other ML approaches are
possible). The model is subjected to a set of available physical constraints (e.g. positivity, symmetries or
invariances). The whole process allows, on one hand, to train a data-driven closure model for the hidden variables
α and, on the other hand, to gain a-posteriori physical knowledge on the fluid constitutive properties.
20/22
Figure 4. Schematic representation of a model (for instance the observed symptoms of an unknown or complex
disease within a population, or observed opinion dynamics within a social network) where the behavior as
observed in data depends on an unknown dynamic or causal structure. The observed behavior or dynamics
might occur on several different spatial and temporal scales, and the observed data might reflect more or less
aspects of the underlying system. In such conditions, representation-learning methods can be employed to distill out
an explanation of the observed data, in the form of a system of ODEs, or as a causal-graph representation.
21/22
Table 1. Summarizing overview of the opportunities for machine learning in scientific discovery. Based on the
differentiation presented in our work, the level of prior, deterministic, knowledge (left) can be used to differentiate
methods (second right) and applications (right) across which scientific advancements can be made by means of
dedicated AI systems. This also allows for various modes of discovery (second left), ranging from cases where
machine learning enables discovery by allowing for efficient computational usage, parametrics sweeps, etc.; to
cases where machine learning is used to causally infer underlying mechanistic behaviours in complex
multidisciplinary systems.
22/22