Neural Network
Neural Network
com/science/article/pii/S0021999118307125
Manuscript_4b4f578d1b2d792cac6bf6ac509f1e2a
Abstract
We introduce physics-informed neural networks – neural networks that are
trained to solve supervised learning tasks while respecting any given laws of
physics described by general nonlinear partial differential equations. In this
work, we present our developments in the context of solving two main classes
of problems: data-driven solution and data-driven discovery of partial differ-
ential equations. Depending on the nature and arrangement of the available
data, we devise two distinct types of algorithms, namely continuous time
and discrete time models. The first type of models forms a new family of
data-efficient spatio-temporal function approximators, while the latter type
allows the use of arbitrarily accurate implicit Runge-Kutta time stepping
schemes with unlimited number of stages. The effectiveness of the proposed
framework is demonstrated through a collection of classical problems in flu-
ids, quantum mechanics, reaction-diffusion systems, and the propagation of
nonlinear shallow-water waves.
Keywords: Data-driven scientific computing, Machine learning, Predictive
modeling, Runge-Kutta methods, Nonlinear dynamics
1 1. Introduction
2 With the explosive growth of available data and computing resources,
3 recent advances in machine learning and data analytics have yielded trans-
© 2018 published by Elsevier. This manuscript is made available under the Elsevier user license
https://www.elsevier.com/open-access/userlicense/1.0/
4 formative results across diverse scientific disciplines, including image recog-
5 nition [1], cognitive science [2], and genomics [3]. However, more often than
6 not, in the course of analyzing complex physical, biological or engineering
7 systems, the cost of data acquisition is prohibitive, and we are inevitably
8 faced with the challenge of drawing conclusions and making decisions under
9 partial information. In this small data regime, the vast majority of state-
10 of-the-art machine learning techniques (e.g., deep/convolutional/recurrent
11 neural networks) are lacking robustness and fail to provide any guarantees
12 of convergence.
13
2
42 bility of the proposed methods to discrete-time domains and compromising
43 the accuracy of their predictions in strongly nonlinear regimes. Secondly,
44 the Bayesian nature of Gaussian process regression requires certain prior as-
45 sumptions that may limit the representation capacity of the model and give
46 rise to robustness/brittleness issues, especially for nonlinear problems [10].
47 2. Problem setup
48 In this work we take a different approach by employing deep neural net-
49 works and leverage their well known capability as universal function ap-
50 proximators [11]. In this setting, we can directly tackle nonlinear problems
51 without the need for committing to any prior assumptions, linearization, or
52 local time-stepping. We exploit recent developments in automatic differenti-
53 ation [12] – one of the most useful but perhaps under-utilized techniques in
54 scientific computing – to differentiate neural networks with respect to their
55 input coordinates and model parameters to obtain physics-informed neural
56 networks. Such neural networks are constrained to respect any symmetries,
57 invariances, or conservation principles originating from the physical laws that
58 govern the observed data, as modeled by general time-dependent and non-
59 linear partial differential equations. This simple yet powerful construction
60 allows us to tackle a wide range of problems in computational science and in-
61 troduces a potentially transformative technology leading to the development
62 of new data-efficient and physics-informed learning machines, new classes of
63 numerical solvers for partial differential equations, as well as new data-driven
64 approaches for model inversion and systems identification.
65
66 The general aim of this work is to set the foundations for a new paradigm
67 in modeling and computation that enriches deep learning with the longstand-
68 ing developments in mathematical physics. To this end, our manuscript is
69 divided into two parts that aim to present our developments in the con-
70 text of two major classes of problems: data-driven solution and data-driven
71 discovery of partial differential equations. All code and data-sets accom-
72 panying this manuscript are available on GitHub at https://github.com/
73 maziarraissi/PINNs. Throughout this work we have been using relatively
74 simple deep feed-forward neural networks architectures with hyperbolic tan-
75 gent activation functions and no additional regularization (e.g., L1/L2 penal-
76 ties, dropout, etc.). Each numerical example in the manuscript is accompa-
77 nied with a detailed discussion about the neural network architecture we
3
78 employed as well as details about its training process (e.g. optimizer, learn-
79 ing rates, etc.). Finally, a comprehensive series of systematic studies that
80 aims to demonstrate the performance of the proposed methods is provided
81 in Appendix A and Appendix B.
82
4
111 3.1. Continuous Time Models
112 We define f (t, x) to be given by the left-hand-side of equation (2); i.e.,
f := ut + N [u], (3)
113 and proceed by approximating u(t, x) by a deep neural network. This as-
114 sumption along with equation (3) result in a physics-informed neural net-
115 work f (t, x). This network can be derived by applying the chain rule for
116 differentiating compositions of functions using automatic differentiation [12],
117 and has the same parameters as the network representing u(t, x), albeit with
118 different activation functions due to the action of the differential operator
119 N . The shared parameters between the neural networks u(t, x) and f (t, x)
120 can be learned by minimizing the mean squared error loss
121 where
Nu
1 X
M SEu = |u(tiu , xiu ) − ui |2 ,
Nu i=1
122 and
Nf
1 X
M SEf = |f (tif , xif )|2 .
Nf i=1
N
124 and {tif , xif }i=1 specify the collocations points for f (t, x). The loss M SEu
f
125 corresponds to the initial and boundary data while M SEf enforces the struc-
126 ture imposed by equation (2) at a finite set of collocation points. Although
127 similar ideas for constraining neural networks using physical laws have been
128 explored in previous studies [15, 16], here we revisit them using modern
129 computational tools, and apply them to more challenging dynamic problems
130 described by time-dependent nonlinear partial differential equations.
131
5
138 employ machine learning algorithms like support vector machines, random
139 forests, Gaussian processes, and feed-forward/convolutional/recurrent neural
140 networks merely as black-box tools. As described above, the proposed work
141 aims to go one step further by revisiting the construction of “custom” activa-
142 tion and loss functions that are tailored to the underlying differential opera-
143 tor. This allows us to open the black-box by understanding and appreciating
144 the key role played by automatic differentiation within the deep learning field.
145 Automatic differentiation in general, and the back-propagation algorithm in
146 particular, is currently the dominant approach for training deep models by
147 taking their derivatives with respect to the parameters (e.g., weights and
148 biases) of the models. Here, we use the exact same automatic differentiation
149 techniques, employed by the deep learning community, to physics-inform
150 neural networks by taking their derivatives with respect to their input co-
151 ordinates (i.e., space and time) where the physics is described by partial
152 differential equations. We have empirically observed that this structured ap-
153 proach introduces a regularization mechanism that allows us to use relatively
154 simple feed-forward neural network architectures and train them with small
155 amounts of data. The effectiveness of this simple idea may be related to
156 the remarks put forth by Lin, Tegmark and Rolnick [30] and raises many
157 interesting questions to be quantitatively addressed in future research. To
158 this end, the proposed work draws inspiration from the early contributions of
159 Psichogios and Ungar [16], Lagaris et. al. [15], as well as the contemporary
160 works of Kondor [31, 32], Hirn [33], and Mallat [34].
161
6
176 equation 4, and defines an open question for research that is in sync with
177 recent theoretical developments in deep learning [38, 39]. To this end, we
178 will test the robustness of the proposed methodology using a series of sys-
179 tematic sensitivity studies that are provided in Appendix A and Appendix B.
180
7
202
Nb
1 X
|hi (tib , −5) − hi (tib , 5)|2 + |hix (tib , −5) − hix (tib , 5)|2 ,
M SEb =
Nb i=1
203 and
Nf
1 X
M SEf = |f (tif , xif )|2 .
Nf i=1
i Nb
204 Here, {xi0 , hi0 }N
i=1 denotes the initial data, {tb }i=1 corresponds to the colloca-
0
Nf
205 tion points on the boundary, and {tif , xif }i=1 represents the collocation points
206 on f (t, x). Consequently, M SE0 corresponds to the loss on the initial data,
207 M SEb enforces the periodic boundary conditions, and M SEf penalizes the
208 Schrödinger equation not being satisfied on the collocation points.
209
210 In order to assess the accuracy of our method, we have simulated equation
211 (5) using conventional spectral methods to create a high-resolution data set.
212 Specifically, starting from an initial state h(0, x) = 2 sech(x) and assuming
213 periodic boundary conditions h(t, −5) = h(t, 5) and hx (t, −5) = hx (t, 5), we
214 have integrated equation (5) up to a final time t = π/2 using the Chebfun
215 package [40] with a spectral Fourier discretization with 256 modes and a
216 fourth-order explicit Runge-Kutta temporal integrator with time-step ∆t =
217 π/2 · 10−6 . Under our data-driven setting, all we observe are measurements
218 {xi0 , hi0 }N
i=1 of the latent function h(t, x) at time t = 0. In particular, the train-
0
219 ing set consists of a total of N0 = 50 data points on h(0, x) randomly parsed
220 from the full high-resolution data-set, as well as Nb = 50 randomly sampled
221 collocation points {tib }N i=1 for enforcing the periodic boundaries. Moreover,
b
222 we have assumed Nf = 20, 000 randomly sampled collocation points used
223 to enforce equation (5) inside the solution domain. All randomly sampled
224 point locations were generated using a space filling Latin Hypercube Sam-
225 pling strategy [41].
226
227 Here our goal is to infer the entire spatio-temporal solution h(t, x) of the
228 Schrödinger equation (5). We chose to jointly represent the latent function
229 h(t, x) = [u(t, x) v(t, x)] using a 5-layer deep neural network with 100 neu-
230 rons per layer and a hyperbolic tangent activation function. In general, the
231 neural network should be given sufficient approximation capacity in order to
232 accommodate the anticipated complexity of u(t, x). Although more system-
233 atic procedures such as Bayesian optimization [42] can be employed in order
8
234 to fine-tune the design of the neural network, in the absence of theoretical
235 error/convergence estimates, the interplay between the neural architecture/-
236 training procedure and the complexity of the underlying differential equation
237 is still poorly understood. One viable path towards assessing the accuracy
238 of the predicted solution could come by adopting a Bayesian approach and
239 monitoring the variance of the predictive posterior distribution, but this goes
240 beyond the scope of the present work and will be investigated in future stud-
241 ies.
242 In this example, our setup aims to highlight the robustness of the pro-
243 posed method with respect to the well known issue of over-fitting. Specifi-
244 cally, the term in M SEf in equation (6) acts as a regularization mechanism
245 that penalizes solutions that do not satisfy equation (5). Therefore, a key
246 property of physics-informed neural networks is that they can be effectively
247 trained using small data sets; a setting often encountered in the study of
248 physical systems for which the cost of data acquisition may be prohibitive.
249 Figure 1 summarizes the results of our experiment. Specifically, the top
250 panel of figure 1pshows the magnitude of the predicted spatio-temporal solu-
251 tion |h(t, x)| = u2 (t, x) + v 2 (t, x), along with the locations of the initial and
252 boundary training data. The resulting prediction error is validated against
253 the test data for this problem, and is measured at 1.97 · 10−3 in the rela-
254 tive L2 -norm. A more detailed assessment of the predicted solution is pre-
255 sented in the bottom panel of Figure 1. In particular, we present a compar-
256 ison between the exact and the predicted solutions at different time instants
257 t = 0.59, 0.79, 0.98. Using only a handful of initial data, the physics-informed
258 neural network can accurately capture the intricate nonlinear behavior of the
259 Schrödinger equation.
260
261 One potential limitation of the continuous time neural network models
262 considered so far stems from the need to use a large number of colloca-
263 tion points Nf in order to enforce physics-informed constraints in the en-
264 tire spatio-temporal domain. Although this poses no significant issues for
265 problems in one or two spatial dimensions, it may introduce a severe bot-
266 tleneck in higher dimensional problems, as the total number of collocation
267 points needed to globally enforce a physics-informed constrain (i.e., in our
268 case a partial differential equation) will increase exponentially. Although
269 this limitation could be addressed to some extend using sparse grid or quasi
270 Monte-Carlo sampling schemes [43, 44], in the next section, we put forth a
271 different approach that circumvents the need for collocation points by in-
9
|h(t, x)|
5
Data (150 points) 3.5
3.0
2.5
0 2.0
x
1.5
1.0
0.5
−5
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
t
|h(t, x)|
|h(t, x)|
0 0 0
−5 0 5 −5 0 5 −5 0 5
x x x
Exact Prediction
Figure 1: Shrödinger equation: Top: Predicted solution |h(t, x)| along with the initial and
boundary training data. In addition we are using 20,000 collocation points generated using
a Latin Hypercube Sampling strategy. Bottom: Comparison of the predicted and exact
solutions corresponding to the three temporal snapshots depicted by the dashed vertical
lines in the top panel. The relative L2 error for this case is 1.97 · 10−3 .
277 Here, un+cj (x) = u(tn + cj ∆t, x) for j = 1, . . . , q. This general form en-
278 capsulates both implicit and explicit time-stepping schemes, depending on
279 the choice of the parameters {aij , bj , cj }. Equations (7) can be equivalently
10
280 expressed as
un = uni , i = 1, . . . , q,
(8)
un = unq+1 ,
281 where
Pq
uni := un+ci + ∆t j=1 aij N [un+cj ], i = 1, . . . , q,
Pq (9)
unq+1 := un+1 + ∆t j=1 bj N [u
n+cj
].
282 We proceed by placing a multi-output neural network prior on
n+c
u 1 (x), . . . , un+cq (x), un+1 (x) .
(10)
283 This prior assumption along with equations (9) result in a physics-informed
284 neural network that takes x as an input and outputs
n
u1 (x), . . . , unq (x), unq+1 (x) .
(11)
290 The Allen-Cahn equation is a well-known equation from the area of reaction-
291 diffusion systems. It describes the process of phase separation in multi-
292 component alloy systems, including order-disorder transitions. For the Allen-
293 Cahn equation, the nonlinear operator in equation (9) is given by
3
N [un+cj ] = −0.0001un+c
xx
j
+ 5 un+cj − 5un+cj ,
294 and the shared parameters of the neural networks (10) and (11) can be
295 learned by minimizing the sum of squared errors
11
296 where
q+1 Nn
X X
SSEn = |unj (xn,i ) − un,i |2 ,
j=1 i=1
297 and
q
X
SSEb = |un+ci (−1) − un+ci (1)|2 + |un+1 (−1) − un+1 (1)|2
i=1
q
X
+ |un+c
x
i
(−1) − uxn+ci (1)|2 + |un+1 n+1 2
x (−1) − ux (1)| .
i=1
299 merical analysis, these time-steps are usually confined to be small due to sta-
300 bility constraints for explicit schemes or computational complexity constrains
301 for implicit formulations [45]. These constraints become more severe as the
302 total number of Runge-Kutta stages q is increased, and, for most problems
303 of practical interest, one needs to take thousands to millions of such steps
304 until the solution is resolved up to a desired final time. In sharp contrast to
305 classical methods, here we can employ implicit Runge-Kutta schemes with
306 an arbitrarily large number of stages at effectively very little extra cost.1
307 This enables us to take very large time steps while retaining stability and
308 high predictive accuracy, therefore allowing us to resolve the entire spatio-
309 temporal solution in a single step.
310
311 In this example, we have generated a training and test data-set set by
312 simulating the Allen-Cahn equation (12) using conventional spectral meth-
313 ods. Specifically, starting from an initial condition u(0, x) = x2 cos(πx) and
314 assuming periodic boundary conditions u(t, −1) = u(t, 1) and ux (t, −1) =
315 ux (t, 1), we have integrated equation (12) up to a final time t = 1.0 using the
316 Chebfun package [40] with a spectral Fourier discretization with 512 modes
317 and a fourth-order explicit Runge-Kutta temporal integrator with time-step
318 ∆t = 10−5 .
319
320 Our training data-set consists of Nn = 200 initial data points that are
321 randomly sub-sampled from the exact solution at time t = 0.1, and our goal
1
To be precise, it is only the number of parameters in the last layer of the neural
network that increases linearly with the total number of stages.
12
322 is to predict the solution at time t = 0.9 using a single time-step with size
323 ∆t = 0.8. To this end, we employ a discrete time physics-informed neural
324 network with 4 hidden layers and 200 neurons per layer, while the output
325 layer predicts 101 quantities of interest corresponding to the q = 100 Runge-
326 Kutta stages un+ci (x), i = 1, . . . , q, and the solution at final time un+1 (x).
327 The theoretical error estimates for this scheme predict a temporal error ac-
328 cumulation of O(∆t2q ) [45], which in our case translates into an error way
329 below machine precision, i.e., ∆t2q = 0.8200 ≈ 10−20 . To our knowledge, this
330 is the first time that an implicit Runge-Kutta scheme of that high-order has
331 ever been used. Remarkably, starting from smooth initial data at t = 0.1 we
332 can predict the nearly discontinuous solution at t = 0.9 in a single time-step
333 with a relative L2 error of 6.99 · 10−3 , as illustrated in Figure 2. This error is
334 entirely attributed to the neural network’s capacity to approximate u(t, x),
335 as well as to the degree that the sum of squared errors loss allows interpola-
336 tion of the training data.
337
338 The key parameters controlling the performance of our discrete time al-
339 gorithm are the total number of Runge-Kutta stages q and the time-step
340 size ∆t. As we demonstrate in the systematic studies provided in Appendix
341 A and Appendix B, low-order methods, such as the case q = 1 correspond-
342 ing to the classical trapezoidal rule, and the case q = 2 corresponding to the
343 4th -order Gauss-Legendre method, cannot retain their predictive accuracy for
344 large time-steps, thus mandating a solution strategy with multiple time-steps
345 of small size. On the other hand, the ability to push the number of Runge-
346 Kutta stages to 32 and even higher allows us to take very large time steps,
347 and effectively resolve the solution in a single step without sacrificing the
348 accuracy of our predictions. Moreover, numerical stability is not sacrificed
349 either as implicit Gauss-Legendre is the only family of time-stepping schemes
350 that remain A-stable regardless of their order, thus making them ideal for
351 stiff problems [45]. These properties are unprecedented for an algorithm of
352 such implementation simplicity, and illustrate one of the key highlights of
353 our discrete time approach.
13
u(t, x)
1.0
0.75
0.5 0.50
0.25
0.0 0.00
x
−0.25
−0.5 −0.50
−0.75
−1.0 −1.00
0.0 0.2 0.4 0.6 0.8 1.0
t
t = 0.10 t = 0.90
1.0
0.00
0.5
−0.25
u(t, x)
u(t, x)
−0.50 0.0
−0.75 −0.5
−1.00 −1.0
−1 0 1 −1 0 1
x x
Data Exact Prediction
Figure 2: Allen-Cahn equation: Top: Solution u(t, x) along with the location of the initial
training snapshot at t = 0.1 and the final prediction snapshot at t = 0.9. Bottom: Initial
training data and final prediction at the snapshots depicted by the white vertical lines in
the top panel. The relative L2 error for this case is 6.99 · 10−3 .
358 and discrete time models, and highlight their properties and performance
359 through the lens of various canonical problems.
14
365 work f (t, x). This network can be derived by applying the chain rule for
366 differentiating compositions of functions using automatic differentiation [12].
367 It is worth highlighting that the parameters of the differential operator λ
368 turn into parameters of the physics-informed neural network f (t, x).
379 where u(t, x, y) denotes the x-component of the velocity field, v(t, x, y) the
380 y-component, and p(t, x, y) the pressure. Here, λ = (λ1 , λ2 ) are the unknown
381 parameters. Solutions to the Navier-Stokes equations are searched in the set
382 of divergence-free functions; i.e.,
ux + vy = 0. (16)
383 This extra equation is the continuity equation for incompressible fluids that
384 describes the conservation of mass of the fluid. We make the assumption
385 that
u = ψy , v = −ψx , (17)
386 for some latent function ψ(t, x, y).3 Under this assumption, the continuity
387 equation (16) will be automatically satisfied. Given noisy measurements
{ti , xi , y i , ui , v i }N
i=1
2
It is straightforward to generalize the proposed framework to the Navier-Stokes equa-
tions in three dimensions (3D).
3
This construction can be generalized to three dimensional problems by employing the
notion of vector potentials.
15
388 of the velocity field, we are interested in learning the parameters λ as well as
389 the pressure p(t, x, y). We define f (t, x, y) and g(t, x, y) to be given by
396 Here we consider the prototype problem of incompressible flow past a circular
397 cylinder; a problem known to exhibit rich dynamic behavior and transitions
398 for different regimes of the Reynolds number Re = u∞ D/ν. Assuming a
399 non-dimensional free stream velocity u∞ = 1, cylinder diameter D = 1, and
400 kinematic viscosity ν = 0.01, the system exhibits a periodic steady state
401 behavior characterized by a asymmetrical vortex shedding pattern in the
402 cylinder wake, known as the Kármán vortex street [46].
403
404 To generate a high-resolution data set for this problem we have employed
405 the spectral/hp-element solver NekTar [47]. Specifically, the solution domain
406 is discretized in space by a tessellation consisting of 412 triangular elements,
407 and within each element the solution is approximated as a linear combination
408 of a tenth-order hierarchical, semi-orthogonal Jacobi polynomial expansion
409 [47]. We have assumed a uniform free stream velocity profile imposed at the
410 left boundary, a zero pressure outflow condition imposed at the right bound-
411 ary located 25 diameters downstream of the cylinder, and periodicity for the
412 top and bottom boundaries of the [−15, 25] × [−8, 8] domain. We integrate
413 equation (15) using a third-order stiffly stable scheme [47] until the system
414 reaches a periodic steady state, as depicted in figure 3(a). In what follows,
415 a small portion of the resulting data-set corresponding to this steady state
16
416 solution will be used for model training, while the remaining data will be
417 used to validate our predictions. For simplicity, we have chosen to confine
418 our sampling in a rectangular region downstream of cylinder as shown in
419 figure 3(a).
420
421 Given scattered and potentially noisy data on the stream-wise u(t, x, y)
422 and transverse v(t, x, y) velocity components, our goal is to identify the un-
423 known parameters λ1 and λ2 , as well as to obtain a qualitatively accurate
424 reconstruction of the entire pressure field p(t, x, y) in the cylinder wake, which
425 by definition can only be identified up to a constant. To this end, we have
426 created a training data-set by randomly sub-sampling the full high-resolution
427 data-set. To highlight the ability of our method to learn from scattered and
428 scarce training data, we have chosen N = 5, 000, corresponding to a mere
429 1% of the total available data as illustrated in figure 3(b). Also plotted are
430 representative snapshots of the predicted velocity components u(t, x, y) and
431 v(t, x, y) after the model was trained. The neural network architecture used
432 here consists of 9 layers with 20 neurons in each layer.
433
443 A more intriguing result stems from the network’s ability to provide a
444 qualitatively accurate prediction of the entire pressure field p(t, x, y) in the
445 absence of any training data on the pressure itself. A visual comparison
446 against the exact pressure solution is presented in figure 4 for a represen-
447 tative pressure snapshot. Notice that the difference in magnitude between
448 the exact and the predicted pressure is justified by the very nature of the
449 incompressible Navier-Stokes system, as the pressure field is only identifiable
450 up to a constant. This result of inferring a continuous quantity of interest
451 from auxiliary measurements by leveraging the underlying physics is a great
452 example of the enhanced capabilities that physics-informed neural networks
453 have to offer, and highlights their potential in solving high-dimensional in-
17
Vorticity
3
2
5
1
0 0
y
−1
−5
−2
−3
−15 −10 −5 0 5 10 15 20 25
x
u(t, x, y) v(t, x, y)
y t y t
x x
Figure 3: Navier-Stokes equation: Top: Incompressible flow and dynamic vortex shedding
past a circular cylinder at Re = 100. The spatio-temporal training data correspond to
the depicted rectangular region in the cylinder wake. Bottom: Locations of training data-
points for the the stream-wise and transverse velocity components, u(t, x, y) and v(t, x, t),
respectively.
456 Our approach so far assumes availability of scattered data throughout the
457 entire spatio-temporal domain. However, in many cases of practical interest,
458 one may only be able to observe the system at distinct time instants. In the
459 next section, we introduce a different approach that tackles the data-driven
460 discovery problem using only two data snapshots. We will see how, by lever-
461 aging the classical Runge-Kutta time-stepping schemes, one can construct
462 discrete time physics-informed neural networks that can retain high predic-
463 tive accuracy even when the temporal gap between the data snapshots is
464 very large.
18
Predicted pressure Exact pressure
2 2
1.4 0.0
1 1.3 1 −0.1
1.2 −0.2
0 0
y
y
1.1 −0.3
−1 1.0 −1 −0.4
−2 0.9 −2 −0.5
2 4 6 8 2 4 6 8
x x
468 Here, un+cj (x) = u(tn + cj ∆t, x) is the hidden state of the system at time
469 tn + cj ∆t for j = 1, . . . , q. This general form encapsulates both implicit and
470 explicit time-stepping schemes, depending on the choice of the parameters
471 {aij , bj , cj }. Equations (20) can be equivalently expressed as
un = uni , i = 1, . . . , q,
(21)
un+1 = un+1
i , i = 1, . . . , q.
19
472 where
Pq
uni := un+ci + ∆t j=1 aij N [un+cj ; λ], i = 1, . . . , q,
Pq (22)
un+1
i := un+ci + ∆t j=1 (aij − bj )N [u
n+cj
; λ], i = 1, . . . , q.
474 This prior assumption along with equations (22) result in two physics-informed
475 neural networks n
u1 (x), . . . , unq (x), unq+1 (x) ,
(24)
476 and n+1 n+1
u1 (x), . . . , un+1
q (x), uq+1 (x) . (25)
477 Given noisy measurements at two distinct temporal snapshots {xn , un } and
478 {xn+1 , un+1 } of the system at times tn and tn+1 , respectively, the shared
479 parameters of the neural networks (23), (24), and (25) along with the pa-
480 rameters λ of the differential operator can be trained by minimizing the sum
481 of squared errors
SSE = SSEn + SSEn+1 , (26)
482 where
q Nn
X X
SSEn := |unj (xn,i ) − un,i |2 ,
j=1 i=1
483 and
q Nn+1
X X
SSEn+1 := |un+1
j (xn+1,i ) − un+1,i |2 .
j=1 i=1
N N N
484 Here, xn = {xn,i }i=1
n
, un = {un,i }i=1
n
, xn+1 = {xn+1,i }i=1
n+1
, and un+1 =
N
485 {un+1,i }i=1 .
n+1
20
492 equation has several connections to physical problems. It describes the evolu-
493 tion of long one-dimensional waves in many physical settings. Such physical
494 settings include shallow-water waves with weakly non-linear restoring forces,
495 long internal waves in a density-stratified ocean, ion acoustic waves in a
496 plasma, and acoustic waves on a crystal lattice. Moreover, the KdV equa-
497 tion is the governing equation of the string in the Fermi-Pasta-Ulam problem
498 [48] in the continuum limit. The KdV equation reads as
499 with (λ1 , λ2 ) being the unknown parameters. For the KdV equation, the
500 nonlinear operator in equations (22) is given by
501 and the shared parameters of the neural networks (23), (24), and (25) along
502 with the parameters λ = (λ1 , λ2 ) of the KdV equation can be learned by
503 minimizing the sum of squared errors (26).
504
505 To obtain a set of training and test data we simulated (27) using con-
506 ventional spectral methods. Specifically, starting from an initial condition
507 u(0, x) = cos(πx) and assuming periodic boundary conditions, we have inte-
508 grated equation (27) up to a final time t = 1.0 using the Chebfun package
509 [40] with a spectral Fourier discretization with 512 modes and a fourth-order
510 explicit Runge-Kutta temporal integrator with time-step ∆t = 10−6 . Using
511 this data-set, we then extract two solution snapshots at time tn = 0.2 and
512 tn+1 = 0.8, and randomly sub-sample them using Nn = 199 and Nn+1 = 201
513 to generate a training data-set. We then use these data to train a discrete
514 time physics-informed neural network by minimizing the sum of squared error
515 loss of equation (26) using L-BFGS [35]. The network architecture used here
516 comprises of 4 hidden layers, 50 neurons per layer, and an output layer pre-
517 dicting the solution at the q Runge-Kutta stages, i.e., un+cj (x), j = 1, . . . , q,
518 where q is empirically chosen to yield a temporal error accumulation of the
519 order of machine precision by setting4
4
This is motivated by the theoretical error estimates for implicit Runge-Kutta schemes
suggesting a truncation error of O(∆t2q ) [45].
21
520 where the time-step for this example is ∆t = 0.6.
521
522 The results of this experiment are summarized in figure 5. In the top
523 panel, we present the exact solution u(t, x), along with the locations of the
524 two data snapshots used for training. A more detailed overview of the exact
525 solution and the training data is given in the middle panel. It is worth notic-
526 ing how the complex nonlinear dynamics of equation (27) causes dramatic
527 differences in the form of the solution between the two reported snapshots.
528 Despite these differences, and the large temporal gap between the two train-
529 ing snapshots, our method is able to correctly identify the unknown param-
530 eters regardless of whether the training data is corrupted with noise or not.
531 Specifically, for the case of noise-free training data, the error in estimating
532 λ1 and λ2 is 0.023%, and 0.006%, respectively, while the case with 1% noise
533 in the training data returns errors of 0.057%, and 0.017%, respectively.
534 5. Conclusions
535 We have introduced physics-informed neural networks, a new class of
536 universal function approximators that is capable of encoding any underlying
537 physical laws that govern a given data-set, and can be described by par-
538 tial differential equations. In this work, we design data-driven algorithms for
539 inferring solutions to general nonlinear partial differential equations, and con-
540 structing computationally efficient physics-informed surrogate models. The
541 resulting methods showcase a series of promising results for a diverse collec-
542 tion of problems in computational science, and open the path for endowing
543 deep learning with the powerful capacity of mathematical physics to model
544 the world around us. As deep learning technology is continuing to grow
545 rapidly both in terms of methodological and algorithmic developments, we
546 believe that this is a timely contribution that can benefit practitioners across
547 a wide range of scientific domains. Specific applications that can readily en-
548 joy these benefits include, but are not limited to, data-driven forecasting of
549 physical processes, model predictive control, multi-physics/multi-scale mod-
550 eling and simulation.
551
552 We must note however that the proposed methods should not be viewed
553 as replacements of classical numerical methods for solving partial differential
554 equations (e.g., finite elements, spectral methods, etc.). Such methods have
555 matured over the last 50 years and, in many cases, meet the robustness and
22
u(t, x)
1.0
2.0
0.5 1.5
1.0
0.0
x
0.5
0.0
−0.5
−0.5
−1.0 −1.0
0.0 0.2 0.4 0.6 0.8 1.0
t
t = 0.20 t = 0.80
199 trainng data 201 trainng data
1.0
2
0.5
u(t, x)
u(t, x)
1
0.0
−0.5 0
−1.0
−1 0 1 −1 0 1
x x
Exact Data
Figure 5: KdV equation: Top: Solution u(t, x) along with the temporal locations of the
two training snapshots. Middle: Training data and exact solution corresponding to the
two temporal snapshots depicted by the dashed vertical lines in the top panel. Bottom:
Correct partial differential equation along with the identified one obtained by learning
λ1 , λ2 .
23
560 algorithms. Moreover, the implementation simplicity of the latter greatly
561 favors rapid development and testing of new ideas, potentially opening the
562 path for a new era in data-driven scientific computing.
563
564 Although a series of promising results was presented, the reader may per-
565 haps agree this work creates more questions than it answers. How deep/wide
566 should the neural network be? How much data is really needed? Why does
567 the algorithm converge to unique values for the parameters of the differen-
568 tial operators, i.e., why is the algorithm not suffering from local optima for
569 the parameters of the differential operator? Does the network suffer from
570 vanishing gradients for deeper architectures and higher order differential op-
571 erators? Could this be mitigated by using different activation functions?
572 Can we improve on initializing the network weights or normalizing the data?
573 Are the mean square error and the sum of squared errors the appropriate
574 loss functions? Why are these methods seemingly so robust to noise in the
575 data? How can we quantify the uncertainty associated with our predictions?
576 Throughout this work, we have attempted to answer some of these questions,
577 but we have observed that specific settings that yielded impressive results for
578 one equation could fail for another. Admittedly, more work is needed collec-
579 tively to set the foundations in this field.
580
581 In a broader context, and along the way of seeking answers to those
582 questions, we believe that this work advocates a fruitful synergy between
583 machine learning and classical computational physics that has the potential
584 to enrich both fields and lead to high-impact developments.
585 Acknowledgements
586 This work received support by the DARPA EQUiPS grant N66001-15-2-
587 4055 and the AFOSR grant FA9550-17-1-0013.
24
594 Appendix A.1. Continuous Time Models
595 In one space dimension, the Burger’s equation along with Dirichlet bound-
596 ary conditions reads as
f := ut + uux − (0.01/π)uxx ,
616 where
Nu
1 X
M SEu = |u(tiu , xiu ) − ui |2 ,
Nu i=1
25
617 and
Nf
1 X
M SEf = |f (tif , xif )|2 .
Nf i=1
618 Here, {tiu , xiu , ui }N
i=1 denote the initial and boundary training data on u(t, x)
u
N
619 and {tif , xif }i=1f
specify the collocations points for f (t, x). The loss M SEu
620 corresponds to the initial and boundary data while M SEf enforces the struc-
621 ture imposed by equation (A.1) at a finite set of collocation points. Although
622 similar ideas for constraining neural networks using physical laws have been
623 explored in previous studies [15, 16], here we revisit them using modern
624 computational tools, and apply them to more challenging dynamic problems
625 described by time-dependent nonlinear partial differential equations.
626
649 In all benchmarks considered in this work, the total number of training
650 data Nu is relatively small (a few hundred up to a few thousand points), and
651 we chose to optimize all loss functions using L-BFGS a quasi-Newton, full-
26
652 batch gradient-based optimization algorithm [35]. For larger data-sets a more
653 computationally efficient mini-batch setting can be readily employed using
654 stochastic gradient descent and its modern variants [36, 37]. Despite the
655 fact that there is no theoretical guarantee that this procedure converges to
656 a global minimum, our empirical evidence indicates that, if the given partial
657 differential equation is well-posed and its solution is unique, our method is
658 capable of achieving good prediction accuracy given a sufficiently expressive
659 neural network architecture and a sufficient number of collocation points Nf .
660 This general observation deeply relates to the resulting optimization land-
661 scape induced by the mean square error loss of equation 4, and defines an
662 open question for research that is in sync with recent theoretical develop-
663 ments in deep learning [38, 39]. Here, we will test the robustness of the
664 proposed methodology using a series of systematic sensitivity studies that
665 accompany the numerical results presented in the following.
666
667 Figure A.6 summarizes our results for the data-driven solution of the
668 Burgers equation. Specifically, given a set of Nu = 100 randomly distributed
669 initial and boundary data, we learn the latent solution u(t, x) by training all
670 3021 parameters of a 9-layer deep neural network using the mean squared
671 error loss of (A.2). Each hidden layer contained 20 neurons and a hyperbolic
672 tangent activation function. The top panel of Figure A.6 shows the predicted
673 spatio-temporal solution u(t, x), along with the locations of the initial and
674 boundary training data. We must underline that, unlike any classical nu-
675 merical method for solving partial differential equations, this prediction is
676 obtained without any sort of discretization of the spatio-temporal domain.
677 The exact solution for this problem is analytically available [13], and the
678 resulting prediction error is measured at 6.7 · 10−4 in the relative L2 -norm.
679 Note that this error is about two orders of magnitude lower than the one
680 reported in our previous work on data-driven solution of partial differential
681 equation using Gaussian processes [8]. A more detailed assessment of the
682 predicted solution is presented in the bottom panel of figure A.6. In partic-
683 ular, we present a comparison between the exact and the predicted solutions
684 at different time instants t = 0.25, 0.50, 0.75. Using only a handful of ini-
685 tial and boundary data, the physics-informed neural network can accurately
686 capture the intricate nonlinear behavior of the Burgers’ equation that leads
687 to the development of a sharp internal layer around t = 0.4. The latter is
688 notoriously hard to accurately resolve with classical numerical methods and
689 requires a laborious spatio-temporal discretization of equation (A.1).
27
u(t, x)
1.0
Data (100 points) 0.75
0.5 0.50
0.25
0.0 0.00
x
−0.25
−0.5 −0.50
−0.75
−1.0
0.0 0.2 0.4 0.6 0.8
t
t = 0.25 t = 0.50 t = 0.75
1 1 1
u(t, x)
u(t, x)
u(t, x)
0 0 0
−1 −1 −1
−1 0 1 −1 0 1 −1 0 1
x x x
Exact Prediction
Figure A.6: Burgers’ equation: Top: Predicted solution u(t, x) along with the initial and
boundary training data. In addition we are using 10,000 collocation points generated using
a Latin Hypercube Sampling strategy. Bottom: Comparison of the predicted and exact
solutions corresponding to the three temporal snapshots depicted by the white vertical
lines in the top panel. The relative L2 error for this case is 6.7 · 10−4 . Model training took
approximately 60 seconds on a single NVIDIA Titan X GPU card.
690
28
Nf
2000 4000 6000 7000 8000 10000
Nu
20 2.9e-01 4.4e-01 8.9e-01 1.2e+00 9.9e-02 4.2e-02
40 6.5e-02 1.1e-02 5.0e-01 9.6e-03 4.6e-01 7.5e-02
60 3.6e-01 1.2e-02 1.7e-01 5.9e-03 1.9e-03 8.2e-03
80 5.5e-03 1.0e-03 3.2e-03 7.8e-03 4.9e-02 4.5e-03
100 6.6e-02 2.7e-01 7.2e-03 6.8e-04 2.2e-03 6.7e-04
200 1.5e-01 2.3e-03 8.2e-04 8.9e-04 6.1e-04 4.9e-04
Table A.1: Burgers’ equation: Relative L2 error between the predicted and the exact
solution u(t, x) for different number of initial and boundary training data Nu , and different
number of collocation points Nf . Here, the network architecture is fixed to 9 layers with
20 neurons per hidden layer.
Neurons
10 20 40
Layers
2 7.4e-02 5.3e-02 1.0e-01
4 3.0e-03 9.4e-04 6.4e-04
6 9.6e-03 1.3e-03 6.1e-04
8 2.5e-03 9.6e-04 5.6e-04
Table A.2: Burgers’ equation: Relative L2 error between the predicted and the exact
solution u(t, x) for different number of hidden layers and different number of neurons per
layer. Here, the total number of training and collocation points is fixed to Nu = 100 and
Nf = 10, 000, respectively.
701 ical law through the collocation points Nf , one can obtain a more accurate
702 and data-efficient learning algorithm.5 Finally, table A.2 shows the resulting
703 relative L2 for different number of hidden layers, and different number of
704 neurons per layer, while the total number of training and collocation points
705 is kept fixed to Nu = 100 and Nf = 10, 000, respectively. As expected, we
706 observe that as the number of layers and neurons is increased (hence the
707 capacity of the neural network to approximate more complex functions), the
708 predictive accuracy is increased.
5
Note that the case Nf = 0 corresponds to a standard neural network model, i.e., a
neural network that does not take into account the underlying governing equation.
29
709 Appendix A.2. Discrete Time Models
710 Let us apply the general form of Runge-Kutta methods with q stages [45]
711 to a general equation of the form
ut + N [u] = 0, x ∈ Ω, t ∈ [0, T ], (A.3)
712 and obtain
713 Here, un+cj (x) = u(tn + cj ∆t, x) for j = 1, . . . , q. This general form en-
714 capsulates both implicit and explicit time-stepping schemes, depending on
715 the choice of the parameters {aij , bj , cj }. Equations (7) can be equivalently
716 expressed as
un = uni , i = 1, . . . , q,
(A.5)
un = unq+1 ,
717 where
Pq
uni := un+ci + ∆t j=1 aij N [un+cj ], i = 1, . . . , q,
Pq (A.6)
unq+1 := un+1 + ∆t j=1 bj N [u
n+cj
].
718 We proceed by placing a multi-output neural network prior on
n+c
u 1 (x), . . . , un+cq (x), un+1 (x) .
(A.7)
719 This prior assumption along with equations (A.6) result in a physics-informed
720 neural network that takes x as an input and outputs
n
u1 (x), . . . , unq (x), unq+1 (x) .
(A.8)
721 To highlight the key features of the discrete time representation we revisit
722 the problem of data-driven solution of the Burgers’ equation. For this case,
723 the nonlinear operator in equation (A.6) is given by
N [un+cj ] = un+cj uxn+cj − (0.01/π)un+c
xx ,
j
724 and the shared parameters of the neural networks (A.7) and (A.8) can be
725 learned by minimizing the sum of squared errors
SSE = SSEn + SSEb , (A.9)
30
726 where
q+1 Nn
X X
SSEn = |unj (xn,i ) − un,i |2 ,
j=1 i=1
727 and
q
X
|un+ci (−1)|2 + |un+ci (1)|2 + |un+1 (−1)|2 + |un+1 (1)|2 .
SSEb =
i=1
729 scheme now allows us to infer the latent solution u(t, x) in a sequential fash-
730 ion. Starting from initial data {xn,i , un,i }N n
i=1 at time t and data at the
n
737 The result of applying this process to the Burgers’ equation is presented
738 in figure A.7. For illustration purposes, we start with a set of Nn = 250 initial
739 data at t = 0.1, and employ a physics-informed neural network induced by an
740 implicit Runge-Kutta scheme with 500 stages to predict the solution at time
741 t = 0.9 in a single step. The theoretical error estimates for this scheme predict
742 a temporal error accumulation of O(∆t2q ) [45], which in our case translates
743 into an error way below machine precision, i.e., ∆t2q = 0.81000 ≈ 10−97 . To
744 our knowledge, this is the first time that an implicit Runge-Kutta scheme
745 of that high-order has ever been used. Remarkably, starting from smooth
746 initial data at t = 0.1 we can predict the nearly discontinuous solution at
747 t = 0.9 in a single time-step with a relative L2 error of 8.2·10−4 . This error is
748 two orders of magnitude lower that the one reported in [8], and it is entirely
749 attributed to the neural network’s capacity to approximate u(t, x), as well as
750 to the degree that the sum of squared errors loss allows interpolation of the
751 training data. The network architecture used here consists of 4 layers with
752 50 neurons in each hidden layer.
753
31
u(t, x)
1.0
0.75
0.5 0.50
0.25
0.0 0.00
x
−0.25
−0.5 −0.50
−0.75
−1.0
0.0 0.2 0.4 0.6 0.8
t
t = 0.10 t = 0.90
1.0
0.5
0.5
u(t, x)
u(t, x)
0.0 0.0
−0.5
−0.5
−1.0
−1 0 1 −1 0 1
x x
Data Exact Prediction
Figure A.7: Burgers equation: Top: Solution u(t, x) along with the location of the initial
training snapshot at t = 0.1 and the final prediction snapshot at t = 0.9. Bottom: Initial
training data and final prediction at the snapshots depicted by the white vertical lines in
the top panel. The relative L2 error for this case is 8.2 · 10−4 .
757 varied the number of hidden layers and the number of neurons per layer, and
758 monitored the resulting relative L2 error for the predicted solution at time
759 t = 0.9. Evidently, as the neural network capacity is increased the predictive
760 accuracy is enhanced.
761
762 The key parameters controlling the performance of our discrete time al-
763 gorithm are the total number of Runge-Kutta stages q and the time-step size
764 ∆t. In table A.4 we summarize the results of an extensive systematic study
765 where we fix the network architecture to 4 hidden layers with 50 neurons
32
Neurons
10 25 50
Layers
1 4.1e-02 4.1e-02 1.5e-01
2 2.7e-03 5.0e-03 2.4e-03
3 3.6e-03 1.9e-03 9.5e-04
Table A.3: Burgers’ equation: Relative final prediction error measure in the L2 norm
for different number of hidden layers and neurons in each layer. Here, the number of
Runge-Kutta stages is fixed to 500 and the time-step size to ∆t = 0.8.
766 per layer, and vary the number of Runge-Kutta stages q and the time-step
767 size ∆t. Specifically, we see how cases with low numbers of stages fail to
768 yield accurate results when the time-step size is large. For instance, the case
769 q = 1 corresponding to the classical trapezoidal rule, and the case q = 2
770 corresponding to the 4th -order Gauss-Legendre method, cannot retain their
771 predictive accuracy for time-steps larger than 0.2, thus mandating a solu-
772 tion strategy with multiple time-steps of small size. On the other hand, the
773 ability to push the number of Runge-Kutta stages to 32 and even higher
774 allows us to take very large time steps, and effectively resolve the solution
775 in a single step without sacrificing the accuracy of our predictions. More-
776 over, numerical stability is not sacrificed either as implicit Gauss-Legendre is
777 the only family of time-stepping schemes that remain A-stable regardless of
778 their order, thus making them ideal for stiff problems [45]. These properties
779 are unprecedented for an algorithm of such implementation simplicity, and
780 illustrate one of the key highlights of our discrete time approach.
781 Finally, in table A.5 we provide a systematic study to quantify the accu-
782 racy of the predicted solution as we vary the spatial resolution of the input
783 data. As expected, increasing the total number of training data results in
784 enhanced prediction accuracy.
33
∆t
0.2 0.4 0.6 0.8
q
1 3.5e-02 1.1e-01 2.3e-01 3.8e-01
2 5.4e-03 5.1e-02 9.3e-02 2.2e-01
4 1.2e-03 1.5e-02 3.6e-02 5.4e-02
8 6.7e-04 1.8e-03 8.7e-03 5.8e-02
16 5.1e-04 7.6e-02 8.4e-04 1.1e-03
32 7.4e-04 5.2e-04 4.2e-04 7.0e-04
64 4.5e-04 4.8e-04 1.2e-03 7.8e-04
100 5.1e-04 5.7e-04 1.8e-02 1.2e-03
500 4.1e-04 3.8e-04 4.2e-04 8.2e-04
Table A.4: Burgers’ equation: Relative final prediction error measured in the L2 norm
for different number of Runge-Kutta stages q and time-step sizes ∆t. Here, the network
architecture is fixed to 4 hidden layers with 50 neurons in each layer.
34
803 of the neural networks u(t, x) and f (t, x) along with the parameters λ =
804 (λ1 , λ2 ) of the differential operator can be learned by minimizing the mean
805 squared error loss
M SE = M SEu + M SEf , (B.3)
806 where
N
1 X
M SEu = |u(tiu , xiu ) − ui |2 ,
N i=1
807 and
N
1 X
M SEf = |f (tiu , xiu )|2 .
N i=1
35
u(t, x)
1.0 1.00
0.75
0.5 0.50
0.25
0.0 0.00
x
−0.25
−0.5 −0.50
−0.75
−1.0 −1.00
0.0 0.2 0.4 0.6 0.8
t Data (2000 points)
u(t, x)
u(t, x)
0 0 0
−1 −1 −1
−1 0 1 −1 0 1 −1 0 1
x x x
Exact Prediction
Figure B.8: Burgers equation: Top: Predicted solution u(t, x) along with the training
data. Middle: Comparison of the predicted and exact solutions corresponding to the three
temporal snapshots depicted by the dashed vertical lines in the top panel. Bottom: Correct
partial differential equation along with the identified one obtained by learning λ1 and λ2 .
36
% error in λ1 % error in λ2
noise
0% 1% 5% 10% 0% 1% 5% 10%
Nu
500 0.131 0.518 0.118 1.319 13.885 0.483 1.708 4.058
1000 0.186 0.533 0.157 1.869 3.719 8.262 3.481 14.544
1500 0.432 0.033 0.706 0.725 3.093 1.423 0.502 3.156
2000 0.096 0.039 0.190 0.101 0.469 0.008 6.216 6.391
Table B.6: Burgers’ equation: Percentage error in the identified parameters λ1 and λ2 for
different number of training data N corrupted by different noise levels. Here, the neural
network architecture is kept fixed to 9 layers and 20 neurons per layer.
% error in λ1 % error in λ2
Neurons
10 20 40 10 20 40
Layers
2 11.696 2.837 1.679 103.919 67.055 49.186
4 0.332 0.109 0.428 4.721 1.234 6.170
6 0.668 0.629 0.118 3.144 3.123 1.158
8 0.414 0.141 0.266 8.459 1.902 1.552
Table B.7: Burgers’ equation: Percentage error in the identified parameters λ1 and λ2
for different number of hidden layers and neurons per layer. Here, the training data is
considered to be noise-free and fixed to N = 2, 000.
un = uni , i = 1, . . . , q,
(B.4)
un+1 = un+1
i , i = 1, . . . , q.
847 where
Pq
uni := un+ci + ∆t j=1 aij N [un+cj ; λ], i = 1, . . . , q,
Pq (B.5)
un+1
i := un+ci + ∆t j=1 (aij − bj )N [u
n+cj
; λ], i = 1, . . . , q.
37
849 This prior assumption along with equations (22) result in two physics-informed
850 neural networks n
u1 (x), . . . , unq (x), unq+1 (x) ,
(B.7)
851 and n+1 n+1
u1 (x), . . . , un+1
q (x), uq+1 (x) . (B.8)
852 Given noisy measurements at two distinct temporal snapshots {xn , un } and
853 {xn+1 , un+1 } of the system at times tn and tn+1 , respectively, the shared
854 parameters of the neural networks (B.6), (B.7), and (B.8) along with the
855 parameters λ of the differential operator can be trained by minimizing the
856 sum of squared errors
857 where
q Nn
X X
SSEn := |unj (xn,i ) − un,i |2 ,
j=1 i=1
858 and
q Nn+1
X X
SSEn+1 := |un+1
j (xn+1,i ) − un+1,i |2 .
j=1 i=1
Nn N N
859 Here, xn = {xn,i }i=1 , un = {un,i }i=1
n
, xn+1 = {xn+1,i }i=1
n+1
, and un+1 =
Nn+1
860 {un+1,i }i=1 .
864 and notice that the nonlinear spatial operator in equation (B.5) is given by
865 Given merely two training data snapshots, the shared parameters of the neu-
866 ral networks (B.6), (B.7), and (B.8) along with the parameters λ = (λ1 , λ2 )
867 of the Burgers’ equation can be learned by minimizing the sum of squared
868 errors in equation (B.9). Here, we have created a training data-set compris-
869 ing of Nn = 199 and Nn+1 = 201 spatial points by randomly sampling the
38
870 exact solution at time instants tn = 0.1 and tn+1 = 0.9, respectively. The
871 training data are shown in the top and middle panel of figure B.9. The neural
872 network architecture used here consists of 4 hidden layers with 50 neurons
873 each, while the number of Runge-Kutta stages is empirically chosen to yield
874 a temporal error accumulation of the order of machine precision by setting6
875 where the time-step for this example is ∆t = 0.8. The bottom panel of fig-
876 ure B.9 summarizes the identified parameters λ = (λ1 , λ2 ) for the cases of
877 noise-free data, as well as noisy data with 1% of Gaussian uncorrelated noise
878 corruption. For both cases, the proposed algorithm is able to learn the cor-
879 rect parameter values λ1 = 1.0 and λ2 = 0.01/π with remarkable accuracy,
880 despite the fact that the two data snapshots used for training are very far
881 apart, and potentially describe different regimes of the underlying dynamics.
882
6
This is motivated by the theoretical error estimates for implicit Runge-Kutta schemes
suggesting a truncation error of O(∆t2q ) [45].
39
u(t, x)
1.0
0.75
0.5 0.50
0.25
0.0 0.00
x
−0.25
−0.5 −0.50
−0.75
−1.0
0.0 0.2 0.4 0.6 0.8
t
t = 0.10 t = 0.90
199 trainng data 201 trainng data
1.0
0.5
0.5
u(t, x)
u(t, x)
0.0 0.0
−0.5
−0.5
−1.0
−1 0 1 −1 0 1
x x
Exact Data
Figure B.9: Burgers equation: Top: Solution u(t, x) along with the temporal locations of
the two training snapshots. Middle: Training data and exact solution corresponding to the
two temporal snapshots depicted by the dashed vertical lines in the top panel. Bottom:
Correct partial differential equation along with the identified one obtained by learning
λ1 , λ2 .
900 References
901 [1] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with
902 deep convolutional neural networks, in: Advances in neural information
903 processing systems, pp. 1097–1105.
40
% error in λ1 % error in λ2
noise
0% 1% 5% 10% 0% 1% 5% 10%
∆t
0.2 0.002 0.435 6.073 3.273 0.151 4.982 59.314 83.969
0.4 0.001 0.119 1.679 2.985 0.088 2.816 8.396 8.377
0.6 0.002 0.064 2.096 1.383 0.090 0.068 3.493 24.321
0.8 0.010 0.221 0.097 1.233 1.918 3.215 13.479 1.621
Table B.8: Burgers’ equation: Percentage error in the identified parameters λ1 and λ2 for
different gap size ∆t between two different snapshots and for different noise levels.
% error in λ1 % error in λ2
Neurons
10 25 50 10 25 50
Layers
1 1.868 4.868 1.960 180.373 237.463 123.539
2 0.443 0.037 0.015 29.474 2.676 1.561
3 0.123 0.012 0.004 7.991 1.906 0.586
4 0.012 0.020 0.011 1.125 4.448 2.014
Table B.9: Burgers’ equation: Percentage error in the identified parameters λ1 and λ2 for
different number of hidden layers and neurons in each layer.
41
918 [7] C. E. Rasmussen, C. K. Williams, Gaussian processes for machine learn-
919 ing, volume 1, MIT press Cambridge, 2006.
42
947 [18] Y. Zhu, N. Zabaras, Bayesian deep convolutional encoder-decoder net-
948 works for surrogate modeling and uncertainty quantification, arXiv
949 preprint arXiv:1801.06879 (2018).
953 [20] R. Tripathy, I. Bilionis, Deep UQ: Learning deep neural network sur-
954 rogate models for high dimensional uncertainty quantification, arXiv
955 preprint arXiv:1802.00850 (2018).
971 [26] M. Milano, P. Koumoutsakos, Neural network modeling for near wall
972 turbulent flow, Journal of Computational Physics 182 (2002) 1–26.
43
978 identification, in: Neural Networks for Signal Processing [1994] IV. Pro-
979 ceedings of the 1994 IEEE Workshop, IEEE, pp. 596–605.
983 [30] H. W. Lin, M. Tegmark, D. Rolnick, Why does deep and cheap learning
984 work so well?, Journal of Statistical Physics 168 (2017) 1223–1247.
996 [35] D. C. Liu, J. Nocedal, On the limited memory BFGS method for large
997 scale optimization, Mathematical programming 45 (1989) 503–528.
998 [36] I. Goodfellow, Y. Bengio, A. Courville, Deep learning, MIT press, 2016.
999 [37] D. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv
1000 preprint arXiv:1412.6980 (2014).
1004 [39] R. Shwartz-Ziv, N. Tishby, Opening the black box of deep neural net-
1005 works via information, arXiv preprint arXiv:1703.00810 (2017).
44
1007 [41] M. Stein, Large sample properties of simulations using latin hypercube
1008 sampling, Technometrics 29 (1987) 143–151.
1012 [43] H.-J. Bungartz, M. Griebel, Sparse grids, Acta numerica 13 (2004)
1013 147–269.
1017 [45] A. Iserles, A first course in the numerical analysis of differential equa-
1018 tions, 44, Cambridge University Press, 2009.
1023 [48] T. Dauxois, Fermi, Pasta, Ulam and a mysterious lady, arXiv preprint
1024 arXiv:0801.1590 (2008).
45