The Confluence of Networks, Games and Learning: A Game-Theoretic Framework For Multi-Agent Decision Making Over Networks
The Confluence of Networks, Games and Learning: A Game-Theoretic Framework For Multi-Agent Decision Making Over Networks
Abstract
Recent years have witnessed significant advances in technologies and services in
modern network applications, including smart grid management, wireless communi-
cation, cybersecurity as well as multi-agent autonomous systems. Considering the
heterogeneous nature of networked entities, emerging network applications call for
game-theoretic models and learning-based approaches in order to create distributed
network intelligence that responds to uncertainties and disruptions in a dynamic or an
adversarial environment. This paper articulates the confluence of networks, games and
learning, which establishes a theoretical underpinning for understanding multi-agent
decision-making over networks. We provide an selective overview of game-theoretic
learning algorithms within the framework of stochastic approximation theory, and as-
sociated applications in some representative contexts of modern network systems, such
as the next generation wireless communication networks, the smart grid and distributed
machine learning. In addition to existing research works on game-theoretic learning
over networks, we highlight several new angles and research endeavors on learning in
games that are related to recent developments in artificial intelligence. Some of the
new angles extrapolate from our own research interests. The overall objective of the
paper is to provide the reader a clear picture of the strengths and challenges of adopting
game-theoretic learning methods within the context of network systems, and further
to identify fruitful future research directions on both theoretical and applied studies.
1 Introduction
Multi-agent decision making over networks has recently attracted an exponentially growing
number of researchers from the systems and control community. The area has gained in-
creasing momentum in various fields including engineering, social sciences, economics, urban
∗
Prepared for IEEE control system magazine, as part of the special issue “Distributed Nash Equilibrium
Seeking over Networks”.
†
Corresponding author
‡
Department of Electrical and Computer Engineering, New York University, NY, USA; Email: {tl2636,
gp1363, qz494}@nyu.edu.
§
Department of Electrical and Computer Engineering & Coordinated Science Laboratory, University of
Illinois at Urbana-Champaign, IL, USA; Email: {basar1}@illinois.edu.
1
science, and artificial intelligence, as it serves as a prevalent framework for studying large
and complex systems, and has been widely applied in tackling many problems arising in
these fields, such as social networks analysis [1], smart grid management [2, 3], traffic con-
trol [4], wireless and communication networks [5–7], cybersecurity [8,9], as well as multi-agent
autonomous systems [10].
Due to the proliferation of advanced technologies and services in modern network appli-
cations, solving the decision-making problems in multi-agent networks calls for novel models
and approaches that can capture the following characteristics of emerging network systems
and the design of autonomous controls:
1. the heterogeneous nature of the underlying network, where multiple entities, repre-
sented by the set of nodes, aim to pursue their own goals with independent decision-
making capabilities;
2. the need for distributed or decentralized operation of the system, when the underlying
network is of a complex topological structure and is too large to be managed in a
centralized approach;
3. the need for creating network intelligence that is responsive to changes in the network
and the environment, as the system oftentimes operates in a dynamic or an adversarial
environment.
Game theory provides a natural set of tools and frameworks addressing these challenges,
and bridging networks to decision making. It entails development of mathematical models
that both qualitatively and quantitively depict how the interactions of self-interested agents
with different information and rationalities can attain a global objective or lead to emerging
behaviors at a system level. Moreover, with the underlying network, game-theoretic models
capture the impact of the topology on the process of distributed decision making, where
agents plan their moves independently according to their goals and local information available
to them, such as their observations of their neighbors.
In addition to game-theoretic models over networks, learning theory is indispensable when
designing decentralized management mechanisms for network systems, in order to equip
networks with distributed intelligence. Through the combination of game-theoretic models
and associated learning schemes, such network intelligence allows heterogeneous agents to
interact strategically with each other and learn to respond to uncertainties, anomalies, and
disruptions, leading to desired collective behavior patterns over the network or an optimal
system-level performance. The key feature of such network intelligence is that even though
each agent’s own decision-making process is influenced by the others’ decisions, the agents
reach an equilibrium state, that is, a Nash equilibrium as we elucidate later, in an online and
decentralized manner. To equip networks with distributed intelligence, networked agents
should adapt themselves to the dynamic environment with limited and local observations
over a large network that may be unknown to them. Computationally, decentralized learning
scales efficiently to large and complex networks, and requires no global information regarding
the entire network, which is more practical compared with centralized control laws.
This paper articulates the confluence of networks, games and learning, which establishes
a theoretical underpinning for understanding multi-agent decision-making over networks.
2
Figure 1: The confluence of networks, games and learning. The combination of game-
theoretic modelling and learning theories leads to resilient and agile network controls for
various networked systems.
2. to present the key analytical tool based on stochastic approximation and Lyapunov
theory for studying learning processes in games, and pinpoint some extensively studied
learning dynamics;
3. to introduce various multi-agent systems and network applications that can be ad-
dressed through game-theoretic learning.
We aim to provide the reader a clear picture of the strengths and challenges of adopting
novel game-theoretic learning methods within the context of network systems. Besides the
highlighted contents, we also provide the reader with references for further reading. In this
paper, complete-information games are the basis of the subject, for which we give a brief
introduction to both static and dynamics games. More comprehensive treatments on this
3
Symbol Meaning
N The set of players
i, j ∈ N Subscript index denoting players
N (i) The set of neighbors of player i
Ai The set of actions available to player i
∆(Ai ) The set of Borel probability measures
(The probability simplex in RAi for finite action set Ai )
s∈Q S State variable
ui : j∈N Aj → R Player i’s utility function
ai ∈ AQ i Action of player i
a−i ∈ j∈N,j̸=i Aj Joint actions of players other than i
Q
a ∈ i∈N Ai Joint actions of all players
πi ∈ ∆(A
Q i ) Strategy of player i
π−i ∈ j∈N,j̸=i ∆(Aj ) Joint strategy of players other then i
ui (π−i ) or ui ∈ R|Ai | Player i’s utility vector in finite games
Di (a) Individual payoff gradient of player i
D(a) The concatenation of {Di (a)}i∈N
k
Ii Feedback of player i at time k
Uik ∈ R The payoff feedback received by player i at time k
ûi ∈ R
k |Ai |
Estimated utility vector at time k
Ûi ∈ R
k |Ai |
Estimator of ui (π−ik
) at time k
BRi Best response mapping for player i
QRϵ Regularized best response or quantal response
topic as well as other game models, such as incomplete information games, can be found
in [11–13]. As most of the network topologies can be characterized by the structure of the
utility function of the game [1, 14], we do not articulate the influence of network topologies
on the game itself. Instead, we focus on its influence on the learning process in games,
where players’ information feedback depends on the network structures, and we present
representative network applications to showcase this influence. We refer the reader to [1, 14]
for further reading on games over various networks.
We structure our discussions as follows. In Section 2, we introduce non-cooperative games
and associated solution concepts, including Nash equilibrium and its variants, which capture
the strategic interactions of self-interested players. Then, in Section 3, we move to the main
focus of this paper: learning dynamics in games that converge to Nash equilibrium. Within
the stochastic approximation framework, a unified description of various dynamics is pro-
vided, and the analytical properties can be studied by ordinary differential equation (ODE)
methods. In Section 4, we discuss applications of these learning algorithms in networks,
leading to distributed and learning-based controls for network systems. Finally, Section 5
concludes the paper. For the reader’s convenience, we summarize the notations that are
frequently used in Table 1.
4
2 Noncooperative Game Theory
Game theory constitutes a mathematical framework with two main branches: noncooper-
ative game theory and cooperative game theory. Noncooperative game theory focuses on
the strategic decision-making process of independent entities or players that aim to optimize
their distinct objective functions, without any external enforcement of cooperative behav-
iors. The term noncooperative does not necessarily mean that players are not engaged in
cooperative behaviors. As a matter of fact, induced cooperative or coordinated behaviors
do arise in noncooperative circumstances, within the context of Nash equilibrium, a solution
concept of noncooperative games. However, such coordination is self-enforcing and arises
from decentralized decision-making processes of self-interested players, and will be further
discussed in Section 4, where we introduce game-theoretic methods for distributed machine
learning.
As briefly discussed above, noncooperative game theory naturally characterizes the decision-
making process of heterogeneous entities acting independently over networks, which is the
main focus of this paper. In the following, we introduce various game models and related
solution concepts, including Nash equilibrium and its variants. Generally speaking, a game
involves the following elements: decision makers (players); choices available to each player
(actions); knowledge that a player acquires for making decisions (information) and each
player’s preference ordering among its actions, affected by also others’ actions (utilities or
cost). Below we provide a short list of these concepts that will be further discussed and
explained in this section.
1. Players are participants in a game, where they compete for their own good. A player
can be an individual or encapsulation of a set of individuals.
2. Actions of a player, in the terminology of control theory, are the implementations of
the player’s control.
3. Information in games refers to the structure regarding the knowledge players acquire
about the game and its history when they decide on their moves. The information
structure can vary considerably. For some games, the information is static and does
not change during the play. While for other games, new information will be revealed
after players’ moves, as the “state” of the game, a concept to be elucidated later, is
determined by players’ actions during the play. In the latter case, the information is
dynamic. We shall address both types of games in this paper.
4. A strategy is a mapping that associates a player’s move with the information available
to him at the time when he decides which move to choose.
5. A utility or payoff is oftentimes a real-valued function capturing a player’s preference
ordering among possible outcomes of the game. Using the terminology in control
theory, this can also be viewed as a cost function for the player’s controls.
The above list refers to elements of games in relatively imprecise common language terms,
and more formal definitions are presented below. To facilitate this discussion, we categorize
noncooperative games into two main classes: static and dynamic games, based on the nature
of the information structure.
5
2.1 Static Games
Static games are one-shot, where players make decisions simultaneously based on the prior
information on the games, such as sets of players’ actions, and their payoffs. In such games,
each player’s knowledge about the game is static and does not evolve during the play. Math-
ematically speaking, a static noncooperative game is defined as follows.
Definition 1 (Static Games) A static game is defined by a triple G := ⟨N , (Ai )i∈N , (ui )i∈N ⟩,
where
2. Ai with some specified topology denotes the set of actions available to the player i ∈ N ;
Q
3. ui : j∈N Aj → R defines player i’s utility, and ui (ai , a−i ) gives the payoff of player i
when taking action ai , given other players’ actions a−i := (aj )j∈N,j̸=i .
In static games, each player develops its strategy, a probability distribution over his
action set, with the objective of maximizing the expected value of its own utility. If players
have finite action sets, then such a static game is called a finite game. In this case, a
strategy is a finite-dimensional vector in theP probability simplex over the action set, that is,
πi ∈ ∆(Ai ) := {π ∈ R |π(a) ≥ 0, ∀a ∈ Ai , a∈Ai π(a) = 1}. If πi is a unit vector ea , a ∈ Ai
|Ai |
with the a-th entry being 1 and 0 for others, then it is referred to a pure strategy, selecting
action a with probability 1; otherwise, it is a mixed strategy, choosing actions randomly
under the selected probability distribution. Similarly, for infinite action sets, the strategy is
defined as a Borel probability measure over the action set, with Dirac measure being the pure
strategy. By a possible abuse of notation, we denote the set of Borel probability measures
over Ai by ∆(Ai ). Unless specified otherwise, static games considered in this paper are all
assumed to be finite, where the player set, and the action sets are all finite.
As a special case of games with infinite actions, the mixed extension of finite games is
introduced in the sequel. Consider a two-player finite game G = ⟨N , (Ai )i∈N , (ui )i∈N ⟩, where
N = {1, 2}, and the action sets are finite |Ai | < ∞, i ∈ N . Given the mixed strategies of
players, πi ∈ ∆(Ai ), the expected utility of player i is Ea1 ∼π1 ,a2 ∼π2 [ui (a1 , a2 )]. With a slight
abuse of notation, we denote this expected utility by ui (π1 , π2 ) := Ea1 ∼π1 ,a2 ∼π2 [ui (a1 , a2 )].
Then, studying the players’ strategic interactions is equivalent to considering the following
infinite game G∞ = ⟨N , (∆(Ai ))i∈N , (ui )i∈N ⟩, where ui denotes the expected utility. In G∞ ,
an action is a vector from the corresponding probability simplex, a convex and compact set
with a continuum of elements. Similar to the notations used in the definition, for the mixed
extension G∞ , we denote the joint action of players other than i by π−i := (πj )j∈N ,j̸=i .
Furthermore, we let ui (π−i ) ∈ R|Ai | be the utility vector of player i, given other players’
strategy profiles, π−i , whose a-th entry is defined as ui (π−i )(a) := ui (ea , π−i ). Due to the
definition of expectation, ui (πi , π−i ) can be expressed as an inner product ⟨πi , ui (π−i )⟩, which
will be frequently used later when discussing learning algorithms in finite games. This mixed
extension allows us to give a geometric characterization to Nash equilibria of finite games,
based on variational inequalities, as discussed in Section 2.3. Meanwhile, this inner product
expression connects learning theory in finite games with online linear optimization [15], where
6
the generic player’s decision variable is πi and the loss function specified by ⟨·, ui (π−i )⟩ is
linear in πi .
Even though widely applied in modeling behaviors of self-interested players, the static
game model is far from being sufficient to cover multi-agent decision making problems arising
in different fields. For instance, when playing poker games, new information will be revealed
during the game play, such as cards played at each round, based on which players can adjust
their moves. There are many games where players’ information about the game changes over
time during the play, which cannot be suitably described by static games. Therefore, we
must resort to another model for capturing the underlying dynamics.
3. a set Ai with some specified topology, defined for each i ∈ N , corresponding to the set
of actions or controls available to player i;
4. a set S with some specified topology, denoting the state space of the game, where sk ∈
S, k ∈ N+ represent the state of the game at time k;
Q
5. a transition kernel T : S × i∈N Ai → ∆(S), according to which the next state is
sampled, that is, sk+1 ∼ T (sk , ak ), where ak = (ak1 , . . . , akN ) is the N -tuple of actions
at time k ∈ N+ , and s1 ∈ S has a given distribution;
Q
6. an instantaneous payoff: ui : S × i Ai → R, defined for each i ∈ N and k ∈ N+ ,
determining the payoff ui (sk , ak ) received by player i at time k ;
1 k 1 k
P∞ {s k, . . . ,ks ,k. . . ; a , . . . , a , . . .}, the discounted cumula-
7. a discounting factor γ. Given
tive payoffs for player i is k=1 γ ui (s , a ).
7
The above definition only characterizes one special case of dynamic games. Based on this
definition, we can derive many other game models. For example, we can make state transi-
tions independent of players’ actions as well as the current state, yielding a special case of
stochastic games, which will be further discussed in another paper in this special issue [17].
We can also consider continuous-time dynamic games where the transition is described by
a differential equation, leading to a differential game model. For an extensive coverage of
dynamic game models, we refer the reader to [11].
With the full observation of states, we can consider the stationary strategy πi : S →
∆(Ai ), by which players plan their moves only based on the current state s ∈ S. In this case,
we say the state variable s characterizes players’ knowledge of the game, since the actions,
utilities and next possible states are all determined by the current state. For dynamic games
under partial observation and/or non-Markovian transition, we refer the reader to [11], since
these topics are beyond the scope of this paper.
which is referred to as a best response of player 1 to player 2’s strategy π2 , and BR1 (·) is
called the best response set of player 1. Similarly, given player 1’s strategy π1 , a best response
of player 2 is π2 ∈ BR2 (π1 ) := arg maxπ∈∆(A2 ) {⟨π, u2 (π1 )⟩}. Therefore, we can define a point-
to-set mapping BR : ∆(A1 ) × ∆(A2 ) → 2∆(A1 )×∆(A2 ) , which is the concatenation of BR1
and BR2 . Given a joint strategy profile π = (π1 , π2 ), BR(π) is defined as
If we can find π ∗ = (π1∗ , π2∗ ), a fixed point of this best-response mapping, that is, π ∗ ∈ BR(π ∗ ),
then when both players adopt the corresponding strategy in this profile, they could do no
8
better by unilaterally deviating from current strategy. In other words, this fixed point
corresponds to an equilibrium outcome of the game, which further leads to the definition of
Nash equilibrium, which we introduce below for the general N -player game.
Definition 3 (Nash Equilibrium) For a static game ⟨N , (Ai )i∈N , (ui )i∈N ⟩, Nash equilib-
∗
rium is a strategy profile π ∗ = (πi∗ , π−i ) with the property that for all i ∈ N ,
ui (πi∗ , π−i
∗ ∗
) ≥ ui (πi , π−i ), (3)
∗
where πi is an arbitrary strategy of player i and π−i = (πj∗ )j∈N ,j̸=i denotes the joint strategy
profile of the other players. If the inequality holds strictly for all πi ̸= πi∗ , then it is referred
to as a strict Nash equilibrium.
Note that the preceding definition naturally carries over to games with infinite action sets,
and we refer the reader to [11, Chapter 4] for more details. Furthermore, for infinite games,
if we impose some topological structures on the action sets and regularity conditions on the
utility functions, we can come up with a geometric interpretation of Nash equilibrium derived
from the inequality in (3). Toward that end, we consider a (static) game with compact and
convex action sets (Ai )i∈N and smooth concave utilities:
Y
ui (ai , a−i ) is concave in ai for all a−i ∈ Aj , i ∈ N .
j∈N ,j̸=i
In such a game, the number of actions to each player is a continuum, and the utility function
is continuous; such games is referred to as continuous-kernel games
Q or continuous games.
∗ ∗ ∗
In this case, a pure strategy Nash equilibrium a = (ai , a−i ) ∈ i∈N Ai is defined by the
following inequality,
Further assuming that ui (ai , a−i ) is continuously differentiable in ai ∈ Ai , for all a−i , by the
first order condition, Nash equilibrium in (4) can be characterized by
where Di (a) := ∇ai ui (ai , a−i ) denotes the individual payoff gradient of player i, and ∇ai ui (ai , a−i )
denotes differentiation with respect to the variable ai . By rewriting the inequality above in a
more compact form, we obtain the following variational characterization of Nash equilibrium
Y
⟨D(a∗ ), a − a∗ ⟩ ≤ 0, for all a ∈ Ai , (5)
i∈N
where D(a) is the concatenation of {Di (a)}i∈N , that is, D(a) = (D1 (a), . . . , DN (a)). Geo-
metrically speaking, (5) states that for concaveQgames, a∗ is a Nash equilibrium Q if and only
∗ ∗ ∗
if D(a ) lies within the polar cone of the set i∈N Ai − a := {a − a |a ∈ i∈N Ai }, as
shown in Fig 2.
In addition to concave games, such variational inequality characterization has been stud-
ied in much broader contexts, such as monotone games [18], which bridges the gap between
9
PC(a⇤ )
<latexit sha1_base64="GhKeqDo9YDesc5KFCZEXElRgwOA=">AAACCHicbVDLSsNAFJ3UV62v+Ni5cLAI1UVJRNFloRuXFewDmlom00k7dDITZiZCCVm68VfcuFDErZ/gzr9xknahrQcuHM65l3vv8SNGlXacb6uwtLyyulZcL21sbm3v2Lt7LSViiUkTCyZkx0eKMMpJU1PNSCeSBIU+I21/XM/89gORigp+pycR6YVoyGlAMdJG6ttHnoiIRFpIjkKSNOppxQuRHvlBgtL7s9O+XXaqTg64SNwZKdcOghyNvv3lDQSOQ8I1ZkiprutEupcgqSlmJC15sSIRwmM0JF1Ds62ql+SPpPDEKAMYCGmKa5irvycSFCo1CX3TmR2p5r1M/M/rxjq47iWUR7EmHE8XBTGDWsAsFTigkmDNJoYgLKm5FeIRkghrk13JhODOv7xIWudV97Lq3Jo0LsAURXAIjkEFuOAK1MANaIAmwOARPINX8GY9WS/Wu/UxbS1Ys5l98AfW5w9905x/</latexit>
<latexit sha1_base64="u+GfKWhO9+9YpKdINLkKHIBKGfo=">AAAB8XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKojcDXjxGMA9MQpidzCZDZmeXmV4hLPkLLxEU8eoX+Bve/Btnkxw0saChqOqmq9uPpTDout9ObmV1bX0jv1nY2t7Z3SvuH9RNlGjGayySkW761HApFK+hQMmbseY09CVv+MObzG88cm1EpO5xFPNOSPtKBIJRtNJDO6Q48IOUjrvFklt2pyDLxJuT0vXnJMNztVv8avciloRcIZPUmJbnxthJqUbBJB8X2onhMWVD2uctSxUNuemk08RjcmKVHgkibUshmaq/J1IaGjMKfduZJTSLXib+57USDK46qVBxglyx2aIgkQQjkp1PekJzhnJkCWVa2KyEDaimDO2TCvYJ3uLJy6R+VvYuyu6dW6qcwwx5OIJjOAUPLqECt1CFGjBQ8AQv8OoYZ+K8Oe+z1pwznzmEP3A+fgD3/ZWG</latexit>
a <latexit sha1_base64="tOb+TS1kcVaekwj/48vID3GF7Jo=">AAAB83icbVDLSgMxFL1TX7W+qi7dBIsgLsqMKLqz4MZlBfuAzlgyaaYNzWRCkhHK0N8Q1IUibv0Af8Odf2OmdaGtBwKHc+7lnpxQcqaN6345hYXFpeWV4mppbX1jc6u8vdPUSaoIbZCEJ6odYk05E7RhmOG0LRXFcchpKxxe5n7rjirNEnFjRpIGMe4LFjGCjZV8P8ZmEEYZHt8edcsVt+pOgOaJ90MqFx8POR7r3fKn30tIGlNhCMdadzxXmiDDyjDC6bjkp5pKTIa4TzuWChxTHWSTzGN0YJUeihJlnzBoov7eyHCs9SgO7WSeUc96ufif10lNdB5kTMjUUEGmh6KUI5OgvADUY4oSw0eWYKKYzYrIACtMjK2pZEvwZr88T5rHVe+06l67ldoJTFGEPdiHQ/DgDGpwBXVoAAEJ9/AML07qPDmvztt0tOD87OzCHzjv3xjVliI=</latexit>
a⇤
D(a⇤ )
<latexit sha1_base64="hy9N3RpT8SNNxyFROO0OmCXyH8s=">AAAB+HicbVDLSsNAFL3xWeujUZduBotQXZREKros6MJlBfuANpbJdNIOnUzCzESooV/ixoUibv0Ud/6NkzYLbT0wcDjnXu6Z48ecKe0439bK6tr6xmZhq7i9s7tXsvcPWipKJKFNEvFIdnysKGeCNjXTnHZiSXHoc9r2x9eZ336kUrFI3OtJTL0QDwULGMHaSH27dFPphViP/CDF04ez075ddqrODGiZuDkpQ45G3/7qDSKShFRowrFSXdeJtZdiqRnhdFrsJYrGmIzxkHYNFTikyktnwafoxCgDFETSPKHRTP29keJQqUnom8kspFr0MvE/r5vo4MpLmYgTTQWZHwoSjnSEshbQgElKNJ8YgolkJisiIywx0aaroinBXfzyMmmdV92LqnNXK9dreR0FOIJjqIALl1CHW2hAEwgk8Ayv8GY9WS/Wu/UxH12x8p1D+APr8wfLNZJ7</latexit>
<latexit sha1_base64="P4fbKDaApOSWQJPT6fIgxqTHOK0=">AAACDnicbZDLSsNAFIYnVWutt6hLEQZLoauSCKILF5VuXEkFe4EmhMlk0g6dTMLMRCghT+DGV3HjQhG3rt35Aj6H0wuirT8MfPznHOac308YlcqyPo3Cyupacb20Ud7c2t7ZNff2OzJOBSZtHLNY9HwkCaOctBVVjPQSQVDkM9L1R81JvXtHhKQxv1XjhLgRGnAaUoyUtjyz6iQiDryMOpRDJ0JqiBHLrvP8hy9zj3pmxapbU8FlsOdQaVgXnaNm8avlmR9OEOM0IlxhhqTs21ai3AwJRTEjedlJJUkQHqEB6WvkKCLSzabn5LCqnQCGsdCPKzh1f09kKJJyHPm6c7KkXKxNzP9q/VSF525GeZIqwvHsozBlUMVwkg0MqCBYsbEGhAXVu0I8RAJhpRMs6xDsxZOXoXNSt0/r1o1OowZmKoFDcAxqwAZnoAGuQAu0AQb34BE8gxfjwXgyXo23WWvBmM8cgD8y3r8BebifKA==</latexit>
Y
Ai
i2N
TC(a⇤ )
<latexit sha1_base64="vZytntJw2+V4/7TGYlVAcn7SlLk=">AAACCHicbVDLSsNAFJ3UV62v+Ni5cLAI1UVJRNFloRuXFfqCJpbJdNIOncyEmYlQQpdu/BU3LhRx6ye482+cpC609cCFwzn3cu89Qcyo0o7zZRWWlldW14rrpY3Nre0de3evrUQiMWlhwYTsBkgRRjlpaaoZ6caSoChgpBOM65nfuSdSUcGbehITP0JDTkOKkTZS3z7yREwk0kJyFJG0WZ9WvAjpURCmaHp3dtq3y07VyQEXiftDyrWDMEejb396A4GTiHCNGVKq5zqx9lMkNcWMTEteokiM8BgNSc/QbKvy0/yRKTwxygCGQpriGubq74kURUpNosB0ZkeqeS8T//N6iQ6v/ZTyONGE49miMGFQC5ilAgdUEqzZxBCEJTW3QjxCEmFtsiuZENz5lxdJ+7zqXladW5PGBZihCA7BMagAF1yBGrgBDdACGDyAJ/ACXq1H69l6s95nrQXrZ2Yf/IH18Q2EI5yD</latexit>
the theory of monotone operators and Nash equilibrium seeking. For a detailed discussion,
we refer the reader to another paper in this special issue [19]. The variational inequality (5)
is referred to as the Stampacchia-type inequality in the literature [20], and a similar varia-
tional inequality of this type can also be derived in the context of the mixed extension. As a
special case of continuous games, the mixed extension of finite games also satisfies the regu-
larity conditions: the action spaces are probability simplex regions, which are compact and
convex, and the utility function, due to its linearity with respect to any player’s mixed strat-
egy, is naturally smooth and concave. Therefore, the mixed strategy Nash equilibrium can
be characterized by variational inequality as well. Thanks to the inner product expression
of the utility in the mixed extension, the individual payoff gradient is simply ui (π−i ), and
we denote the concatenation of {ui }i∈N by u(π) := [u1 (π−1 (t)), u2 (π−2 (t)), . . . , uN (π−N (t))],
which we also refer to as the joint utility vector under the strategy profile π. In the same
spirit of (5), a strategy profile π ∗ is Nash equilibrium of the underlying finite game if and
only if the following Stampacchia-type inequality holds
Y
⟨u(π ∗ ), π − π ∗ ⟩ ≤ 0, for all π ∈ ∆(Ai ). (SVI)
i∈N
As we will later see in Section 3.4.2, this variational characterization of Nash equilibrium
bridges the equilibrium concept of games and the equilibrium concept of dynamical systems
induced by learning algorithms.
In the same spirit of (3), Nash equilibrium in dynamic games can also be defined ac-
cordingly. For Markov games, given players’ stationary strategy profile π, the cumulative
expected utility of player i, starting from the initial state s1 = s, is
X∞
Viπ (s) := Esk+1 ∼T,ak ∼π [ γ k ui (sk , ak )|s1 = s], (6)
k=1
10
which is referred to as state-value function in Markov decision process [21]. If we view Viπ as a
function of the strategy profile, following (3), we can define Nash equilibrium for the Markov
game, where the inequality holds for every state. In other words, regardless of previous play,
as long as players follow π ∗ from the current state s, they achieve the best outcome for the
rest of the game, and no player has any incentive to deviate from the strategy dictated by
π ∗ . Hence, this kind of Nash equilibrium is referred to as subgame perfect Nash equilibrium
(SPNE), which is widely used in the study of dynamic games [22, 23].
The Nash equilibrium serves as a building block for noncooperative games. One of its
major advantages is that it characterizes a stable state of a noncooperative game, in which
no rational player has the incentive to move unilaterally. This stability idea will be further
discussed when we focus on learning in games, which relates stability theory of differential
equations to the convergence of learning algorithms in Nash equilibrium seeking.
3 Learning in Games
Learning in games refers to a long-run non-equilibrium process of learning, adaptation,
and/or imitation that leads to some equilibrium [24]. Different from pure equilibrium anal-
ysis based on the definition, learning in games accounts for how players behave adaptively
during repeated game play under uncertainties and partial observations. Computationally
speaking, computing NE based on equilibrium analysis is challenging due to the computa-
tional complexity [25], and this hardly accounts for the decision-making process in practice,
where players have limited computation power and information. Hence, learning models
are needed to describe how less than fully rational players behave in order to reach the
equilibrium. Equilibrium seeking or computation motivates learning in games [23].
If we view the learning process as a dynamical system, then the learning model can
predict how each player adjusts its behavior in response to other players over time to search
for strategies that will lead to higher payoffs. From this perspective, a Nash equilibrium can
also be interpreted as the steady state of the learning process, which serves as a prediction of
the limiting behavior of the dynamical system induced by the learning model. This viewpoint
has been widely adopted in the study of population biology and evolutionary game theory,
as we shall see more clearly when we discuss later reinforcement learning and replicator
dynamics [26].
In this section, various learning dynamics are presented in the context of infinitely re-
peated games for Nash equilibrium seeking. We consider a number of players repeatedly
playing the game ⟨N , (Ai )i∈N , (ui )i∈N ⟩ infinitely many times. At time k, players determine
their moves based on their observations up to time k − 1. Then, they receive feedback
from the environment, which provides information on the past actions. For example, in
finite games, based on the information available to it, player i constructs a mixed strategy
πik ∈ ∆(Ai ), from which it samples an action aki and implements it. Then it will receive
a payoff feedback related to ui (aki , ak−i ), which evaluates the performance of aki and helps
the player shape its strategy for future plays. In such a repeated game, the amount of in-
formation that players acquire in repeated plays directly determines how players plan their
moves at each round and further influences the resulting learning dynamics. Besides being of
theoretical importance, the information feedback in the learning process, such as players’ ob-
11
servations of their opponents’ moves, is also of vital importance in designing learning-based
methods for solving network problems. As we shall see more clearly in Section 4, in many
network applications, networked agents only observe their surroundings, without any access
to the global information regarding the whole network. Therefore, due to its significance
in learning processes, we first present existing feedback structures that are of wide use in
learning, before moving to the details of learning algorithms.
indicating the completeness of the feedback from both the temporal and the spatial sense.
Furthermore, we can also consider the noisy feedback of payoffs, Uik , defined as
where ξik is a zero-mean martingale noise process with finite second moment, that is E[ξik |F k−1 ] =
0, E[(ξik )2 |F k−1 ] is less than a constant, and the expectation is taken with respect to the σ-
field F k−1 generated by the history of play up to time k − 1. Simply put, the noisy feedback
Uik is a conditionally unbiased estimator of uki with respect to the history, which is a standing
assumption when dealing with the convergence of learning dynamics in games. For noisy
feedback in general, or equivalently ξik being a generic random variable, the discussion will be
carried out in a different context. In that case, a system state should be introduced, which
accounts for the uncertainty in the environment, and the learning problem becomes Nash
equilibrium seeking in stochastic games (see Definition 2). For more detailed discussions, we
refer the reader to another paper in this special issue [17].
The perfect global feedback is of limited use in practice when designing learning algo-
rithms, as the global information is difficult or even impossible to acquire for individuals
in large-scale network systems. For example, in distributed or decentralized learning over
heterogeneous networks, players may have no access to others’ utilities due to physical limita-
tions. Therefore, we are interested in the scenario where players only have direct or indirect
access to their own utilities as well as their neighbors’, and hence players’ feedback can be
dependent on the topological structure of the underlying network that connects them.
Consider a repeated game over a graph G := (N , E), where N = {1, 2, . . . , N } is the
set of nodes, representing the players in the game, who are connected via the edges in
E = {(i, j)|i, j are connected}. To simplify the exposition, we assume that the graph is
undirected. Note that the direction of the edges does not affect our discussion as long as the
neighborhood is properly defined. For example, in a directed graph, when in-neighbors or
12
out-neighbors specify to which player(s) the player in question can pass information, then
the following characterizations of feedback structures still apply. For a more comprehensive
treatment of games over networks, we refer the reader to [14].
Each player is allowed to exchange payoff feedback with its neighbors through the edges
and observe their actions during the repeated play, whereas the information regarding the
rest is hidden from him. In this case, the feedback structure for player i is
Note that the player’s feedback regarding the payoffs and actions may not be consistent.
For example, in a multi-agent robotic system where only the sensors network is effective,
each agent can only observe its neighbors’ movements through sensors. In this case, without
any information of others’ utilities, the information feedback of agent i reduces to Iik =
{{u1:k 1:k
i }, {aj }j∈{i}∪N (i) }. To sum up, if the players can only receive feedback from their
neighbors, then players’ feedback structures are related to the underlying topology, leading
to what is referred to as the local feedback. In accordance with this, the extreme case of
local feedback is one where the player is isolated in the network, and no information other
than its own payoff feedback and actions is available to it. We refer to this extreme case as
individual feedback, which is a typical information feedback considered in fully decentralized
learning and will be further elaborated on when discussing specific learning dynamics later
in this section.
In addition to the refinements from the spatial side, we can also consider feedback with
various temporal structures. If the player has perfect recall of previous plays, the resulting
feedback is said to be perfect, and those we have introduced above all fall within this class.
Otherwise, players have access to imperfect feedback, and we discuss two common cases of
imperfect information feedback in the following, namely windowed and delayed feedback.
For the sake of simplicity, we use perfect feedback Iik = {u1:k i:k
i , ai } as a baseline to
illustrate that different missing parts of Iik lead to different kinds of imperfect feedback. If
the head of u1:k i and/or a1:ki is not available to the player, that is, there exists a window 0 <
(k−m):k (k−m):k
m < k such that the player only recalls ui , ai , then the corresponding feedback
(k−m):k (k−m):k (k−m):k
Ii = {ui , ai } is referred to as the windowed feedback with a window size
m. Similarly, if the tail of u1:k i and/or a1:k
i is not available, that is, the player only recalls
1:(k−m) 1:(k−m) 1:(k−m) 1:(k−m) 1:(k−m)
ui , ai , then the imperfect information feedback is Ii = {ui , ai },
which is called m-step delayed feedback.
For learning in games, each player learns to select actions by updating the strategy based
on the available feedback at each round. To describe this in mathematical terms, let Fik the
strategy learning policy of player i. The learning policy produces a new strategy πik+1 for
the next play according to
where λki is the learning rate, indicating the player’s capabilities of information retrieval.
Different feedback structures lead to different learning dynamics in repeated games. Under
the global or the local feedback structure, each player’s feedback is influenced by its oppo-
nents’ actions and/or payoffs, which makes the players’ learning processes coupled, as shown
in Figure 3.
13
𝜋" 𝜋!"
𝐼" 𝐼!"
Figure 3: Player’s strategy learning with the corresponding feedback. Under the global or
the local feedback structure, players’ learning processes are coupled, as their feedback is
influenced by their opponents’ moves. By contrast, players learn to play the game indepen-
dently under the individual feedback.
In the case of fully decentralized learning under individual information feedback, players
learn to play the game independently, and such a learning process is said to be uncoupled.
Uncoupled learning processes are of great significance in both theoretical studies [27] and
practical applications. Theoretically, learning with such limited information feedback is
much more transferable in the sense that learning algorithms under this feedback also apply
to online optimization problems, where the online decision-making process is viewed as a
repeated game played between a player and the nature [15].
Considering its theoretical importance, we focus on learning with individual feedback
in the sequel, and we refer the reader to [28] for a survey on learning methods under other
kinds of feedback. We first present reinforcement learning for finite games, where the learning
algorithms are characterized into two main classes, due to their distinct nature in exploration.
Then, we proceed to gradient play for infinite games, and elaborate on its connection to
reinforcement learning. The convergence results of presented algorithms are discussed in
Section 3.4 based on stochastic approximation [29, 30] and Lyapunov stability theory.
14
it can construct an estimator ûki ∈ R|Ai | based on Iik to evaluate actions a ∈ Ai . By using
this estimator, the player can compare its actions and choose the one that can achieve higher
payoffs in the next round. In mathematical terms, the estimator (score function) is given by
the following discrete-time dynamical system
ûk+1
i = (1 − µki )ûki + µki Gki (πik , ûki , Uik , aki ), (8)
where Gki : ∆(Ai ) × R|Ai | × R × Ai → R|Ai | is the learning policy for utility learning, πik is
the policy employed at time k, and µki is the learning rate. Based on the score function, the
player can modify its strategy accordingly in the sense that better actions shall be played
more frequently in the future. With slight abuse of notations, the strategy update is
πik+1 = (1 − λki )πik + λki Fik (πik , ûk+1
i , Uik , aki ), (9)
where Fik : ∆(Ai ) × R|Ai | × R × Ai → ∆(Ai ) is the learning policy for strategy learning,
yielding a new policy for the next play. Compared with (7), the above discrete-time systems
(8) (9) explicitly show how the feedback shapes the player’s future play. According to (8),
the player recursively updates its estimate of the utility function based on the feedback it
receives after playing πik , and then the player determines its move in the next round, following
(9). Intuitively, we can view (πik , ûk+1
i ) as the information extracted from Iik for updating
the player’s strategy.
In reinforcement learning, the choice mapping plays an important role in achieving the
balance between exploitation and exploration. On one hand, the player would like to choose
the best action that is supposed to incur the highest payoff based on the score function.
However, this pure exploitation oftentimes leads to myopic behaviors, as the score function
may return a poor estimate of the utility function at the beginning of the learning process.
Hence, to gather more information for a better estimator, the player also needs some ex-
perimental moves for exploration, where suboptimal actions are implemented. To sum up,
the trade-off between exploitation and exploration is of vital importance to the success of
reinforcement learning, and it depends on the construction of the choice mapping. Different
choice mappings result in different reinforcement learning algorithms. Based on their distinct
natures in exploration, the algorithms can be categorized into two main classes: exploitative
reinforcement learning and exploratory reinforcement learning.
Recall that in the strategy learning (9), the next strategy produced by the corresponding
choice mapping is
πik+1 = (1 − λki )πik + λki Fik (πik , ûk+1
i , Uik , aki ),
where (1−λki )πik is referred to as the cognitive inertia or simply inertia, describing the player’s
tendency to repeat previous choices independently of the outcome. When determining its
next move πik+1 , the player takes into account both its previous strategy πik and the increment
update using the strategy learning policy Fik . Therefore, players’ exploration at (k + 1)-th
round stems either from this inertia or the strategy learning policy Fik . The former is called
passive exploration, as it relies on the player’s tendency to repeat previous choices, while the
latter one is referred to as active exploration, as the player deliberately tries actions based
on what he has learned from previous plays.
As the new strategy is a convex combination of the inertia term πik and the learned
incremental update Fik (πik , ûk+1
i , Uik , aki ), there is no clear-cut boundary between passive and
15
active exploration. In fact, reinforcement learning is a continuum of learning algorithms.
In the following, we illustrate such a continuum by three prominent learning schemes. The
first one is the best response dynamics, located on the left endpoint, which is an example
of exploitative reinforcement learning. Solely relying on the inertia for passive exploration,
the best response dynamics adopts a purely exploitative learning policy: the best response
mapping in (1). On the contrary to the exploitative one, we present dual averaging as
an example of exploratory reinforcement learning, which only leverages the learning policy
for exploring suboptimal actions without any cognitive inertia. In between, there lies the
smoothed best response dynamics, where both the inertia and the strategy learning policy
come into play for achieving the balance between exploration and exploitation.
In general, the best response mapping is a point-to-set mapping, and to analyze the as-
sociated learning dynamics, differential inclusion theory [30] is needed, which make the
convergence analysis more involved as discussed in Section 3.4.2.
Under the noisy feedback Iik = {Ui1:k , a1:k
i }, the score function of player i is the estimated
k
utility ûi , which is updated according to the following moving average scheme [32]
1{a=aki }
ûk+1
i (a) = (1 − µki )ûki (a) + µki Uik , a ∈ Ai , (11)
πik (a)
where 1{·} is an indicator function. Note that in (11), the importance sampling technique,
which is common in bandit algorithms [15], is utilized to construct an unbiased estima-
tor of ui (π−i k
). To see this, define a vector Ûki ∈ R|Ai | , whose a-th entry is Ûki (a) :=
1{a=aki } Uik /πik (a); and we then obtain E[Ûki (a)|F k−1 ] = ui (a, π−i
k
). Hence, (11) can be rewrit-
ten as
ûk+1
i = (1 − µki )ûki + µki Ûki , (12)
and ûk+1
i (a) gives the averaged payoff incurred by a in the first k rounds. This importance
sampling technique can be viewed as compensating for the fact that actions played with
a low probability do not receive frequent updates of the corresponding estimates, so that
when they are played, any estimation error Uik − ûki (aki ) must have greater influence on the
estimated value than if frequent updates occur. We refer the reader to [15, 24] for more
details on importance sampling and its use in learning processes.
16
With a slight abuse of the notation of best response mapping in (2), we define the
corresponding best response under the noisy feedback as
The resulting dynamical system under the noisy feedback is a coupled system as shown below
ûk+1
i = (1 − µki )ûki + µki Ûki ,
(BR-d)
πik+1 ∈ (1 − λki )πik + λki BRi (ûki ).
Originally proposed as a computational method for Nash equilibrium seeking [32, 33], the
best response dynamics (BR-d) is directly built upon the best response idea and has been
widely applied to evolutionary game problems [34]. One prominent example of best response
dynamics is fictitious play [35], where a player’s empirical play follows (BR-d); and more
details are included in Appendix A. As shown above, best response dynamics adopts passive
exploration, and the best response mapping BRi (·) encourages greedy actions that might be
myopic. As a result, exploitative reinforcement learning may fail to converge [24, 36].
where h(·) is a penalty function or regularizer and ϵ is the regularization parameter. Accord-
ing to [38], a proper regularizer h(·) defined on the probability simplex should be continuous
over the simplex and smooth on the relative interior of every face of the simplex. Besides,
h should be a strongly convex function, and these assumptions ensure that QRϵ (·) always
returns a unique maximizer. The mapping QRϵ is referred to as a quantal response map-
ping [39], which allows players to choose suboptimal actions with positive probability. To
see howP this regularization contributes to active exploration, consider the entropy regularizer
h(x) = xi xi log xi . In this case, QRϵ is
which is also known as the Boltzmann-Gibbs strategy mapping [40] or the soft-max function
parameterized by ϵ > 0. On the one hand, the Boltzmann-Gibbs mapping produces a
17
strategy that assigns more weight to the actions leading to higher payoffs, that is, the larger
ui (a) = ui (a, π−i ) is, the larger QRϵ (ui )(a) becomes. On the other hand, it always retains
positive probabilities for every action, when ϵ > 0. Note that QRϵ can induce different levels
of exploration by adjusting the parameter ϵ. When ϵ tends to 0, the strategy (16) simply
returns the action that yields the highest payoff, implying that QRϵ reduces to the best
response mapping BRi (·) in (2). As ϵ gets larger, 1/ϵ tends to 0, and the strategy does not
distinguish among actions, leading to equal weights for all actions.
Similar to the previous argument, with the noisy feedback, we replace ui by the estimator
ûki , and the definition of quantal response mapping is then modified accordingly as
Due to the active exploration brought up by QRϵ , we can consider an inertia-free reinforce-
ment learning scheme, where the choice map is simply the strategy learning policy QRϵ . The
corresponding strategy learning scheme is then as
where the score function ûki is updated according to the following [41]
ûk+1
i = ûki + µki Ûki . (17)
To recap, the learning algorithm operates in the following fashion: at each time k, an
unbiased estimator Ûki is constructed as introduced in (11), using importance sampling, and
the score function is updated according to (17). Then, the next strategy is produced by the
mapping QRϵ , acting on the score function ûk+1
i , as shown below
ûk+1
i = ûki + µki Ûki ,
(DA-d)
πik+1 = QRϵ (ûk+1
i ),
(DA-d) is also referred to as dual averaging, pioneered by Nesterov [41], which was originally
proposed as a variant of gradient methods for solving convex programming problems. We
elucidate the term “dual averaging” later when we discuss the relation between dual averaging
and gradient play, where we demonstrate that (DA-d) can be viewed as a gradient-based
algorithm in finite games with ûki being the gradient. Finally, as a remark, we note that in
(DA-d), the score function is updated in a manner different than in best response dynamics
(BR-d). However, this is merely a matter of presentation, and by selecting a proper ϵ, the
moving averaging scheme (12) is essentially the same as the discounted accumulation (17),
for which we refer the reader to [41, 42]. By adopting the discounted accumulation (17), we
later can draw a connection between dual averaging and gradient play.
Apparently, the discrete-time system (DA-d) does not depict how πi (t) evolves in ∆(Ai ),
and it is not straightforward to tell how those good actions bringing up higher payoffs are
“reinforced” in the sense that probabilities of choosing them are increasing as the learning
process proceeds. In Appendix B, we present that when choosing entropy regularization,
(DA-d) is equivalent to the replicator dynamics, one of the well-known evolutionary dynamics
18
[43–45], which explicitly displays a gradual adjustment of strategies based on the quality of
each action. Meanwhile, with an example of population games, we show that this connection
brings learning in games to the broader context of evolutionary game theory [34, 44].
As we have mentioned, reinforcement learning is a continuum of learning algorithms, and
the best response dynamics (BR-d) and dual averaging (DA-d) are the two endpoints of the
continuum. Naturally, we can consider reinforcement learning methods with a blend of both
passive and active exploration, where the exploration stems from both the inertia term and
the strategy learning policy, as we present in the following.
Instead of choosing actions greedily, we replace the best response BRi (·) in (14) by
QRϵ (·), the quantal response for active exploration, and then we obtain the following strategy
learning scheme [24]
Similar to the best response dynamics in (BR-d), if utility learning follows the moving average
scheme in (11), the resulting reinforcement learning has the following discrete-time learning
dynamics
ûk+1
i = (1 − µki )ûki + µki Ûki ,
(SBR-d)
πik+1 = (1 − λki )πik + λki QRϵ (ûki ).
19
Smoothed Best Response
𝜆
→
𝜀
←
1
0
𝜀 𝜆
Best Response ↓ ↓ Dual Averaging
0 1
𝜆
→
𝜀
1
←
0
Follow-The-Leader
Figure 4: Relationships of reinforcement learning algorithms. For 0 < λ < 1 and ϵ > 0, we
obtain the exploratory reinforcement learning: smoothed best response dynamics (SBR-d),
where exploration arises from both the inertia and the learning policy. If the active explo-
ration vanishes as ϵ goes to zero, smoothed best response reduces to best response dynamics
(BR-d), an example of exploitative reinforcement learning. By contrast, we obtain dual av-
eraging (DA-c), if λ tends to 1. Finally, if ϵ goes to zero while λ tends to 1, players always
choose their actions greedily according to follow-the-leader policy.
learning (BR-d)(SBR-d) and dual averaging (DA-d) is through the corresponding continuous-
time learning dynamics in Section 3.4.1.
Even though (DA-d) is not an actor-critic learning, its trajectory is closely related to
that of (BR-d)(SBR-d). Intuitively speaking, dual averaging only differs from the smoothed
best response in that (DA-d) does not acquire an inertia term, as the learning rate is zero.
Hence, πik in (SBR-d) can be seen as the moving average of QRϵ (ûki ) in (DA-d). Therefore,
it is reasonable to expect that the time average of the trajectory produced by (DA-d) is
related to the one produced by the smoothed best response. This intuition has been verified
in [38, 51], where it has been shown that the time averaged trajectory of (DA-d) follows
(SBR-d) with a time-dependent perturbation ϵ(t).
Apart from the difference in the learning rates, learning algorithms also display distinct
asymptotic behavior due to the difference in the exploration parameter. The exploration
parameter ϵ has less drastic consequence under (DA-d) than under the actor-critic learning
(BR-d)(SBR-d). As observed in [38], adding a positive ϵ is equivalent to rescaling the
regularizer, that is, replacing h(·) with ϵh(·). As long as ϵ > 0, the regularization ϵh(·) is
still proper (see (15) and the following discussion). This implies that even though the choice
of ϵ affects the speed at which (DA-d) evolves, the qualitative results remain the same. We
refer the reader to [38, 52] for a detailed discussion. When ϵ = 0, there is no exploration
nor inertia for dual averaging, and in this case, players always choose their actions greedily
according to the best response mapping
where ûki is the score function of player i, based on its history of play up to round k, and it
20
can be updated following (11) or (17). In the online learning literature [15], (FTL) is known
as follow-the-leader (FTL) policy, which can also be obtained by eliminating the inertia term
in the best response dynamics (BR-d). Due to lack of exploration, (FTL) is too aggressive
and can be exploited by the adversary, resulting in a positive, non-diminishing regret [15].
The regret is a measure of the performance gap between the cumulative payoffs of current
policy (FTL) and that of the best policy in hindsight.
The exploration parameter plays a more important role in the actor-critic learning which
balances exploration and exploitation [31]. The smoothed best response (SBR-d), which is a
perturbed version of the best response, can only use the regularization ϵh(·) for encouraging
active exploration. Thanks to the positive exploration parameter, the smoothed best re-
sponse (SBR-d) enjoys an ϵ-no-regret property, a weak form of external consistency studied
in [51,53], which is desired in an adversarial environment [15]. In contrast, the best response
dynamics (BR-d), due to the myopic nature of the best response mapping (2), does not
possess similar properties.
ak+1
i = projAi [aki + µki Dik ],
:= arg min{∥aki + µki Dik − a∥22 } (GD)
a∈Ai
where projAi (·) is the Euclidean projection operator, and (GD) is called online gradient
descent or projected gradient descent [42]. One extensively studied variant of (GD) [42, 59]
21
Yi1 = 0
<latexit sha1_base64="8lD8hvS1xJut59HYLscSpqK+hJ4=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4Koko9iIUvHisYD+kjWWznbRLN5uwuxFK6I/w4kERr/4eb/4bt20O2vpg4PHeDDPzgkRwbVz321lZXVvf2CxsFbd3dvf2SweHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWMbqZ+6wmV5rG8N+ME/YgOJA85o8ZKrYcef/Su3V6p7FbcGcgy8XJShhz1Xumr249ZGqE0TFCtO56bGD+jynAmcFLsphoTykZ0gB1LJY1Q+9ns3Ak5tUqfhLGyJQ2Zqb8nMhppPY4C2xlRM9SL3lT8z+ukJqz6GZdJalCy+aIwFcTEZPo76XOFzIixJZQpbm8lbEgVZcYmVLQheIsvL5PmecW7rLh3F+VaNY+jAMdwAmfgwRXU4Bbq0AAGI3iGV3hzEufFeXc+5q0rTj5zBH/gfP4AR52O1w==</latexit>
+µ1i Di1
<latexit sha1_base64="TARFUywY5Agc42PcYQCeiMyJuPg=">AAAB+HicbVDLSgMxFM34rPXRUZdugkUQhDIjil0WdOGygn1AOw6ZNNOGJpkhD6EO/RI3LhRx66e482/MtLPQ1gP3cjjnXnJzopRRpT3v21lZXVvf2Cxtlbd3dvcq7v5BWyVGYtLCCUtkN0KKMCpIS1PNSDeVBPGIkU40vs79ziORiibiXk9SEnA0FDSmGGkrhW7lrM9NSB98eJP3cuhWvZo3A1wmfkGqoEAzdL/6gwQbToTGDCnV871UBxmSmmJGpuW+USRFeIyGpGepQJyoIJsdPoUnVhnAOJG2hIYz9fdGhrhSEx7ZSY70SC16ufif1zM6rgcZFanRROD5Q7FhUCcwTwEOqCRYs4klCEtqb4V4hCTC2maVh+AvfnmZtM9r/mXNu7uoNupFHCVwBI7BKfDBFWiAW9AELYCBAc/gFbw5T86L8+58zEdXnGLnEPyB8/kDqDaRwA==</latexit>
Yi2
<latexit sha1_base64="bP5+v5E4xL1l03OPi8W7jcyksRo=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKYo8FLx4rmFZpY9lsJ+3SzSbsboRS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MBVcG9f9dgpr6xubW8Xt0s7u3v5B+fCopZNMMfRZIhJ1H1KNgkv0DTcC71OFNA4FtsPR9cxvP6HSPJF3ZpxiENOB5BFn1FjJf+jxx1qvXHGr7hxklXg5qUCOZq/81e0nLItRGiao1h3PTU0wocpwJnBa6mYaU8pGdIAdSyWNUQeT+bFTcmaVPokSZUsaMld/T0xorPU4Dm1nTM1QL3sz8T+vk5moHky4TDODki0WRZkgJiGzz0mfK2RGjC2hTHF7K2FDqigzNp+SDcFbfnmVtGpV77Lq3l5UGvU8jiKcwCmcgwdX0IAbaIIPDDg8wyu8OdJ5cd6dj0VrwclnjuEPnM8fVmeOVw==</latexit>
a1i
<latexit sha1_base64="jvyk4AR7xykDVX4TL2uLQe6NEgo=">AAAB7nicbVBNS8NAEJ34WeNX1aOXxSJ4Koko9ljw4rGC/YA2ls122i7dbMLuRiihP8KLB0W8+nu8+W/ctDlo64OBx3szzMwLE8G18bxvZ219Y3Nru7Tj7u7tHxyWj45bOk4VwyaLRaw6IdUouMSm4UZgJ1FIo1BgO5zc5n77CZXmsXww0wSDiI4kH3JGjZXatM8ffdftlyte1ZuDrBK/IBUo0OiXv3qDmKURSsME1brre4kJMqoMZwJnbi/VmFA2oSPsWipphDrI5ufOyLlVBmQYK1vSkLn6eyKjkdbTKLSdETVjvezl4n9eNzXDWpBxmaQGJVssGqaCmJjkv5MBV8iMmFpCmeL2VsLGVFFmbEJ5CP7yy6ukdVn1r6ve/VWlXiviKMEpnMEF+HADdbiDBjSBwQSe4RXenMR5cd6dj0XrmlPMnMAfOJ8/zMeOhg==</latexit>
a2i(GD) = a2i(LGD)
<latexit sha1_base64="hIeyVaDM4DuKoBbIGGcuxilqoFI=">AAACEXicbZDLSsNAFIYnXmu8RV26CRahbkpSFLsRCgq6cFHBXqBNw2Q6aYdOLsyciCX0Fdz4Km5cKOLWnTvfxiQNqK0/DPx85xzOnN8JOZNgGF/KwuLS8spqYU1d39jc2tZ2dpsyiAShDRLwQLQdLClnPm0AA07boaDYczhtOaPztN66o0KywL+FcUgtDw985jKCIUG2VsJ2zEpdoPcQX15Mjia9yhnuVX7gdUZV1daKRtnIpM8bMzdFlKtua5/dfkAij/pAOJayYxohWDEWwAinE7UbSRpiMsID2kmsjz0qrTi7aKIfJqSvu4FIng96Rn9PxNiTcuw5SaeHYShnayn8r9aJwK1aMfPDCKhPpovciOsQ6Gk8ep8JSoCPE4OJYMlfdTLEAhNIQkxDMGdPnjfNStk8KRs3x8VaNY+jgPbRASohE52iGrpCddRABD2gJ/SCXpVH5Vl5U96nrQtKPrOH/kj5+AaCiJwe</latexit>
+µ2i Di2
<latexit sha1_base64="p84gsydHLjMEQoYxagkNM4OsMS0=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIsgCGWmKHZZ0IXLCvYB7bRk0kwbmmSGJKOU0v9w40IRt/6LO//GTDsLbT1wL4dz7iU3J4g508Z1v53c2vrG5lZ+u7Czu7d/UDw8auooUYQ2SMQj1Q6wppxJ2jDMcNqOFcUi4LQVjG9Sv/VIlWaRfDCTmPoCDyULGcHGSr2Lrkj6rFdBt2nvF0tu2Z0DrRIvIyXIUO8Xv7qDiCSCSkM41rrjubHxp1gZRjidFbqJpjEmYzykHUslFlT70/nVM3RmlQEKI2VLGjRXf29MsdB6IgI7KbAZ6WUvFf/zOokJq/6UyTgxVJLFQ2HCkYlQGgEaMEWJ4RNLMFHM3orICCtMjA2qYEPwlr+8SpqVsndVdu8vS7VqFkceTuAUzsGDa6jBHdShAQQUPMMrvDlPzovz7nwsRnNOtnMMf+B8/gD635F9</latexit>
+µ2i Di2
<latexit sha1_base64="p84gsydHLjMEQoYxagkNM4OsMS0=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIsgCGWmKHZZ0IXLCvYB7bRk0kwbmmSGJKOU0v9w40IRt/6LO//GTDsLbT1wL4dz7iU3J4g508Z1v53c2vrG5lZ+u7Czu7d/UDw8auooUYQ2SMQj1Q6wppxJ2jDMcNqOFcUi4LQVjG9Sv/VIlWaRfDCTmPoCDyULGcHGSr2Lrkj6rFdBt2nvF0tu2Z0DrRIvIyXIUO8Xv7qDiCSCSkM41rrjubHxp1gZRjidFbqJpjEmYzykHUslFlT70/nVM3RmlQEKI2VLGjRXf29MsdB6IgI7KbAZ6WUvFf/zOokJq/6UyTgxVJLFQ2HCkYlQGgEaMEWJ4RNLMFHM3orICCtMjA2qYEPwlr+8SpqVsndVdu8vS7VqFkceTuAUzsGDa6jBHdShAQQUPMMrvDlPzovz7nwsRnNOtnMMf+B8/gD635F9</latexit>
a3i(LGD)
<latexit sha1_base64="YPMDT255kbZ4Ep77Vh2w1BuSiRM=">AAAB/HicbVDLSsNAFJ3UV62vaJdugkWom5L4wC4LCrpwUcE+oI1hMp20QycPZm7EEOKvuHGhiFs/xJ1/47TNQlsPXDiccy/33uNGnEkwzW+tsLS8srpWXC9tbG5t7+i7e20ZxoLQFgl5KLoulpSzgLaAAafdSFDsu5x23PHFxO88UCFZGNxBElHbx8OAeYxgUJKjl7GTsmof6COkN1eX2VF2f+LoFbNmTmEsEisnFZSj6ehf/UFIYp8GQDiWsmeZEdgpFsAIp1mpH0saYTLGQ9pTNMA+lXY6PT4zDpUyMLxQqArAmKq/J1LsS5n4rur0MYzkvDcR//N6MXh1O2VBFAMNyGyRF3MDQmOShDFgghLgiSKYCKZuNcgIC0xA5VVSIVjzLy+S9nHNOquZt6eVRj2Po4j20QGqIgudowa6Rk3UQgQl6Bm9ojftSXvR3rWPWWtBy2fK6A+0zx8KQpRW</latexit>
Yi3
<latexit sha1_base64="QiK/o4NjYwfqeBcerHrJOKXwEbk=">AAAB7HicbVBNS8NAEJ34WetX1aOXxSJ4Kokf2GPBi8cKpq20sWy2k3bpZhN2N0Ip/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemAqujet+Oyura+sbm4Wt4vbO7t5+6eCwoZNMMfRZIhLVCqlGwSX6hhuBrVQhjUOBzXB4M/WbT6g0T+S9GaUYxLQvecQZNVbyH7r88aJbKrsVdwayTLyclCFHvVv66vQSlsUoDRNU67bnpiYYU2U4EzgpdjKNKWVD2se2pZLGqIPx7NgJObVKj0SJsiUNmam/J8Y01noUh7YzpmagF72p+J/XzkxUDcZcpplByeaLokwQk5Dp56THFTIjRpZQpri9lbABVZQZm0/RhuAtvrxMGucV76ri3l2Wa9U8jgIcwwmcgQfXUINbqIMPDDg8wyu8OdJ5cd6dj3nripPPHMEfOJ8/V+uOWA==</latexit>
a3i(GD)
<latexit sha1_base64="+pr6tWBA7DWDbN11fJoCGHp+98A=">AAAB+3icbVDLSsNAFJ34rPUV69LNYBHqpiQ+sMuCgi4r2Ae0MUymk3bo5MHMjbSE/IobF4q49Ufc+TdO2yy09cCFwzn3cu89Xiy4Asv6NlZW19Y3Ngtbxe2d3b1986DUUlEiKWvSSESy4xHFBA9ZEzgI1oklI4EnWNsbXU/99hOTikfhA0xi5gRkEHKfUwJacs0ScVNe6QEbQ3p7k51mj+euWbaq1gx4mdg5KaMcDdf86vUjmgQsBCqIUl3bisFJiQROBcuKvUSxmNARGbCupiEJmHLS2e0ZPtFKH/uR1BUCnqm/J1ISKDUJPN0ZEBiqRW8q/ud1E/BrTsrDOAEW0vkiPxEYIjwNAve5ZBTERBNCJde3YjokklDQcRV1CPbiy8ukdVa1L6vW/UW5XsvjKKAjdIwqyEZXqI7uUAM1EUVj9Ixe0ZuRGS/Gu/Exb10x8plD9AfG5w9rYpQA</latexit>
Ai
<latexit sha1_base64="LXBzhDoOA8dBYgais3q1+RgFuB4=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIosuKG5cV7APasWTS2zY0kxmSjFKG/ocbF4q49V/c+Tdm2llo64HA4Zx7uScniAXXxnW/ncLK6tr6RnGztLW9s7tX3j9o6ihRDBssEpFqB1Sj4BIbhhuB7VghDQOBrWB8k/mtR1SaR/LeTGL0QzqUfMAZNVZ66IbUjBgV6fW0x0u9csWtujOQZeLlpAI56r3yV7cfsSREaZigWnc8NzZ+SpXhTOC01E00xpSN6RA7lkoaovbTWeopObFKnwwiZZ80ZKb+3khpqPUkDOxkllIvepn4n9dJzODKT7mME4OSzQ8NEkFMRLIKSJ8rZEZMLKFMcZuVsBFVlBlbVFaCt/jlZdI8q3oXVffuvFI7z+sowhEcwyl4cAk1uIU6NICBgmd4hTfnyXlx3p2P+WjByXcO4Q+czx8pMpI9</latexit>
Figure 5: Illustration of the difference between (GD) and (LGD). aki(GD) and aki(LGD) denote,
respectively, the iterates generated by (GD) and (LGD). (LGD) first aggregates the gradient
steps, and then projects the aggregation onto the primal space to generate a new gradient
step.
is
Yik+1 = Yik + µki Dik ,
(LGD)
ak+1
i = projAi (Yik+1 ),
where Yik is an auxiliary variable that aggregates the gradient steps. Such an algorithm
is referred to as the lazy gradient descent (LGD) [41], since the algorithm aggregates the
gradient steps “lazily”, without transporting them to the action space as (GD) does. We
illustrate the difference between the two algorithms in Figure 5. We note that both based on
the gradient descent idea, (LGD) and (GD) share the same asymptotic behavior [15], and
the two coincide when Ai is an affine subspace of Rn .
Different from a purely primal-based algorithm, such as (GD), where the trajectory of
the algorithm only evolves in the primal space (the action space), (LGD) is a primal-dual
scheme, and the interplay between primal variables, actions aki , and the dual, gradients
Di (ak ), is of great significance. The main idea of (LGD) is as follows. At the k-th round,
each player computes the gradient Di (ak ) based on the knowledge of utility functions and
observations of the opponent’s move. Subsequently, they take a step along this gradient in
the dual space (where gradients live) and they “mirror” the output back to the primal space
(the action space), using the Euclidean projection.
Gradient based learning algorithms are further investigated in another paper of this
special issue in the context of generalized Nash equilibrium seeking [19]. In the following,
we present a generalization of (LGD): mirror descent [41, 59]. Starting with some arbitrary
initialization Yi1 , the mirror descent scheme can be described via the recursion
22
where QRϵ is the quantal response mapping in the context of the continuous game, defined
as
QRϵ (Y ) = arg max{⟨Y, a⟩ − ϵh(a)}.
a∈Ai
When we choose the Euclidean norm as the regularizer, that is, h(x) = 12 ∥x∥22 and ϵ = 1,
QRϵ reduces to the projection operator projAi . Geometrically, the gradient search step is
performed in the dual space, and then the primal update is produced by the mapping QRϵ .
Since QRϵ “mirrors” the gradient update in the dual space back to the primal space, it is
also referred to as the mirror map in the online optimization literature [15].
X
k
ak+1
i = arg max{ ui (a, aτ−i ) − ϵh(a)}, (FTRL)
a∈Ai τ =1
where ϵh(·) is the regularization introduced in (15), encouraging exploration in the learning
process. Based on the optimality in hindsight, this action selection (FTRL) is known as
follow-the-regularized-leader (FTRL) [60]. Moreover, if ui is well-behaved in the sense that
it can be approximated by the first-order Taylor expansion, that is, ui (a, aτ−i ) ≈ ui (aτi , aτ−i ) +
⟨Di (aτ ), a − aτi ⟩, then (FTRL) is equivalent to
Xk
ak+1
i = arg max{ ⟨Di (aτ ), a⟩ − ϵh(a)}
a∈Ai
τ =1
X k
= arg max{⟨ Di (aτ ), a⟩ − ϵh(a)}
a∈Ai τ =1
X
k
= QRϵ ( Di (aτ )),
τ =1
which is exactly the mirror descent scheme in (MD), despite using an auxiliary variable
Yik to aggregate these gradients weighted by the learning rates µki . In other words, by the
first-order expansion, the sum of gradients living in the dual space serves a linear functional
for evaluating the quality of the actions. Hence, the sum or equivalently Yik can be treated
as a “score function”, based on which the mirror map outputs a better action in hindsight,
yielding a reinforcement procedure.
23
û3i
<latexit sha1_base64="x8uJFKG/UYalfg4Jj/UkbOHW3Eo=">AAAB/3icbVDLSsNAFL3xWesrKrhxEyyCq5L4wC4LblxWsA9oYphMJ+3QySTMTIQSs/BX3LhQxK2/4c6/cdJmoa0HBg7n3Ms9c4KEUals+9tYWl5ZXVuvbFQ3t7Z3ds29/Y6MU4FJG8csFr0AScIoJ21FFSO9RBAUBYx0g/F14XcfiJA05ndqkhAvQkNOQ4qR0pJvHrojpDI3QmoUhFma5z69z85z36zZdXsKa5E4JalBiZZvfrmDGKcR4QozJGXfsRPlZUgoihnJq24qSYLwGA1JX1OOIiK9bJo/t060MrDCWOjHlTVVf29kKJJyEgV6sggq571C/M/rpypseBnlSaoIx7NDYcosFVtFGdaACoIVm2iCsKA6q4VHSCCsdGVVXYIz/+VF0jmrO5d1+/ai1myUdVTgCI7hFBy4gibcQAvagOERnuEV3own48V4Nz5mo0tGuXMAf2B8/gDgMZae</latexit>
R|Ai |
<latexit sha1_base64="XTlMOrClZGomglpDajX9jLrv4lY=">AAACBXicbVC7TsMwFHXKq5RXgBEGiwqJqUoQiI5FLIwF0YfUhshxndaq40S2g1SlWVj4FRYGEGLlH9j4G5w0A7QcydLxOffq3nu8iFGpLOvbKC0tr6yuldcrG5tb2zvm7l5bhrHApIVDFoquhyRhlJOWooqRbiQICjxGOt74KvM7D0RIGvI7NYmIE6Ahpz7FSGnJNQ/7AVIjz0tu0/tkmn8wYsll6tJp6ppVq2blgIvELkgVFGi65ld/EOI4IFxhhqTs2VaknAQJRTEjaaUfSxIhPEZD0tOUo4BIJ8mvSOGxVgbQD4V+XMFc/d2RoEDKSeDpymxNOe9l4n9eL1Z+3Ukoj2JFOJ4N8mMGVQizSOCACoIVm2iCsKB6V4hHSCCsdHAVHYI9f/IiaZ/W7POadXNWbdSLOMrgAByBE2CDC9AA16AJWgCDR/AMXsGb8WS8GO/Gx6y0ZBQ9++APjM8fvqqZVg==</latexit>
û4i
<latexit sha1_base64="G5mQx6kWnjbfzoSXR2l/u7TttYw=">AAAB/3icbVDLSsNAFJ34rPUVFdy4CRbBVUmkYpcFNy4r2Ac0MUymk3boZBJmboQSs/BX3LhQxK2/4c6/cdJmoa0HBg7n3Ms9c4KEMwW2/W2srK6tb2xWtqrbO7t7++bBYVfFqSS0Q2Iey36AFeVM0A4w4LSfSIqjgNNeMLku/N4DlYrF4g6mCfUiPBIsZASDlnzz2B1jyNwIwzgIszTPfXafNXLfrNl1ewZrmTglqaESbd/8cocxSSMqgHCs1MCxE/AyLIERTvOqmyqaYDLBIzrQVOCIKi+b5c+tM60MrTCW+gmwZurvjQxHSk2jQE8WQdWiV4j/eYMUwqaXMZGkQAWZHwpTbkFsFWVYQyYpAT7VBBPJdFaLjLHEBHRlVV2Cs/jlZdK9qDuXdfu2UWs1yzoq6ASdonPkoCvUQjeojTqIoEf0jF7Rm/FkvBjvxsd8dMUod47QHxifP+G2lp8=</latexit>
û1i
<latexit sha1_base64="OSeklr5/fHs0xMKrMN8lHYmlYSU=">AAAB/3icbVDLSsNAFJ3UV62vqODGTbAIrkoiil0W3LisYB/QxDCZTtqhk0mYuRFKzMJfceNCEbf+hjv/xkmbhbYeGDiccy/3zAkSzhTY9rdRWVldW9+obta2tnd298z9g66KU0loh8Q8lv0AK8qZoB1gwGk/kRRHAae9YHJd+L0HKhWLxR1ME+pFeCRYyAgGLfnmkTvGkLkRhnEQZmme++w+c3LfrNsNewZrmTglqaMSbd/8cocxSSMqgHCs1MCxE/AyLIERTvOamyqaYDLBIzrQVOCIKi+b5c+tU60MrTCW+gmwZurvjQxHSk2jQE8WQdWiV4j/eYMUwqaXMZGkQAWZHwpTbkFsFWVYQyYpAT7VBBPJdFaLjLHEBHRlNV2Cs/jlZdI9bziXDfv2ot5qlnVU0TE6QWfIQVeohW5QG3UQQY/oGb2iN+PJeDHejY/5aMUodw7RHxifP90nlpw=</latexit>
<latexit sha1_base64="bAPraEgOQBd4HkRW/fgZsKqresc=">AAACBnicbVDLSsNAFJ3UV62vqEsRBosgCCXxgV0W3LisYNpCE8NkOmmHTiZhZiKUkJUbf8WNC0Xc+g3u/BsnbRbaeuDC4Zx7ufeeIGFUKsv6NipLyyura9X12sbm1vaOubvXkXEqMHFwzGLRC5AkjHLiKKoY6SWCoChgpBuMrwu/+0CEpDG/U5OEeBEachpSjJSWfPPw1I1Sn96fQ3eEVOZGSI2CMHPyvBB9s241rCngIrFLUgcl2r755Q5inEaEK8yQlH3bSpSXIaEoZiSvuakkCcJjNCR9TTmKiPSy6Rs5PNbKAIax0MUVnKq/JzIUSTmJAt1ZnCnnvUL8z+unKmx6GeVJqgjHs0VhyqCKYZEJHFBBsGITTRAWVN8K8QgJhJVOrqZDsOdfXiSds4Z92bBuL+qtZhlHFRyAI3ACbHAFWuAGtIEDMHgEz+AVvBlPxovxbnzMWitGObMP/sD4/AHPLpiu</latexit>
<latexit sha1_base64="z69SXn9Qe58Ldo+eQlAMsCZ2Czw=">AAACBnicbVBNS8NAEN3Ur1q/oh5FWCyCIJSkKPZY8OKxgmkLTQyb7aZdupuE3Y1QQk5e/CtePCji1d/gzX/jps1BWx8MPN6bYWZekDAqlWV9G5WV1bX1jepmbWt7Z3fP3D/oyjgVmDg4ZrHoB0gSRiPiKKoY6SeCIB4w0gsm14XfeyBC0ji6U9OEeByNIhpSjJSWfPP43OWpT++b0B0jlbkcqXEQZk6eF6Jv1q2GNQNcJnZJ6qBExze/3GGMU04ihRmScmBbifIyJBTFjOQ1N5UkQXiCRmSgaYQ4kV42eyOHp1oZwjAWuiIFZ+rviQxxKac80J3FmXLRK8T/vEGqwpaX0ShJFYnwfFGYMqhiWGQCh1QQrNhUE4QF1bdCPEYCYaWTq+kQ7MWXl0m32bAvG9btRb3dKuOogiNwAs6ADa5AG9yADnAABo/gGbyCN+PJeDHejY95a8UoZw7BHxifP8wRmKw=</latexit>
<latexit sha1_base64="yb/kuvrtVzAs1aUtzsLsqzhB3zA=">AAACBnicbVBNS8NAEN34WetX1KMIi0UQhJKIYo8FLx4rmLbQxLDZbtqlu5uwuxFKyMmLf8WLB0W8+hu8+W/ctD1o64OBx3szzMyLUkaVdpxva2l5ZXVtvbJR3dza3tm19/bbKskkJh5OWCK7EVKEUUE8TTUj3VQSxCNGOtHouvQ7D0Qqmog7PU5JwNFA0JhipI0U2kdnPs9Ceu9Cf4h07nOkh1Gce0VRiqFdc+rOBHCRuDNSAzO0QvvL7yc440RozJBSPddJdZAjqSlmpKj6mSIpwiM0ID1DBeJEBfnkjQKeGKUP40SaEhpO1N8TOeJKjXlkOssz1bxXiv95vUzHjSCnIs00EXi6KM4Y1AksM4F9KgnWbGwIwpKaWyEeIomwNslVTQju/MuLpH1edy/rzu1FrdmYxVEBh+AYnAIXXIEmuAEt4AEMHsEzeAVv1pP1Yr1bH9PWJWs2cwD+wPr8Acj0mKo=</latexit>
+µ2i Û2i
+µ1i Û1i
QR✏
<latexit sha1_base64="bXTfi1nFDNXBZvkYx5Ms3iDuRgg=">AAAB8nicbVBNS8NAEJ34WetX1aOXYBE8lUQUeyx48diK/YA0ls120y7d7IbdiVBKf4YXD4p49dd489+4bXPQ1gcDj/dmmJkXpYIb9LxvZ219Y3Nru7BT3N3bPzgsHR23jMo0ZU2qhNKdiBgmuGRN5ChYJ9WMJJFg7Wh0O/PbT0wbruQDjlMWJmQgecwpQSsFjfvHLksNF0r2SmWv4s3hrhI/J2XIUe+Vvrp9RbOESaSCGBP4XorhhGjkVLBpsZsZlhI6IgMWWCpJwkw4mZ88dc+t0ndjpW1JdOfq74kJSYwZJ5HtTAgOzbI3E//zggzjajjhMs2QSbpYFGfCReXO/nf7XDOKYmwJoZrbW106JJpQtCkVbQj+8surpHVZ8a8rXuOqXKvmcRTgFM7gAny4gRrcQR2aQEHBM7zCm4POi/PufCxa15x85gT+wPn8AUibkTk=</latexit>
⇡i4
<latexit sha1_base64="gF9ZTuS+P0GY+6NvYj7Rjk7BxSk=">AAAB8HicbVBNSwMxEJ3Ur1q/qh69BIvgqexKxR4LXjxWsK3SriWbZtvQJLskWaEs/RVePCji1Z/jzX9j2u5BWx8MPN6bYWZemAhurOd9o8La+sbmVnG7tLO7t39QPjxqmzjVlLVoLGJ9HxLDBFesZbkV7D7RjMhQsE44vp75nSemDY/VnZ0kLJBkqHjEKbFOeuglvM8fs9q0X654VW8OvEr8nFQgR7Nf/uoNYppKpiwVxJiu7yU2yIi2nAo2LfVSwxJCx2TIuo4qIpkJsvnBU3zmlAGOYu1KWTxXf09kRBozkaHrlMSOzLI3E//zuqmN6kHGVZJapuhiUZQKbGM8+x4PuGbUiokjhGrubsV0RDSh1mVUciH4yy+vkvZF1b+sere1SqOex1GEEziFc/DhChpwA01oAQUJz/AKb0ijF/SOPhatBZTPHMMfoM8fvSCQVQ==</latexit>
⇡i1
<latexit sha1_base64="tx3FDwIPhS1oY++siNeafbEbldo=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRbBU9kVxR4LXjxWsB/SriWbZtvQJBuSrFCW/govHhTx6s/x5r8xbfegrQ8GHu/NMDMvUpwZ6/vfXmFtfWNzq7hd2tnd2z8oHx61TJJqQpsk4YnuRNhQziRtWmY57ShNsYg4bUfjm5nffqLasETe24miocBDyWJGsHXSQ0+xPnvMgmm/XPGr/hxolQQ5qUCORr/81RskJBVUWsKxMd3AVzbMsLaMcDot9VJDFSZjPKRdRyUW1ITZ/OApOnPKAMWJdiUtmqu/JzIsjJmIyHUKbEdm2ZuJ/3nd1Ma1MGNSpZZKslgUpxzZBM2+RwOmKbF84ggmmrlbERlhjYl1GZVcCMHyy6ukdVENrqr+3WWlXsvjKMIJnMI5BHANdbiFBjSBgIBneIU3T3sv3rv3sWgtePnMMfyB9/kDuJGQUg==</latexit>
⇡i3
<latexit sha1_base64="B0sLcbKuhsnD4fNwqFC7JqOKUVY=">AAAB8HicbVDJSgNBEK1xjXGLevTSGARPYcYFcwx48RjBLJKMoafTkzTpZejuEcKQr/DiQRGvfo43/8ZOMgdNfFDweK+KqnpRwpmxvv/trayurW9sFraK2zu7e/ulg8OmUakmtEEUV7odYUM5k7RhmeW0nWiKRcRpKxrdTP3WE9WGKXlvxwkNBR5IFjOCrZMeugnrscfsYtIrlf2KPwNaJkFOypCj3it9dfuKpIJKSzg2phP4iQ0zrC0jnE6K3dTQBJMRHtCOoxILasJsdvAEnTqlj2KlXUmLZurviQwLY8Yicp0C26FZ9Kbif14ntXE1zJhMUkslmS+KU46sQtPvUZ9pSiwfO4KJZu5WRIZYY2JdRkUXQrD48jJpnleCq4p/d1muVfM4CnAMJ3AGAVxDDW6hDg0gIOAZXuHN096L9+59zFtXvHzmCP7A+/wBu5uQVA==</latexit>
<latexit sha1_base64="gPxgA3phsRnJovYJvuMnoDNCh/4=">AAAB/nicbVDLSsNAFJ3UV62vqLhyM1iEuimJKHZZ0YXLCvYBTQiT6U07dPJgZiKUUPBX3LhQxK3f4c6/cdJmoa0HBg7n3Ms9c/yEM6ks69sorayurW+UNytb2zu7e+b+QUfGqaDQpjGPRc8nEjiLoK2Y4tBLBJDQ59D1xze5330EIVkcPahJAm5IhhELGCVKS5555NwCV6TmhESNKOHZ9dRjZ55ZterWDHiZ2AWpogItz/xyBjFNQ4gU5UTKvm0lys2IUIxymFacVEJC6JgMoa9pREKQbjaLP8WnWhngIBb6RQrP1N8bGQmlnIS+nsxTykUvF//z+qkKGm7GoiRVENH5oSDlWMU47wIPmACq+EQTQgXTWTEdEUGo0o1VdAn24peXSee8bl/WrfuLarNR1FFGx+gE1ZCNrlAT3aEWaiOKMvSMXtGb8WS8GO/Gx3y0ZBQ7h+gPjM8frZGVRQ==</latexit>
(Ai )
⇡i2
<latexit sha1_base64="DRVhdvEMUWVSmxdfjvBagPHDI0o=">AAAB8HicbVBNSwMxEJ3Ur1q/qh69BIvgqewWxR4LXjxWsK3SriWbZtvQJLskWaEs/RVePCji1Z/jzX9j2u5BWx8MPN6bYWZemAhurOd9o8La+sbmVnG7tLO7t39QPjxqmzjVlLVoLGJ9HxLDBFesZbkV7D7RjMhQsE44vp75nSemDY/VnZ0kLJBkqHjEKbFOeuglvM8fs9q0X654VW8OvEr8nFQgR7Nf/uoNYppKpiwVxJiu7yU2yIi2nAo2LfVSwxJCx2TIuo4qIpkJsvnBU3zmlAGOYu1KWTxXf09kRBozkaHrlMSOzLI3E//zuqmN6kHGVZJapuhiUZQKbGM8+x4PuGbUiokjhGrubsV0RDSh1mVUciH4yy+vknat6l9WvduLSqOex1GEEziFc/DhChpwA01oAQUJz/AKb0ijF/SOPhatBZTPHMMfoM8fuhaQUw==</latexit>
24
pointing the reader to [37,58,61–63] for the treatment in continuous games. The discussion in
this subsection is primarily based on stochastic approximation theory and Lyapunov stabil-
ity theory [30,64], and a generic procedure of applying such analytical tools consists of three
steps: 1) develop the mean-field continuous-time dynamics using stochastic approximation
theory; 2) study the continuous-time learning dynamics using ODE methods, relating its Lya-
punov stability to Nash equilibria of the underlying game; 3) derive the convergence results
of discrete-time algorithms using asymptotic convergence of corresponding continuous-time
dynamics. Since the third step is direct corollary of the results of the first and second steps,
we articulate the first two steps in analyzing the asymptotic behaviors of reinforcement learn-
ing in the sequel. We refer the reader to Appendix C and references therein for details on
the relation between discrete-time trajectory and its continuous counterpart.
ûk+1
i = (1 − µki )ûki + µki Gki (πik , ûki , Uik , aki ),
πik+1 = (1 − λki )πik + λki Fik (πik , ûk+1
i , Uik , aki ).
In the following, the continuous-time dynamics associated with (8) and (9) is obtained via
stochastic approximation, which paves the way for the ODE-based convergence analysis.
We begin with a generic description of the learning dynamics under reinforcement learning,
and then we specify the learning dynamics corresponding to (BR-d)(DA-d)(SBR-d). For
more details regarding stochastic approximation, we refer the reader to Appendix C and the
references therein.
For the sake of simplicity in exposition, we assume that learning policies in (8) and (9)
are time-invariant, denoted by Fi and Gi , respectively. When the learning policies are time-
variant, stochastic approximation theory still applies, and we refer the reader to [47] for
more details. Let the mean-field components of (8) and (9) be denoted by fi (πik , ûk+1 i )=
E[Fi (πik , ûik+1 , Uik , aki )|F k−1 ] and gi (πik , ûki ) = E[Gi (πik , ûki , Uik , aki )|F k−1 ], respectively. We
25
can then write down the following coupled differential equations
dûi (t)
= gi (πi (t), ûi (t)),
dt
dπi (t)
= fi (πi (t), ûi (t)),
dt
which are closely related to (8) and (9). By stochastic approximation theory (see Ap-
pendix C), the linear interpolations of the sequences {πik } and {ûki } are the perturbed solu-
tions to the differential equations above, which are arbitrarily close to the true solution as
time goes to infinity. In other words, the convergence results of (8) and (9) can be obtained
by studying the limiting behavior of the associated differential equations.
Following the same argument, the learning dynamics of the best response (BR-d) can be
written as
dûi (t)
= ui (π−i (t)) − ûi (t),
dt (BR-c)
dπi (t)
∈ BRi (ûi (t)) − πi (t).
dt
If the best response dynamics is adopted by every player, we can consider the continuous-time
dynamics of the strategy profile of all players π(t) = [π1 (t), π2 (t), . . . , πN (t)] under best re-
sponse. Denote the joint utility vector by u(π(t)) := [u1 (π−1 (t)), u2 (π−2 (t)), . . . , uN (π−N (t))],
and similarly, the joint estimated utility vector by û(t) := [û1 (t), û2 (t), . . . , ûN (t)]. Then,
for the strategy profile π(t), the continuous-time learning dynamics under the best response
algorithm is
dû(t)
= u(π(t)) − û(t), (18)
dt
dπ(t)
∈ BR(û(t)) − π(t). (19)
dt
From its associated learning dynamics, we can see that the best response algorithm (BR-d)
or equivalently its continuous-time mean-field dynamics (BR-c) is in fact an actor-critic
learning [31], where the approximation û(t) given by (18) serves as the actor, evaluating the
performance of the current strategy profile, while the strategy update (19) is the critic that
improves the strategy.
As observed in the literature [31], the performance of the actor-critic learning relies on the
quality of evaluation from the actor. One approach to obtain a satisfying actor in learning
is to leverage the two-timescale idea [29], according to which (18) should operate at a faster
timescale than (19). Intuitively speaking, in order to obtain a û(t) that can approximately
evaluate the current strategy profile π(t), the player must wait until û(t) nearly converges
before it updates the strategy using (19). To analyze the convergence of the two-timescale
dynamics, one can study its equivalent single-timescale dynamics. Since the actor (18) runs
at a faster timescale, the system (18) and (19) can be “decoupled” in the following way: by
fixing π(t) = π, the faster timescale update (18) converges to u(π), where π is viewed as
a parameter, Then, after the convergence of the fast dynamics to an equilibrium u(π), the
26
slow dynamics (19) is set into motion, where û(t) is replaced by its equilibrium point u(π(t))
and the resulting learning dynamics is
dπ(t)
∈ BR(π(t)) − π(t). (20)
dt
As we illustrate in Appendix C, the coupled dynamics (18)(19) and the single-timescale
(20) share similar asymptotic behaviors. Hence, we can focus on the much simplified one
(20) for the derivation of the convergence results. For more details about the two-timescale
learning and the derivation of the equivalent dynamics, we refer the reader to Appendix C
and references therein.
Applying the same argument to the smoothed best response (SBR-d), we obtain
dûi (t)
= ui (π−i (t)) − ûi (t),
dt (SBR-c)
dπi (t)
= QRϵ (ûi (t)) − πi (t),
dt
and its equivalent dynamics regarding the joint strategy profile is
dπ(t)
= QRϵ (u(π(t))) − π(t). (21)
dt
Different from the best response (BR-d) and the smoothed best response (SBR-d), dual
averaging (DA-d) does not belong to the class of actor-critic methods. To see this, let us
write down its continuous-time dynamics
dûi (t)
= ui (π−i (t)),
dt (DA-c)
πi (t) = QRϵ (ûi (t)).
Similar to the previous argument, the learning dynamics for the strategy profile is
dû(t)
= u(π(t)),
dt (DA)
π(t) = QRϵ (û(t)),
where the dynamics regarding û(t) does notR t produce an approximation of u(π(t)). Instead,
it gives the cumulative payoff: û(t) = 0 u(π(τ ))dτ + û(0). It is straightforward to see
that as there is only one differential equation in (DA), the resulting autonomous dynamical
system is only related to û(t). Hence, there is no additional dynamics regarding the strategy
update, which makes (DA) fundamentally different from (BR-c) and (SBR-c).
27
Dual Averaging Consider the learning dynamics of the joint strategy profile and the
estimated utility vector under dual averaging
dû(t)
= u(π(t)),
dt (DA)
π(t) = QRϵ (û(t)).
This compact form implies that (DA) is an autonomous system evolving in the dual space.
Here, similar to the discussion in Section 3.3, we adopt the terminology in [41, 42], where
the gradient u(π(t)) is the dual variable and the corresponding space is termed the dual
space. As shown in [38], (DA) is a well-posed dynamical system in the dual space in that it
admits a unique global solution for every initial û(0). Furthermore, it can be shown that the
dynamics of π(t) on the game’s strategy space induced by (DA) under steep regularizers is
also well-posed [38, 52]. However, the well-posedness of the induced dynamics under generic
regularizers remains unclear [38]. The reason lies in the fact that under steep regularizers,
such as the entropy regularizer, the projected dynamics regarding π(t) evolves within the
interior of the simplex, and the resulting ODE is also well posed in the primal space, which
need not hold for nonsteep regularizers. For more generic choices of QR and related stability
analysis, we refer the reader to [38].
Even though studying the stability of the induced dynamics in the primal space may not
be viable due to the well-posedness issue, the asymptotic behavior of π(t) can be character-
ized by investigating its dual û(t). Toward that end, we call π(t) = QRϵ (û(t)) the induced
orbit of (DA) or simply orbit, and we introduce the following notions regarding the stability
and stationarity of π(t), which is adapted from [38].
ϵ
Definition 4 Denote by im(QR Q ) the image of QRϵ . For π(t) = QR(u(t)), an orbit of
(DA), we say that a fixed π ∗ ∈ i∈N ∆(Ai ) is
4. globally attracting, if π ∗ is attracting with the attracting basin being the entire image
im(QRϵ );
Similar to the Folk Theorem of evolutionary game theory [34], there is an equivalence
between the stationary points of (DA) and the Nash equilibria [34, 38]: any stationary point
is a Nash equilibrium and conversely, every Nash equilibrium that is within the image of the
mirror map (15) is a stationary point. In addition to the relation between Nash equilibrium
and the stationary point, another important question is the following:
28
Are Nash equilibria of the underlying game (globally) asymptotically stable under (DA)?
To answer this question, we shall revisit the variational characterization of Nash equilibrium,
which bridges the equilibrium concepts associated with two different mathematical models:
games and dynamical systems. Recall that the Nash equilibrium is equivalent to the solution
of the variational inequality
Y
⟨u(π ∗ ), π − π ∗ ⟩ ≤ 0, for all π ∈ ∆(Ai ). (SVI)
i∈N
Since the utility function ui (πi , π−i ) is linear in πi , the Stampacchia-type variational inequal-
ity (SVI) is equivalent to the following Minty-type variational inequality
Y
⟨u(π), π − π ∗ ⟩ ≤ 0, for all π ∈ ∆(Ai ), (MVI)
i∈N
which implies that the Nash equilibrium π ∗ is the solution to (MVI) [20]. Then, to answer
the question of interest, it suffices to investigate whether the solution to (MVI) is attracting
under (DA). As discussed in [61], the answer is negative: not every Nash equilibrium of
N -player general-sum game is attracting. To ensure the convergence of (DA), an additional
condition has to be imposed on (MVI).
Definition 5 (Variational Stability [38]) π ∗ is said to be variationally stable if there ex-
ists a neighborhood U of π ∗ such that
29
What has been presented above provides a generic criterion for examining the convergence
of gradient-based dynamics (DA), and in the following, based on the notion of variational
stability, we discuss some concrete cases, where the learning dynamics converges either locally
or globally to Nash equilibria. As shown in [37], for any finite games, every strict Nash
equilibrium satisfies (VS) and hence is a LASNE. Therefore, every strict Nash equilibrium
in finite games is locally attracting. On the other hand, to ensure global convergence, the
underlying Nash equilibrium has to be GASNE or equivalently satisfy the global variational
stability. For finite games, the existence of a potential implies monotonicity, which further
implies the existence of globally variationally stable Nash equilibiria [37]. Hence, for potential
games [38,65] and monotone games [37,66], regardless of the initial points, the orbit of (DA)
always converges to the set of Nash equilibria. We summarize our discussions in the following,
where 1) and 2) are direct extensions of the folk theorem of evolutionary dynamics [34], while
3)-5) are corollaries of variational characterization of Nash equilibria in [38] and [61].
For every finite game, we have the following characterization
Q of Nash equilibrium using
∗
the language of Lyapunov stability [38, 52]. For a fixed π ∈ i∈N ∆(Ai ),
3. if π ∗ is a Nash equilibrium and it falls within the image of the mirror map, then it is
stationary;
Best response dynamics The analysis of the best response dynamics (20) is more in-
volved than that of dual averaging (DA). The theoretical challenge is mainly due to the
discontinuous, set-valued nature of the best response mapping (2). In general, as a differ-
ential inclusion, (20) typically admits non-unique solutions through every initial point [30].
Early works have established the convergence results on (20) for games with special struc-
tures: best response dynamics converges to Nash equilibrium in zero-sum games, where the
Nash equilibrium is essentially a saddle point [33, 62, 67], in two-player strictly supermodu-
lar games [44] and in finite potential games [30, 33]. However, we note that these research
works, even though most of them still rely on the Lyapunov argument [30, 33, 62, 67], do
not directly reveal any generic relation between Lyapunov stability and Nash equilibrium in
general multi-player non-zero sum games, and they are more or less on an ad hoc basis.
Recent endeavors on the study of the best response dynamics have helped shed some light
on the asymptotic behavior of best response dynamics by relating the best response vector
field BR(π) − π to the gradient field u(π), which renders the best response dynamics in some
potential games [68, 69] as an approximation of the gradient-based dynamical system [68].
For the finite potential games considered in [68], additional regularity conditions are imposed,
which are closely related to the notion of variational stability introduced above. Therefore,
the variational characterization of Nash equilibrium and varitional stability becomes relevant
30
under the best response dynamics. Following this line of reasoning, it is shown in [68], in
regular potential games, that the best response dynamics is well-posed for almost every
initial condition, and converges to the set of Nash equilibria.
Smoothed Best Response As we can see from the explicit expression, smoothed best
response dynamics (21) only differs from the best response dynamics (20) in the operator
QRϵ (·), which serves as a perturbed best response [70], and the perturbation is determined
by ϵ [51]. Hence, if ϵ tends to zero, it is straightforward to see that the smoothed best
response (21) will enjoy the same asymptotic property as the best response (20), which
implies that identical results should also be achievable for smoothed best response with
vanishing exploration. This intuition has been verified in [46, 63], where smoothed best
response (21) is shown to converge in zero-sum games, potential games and supermodular
games.
On the other hand, with a constant ϵ, it is not realistic to expect the smoothed best
response, essentially a fixed point iteration, to always converge to exact Nash equilibrium.
Hence, a new equilibrium concept has been introduced in the literature, which is termed
perturbed Nash equilibrium in [71, 72] or Nash distribution in [32, 50]. The new equilibrium
is defined as the fixed point of the smoothed best response. We do not carry out detailed
discussion on that in this paper, since the convergence analysis still rests on the standard
Lyapunov argument, and the epistemic justification of such equilibrium [24, 33] is beyond
the scope of this paper. We refer the reader to [24, 50, 63, 72] for a rigorous treatment of the
smoothed best response.
31
This mood-based trial and error is different from reinforcement learning introduced in the
previous subsection, where the exploration is not determined eplicitly by the score function
and the choice mapping. Hence, LTE does not fit the stochastic approximation framework
introduced above, and instead, the associated convergence proof relies on perturbed Markov
Chain theory [73, 76]. It is shown in [74] that in a two-player finite game, if there at least
exists a pure Nash equilibrium, then LTE guarantees that pure Nash equilibrium is played
at least 1 − ϵ of the time, where ϵ is the probability of exploring new strategies. For an
N -player finite game, if the game is interdependent [74] and there at least exists one pure
Nash equilibrium, the same theoretical guarantee for the two-player case also holds. It is not
surprising that LTE does not achieve convergence in conventional ways, that is, almost sure
convergence and convergence in the mean, since players will always explore new strategies
with positive probability at least ϵ. The proposed learning method and its variants have
also been applied to learning efficient equilibrium [77] (Pareto dominant, maximizing social
welfare), learning efficient correlated equilibria [78], achieving the Pareto optimality [79] and
other related works in engineering applications, especially in cognitive radio problems [28].
The idea of trial and error in LTE leads to many important variants, such as sample
experimentation dynamics in [76] and optimal dynamical learning [75,79], which also rely on
perturbed Markov processes for equilibrium seeking. Even though the convergence results
of these algorithms all rest on Markov Chain (MC) theory [73], the analysis of their perfor-
mance remains unclear, due to the computation complexity of the inherent MC generated
by these algorithms. To circumvent the dimensionality issue regarding the number of states
in the original MC, an approximation-based dimension reduction method is proposed in [75],
which allows numerical convergence analysis for LTE and its variants based on Monte Carlo
simulations. Besides, we also note that a much simplified trial-and-error algorithm has been
theoretically analyzed in [80], where the optimal exploration rate is identified and the asso-
ciated convergence rate is discussed. It is not unrealistic to expect a similar argument may
apply to LTE and its variants, but the technical challenges regarding the dimensionality
should not be downplayed.
32
modeling using deep learning methods, involving automatically discovering and learning the
patterns of input data in such a way that newly generated examples output by the generative
model (generator) cannot be distinguished from the input. In game-theoretic language, the
training process of GAN is essentially a learning process in a zero-sum game between the
generator and the discriminator, where the generator tries to generate new samples that
plausibly could have been drawn from the original dataset, while the discriminator tries to
pick those fake ones produced by the generator. We do not intend to provide a comprehensive
survey for these machine learning applications, instead we refer the reader to [81, 82].
Despite different contexts under which the learning theory is studied, recent research
efforts mainly revolve around the following three aspects:
The first research direction is a natural follow-up to the study of evolutionary dynamics
[34, 44], which aims to bring learning in games to a broad range of ML applications, since in
ML, the game structure is specified by the underlying data and may not enjoy any desired
properties. Recall that convergence results and asymptotic behaviors regarding the three
dynamics (BR-c)(SBR-c)(DA-c) are discussed with the assumption that the underlying game
acquires special structures, such as potential games, supermodular games and zero-sum
games. However, for games with fewer assumptions on the utility function, there is still
a lack of understanding of the dynamics and the limiting behavior of learning algorithms.
One of the central questions of this direction is what the relations between Nash equilibria
and stationary points as well as attracting sets under the learning dynamics are. Recent
attempts try to answer this question from a variational perspective [85], and provide various
characterizations of Nash equilibria with desired properties under gradient-based dynamics
[52,61,86]. Furthermore, considering its applications in ML problems, learning algorithms in
stochastic settings are of great significance in recent studies, and we refer the reader to [61,87]
for more details as well as to [17] for an introduction to stochastic Nash equilibrium seeking.
The second research direction, which attracts attention from the ML community, the
optimization community as well as the control community, is directly related to the design
of ML algorithms. The goal is to develop acceleration techniques that improve the per-
formance of learning algorithms. Based on the understanding of first-order gradient-based
dynamics games such as (GD)(LGD), recent research efforts have focused on high-order gra-
dient methods, which can be dated back to Nesterov’s momentum idea [42], and researchers
endeavor to propose a general framework that generalize the momentum for the generation
of accelerated gradient-based algorithms [85]. On account of the close relationships among
Nash equilibrium, variational problems and dynamical systems [20], one approach for devel-
oping acceleration is to generalize the concept momentum by formulating the equilibrium
seeking as a variational (optimization) problem [20, 88], and then investigate acceleration
methods within the optimization context using, for example, variational analysis [85], extra-
gradient [88] and differential equation [89]. In addition to these mentioned research works,
we refer the reader to [19] for a review on the optimization-based approach. On the other
33
hand, as depicted in Figure 3, a learning process in general is a feedback system, and it is not
surprising that control theory can play a part in designing the acceleration. For example,
recent studies on reinforcement learning demonstrate that passivity-based control theory can
be leveraged in designing high-order learning algorithms [66, 90], where the learning rule is
treated as the control law to be designed. Another paper [91] promotes the use of memory in
best response maps to accelerate convergence in Nash seeking, and demonstrates substantial
improvements in doing so. In addition to the mentioned references, we further refer the
reader to [92] for a review on control-theoretic approaches on distributed Nash equilibrium
seeking, and to [93] for the use of extreme seeking in the learning process.
The recent advance on the third research direction is in part driven by multi-agent re-
inforcement learning and its applications such as multi-agent robotic control [10, 94, 95].
Different from the first two directions where the learning dynamics is primarily studied in
the context of repeated games, the third research direction focuses on games with dynamic
information (see Section 2.2). In this context, the appropriate learning objective, out of
practical consideration [16], is to obtain stationary strategies that are subgame perfect [96]
(see Section 2.2 for the definition of subgame perfectness). Different from the first two where
the change to payoffs resulted from a certain action completely comes from the opponents’
move, in dynamic games, the feedback each player receives not only depends on other players’
moves but also the dynamic environment. Moreover, when making decisions at each state,
players have to trade off current stage payoff for estimated future payoffs while forming pre-
dictions on the opponent’s strategies. Dynamic trade-off makes the analysis of learning in
stochastic games potentially challenging [97].
Earlier works on seeking for such Markov perfect Nash equilibrium are largely based on
dynamic programming [98,99], which requires a global information feedback, a restrictive as-
sumption in practice. Recent efforts focus on various approaches to lessen this requirement.
Currently, there are mainly three lines of research regarding learning in dynamic games. The
first approach is to extend learning dynamics in repeated games to dynamic games. Built
upon similar ideas in best response dynamics (BR-d), two-timescale best response dynamics
for zero-sum Markov games have been considered in [97, 100], meanwhile the gradient play
has been investigated in linear-quadratic dynamic zero-sum games [61, 101, 102]. The key
challenge in the approach, particularly in the case of Markov games, is to properly con-
struct the score function, which balance current stage payoffs and the future payoffs, and
we refer the reader to the mentioned references for more details and to [81] for an overview.
The second approach is to extend learning methods in single-agent Markov decision pro-
cess to Markov games. However, the direct extension of methods such as Q-learning [103],
policy gradient [31] and actor-critic [49] often fail to deliver desired results due to the non-
stationarity issue [104]. One natural way to overcome the non-stationarity issue is to allow
players to exchange information with neighbors [105,106], by which enables players to jointly
identify the non-stationarity created by the dynamic environment. For more details regard-
ing this approach, we refer the reader to recent reviews [81,104]. Finally, the third approach
is about a unilateral viewpoint of dynamic games. Different from the first two approaches
where learning processes are still investigated in a competitive environment, the third one
interprets learning in Markov games as an online optimization problem [107,108], where play-
ers independently make decisions based on the received feedback. This approach accounts
for the fully decentralized learning, where from each player’s perspective, other players are
34
considered as part of the environment. The key idea of this approach is to leverage the
regret minimization technique [15], which has led to many successes in solving extensive
form games of incomplete information [109]. Despite recent advances regarding the first two
approaches [61,81,97,100,110] and positive results for the last one [107,108,111], we still lack
a unified framework and a through understanding regarding the learning process in general
Markov games, which remains an open area for researchers from diverse communities.
35
Figure 7: The next generation of communication network: macrocells (bands < 3 GHz);
small cells (millimeter-wave); femtocells and Wi-Fi (millimeter-wave); massive multiple-
input, multiple-output with beamforming; and device-to-device (D2D) and machine-to-
machine (M2M) communications. Solid arrows indicate wireless links, whereas the dashed
arrows indicate backhaul links.
• increased indoor and small cell/hotspot traffic, which will make up the majority of
mobile traffic volume, leading to complex network structures;
• improved energy consumption or efficient power control for reducing carbon footprint.
From a system science perspective, these requirements impose a large-scale, time-variant,
and heterogeneous network topology on modern wireless communication systems, as shown
in Figure 7. Hence, it is impractical to manage/secure the wireless communications network
centrally. Game-theoretic learning provides a scalable distributed solution with adaptive
attributes to deal with this challenge. In the following, we take the dynamic secure routing
mechanism as an example to illustrate how game-theoretic learning contributes to a resilient
and agile communication system.
Security of routing in a distributed cognitive network (CR) is a prime issue, as the
routing may be compromised by unknown attacks, malicious behaviors, and unintentional
misconfigurations, which makes it inherently fragile. Even with appropriate cryptographic
techniques, routing in CR networks is still vulnerable to attacks in the physical layer, which
can critically compromise performance and reliability. Most of the existing work focuses on
the resource allocation perspective, which fails to capture the user’s lack of knowledge of
the attacker due to the distributed mechanism. To address these issues, [113] provides a
36
learning-based secure scheme, which allows the network to defend against unknown attacks
with a minimum level of deterioration in performance.
Consider Gw := (Nw , Ew ), which is a topology graph for a multi-hop CR network, where
Nw = {n1 , n2 , ..., nN } is a set of secondary users, and Ew is a set of links connecting these
users. The system state s indicates whether the primary users occupy nodes. The objective
of the secondary user is to find an optimal path to its destination. In multi-hop routing, a
secondary user ni starts with exploring neighboring nodes that are not occupied and then
chooses a node among them to which the user routes data. The selected node initializes
another exploration process for discovering the next node, and the same process is repeated
until the destination is reached.
Let Pi (0, Li ) := {(ni , li ), li ∈ {0, 1, 2, ..., Li }} be the multi-hop path from the node ni to
its destination, where Li is the total number of explorations until it reaches its destination.
Suppose there are J jammers in the network, the set of which is given by J := {1, 2, ..., J}.
Let Rj , j ∈ J , be the set of nodes under the influence of jammer j. Denote the joint action
of the jammers by r = [rj ]j∈J , where rj ∈ Rj . A zero-sum game formulation is proposed
in [113], where the secondary users aim to find an optimal routing path by selecting Pi (0, Li ),
while the jammers aim to compromise the data transmission by choosing r. The expected
utility function is
"L #
X i
(n ,l ) (n ,l )
Es [ui (s, Pi (0, Li ), r)] = −Es ln q(nii,lii−1) + λτ(nii,lii−1) ,
li =1
(n ,l )
where q(nii,lii−1) is the probability of successful transmissions from node (ni , li − 1) to node
(n ,l )
(ni , li ), and λ(nii ,lii −1) is the transmission delay between these two nodes. Here, the expectation
Es [·] is taken over all the possible system states.
Due to a lack of complete knowledge of adversaries and payoff structures, Boltzmann-
Gibbs reinforcement learning (SBR-d) is utilized to find the optimal path because of its
capability of estimating the expected utility. The resulting secure routing algorithm can
spatially circumvent jammers along the routing path and learn to defend against malicious
attackers as the state changes. As shown in Figure 8, the routing path generated from
the proposed routing algorithm in [113, 129] can avoid the nodes compromised by the jam-
mers. Thus, the routing algorithm stemming from the proposed game-theoretic formulation
provides more resilience, security, and agility than the ad-hoc on-demand distance vector
(AODV) algorithm, as AODV fails to dynamically adjust the routing path in the case of
a malicious attack. Moreover, the proposed routing algorithm can reduce the delay time
incurred by the attack due to its adaptive and dynamic feature, and hence, is more efficient
than AODV.
37
Figure 8: Illustration of a random network topology for 500 secondary users with a source
(S) and a destination (D), and routes of AODV and the proposed secure routing algorithm
in 2 km by 2 km area. The PU footprint denotes the set of nodes unavailable to secondary
users. Without an attacker, AODV establishes the route path (a), described by the solid
line, while the route path (b), the blue dashed line, is generated by the Boltzmann-Gibbs
learning method. Even though the AODV path is the shortest path between the source
and the destination, it is disrupted by malicious attacks. By contrast, the learning method
can develop a new route path (c) that circumvents jammers, leading to a resilient routing
mechanism.
the integration of microgrids can enhance the stability, resiliency, and reliability of the power
system, as they can operate independently from the main power grid autonomously. Such
integration, together with smart meters and appliances, produces the so-called smart grid,
a modern infrastructure for the reliable delivery of electricity.
The future smart grid is envisioned as a large-scale cyber-physical system comprising
advanced power, communications, control, and computing technologies. To accommodate
these technologies employed by different parties in the grid and to ensure an efficient and ro-
bust operation of such heterogeneous and large-scale cyber-physical systems, game-theoretic
methods have been widely employed in smart grid management problems. In the grid, mi-
crogrids are modeled as self-interested players who can operate, communicate, and interact
autonomously to deliver power and electricity to their consumers efficiently. Here, we discuss
a microgrid management mechanism developed in [117], built on game-theoretic learning,
enabling autonomous management of renewable resources.
The system model considered in [117] includes the generators, microgrids, and communi-
cations. As shown in Figure 10, generators in the upper layer determine the amount of power
to be generated, along with the electricity price, and send them to the bottom layer. A mi-
crogrid can generate renewable energies and make decisions by responding to the strategies
of the generators and other microgrids to optimize their payoffs, specified in the following
38
Figure 9: The integration of microgrids. A microgrid consists of a controller, consumers,
generators, and energy storage. In the grid, microgrids can either be connected to the main
grid or other microgrids, and these networked microgrids can operate, communicate, and
interact autonomously to deliver power and electricity to their consumers efficiently.
game-theoretic model.
Let Nd = {r, 1, 2, ..., Nd } be the set of Nd + 1 buses in a power grid, where r denotes the
slack bus. Assume that a smart grid is composed of load buses and generator buses and let
pgi , pli and θi be, respectively, the power generation, power load, and voltage angle at the i-th
bus. Note that the active power injection at the i-th bus satisfies
pi = pgi − pli , ∀ i ∈ Nd ,
P P
while the balance of the grid gives i∈Nd pgi = i∈Nd pli . Let N := {1, 2, ..., N } ⊆ Nd be the
set of N buses that can generate renewable energies, such as wind power, solar power, etc.
In the game considered in [117], the utility function of the i-th bus measures not only
economic factors related to power generation but also the efficiency of the microgrids. Before
giving the mathematical definition of the utility function, we first introduce the following
notations. Let ci be the unit cost of generated power for the i-th player, and c the unit
price of renewable energy for sale defined by the power market. ci , c are quantities relevant
to the profit gained by the bus. For the efficiency part, denote by ri a weighting parameter
that measures the importance of regulations of voltage angle at the i-th bus. Further,
[sij ]i,j∈Nd = −[bij ]−1
i,j∈Nd , where bij is the imaginary part of the element (i, j) in the admittance
matrix of the power grid. Moreover, each microgrid has a maximum generation, denoted by
p̄gi . Finally, we note that as a physical constraint, [sij ] and [pi ] satisfy (23) due to the power
flow equation [117]
X X
sij pj + sij pj = θi − sii pi , ∀i ∈ N , (23)
j∈Nd \N j̸=i∈N
where θi is the voltage angle of the i-th bus. With all the notations above, the utility function
of the i-th bus is defined as
!
g g g g
1 X
ui (pi , p−i ) := −ci pi − c pli − pi − ri2 sij pj , 0 ≤ pgi ≤ p̄gi , i ∈ N .
2 j∈N d
39
Figure 10: Smart grid hierarchy model. The upper layer containing conventional generators
forms a generator network, and the distributed renewable energy generators in the bottom
layer constitute the microgrid network; the information exchange, such as the electricity
market price and the amount of power generation, between the two layers are through the
communication network layer in the middle.
Three learning methods are proposed in the paper to seek the Nash equilibrium, all
based on best response dynamics (10). The first two algorithms are parallel-update algo-
rithm (PUA) and random-update algorithm (RUA) studied in [112]. PUA is essentially the
best response algorithm we represent in (10), with the learning rate λki being zero for all i,
and all players update their strategies in parallel. As its name suggests, RUA incorporates
randomness into the best response algorithm, resulting in an ϵ-greedy best response algo-
rithm: players update their strategies according to (10) with probability 1 − ϵ, with ϵ ∈ (0, 1)
and retain their previous strategies otherwise. When ϵ = 0, players constantly update their
strategies in every round; in this case, RUA reduces to PUA.
However, as special cases of (10), PUA and RUA require global information regarding the
grid, including the specific generated power of generators and other players’ active power in-
jections, which are assumed to be private in practice. Hence, to implement these algorithms,
communication networks are needed to broadcast information to players, which is costly and
not confidential. As a possible remedy, we can consider incorporating utility estimation and
using smoothed best response dynamics (SBR-d) as in the wireless setting. Another more
straightforward approach, as shown in the paper, is to modify the best response algorithm
by using the power flow equations in the smart grid. Based on a phasor measurement unit
(PMU), the third algorithm, termed PMU-enabled distributed algorithm (PDA), enables
each player to compute the aggregation of others’ actions, and the only information needed
is the player’s voltage angle θi . Therefore, by taking into account the power flow equation
(23), a player does not need other players’ private information of active power injection when
using PDA, as shown in Figure 11. Compared with the other two, PDA requires much less
information and is more self-dependent as players only need their current voltage angles θi ,
40
and the common knowledge of the electricity price.
Figure 11: The framework to implement the PMU-enabled distributed algorithm. PMU
measures the voltage angle at the bus, and the controller generates a command regarding
the amount of microgrid renewable energy injection from the local storage to the grid based
on the received voltage angle.
As indicated in [117], the effectiveness and resiliency of the algorithm have been vali-
dated via case studies based on the IEEE 14-bus system: the game-theory-based distributed
algorithm not only can converge to the unique Nash equilibrium but also provides strong
resilience against fault models (generator breakdown, microgrid turn-off, and open-circuit
of the transmission line, etc.) and attack models (data injection attacks, unavailability of
PMU data and jamming attacks, etc.). The strong resilience enables the microgrids to oper-
ate appropriately in unanticipated situations. Moreover, the distributed algorithm enables
autonomous management of renewable resources and the plug-and-play feature of the smart
grid. The proposed learning algorithm only requires the players to have common knowledge
without revealing their private information, which increases security and privacy and reduces
communication overhead.
41
aims to develop efficient and scalable algorithms with reasonable requirements of memory
computation resources, by allocating the learning processes among several networked com-
puting units with distributed data sets.
The key feature of DMLON is that data sets are stored and processed locally on these
computing units, which enables distributed and parallel computing schemes in large-scale ma-
chine learning systems. Compared with centralized approaches, distributed machine learning
avoids maintaining and mining a central data set and preserves data privacy, as these net-
worked units exchange knowledge about the learned models without exchanging raw private
data.
Based on the idea of “local learning and global integration,” DMLON can utilize different
learning processes to train several models from distributed data sets and then produce an
integration of learning models that can increase the possibility of achieving higher accuracy,
especially on a large-size domain. For example, in federated learning [130], the global in-
tegration is created by a third-party coordinator other than computing units, which makes
networked computing units collaboratively train a machine learning model using their data
in security. On the other hand, as indicated in [131], such a global integration can also
stem from the collective patterns of local learning without external enforcement. The key
behind this bottom-up integration is that each computing unit is modeled as a self-interested
player who learns the learning model based on the local data set and the feedback from its
neighbors. It has been shown in the paper that by modeling DMLON as a noncooperative
game, game-theoretic learning methods lead to a communication-efficient distributed ma-
chine learning, where the global outcome is characterized by the Nash equilibrium, resulting
from players’ self-adaptive behaviors.
Specifically, the networked system of computing units is described by a graph with the
set of nodes Nm := {1, 2, . . . , N } representing these units. Each node i ∈ Nm possesses
local data that cannot be transferred to other nodes. In the game model considered in [131,
132], instead of fixing the network topology, nodes can determine the network’s connectivity
based on their attributes when they perform learning tasks, resulting in a network formation
game. In mathematical terms, the action of node i consists of two components: the learning
parameter θi ∈ Rd , and the network formation parameter ei ∈ RN −1 . The first component
θi corresponds to the weights or parameters of the machine learning model, which captures
the local learning process at node i, and the corresponding empirical loss, given the local
data, is denoted by Li (θi ). In addition to this learning parameter θi , the network formation
parameter ei plays an important role in bringing up the global integration. The parameter
ei := (eji )j̸=i,j∈N ∈ [0, 1]N −1 denotes concatenation of weights on the directed edges from
node i to other nodes, where eji can be interpreted as the attention node i pays to the
local learning at node j, and this further influence the communication between the nodes.
Each node can communicate with its neighbors during the distributed learning process to
exchange learning parameters if their objectives are aligned. Otherwise, the corresponding
edge weight eji is set to zero. For node i, the communication cost is Ci (θi , θ−i , ei ). In the
game considered in [131], each node aims to maximize its utility function, defined as
In this definition, the first term Li (θi ) captures the local learning process at node i, whereas
42
Node 1 Node 1
Node 2 Node 2
Distributed
Learning
Layer
Node 3 Node 3
Node 4 Node 4
Outer Learning
Figure 12: A schematic representation of two-layer learning. The directed red lines stand for
the communication between nodes. In the network formation layer, the nodes learn to elim-
inate/establish links with other nodes to achieve efficient communication. In the distributed
machine learning layer, the nodes communicate their parameters with their neighbors and
perform their learning tasks.
the second term Ci (θi , θ−i , ei ) depicts the interactions among nodes. The objective of each
node is to improve the performance of learning while reducing the communication overhead.
A two-layer learning approach is proposed in [131] to find the Nash equilibrium of the
game, and a schematic representation is provided in Figure 12. The outer layer corresponds
to network formation learning, where each node decides its network formation parameter ei
with the learning parameter fixed, and the joint parameters of all nodes e = (ei )i∈Nm give rise
to new network topology, leading to efficient communication. In network formation learning,
each node decides their optimal parameter ei by gradient play (GD), and computing the
individual payoff gradient ∇ei ui (θi , θ−i , ei , e−i ) relies on the stabilized learning parameters
θi , θ−i given by the inner layer: distributed learning layer. In this inner learning, the network
formation parameter is fixed, and each node implements online mirror descent (MD) for
seeking the Nash equilibrium with the local feedback under the current network topology,
as the networked nodes can exchange information with their neighbors.
Compared with existing works on distributed machine learning, the game-theoretic method
studied in [131] enables distributed machine learning over strategic networks. On the one
hand, the global outcome characterized by the Nash equilibrium is self-enforcing, resulting
from the coordinated behaviors of independent computing compared with the external en-
forcing one in federated learning. This bottom-up approach scales efficiently when additional
computing units are introduced into the system. On the other hand, the strategic interac-
tions over the network, described by the network formation decision of each node, create
a network intelligence that allows each computing unit to adaptively adjust the underlying
topology, resulting in a desired distributed learning pattern that minimizes communication
costs during the learning process.
43
4.4 Emerging Network Applications
From the examples above, game-theoretic learning provides a natural scalable design frame-
work to create network intelligence for autonomous control, management, and coordination
of large-scale complex network systems with heterogeneous parties. In the following, we
offer some thoughts regarding various applications of game-theoretic learning in a broader
context, showing that such a design framework is pervasive for diverse network problems.
Interdependent infrastructure networks, including wireless communication networks and
the smart grid, play a significant role in modern society, where Internet-of-Things (IoT)
devices are massively deployed and interconnected. These devices are connected to cellu-
lar/cloud networks, creating multi-layer networks, referred to as networks-of-networks [133].
The smart grid is one prominent example, where wireless sensors collect the data of buses
and power transmission lines, forming a sensor network built on the power networks for
grid monitoring and decision planning purposes [134]. Besides, the networks-of-networks
model has also been extensively studied in other infrastructure networks. For instance, in
an intelligent transportation network, apart from vehicle-to-vehicle (V2V) communications,
vehicles can also communicate with roadside infrastructures or units belonging to one or
several service providers to exchange various types of data related to different applications,
such as GPS navigation. In this case, the vehicles form one network while the infrastructure
nodes form another network, and the interconnections between the two networks lead to the
intelligent management and operation of modern transportation networks.
Due to interdependent networks’ heterogeneous and multi-tier features, the required man-
agement mechanisms or controls can vary for different networks. For example, the connec-
tivity of sensor networks in smart grids or V2V communication networks requires higher
security levels than the infrastructure networks, as cyberspace is more likely to be targeted
by adversaries [135]. Therefore, to manage and secure interdependent infrastructure net-
works, game-theoretic learning methods, especially heterogeneous learning [40, 47], can be
used to design decentralized and resilient mechanisms that are responsive to attacks and
adaptive to the dynamic environment, as different parties in interdependent infrastructure
networks may acquire different information. For further readings on this topic, we refer the
reader to [47, 133] and references therein.
Similar to distributed optimization and machine learning based on game-theoretic learn-
ing, the control of autonomous mobile robots can also be cast as a Nash equilibrium seeking
problem over networks, where the equilibrium is viewed as the desired coordination of all
robots [94,95]. For applications of this kind, where the nature of robot movements determines
the network topologies, dynamic games over networks are considered, and corresponding
learning algorithms are employed. Based on their observations of the surroundings, robots
rely on game-theoretic learning, for example, reinforcement learning, for developing self-rule
policies, leading to a need for decentralized and scalable control laws for multi-agent robotic
systems. Moreover, reinforcement learning has proven effective for real-world multi-agent
robotic control when combined with powerful function approximators, such as deep neural
networks. This area of research, termed deep multi-agent reinforcement learning [81, 136], is
growing rapidly and attracting the attention of researchers from machine learning, robotics
as well as control communities.
In addition to these prescriptive mechanisms in engineering practices, game-theoretic
44
learning also provides a descriptive model for studying human decision-making and strategic
interactions in epidemiology and social sciences, where the Nash equilibrium represents a
stable state of the underlying noncooperative game. For example, a differential game model
has been proposed in [137] to study the virus or diseases spreading over the network. Authors
have developed a decentralized mitigation mechanism for controlling the spreading. Such
an approach has been further explored in [138], where an optimal quarantining strategy
of suppressing two interdependent epidemics spreading over complex networks has been
proposed and proven robust against random changes in network connections.
5 Summary
This article provides a comprehensive overview of game theory basics and related learning
theories, which serve as building blocks for systematically treating multi-agent decision-
making over networks. We have elaborated on the game-theoretic learning methods for net-
work applications drawn from spanning emerging areas such as the next-generation wireless
networks, the smart grid, and networked machine learning. In each area, we have identified
the main technical challenges and discussed how game theory can be applied to address them
in a bottom-up approach.
From the surveyed works, we conclude that noncooperative game theory is the cornerstone
of decentralized mechanisms for large-scale complex networks with heterogeneous entities,
where each node is modeled as an independent decision-maker. The resulting collective
behaviors of these rational decision-makers over the network can be mathematically depicted
by the solution concept: Nash equilibrium. In addition to various game models, learning
in games is of great significance for creating distributed network intelligence, which enables
each entity in the network to respond to unanticipated situations, such as malicious attacks
from adversaries in cyber-physical systems [134]. Under local or individual feedback, the
introduced learning dynamics lead to a decentralized and self-adaptive procedure, resulting
in desired collective behavior patterns without external enforcement.
Beyond the existing successes of game-theoretic learning, which mainly focuses on learn-
ing in static repeated games, it is also of interest to investigate dynamic game models and
associated learning dynamics, in order to better understand the decision-making process in
dynamic environments. The motivation for studying dynamic models and related learning
theory stems, on the one hand, from the pervasive presence of time-varying network struc-
tures, such as generation and demand in the smart grid [117]. On the other hand, by defining
auxiliary state variables, the problem of decision-making under uncertainties can be modeled
as a dynamic game, where the state of the game includes the hidden information players do
not have access to when making decisions. For example, the state variable can capture the
uncertainty of the environment, as we have discussed in the context of the dynamic routing
problem [113], or it can describe the global status of the entire system, as we have shown in
the example of distributed optimization [139]. The dynamic game models not only simplify
the construction of players’ utilities and actions, providing a clear picture of the strategic
interactions under uncertainties in the dynamic environment but can also offer a scalable
design framework for prescribing players’ self-adaptive behaviors that lead to equilibrium
states under various feedback structures.
45
To recap, this article has presented a comprehensive overview of game-theoretic learning
and its potential for tackling the challenges emerging from network applications. The combi-
nation of game-theoretic modeling and related learning theories constitutes a powerful tool
for designing future data-driven network systems with distributed intelligent entities, which
serve as the bedrock and a key enabler for resilient and agile control of large-scale artificial
intelligence systems in the near future.
46
References
[1] M. O. Jackson, Social and Economic Networks. Princeton, NJ: Princeton University
Press, 2010.
[3] Q. Zhu, Z. Han, and T. Başar, “A differential game approach to distributed demand
side management in smart grid,” in 2012 IEEE International Conference on Commu-
nications (ICC), pp. 3345–3350, 2012.
[5] Z. Han, D. Niyato, W. Saad, T. Başar, and A. Hjørungnes, Game theory in Wire-
less and Communication Networks: Theory, Models, and Applications. Cambridge
University Press, 2012.
[6] Q. Zhu, Z. Yuan, J. B. Song, Z. Han, and T. Başar, “Interference aware routing
game for cognitive radio multi-hop networks,” IEEE Journal on Selected Areas in
Communications, vol. 30, no. 10, pp. 2006–2015, 2012.
[7] Z. Han, D. Niyato, W. Saad, and T. Başar, Game Theory for Next Generation Wireless
and Communication Networks: Modeling, Analysis, and Design. Cambridge University
Press, 2019.
[8] M. H. Manshaei, Q. Zhu, T. Alpcan, T. Başar, and J.-P. Hubaux, “Game theory meets
network security and privacy,” ACM Computing Surveys (CSUR), vol. 45, no. 3, pp. 1–
39, 2013.
[9] Q. Zhu and T. Başar, “Game-theoretic methods for robustness, security, and resilience
of cyberphysical control systems: games-in-games principle for optimal cross-layer re-
silient control systems,” IEEE Control Systems Magazine, vol. 35, no. 1, pp. 46–65,
2015.
[10] P. Stone and M. Veloso, “Multiagent systems: a survey from a machine learning per-
spective,” Autonomous Robots, vol. 8, no. 3, pp. 345–383, 2000.
[11] T. Başar and G. J. Olsder, Dynamic Noncooperative Game Theory, 2nd Edition. So-
ciety for Industrial and Applied Mathematics, 1998.
[12] D. Fudenberg and J. Tirole, Game Theory. Cambridge, MA: MIT Press, 1991.
[13] M. Maschler, E. Solan, and S. Zamir, Game Theory. Cambridge University Press,
2013.
47
[14] M. O. Jackson and Y. Zenou, “Chapter 3 Games on Networks,” Handbook of Game
Theory with Economic Applications, vol. 4, pp. 95–163, 2015.
[15] S. Shalev-Shwartz, “Online learning and online convex optimization,” Foundations and
Trends® in Machine Learning, vol. 4, no. 2, pp. 107–194, 2011.
[17] J. Lei and U. V. Shanbhag, “Stochastic Nash equilibrium problems: Models, analysis,
and algorithms,” submitted as part of CSM special issue, 2020.
[18] J. B. Rosen, “Existence and uniqueness of equilibrium points for concave N-person
games,” Econometrica, vol. 33, no. 3, pp. 520—534, 1965.
[22] R. Selten, “Reexamination of the perfectness concept for equilibrium points in exten-
sive games,” International Journal of Game Theory, vol. 4, no. 1, pp. 25–55, 1975.
[24] D. Fudenberg, The Theory of Learning in Games. Cambridge, MA: MIT Press, 1998.
[26] P. D. Taylor and L. B. Jonker, “Evolutionary stable strategies and game dynamics,”
Mathematical Biosciences, vol. 40, no. 1-2, pp. 145–156, 1978.
[27] S. Hart and A. Mas-Colell, “Uncoupled dynamics do not lead to Nash equilibrium,”
The American Economic Review, vol. 93, no. 5, pp. 1830–1836, 2003.
[28] J. R. Marden and J. S. Shamma, “Chapter 16 Game Theory and Distributed Control,”
Handbook of Game Theory with Economic Applications, vol. 4, pp. 861–899, 2015.
48
[29] V. S. Borkar, Stochastic Approximation, A Dynamical Systems Viewpoint, vol. 48 of
Springer. Springer, 2008.
[33] C. Harris, “On the Rate of Convergence of Continuous-Time Fictitious Play,” Games
and Economic Behavior, vol. 22, no. 2, pp. 238–259, 1998.
[34] J. Hofbauer and K. Sigmund, “Evolutionary game dynamics,” Bulletin of the American
Mathematical Society, vol. 40, no. 4, pp. 479–519, 2003.
[36] V. Krishna and T. Sjöström, “On the convergence of fictitious play,” Mathematics of
Operations Research, vol. 23, no. 2, pp. 479–511, 1998.
[37] P. Mertikopoulos and Z. Zhou, “Learning in games with continuous action sets and
unknown payoff functions,” Mathematical Programming, vol. 173, no. 1-2, pp. 465–507,
2018.
[39] R. D. McKelvey and T. R. Palfrey, “Quantal response equilibria for normal form
games,” Games and Economic Behavior, vol. 10, no. 1, pp. 6–38, 1995.
[43] J. M. Smith and G. R. Price, “The logic of animal conflict,” Nature, vol. 246, no. 5427,
pp. 15–18, 1973.
49
[44] W. H. Sandholm, Population Games and Evolutionary Dynamics. Cambridge, MA:
MIT Press, 2010.
[45] R. Cressman and Y. Tao, “The replicator equation and other game dynamics,” Proceed-
ings of the National Academy of Sciences, vol. 111, no. Supplement 3, pp. 10810–10817,
2014.
[46] D. S. Leslie and E. Collins, “Generalised weakened fictitious play,” Games and Eco-
nomic Behavior, vol. 56, no. 2, pp. 285–298, 2006.
[47] Q. Zhu, H. Tembine, and T. Başar, “Hybrid learning in stochastic games and its
application in network security,” in Reinforcement Learning and Approximate Dynamic
Programming for Feedback Control, pp. 303–329, John Wiley & Sons, Ltd, 2012.
[50] D. S. Leslie and E. J. Collins, “Individual Q-learning in normal form games,” SIAM
Journal on Control and Optimization, vol. 44, no. 2, pp. 495–514, 2005.
[51] J. Hofbauer, S. Sorin, and Y. Viossat, “Time average replicator and best-reply dynam-
ics,” Mathematics of Operations Research, vol. 34, no. 2, pp. 263–269, 2009.
[55] T. Li and Q. Zhu, “On convergence rate of adaptive multiscale value function ap-
proximation for reinforcement learning,” 2019 IEEE 29th International Workshop on
Machine Learning for Signal Processing (MLSP), pp. 1–6, 2019.
50
[57] J. C. Spall, “A one-measurement form of simultaneous perturbation stochastic approx-
imation,” Automatica, vol. 33, no. 1, pp. 109–112, 1997.
[59] L. Xiao, “Dual averaging methods for regularized stochastic learning and online opti-
mization,” J. Mach. Learn. Res., vol. 11, p. 2543–2596, Dec. 2010.
[62] J. Hofbauer and S. Sorin, “Best response dynamics for continuous zero-sum games,”
Discrete & Continuous Dynamical Systems - B, vol. 6, no. 1, pp. 215–224, 2006.
[63] J. Hofbauer and W. H. Sandholm, “On the global convergence of stochastic fictitious
play,” Econometrica, vol. 70, no. 6, pp. 2265–2294, 2002.
[65] A. Heliou, J. Cohen, and P. Mertikopoulos, “Learning with bandit feedback in potential
games,” in Advances in Neural Information Processing Systems 30, pp. 6369—6378,
Curran Associates, Inc., 2017.
[66] B. Gao and L. Pavel, “On passivity, reinforcement learning, and higher order learning
in multiagent finite games,” IEEE Transactions on Automatic Control, vol. 66, no. 1,
pp. 121–136, 2019.
[67] E. N. Barron, R. Goebel, and R. R. Jensen, “Best response dynamics for continuous
games,” Proceedings of the American Mathematical Society, vol. 138, no. 03, pp. 1069–
1069, 2010.
[68] B. Swenson, R. Murray, and S. Kar, “On best-response dynamics in potential games,”
SIAM Journal on Control and Optimization, vol. 56, no. 4, pp. 2734–2767, 2018.
[69] B. Swenson, R. Murray, and S. Kar, “Regular potential games,” Games and Economic
Behavior, vol. 124, pp. 432–453, 2020.
51
[71] J. C. Harsanyi, “Games with randomly disturbed payoffs: A new rationale for mixed-
strategy equilibrium points,” International Journal of Game Theory, vol. 2, no. 1,
pp. 1–23, 1973.
[72] J. Hofbauer and E. Hopkins, “Learning in perturbed asymmetric games,” Games and
Economic Behavior, vol. 52, no. 1, pp. 133–152, 2005.
[73] H. P. Young, “The evolution of conventions,” Econometrica, vol. 61, no. 1, pp. 57–84,
1993.
[74] H. P. Young, “Learning by trial and error,” Games and Economic Behavior, vol. 65,
no. 2, pp. 626–643, 2009.
[75] J. Gaveau, C. J. Le Martret, and M. Assaad, “Performance analysis of trial and error
algorithms,” IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 6,
pp. 1343–1356, 2020.
[80] Z. Hu, M. Zhu, P. Chen, and P. Liu, “On convergence rates of game theoretic rein-
forcement learning algorithms,” Automatica, vol. 104, pp. 90–101, 2019.
[82] Y. Zhou, M. Kantarcioglu, and B. Xi, “A survey of game theoretic approach for adver-
sarial machine learning,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge
Discovery, vol. 9, no. 3, 2019.
[83] K. Zhang, B. Hu, and T. Basar, “On the stability and convergence of robust adversarial
reinforcement learning: A case study on linear quadratic systems,” in Advances in
Neural Information Processing Systems (H. Larochelle, M. Ranzato, R. Hadsell, M. F.
Balcan, and H. Lin, eds.), vol. 33, pp. 22056–22068, 2020.
52
[85] A. Wibisono, A. C. Wilson, and M. I. Jordan, “A variational perspective on accelerated
methods in optimization,” Proceedings of the National Academy of Sciences, vol. 113,
no. 47, pp. E7351–E7358, 2016.
[86] E. V. Mazumdar, M. I. Jordan, and S. S. Sastry, “On finding local Nash equilibria
(and only local Nash equilibria) in zero-sum games,” arXiv, 2019.
[87] C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan, “On nonconvex op-
timization for machine learning: Gradients, stochasticity, and saddle points,” arXiv,
2019.
[89] W. Su, S. Boyd, and E. J. Candès, “A differential equation for modeling Nesterov’s
accelerated gradient method: theory and insights,” Journal of Machine Learning Re-
search, vol. 17, no. 153, pp. 1–43, 2016.
[90] D. Gadjov and L. Pavel, “A passivity-based approach to Nash equilibrium seeking over
networks,” IEEE Transactions on Automatic Control, vol. 64, no. 3, pp. 1077–1092,
2017.
[91] T. Başar, “Relaxation techniques and asynchronous algorithms for on-line computation
of non-cooperative equilibria,” Journal of Economic Dynamics and Control, vol. 11,
no. 4, pp. 531–549, 1987.
[92] G. Hu, Y. Pang, C. Sun, and Y. Hong, “Distributed Nash equilibrium seeking:
continuous-time control-theoretic approaches,” submitted as part of CSM special is-
sue, 2020.
53
[97] M. O. Sayin, F. Parise, and A. Ozdaglar, “Fictitious play in zero-sum stochastic
games,” arXiv, 2020.
[98] R. Bellman, “The theory of dynamic programming,” Bulletin of the American Math-
ematical Society, vol. 60, no. 6, pp. 503–515, 1954.
[99] J. Hu and M. P. Wellman, “Nash Q-learning for general-sum stochastic games,” Journal
of machine learning research, vol. 4, no. Nov, pp. 1039—1069, 2003.
[101] J. Bu, L. J. Ratliff, and M. Mesbahi, “Global convergence of policy gradient for se-
quential zero-sum linear quadratic dynamic games,” arXiv, 2019.
[102] K. Zhang, X. Zhang, B. Hu, and T. Başar, “Derivative-free policy optimization for risk-
sensitive and robust control design: implicit regularization and sample complexity,”
arXiv, 2021.
[103] P. Dayan and C. J. Watkins, “Q-Learning,” Machine Learning, vol. 8, no. 3-4, p. 279
292, 1992.
[105] H.-T. Wai, Z. Yang, Z. Wang, and M. Hong, “Multi-agent reinforcement learning via
double averaging primal-dual optimization,” arXiv, 2018.
[106] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Başar, “Fully decentralized multi-agent
reinforcement learning with networked agents,” in Proceedings of the 35th International
Conference on Machine Learning, vol. 80 of Proceedings of Machine Learning Research,
(Stockholmsmässan, Stockholm Sweden), pp. 5872–5881, PMLR, 2018.
[108] T. Li, G. Peng, and Q. Zhu, “Blackwell online learning for Markov decision processes,”
arXiv preprint arXiv:2012.14043, 2020.
54
[111] V. Hakami and M. Dehghan, “Learning stationary correlated equilibria in constrained
general-sum stochastic games,” IEEE Transactions on Cybernetics, vol. 46, no. 7,
pp. 1640–1654, 2016.
[112] T. Alpcan, T. Başar, R. Srikant, and E. Altman, “CDMA uplink power control as a
noncooperative game,” Wireless Networks, vol. 8, no. 6, pp. 659–670, 2002.
[113] Q. Zhu, J. B. Song, and T. Başar, “Dynamic secure routing game in distributed
cognitive radio networks,” in 2011 IEEE Global Telecommunications Conference-
GLOBECOM 2011, pp. 1–6, IEEE, 2011.
[115] M. J. Farooq and Q. Zhu, “On the secure and reconfigurable multi-layer network design
for critical information dissemination in the internet of battlefield things (IoBT),”
IEEE Transactions on Wireless Communications, vol. 17, no. 4, pp. 2618–2632, 2018.
[116] M. J. Farooq and Q. Zhu, “Modeling, analysis, and mitigation of dynamic botnet
formation in wireless IoT networks,” IEEE Transactions on Information Forensics
and Security, vol. 14, no. 9, pp. 2412–2426, 2019.
[117] J. Chen and Q. Zhu, “A game-theoretic framework for resilient and distributed gener-
ation control of renewable energies in microgrids,” IEEE Transactions on Smart Grid,
vol. 8, no. 1, pp. 285–295, 2016.
[118] S. Maharjan, Q. Zhu, Y. Zhang, S. Gjessing, and T. Başar, “Demand response man-
agement in the smart grid in a large population regime,” IEEE Transactions on Smart
Grid, vol. 7, no. 1, pp. 189–199, 2015.
[119] J. Chen, C. Touati, and Q. Zhu, “A dynamic game approach to strategic design of se-
cure and resilient infrastructure network,” IEEE Transactions on Information Foren-
sics and Security, vol. 15, pp. 462–474, 2019.
[120] L. Huang, J. Chen, and Q. Zhu, “A large-scale Markov game approach to dynamic
protection of interdependent infrastructure networks,” in International Conference on
Decision and Game Theory for Security, pp. 357–376, Springer, 2017.
[121] J. Chen and Q. Zhu, “Interdependent network formation games with an application to
critical infrastructures,” in 2016 American Control Conference (ACC), pp. 2870–2875,
IEEE, 2016.
[122] J. Chen, C. Touati, and Q. Zhu, “Heterogeneous multi-layer adversarial network de-
sign for the IoT-enabled infrastructures,” in GLOBECOM 2017-2017 IEEE Global
Communications Conference, pp. 1–6, IEEE, 2017.
55
[123] Z. Xu and Q. Zhu, “A game-theoretic approach to secure control of communication-
based train control systems under jamming attacks,” in Proceedings of the 1st Interna-
tional Workshop on Safe Control of Connected and Autonomous Vehicles, pp. 27–34,
2017.
[124] Q. Zhu, W. Saad, Z. Han, H. V. Poor, and T. Başar, “Eavesdropping and jamming
in next-generation wireless networks: A game-theoretic approach,” in 2011-MILCOM
2011 Military Communications Conference, pp. 119–124, IEEE, 2011.
[126] L. Huang and Q. Zhu, “A dynamic games approach to proactive defense strategies
against advanced persistent threats in cyber-physical systems,” Computers & Security,
vol. 89, p. 101660, 2020.
[127] Q. Zhu and S. Rass, “On multi-phase and multi-stage game-theoretic modeling of
advanced persistent threats,” IEEE Access, vol. 6, pp. 13958–13971, 2018.
[128] N. Al-Falahy and O. Y. Alani, “Technologies for 5G networks: challenges and oppor-
tunities,” IT Professional, vol. 19, no. 1, pp. 12–20, 2017.
[129] J. B. Song and Q. Zhu, “Performance of dynamic secure routing game,” in Game
Theory for Networking Applications, pp. 37–56, Springer, 2019.
[131] S. Liu, T. Li, and Q. Zhu, “Communication-efficient distributed machine learning over
strategic networks: A two-layer game approach,” arXiv preprint arXiv:2011.01455,
2020.
[132] S. Liu, T. Li, and Q. Zhu, “Game-theoretic distributed empirical risk minimization with
strategic network design,” IEEE Transactions on Signal and Information Processing
over Networks, vol. 9, pp. 542–556, 2023.
[133] J. Chen and Q. Zhu, “A game- and decision-theoretic approach to resilient inter-
dependent network analysis and design,” SpringerBriefs in Electrical and Computer
Engineering, pp. 75–102, 2019.
[134] Q. Zhu, “Multilayer cyber-physical security and resilience for smart grid,” in Smart
Grid Control, pp. 225–239, Springer, 2019.
[135] M. J. Farooq and Q. Zhu, “On the secure and reconfigurable multi-layer network design
for critical information dissemination in the internet of battlefield things (IoBT),”
IEEE Transactions on Wireless Communications, vol. 17, no. 4, pp. 2618–2632, 2018.
56
[136] J. Foerster, I. A. Assael, N. De Freitas, and S. Whiteson, “Learning to communi-
cate with deep multi-agent reinforcement learning,” Advances in Neural Information
Processing Systems, vol. 29, pp. 2137–2145, 2016.
[139] N. Li and J. R. Marden, “Designing games for distributed optimization,” IEEE Journal
of Selected Topics in Signal Processing, vol. 7, no. 2, pp. 230–242, 2013.
[140] H. Kushner and G. G. Yin, Stochastic Approximation and Recursive Algorithms and
Applications, vol. 35. Springer Science & Business Media, 2003.
A Fictitious Play
Consider the repeated play between two players, with each player knowing his own utility
function. Further, each player is able to observe the actions of the other player and choose
an optimal action based on the empirical frequency of these actions.
In fictitious
Pk play, from player 1’s viewpoint, player 2’s strategy at time k can be estimated
as π2 (a) = s=1 1{as2 =a} /k, a ∈ A2 , which is the empirical frequency of actions player 2 has
k
57
B Replicator Dynamics
Recall that continuous-time learning dynamics under dual averaging is
dûi (t)
= ui (π−i (t)),
dt (DA-c)
ϵ
πi (t) = QR (ûi (t)).
P
We now consider the entropy regularizer h(x) = xi xi log xi and let ϵ = 1 for simplicity.
Differentiate the strategy πi (t) with respect to time variable in (DA-c), arriving at
!
dπi,a (t) 1 dûi,a (t) ûi,a (t) X ûi,a′ (t) X dû i,a′ (t)
= P ûi,a (t) 2 e e − eûi,a (t) eûi,a′ (t)
dt ( a′ e ) dt a′ a′
dt
!
dûi,a (t) X dûi,a′ (t)
= πi,a (t) − πi,a′ (t)
dt a′
dt
= πi,a (t)[ui (a, π−i (t)) − ui (πi (t), π−i (t))]. (RD)
From the equation above, we can see that for a certain action a, if its outcome ui (a, π−i (t))
is above the average ui (πi (t), π−i (t)), then it will be “reinforced” in the sense that the prob-
ability of choosing a gets higher as time evolves. The above equation (RD) is referred to
as replicator dynamics, and has been widely used in evolutionary game theory to under-
stand natural selection and population biology. We consider a two-population system and
we reinterpret the elements in the two-player game using population biology language. For
population 1, there are |A1 | types and each type is specified by an element a ∈ A1 . We let
π1,a (t) be the percentage of type a in population 1 at time t, and assume here that π1 (t) is
differentiable with respect to time t, as the population, which is infinitely large, interacts
with the other population in a continuous-time manner.
For population 2, we have similar notions. If individuals from the two population meet
randomly, then they engage in a competition or a game with payoff dependent on their types.
For example, if type a1 from population 1 competes with type a2 from population 2, then the
payoffs for the two types are given by u1 (a1 , a2 ) and u2 (a1 , a2 ), respectively. For population
i, if we assume that the per capita rate of growth is given by the difference between the
payoff for type a and the average payoff in the population, a rule studied in [43], then the
percentage of different types within a population is precisely described by
1 dπi,a (t)
= ui (a, πi (t)) − ui (πi (t), π−i (t)),
πi,a dt
which is exactly the replicator dynamics (RD). In addition, as shown in [38], different
regularizers lead to different learning dynamics, which display different asymptotic behavior
accounts for the evolutionary process under different circumstances.
With replicator dynamics and other related evolutionary dynamics, biologists can predict
the evolutionary outcome of the multi-population system by examining the Nash equilib-
rium of the underlying game, which brings strategic reasoning into population biology and
has a profound influence in evolutionary game theory [44, 45]. Moreover, the Nash equi-
librium in this population game, characterized by the limiting behavior of the dynamics
58
under proper conditions [45], represents an evolutionarily stable state of the population,
which is an important refinement of Nash equilibrium. When this stable state is reached,
natural selection alone is sufficient to prevent the population from being influenced by mu-
tation [34, 44]. For more details on this refinement and its application in biology, we refer
the reader to [11, 34, 44, 45].
fi (πik , ûk+1
i ) = E[Fi (πik , ûk+1
i , Uik+1 , ak+1
i )|F k−1 ],
gi (πik , ûki ) = E[Gi (πik , ûki , Uik+1 , ak+1
i )|F k−1 ].
With the mean-field part defined as above, Mik+1 = Fi (πik , ûk+1 i , Uik+1 , ak+1
i ) − fi (πik , ûk+1
i )
k+1 k k
and Γi takes a similar form. λ̄i , µ̄i are time-scaling factors dependent on the learning
rates λki , µki , which account for the adjustment of the original step sizes in asynchronous
schemes [29,64], and in synchronous cases, the time scaling factors coincide with the original
step sizes. Similar to our discussion in the main text (see (18) and (19)), we consider the
dynamical system of the joint strategy profile π k and utility vector ûk
π k+1 − π k = λ̄k f (π k , ûk+1 ) + M k+1 ,
(DSA)
ûk+1 − ûk = µ̄k g(π k , ûk ) + Γk+1 ,
where f and g are concatenations of {fi }i∈N and {gi }i∈N , respectively. λ̄k , µ̄k and M k , Γk
take similar forms.
As we have discussed in “Convergence of Learning in Games”, in order to obtain an ap-
proximately accurate score function, the two coupled dynamical systems in (DSA) should op-
erate on different timescales: the score function ûk should be updated sufficiently many times
until near-convergence before updating the strategy. This two-timescale iteration can be
achieved by adjusting the time-scaling factors: λ̄k and µ̄k are chosen so that limk→∞ λ̄k /µ̄k =
0. To understand this timescale system, it is instructive to consider a coupled continuous-
time dynamical system, as suggested in [29]
dπ(t)
= f (π(t), û(t)),
dt (C2)
dû(t) 1
= g(π(t), û(t)),
dt ε
59
where in the limit ε tends to zero. Hence, û(t) is fast transient while π(t) is slow. Then,
we can analyze the long-run behavior of the above coupled system as if the fast process is
always fully calibrated to the current value of the slow process. This suggests investigating
the ODE
dû(t)
= g(π, û(t)), (C3)
dt
where π is held fixed as a constant parameter. Suppose (C3) has a globally asymptotically
stable equilibrium Λ(π), where the mapping Λ(·) satisfies regularity conditions specified
in [30,64]. Then, it is reasonable to expect û(t) given by (C3) to closely track Λ(π). In turn,
this suggests that the investigation into the coupled system (C2) is equivalent to the study
of the single-timescale one
dπ(t)
= f (π(t), Λ(π(t))), (C4)
dt
which would capture the long-run behavior of π(t) in (C2) to a good approximation [29].
Informally speaking, to study the convergence of (DSA), we can relate its discrete-time
trajectory to that of (C2), which is further equivalent to (π(t), Λ(π(t))) specified by (C4).
Therefore, we can apply Lyapunov stability theory to (C4), in order to derive the conver-
gence results of the original discrete-time algorithm. We begin with the linear interpolation
process of the discrete-time trajectory, which connects the discrete-time system (DSA) and
its continuous-time couterpart (C2),(C4). Under some regularity conditions [30], for {π k },
the sequence generated by (DSA), we can construct the following continuous time pro-
cess π̄(t)
P : R+ → ∆(A), based on the linear interpolation of {π k }. Letting τ 0 = 0 and
k
τ k = s=1 λ̄s , we define
π k+1 − π k
π̄(t) := π k + (t − τ k ) , t ∈ [τ k , τ k+1 ).
τ k+1 − π k
Similarly, we can define a continuous-time process ū(t) corresponding to {ûk }.
As shown in [30, 64], such a linearly interpolated process (π̄(t), ū(t)) is closely related to
the flow of the following differential equations:
dπ(t)
= f (π(t), û(t)),
dt (C5)
dû(t)
= g(π(t), û(t)).
dt
We note that (C5) is defined for ease of presentation, and the actual differential inclusion
systems involves rearrangement of several terms; which we refer the reader to [64] for more
details. Further, we denote the flow of (C5) by
Φt (π 0 , u0 ) := {(π(t), û(t))|(π(t), û(t)) is a solution to (C5), with π(0) = π 0 , û(0) = u0 }.
The key of stochastic approximation theory lies in the fact that in the presence of a global
attractor for (C5), the continuous-time process (π̄(t), ū(t)) asymptotically tracks the flow
with arbitrary accuracy over windows of arbitrary length [30],
lim sup dist{(π̄(t + s), ū(t + s)), Φs (π̄(t), ū(t))} = 0,
t→∞ s∈[0,T ]
60
where dist{·, ·} denotes a distance measure on ∆(A) × RA . We refer to (π̄(t), ū(t)) as an
asymptotic pseudo-trajectory (APT) of the dynamics (C5). In other words, in order to study
the convergence of (DSA), we can resort to the convergence analysis of (C5), which can be
addressed by Lyapunov stability theory as shown in [30,64], where the key conclusion is that
if there is a global attractor A for (C4). Then the interpolated process (π̄(t), ū(t)) or simply
(π k , ūk ) converges almost surely to (A, Λ(A)).
61