0% found this document useful (0 votes)

43 views61 pages

The Confluence of Networks, Games and Learning: A Game-Theoretic Framework For Multi-Agent Decision Making Over Networks

This document discusses applying game theory and learning approaches to multi-agent decision making over networks. It provides an overview of using stochastic approximation theory and game-theoretic learning algorithms to create distributed network intelligence. Examples of application areas include smart grids, wireless communication networks, and distributed machine learning. The goal is to explain how game theory, learning, and networks can be combined to develop decentralized control mechanisms for complex networked systems.

Uploaded by

hareem7bilal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views61 pages

The Confluence of Networks, Games and Learning: A Game-Theoretic Framework For Multi-Agent Decision Making Over Networks

Uploaded by

hareem7bilal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

The Confluence of Networks, Games and Learning∗

A game-theoretic framework for multi-agent decision making over

networks

Tao Li†‡, Guanze Peng‡, Quanyan Zhu‡, Tamer Başar§

arXiv:2105.08158v2 [cs.MA] 26 Aug 2023

Abstract
Recent years have witnessed significant advances in technologies and services in
modern network applications, including smart grid management, wireless communi-
cation, cybersecurity as well as multi-agent autonomous systems. Considering the
heterogeneous nature of networked entities, emerging network applications call for
game-theoretic models and learning-based approaches in order to create distributed
network intelligence that responds to uncertainties and disruptions in a dynamic or an
adversarial environment. This paper articulates the confluence of networks, games and
learning, which establishes a theoretical underpinning for understanding multi-agent
decision-making over networks. We provide an selective overview of game-theoretic
learning algorithms within the framework of stochastic approximation theory, and as-
sociated applications in some representative contexts of modern network systems, such
as the next generation wireless communication networks, the smart grid and distributed
machine learning. In addition to existing research works on game-theoretic learning
over networks, we highlight several new angles and research endeavors on learning in
games that are related to recent developments in artificial intelligence. Some of the
new angles extrapolate from our own research interests. The overall objective of the
paper is to provide the reader a clear picture of the strengths and challenges of adopting
game-theoretic learning methods within the context of network systems, and further
to identify fruitful future research directions on both theoretical and applied studies.

1 Introduction
Multi-agent decision making over networks has recently attracted an exponentially growing
number of researchers from the systems and control community. The area has gained in-
creasing momentum in various fields including engineering, social sciences, economics, urban
∗
Prepared for IEEE control system magazine, as part of the special issue “Distributed Nash Equilibrium
Seeking over Networks”.
†
Corresponding author
‡
Department of Electrical and Computer Engineering, New York University, NY, USA; Email: {tl2636,
gp1363, qz494}@nyu.edu.
§
Department of Electrical and Computer Engineering & Coordinated Science Laboratory, University of
Illinois at Urbana-Champaign, IL, USA; Email: {basar1}@illinois.edu.

1
science, and artificial intelligence, as it serves as a prevalent framework for studying large
and complex systems, and has been widely applied in tackling many problems arising in
these fields, such as social networks analysis [1], smart grid management [2, 3], traffic con-
trol [4], wireless and communication networks [5–7], cybersecurity [8,9], as well as multi-agent
autonomous systems [10].
Due to the proliferation of advanced technologies and services in modern network appli-
cations, solving the decision-making problems in multi-agent networks calls for novel models
and approaches that can capture the following characteristics of emerging network systems
and the design of autonomous controls:

1. the heterogeneous nature of the underlying network, where multiple entities, repre-
sented by the set of nodes, aim to pursue their own goals with independent decision-
making capabilities;

2. the need for distributed or decentralized operation of the system, when the underlying
network is of a complex topological structure and is too large to be managed in a
centralized approach;

3. the need for creating network intelligence that is responsive to changes in the network
and the environment, as the system oftentimes operates in a dynamic or an adversarial
environment.

Game theory provides a natural set of tools and frameworks addressing these challenges,
and bridging networks to decision making. It entails development of mathematical models
that both qualitatively and quantitively depict how the interactions of self-interested agents
with different information and rationalities can attain a global objective or lead to emerging
behaviors at a system level. Moreover, with the underlying network, game-theoretic models
capture the impact of the topology on the process of distributed decision making, where
agents plan their moves independently according to their goals and local information available
to them, such as their observations of their neighbors.
In addition to game-theoretic models over networks, learning theory is indispensable when
designing decentralized management mechanisms for network systems, in order to equip
networks with distributed intelligence. Through the combination of game-theoretic models
and associated learning schemes, such network intelligence allows heterogeneous agents to
interact strategically with each other and learn to respond to uncertainties, anomalies, and
disruptions, leading to desired collective behavior patterns over the network or an optimal
system-level performance. The key feature of such network intelligence is that even though
each agent’s own decision-making process is influenced by the others’ decisions, the agents
reach an equilibrium state, that is, a Nash equilibrium as we elucidate later, in an online and
decentralized manner. To equip networks with distributed intelligence, networked agents
should adapt themselves to the dynamic environment with limited and local observations
over a large network that may be unknown to them. Computationally, decentralized learning
scales efficiently to large and complex networks, and requires no global information regarding
the entire network, which is more practical compared with centralized control laws.
This paper articulates the confluence of networks, games and learning, which establishes
a theoretical underpinning for understanding multi-agent decision-making over networks.

2
Figure 1: The confluence of networks, games and learning. The combination of game-
theoretic modelling and learning theories leads to resilient and agile network controls for
various networked systems.

We aim to provide a systematic treatment of game-theoretic learning methods and their

applications in network problems, which meet the three requirements specified above. As
shown in Figure 1, emerging network applications call for novel approaches, and thanks to the
decentralized nature, game-theoretic models as well as associated learning methods provide
an elegant approach for tackling network problems arising from various fields. Specifically,
our objectives are threefold:

1. to provide a high-level introduction to game-theoretic models that apply to multi-agent

decision making problems;

2. to present the key analytical tool based on stochastic approximation and Lyapunov
theory for studying learning processes in games, and pinpoint some extensively studied
learning dynamics;

3. to introduce various multi-agent systems and network applications that can be ad-
dressed through game-theoretic learning.

We aim to provide the reader a clear picture of the strengths and challenges of adopting
novel game-theoretic learning methods within the context of network systems. Besides the
highlighted contents, we also provide the reader with references for further reading. In this
paper, complete-information games are the basis of the subject, for which we give a brief
introduction to both static and dynamics games. More comprehensive treatments on this

3
Symbol Meaning
N The set of players
i, j ∈ N Subscript index denoting players
N (i) The set of neighbors of player i
Ai The set of actions available to player i
∆(Ai ) The set of Borel probability measures
(The probability simplex in RAi for finite action set Ai )
s∈Q S State variable
ui : j∈N Aj → R Player i’s utility function
ai ∈ AQ i Action of player i
a−i ∈ j∈N,j̸=i Aj Joint actions of players other than i
Q
a ∈ i∈N Ai Joint actions of all players
πi ∈ ∆(A
Q i ) Strategy of player i
π−i ∈ j∈N,j̸=i ∆(Aj ) Joint strategy of players other then i
ui (π−i ) or ui ∈ R|Ai | Player i’s utility vector in finite games
Di (a) Individual payoff gradient of player i
D(a) The concatenation of {Di (a)}i∈N
k
Ii Feedback of player i at time k
Uik ∈ R The payoff feedback received by player i at time k
ûi ∈ R
k |Ai |
Estimated utility vector at time k
Ûi ∈ R
k |Ai |
Estimator of ui (π−ik
) at time k
BRi Best response mapping for player i
QRϵ Regularized best response or quantal response

Table 1: Table of Notations

topic as well as other game models, such as incomplete information games, can be found
in [11–13]. As most of the network topologies can be characterized by the structure of the
utility function of the game [1, 14], we do not articulate the influence of network topologies
on the game itself. Instead, we focus on its influence on the learning process in games,
where players’ information feedback depends on the network structures, and we present
representative network applications to showcase this influence. We refer the reader to [1, 14]
for further reading on games over various networks.
We structure our discussions as follows. In Section 2, we introduce non-cooperative games
and associated solution concepts, including Nash equilibrium and its variants, which capture
the strategic interactions of self-interested players. Then, in Section 3, we move to the main
focus of this paper: learning dynamics in games that converge to Nash equilibrium. Within
the stochastic approximation framework, a unified description of various dynamics is pro-
vided, and the analytical properties can be studied by ordinary differential equation (ODE)
methods. In Section 4, we discuss applications of these learning algorithms in networks,
leading to distributed and learning-based controls for network systems. Finally, Section 5
concludes the paper. For the reader’s convenience, we summarize the notations that are
frequently used in Table 1.

4
2 Noncooperative Game Theory
Game theory constitutes a mathematical framework with two main branches: noncooper-
ative game theory and cooperative game theory. Noncooperative game theory focuses on
the strategic decision-making process of independent entities or players that aim to optimize
their distinct objective functions, without any external enforcement of cooperative behav-
iors. The term noncooperative does not necessarily mean that players are not engaged in
cooperative behaviors. As a matter of fact, induced cooperative or coordinated behaviors
do arise in noncooperative circumstances, within the context of Nash equilibrium, a solution
concept of noncooperative games. However, such coordination is self-enforcing and arises
from decentralized decision-making processes of self-interested players, and will be further
discussed in Section 4, where we introduce game-theoretic methods for distributed machine
learning.
As briefly discussed above, noncooperative game theory naturally characterizes the decision-
making process of heterogeneous entities acting independently over networks, which is the
main focus of this paper. In the following, we introduce various game models and related
solution concepts, including Nash equilibrium and its variants. Generally speaking, a game
involves the following elements: decision makers (players); choices available to each player
(actions); knowledge that a player acquires for making decisions (information) and each
player’s preference ordering among its actions, affected by also others’ actions (utilities or
cost). Below we provide a short list of these concepts that will be further discussed and
explained in this section.
1. Players are participants in a game, where they compete for their own good. A player
can be an individual or encapsulation of a set of individuals.
2. Actions of a player, in the terminology of control theory, are the implementations of
the player’s control.
3. Information in games refers to the structure regarding the knowledge players acquire
about the game and its history when they decide on their moves. The information
structure can vary considerably. For some games, the information is static and does
not change during the play. While for other games, new information will be revealed
after players’ moves, as the “state” of the game, a concept to be elucidated later, is
determined by players’ actions during the play. In the latter case, the information is
dynamic. We shall address both types of games in this paper.
4. A strategy is a mapping that associates a player’s move with the information available
to him at the time when he decides which move to choose.
5. A utility or payoff is oftentimes a real-valued function capturing a player’s preference
ordering among possible outcomes of the game. Using the terminology in control
theory, this can also be viewed as a cost function for the player’s controls.
The above list refers to elements of games in relatively imprecise common language terms,
and more formal definitions are presented below. To facilitate this discussion, we categorize
noncooperative games into two main classes: static and dynamic games, based on the nature
of the information structure.

5
2.1 Static Games
Static games are one-shot, where players make decisions simultaneously based on the prior
information on the games, such as sets of players’ actions, and their payoffs. In such games,
each player’s knowledge about the game is static and does not evolve during the play. Math-
ematically speaking, a static noncooperative game is defined as follows.

Definition 1 (Static Games) A static game is defined by a triple G := ⟨N , (Ai )i∈N , (ui )i∈N ⟩,
where

1. N = {1, 2, . . . , N } is a finite set of players;

2. Ai with some specified topology denotes the set of actions available to the player i ∈ N ;
Q
3. ui : j∈N Aj → R defines player i’s utility, and ui (ai , a−i ) gives the payoff of player i
when taking action ai , given other players’ actions a−i := (aj )j∈N,j̸=i .

In static games, each player develops its strategy, a probability distribution over his
action set, with the objective of maximizing the expected value of its own utility. If players
have finite action sets, then such a static game is called a finite game. In this case, a
strategy is a finite-dimensional vector in theP probability simplex over the action set, that is,
πi ∈ ∆(Ai ) := {π ∈ R |π(a) ≥ 0, ∀a ∈ Ai , a∈Ai π(a) = 1}. If πi is a unit vector ea , a ∈ Ai
|Ai |

with the a-th entry being 1 and 0 for others, then it is referred to a pure strategy, selecting
action a with probability 1; otherwise, it is a mixed strategy, choosing actions randomly
under the selected probability distribution. Similarly, for infinite action sets, the strategy is
defined as a Borel probability measure over the action set, with Dirac measure being the pure
strategy. By a possible abuse of notation, we denote the set of Borel probability measures
over Ai by ∆(Ai ). Unless specified otherwise, static games considered in this paper are all
assumed to be finite, where the player set, and the action sets are all finite.
As a special case of games with infinite actions, the mixed extension of finite games is
introduced in the sequel. Consider a two-player finite game G = ⟨N , (Ai )i∈N , (ui )i∈N ⟩, where
N = {1, 2}, and the action sets are finite |Ai | < ∞, i ∈ N . Given the mixed strategies of
players, πi ∈ ∆(Ai ), the expected utility of player i is Ea1 ∼π1 ,a2 ∼π2 [ui (a1 , a2 )]. With a slight
abuse of notation, we denote this expected utility by ui (π1 , π2 ) := Ea1 ∼π1 ,a2 ∼π2 [ui (a1 , a2 )].
Then, studying the players’ strategic interactions is equivalent to considering the following
infinite game G∞ = ⟨N , (∆(Ai ))i∈N , (ui )i∈N ⟩, where ui denotes the expected utility. In G∞ ,
an action is a vector from the corresponding probability simplex, a convex and compact set
with a continuum of elements. Similar to the notations used in the definition, for the mixed
extension G∞ , we denote the joint action of players other than i by π−i := (πj )j∈N ,j̸=i .
Furthermore, we let ui (π−i ) ∈ R|Ai | be the utility vector of player i, given other players’
strategy profiles, π−i , whose a-th entry is defined as ui (π−i )(a) := ui (ea , π−i ). Due to the
definition of expectation, ui (πi , π−i ) can be expressed as an inner product ⟨πi , ui (π−i )⟩, which
will be frequently used later when discussing learning algorithms in finite games. This mixed
extension allows us to give a geometric characterization to Nash equilibria of finite games,
based on variational inequalities, as discussed in Section 2.3. Meanwhile, this inner product
expression connects learning theory in finite games with online linear optimization [15], where

6
the generic player’s decision variable is πi and the loss function specified by ⟨·, ui (π−i )⟩ is
linear in πi .
Even though widely applied in modeling behaviors of self-interested players, the static
game model is far from being sufficient to cover multi-agent decision making problems arising
in different fields. For instance, when playing poker games, new information will be revealed
during the game play, such as cards played at each round, based on which players can adjust
their moves. There are many games where players’ information about the game changes over
time during the play, which cannot be suitably described by static games. Therefore, we
must resort to another model for capturing the underlying dynamics.

2.2 Dynamic Games

To explicitly represent the dynamic nature of the decision-making process, we adopt system
theory terminology and use the state of the game to describe its evolution over a period
of time, which could be finite or infinite. Roughly speaking, the current state specifies the
current situation of the dynamic game, including the set of players who are about to take
actions, actions available to them and their utilities at this time. The fundamental difference
between static games and dynamic games is that for the latter the game changes over time
as players implement their sequences of actions during the play. Hence, players’ knowledge
regarding the game also evolves, as players can fully or partially observe the current state.
In the following, a subclass of Markov games is introduced as an example of dynamic
games, which is a very popular game model for studies on multi-agent sequential decision
making under uncertainties, such as multi-agent reinforcement learning [16].

Definition 2 (Markov Games) An N -person discrete-time infinite horizon discounted Markov

game consists of

1. a player set N = {1, 2, . . . , N };

2. a discrete time set N+ := {1, 2, · · · }, with actions by players taken at each k ∈ N+ ;

3. a set Ai with some specified topology, defined for each i ∈ N , corresponding to the set
of actions or controls available to player i;

4. a set S with some specified topology, denoting the state space of the game, where sk ∈
S, k ∈ N+ represent the state of the game at time k;
Q
5. a transition kernel T : S × i∈N Ai → ∆(S), according to which the next state is
sampled, that is, sk+1 ∼ T (sk , ak ), where ak = (ak1 , . . . , akN ) is the N -tuple of actions
at time k ∈ N+ , and s1 ∈ S has a given distribution;
Q
6. an instantaneous payoff: ui : S × i Ai → R, defined for each i ∈ N and k ∈ N+ ,
determining the payoff ui (sk , ak ) received by player i at time k ;
1 k 1 k
P∞ {s k, . . . ,ks ,k. . . ; a , . . . , a , . . .}, the discounted cumula-
7. a discounting factor γ. Given
tive payoffs for player i is k=1 γ ui (s , a ).

7
The above definition only characterizes one special case of dynamic games. Based on this
definition, we can derive many other game models. For example, we can make state transi-
tions independent of players’ actions as well as the current state, yielding a special case of
stochastic games, which will be further discussed in another paper in this special issue [17].
We can also consider continuous-time dynamic games where the transition is described by
a differential equation, leading to a differential game model. For an extensive coverage of
dynamic game models, we refer the reader to [11].
With the full observation of states, we can consider the stationary strategy πi : S →
∆(Ai ), by which players plan their moves only based on the current state s ∈ S. In this case,
we say the state variable s characterizes players’ knowledge of the game, since the actions,
utilities and next possible states are all determined by the current state. For dynamic games
under partial observation and/or non-Markovian transition, we refer the reader to [11], since
these topics are beyond the scope of this paper.

2.3 Solution Concepts

The solution or outcome of any given game is more or less a matter of understanding game
rules and relations between players. However, besides these concrete matters, there exist
general principles, which dictate players’ behaviors and apply to all games. Here, we argue
that these principles revolve around the notion of rationality, based on which we introduce
the solution concept of Nash equilibrium and some of its variants. Mathematically speaking,
a solution to an N -person game is a collection of all players’ strategies, that has attractive
properties expressed in terms of payoffs received by the players. In addition, players can
admit different strategies depending on how the game is defined and, in particular, on the
information that players acquire. We start with static games, where the information structure
is relatively simple.
Compared with single-agent optimization problems, the analysis of games is more in-
volved, as each player’s utility is determined not only by its own decision but also by others’
moves. Hence, when a player takes an action, it must take into account possible moves of
the other players, which leads to the notion of best response. To introduce “best response”,
for clarity, but without any loss of conceptual generality, let us focus on games with two
players. For player 1, given the other player’s strategy π2 , the optimal choice is

π1 ∈ BR1 (π2 ) := arg max{⟨π, u1 (π2 )⟩}, (1)

π∈∆(A1 )

which is referred to as a best response of player 1 to player 2’s strategy π2 , and BR1 (·) is
called the best response set of player 1. Similarly, given player 1’s strategy π1 , a best response
of player 2 is π2 ∈ BR2 (π1 ) := arg maxπ∈∆(A2 ) {⟨π, u2 (π1 )⟩}. Therefore, we can define a point-
to-set mapping BR : ∆(A1 ) × ∆(A2 ) → 2∆(A1 )×∆(A2 ) , which is the concatenation of BR1
and BR2 . Given a joint strategy profile π = (π1 , π2 ), BR(π) is defined as

BR(π) := {(π ′ 1 , π ′ 2 )|π ′ 1 ∈ BR1 (π2 ), π ′ 2 ∈ BR2 (π1 )}. (2)

If we can find π ∗ = (π1∗ , π2∗ ), a fixed point of this best-response mapping, that is, π ∗ ∈ BR(π ∗ ),
then when both players adopt the corresponding strategy in this profile, they could do no

8
better by unilaterally deviating from current strategy. In other words, this fixed point
corresponds to an equilibrium outcome of the game, which further leads to the definition of
Nash equilibrium, which we introduce below for the general N -player game.

Definition 3 (Nash Equilibrium) For a static game ⟨N , (Ai )i∈N , (ui )i∈N ⟩, Nash equilib-
∗
rium is a strategy profile π ∗ = (πi∗ , π−i ) with the property that for all i ∈ N ,

ui (πi∗ , π−i
∗ ∗
) ≥ ui (πi , π−i ), (3)
∗
where πi is an arbitrary strategy of player i and π−i = (πj∗ )j∈N ,j̸=i denotes the joint strategy
profile of the other players. If the inequality holds strictly for all πi ̸= πi∗ , then it is referred
to as a strict Nash equilibrium.

Note that the preceding definition naturally carries over to games with infinite action sets,
and we refer the reader to [11, Chapter 4] for more details. Furthermore, for infinite games,
if we impose some topological structures on the action sets and regularity conditions on the
utility functions, we can come up with a geometric interpretation of Nash equilibrium derived
from the inequality in (3). Toward that end, we consider a (static) game with compact and
convex action sets (Ai )i∈N and smooth concave utilities:
Y
ui (ai , a−i ) is concave in ai for all a−i ∈ Aj , i ∈ N .
j∈N ,j̸=i

In such a game, the number of actions to each player is a continuum, and the utility function
is continuous; such games is referred to as continuous-kernel games
Q or continuous games.
∗ ∗ ∗
In this case, a pure strategy Nash equilibrium a = (ai , a−i ) ∈ i∈N Ai is defined by the
following inequality,

ui (a∗i , a∗−i ) ≥ ui (ai , a∗−i ), for all ai ∈ Ai and all i ∈ N . (4)

Further assuming that ui (ai , a−i ) is continuously differentiable in ai ∈ Ai , for all a−i , by the
first order condition, Nash equilibrium in (4) can be characterized by

⟨Di (a∗ ), ai − a∗i ⟩ ≤ 0, for all ai ∈ Ai , i ∈ N ,

where Di (a) := ∇ai ui (ai , a−i ) denotes the individual payoff gradient of player i, and ∇ai ui (ai , a−i )
denotes differentiation with respect to the variable ai . By rewriting the inequality above in a
more compact form, we obtain the following variational characterization of Nash equilibrium
Y
⟨D(a∗ ), a − a∗ ⟩ ≤ 0, for all a ∈ Ai , (5)
i∈N

where D(a) is the concatenation of {Di (a)}i∈N , that is, D(a) = (D1 (a), . . . , DN (a)). Geo-
metrically speaking, (5) states that for concaveQgames, a∗ is a Nash equilibrium Q if and only
∗ ∗ ∗
if D(a ) lies within the polar cone of the set i∈N Ai − a := {a − a |a ∈ i∈N Ai }, as
shown in Fig 2.
In addition to concave games, such variational inequality characterization has been stud-
ied in much broader contexts, such as monotone games [18], which bridges the gap between

9
PC(a⇤ )
<latexit sha1_base64="GhKeqDo9YDesc5KFCZEXElRgwOA=">AAACCHicbVDLSsNAFJ3UV62v+Ni5cLAI1UVJRNFloRuXFewDmlom00k7dDITZiZCCVm68VfcuFDErZ/gzr9xknahrQcuHM65l3vv8SNGlXacb6uwtLyyulZcL21sbm3v2Lt7LSViiUkTCyZkx0eKMMpJU1PNSCeSBIU+I21/XM/89gORigp+pycR6YVoyGlAMdJG6ttHnoiIRFpIjkKSNOppxQuRHvlBgtL7s9O+XXaqTg64SNwZKdcOghyNvv3lDQSOQ8I1ZkiprutEupcgqSlmJC15sSIRwmM0JF1Ds62ql+SPpPDEKAMYCGmKa5irvycSFCo1CX3TmR2p5r1M/M/rxjq47iWUR7EmHE8XBTGDWsAsFTigkmDNJoYgLKm5FeIRkghrk13JhODOv7xIWudV97Lq3Jo0LsAURXAIjkEFuOAK1MANaIAmwOARPINX8GY9WS/Wu/UxbS1Ys5l98AfW5w9905x/</latexit>

<latexit sha1_base64="u+GfKWhO9+9YpKdINLkKHIBKGfo=">AAAB8XicbVDLSgNBEOyNrxhfUY9eBoPgKeyKojcDXjxGMA9MQpidzCZDZmeXmV4hLPkLLxEU8eoX+Bve/Btnkxw0saChqOqmq9uPpTDout9ObmV1bX0jv1nY2t7Z3SvuH9RNlGjGayySkW761HApFK+hQMmbseY09CVv+MObzG88cm1EpO5xFPNOSPtKBIJRtNJDO6Q48IOUjrvFklt2pyDLxJuT0vXnJMNztVv8avciloRcIZPUmJbnxthJqUbBJB8X2onhMWVD2uctSxUNuemk08RjcmKVHgkibUshmaq/J1IaGjMKfduZJTSLXib+57USDK46qVBxglyx2aIgkQQjkp1PekJzhnJkCWVa2KyEDaimDO2TCvYJ3uLJy6R+VvYuyu6dW6qcwwx5OIJjOAUPLqECt1CFGjBQ8AQv8OoYZ+K8Oe+z1pwznzmEP3A+fgD3/ZWG</latexit>

a <latexit sha1_base64="tOb+TS1kcVaekwj/48vID3GF7Jo=">AAAB83icbVDLSgMxFL1TX7W+qi7dBIsgLsqMKLqz4MZlBfuAzlgyaaYNzWRCkhHK0N8Q1IUibv0Af8Odf2OmdaGtBwKHc+7lnpxQcqaN6345hYXFpeWV4mppbX1jc6u8vdPUSaoIbZCEJ6odYk05E7RhmOG0LRXFcchpKxxe5n7rjirNEnFjRpIGMe4LFjGCjZV8P8ZmEEYZHt8edcsVt+pOgOaJ90MqFx8POR7r3fKn30tIGlNhCMdadzxXmiDDyjDC6bjkp5pKTIa4TzuWChxTHWSTzGN0YJUeihJlnzBoov7eyHCs9SgO7WSeUc96ufif10lNdB5kTMjUUEGmh6KUI5OgvADUY4oSw0eWYKKYzYrIACtMjK2pZEvwZr88T5rHVe+06l67ldoJTFGEPdiHQ/DgDGpwBXVoAAEJ9/AML07qPDmvztt0tOD87OzCHzjv3xjVliI=</latexit>

a⇤
D(a⇤ )
<latexit sha1_base64="hy9N3RpT8SNNxyFROO0OmCXyH8s=">AAAB+HicbVDLSsNAFL3xWeujUZduBotQXZREKros6MJlBfuANpbJdNIOnUzCzESooV/ixoUibv0Ud/6NkzYLbT0wcDjnXu6Z48ecKe0439bK6tr6xmZhq7i9s7tXsvcPWipKJKFNEvFIdnysKGeCNjXTnHZiSXHoc9r2x9eZ336kUrFI3OtJTL0QDwULGMHaSH27dFPphViP/CDF04ez075ddqrODGiZuDkpQ45G3/7qDSKShFRowrFSXdeJtZdiqRnhdFrsJYrGmIzxkHYNFTikyktnwafoxCgDFETSPKHRTP29keJQqUnom8kspFr0MvE/r5vo4MpLmYgTTQWZHwoSjnSEshbQgElKNJ8YgolkJisiIywx0aaroinBXfzyMmmdV92LqnNXK9dreR0FOIJjqIALl1CHW2hAEwgk8Ayv8GY9WS/Wu/UxH12x8p1D+APr8wfLNZJ7</latexit>

<latexit sha1_base64="P4fbKDaApOSWQJPT6fIgxqTHOK0=">AAACDnicbZDLSsNAFIYnVWutt6hLEQZLoauSCKILF5VuXEkFe4EmhMlk0g6dTMLMRCghT+DGV3HjQhG3rt35Aj6H0wuirT8MfPznHOac308YlcqyPo3Cyupacb20Ud7c2t7ZNff2OzJOBSZtHLNY9HwkCaOctBVVjPQSQVDkM9L1R81JvXtHhKQxv1XjhLgRGnAaUoyUtjyz6iQiDryMOpRDJ0JqiBHLrvP8hy9zj3pmxapbU8FlsOdQaVgXnaNm8avlmR9OEOM0IlxhhqTs21ai3AwJRTEjedlJJUkQHqEB6WvkKCLSzabn5LCqnQCGsdCPKzh1f09kKJJyHPm6c7KkXKxNzP9q/VSF525GeZIqwvHsozBlUMVwkg0MqCBYsbEGhAXVu0I8RAJhpRMs6xDsxZOXoXNSt0/r1o1OowZmKoFDcAxqwAZnoAGuQAu0AQb34BE8gxfjwXgyXo23WWvBmM8cgD8y3r8BebifKA==</latexit>

Y
Ai
i2N

TC(a⇤ )
<latexit sha1_base64="vZytntJw2+V4/7TGYlVAcn7SlLk=">AAACCHicbVDLSsNAFJ3UV62v+Ni5cLAI1UVJRNFloRuXFfqCJpbJdNIOncyEmYlQQpdu/BU3LhRx6ye482+cpC609cCFwzn3cu89Qcyo0o7zZRWWlldW14rrpY3Nre0de3evrUQiMWlhwYTsBkgRRjlpaaoZ6caSoChgpBOM65nfuSdSUcGbehITP0JDTkOKkTZS3z7yREwk0kJyFJG0WZ9WvAjpURCmaHp3dtq3y07VyQEXiftDyrWDMEejb396A4GTiHCNGVKq5zqx9lMkNcWMTEteokiM8BgNSc/QbKvy0/yRKTwxygCGQpriGubq74kURUpNosB0ZkeqeS8T//N6iQ6v/ZTyONGE49miMGFQC5ilAgdUEqzZxBCEJTW3QjxCEmFtsiuZENz5lxdJ+7zqXladW5PGBZihCA7BMagAF1yBGrgBDdACGDyAJ/ACXq1H69l6s95nrQXrZ2Yf/IH18Q2EI5yD</latexit>

Figure 2: Variational characterization of a Nash equilibrium a∗ inQ concave games. TC(a∗ )

and PC(a∗ ) denote, respectively, the tangent and the polar cone of i∈N Ai − a∗ . According
to the variational inequality (5), a∗ is a Nash equilibrium if and only if D(a∗ ) lies in the
polar cone.

the theory of monotone operators and Nash equilibrium seeking. For a detailed discussion,
we refer the reader to another paper in this special issue [19]. The variational inequality (5)
is referred to as the Stampacchia-type inequality in the literature [20], and a similar varia-
tional inequality of this type can also be derived in the context of the mixed extension. As a
special case of continuous games, the mixed extension of finite games also satisfies the regu-
larity conditions: the action spaces are probability simplex regions, which are compact and
convex, and the utility function, due to its linearity with respect to any player’s mixed strat-
egy, is naturally smooth and concave. Therefore, the mixed strategy Nash equilibrium can
be characterized by variational inequality as well. Thanks to the inner product expression
of the utility in the mixed extension, the individual payoff gradient is simply ui (π−i ), and
we denote the concatenation of {ui }i∈N by u(π) := [u1 (π−1 (t)), u2 (π−2 (t)), . . . , uN (π−N (t))],
which we also refer to as the joint utility vector under the strategy profile π. In the same
spirit of (5), a strategy profile π ∗ is Nash equilibrium of the underlying finite game if and
only if the following Stampacchia-type inequality holds
Y
⟨u(π ∗ ), π − π ∗ ⟩ ≤ 0, for all π ∈ ∆(Ai ). (SVI)
i∈N

As we will later see in Section 3.4.2, this variational characterization of Nash equilibrium
bridges the equilibrium concept of games and the equilibrium concept of dynamical systems
induced by learning algorithms.
In the same spirit of (3), Nash equilibrium in dynamic games can also be defined ac-
cordingly. For Markov games, given players’ stationary strategy profile π, the cumulative
expected utility of player i, starting from the initial state s1 = s, is
X∞
Viπ (s) := Esk+1 ∼T,ak ∼π [ γ k ui (sk , ak )|s1 = s], (6)
k=1

10
which is referred to as state-value function in Markov decision process [21]. If we view Viπ as a
function of the strategy profile, following (3), we can define Nash equilibrium for the Markov
game, where the inequality holds for every state. In other words, regardless of previous play,
as long as players follow π ∗ from the current state s, they achieve the best outcome for the
rest of the game, and no player has any incentive to deviate from the strategy dictated by
π ∗ . Hence, this kind of Nash equilibrium is referred to as subgame perfect Nash equilibrium
(SPNE), which is widely used in the study of dynamic games [22, 23].
The Nash equilibrium serves as a building block for noncooperative games. One of its
major advantages is that it characterizes a stable state of a noncooperative game, in which
no rational player has the incentive to move unilaterally. This stability idea will be further
discussed when we focus on learning in games, which relates stability theory of differential
equations to the convergence of learning algorithms in Nash equilibrium seeking.

3 Learning in Games
Learning in games refers to a long-run non-equilibrium process of learning, adaptation,
and/or imitation that leads to some equilibrium [24]. Different from pure equilibrium anal-
ysis based on the definition, learning in games accounts for how players behave adaptively
during repeated game play under uncertainties and partial observations. Computationally
speaking, computing NE based on equilibrium analysis is challenging due to the computa-
tional complexity [25], and this hardly accounts for the decision-making process in practice,
where players have limited computation power and information. Hence, learning models
are needed to describe how less than fully rational players behave in order to reach the
equilibrium. Equilibrium seeking or computation motivates learning in games [23].
If we view the learning process as a dynamical system, then the learning model can
predict how each player adjusts its behavior in response to other players over time to search
for strategies that will lead to higher payoffs. From this perspective, a Nash equilibrium can
also be interpreted as the steady state of the learning process, which serves as a prediction of
the limiting behavior of the dynamical system induced by the learning model. This viewpoint
has been widely adopted in the study of population biology and evolutionary game theory,
as we shall see more clearly when we discuss later reinforcement learning and replicator
dynamics [26].
In this section, various learning dynamics are presented in the context of infinitely re-
peated games for Nash equilibrium seeking. We consider a number of players repeatedly
playing the game ⟨N , (Ai )i∈N , (ui )i∈N ⟩ infinitely many times. At time k, players determine
their moves based on their observations up to time k − 1. Then, they receive feedback
from the environment, which provides information on the past actions. For example, in
finite games, based on the information available to it, player i constructs a mixed strategy
πik ∈ ∆(Ai ), from which it samples an action aki and implements it. Then it will receive
a payoff feedback related to ui (aki , ak−i ), which evaluates the performance of aki and helps
the player shape its strategy for future plays. In such a repeated game, the amount of in-
formation that players acquire in repeated plays directly determines how players plan their
moves at each round and further influences the resulting learning dynamics. Besides being of
theoretical importance, the information feedback in the learning process, such as players’ ob-

11
servations of their opponents’ moves, is also of vital importance in designing learning-based
methods for solving network problems. As we shall see more clearly in Section 4, in many
network applications, networked agents only observe their surroundings, without any access
to the global information regarding the whole network. Therefore, due to its significance
in learning processes, we first present existing feedback structures that are of wide use in
learning, before moving to the details of learning algorithms.

3.1 Feedback Structures in Learning

The feedback structure for a player in a repeated game includes its observations regarding
the game and the repeated plays, which is a subset of every player’s histories of plays and
payoffs. To make our discussion more concrete, we introduce the following notation. Let
Iik be the feedback of player i up to time k. Denote the payoff received by player i at
the k-th round by uki := ui (aki , ak−i ), and the sequence of payoffs received up to time k by
u1:k
i := {u1i , . . . , uki }.
The simplest feedback structure is called the perfect global feedback, where

Iik = {{u1:k 1:k

j }j∈N , {aj }j∈N },

indicating the completeness of the feedback from both the temporal and the spatial sense.
Furthermore, we can also consider the noisy feedback of payoffs, Uik , defined as

Uik = ui (aki , ak−i ) + ξik ,

where ξik is a zero-mean martingale noise process with finite second moment, that is E[ξik |F k−1 ] =
0, E[(ξik )2 |F k−1 ] is less than a constant, and the expectation is taken with respect to the σ-
field F k−1 generated by the history of play up to time k − 1. Simply put, the noisy feedback
Uik is a conditionally unbiased estimator of uki with respect to the history, which is a standing
assumption when dealing with the convergence of learning dynamics in games. For noisy
feedback in general, or equivalently ξik being a generic random variable, the discussion will be
carried out in a different context. In that case, a system state should be introduced, which
accounts for the uncertainty in the environment, and the learning problem becomes Nash
equilibrium seeking in stochastic games (see Definition 2). For more detailed discussions, we
refer the reader to another paper in this special issue [17].
The perfect global feedback is of limited use in practice when designing learning algo-
rithms, as the global information is difficult or even impossible to acquire for individuals
in large-scale network systems. For example, in distributed or decentralized learning over
heterogeneous networks, players may have no access to others’ utilities due to physical limita-
tions. Therefore, we are interested in the scenario where players only have direct or indirect
access to their own utilities as well as their neighbors’, and hence players’ feedback can be
dependent on the topological structure of the underlying network that connects them.
Consider a repeated game over a graph G := (N , E), where N = {1, 2, . . . , N } is the
set of nodes, representing the players in the game, who are connected via the edges in
E = {(i, j)|i, j are connected}. To simplify the exposition, we assume that the graph is
undirected. Note that the direction of the edges does not affect our discussion as long as the
neighborhood is properly defined. For example, in a directed graph, when in-neighbors or

12
out-neighbors specify to which player(s) the player in question can pass information, then
the following characterizations of feedback structures still apply. For a more comprehensive
treatment of games over networks, we refer the reader to [14].
Each player is allowed to exchange payoff feedback with its neighbors through the edges
and observe their actions during the repeated play, whereas the information regarding the
rest is hidden from him. In this case, the feedback structure for player i is

Iik = {{u1:k 1:k

j }j∈{i}∪N (i) , {aj }j∈{i}∪N (i) }, N (i) := {j|(i, j) ∈ E}.

Note that the player’s feedback regarding the payoffs and actions may not be consistent.
For example, in a multi-agent robotic system where only the sensors network is effective,
each agent can only observe its neighbors’ movements through sensors. In this case, without
any information of others’ utilities, the information feedback of agent i reduces to Iik =
{{u1:k 1:k
i }, {aj }j∈{i}∪N (i) }. To sum up, if the players can only receive feedback from their
neighbors, then players’ feedback structures are related to the underlying topology, leading
to what is referred to as the local feedback. In accordance with this, the extreme case of
local feedback is one where the player is isolated in the network, and no information other
than its own payoff feedback and actions is available to it. We refer to this extreme case as
individual feedback, which is a typical information feedback considered in fully decentralized
learning and will be further elaborated on when discussing specific learning dynamics later
in this section.
In addition to the refinements from the spatial side, we can also consider feedback with
various temporal structures. If the player has perfect recall of previous plays, the resulting
feedback is said to be perfect, and those we have introduced above all fall within this class.
Otherwise, players have access to imperfect feedback, and we discuss two common cases of
imperfect information feedback in the following, namely windowed and delayed feedback.
For the sake of simplicity, we use perfect feedback Iik = {u1:k i:k
i , ai } as a baseline to
illustrate that different missing parts of Iik lead to different kinds of imperfect feedback. If
the head of u1:k i and/or a1:ki is not available to the player, that is, there exists a window 0 <
(k−m):k (k−m):k
m < k such that the player only recalls ui , ai , then the corresponding feedback
(k−m):k (k−m):k (k−m):k
Ii = {ui , ai } is referred to as the windowed feedback with a window size
m. Similarly, if the tail of u1:k i and/or a1:k
i is not available, that is, the player only recalls
1:(k−m) 1:(k−m) 1:(k−m) 1:(k−m) 1:(k−m)
ui , ai , then the imperfect information feedback is Ii = {ui , ai },
which is called m-step delayed feedback.
For learning in games, each player learns to select actions by updating the strategy based
on the available feedback at each round. To describe this in mathematical terms, let Fik the
strategy learning policy of player i. The learning policy produces a new strategy πik+1 for
the next play according to

πik+1 = (1 − λki )πik + λki Fik (Iik ), (7)

where λki is the learning rate, indicating the player’s capabilities of information retrieval.
Different feedback structures lead to different learning dynamics in repeated games. Under
the global or the local feedback structure, each player’s feedback is influenced by its oppo-
nents’ actions and/or payoffs, which makes the players’ learning processes coupled, as shown
in Figure 3.

13
𝜋" 𝜋!"

𝐼" 𝐼!"

Figure 3: Player’s strategy learning with the corresponding feedback. Under the global or
the local feedback structure, players’ learning processes are coupled, as their feedback is
influenced by their opponents’ moves. By contrast, players learn to play the game indepen-
dently under the individual feedback.

In the case of fully decentralized learning under individual information feedback, players
learn to play the game independently, and such a learning process is said to be uncoupled.
Uncoupled learning processes are of great significance in both theoretical studies [27] and
practical applications. Theoretically, learning with such limited information feedback is
much more transferable in the sense that learning algorithms under this feedback also apply
to online optimization problems, where the online decision-making process is viewed as a
repeated game played between a player and the nature [15].
Considering its theoretical importance, we focus on learning with individual feedback
in the sequel, and we refer the reader to [28] for a survey on learning methods under other
kinds of feedback. We first present reinforcement learning for finite games, where the learning
algorithms are characterized into two main classes, due to their distinct nature in exploration.
Then, we proceed to gradient play for infinite games, and elaborate on its connection to
reinforcement learning. The convergence results of presented algorithms are discussed in
Section 3.4 based on stochastic approximation [29, 30] and Lyapunov stability theory.

3.2 Reinforcement Learning

Reinforcement learning has been studied in many disciplines and has become a catch-all
term for learning in sequential decision making processes where the players’ future choices
of actions are shaped by the feedback. In general, reinforcement learning consists of two
functions, one of which is the score function, evaluating the performance of actions, and
the other one is the choice mapping, determining the next move. Note that in the machine
learning literature [31], the score function and the choice mapping are also called the critic
and the actor, respectively. Different score functions and choice mappings lead to different
reinforcement learning algorithms. We first provide a generic description of the score function
and choice mapping in reinforcement learning from a dynamic system viewpoint, and then
we give a characterization of various reinforcement learning algorithms based on different
natures in choice mappings. Finally, at the end of this subsection, relations among introduced
reinforcement learning algorithms are discussed.
To begin with, we show how the score function can be constructed using the information
feedback recursively. Since the player has no direct access to its utility function in this case,

14
it can construct an estimator ûki ∈ R|Ai | based on Iik to evaluate actions a ∈ Ai . By using
this estimator, the player can compare its actions and choose the one that can achieve higher
payoffs in the next round. In mathematical terms, the estimator (score function) is given by
the following discrete-time dynamical system
ûk+1
i = (1 − µki )ûki + µki Gki (πik , ûki , Uik , aki ), (8)

where Gki : ∆(Ai ) × R|Ai | × R × Ai → R|Ai | is the learning policy for utility learning, πik is
the policy employed at time k, and µki is the learning rate. Based on the score function, the
player can modify its strategy accordingly in the sense that better actions shall be played
more frequently in the future. With slight abuse of notations, the strategy update is
πik+1 = (1 − λki )πik + λki Fik (πik , ûk+1
i , Uik , aki ), (9)

where Fik : ∆(Ai ) × R|Ai | × R × Ai → ∆(Ai ) is the learning policy for strategy learning,
yielding a new policy for the next play. Compared with (7), the above discrete-time systems
(8) (9) explicitly show how the feedback shapes the player’s future play. According to (8),
the player recursively updates its estimate of the utility function based on the feedback it
receives after playing πik , and then the player determines its move in the next round, following
(9). Intuitively, we can view (πik , ûk+1
i ) as the information extracted from Iik for updating
the player’s strategy.
In reinforcement learning, the choice mapping plays an important role in achieving the
balance between exploitation and exploration. On one hand, the player would like to choose
the best action that is supposed to incur the highest payoff based on the score function.
However, this pure exploitation oftentimes leads to myopic behaviors, as the score function
may return a poor estimate of the utility function at the beginning of the learning process.
Hence, to gather more information for a better estimator, the player also needs some ex-
perimental moves for exploration, where suboptimal actions are implemented. To sum up,
the trade-off between exploitation and exploration is of vital importance to the success of
reinforcement learning, and it depends on the construction of the choice mapping. Different
choice mappings result in different reinforcement learning algorithms. Based on their distinct
natures in exploration, the algorithms can be categorized into two main classes: exploitative
reinforcement learning and exploratory reinforcement learning.
Recall that in the strategy learning (9), the next strategy produced by the corresponding
choice mapping is
πik+1 = (1 − λki )πik + λki Fik (πik , ûk+1
i , Uik , aki ),
where (1−λki )πik is referred to as the cognitive inertia or simply inertia, describing the player’s
tendency to repeat previous choices independently of the outcome. When determining its
next move πik+1 , the player takes into account both its previous strategy πik and the increment
update using the strategy learning policy Fik . Therefore, players’ exploration at (k + 1)-th
round stems either from this inertia or the strategy learning policy Fik . The former is called
passive exploration, as it relies on the player’s tendency to repeat previous choices, while the
latter one is referred to as active exploration, as the player deliberately tries actions based
on what he has learned from previous plays.
As the new strategy is a convex combination of the inertia term πik and the learned
incremental update Fik (πik , ûk+1
i , Uik , aki ), there is no clear-cut boundary between passive and

15
active exploration. In fact, reinforcement learning is a continuum of learning algorithms.
In the following, we illustrate such a continuum by three prominent learning schemes. The
first one is the best response dynamics, located on the left endpoint, which is an example
of exploitative reinforcement learning. Solely relying on the inertia for passive exploration,
the best response dynamics adopts a purely exploitative learning policy: the best response
mapping in (1). On the contrary to the exploitative one, we present dual averaging as
an example of exploratory reinforcement learning, which only leverages the learning policy
for exploring suboptimal actions without any cognitive inertia. In between, there lies the
smoothed best response dynamics, where both the inertia and the strategy learning policy
come into play for achieving the balance between exploration and exploitation.

3.2.1 Exploitative Reinforcement Learning

For exploitative reinforcement learning, the strategy learning policy always outputs the best
strategy based on the score function, which can be viewed as a natural extension of the best
response idea in the context of Nash equilibrium (1). In the repeated play scenario, given
k
the opponent’s strategy at the k-th round, π−i , from player i’s standpoint, the best it can
k k
do is to choose the best response BRi (π−i ) := arg maxπ∈∆(Ai ) { π, ui (π−i ) }, which is purely
exploitative. In this case, the strategy learning scheme becomes

πik+1 ∈ (1 − λki )πik + λki BRi (π−i

k
). (10)

In general, the best response mapping is a point-to-set mapping, and to analyze the as-
sociated learning dynamics, differential inclusion theory [30] is needed, which make the
convergence analysis more involved as discussed in Section 3.4.2.
Under the noisy feedback Iik = {Ui1:k , a1:k
i }, the score function of player i is the estimated
k
utility ûi , which is updated according to the following moving average scheme [32]
1{a=aki }
ûk+1
i (a) = (1 − µki )ûki (a) + µki Uik , a ∈ Ai , (11)
πik (a)

where 1{·} is an indicator function. Note that in (11), the importance sampling technique,
which is common in bandit algorithms [15], is utilized to construct an unbiased estima-
tor of ui (π−i k
). To see this, define a vector Ûki ∈ R|Ai | , whose a-th entry is Ûki (a) :=
1{a=aki } Uik /πik (a); and we then obtain E[Ûki (a)|F k−1 ] = ui (a, π−i
k
). Hence, (11) can be rewrit-
ten as

ûk+1
i = (1 − µki )ûki + µki Ûki , (12)

and ûk+1
i (a) gives the averaged payoff incurred by a in the first k rounds. This importance
sampling technique can be viewed as compensating for the fact that actions played with
a low probability do not receive frequent updates of the corresponding estimates, so that
when they are played, any estimation error Uik − ûki (aki ) must have greater influence on the
estimated value than if frequent updates occur. We refer the reader to [15, 24] for more
details on importance sampling and its use in learning processes.

16
With a slight abuse of the notation of best response mapping in (2), we define the
corresponding best response under the noisy feedback as

BRi (ûki ) := arg max{ π, ûki }. (13)

π∈∆(Ai )

Then, we obtain the following strategy learning scheme [24]

πik+1 ∈ (1 − λki )πik + λki BRi (ûki ). (14)

The resulting dynamical system under the noisy feedback is a coupled system as shown below

ûk+1
i = (1 − µki )ûki + µki Ûki ,
(BR-d)
πik+1 ∈ (1 − λki )πik + λki BRi (ûki ).

Originally proposed as a computational method for Nash equilibrium seeking [32, 33], the
best response dynamics (BR-d) is directly built upon the best response idea and has been
widely applied to evolutionary game problems [34]. One prominent example of best response
dynamics is fictitious play [35], where a player’s empirical play follows (BR-d); and more
details are included in Appendix A. As shown above, best response dynamics adopts passive
exploration, and the best response mapping BRi (·) encourages greedy actions that might be
myopic. As a result, exploitative reinforcement learning may fail to converge [24, 36].

3.2.2 Exploratory Reinforcement Learning

In contrast to the inertia-based passive exploration in (BR-d), dual averaging, as introduced
in this subsection, only relies on the strategy learning policy Fik for exploring suboptimal
actions, in order to avoid myopic behaviors due to the poor estimates of the utility function.
In dual averaging, given the player’s utility vector ui , the strategy learning policy is a
regularized best response [37], defined as

QRϵ (ui ) := arg max{⟨πi , ui ⟩ − ϵh(πi )}, (15)

πi ∈∆(Ai )

where h(·) is a penalty function or regularizer and ϵ is the regularization parameter. Accord-
ing to [38], a proper regularizer h(·) defined on the probability simplex should be continuous
over the simplex and smooth on the relative interior of every face of the simplex. Besides,
h should be a strongly convex function, and these assumptions ensure that QRϵ (·) always
returns a unique maximizer. The mapping QRϵ is referred to as a quantal response map-
ping [39], which allows players to choose suboptimal actions with positive probability. To
see howP this regularization contributes to active exploration, consider the entropy regularizer
h(x) = xi xi log xi . In this case, QRϵ is

exp( 1ϵ ui (a, π−i ))

QRϵ (ui )(a) := P 1 ′
, a ∈ Ai , (16)
a′ ∈Ai exp( ϵ ui (a , π−i ))

which is also known as the Boltzmann-Gibbs strategy mapping [40] or the soft-max function
parameterized by ϵ > 0. On the one hand, the Boltzmann-Gibbs mapping produces a

17
strategy that assigns more weight to the actions leading to higher payoffs, that is, the larger
ui (a) = ui (a, π−i ) is, the larger QRϵ (ui )(a) becomes. On the other hand, it always retains
positive probabilities for every action, when ϵ > 0. Note that QRϵ can induce different levels
of exploration by adjusting the parameter ϵ. When ϵ tends to 0, the strategy (16) simply
returns the action that yields the highest payoff, implying that QRϵ reduces to the best
response mapping BRi (·) in (2). As ϵ gets larger, 1/ϵ tends to 0, and the strategy does not
distinguish among actions, leading to equal weights for all actions.
Similar to the previous argument, with the noisy feedback, we replace ui by the estimator
ûki , and the definition of quantal response mapping is then modified accordingly as

exp( 1ϵ ûki (a))

QRϵ (ûki )(a) := P 1 k ′ , a ∈ Ai .
a′ ∈Ai exp( ϵ ûi (a ))

Due to the active exploration brought up by QRϵ , we can consider an inertia-free reinforce-
ment learning scheme, where the choice map is simply the strategy learning policy QRϵ . The
corresponding strategy learning scheme is then as

πik+1 = QRϵ (ûk+1

i ),

where the score function ûki is updated according to the following [41]

ûk+1
i = ûki + µki Ûki . (17)

To recap, the learning algorithm operates in the following fashion: at each time k, an
unbiased estimator Ûki is constructed as introduced in (11), using importance sampling, and
the score function is updated according to (17). Then, the next strategy is produced by the
mapping QRϵ , acting on the score function ûk+1
i , as shown below

ûk+1
i = ûki + µki Ûki ,
(DA-d)
πik+1 = QRϵ (ûk+1
i ),

(DA-d) is also referred to as dual averaging, pioneered by Nesterov [41], which was originally
proposed as a variant of gradient methods for solving convex programming problems. We
elucidate the term “dual averaging” later when we discuss the relation between dual averaging
and gradient play, where we demonstrate that (DA-d) can be viewed as a gradient-based
algorithm in finite games with ûki being the gradient. Finally, as a remark, we note that in
(DA-d), the score function is updated in a manner different than in best response dynamics
(BR-d). However, this is merely a matter of presentation, and by selecting a proper ϵ, the
moving averaging scheme (12) is essentially the same as the discounted accumulation (17),
for which we refer the reader to [41, 42]. By adopting the discounted accumulation (17), we
later can draw a connection between dual averaging and gradient play.
Apparently, the discrete-time system (DA-d) does not depict how πi (t) evolves in ∆(Ai ),
and it is not straightforward to tell how those good actions bringing up higher payoffs are
“reinforced” in the sense that probabilities of choosing them are increasing as the learning
process proceeds. In Appendix B, we present that when choosing entropy regularization,
(DA-d) is equivalent to the replicator dynamics, one of the well-known evolutionary dynamics

18
[43–45], which explicitly displays a gradual adjustment of strategies based on the quality of
each action. Meanwhile, with an example of population games, we show that this connection
brings learning in games to the broader context of evolutionary game theory [34, 44].
As we have mentioned, reinforcement learning is a continuum of learning algorithms, and
the best response dynamics (BR-d) and dual averaging (DA-d) are the two endpoints of the
continuum. Naturally, we can consider reinforcement learning methods with a blend of both
passive and active exploration, where the exploration stems from both the inertia term and
the strategy learning policy, as we present in the following.
Instead of choosing actions greedily, we replace the best response BRi (·) in (14) by
QRϵ (·), the quantal response for active exploration, and then we obtain the following strategy
learning scheme [24]

πik+1 = (1 − λki )πik + λki QRϵ (ûki ).

Similar to the best response dynamics in (BR-d), if utility learning follows the moving average
scheme in (11), the resulting reinforcement learning has the following discrete-time learning
dynamics

ûk+1
i = (1 − µki )ûki + µki Ûki ,
(SBR-d)
πik+1 = (1 − λki )πik + λki QRϵ (ûki ).

Considering its similarity to best response dynamics, (SBR-d) is referred to as smoothed

best response dynamics in the literature [24, 46]. Specifically, if the entropy regularizer is
adopted, the resulting learning process is called Boltzmann-Gibbs reinforcement learning [47]
or entropic reinforcement learning, which has been extensively studied in the context of
Markov decision processes [48].

3.2.3 Relations among Reinforcement Learning Algorithms

Before wrapping up our presentation on reinforcement learning in finite games, we discuss
the relations among the introduced learning algorithms. We reiterate that reinforcement
learning corresponds to a continuum of learning algorithms, where one algorithm can be
converted to the other by adjusting the learning rate λki in strategy learning (7) and/or the
exploration parameter ϵ, and a diagram of such conversion is presented in Figure 4. Our
discussion will revolve around the learning rate λki and the exploration parameter ϵ. For
simplicity, we suppress the subscript and the superscript of the learning rate and simply
denote it by λ.
We begin the discussion with the learning rate λ. Different from dual averaging (DA-d),
the best response dynamics (BR-d) and the smoothed best response dynamics (SBR-d) are
in fact actor-critic learning [32,49,50] due to a positive learning rate λ > 0. Under the actor-
critic framework such as (BR-d)(SBR-d), the player maintains two recursive schemes for
updating the estimated utility vector and the strategy, respectively. The recursive schemes
lead to coupled dynamical systems of ûki and πik . In contrast, even though dual averaging
(DA-d) also consists of both updating schemes for estimated utility vector and the strategy,
since the learning rate is zero, there is only one effective dynamical system: the one induced
by the estimation of utility vector (17). Another way to see the difference between actor-critic

19
Smoothed Best Response

𝜆
→

𝜀
←
1

0
𝜀 𝜆
Best Response ↓ ↓ Dual Averaging
0 1

𝜆
→

𝜀
1

←
0
Follow-The-Leader

Figure 4: Relationships of reinforcement learning algorithms. For 0 < λ < 1 and ϵ > 0, we
obtain the exploratory reinforcement learning: smoothed best response dynamics (SBR-d),
where exploration arises from both the inertia and the learning policy. If the active explo-
ration vanishes as ϵ goes to zero, smoothed best response reduces to best response dynamics
(BR-d), an example of exploitative reinforcement learning. By contrast, we obtain dual av-
eraging (DA-c), if λ tends to 1. Finally, if ϵ goes to zero while λ tends to 1, players always
choose their actions greedily according to follow-the-leader policy.

learning (BR-d)(SBR-d) and dual averaging (DA-d) is through the corresponding continuous-
time learning dynamics in Section 3.4.1.
Even though (DA-d) is not an actor-critic learning, its trajectory is closely related to
that of (BR-d)(SBR-d). Intuitively speaking, dual averaging only differs from the smoothed
best response in that (DA-d) does not acquire an inertia term, as the learning rate is zero.
Hence, πik in (SBR-d) can be seen as the moving average of QRϵ (ûki ) in (DA-d). Therefore,
it is reasonable to expect that the time average of the trajectory produced by (DA-d) is
related to the one produced by the smoothed best response. This intuition has been verified
in [38, 51], where it has been shown that the time averaged trajectory of (DA-d) follows
(SBR-d) with a time-dependent perturbation ϵ(t).
Apart from the difference in the learning rates, learning algorithms also display distinct
asymptotic behavior due to the difference in the exploration parameter. The exploration
parameter ϵ has less drastic consequence under (DA-d) than under the actor-critic learning
(BR-d)(SBR-d). As observed in [38], adding a positive ϵ is equivalent to rescaling the
regularizer, that is, replacing h(·) with ϵh(·). As long as ϵ > 0, the regularization ϵh(·) is
still proper (see (15) and the following discussion). This implies that even though the choice
of ϵ affects the speed at which (DA-d) evolves, the qualitative results remain the same. We
refer the reader to [38, 52] for a detailed discussion. When ϵ = 0, there is no exploration
nor inertia for dual averaging, and in this case, players always choose their actions greedily
according to the best response mapping

πik+1 = arg max{⟨π, ûki ⟩}, (FTL)

π∈∆(Ai )

where ûki is the score function of player i, based on its history of play up to round k, and it

20
can be updated following (11) or (17). In the online learning literature [15], (FTL) is known
as follow-the-leader (FTL) policy, which can also be obtained by eliminating the inertia term
in the best response dynamics (BR-d). Due to lack of exploration, (FTL) is too aggressive
and can be exploited by the adversary, resulting in a positive, non-diminishing regret [15].
The regret is a measure of the performance gap between the cumulative payoffs of current
policy (FTL) and that of the best policy in hindsight.
The exploration parameter plays a more important role in the actor-critic learning which
balances exploration and exploitation [31]. The smoothed best response (SBR-d), which is a
perturbed version of the best response, can only use the regularization ϵh(·) for encouraging
active exploration. Thanks to the positive exploration parameter, the smoothed best re-
sponse (SBR-d) enjoys an ϵ-no-regret property, a weak form of external consistency studied
in [51,53], which is desired in an adversarial environment [15]. In contrast, the best response
dynamics (BR-d), due to the myopic nature of the best response mapping (2), does not
possess similar properties.

3.3 Gradient Play

Heretofore, we have limited our discussions to learning processes in finite games, where
the score function (8) and the choice mapping (9) act on finite-dimensional vectors. For
continuous-kernel games, it is not straightforward to extend reinforcement learning, since
a suitable score function is required to evaluate a continuum of actions, and constructing
such a score function can be very challenging. Even though function approximators, such
as linear [54, 55] or nonlinear [56] ones can be of some help, we present here a mathemat-
ically more elegant way of leveraging the reinforcement idea based on gradients of utility
functions. In other words, instead of seeking the maximizers, we seek for a better response
by searching along the gradient direction. Such gradient-based learning algorithms, referred
to as gradient play, are popular in a variety of multi-agent settings due to their versatility,
ease of implementation, and dependence on local information.
For the sake of simplicity, we restrict our discussion to pure strategy Nash equilibrium
in continuous games (see (4) for the definition and (5) for its variational characterization),
in order to avoid measure-theoretic issues when studying the mixed strategy case. We
further assume that utilities are smooth functions and perfect feedback is available to players,
implying that each player can compute the gradient of the utility function given current
iterates: Dik = ∇ai ui (aki , ak−i ). Even though the perfect feedback is assumed here, it is purely
for the simplicity of exposition. It is viable for players to estimate the gradient based on
the realized payoff under noisy individual feedback by simultaneous perturbation stochastic
approximation [57,58]. Based on this gradient, players update their actions according to the
following

ak+1
i = projAi [aki + µki Dik ],
:= arg min{∥aki + µki Dik − a∥22 } (GD)
a∈Ai

where projAi (·) is the Euclidean projection operator, and (GD) is called online gradient
descent or projected gradient descent [42]. One extensively studied variant of (GD) [42, 59]

21
Yi1 = 0
<latexit sha1_base64="8lD8hvS1xJut59HYLscSpqK+hJ4=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4Koko9iIUvHisYD+kjWWznbRLN5uwuxFK6I/w4kERr/4eb/4bt20O2vpg4PHeDDPzgkRwbVz321lZXVvf2CxsFbd3dvf2SweHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWMbqZ+6wmV5rG8N+ME/YgOJA85o8ZKrYcef/Su3V6p7FbcGcgy8XJShhz1Xumr249ZGqE0TFCtO56bGD+jynAmcFLsphoTykZ0gB1LJY1Q+9ns3Ak5tUqfhLGyJQ2Zqb8nMhppPY4C2xlRM9SL3lT8z+ukJqz6GZdJalCy+aIwFcTEZPo76XOFzIixJZQpbm8lbEgVZcYmVLQheIsvL5PmecW7rLh3F+VaNY+jAMdwAmfgwRXU4Bbq0AAGI3iGV3hzEufFeXc+5q0rTj5zBH/gfP4AR52O1w==</latexit>

+µ1i Di1
<latexit sha1_base64="TARFUywY5Agc42PcYQCeiMyJuPg=">AAAB+HicbVDLSgMxFM34rPXRUZdugkUQhDIjil0WdOGygn1AOw6ZNNOGJpkhD6EO/RI3LhRx66e482/MtLPQ1gP3cjjnXnJzopRRpT3v21lZXVvf2Cxtlbd3dvcq7v5BWyVGYtLCCUtkN0KKMCpIS1PNSDeVBPGIkU40vs79ziORiibiXk9SEnA0FDSmGGkrhW7lrM9NSB98eJP3cuhWvZo3A1wmfkGqoEAzdL/6gwQbToTGDCnV871UBxmSmmJGpuW+USRFeIyGpGepQJyoIJsdPoUnVhnAOJG2hIYz9fdGhrhSEx7ZSY70SC16ufif1zM6rgcZFanRROD5Q7FhUCcwTwEOqCRYs4klCEtqb4V4hCTC2maVh+AvfnmZtM9r/mXNu7uoNupFHCVwBI7BKfDBFWiAW9AELYCBAc/gFbw5T86L8+58zEdXnGLnEPyB8/kDqDaRwA==</latexit>

Yi2
<latexit sha1_base64="bP5+v5E4xL1l03OPi8W7jcyksRo=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mKYo8FLx4rmFZpY9lsJ+3SzSbsboRS+hu8eFDEqz/Im//GbZuDtj4YeLw3w8y8MBVcG9f9dgpr6xubW8Xt0s7u3v5B+fCopZNMMfRZIhJ1H1KNgkv0DTcC71OFNA4FtsPR9cxvP6HSPJF3ZpxiENOB5BFn1FjJf+jxx1qvXHGr7hxklXg5qUCOZq/81e0nLItRGiao1h3PTU0wocpwJnBa6mYaU8pGdIAdSyWNUQeT+bFTcmaVPokSZUsaMld/T0xorPU4Dm1nTM1QL3sz8T+vk5moHky4TDODki0WRZkgJiGzz0mfK2RGjC2hTHF7K2FDqigzNp+SDcFbfnmVtGpV77Lq3l5UGvU8jiKcwCmcgwdX0IAbaIIPDDg8wyu8OdJ5cd6dj0VrwclnjuEPnM8fVmeOVw==</latexit>

a1i
<latexit sha1_base64="jvyk4AR7xykDVX4TL2uLQe6NEgo=">AAAB7nicbVBNS8NAEJ34WeNX1aOXxSJ4Koko9ljw4rGC/YA2ls122i7dbMLuRiihP8KLB0W8+nu8+W/ctDlo64OBx3szzMwLE8G18bxvZ219Y3Nru7Tj7u7tHxyWj45bOk4VwyaLRaw6IdUouMSm4UZgJ1FIo1BgO5zc5n77CZXmsXww0wSDiI4kH3JGjZXatM8ffdftlyte1ZuDrBK/IBUo0OiXv3qDmKURSsME1brre4kJMqoMZwJnbi/VmFA2oSPsWipphDrI5ufOyLlVBmQYK1vSkLn6eyKjkdbTKLSdETVjvezl4n9eNzXDWpBxmaQGJVssGqaCmJjkv5MBV8iMmFpCmeL2VsLGVFFmbEJ5CP7yy6ukdVn1r6ve/VWlXiviKMEpnMEF+HADdbiDBjSBwQSe4RXenMR5cd6dj0XrmlPMnMAfOJ8/zMeOhg==</latexit>

a2i(GD) = a2i(LGD)
<latexit sha1_base64="hIeyVaDM4DuKoBbIGGcuxilqoFI=">AAACEXicbZDLSsNAFIYnXmu8RV26CRahbkpSFLsRCgq6cFHBXqBNw2Q6aYdOLsyciCX0Fdz4Km5cKOLWnTvfxiQNqK0/DPx85xzOnN8JOZNgGF/KwuLS8spqYU1d39jc2tZ2dpsyiAShDRLwQLQdLClnPm0AA07boaDYczhtOaPztN66o0KywL+FcUgtDw985jKCIUG2VsJ2zEpdoPcQX15Mjia9yhnuVX7gdUZV1daKRtnIpM8bMzdFlKtua5/dfkAij/pAOJayYxohWDEWwAinE7UbSRpiMsID2kmsjz0qrTi7aKIfJqSvu4FIng96Rn9PxNiTcuw5SaeHYShnayn8r9aJwK1aMfPDCKhPpovciOsQ6Gk8ep8JSoCPE4OJYMlfdTLEAhNIQkxDMGdPnjfNStk8KRs3x8VaNY+jgPbRASohE52iGrpCddRABD2gJ/SCXpVH5Vl5U96nrQtKPrOH/kj5+AaCiJwe</latexit>

+µ2i Di2
<latexit sha1_base64="p84gsydHLjMEQoYxagkNM4OsMS0=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIsgCGWmKHZZ0IXLCvYB7bRk0kwbmmSGJKOU0v9w40IRt/6LO//GTDsLbT1wL4dz7iU3J4g508Z1v53c2vrG5lZ+u7Czu7d/UDw8auooUYQ2SMQj1Q6wppxJ2jDMcNqOFcUi4LQVjG9Sv/VIlWaRfDCTmPoCDyULGcHGSr2Lrkj6rFdBt2nvF0tu2Z0DrRIvIyXIUO8Xv7qDiCSCSkM41rrjubHxp1gZRjidFbqJpjEmYzykHUslFlT70/nVM3RmlQEKI2VLGjRXf29MsdB6IgI7KbAZ6WUvFf/zOokJq/6UyTgxVJLFQ2HCkYlQGgEaMEWJ4RNLMFHM3orICCtMjA2qYEPwlr+8SpqVsndVdu8vS7VqFkceTuAUzsGDa6jBHdShAQQUPMMrvDlPzovz7nwsRnNOtnMMf+B8/gD635F9</latexit>

a3i(LGD)
<latexit sha1_base64="YPMDT255kbZ4Ep77Vh2w1BuSiRM=">AAAB/HicbVDLSsNAFJ3UV62vaJdugkWom5L4wC4LCrpwUcE+oI1hMp20QycPZm7EEOKvuHGhiFs/xJ1/47TNQlsPXDiccy/33uNGnEkwzW+tsLS8srpWXC9tbG5t7+i7e20ZxoLQFgl5KLoulpSzgLaAAafdSFDsu5x23PHFxO88UCFZGNxBElHbx8OAeYxgUJKjl7GTsmof6COkN1eX2VF2f+LoFbNmTmEsEisnFZSj6ehf/UFIYp8GQDiWsmeZEdgpFsAIp1mpH0saYTLGQ9pTNMA+lXY6PT4zDpUyMLxQqArAmKq/J1LsS5n4rur0MYzkvDcR//N6MXh1O2VBFAMNyGyRF3MDQmOShDFgghLgiSKYCKZuNcgIC0xA5VVSIVjzLy+S9nHNOquZt6eVRj2Po4j20QGqIgudowa6Rk3UQgQl6Bm9ojftSXvR3rWPWWtBy2fK6A+0zx8KQpRW</latexit>

Yi3
<latexit sha1_base64="QiK/o4NjYwfqeBcerHrJOKXwEbk=">AAAB7HicbVBNS8NAEJ34WetX1aOXxSJ4Kokf2GPBi8cKpq20sWy2k3bpZhN2N0Ip/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZemAqujet+Oyura+sbm4Wt4vbO7t5+6eCwoZNMMfRZIhLVCqlGwSX6hhuBrVQhjUOBzXB4M/WbT6g0T+S9GaUYxLQvecQZNVbyH7r88aJbKrsVdwayTLyclCFHvVv66vQSlsUoDRNU67bnpiYYU2U4EzgpdjKNKWVD2se2pZLGqIPx7NgJObVKj0SJsiUNmam/J8Y01noUh7YzpmagF72p+J/XzkxUDcZcpplByeaLokwQk5Dp56THFTIjRpZQpri9lbABVZQZm0/RhuAtvrxMGucV76ri3l2Wa9U8jgIcwwmcgQfXUINbqIMPDDg8wyu8OdJ5cd6dj3nripPPHMEfOJ8/V+uOWA==</latexit>

a3i(GD)
<latexit sha1_base64="+pr6tWBA7DWDbN11fJoCGHp+98A=">AAAB+3icbVDLSsNAFJ34rPUV69LNYBHqpiQ+sMuCgi4r2Ae0MUymk3bo5MHMjbSE/IobF4q49Ufc+TdO2yy09cCFwzn3cu89Xiy4Asv6NlZW19Y3Ngtbxe2d3b1986DUUlEiKWvSSESy4xHFBA9ZEzgI1oklI4EnWNsbXU/99hOTikfhA0xi5gRkEHKfUwJacs0ScVNe6QEbQ3p7k51mj+euWbaq1gx4mdg5KaMcDdf86vUjmgQsBCqIUl3bisFJiQROBcuKvUSxmNARGbCupiEJmHLS2e0ZPtFKH/uR1BUCnqm/J1ISKDUJPN0ZEBiqRW8q/ud1E/BrTsrDOAEW0vkiPxEYIjwNAve5ZBTERBNCJde3YjokklDQcRV1CPbiy8ukdVa1L6vW/UW5XsvjKKAjdIwqyEZXqI7uUAM1EUVj9Ixe0ZuRGS/Gu/Exb10x8plD9AfG5w9rYpQA</latexit>

Ai
<latexit sha1_base64="LXBzhDoOA8dBYgais3q1+RgFuB4=">AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsyIosuKG5cV7APasWTS2zY0kxmSjFKG/ocbF4q49V/c+Tdm2llo64HA4Zx7uScniAXXxnW/ncLK6tr6RnGztLW9s7tX3j9o6ihRDBssEpFqB1Sj4BIbhhuB7VghDQOBrWB8k/mtR1SaR/LeTGL0QzqUfMAZNVZ66IbUjBgV6fW0x0u9csWtujOQZeLlpAI56r3yV7cfsSREaZigWnc8NzZ+SpXhTOC01E00xpSN6RA7lkoaovbTWeopObFKnwwiZZ80ZKb+3khpqPUkDOxkllIvepn4n9dJzODKT7mME4OSzQ8NEkFMRLIKSJ8rZEZMLKFMcZuVsBFVlBlbVFaCt/jlZdI8q3oXVffuvFI7z+sowhEcwyl4cAk1uIU6NICBgmd4hTfnyXlx3p2P+WjByXcO4Q+czx8pMpI9</latexit>

Figure 5: Illustration of the difference between (GD) and (LGD). aki(GD) and aki(LGD) denote,
respectively, the iterates generated by (GD) and (LGD). (LGD) first aggregates the gradient
steps, and then projects the aggregation onto the primal space to generate a new gradient
step.

is
Yik+1 = Yik + µki Dik ,
(LGD)
ak+1
i = projAi (Yik+1 ),

where Yik is an auxiliary variable that aggregates the gradient steps. Such an algorithm
is referred to as the lazy gradient descent (LGD) [41], since the algorithm aggregates the
gradient steps “lazily”, without transporting them to the action space as (GD) does. We
illustrate the difference between the two algorithms in Figure 5. We note that both based on
the gradient descent idea, (LGD) and (GD) share the same asymptotic behavior [15], and
the two coincide when Ai is an affine subspace of Rn .
Different from a purely primal-based algorithm, such as (GD), where the trajectory of
the algorithm only evolves in the primal space (the action space), (LGD) is a primal-dual
scheme, and the interplay between primal variables, actions aki , and the dual, gradients
Di (ak ), is of great significance. The main idea of (LGD) is as follows. At the k-th round,
each player computes the gradient Di (ak ) based on the knowledge of utility functions and
observations of the opponent’s move. Subsequently, they take a step along this gradient in
the dual space (where gradients live) and they “mirror” the output back to the primal space
(the action space), using the Euclidean projection.
Gradient based learning algorithms are further investigated in another paper of this
special issue in the context of generalized Nash equilibrium seeking [19]. In the following,
we present a generalization of (LGD): mirror descent [41, 59]. Starting with some arbitrary
initialization Yi1 , the mirror descent scheme can be described via the recursion

Yik+1 = Yik + µki Di (ak ),

(MD)
ak+1
i = QRϵ (Yik+1 ),

22
where QRϵ is the quantal response mapping in the context of the continuous game, defined
as
QRϵ (Y ) = arg max{⟨Y, a⟩ − ϵh(a)}.
a∈Ai

When we choose the Euclidean norm as the regularizer, that is, h(x) = 12 ∥x∥22 and ϵ = 1,
QRϵ reduces to the projection operator projAi . Geometrically, the gradient search step is
performed in the dual space, and then the primal update is produced by the mapping QRϵ .
Since QRϵ “mirrors” the gradient update in the dual space back to the primal space, it is
also referred to as the mirror map in the online optimization literature [15].

3.3.1 Mirror Descent as Reinforcement Learning in Continuous Games

Mirror descent (MD) and the reinforcement learning (DA-d) share the same choice map,
and they are closely connected. We demonstrate in the following that as a gradient-based
algorithm, mirror descent can also be cast as a reinforcement learning scheme in continuous
games, with Yik being the “score function”. P
To evaluate a certain action a ∈ Ai at time k, consider kτ=1 ui (a, aτ−i ), the counterfactual
outcome had player i implemented a all the time in the past. The higher the sum is, the
better action is a, since it could have brought up higher payoffs. Hence, the player can choose
the next action that is optimal in hindsight:

X
k
ak+1
i = arg max{ ui (a, aτ−i ) − ϵh(a)}, (FTRL)
a∈Ai τ =1

where ϵh(·) is the regularization introduced in (15), encouraging exploration in the learning
process. Based on the optimality in hindsight, this action selection (FTRL) is known as
follow-the-regularized-leader (FTRL) [60]. Moreover, if ui is well-behaved in the sense that
it can be approximated by the first-order Taylor expansion, that is, ui (a, aτ−i ) ≈ ui (aτi , aτ−i ) +
⟨Di (aτ ), a − aτi ⟩, then (FTRL) is equivalent to

Xk
ak+1
i = arg max{ ⟨Di (aτ ), a⟩ − ϵh(a)}
a∈Ai
τ =1
X k
= arg max{⟨ Di (aτ ), a⟩ − ϵh(a)}
a∈Ai τ =1
X
k
= QRϵ ( Di (aτ )),
τ =1

which is exactly the mirror descent scheme in (MD), despite using an auxiliary variable
Yik to aggregate these gradients weighted by the learning rates µki . In other words, by the
first-order expansion, the sum of gradients living in the dual space serves a linear functional
for evaluating the quality of the actions. Hence, the sum or equivalently Yik can be treated
as a “score function”, based on which the mirror map outputs a better action in hindsight,
yielding a reinforcement procedure.

23
û3i
<latexit sha1_base64="x8uJFKG/UYalfg4Jj/UkbOHW3Eo=">AAAB/3icbVDLSsNAFL3xWesrKrhxEyyCq5L4wC4LblxWsA9oYphMJ+3QySTMTIQSs/BX3LhQxK2/4c6/cdJmoa0HBg7n3Ms9c4KEUals+9tYWl5ZXVuvbFQ3t7Z3ds29/Y6MU4FJG8csFr0AScIoJ21FFSO9RBAUBYx0g/F14XcfiJA05ndqkhAvQkNOQ4qR0pJvHrojpDI3QmoUhFma5z69z85z36zZdXsKa5E4JalBiZZvfrmDGKcR4QozJGXfsRPlZUgoihnJq24qSYLwGA1JX1OOIiK9bJo/t060MrDCWOjHlTVVf29kKJJyEgV6sggq571C/M/rpypseBnlSaoIx7NDYcosFVtFGdaACoIVm2iCsKA6q4VHSCCsdGVVXYIz/+VF0jmrO5d1+/ai1myUdVTgCI7hFBy4gibcQAvagOERnuEV3own48V4Nz5mo0tGuXMAf2B8/gDgMZae</latexit>

R|Ai |
<latexit sha1_base64="XTlMOrClZGomglpDajX9jLrv4lY=">AAACBXicbVC7TsMwFHXKq5RXgBEGiwqJqUoQiI5FLIwF0YfUhshxndaq40S2g1SlWVj4FRYGEGLlH9j4G5w0A7QcydLxOffq3nu8iFGpLOvbKC0tr6yuldcrG5tb2zvm7l5bhrHApIVDFoquhyRhlJOWooqRbiQICjxGOt74KvM7D0RIGvI7NYmIE6Ahpz7FSGnJNQ/7AVIjz0tu0/tkmn8wYsll6tJp6ppVq2blgIvELkgVFGi65ld/EOI4IFxhhqTs2VaknAQJRTEjaaUfSxIhPEZD0tOUo4BIJ8mvSOGxVgbQD4V+XMFc/d2RoEDKSeDpymxNOe9l4n9eL1Z+3Ukoj2JFOJ4N8mMGVQizSOCACoIVm2iCsKB6V4hHSCCsdHAVHYI9f/IiaZ/W7POadXNWbdSLOMrgAByBE2CDC9AA16AJWgCDR/AMXsGb8WS8GO/Gx6y0ZBQ9++APjM8fvqqZVg==</latexit>

û4i
<latexit sha1_base64="G5mQx6kWnjbfzoSXR2l/u7TttYw=">AAAB/3icbVDLSsNAFJ34rPUVFdy4CRbBVUmkYpcFNy4r2Ac0MUymk3boZBJmboQSs/BX3LhQxK2/4c6/cdJmoa0HBg7n3Ms9c4KEMwW2/W2srK6tb2xWtqrbO7t7++bBYVfFqSS0Q2Iey36AFeVM0A4w4LSfSIqjgNNeMLku/N4DlYrF4g6mCfUiPBIsZASDlnzz2B1jyNwIwzgIszTPfXafNXLfrNl1ewZrmTglqaESbd/8cocxSSMqgHCs1MCxE/AyLIERTvOqmyqaYDLBIzrQVOCIKi+b5c+tM60MrTCW+gmwZurvjQxHSk2jQE8WQdWiV4j/eYMUwqaXMZGkQAWZHwpTbkFsFWVYQyYpAT7VBBPJdFaLjLHEBHRlVV2Cs/jlZdK9qDuXdfu2UWs1yzoq6ASdonPkoCvUQjeojTqIoEf0jF7Rm/FkvBjvxsd8dMUod47QHxifP+G2lp8=</latexit>

û1i
<latexit sha1_base64="OSeklr5/fHs0xMKrMN8lHYmlYSU=">AAAB/3icbVDLSsNAFJ3UV62vqODGTbAIrkoiil0W3LisYB/QxDCZTtqhk0mYuRFKzMJfceNCEbf+hjv/xkmbhbYeGDiccy/3zAkSzhTY9rdRWVldW9+obta2tnd298z9g66KU0loh8Q8lv0AK8qZoB1gwGk/kRRHAae9YHJd+L0HKhWLxR1ME+pFeCRYyAgGLfnmkTvGkLkRhnEQZmme++w+c3LfrNsNewZrmTglqaMSbd/8cocxSSMqgHCs1MCxE/AyLIERTvOamyqaYDLBIzrQVOCIKi+b5c+tU60MrTCW+gmwZurvjQxHSk2jQE8WQdWiV4j/eYMUwqaXMZGkQAWZHwpTbkFsFWVYQyYpAT7VBBPJdFaLjLHEBHRlNV2Cs/jlZdI9bziXDfv2ot5qlnVU0TE6QWfIQVeohW5QG3UQQY/oGb2iN+PJeDHejY/5aMUodw7RHxifP90nlpw=</latexit>

<latexit sha1_base64="bAPraEgOQBd4HkRW/fgZsKqresc=">AAACBnicbVDLSsNAFJ3UV62vqEsRBosgCCXxgV0W3LisYNpCE8NkOmmHTiZhZiKUkJUbf8WNC0Xc+g3u/BsnbRbaeuDC4Zx7ufeeIGFUKsv6NipLyyura9X12sbm1vaOubvXkXEqMHFwzGLRC5AkjHLiKKoY6SWCoChgpBuMrwu/+0CEpDG/U5OEeBEachpSjJSWfPPw1I1Sn96fQ3eEVOZGSI2CMHPyvBB9s241rCngIrFLUgcl2r755Q5inEaEK8yQlH3bSpSXIaEoZiSvuakkCcJjNCR9TTmKiPSy6Rs5PNbKAIax0MUVnKq/JzIUSTmJAt1ZnCnnvUL8z+unKmx6GeVJqgjHs0VhyqCKYZEJHFBBsGITTRAWVN8K8QgJhJVOrqZDsOdfXiSds4Z92bBuL+qtZhlHFRyAI3ACbHAFWuAGtIEDMHgEz+AVvBlPxovxbnzMWitGObMP/sD4/AHPLpiu</latexit>

û2i +µ3i Û3i

<latexit sha1_base64="sxfAh6Se1KWMMQH2B2BkjVU+Frg=">AAAB/3icbVDLSsNAFJ3UV62vqODGzWARXJWkKHZZcOOygm2FJobJdNIOnTyYuRFKzMJfceNCEbf+hjv/xkmbhbYeGDiccy/3zPETwRVY1rdRWVldW9+obta2tnd298z9g56KU0lZl8Yilnc+UUzwiHWBg2B3iWQk9AXr+5Orwu8/MKl4HN3CNGFuSEYRDzgloCXPPHLGBDInJDD2gyzNc4/fZ83cM+tWw5oBLxO7JHVUouOZX84wpmnIIqCCKDWwrQTcjEjgVLC85qSKJYROyIgNNI1IyJSbzfLn+FQrQxzEUr8I8Ez9vZGRUKlp6OvJIqha9ArxP2+QQtByMx4lKbCIzg8FqcAQ46IMPOSSURBTTQiVXGfFdEwkoaArq+kS7MUvL5Nes2FfNKyb83q7VdZRRcfoBJ0hG12iNrpGHdRFFD2iZ/SK3own48V4Nz7moxWj3DlEf2B8/gDerJad</latexit>

<latexit sha1_base64="z69SXn9Qe58Ldo+eQlAMsCZ2Czw=">AAACBnicbVBNS8NAEN3Ur1q/oh5FWCyCIJSkKPZY8OKxgmkLTQyb7aZdupuE3Y1QQk5e/CtePCji1d/gzX/jps1BWx8MPN6bYWZekDAqlWV9G5WV1bX1jepmbWt7Z3fP3D/oyjgVmDg4ZrHoB0gSRiPiKKoY6SeCIB4w0gsm14XfeyBC0ji6U9OEeByNIhpSjJSWfPP43OWpT++b0B0jlbkcqXEQZk6eF6Jv1q2GNQNcJnZJ6qBExze/3GGMU04ihRmScmBbifIyJBTFjOQ1N5UkQXiCRmSgaYQ4kV42eyOHp1oZwjAWuiIFZ+rviQxxKac80J3FmXLRK8T/vEGqwpaX0ShJFYnwfFGYMqhiWGQCh1QQrNhUE4QF1bdCPEYCYaWTq+kQ7MWXl0m32bAvG9btRb3dKuOogiNwAs6ADa5AG9yADnAABo/gGbyCN+PJeDHejY95a8UoZw7BHxifP8wRmKw=</latexit>

<latexit sha1_base64="yb/kuvrtVzAs1aUtzsLsqzhB3zA=">AAACBnicbVBNS8NAEN34WetX1KMIi0UQhJKIYo8FLx4rmLbQxLDZbtqlu5uwuxFKyMmLf8WLB0W8+hu8+W/ctD1o64OBx3szzMyLUkaVdpxva2l5ZXVtvbJR3dza3tm19/bbKskkJh5OWCK7EVKEUUE8TTUj3VQSxCNGOtHouvQ7D0Qqmog7PU5JwNFA0JhipI0U2kdnPs9Ceu9Cf4h07nOkh1Gce0VRiqFdc+rOBHCRuDNSAzO0QvvL7yc440RozJBSPddJdZAjqSlmpKj6mSIpwiM0ID1DBeJEBfnkjQKeGKUP40SaEhpO1N8TOeJKjXlkOssz1bxXiv95vUzHjSCnIs00EXi6KM4Y1AksM4F9KgnWbGwIwpKaWyEeIomwNslVTQju/MuLpH1edy/rzu1FrdmYxVEBh+AYnAIXXIEmuAEt4AEMHsEzeAVv1pP1Yr1bH9PWJWs2cwD+wPr8Acj0mKo=</latexit>

+µ2i Û2i
+µ1i Û1i

QR✏
<latexit sha1_base64="bXTfi1nFDNXBZvkYx5Ms3iDuRgg=">AAAB8nicbVBNS8NAEJ34WetX1aOXYBE8lUQUeyx48diK/YA0ls120y7d7IbdiVBKf4YXD4p49dd489+4bXPQ1gcDj/dmmJkXpYIb9LxvZ219Y3Nru7BT3N3bPzgsHR23jMo0ZU2qhNKdiBgmuGRN5ChYJ9WMJJFg7Wh0O/PbT0wbruQDjlMWJmQgecwpQSsFjfvHLksNF0r2SmWv4s3hrhI/J2XIUe+Vvrp9RbOESaSCGBP4XorhhGjkVLBpsZsZlhI6IgMWWCpJwkw4mZ88dc+t0ndjpW1JdOfq74kJSYwZJ5HtTAgOzbI3E//zggzjajjhMs2QSbpYFGfCReXO/nf7XDOKYmwJoZrbW106JJpQtCkVbQj+8surpHVZ8a8rXuOqXKvmcRTgFM7gAny4gRrcQR2aQEHBM7zCm4POi/PufCxa15x85gT+wPn8AUibkTk=</latexit>

⇡i4
<latexit sha1_base64="gF9ZTuS+P0GY+6NvYj7Rjk7BxSk=">AAAB8HicbVBNSwMxEJ3Ur1q/qh69BIvgqexKxR4LXjxWsK3SriWbZtvQJLskWaEs/RVePCji1Z/jzX9j2u5BWx8MPN6bYWZemAhurOd9o8La+sbmVnG7tLO7t39QPjxqmzjVlLVoLGJ9HxLDBFesZbkV7D7RjMhQsE44vp75nSemDY/VnZ0kLJBkqHjEKbFOeuglvM8fs9q0X654VW8OvEr8nFQgR7Nf/uoNYppKpiwVxJiu7yU2yIi2nAo2LfVSwxJCx2TIuo4qIpkJsvnBU3zmlAGOYu1KWTxXf09kRBozkaHrlMSOzLI3E//zuqmN6kHGVZJapuhiUZQKbGM8+x4PuGbUiokjhGrubsV0RDSh1mVUciH4yy+vkvZF1b+sere1SqOex1GEEziFc/DhChpwA01oAQUJz/AKb0ijF/SOPhatBZTPHMMfoM8fvSCQVQ==</latexit>

⇡i1
<latexit sha1_base64="tx3FDwIPhS1oY++siNeafbEbldo=">AAAB8HicbVBNSwMxEJ2tX7V+VT16CRbBU9kVxR4LXjxWsB/SriWbZtvQJBuSrFCW/govHhTx6s/x5r8xbfegrQ8GHu/NMDMvUpwZ6/vfXmFtfWNzq7hd2tnd2z8oHx61TJJqQpsk4YnuRNhQziRtWmY57ShNsYg4bUfjm5nffqLasETe24miocBDyWJGsHXSQ0+xPnvMgmm/XPGr/hxolQQ5qUCORr/81RskJBVUWsKxMd3AVzbMsLaMcDot9VJDFSZjPKRdRyUW1ITZ/OApOnPKAMWJdiUtmqu/JzIsjJmIyHUKbEdm2ZuJ/3nd1Ma1MGNSpZZKslgUpxzZBM2+RwOmKbF84ggmmrlbERlhjYl1GZVcCMHyy6ukdVENrqr+3WWlXsvjKMIJnMI5BHANdbiFBjSBgIBneIU3T3sv3rv3sWgtePnMMfyB9/kDuJGQUg==</latexit>

⇡i3
<latexit sha1_base64="B0sLcbKuhsnD4fNwqFC7JqOKUVY=">AAAB8HicbVDJSgNBEK1xjXGLevTSGARPYcYFcwx48RjBLJKMoafTkzTpZejuEcKQr/DiQRGvfo43/8ZOMgdNfFDweK+KqnpRwpmxvv/trayurW9sFraK2zu7e/ulg8OmUakmtEEUV7odYUM5k7RhmeW0nWiKRcRpKxrdTP3WE9WGKXlvxwkNBR5IFjOCrZMeugnrscfsYtIrlf2KPwNaJkFOypCj3it9dfuKpIJKSzg2phP4iQ0zrC0jnE6K3dTQBJMRHtCOoxILasJsdvAEnTqlj2KlXUmLZurviQwLY8Yicp0C26FZ9Kbif14ntXE1zJhMUkslmS+KU46sQtPvUZ9pSiwfO4KJZu5WRIZYY2JdRkUXQrD48jJpnleCq4p/d1muVfM4CnAMJ3AGAVxDDW6hDg0gIOAZXuHN096L9+59zFtXvHzmCP7A+/wBu5uQVA==</latexit>

<latexit sha1_base64="gPxgA3phsRnJovYJvuMnoDNCh/4=">AAAB/nicbVDLSsNAFJ3UV62vqLhyM1iEuimJKHZZ0YXLCvYBTQiT6U07dPJgZiKUUPBX3LhQxK3f4c6/cdJmoa0HBg7n3Ms9c/yEM6ks69sorayurW+UNytb2zu7e+b+QUfGqaDQpjGPRc8nEjiLoK2Y4tBLBJDQ59D1xze5330EIVkcPahJAm5IhhELGCVKS5555NwCV6TmhESNKOHZ9dRjZ55ZterWDHiZ2AWpogItz/xyBjFNQ4gU5UTKvm0lys2IUIxymFacVEJC6JgMoa9pREKQbjaLP8WnWhngIBb6RQrP1N8bGQmlnIS+nsxTykUvF//z+qkKGm7GoiRVENH5oSDlWMU47wIPmACq+EQTQgXTWTEdEUGo0o1VdAn24peXSee8bl/WrfuLarNR1FFGx+gE1ZCNrlAT3aEWaiOKMvSMXtGb8WS8GO/Gx3y0ZBQ7h+gPjM8frZGVRQ==</latexit>

(Ai )
⇡i2
<latexit sha1_base64="DRVhdvEMUWVSmxdfjvBagPHDI0o=">AAAB8HicbVBNSwMxEJ3Ur1q/qh69BIvgqewWxR4LXjxWsK3SriWbZtvQJLskWaEs/RVePCji1Z/jzX9j2u5BWx8MPN6bYWZemAhurOd9o8La+sbmVnG7tLO7t39QPjxqmzjVlLVoLGJ9HxLDBFesZbkV7D7RjMhQsE44vp75nSemDY/VnZ0kLJBkqHjEKbFOeuglvM8fs9q0X654VW8OvEr8nFQgR7Nf/uoNYppKpiwVxJiu7yU2yIi2nAo2LfVSwxJCx2TIuo4qIpkJsvnBU3zmlAGOYu1KWTxXf09kRBozkaHrlMSOzLI3E//zuqmN6kHGVZJapuhiUZQKbGM8+x4PuGbUiokjhGrubsV0RDSh1mVUciH4yy+vknat6l9WvduLSqOex1GEEziFc/DhChpwA01oAQUJz/AKb0ijF/SOPhatBZTPHMMfoM8fuhaQUw==</latexit>

Figure 6: Schematic representation of dual averaging (DA-d). There is no explicit dynamics

in the primal space ∆(Ai ). Instead, the dual variables Ûki are first aggregated within the
dual space R|A|i , and then are “mirrored” back to the primal space via the mirror mapping
QRϵ .

3.3.2 Reinforcement Learning as Mirror Descent in Finite Games

In the above discussion, we interpreted the mirror descent scheme (MD) as a “reinforcement
learning” in continuous games. In this subsection, we further show that the idea of mirror
descent can also be employed in finite games, and the resulting learning dynamics is in fact
the exploratory reinforcement learning scheme (DA-d).
In finite games, the utility function is not differentiable with respect to the action, since
actions sets are finite. In order to leverage the gradient play, we consider the mixed exten-
sion of the finite games. Consider the expected utility ui (πi , π−i ) = ⟨πi , ui (π−i )⟩, then the
gradient of the expected utility with respect to player i’s strategy πi is given by ui (π−i ).
Naturally, we can apply the mirror descent scheme (MD) to this mixed extension without
difficulty. Furthermore, if the gradient is not directly available, for example, learning under
k
the noisy feedback, we rely on the unbiased estimator of ui (π−i ), Ûki , which can be viewed as
an estimator of the payoff gradient Di in (MD). It can be easily seen that the mirror descent
scheme for this induced continuous game reduces to the exploratory reinforcement learning
in (DA-d). Consequently, the learning scheme (DA-d) is called dual averaging: the dual vari-
ables, the gradients Ûki , are aggregated first within the dual space and then are “mirrored”
back to the primal space by the mirror mapping [41]. A schematic representation of dual
averaging is provided in Figure 6.

3.4 Convergence of Learning in Games

This subsection examines the asymptotic behavior of learning algorithms introduced above,
with the focus on the convergence results of the introduced learning algorithms. Due to the
close connection between gradient play in continuous games and reinforcement learning in
finite games, we limit our scope to reinforcement learning algorithms in finite games, while

24
pointing the reader to [37,58,61–63] for the treatment in continuous games. The discussion in
this subsection is primarily based on stochastic approximation theory and Lyapunov stabil-
ity theory [30,64], and a generic procedure of applying such analytical tools consists of three
steps: 1) develop the mean-field continuous-time dynamics using stochastic approximation
theory; 2) study the continuous-time learning dynamics using ODE methods, relating its Lya-
punov stability to Nash equilibria of the underlying game; 3) derive the convergence results
of discrete-time algorithms using asymptotic convergence of corresponding continuous-time
dynamics. Since the third step is direct corollary of the results of the first and second steps,
we articulate the first two steps in analyzing the asymptotic behaviors of reinforcement learn-
ing in the sequel. We refer the reader to Appendix C and references therein for details on
the relation between discrete-time trajectory and its continuous counterpart.

3.4.1 Learning Dynamics and Stochastic Approximation

With proper Fik and Gki , learning algorithms allow the players to reach the Nash equilibrium
of the game in the limit. Hence, the problem boils down to analyzing the limiting behavior
of the discrete-time systems (BR-d),(DA-d),(SBR-d), that is, whether its global attractor
comprises equilibria. Direct investigations into such learning dynamics are challenging, as
stochasticity enters the updating rules. For example, the action at time k, aki is sampled
from the strategy πik , and the payoff feedback Uik also incurs randomness.
Thanks to the celebrated stochastic approximation theory, we can turn to the contin-
uous counterpart of the discrete-time dynamics: an ordinary differential equation (ODE),
whose trajectory enjoys the same asymptotic property. From a technical standpoint, the
continuous-time dynamics often produce a more comprehensible picture for analysis with
fruitful tools available at our disposal. One of the most powerful tools is Lyapunov stability
theory. Besides, such a continuous-time framework also allows us to connect learning the-
ory with the extensive literature on game dynamics in biology and evolutionary theory [24],
where the time interval between two repetitions of the game is infinitesimally small.
Recall that reinforcement learning adopts two coupled discrete-time dynamical systems:
one for the score function (8) and the other one for the choice mapping (9).

ûk+1
i = (1 − µki )ûki + µki Gki (πik , ûki , Uik , aki ),
πik+1 = (1 − λki )πik + λki Fik (πik , ûk+1
i , Uik , aki ).

In the following, the continuous-time dynamics associated with (8) and (9) is obtained via
stochastic approximation, which paves the way for the ODE-based convergence analysis.
We begin with a generic description of the learning dynamics under reinforcement learning,
and then we specify the learning dynamics corresponding to (BR-d)(DA-d)(SBR-d). For
more details regarding stochastic approximation, we refer the reader to Appendix C and the
references therein.
For the sake of simplicity in exposition, we assume that learning policies in (8) and (9)
are time-invariant, denoted by Fi and Gi , respectively. When the learning policies are time-
variant, stochastic approximation theory still applies, and we refer the reader to [47] for
more details. Let the mean-field components of (8) and (9) be denoted by fi (πik , ûk+1 i )=
E[Fi (πik , ûik+1 , Uik , aki )|F k−1 ] and gi (πik , ûki ) = E[Gi (πik , ûki , Uik , aki )|F k−1 ], respectively. We

25
can then write down the following coupled differential equations

dûi (t)
= gi (πi (t), ûi (t)),
dt
dπi (t)
= fi (πi (t), ûi (t)),
dt
which are closely related to (8) and (9). By stochastic approximation theory (see Ap-
pendix C), the linear interpolations of the sequences {πik } and {ûki } are the perturbed solu-
tions to the differential equations above, which are arbitrarily close to the true solution as
time goes to infinity. In other words, the convergence results of (8) and (9) can be obtained
by studying the limiting behavior of the associated differential equations.
Following the same argument, the learning dynamics of the best response (BR-d) can be
written as
dûi (t)
= ui (π−i (t)) − ûi (t),
dt (BR-c)
dπi (t)
∈ BRi (ûi (t)) − πi (t).
dt
If the best response dynamics is adopted by every player, we can consider the continuous-time
dynamics of the strategy profile of all players π(t) = [π1 (t), π2 (t), . . . , πN (t)] under best re-
sponse. Denote the joint utility vector by u(π(t)) := [u1 (π−1 (t)), u2 (π−2 (t)), . . . , uN (π−N (t))],
and similarly, the joint estimated utility vector by û(t) := [û1 (t), û2 (t), . . . , ûN (t)]. Then,
for the strategy profile π(t), the continuous-time learning dynamics under the best response
algorithm is

dû(t)
= u(π(t)) − û(t), (18)
dt
dπ(t)
∈ BR(û(t)) − π(t). (19)
dt
From its associated learning dynamics, we can see that the best response algorithm (BR-d)
or equivalently its continuous-time mean-field dynamics (BR-c) is in fact an actor-critic
learning [31], where the approximation û(t) given by (18) serves as the actor, evaluating the
performance of the current strategy profile, while the strategy update (19) is the critic that
improves the strategy.
As observed in the literature [31], the performance of the actor-critic learning relies on the
quality of evaluation from the actor. One approach to obtain a satisfying actor in learning
is to leverage the two-timescale idea [29], according to which (18) should operate at a faster
timescale than (19). Intuitively speaking, in order to obtain a û(t) that can approximately
evaluate the current strategy profile π(t), the player must wait until û(t) nearly converges
before it updates the strategy using (19). To analyze the convergence of the two-timescale
dynamics, one can study its equivalent single-timescale dynamics. Since the actor (18) runs
at a faster timescale, the system (18) and (19) can be “decoupled” in the following way: by
fixing π(t) = π, the faster timescale update (18) converges to u(π), where π is viewed as
a parameter, Then, after the convergence of the fast dynamics to an equilibrium u(π), the

26
slow dynamics (19) is set into motion, where û(t) is replaced by its equilibrium point u(π(t))
and the resulting learning dynamics is
dπ(t)
∈ BR(π(t)) − π(t). (20)
dt
As we illustrate in Appendix C, the coupled dynamics (18)(19) and the single-timescale
(20) share similar asymptotic behaviors. Hence, we can focus on the much simplified one
(20) for the derivation of the convergence results. For more details about the two-timescale
learning and the derivation of the equivalent dynamics, we refer the reader to Appendix C
and references therein.
Applying the same argument to the smoothed best response (SBR-d), we obtain
dûi (t)
= ui (π−i (t)) − ûi (t),
dt (SBR-c)
dπi (t)
= QRϵ (ûi (t)) − πi (t),
dt
and its equivalent dynamics regarding the joint strategy profile is
dπ(t)
= QRϵ (u(π(t))) − π(t). (21)
dt
Different from the best response (BR-d) and the smoothed best response (SBR-d), dual
averaging (DA-d) does not belong to the class of actor-critic methods. To see this, let us
write down its continuous-time dynamics
dûi (t)
= ui (π−i (t)),
dt (DA-c)
πi (t) = QRϵ (ûi (t)).

Similar to the previous argument, the learning dynamics for the strategy profile is
dû(t)
= u(π(t)),
dt (DA)
π(t) = QRϵ (û(t)),

where the dynamics regarding û(t) does notR t produce an approximation of u(π(t)). Instead,
it gives the cumulative payoff: û(t) = 0 u(π(τ ))dτ + û(0). It is straightforward to see
that as there is only one differential equation in (DA), the resulting autonomous dynamical
system is only related to û(t). Hence, there is no additional dynamics regarding the strategy
update, which makes (DA) fundamentally different from (BR-c) and (SBR-c).

3.4.2 Nash Equilibrium and Lyapunov Stability

Since the various learning algorithms belong to different classes, the discussion regarding the
convergence results of the introduced learning dynamics are organized in the following way.
We begin with dual averaging (DA), a type of gradient-based dynamics, and then proceed
to the best response dynamics (BR-c) and the smoothed best response (SBR-c).

27
Dual Averaging Consider the learning dynamics of the joint strategy profile and the
estimated utility vector under dual averaging

dû(t)
= u(π(t)),
dt (DA)
π(t) = QRϵ (û(t)).

This compact form implies that (DA) is an autonomous system evolving in the dual space.
Here, similar to the discussion in Section 3.3, we adopt the terminology in [41, 42], where
the gradient u(π(t)) is the dual variable and the corresponding space is termed the dual
space. As shown in [38], (DA) is a well-posed dynamical system in the dual space in that it
admits a unique global solution for every initial û(0). Furthermore, it can be shown that the
dynamics of π(t) on the game’s strategy space induced by (DA) under steep regularizers is
also well-posed [38, 52]. However, the well-posedness of the induced dynamics under generic
regularizers remains unclear [38]. The reason lies in the fact that under steep regularizers,
such as the entropy regularizer, the projected dynamics regarding π(t) evolves within the
interior of the simplex, and the resulting ODE is also well posed in the primal space, which
need not hold for nonsteep regularizers. For more generic choices of QR and related stability
analysis, we refer the reader to [38].
Even though studying the stability of the induced dynamics in the primal space may not
be viable due to the well-posedness issue, the asymptotic behavior of π(t) can be character-
ized by investigating its dual û(t). Toward that end, we call π(t) = QRϵ (û(t)) the induced
orbit of (DA) or simply orbit, and we introduce the following notions regarding the stability
and stationarity of π(t), which is adapted from [38].
ϵ
Definition 4 Denote by im(QR Q ) the image of QRϵ . For π(t) = QR(u(t)), an orbit of
(DA), we say that a fixed π ∗ ∈ i∈N ∆(Ai ) is

1. stationary, if π(t) = π ∗ ∈ im(QRϵ ) for all t ≥ 0, whenever π(0) = π ∗ ;

2. Lyapunov stable, if for every neighborhood U of π ∗ , there exists a neighborhood U ′ of

π ∗ such that π(t) ∈ U for all t ≥ 0 whenever π0 ∈ U ′ ∩ im(QRϵ );

3. attracting, if there exists a neighborhood U such that π(t) → π ∗ as t → ∞ whenever

π0 ∈ U ∩ im(QRϵ );

4. globally attracting, if π ∗ is attracting with the attracting basin being the entire image
im(QRϵ );

5. asymptotically stable, if π ∗ is both attracting and Lyapunov stable;

6. globally asymptotically stable, if π ∗ is both globally attracting and Lyapunov stable.

Similar to the Folk Theorem of evolutionary game theory [34], there is an equivalence
between the stationary points of (DA) and the Nash equilibria [34, 38]: any stationary point
is a Nash equilibrium and conversely, every Nash equilibrium that is within the image of the
mirror map (15) is a stationary point. In addition to the relation between Nash equilibrium
and the stationary point, another important question is the following:

28
Are Nash equilibria of the underlying game (globally) asymptotically stable under (DA)?
To answer this question, we shall revisit the variational characterization of Nash equilibrium,
which bridges the equilibrium concepts associated with two different mathematical models:
games and dynamical systems. Recall that the Nash equilibrium is equivalent to the solution
of the variational inequality
Y
⟨u(π ∗ ), π − π ∗ ⟩ ≤ 0, for all π ∈ ∆(Ai ). (SVI)
i∈N

Since the utility function ui (πi , π−i ) is linear in πi , the Stampacchia-type variational inequal-
ity (SVI) is equivalent to the following Minty-type variational inequality
Y
⟨u(π), π − π ∗ ⟩ ≤ 0, for all π ∈ ∆(Ai ), (MVI)
i∈N

which implies that the Nash equilibrium π ∗ is the solution to (MVI) [20]. Then, to answer
the question of interest, it suffices to investigate whether the solution to (MVI) is attracting
under (DA). As discussed in [61], the answer is negative: not every Nash equilibrium of
N -player general-sum game is attracting. To ensure the convergence of (DA), an additional
condition has to be imposed on (MVI).
Definition 5 (Variational Stability [38]) π ∗ is said to be variationally stable if there ex-
ists a neighborhood U of π ∗ such that

⟨u(π), π − π ∗ ⟩ ≤ 0, for all π ∈ U, (VS)

Q
where equality holds if and only if π ∗ = π. In particular, if U = i∈N ∆(Ai ), π ∗ is said to
be globally variationally stable.
The definition of variational stability (VS) can be extended to sets [38]. Let a subset Π∗ ⊂
Q ∗
i∈N ∆(Ai ) be closed and nonempty. Π is said to be variationally stable if there exists a
neighborhood U of Π∗ such that

⟨u(π), π − π ∗ ⟩ ≤ 0, for all π ∈ U, π ∗ ∈ Π∗ , (22)

where equality holds for a given π ∗ ∈ Π∗ if and only if π ∈ Π∗ .

The notion “variational stability” is proposed in [38] as a relaxation of the monotonicity
condition of the pseudo-gradient mapping of the game, e.g., u(π) in the mixed extension
of finite games, or D(a) in continuous games. Variational stability alludes to the seminal
notion of evolutionary stability introduced in [43], and the introduced definition is in a
similar spirit to the variational characterization of evolutionarily stable state studied in
[34]. An equivalent notion is developed in the line of works on gradient-based learning [61],
named locally asymptotically stable Nash equilibria (LASNE), and as its name suggests, Nash
equilibria satisfying the variational stability (VS) are asymptotically stable under gradient-
based dynamics. Likewise, Nash equilibria satisfying global variational stability are globally
asymptotically stable (GASNE). We refer the reader to [61] and references therein for more
details about this characterization of Nash equilibria.

29
What has been presented above provides a generic criterion for examining the convergence
of gradient-based dynamics (DA), and in the following, based on the notion of variational
stability, we discuss some concrete cases, where the learning dynamics converges either locally
or globally to Nash equilibria. As shown in [37], for any finite games, every strict Nash
equilibrium satisfies (VS) and hence is a LASNE. Therefore, every strict Nash equilibrium
in finite games is locally attracting. On the other hand, to ensure global convergence, the
underlying Nash equilibrium has to be GASNE or equivalently satisfy the global variational
stability. For finite games, the existence of a potential implies monotonicity, which further
implies the existence of globally variationally stable Nash equilibiria [37]. Hence, for potential
games [38,65] and monotone games [37,66], regardless of the initial points, the orbit of (DA)
always converges to the set of Nash equilibria. We summarize our discussions in the following,
where 1) and 2) are direct extensions of the folk theorem of evolutionary dynamics [34], while
3)-5) are corollaries of variational characterization of Nash equilibria in [38] and [61].
For every finite game, we have the following characterization
Q of Nash equilibrium using
∗
the language of Lyapunov stability [38, 52]. For a fixed π ∈ i∈N ∆(Ai ),

1. if π ∗ is stationary, it is a Nash equilibrium;

2. if π ∗ is Lyapunov stable, then π ∗ is a Nash equilibrium;

3. if π ∗ is a Nash equilibrium and it falls within the image of the mirror map, then it is
stationary;

4. if π ∗ is a strict Nash equilibrium, it is asymptotically stable;

5. if π ∗ is a Nash equilibrium of a potential game or a monotone game, it is globally

asymptotically stable.

Best response dynamics The analysis of the best response dynamics (20) is more in-
volved than that of dual averaging (DA). The theoretical challenge is mainly due to the
discontinuous, set-valued nature of the best response mapping (2). In general, as a differ-
ential inclusion, (20) typically admits non-unique solutions through every initial point [30].
Early works have established the convergence results on (20) for games with special struc-
tures: best response dynamics converges to Nash equilibrium in zero-sum games, where the
Nash equilibrium is essentially a saddle point [33, 62, 67], in two-player strictly supermodu-
lar games [44] and in finite potential games [30, 33]. However, we note that these research
works, even though most of them still rely on the Lyapunov argument [30, 33, 62, 67], do
not directly reveal any generic relation between Lyapunov stability and Nash equilibrium in
general multi-player non-zero sum games, and they are more or less on an ad hoc basis.
Recent endeavors on the study of the best response dynamics have helped shed some light
on the asymptotic behavior of best response dynamics by relating the best response vector
field BR(π) − π to the gradient field u(π), which renders the best response dynamics in some
potential games [68, 69] as an approximation of the gradient-based dynamical system [68].
For the finite potential games considered in [68], additional regularity conditions are imposed,
which are closely related to the notion of variational stability introduced above. Therefore,
the variational characterization of Nash equilibrium and varitional stability becomes relevant

30
under the best response dynamics. Following this line of reasoning, it is shown in [68], in
regular potential games, that the best response dynamics is well-posed for almost every
initial condition, and converges to the set of Nash equilibria.

Smoothed Best Response As we can see from the explicit expression, smoothed best
response dynamics (21) only differs from the best response dynamics (20) in the operator
QRϵ (·), which serves as a perturbed best response [70], and the perturbation is determined
by ϵ [51]. Hence, if ϵ tends to zero, it is straightforward to see that the smoothed best
response (21) will enjoy the same asymptotic property as the best response (20), which
implies that identical results should also be achievable for smoothed best response with
vanishing exploration. This intuition has been verified in [46, 63], where smoothed best
response (21) is shown to converge in zero-sum games, potential games and supermodular
games.
On the other hand, with a constant ϵ, it is not realistic to expect the smoothed best
response, essentially a fixed point iteration, to always converge to exact Nash equilibrium.
Hence, a new equilibrium concept has been introduced in the literature, which is termed
perturbed Nash equilibrium in [71, 72] or Nash distribution in [32, 50]. The new equilibrium
is defined as the fixed point of the smoothed best response. We do not carry out detailed
discussion on that in this paper, since the convergence analysis still rests on the standard
Lyapunov argument, and the epistemic justification of such equilibrium [24, 33] is beyond
the scope of this paper. We refer the reader to [24, 50, 63, 72] for a rigorous treatment of the
smoothed best response.

3.5 Beyond Stochastic Approximation

In addition to stochastic approximation and related ODE methods, another class of widely
applied learning algorithms is built upon Markov Chain theory [73], which is termed learning
by trial and error (LTE) [74]. Even though the name of the proposed learning suggests its
similarity to reinforcement learning, the learning process is quite different in the sense that
there are no explicit score functions or choice mappings in the proposed method. In LTE,
there are two basic rules: 1) players occasionally experiment with alternative strategies,
keeping the new strategy if, and only if, it leads to a strict increase in payoff; 2) if the
player experiences a payoff decrease due to a strategy change by someone else, it starts a
random search for a new strategy, eventually settling on one with a probability that increases
monotonically with its realized payoff. In words, the “error” part relies on the realized payoff,
and no advanced device is needed, such as score functions like Q-functions or estimated
utilities, while the “trial” part is a random search procedure implemented according to
the two basic rules. A novel feature of the process is that different search procedures are
triggered by different psychological states or moods, where mood changes are induced by the
relationship between a player’s realized payoffs and his current payoff expectations. To be
specific, there are four moods: Content(C ), Hopeful (H ), Watchful (W ) and Discontent(D),
and different moods lead to different random search procedures. Briefly, players will explore
new strategies with high probabilities when in W and D, while sticking to the current one,
with high probabilities, if the mood is C or H. Details can be found in the original paper,
and a concise summary is provided in [75].

31
This mood-based trial and error is different from reinforcement learning introduced in the
previous subsection, where the exploration is not determined eplicitly by the score function
and the choice mapping. Hence, LTE does not fit the stochastic approximation framework
introduced above, and instead, the associated convergence proof relies on perturbed Markov
Chain theory [73, 76]. It is shown in [74] that in a two-player finite game, if there at least
exists a pure Nash equilibrium, then LTE guarantees that pure Nash equilibrium is played
at least 1 − ϵ of the time, where ϵ is the probability of exploring new strategies. For an
N -player finite game, if the game is interdependent [74] and there at least exists one pure
Nash equilibrium, the same theoretical guarantee for the two-player case also holds. It is not
surprising that LTE does not achieve convergence in conventional ways, that is, almost sure
convergence and convergence in the mean, since players will always explore new strategies
with positive probability at least ϵ. The proposed learning method and its variants have
also been applied to learning efficient equilibrium [77] (Pareto dominant, maximizing social
welfare), learning efficient correlated equilibria [78], achieving the Pareto optimality [79] and
other related works in engineering applications, especially in cognitive radio problems [28].
The idea of trial and error in LTE leads to many important variants, such as sample
experimentation dynamics in [76] and optimal dynamical learning [75,79], which also rely on
perturbed Markov processes for equilibrium seeking. Even though the convergence results
of these algorithms all rest on Markov Chain (MC) theory [73], the analysis of their perfor-
mance remains unclear, due to the computation complexity of the inherent MC generated
by these algorithms. To circumvent the dimensionality issue regarding the number of states
in the original MC, an approximation-based dimension reduction method is proposed in [75],
which allows numerical convergence analysis for LTE and its variants based on Monte Carlo
simulations. Besides, we also note that a much simplified trial-and-error algorithm has been
theoretically analyzed in [80], where the optimal exploration rate is identified and the asso-
ciated convergence rate is discussed. It is not unrealistic to expect a similar argument may
apply to LTE and its variants, but the technical challenges regarding the dimensionality
should not be downplayed.

3.6 Resurgence of Learning in Games

With machine learning (ML) algorithms being increasingly deployed in real-world applica-
tions, there has been a resurgence in research endeavors on multi-agent learning and learning
in games [81]. In addition to the line of research driven by evolutionary dynamics dating
back to 1950s [34,44], the current wave of learning theory development is mainly driven by a
desire to better understand and improve the performance of ML algorithms in a competitive
environment. In general, there are two possible roles that game theoretic methods can play
in ML study: 1) Game-theoretic methods is an add-on for improving the performance of
ML algorithms. 2) Certain ML problems manifest the game features, which calls for game-
theoretic tools. For supervised learning, the recent interest in adversarial learning techniques
serves as an example to show how game-theoretic models and learning methods can be used
to robustify machine learning [82, 83], where potential attacks or disturbance are viewed as
strategic moves of an opponent. On the other hand, there are problems in unsupervised
learning where game-theoretic models are no longer tools for solving the problem but the
problem itself. Generative Adversarial Networks (GAN) [84], is an approach to generative

32
modeling using deep learning methods, involving automatically discovering and learning the
patterns of input data in such a way that newly generated examples output by the generative
model (generator) cannot be distinguished from the input. In game-theoretic language, the
training process of GAN is essentially a learning process in a zero-sum game between the
generator and the discriminator, where the generator tries to generate new samples that
plausibly could have been drawn from the original dataset, while the discriminator tries to
pick those fake ones produced by the generator. We do not intend to provide a comprehensive
survey for these machine learning applications, instead we refer the reader to [81, 82].
Despite different contexts under which the learning theory is studied, recent research
efforts mainly revolve around the following three aspects:

1. learning dynamics in general multi-player repeated games;

2. learning dynamics in repeated games with acceleration design;

3. learning dynamics in dynamic games in a decentralized manner.

The first research direction is a natural follow-up to the study of evolutionary dynamics
[34, 44], which aims to bring learning in games to a broad range of ML applications, since in
ML, the game structure is specified by the underlying data and may not enjoy any desired
properties. Recall that convergence results and asymptotic behaviors regarding the three
dynamics (BR-c)(SBR-c)(DA-c) are discussed with the assumption that the underlying game
acquires special structures, such as potential games, supermodular games and zero-sum
games. However, for games with fewer assumptions on the utility function, there is still
a lack of understanding of the dynamics and the limiting behavior of learning algorithms.
One of the central questions of this direction is what the relations between Nash equilibria
and stationary points as well as attracting sets under the learning dynamics are. Recent
attempts try to answer this question from a variational perspective [85], and provide various
characterizations of Nash equilibria with desired properties under gradient-based dynamics
[52,61,86]. Furthermore, considering its applications in ML problems, learning algorithms in
stochastic settings are of great significance in recent studies, and we refer the reader to [61,87]
for more details as well as to [17] for an introduction to stochastic Nash equilibrium seeking.
The second research direction, which attracts attention from the ML community, the
optimization community as well as the control community, is directly related to the design
of ML algorithms. The goal is to develop acceleration techniques that improve the per-
formance of learning algorithms. Based on the understanding of first-order gradient-based
dynamics games such as (GD)(LGD), recent research efforts have focused on high-order gra-
dient methods, which can be dated back to Nesterov’s momentum idea [42], and researchers
endeavor to propose a general framework that generalize the momentum for the generation
of accelerated gradient-based algorithms [85]. On account of the close relationships among
Nash equilibrium, variational problems and dynamical systems [20], one approach for devel-
oping acceleration is to generalize the concept momentum by formulating the equilibrium
seeking as a variational (optimization) problem [20, 88], and then investigate acceleration
methods within the optimization context using, for example, variational analysis [85], extra-
gradient [88] and differential equation [89]. In addition to these mentioned research works,
we refer the reader to [19] for a review on the optimization-based approach. On the other

33
hand, as depicted in Figure 3, a learning process in general is a feedback system, and it is not
surprising that control theory can play a part in designing the acceleration. For example,
recent studies on reinforcement learning demonstrate that passivity-based control theory can
be leveraged in designing high-order learning algorithms [66, 90], where the learning rule is
treated as the control law to be designed. Another paper [91] promotes the use of memory in
best response maps to accelerate convergence in Nash seeking, and demonstrates substantial
improvements in doing so. In addition to the mentioned references, we further refer the
reader to [92] for a review on control-theoretic approaches on distributed Nash equilibrium
seeking, and to [93] for the use of extreme seeking in the learning process.
The recent advance on the third research direction is in part driven by multi-agent re-
inforcement learning and its applications such as multi-agent robotic control [10, 94, 95].
Different from the first two directions where the learning dynamics is primarily studied in
the context of repeated games, the third research direction focuses on games with dynamic
information (see Section 2.2). In this context, the appropriate learning objective, out of
practical consideration [16], is to obtain stationary strategies that are subgame perfect [96]
(see Section 2.2 for the definition of subgame perfectness). Different from the first two where
the change to payoffs resulted from a certain action completely comes from the opponents’
move, in dynamic games, the feedback each player receives not only depends on other players’
moves but also the dynamic environment. Moreover, when making decisions at each state,
players have to trade off current stage payoff for estimated future payoffs while forming pre-
dictions on the opponent’s strategies. Dynamic trade-off makes the analysis of learning in
stochastic games potentially challenging [97].
Earlier works on seeking for such Markov perfect Nash equilibrium are largely based on
dynamic programming [98,99], which requires a global information feedback, a restrictive as-
sumption in practice. Recent efforts focus on various approaches to lessen this requirement.
Currently, there are mainly three lines of research regarding learning in dynamic games. The
first approach is to extend learning dynamics in repeated games to dynamic games. Built
upon similar ideas in best response dynamics (BR-d), two-timescale best response dynamics
for zero-sum Markov games have been considered in [97, 100], meanwhile the gradient play
has been investigated in linear-quadratic dynamic zero-sum games [61, 101, 102]. The key
challenge in the approach, particularly in the case of Markov games, is to properly con-
struct the score function, which balance current stage payoffs and the future payoffs, and
we refer the reader to the mentioned references for more details and to [81] for an overview.
The second approach is to extend learning methods in single-agent Markov decision pro-
cess to Markov games. However, the direct extension of methods such as Q-learning [103],
policy gradient [31] and actor-critic [49] often fail to deliver desired results due to the non-
stationarity issue [104]. One natural way to overcome the non-stationarity issue is to allow
players to exchange information with neighbors [105,106], by which enables players to jointly
identify the non-stationarity created by the dynamic environment. For more details regard-
ing this approach, we refer the reader to recent reviews [81,104]. Finally, the third approach
is about a unilateral viewpoint of dynamic games. Different from the first two approaches
where learning processes are still investigated in a competitive environment, the third one
interprets learning in Markov games as an online optimization problem [107,108], where play-
ers independently make decisions based on the received feedback. This approach accounts
for the fully decentralized learning, where from each player’s perspective, other players are

34
considered as part of the environment. The key idea of this approach is to leverage the
regret minimization technique [15], which has led to many successes in solving extensive
form games of incomplete information [109]. Despite recent advances regarding the first two
approaches [61,81,97,100,110] and positive results for the last one [107,108,111], we still lack
a unified framework and a through understanding regarding the learning process in general
Markov games, which remains an open area for researchers from diverse communities.

4 Game-Theoretic Learning over Networks

Learning in games is not only intellectually interesting but also practically useful. When
combined with game-theoretic modeling, such learning methods, thanks to their decentral-
ized and adaptive nature, provide a comprehensive tool kit for designing resilient, agile, and
computationally efficient controls or mechanisms for diverse applications of networks.
In this section, we demonstrate that such a combination of game-theoretic models and
associated learning dynamics, referred to as game-theoretic learning, has become indispens-
able for modern network problems. On the one hand, these networks often admit complex
topological structures and heterogeneous nodes, resulting in large-scale complex systems,
making centralized controls or mechanisms either impractical or costly. By contrast, game-
theoretic models treat each node in the network as a rational and self-interested player, and
the heterogeneous nature is captured by players’ distinct utilities and action sets as well as
information available to them, leading to a bottom-up approach for designing decentralized
and scalable mechanisms and controls. On the other hand, modern networked systems, such
as wireless communication networks and the smart grid, operate in a dynamic or an adver-
sarial environment, calling for learning-based mechanisms that are responsive to changes in
the environment or malicious attacks from adversaries. As shown in the last section, game-
theoretic learning provides a self-adaptive procedure for each player in the system, according
to which players adjust their moves based on feedback from the environment, resulting in
desired collective behaviors.
Thanks to its advantages over the centralized approach, game-theoretic learning has
gained much popularity among researchers working on multi-agent systems and network
applications. There have been numerous encouraging successes in many fields, ranging from
wireless and IoT communication networks [112–116], the smart grid and power networks [2,3,
117, 118], infrastructure systems [119–122], to cybersecurity applications [123–125, 125–127].
In the following, some representative works in these fields are presented. To be specific, the
focus of this section is on the applications of learning methods in wireless communications,
the smart grid, and distributed machine learning, while other related applications will be
briefly discussed at the end of the section.

4.1 Next-Generation Wireless Networks

The next-generation wireless communication technologies offer an accommodating and adap-
tive solution that meets the requirements of a diverse range of use cases within a common
network infrastructure, providing the necessary flexibility for service heterogeneity and com-
patibility [7]. Such architecture, as pointed out in [128], aims to meet following demands:

35
Figure 7: The next generation of communication network: macrocells (bands < 3 GHz);
small cells (millimeter-wave); femtocells and Wi-Fi (millimeter-wave); massive multiple-
input, multiple-output with beamforming; and device-to-device (D2D) and machine-to-
machine (M2M) communications. Solid arrows indicate wireless links, whereas the dashed
arrows indicate backhaul links.

• increased indoor and small cell/hotspot traffic, which will make up the majority of
mobile traffic volume, leading to complex network structures;

• higher numbers of connected heterogeneous devices stemming from the Internet of

Things (IoT), which will support massive machine-to-machine (M2M) communications
and applications;

• improved energy consumption or efficient power control for reducing carbon footprint.
From a system science perspective, these requirements impose a large-scale, time-variant,
and heterogeneous network topology on modern wireless communication systems, as shown
in Figure 7. Hence, it is impractical to manage/secure the wireless communications network
centrally. Game-theoretic learning provides a scalable distributed solution with adaptive
attributes to deal with this challenge. In the following, we take the dynamic secure routing
mechanism as an example to illustrate how game-theoretic learning contributes to a resilient
and agile communication system.
Security of routing in a distributed cognitive network (CR) is a prime issue, as the
routing may be compromised by unknown attacks, malicious behaviors, and unintentional
misconfigurations, which makes it inherently fragile. Even with appropriate cryptographic
techniques, routing in CR networks is still vulnerable to attacks in the physical layer, which
can critically compromise performance and reliability. Most of the existing work focuses on
the resource allocation perspective, which fails to capture the user’s lack of knowledge of
the attacker due to the distributed mechanism. To address these issues, [113] provides a

36
learning-based secure scheme, which allows the network to defend against unknown attacks
with a minimum level of deterioration in performance.
Consider Gw := (Nw , Ew ), which is a topology graph for a multi-hop CR network, where
Nw = {n1 , n2 , ..., nN } is a set of secondary users, and Ew is a set of links connecting these
users. The system state s indicates whether the primary users occupy nodes. The objective
of the secondary user is to find an optimal path to its destination. In multi-hop routing, a
secondary user ni starts with exploring neighboring nodes that are not occupied and then
chooses a node among them to which the user routes data. The selected node initializes
another exploration process for discovering the next node, and the same process is repeated
until the destination is reached.
Let Pi (0, Li ) := {(ni , li ), li ∈ {0, 1, 2, ..., Li }} be the multi-hop path from the node ni to
its destination, where Li is the total number of explorations until it reaches its destination.
Suppose there are J jammers in the network, the set of which is given by J := {1, 2, ..., J}.
Let Rj , j ∈ J , be the set of nodes under the influence of jammer j. Denote the joint action
of the jammers by r = [rj ]j∈J , where rj ∈ Rj . A zero-sum game formulation is proposed
in [113], where the secondary users aim to find an optimal routing path by selecting Pi (0, Li ),
while the jammers aim to compromise the data transmission by choosing r. The expected
utility function is
"L #
X i
(n ,l ) (n ,l )
Es [ui (s, Pi (0, Li ), r)] = −Es ln q(nii,lii−1) + λτ(nii,lii−1) ,
li =1

(n ,l )
where q(nii,lii−1) is the probability of successful transmissions from node (ni , li − 1) to node
(n ,l )
(ni , li ), and λ(nii ,lii −1) is the transmission delay between these two nodes. Here, the expectation
Es [·] is taken over all the possible system states.
Due to a lack of complete knowledge of adversaries and payoff structures, Boltzmann-
Gibbs reinforcement learning (SBR-d) is utilized to find the optimal path because of its
capability of estimating the expected utility. The resulting secure routing algorithm can
spatially circumvent jammers along the routing path and learn to defend against malicious
attackers as the state changes. As shown in Figure 8, the routing path generated from
the proposed routing algorithm in [113, 129] can avoid the nodes compromised by the jam-
mers. Thus, the routing algorithm stemming from the proposed game-theoretic formulation
provides more resilience, security, and agility than the ad-hoc on-demand distance vector
(AODV) algorithm, as AODV fails to dynamically adjust the routing path in the case of
a malicious attack. Moreover, the proposed routing algorithm can reduce the delay time
incurred by the attack due to its adaptive and dynamic feature, and hence, is more efficient
than AODV.

4.2 The Smart Grid

Gradual replacements of conventional energies with renewable energies greatly help with
the reduction of greenhouse gases and the mitigation of climate change. More and more
microgrids are being integrated with the main power grid, which are green systems that rely
on renewable distributed resources such as wind turbines and fuel cells. As shown in Figure 9,

37
Figure 8: Illustration of a random network topology for 500 secondary users with a source
(S) and a destination (D), and routes of AODV and the proposed secure routing algorithm
in 2 km by 2 km area. The PU footprint denotes the set of nodes unavailable to secondary
users. Without an attacker, AODV establishes the route path (a), described by the solid
line, while the route path (b), the blue dashed line, is generated by the Boltzmann-Gibbs
learning method. Even though the AODV path is the shortest path between the source
and the destination, it is disrupted by malicious attacks. By contrast, the learning method
can develop a new route path (c) that circumvents jammers, leading to a resilient routing
mechanism.

the integration of microgrids can enhance the stability, resiliency, and reliability of the power
system, as they can operate independently from the main power grid autonomously. Such
integration, together with smart meters and appliances, produces the so-called smart grid,
a modern infrastructure for the reliable delivery of electricity.
The future smart grid is envisioned as a large-scale cyber-physical system comprising
advanced power, communications, control, and computing technologies. To accommodate
these technologies employed by different parties in the grid and to ensure an efficient and ro-
bust operation of such heterogeneous and large-scale cyber-physical systems, game-theoretic
methods have been widely employed in smart grid management problems. In the grid, mi-
crogrids are modeled as self-interested players who can operate, communicate, and interact
autonomously to deliver power and electricity to their consumers efficiently. Here, we discuss
a microgrid management mechanism developed in [117], built on game-theoretic learning,
enabling autonomous management of renewable resources.
The system model considered in [117] includes the generators, microgrids, and communi-
cations. As shown in Figure 10, generators in the upper layer determine the amount of power
to be generated, along with the electricity price, and send them to the bottom layer. A mi-
crogrid can generate renewable energies and make decisions by responding to the strategies
of the generators and other microgrids to optimize their payoffs, specified in the following

38
Figure 9: The integration of microgrids. A microgrid consists of a controller, consumers,
generators, and energy storage. In the grid, microgrids can either be connected to the main
grid or other microgrids, and these networked microgrids can operate, communicate, and
interact autonomously to deliver power and electricity to their consumers efficiently.

game-theoretic model.
Let Nd = {r, 1, 2, ..., Nd } be the set of Nd + 1 buses in a power grid, where r denotes the
slack bus. Assume that a smart grid is composed of load buses and generator buses and let
pgi , pli and θi be, respectively, the power generation, power load, and voltage angle at the i-th
bus. Note that the active power injection at the i-th bus satisfies
pi = pgi − pli , ∀ i ∈ Nd ,
P P
while the balance of the grid gives i∈Nd pgi = i∈Nd pli . Let N := {1, 2, ..., N } ⊆ Nd be the
set of N buses that can generate renewable energies, such as wind power, solar power, etc.
In the game considered in [117], the utility function of the i-th bus measures not only
economic factors related to power generation but also the efficiency of the microgrids. Before
giving the mathematical definition of the utility function, we first introduce the following
notations. Let ci be the unit cost of generated power for the i-th player, and c the unit
price of renewable energy for sale defined by the power market. ci , c are quantities relevant
to the profit gained by the bus. For the efficiency part, denote by ri a weighting parameter
that measures the importance of regulations of voltage angle at the i-th bus. Further,
[sij ]i,j∈Nd = −[bij ]−1
i,j∈Nd , where bij is the imaginary part of the element (i, j) in the admittance
matrix of the power grid. Moreover, each microgrid has a maximum generation, denoted by
p̄gi . Finally, we note that as a physical constraint, [sij ] and [pi ] satisfy (23) due to the power
flow equation [117]
X X
sij pj + sij pj = θi − sii pi , ∀i ∈ N , (23)
j∈Nd \N j̸=i∈N

where θi is the voltage angle of the i-th bus. With all the notations above, the utility function
of the i-th bus is defined as
!
g g g g
1 X
ui (pi , p−i ) := −ci pi − c pli − pi − ri2 sij pj , 0 ≤ pgi ≤ p̄gi , i ∈ N .
2 j∈N d

39
Figure 10: Smart grid hierarchy model. The upper layer containing conventional generators
forms a generator network, and the distributed renewable energy generators in the bottom
layer constitute the microgrid network; the information exchange, such as the electricity
market price and the amount of power generation, between the two layers are through the
communication network layer in the middle.

Three learning methods are proposed in the paper to seek the Nash equilibrium, all
based on best response dynamics (10). The first two algorithms are parallel-update algo-
rithm (PUA) and random-update algorithm (RUA) studied in [112]. PUA is essentially the
best response algorithm we represent in (10), with the learning rate λki being zero for all i,
and all players update their strategies in parallel. As its name suggests, RUA incorporates
randomness into the best response algorithm, resulting in an ϵ-greedy best response algo-
rithm: players update their strategies according to (10) with probability 1 − ϵ, with ϵ ∈ (0, 1)
and retain their previous strategies otherwise. When ϵ = 0, players constantly update their
strategies in every round; in this case, RUA reduces to PUA.
However, as special cases of (10), PUA and RUA require global information regarding the
grid, including the specific generated power of generators and other players’ active power in-
jections, which are assumed to be private in practice. Hence, to implement these algorithms,
communication networks are needed to broadcast information to players, which is costly and
not confidential. As a possible remedy, we can consider incorporating utility estimation and
using smoothed best response dynamics (SBR-d) as in the wireless setting. Another more
straightforward approach, as shown in the paper, is to modify the best response algorithm
by using the power flow equations in the smart grid. Based on a phasor measurement unit
(PMU), the third algorithm, termed PMU-enabled distributed algorithm (PDA), enables
each player to compute the aggregation of others’ actions, and the only information needed
is the player’s voltage angle θi . Therefore, by taking into account the power flow equation
(23), a player does not need other players’ private information of active power injection when
using PDA, as shown in Figure 11. Compared with the other two, PDA requires much less
information and is more self-dependent as players only need their current voltage angles θi ,

40
and the common knowledge of the electricity price.

Figure 11: The framework to implement the PMU-enabled distributed algorithm. PMU
measures the voltage angle at the bus, and the controller generates a command regarding
the amount of microgrid renewable energy injection from the local storage to the grid based
on the received voltage angle.

As indicated in [117], the effectiveness and resiliency of the algorithm have been vali-
dated via case studies based on the IEEE 14-bus system: the game-theory-based distributed
algorithm not only can converge to the unique Nash equilibrium but also provides strong
resilience against fault models (generator breakdown, microgrid turn-off, and open-circuit
of the transmission line, etc.) and attack models (data injection attacks, unavailability of
PMU data and jamming attacks, etc.). The strong resilience enables the microgrids to oper-
ate appropriately in unanticipated situations. Moreover, the distributed algorithm enables
autonomous management of renewable resources and the plug-and-play feature of the smart
grid. The proposed learning algorithm only requires the players to have common knowledge
without revealing their private information, which increases security and privacy and reduces
communication overhead.

4.3 Distributed Machine Learning over Networks

The rise of Big Data has led to new demands for large-scale machine learning systems that
promise adequate capacity to digest massive data sets and offer powerful predictive analytics.
With the unrestrainable growth of data, large-scale machine learning needs to address new
challenges regarding the scalability and efficiency of learning algorithms concerning compu-
tational and memory resources. Compared with classical machine learning approaches that
are designed to learn from a single integrated data set, one of the promising research lines of
large-scale machine learning is distributed machine learning over networks (DMLON), which

41
aims to develop efficient and scalable algorithms with reasonable requirements of memory
computation resources, by allocating the learning processes among several networked com-
puting units with distributed data sets.
The key feature of DMLON is that data sets are stored and processed locally on these
computing units, which enables distributed and parallel computing schemes in large-scale ma-
chine learning systems. Compared with centralized approaches, distributed machine learning
avoids maintaining and mining a central data set and preserves data privacy, as these net-
worked units exchange knowledge about the learned models without exchanging raw private
data.
Based on the idea of “local learning and global integration,” DMLON can utilize different
learning processes to train several models from distributed data sets and then produce an
integration of learning models that can increase the possibility of achieving higher accuracy,
especially on a large-size domain. For example, in federated learning [130], the global in-
tegration is created by a third-party coordinator other than computing units, which makes
networked computing units collaboratively train a machine learning model using their data
in security. On the other hand, as indicated in [131], such a global integration can also
stem from the collective patterns of local learning without external enforcement. The key
behind this bottom-up integration is that each computing unit is modeled as a self-interested
player who learns the learning model based on the local data set and the feedback from its
neighbors. It has been shown in the paper that by modeling DMLON as a noncooperative
game, game-theoretic learning methods lead to a communication-efficient distributed ma-
chine learning, where the global outcome is characterized by the Nash equilibrium, resulting
from players’ self-adaptive behaviors.
Specifically, the networked system of computing units is described by a graph with the
set of nodes Nm := {1, 2, . . . , N } representing these units. Each node i ∈ Nm possesses
local data that cannot be transferred to other nodes. In the game model considered in [131,
132], instead of fixing the network topology, nodes can determine the network’s connectivity
based on their attributes when they perform learning tasks, resulting in a network formation
game. In mathematical terms, the action of node i consists of two components: the learning
parameter θi ∈ Rd , and the network formation parameter ei ∈ RN −1 . The first component
θi corresponds to the weights or parameters of the machine learning model, which captures
the local learning process at node i, and the corresponding empirical loss, given the local
data, is denoted by Li (θi ). In addition to this learning parameter θi , the network formation
parameter ei plays an important role in bringing up the global integration. The parameter
ei := (eji )j̸=i,j∈N ∈ [0, 1]N −1 denotes concatenation of weights on the directed edges from
node i to other nodes, where eji can be interpreted as the attention node i pays to the
local learning at node j, and this further influence the communication between the nodes.
Each node can communicate with its neighbors during the distributed learning process to
exchange learning parameters if their objectives are aligned. Otherwise, the corresponding
edge weight eji is set to zero. For node i, the communication cost is Ci (θi , θ−i , ei ). In the
game considered in [131], each node aims to maximize its utility function, defined as

ui (θi , θ−i , ei , e−i ) := −Li (θi ) − Ci (θi , θ−i , ei ),

In this definition, the first term Li (θi ) captures the local learning process at node i, whereas

42
Node 1 Node 1
Node 2 Node 2
Distributed
Learning
Layer

Node 3 Node 3
Node 4 Node 4

Node 1 Node 1 Node 1

Node 2 Node 2 Node 2
Network
Formation Inner Inner
Layer Learning Learning

Node 3 Node 3 Node 3

Node 4 Node 4 Node 4

Outer Learning

Figure 12: A schematic representation of two-layer learning. The directed red lines stand for
the communication between nodes. In the network formation layer, the nodes learn to elim-
inate/establish links with other nodes to achieve efficient communication. In the distributed
machine learning layer, the nodes communicate their parameters with their neighbors and
perform their learning tasks.

the second term Ci (θi , θ−i , ei ) depicts the interactions among nodes. The objective of each
node is to improve the performance of learning while reducing the communication overhead.
A two-layer learning approach is proposed in [131] to find the Nash equilibrium of the
game, and a schematic representation is provided in Figure 12. The outer layer corresponds
to network formation learning, where each node decides its network formation parameter ei
with the learning parameter fixed, and the joint parameters of all nodes e = (ei )i∈Nm give rise
to new network topology, leading to efficient communication. In network formation learning,
each node decides their optimal parameter ei by gradient play (GD), and computing the
individual payoff gradient ∇ei ui (θi , θ−i , ei , e−i ) relies on the stabilized learning parameters
θi , θ−i given by the inner layer: distributed learning layer. In this inner learning, the network
formation parameter is fixed, and each node implements online mirror descent (MD) for
seeking the Nash equilibrium with the local feedback under the current network topology,
as the networked nodes can exchange information with their neighbors.
Compared with existing works on distributed machine learning, the game-theoretic method
studied in [131] enables distributed machine learning over strategic networks. On the one
hand, the global outcome characterized by the Nash equilibrium is self-enforcing, resulting
from the coordinated behaviors of independent computing compared with the external en-
forcing one in federated learning. This bottom-up approach scales efficiently when additional
computing units are introduced into the system. On the other hand, the strategic interac-
tions over the network, described by the network formation decision of each node, create
a network intelligence that allows each computing unit to adaptively adjust the underlying
topology, resulting in a desired distributed learning pattern that minimizes communication
costs during the learning process.

43
4.4 Emerging Network Applications
From the examples above, game-theoretic learning provides a natural scalable design frame-
work to create network intelligence for autonomous control, management, and coordination
of large-scale complex network systems with heterogeneous parties. In the following, we
offer some thoughts regarding various applications of game-theoretic learning in a broader
context, showing that such a design framework is pervasive for diverse network problems.
Interdependent infrastructure networks, including wireless communication networks and
the smart grid, play a significant role in modern society, where Internet-of-Things (IoT)
devices are massively deployed and interconnected. These devices are connected to cellu-
lar/cloud networks, creating multi-layer networks, referred to as networks-of-networks [133].
The smart grid is one prominent example, where wireless sensors collect the data of buses
and power transmission lines, forming a sensor network built on the power networks for
grid monitoring and decision planning purposes [134]. Besides, the networks-of-networks
model has also been extensively studied in other infrastructure networks. For instance, in
an intelligent transportation network, apart from vehicle-to-vehicle (V2V) communications,
vehicles can also communicate with roadside infrastructures or units belonging to one or
several service providers to exchange various types of data related to different applications,
such as GPS navigation. In this case, the vehicles form one network while the infrastructure
nodes form another network, and the interconnections between the two networks lead to the
intelligent management and operation of modern transportation networks.
Due to interdependent networks’ heterogeneous and multi-tier features, the required man-
agement mechanisms or controls can vary for different networks. For example, the connec-
tivity of sensor networks in smart grids or V2V communication networks requires higher
security levels than the infrastructure networks, as cyberspace is more likely to be targeted
by adversaries [135]. Therefore, to manage and secure interdependent infrastructure net-
works, game-theoretic learning methods, especially heterogeneous learning [40, 47], can be
used to design decentralized and resilient mechanisms that are responsive to attacks and
adaptive to the dynamic environment, as different parties in interdependent infrastructure
networks may acquire different information. For further readings on this topic, we refer the
reader to [47, 133] and references therein.
Similar to distributed optimization and machine learning based on game-theoretic learn-
ing, the control of autonomous mobile robots can also be cast as a Nash equilibrium seeking
problem over networks, where the equilibrium is viewed as the desired coordination of all
robots [94,95]. For applications of this kind, where the nature of robot movements determines
the network topologies, dynamic games over networks are considered, and corresponding
learning algorithms are employed. Based on their observations of the surroundings, robots
rely on game-theoretic learning, for example, reinforcement learning, for developing self-rule
policies, leading to a need for decentralized and scalable control laws for multi-agent robotic
systems. Moreover, reinforcement learning has proven effective for real-world multi-agent
robotic control when combined with powerful function approximators, such as deep neural
networks. This area of research, termed deep multi-agent reinforcement learning [81, 136], is
growing rapidly and attracting the attention of researchers from machine learning, robotics
as well as control communities.
In addition to these prescriptive mechanisms in engineering practices, game-theoretic

44
learning also provides a descriptive model for studying human decision-making and strategic
interactions in epidemiology and social sciences, where the Nash equilibrium represents a
stable state of the underlying noncooperative game. For example, a differential game model
has been proposed in [137] to study the virus or diseases spreading over the network. Authors
have developed a decentralized mitigation mechanism for controlling the spreading. Such
an approach has been further explored in [138], where an optimal quarantining strategy
of suppressing two interdependent epidemics spreading over complex networks has been
proposed and proven robust against random changes in network connections.

5 Summary
This article provides a comprehensive overview of game theory basics and related learning
theories, which serve as building blocks for systematically treating multi-agent decision-
making over networks. We have elaborated on the game-theoretic learning methods for net-
work applications drawn from spanning emerging areas such as the next-generation wireless
networks, the smart grid, and networked machine learning. In each area, we have identified
the main technical challenges and discussed how game theory can be applied to address them
in a bottom-up approach.
From the surveyed works, we conclude that noncooperative game theory is the cornerstone
of decentralized mechanisms for large-scale complex networks with heterogeneous entities,
where each node is modeled as an independent decision-maker. The resulting collective
behaviors of these rational decision-makers over the network can be mathematically depicted
by the solution concept: Nash equilibrium. In addition to various game models, learning
in games is of great significance for creating distributed network intelligence, which enables
each entity in the network to respond to unanticipated situations, such as malicious attacks
from adversaries in cyber-physical systems [134]. Under local or individual feedback, the
introduced learning dynamics lead to a decentralized and self-adaptive procedure, resulting
in desired collective behavior patterns without external enforcement.
Beyond the existing successes of game-theoretic learning, which mainly focuses on learn-
ing in static repeated games, it is also of interest to investigate dynamic game models and
associated learning dynamics, in order to better understand the decision-making process in
dynamic environments. The motivation for studying dynamic models and related learning
theory stems, on the one hand, from the pervasive presence of time-varying network struc-
tures, such as generation and demand in the smart grid [117]. On the other hand, by defining
auxiliary state variables, the problem of decision-making under uncertainties can be modeled
as a dynamic game, where the state of the game includes the hidden information players do
not have access to when making decisions. For example, the state variable can capture the
uncertainty of the environment, as we have discussed in the context of the dynamic routing
problem [113], or it can describe the global status of the entire system, as we have shown in
the example of distributed optimization [139]. The dynamic game models not only simplify
the construction of players’ utilities and actions, providing a clear picture of the strategic
interactions under uncertainties in the dynamic environment but can also offer a scalable
design framework for prescribing players’ self-adaptive behaviors that lead to equilibrium
states under various feedback structures.

45
To recap, this article has presented a comprehensive overview of game-theoretic learning
and its potential for tackling the challenges emerging from network applications. The combi-
nation of game-theoretic modeling and related learning theories constitutes a powerful tool
for designing future data-driven network systems with distributed intelligent entities, which
serve as the bedrock and a key enabler for resilient and agile control of large-scale artificial
intelligence systems in the near future.

46
References
[1] M. O. Jackson, Social and Economic Networks. Princeton, NJ: Princeton University
Press, 2010.

[2] S. Maharjan, Q. Zhu, Y. Zhang, S. Gjessing, and T. Başar, “Dependable demand

response management in the smart grid: A Stackelberg game approach,” IEEE Trans-
actions on Smart Grid, vol. 4, no. 1, pp. 120–132, 2013.

[3] Q. Zhu, Z. Han, and T. Başar, “A differential game approach to distributed demand
side management in smart grid,” in 2012 IEEE International Conference on Commu-
nications (ICC), pp. 3345–3350, 2012.

[4] N. Groot, B. De Schutter, and H. Hellendoorn, “Toward system-optimal routing in

traffic networks: A reverse Stackelberg game approach,” IEEE Transactions on Intel-
ligent Transportation Systems, vol. 16, no. 1, pp. 29–40, 2014.

[5] Z. Han, D. Niyato, W. Saad, T. Başar, and A. Hjørungnes, Game theory in Wire-
less and Communication Networks: Theory, Models, and Applications. Cambridge
University Press, 2012.

[6] Q. Zhu, Z. Yuan, J. B. Song, Z. Han, and T. Başar, “Interference aware routing
game for cognitive radio multi-hop networks,” IEEE Journal on Selected Areas in
Communications, vol. 30, no. 10, pp. 2006–2015, 2012.

[7] Z. Han, D. Niyato, W. Saad, and T. Başar, Game Theory for Next Generation Wireless
and Communication Networks: Modeling, Analysis, and Design. Cambridge University
Press, 2019.

[8] M. H. Manshaei, Q. Zhu, T. Alpcan, T. Başar, and J.-P. Hubaux, “Game theory meets
network security and privacy,” ACM Computing Surveys (CSUR), vol. 45, no. 3, pp. 1–
39, 2013.

[9] Q. Zhu and T. Başar, “Game-theoretic methods for robustness, security, and resilience
of cyberphysical control systems: games-in-games principle for optimal cross-layer re-
silient control systems,” IEEE Control Systems Magazine, vol. 35, no. 1, pp. 46–65,
2015.

[10] P. Stone and M. Veloso, “Multiagent systems: a survey from a machine learning per-
spective,” Autonomous Robots, vol. 8, no. 3, pp. 345–383, 2000.

[11] T. Başar and G. J. Olsder, Dynamic Noncooperative Game Theory, 2nd Edition. So-
ciety for Industrial and Applied Mathematics, 1998.

[12] D. Fudenberg and J. Tirole, Game Theory. Cambridge, MA: MIT Press, 1991.

[13] M. Maschler, E. Solan, and S. Zamir, Game Theory. Cambridge University Press,
2013.

47
[14] M. O. Jackson and Y. Zenou, “Chapter 3 Games on Networks,” Handbook of Game
Theory with Economic Applications, vol. 4, pp. 95–163, 2015.

[15] S. Shalev-Shwartz, “Online learning and online convex optimization,” Foundations and
Trends® in Machine Learning, vol. 4, no. 2, pp. 107–194, 2011.

[16] M. L. Littman, “Markov games as a framework for multi-agent reinforcement learning,”

in Proceedings of the Eleventh International Conference on International Conference
on Machine Learning, ICML’94, (San Francisco, CA, USA), p. 157–163, Morgan Kauf-
mann Publishers Inc., 1994.

[17] J. Lei and U. V. Shanbhag, “Stochastic Nash equilibrium problems: Models, analysis,
and algorithms,” submitted as part of CSM special issue, 2020.

[18] J. B. Rosen, “Existence and uniqueness of equilibrium points for concave N-person
games,” Econometrica, vol. 33, no. 3, pp. 520—534, 1965.

[19] G. Belgioioso, P. Yi, S. Grammatico, and L. Pavel, “Distributed generalized nash

equilibrium seeking: An operator theoretic perspective,” submitted as part of CSM
special issue, 2020.

[20] E. Cavazzuti, M. Pappalardo, and M. Passacantando, “Nash equilibria, variational in-

equalities, and dynamical systems,” Journal of Optimization Theory and Applications,
vol. 114, no. 3, pp. 491–506, 2002.

[21] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Program-

ming. USA: John Wiley & Sons, 1st ed., 1994.

[22] R. Selten, “Reexamination of the perfectness concept for equilibrium points in exten-
sive games,” International Journal of Game Theory, vol. 4, no. 1, pp. 25–55, 1975.

[23] T. Başar, “Time consistency and robustness of equilibria in non-cooperative dynamic

games,” in Dynamic Policy Games in Economics, vol. 181 of Contributions to Eco-
nomic Analysis, pp. 9 – 54, Elsevier, 1989.

[24] D. Fudenberg, The Theory of Learning in Games. Cambridge, MA: MIT Press, 1998.

[25] C. Daskalakis, P. W. Goldberg, and C. H. Papadimitriou, “The complexity of com-

puting a Nash equilibrium,” SIAM Journal on Computing, vol. 39, no. 1, pp. 195–259,
2009.

[26] P. D. Taylor and L. B. Jonker, “Evolutionary stable strategies and game dynamics,”
Mathematical Biosciences, vol. 40, no. 1-2, pp. 145–156, 1978.

[27] S. Hart and A. Mas-Colell, “Uncoupled dynamics do not lead to Nash equilibrium,”
The American Economic Review, vol. 93, no. 5, pp. 1830–1836, 2003.

[28] J. R. Marden and J. S. Shamma, “Chapter 16 Game Theory and Distributed Control,”
Handbook of Game Theory with Economic Applications, vol. 4, pp. 861–899, 2015.

48
[29] V. S. Borkar, Stochastic Approximation, A Dynamical Systems Viewpoint, vol. 48 of
Springer. Springer, 2008.

[30] M. Benaı̈m, J. Hofbauer, and S. Sorin, “Stochastic approximations and differential

inclusions,” SIAM Journal on Control and Optimization, vol. 44, no. 1, pp. 328–348,
2005.

[31] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge,

MA: MIT Press, 2018.

[32] D. S. Leslie and E. J. Collins, “Convergent multiple-timescales reinforcement learn-

ing algorithms in normal form games,” The Annals of Applied Probability, vol. 13,
pp. 1231–1251, 11 2003.

[33] C. Harris, “On the Rate of Convergence of Continuous-Time Fictitious Play,” Games
and Economic Behavior, vol. 22, no. 2, pp. 238–259, 1998.

[34] J. Hofbauer and K. Sigmund, “Evolutionary game dynamics,” Bulletin of the American
Mathematical Society, vol. 40, no. 4, pp. 479–519, 2003.

[35] G. W. Brown, “Iterative solution of games by fictitious play,” Activity Analysis of

Production and Allocation, vol. 13, no. 1, pp. 374–376, 1951.

[36] V. Krishna and T. Sjöström, “On the convergence of fictitious play,” Mathematics of
Operations Research, vol. 23, no. 2, pp. 479–511, 1998.

[37] P. Mertikopoulos and Z. Zhou, “Learning in games with continuous action sets and
unknown payoff functions,” Mathematical Programming, vol. 173, no. 1-2, pp. 465–507,
2018.

[38] P. Mertikopoulos and W. H. Sandholm, “Learning in games via reinforcement and

regularization,” Mathematics of Operations Research, vol. 41, no. 4, pp. 1297–1324,
2016.

[39] R. D. McKelvey and T. R. Palfrey, “Quantal response equilibria for normal form
games,” Games and Economic Behavior, vol. 10, no. 1, pp. 6–38, 1995.

[40] Q. Zhu, H. Tembine, and T. Başar, “Heterogeneous learning in zero-sum stochastic

games with incomplete information,” 49th IEEE Conference on Decision and Control
(CDC), pp. 219–224, 2010.

[41] Y. Nesterov, “Primal-dual subgradient methods for convex problems,” Mathematical

Programming, vol. 120, no. 1, pp. 221–259, 2009.

[42] Y. Nesterov, “Introductory lectures on convex optimization, a basic course,” Applied

Optimization, 2004.

[43] J. M. Smith and G. R. Price, “The logic of animal conflict,” Nature, vol. 246, no. 5427,
pp. 15–18, 1973.

49
[44] W. H. Sandholm, Population Games and Evolutionary Dynamics. Cambridge, MA:
MIT Press, 2010.

[45] R. Cressman and Y. Tao, “The replicator equation and other game dynamics,” Proceed-
ings of the National Academy of Sciences, vol. 111, no. Supplement 3, pp. 10810–10817,
2014.

[46] D. S. Leslie and E. Collins, “Generalised weakened fictitious play,” Games and Eco-
nomic Behavior, vol. 56, no. 2, pp. 285–298, 2006.

[47] Q. Zhu, H. Tembine, and T. Başar, “Hybrid learning in stochastic games and its
application in network security,” in Reinforcement Learning and Approximate Dynamic
Programming for Feedback Control, pp. 303–329, John Wiley & Sons, Ltd, 2012.

[48] G. Neu, A. Jonsson, and V. Gómez, “A unified view of entropy-regularized markov

decision processes,” arXiv preprint arXiv:1705.07798, 2017.

[49] V. R. Konda and V. S. Borkar, “Actor-critic–type learning algorithms for Markov

decision processes,” SIAM Journal on Control and Optimization, vol. 38, no. 1, pp. 94–
123, 1999.

[50] D. S. Leslie and E. J. Collins, “Individual Q-learning in normal form games,” SIAM
Journal on Control and Optimization, vol. 44, no. 2, pp. 495–514, 2005.

[51] J. Hofbauer, S. Sorin, and Y. Viossat, “Time average replicator and best-reply dynam-
ics,” Mathematics of Operations Research, vol. 34, no. 2, pp. 263–269, 2009.

[52] P. Mertikopoulos and W. H. Sandholm, “Riemannian game dynamics,” Journal of

Economic Theory, vol. 177, pp. 315–364, 2018.

[53] M. Benaı̈m, J. Hofbauer, and S. Sorin, “Stochastic approximations and differential

inclusions, Part II: Applications,” Mathematics of OR, vol. 31, pp. 673–695, 11 2006.

[54] A. Geramifard, T. J. Walsh, S. Tellex, G. Chowdhary, N. Roy, and J. P. How, “A

tutorial on linear function approximators for dynamic programming and reinforcement
learning,” Foundations and Trends® in Machine Learning, vol. 6, no. 4, pp. 375–451,
2013.

[55] T. Li and Q. Zhu, “On convergence rate of adaptive multiscale value function ap-
proximation for reinforcement learning,” 2019 IEEE 29th International Workshop on
Machine Learning for Signal Processing (MLSP), pp. 1–6, 2019.

[56] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare,

A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie,
A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Has-
sabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518,
no. 7540, pp. 529–533, 2015.

50
[57] J. C. Spall, “A one-measurement form of simultaneous perturbation stochastic approx-
imation,” Automatica, vol. 33, no. 1, pp. 109–112, 1997.

[58] M. Bravo, D. S. Leslie, and P. Mertikopoulos, “Bandit learning in concave N-person

games,” in Advances in Neural Information Processing Systems 31, Advances in Neural
Information Processing Systems, Curran Associates, Inc., 2018.

[59] L. Xiao, “Dual averaging methods for regularized stochastic learning and online opti-
mization,” J. Mach. Learn. Res., vol. 11, p. 2543–2596, Dec. 2010.

[60] P. Mertikopoulos, C. Papadimitriou, and G. Piliouras, “Cycles in adversarial regular-

ized learning,” in Proceedings of the 2018 Annual ACM-SIAM Symposium on Discrete
Algorithms, pp. 2703–2717, 2018.

[61] E. Mazumdar, L. J. Ratliff, and S. S. Sastry, “On gradient-based learning in continuous

games,” SIAM Journal on Mathematics of Data Science, vol. 2, no. 1, pp. 103–131,
2020.

[62] J. Hofbauer and S. Sorin, “Best response dynamics for continuous zero-sum games,”
Discrete & Continuous Dynamical Systems - B, vol. 6, no. 1, pp. 215–224, 2006.

[63] J. Hofbauer and W. H. Sandholm, “On the global convergence of stochastic fictitious
play,” Econometrica, vol. 70, no. 6, pp. 2265–2294, 2002.

[64] S. Perkins and D. S. Leslie, “Asynchronous stochastic approximation with differential

inclusions,” Stochastic Systems, vol. 2, no. 2, pp. 409–446, 2012.

[65] A. Heliou, J. Cohen, and P. Mertikopoulos, “Learning with bandit feedback in potential
games,” in Advances in Neural Information Processing Systems 30, pp. 6369—6378,
Curran Associates, Inc., 2017.

[66] B. Gao and L. Pavel, “On passivity, reinforcement learning, and higher order learning
in multiagent finite games,” IEEE Transactions on Automatic Control, vol. 66, no. 1,
pp. 121–136, 2019.

[67] E. N. Barron, R. Goebel, and R. R. Jensen, “Best response dynamics for continuous
games,” Proceedings of the American Mathematical Society, vol. 138, no. 03, pp. 1069–
1069, 2010.

[68] B. Swenson, R. Murray, and S. Kar, “On best-response dynamics in potential games,”
SIAM Journal on Control and Optimization, vol. 56, no. 4, pp. 2734–2767, 2018.

[69] B. Swenson, R. Murray, and S. Kar, “Regular potential games,” Games and Economic
Behavior, vol. 124, pp. 432–453, 2020.

[70] M. Benaı̈m, J. Hofbauer, and S. Sorin, “Perturbations of set-valued dynamical systems,

with applications to game theory,” Dynamic Games and Applications, vol. 2, no. 2,
pp. 195–205, 2012.

51
[71] J. C. Harsanyi, “Games with randomly disturbed payoffs: A new rationale for mixed-
strategy equilibrium points,” International Journal of Game Theory, vol. 2, no. 1,
pp. 1–23, 1973.

[72] J. Hofbauer and E. Hopkins, “Learning in perturbed asymmetric games,” Games and
Economic Behavior, vol. 52, no. 1, pp. 133–152, 2005.

[73] H. P. Young, “The evolution of conventions,” Econometrica, vol. 61, no. 1, pp. 57–84,
1993.

[74] H. P. Young, “Learning by trial and error,” Games and Economic Behavior, vol. 65,
no. 2, pp. 626–643, 2009.

[75] J. Gaveau, C. J. Le Martret, and M. Assaad, “Performance analysis of trial and error
algorithms,” IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 6,
pp. 1343–1356, 2020.

[76] J. R. Marden, H. P. Young, G. Arslan, and J. S. Shamma, “Payoff-based dynamics

for multiplayer weakly acyclic games,” SIAM Journal on Control and Optimization,
vol. 48, no. 1, pp. 373–396, 2009.

[77] B. S. Pradelski and H. P. Young, “Learning efficient Nash equilibria in distributed

systems,” Games and Economic Behavior, vol. 75, no. 2, pp. 882–897, 2012.

[78] J. R. Marden, “Selecting efficient correlated equilibria through distributed learning,”

Games and Economic Behavior, vol. 106, pp. 114–133, 2017.

[79] J. R. Marden, H. P. Young, and L. Y. Pao, “Achieving Pareto optimality through

distributed learning,” SIAM Journal on Control and Optimization, vol. 52, no. 5,
pp. 2753–2770, 2014.

[80] Z. Hu, M. Zhu, P. Chen, and P. Liu, “On convergence rates of game theoretic rein-
forcement learning algorithms,” Automatica, vol. 104, pp. 90–101, 2019.

[81] K. Zhang, Z. Yang, and T. Başar, “Multi-agent reinforcement learning: A selective

overview of theories and algorithms,” arXiv preprint arXiv:1911.10635, 2019.

[82] Y. Zhou, M. Kantarcioglu, and B. Xi, “A survey of game theoretic approach for adver-
sarial machine learning,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge
Discovery, vol. 9, no. 3, 2019.

[83] K. Zhang, B. Hu, and T. Basar, “On the stability and convergence of robust adversarial
reinforcement learning: A case study on linear quadratic systems,” in Advances in
Neural Information Processing Systems (H. Larochelle, M. Ranzato, R. Hadsell, M. F.
Balcan, and H. Lin, eds.), vol. 33, pp. 22056–22068, 2020.

[84] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,

A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of
the ACM, vol. 63, no. 11, pp. 139–144, 2020.

52
[85] A. Wibisono, A. C. Wilson, and M. I. Jordan, “A variational perspective on accelerated
methods in optimization,” Proceedings of the National Academy of Sciences, vol. 113,
no. 47, pp. E7351–E7358, 2016.

[86] E. V. Mazumdar, M. I. Jordan, and S. S. Sastry, “On finding local Nash equilibria
(and only local Nash equilibria) in zero-sum games,” arXiv, 2019.

[87] C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan, “On nonconvex op-
timization for machine learning: Gradients, stochasticity, and saddle points,” arXiv,
2019.

[88] J. Diakonikolas, C. Daskalakis, and M. Jordan, “Efficient methods for structured

nonconvex-nonconcave min-max optimization,” in Proceedings of The 24th Interna-
tional Conference on Artificial Intelligence and Statistics (A. Banerjee and K. Fuku-
mizu, eds.), vol. 130 of Proceedings of Machine Learning Research, pp. 2746–2754,
PMLR, 13–15 Apr 2021.

[89] W. Su, S. Boyd, and E. J. Candès, “A differential equation for modeling Nesterov’s
accelerated gradient method: theory and insights,” Journal of Machine Learning Re-
search, vol. 17, no. 153, pp. 1–43, 2016.

[90] D. Gadjov and L. Pavel, “A passivity-based approach to Nash equilibrium seeking over
networks,” IEEE Transactions on Automatic Control, vol. 64, no. 3, pp. 1077–1092,
2017.

[91] T. Başar, “Relaxation techniques and asynchronous algorithms for on-line computation
of non-cooperative equilibria,” Journal of Economic Dynamics and Control, vol. 11,
no. 4, pp. 531–549, 1987.

[92] G. Hu, Y. Pang, C. Sun, and Y. Hong, “Distributed Nash equilibrium seeking:
continuous-time control-theoretic approaches,” submitted as part of CSM special is-
sue, 2020.

[93] P. Frihauf, M. Krstic, and T. Basar, “Nash equilibrium seeking in noncooperative

games,” IEEE Transactions on Automatic Control, vol. 57, no. 5, pp. 1192–1207, 2012.

[94] G. A. Kaminka, D. Erusalimchik, and S. Kraus, “Adaptive multi-robot coordination: a

game-theoretic perspective,” in 2010 IEEE International Conference on Robotics and
Automation, pp. 328–334, 2010.

[95] W. Inujima, K. Nakano, and S. Hosokawa, “Multi-robot coordination using switch-

ing of methods for deriving equilibrium in game theory,” in 2013 10th International
Conference on Electrical Engineering/Electronics, Computer, Telecommunications and
Information Technology, pp. 1–6, 2013.

[96] W. He and Y. Sun, “Stationary Markov perfect equilibria in discounted stochastic

games,” Journal of Economic Theory, vol. 169, pp. 35–61, 2017.

53
[97] M. O. Sayin, F. Parise, and A. Ozdaglar, “Fictitious play in zero-sum stochastic
games,” arXiv, 2020.

[98] R. Bellman, “The theory of dynamic programming,” Bulletin of the American Math-
ematical Society, vol. 60, no. 6, pp. 503–515, 1954.

[99] J. Hu and M. P. Wellman, “Nash Q-learning for general-sum stochastic games,” Journal
of machine learning research, vol. 4, no. Nov, pp. 1039—1069, 2003.

[100] D. S. Leslie, S. Perkins, and Z. Xu, “Best-response dynamics in zero-sum stochastic

games,” Journal of Economic Theory, vol. 189, p. 105095, 2020.

[101] J. Bu, L. J. Ratliff, and M. Mesbahi, “Global convergence of policy gradient for se-
quential zero-sum linear quadratic dynamic games,” arXiv, 2019.

[102] K. Zhang, X. Zhang, B. Hu, and T. Başar, “Derivative-free policy optimization for risk-
sensitive and robust control design: implicit regularization and sample complexity,”
arXiv, 2021.

[103] P. Dayan and C. J. Watkins, “Q-Learning,” Machine Learning, vol. 8, no. 3-4, p. 279
292, 1992.

[104] P. Hernandez-Leal, M. Kaisers, T. Baarslag, and E. M. d. Cote, “A survey of learning

in multiagent environments: Dealing with non-stationarity,” arXiv, 2017.

[105] H.-T. Wai, Z. Yang, Z. Wang, and M. Hong, “Multi-agent reinforcement learning via
double averaging primal-dual optimization,” arXiv, 2018.

[106] K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Başar, “Fully decentralized multi-agent
reinforcement learning with networked agents,” in Proceedings of the 35th International
Conference on Machine Learning, vol. 80 of Proceedings of Machine Learning Research,
(Stockholmsmässan, Stockholm Sweden), pp. 5872–5881, PMLR, 2018.

[107] I. A. Kash, M. Sullins, and K. Hofmann, “Combining no-regret and Q-learning,” in

Proceedings of the 19th International Conference on Autonomous Agents and MultiA-
gent Systems, AAMAS ’20, (Richland, SC), p. 593–601, International Foundation for
Autonomous Agents and Multiagent Systems, 2020.

[108] T. Li, G. Peng, and Q. Zhu, “Blackwell online learning for Markov decision processes,”
arXiv preprint arXiv:2012.14043, 2020.

[109] M. Zinkevich, M. Johanson, M. Bowling, and C. Piccione, “Regret minimization in

games with incomplete information,” in Advances in Neural Information Processing
Systems 20 (J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, eds.), pp. 1729—1736,
Curran Associates, Inc., 2008.

[110] K. Zhang, S. M. Kakade, T. Başar, and L. F. Yang, “Model-based multi-agent RL in

zero-sum markov games with near-optimal sample complexity,” arXiv, 2020.

54
[111] V. Hakami and M. Dehghan, “Learning stationary correlated equilibria in constrained
general-sum stochastic games,” IEEE Transactions on Cybernetics, vol. 46, no. 7,
pp. 1640–1654, 2016.

[112] T. Alpcan, T. Başar, R. Srikant, and E. Altman, “CDMA uplink power control as a
noncooperative game,” Wireless Networks, vol. 8, no. 6, pp. 659–670, 2002.

[113] Q. Zhu, J. B. Song, and T. Başar, “Dynamic secure routing game in distributed
cognitive radio networks,” in 2011 IEEE Global Telecommunications Conference-
GLOBECOM 2011, pp. 1–6, IEEE, 2011.

[114] Q. Zhu, C. Fung, R. Boutaba, and T. Başar, “A game-theoretical approach to incentive

design in collaborative intrusion detection networks,” in 2009 International Conference
on Game Theory for Networks, pp. 384–392, IEEE, 2009.

[115] M. J. Farooq and Q. Zhu, “On the secure and reconfigurable multi-layer network design
for critical information dissemination in the internet of battlefield things (IoBT),”
IEEE Transactions on Wireless Communications, vol. 17, no. 4, pp. 2618–2632, 2018.

[116] M. J. Farooq and Q. Zhu, “Modeling, analysis, and mitigation of dynamic botnet
formation in wireless IoT networks,” IEEE Transactions on Information Forensics
and Security, vol. 14, no. 9, pp. 2412–2426, 2019.

[117] J. Chen and Q. Zhu, “A game-theoretic framework for resilient and distributed gener-
ation control of renewable energies in microgrids,” IEEE Transactions on Smart Grid,
vol. 8, no. 1, pp. 285–295, 2016.

[118] S. Maharjan, Q. Zhu, Y. Zhang, S. Gjessing, and T. Başar, “Demand response man-
agement in the smart grid in a large population regime,” IEEE Transactions on Smart
Grid, vol. 7, no. 1, pp. 189–199, 2015.

[119] J. Chen, C. Touati, and Q. Zhu, “A dynamic game approach to strategic design of se-
cure and resilient infrastructure network,” IEEE Transactions on Information Foren-
sics and Security, vol. 15, pp. 462–474, 2019.

[120] L. Huang, J. Chen, and Q. Zhu, “A large-scale Markov game approach to dynamic
protection of interdependent infrastructure networks,” in International Conference on
Decision and Game Theory for Security, pp. 357–376, Springer, 2017.

[121] J. Chen and Q. Zhu, “Interdependent network formation games with an application to
critical infrastructures,” in 2016 American Control Conference (ACC), pp. 2870–2875,
IEEE, 2016.

[122] J. Chen, C. Touati, and Q. Zhu, “Heterogeneous multi-layer adversarial network de-
sign for the IoT-enabled infrastructures,” in GLOBECOM 2017-2017 IEEE Global
Communications Conference, pp. 1–6, IEEE, 2017.

55
[123] Z. Xu and Q. Zhu, “A game-theoretic approach to secure control of communication-
based train control systems under jamming attacks,” in Proceedings of the 1st Interna-
tional Workshop on Safe Control of Connected and Autonomous Vehicles, pp. 27–34,
2017.

[124] Q. Zhu, W. Saad, Z. Han, H. V. Poor, and T. Başar, “Eavesdropping and jamming
in next-generation wireless networks: A game-theoretic approach,” in 2011-MILCOM
2011 Military Communications Conference, pp. 119–124, IEEE, 2011.

[125] Q. Zhu and T. Başar, “Game-theoretic approach to feedback-driven multi-stage mov-

ing target defense,” in International Conference on Decision and Game Theory for
Security, pp. 246–263, Springer, 2013.

[126] L. Huang and Q. Zhu, “A dynamic games approach to proactive defense strategies
against advanced persistent threats in cyber-physical systems,” Computers & Security,
vol. 89, p. 101660, 2020.

[127] Q. Zhu and S. Rass, “On multi-phase and multi-stage game-theoretic modeling of
advanced persistent threats,” IEEE Access, vol. 6, pp. 13958–13971, 2018.

[128] N. Al-Falahy and O. Y. Alani, “Technologies for 5G networks: challenges and oppor-
tunities,” IT Professional, vol. 19, no. 1, pp. 12–20, 2017.

[129] J. B. Song and Q. Zhu, “Performance of dynamic secure routing game,” in Game
Theory for Networking Applications, pp. 37–56, Springer, 2019.

[130] J. Konečnỳ, H. B. McMahan, D. Ramage, and P. Richtárik, “Federated opti-

mization: Distributed machine learning for on-device intelligence,” arXiv preprint
arXiv:1610.02527, 2016.

[131] S. Liu, T. Li, and Q. Zhu, “Communication-efficient distributed machine learning over
strategic networks: A two-layer game approach,” arXiv preprint arXiv:2011.01455,
2020.

[132] S. Liu, T. Li, and Q. Zhu, “Game-theoretic distributed empirical risk minimization with
strategic network design,” IEEE Transactions on Signal and Information Processing
over Networks, vol. 9, pp. 542–556, 2023.

[133] J. Chen and Q. Zhu, “A game- and decision-theoretic approach to resilient inter-
dependent network analysis and design,” SpringerBriefs in Electrical and Computer
Engineering, pp. 75–102, 2019.

[134] Q. Zhu, “Multilayer cyber-physical security and resilience for smart grid,” in Smart
Grid Control, pp. 225–239, Springer, 2019.

[135] M. J. Farooq and Q. Zhu, “On the secure and reconfigurable multi-layer network design
for critical information dissemination in the internet of battlefield things (IoBT),”
IEEE Transactions on Wireless Communications, vol. 17, no. 4, pp. 2618–2632, 2018.

56
[136] J. Foerster, I. A. Assael, N. De Freitas, and S. Whiteson, “Learning to communi-
cate with deep multi-agent reinforcement learning,” Advances in Neural Information
Processing Systems, vol. 29, pp. 2137–2145, 2016.

[137] Y. Huang and Q. Zhu, “A differential game approach to decentralized virus-resistant

weight adaptation policy over complex networks,” IEEE Transactions on Control of
Network Systems, vol. 7, no. 2, pp. 944–955, 2020.

[138] J. Chen, Y. Huang, R. Zhang, and Q. Zhu, “Optimal quarantining strategy

for interdependent epidemics spreading over complex networks,” arXiv preprint
arXiv:2011.14262, 2020.

[139] N. Li and J. R. Marden, “Designing games for distributed optimization,” IEEE Journal
of Selected Topics in Signal Processing, vol. 7, no. 2, pp. 230–242, 2013.

[140] H. Kushner and G. G. Yin, Stochastic Approximation and Recursive Algorithms and
Applications, vol. 35. Springer Science & Business Media, 2003.

A Fictitious Play
Consider the repeated play between two players, with each player knowing his own utility
function. Further, each player is able to observe the actions of the other player and choose
an optimal action based on the empirical frequency of these actions.
In fictitious
Pk play, from player 1’s viewpoint, player 2’s strategy at time k can be estimated
as π2 (a) = s=1 1{as2 =a} /k, a ∈ A2 , which is the empirical frequency of actions player 2 has
k

implemented up to that point. π2k can be computed by a moving average scheme:

1 1
π2k = (1 − )π2k−1 + eak2 .
k k
Using this, player 1 chooses the best response: ak+1
1 = arg maxa∈A1 u1 (a, π2k ) for the next
play. Then, the empirical frequency of player 1’s implemented actions is updated according
to
1 1
π1k+1 = (1 − )π1k + e k+1 ,
k+1 k + 1 a1
where eak+1 ∈ ∆(A1 ) is exactly given by BR1 (π2k ) and the equation is the same as the one
1
1
in (10), with the learning rate being λk1 = k+1 . Hence, we conclude that in fictitious play, a
player’s empirical play follows best response dynamics. Furthermore, if we replace the best
response mapping BR with the quantal response QRϵ , we then obtain an important variant:
stochastic fictitious play [24].

57
B Replicator Dynamics
Recall that continuous-time learning dynamics under dual averaging is
dûi (t)
= ui (π−i (t)),
dt (DA-c)
ϵ
πi (t) = QR (ûi (t)).
P
We now consider the entropy regularizer h(x) = xi xi log xi and let ϵ = 1 for simplicity.
Differentiate the strategy πi (t) with respect to time variable in (DA-c), arriving at
!
dπi,a (t) 1 dûi,a (t) ûi,a (t) X ûi,a′ (t) X dû i,a′ (t)
= P ûi,a (t) 2 e e − eûi,a (t) eûi,a′ (t)
dt ( a′ e ) dt a′ a′
dt
!
dûi,a (t) X dûi,a′ (t)
= πi,a (t) − πi,a′ (t)
dt a′
dt
= πi,a (t)[ui (a, π−i (t)) − ui (πi (t), π−i (t))]. (RD)
From the equation above, we can see that for a certain action a, if its outcome ui (a, π−i (t))
is above the average ui (πi (t), π−i (t)), then it will be “reinforced” in the sense that the prob-
ability of choosing a gets higher as time evolves. The above equation (RD) is referred to
as replicator dynamics, and has been widely used in evolutionary game theory to under-
stand natural selection and population biology. We consider a two-population system and
we reinterpret the elements in the two-player game using population biology language. For
population 1, there are |A1 | types and each type is specified by an element a ∈ A1 . We let
π1,a (t) be the percentage of type a in population 1 at time t, and assume here that π1 (t) is
differentiable with respect to time t, as the population, which is infinitely large, interacts
with the other population in a continuous-time manner.
For population 2, we have similar notions. If individuals from the two population meet
randomly, then they engage in a competition or a game with payoff dependent on their types.
For example, if type a1 from population 1 competes with type a2 from population 2, then the
payoffs for the two types are given by u1 (a1 , a2 ) and u2 (a1 , a2 ), respectively. For population
i, if we assume that the per capita rate of growth is given by the difference between the
payoff for type a and the average payoff in the population, a rule studied in [43], then the
percentage of different types within a population is precisely described by
1 dπi,a (t)
= ui (a, πi (t)) − ui (πi (t), π−i (t)),
πi,a dt
which is exactly the replicator dynamics (RD). In addition, as shown in [38], different
regularizers lead to different learning dynamics, which display different asymptotic behavior
accounts for the evolutionary process under different circumstances.
With replicator dynamics and other related evolutionary dynamics, biologists can predict
the evolutionary outcome of the multi-population system by examining the Nash equilib-
rium of the underlying game, which brings strategic reasoning into population biology and
has a profound influence in evolutionary game theory [44, 45]. Moreover, the Nash equi-
librium in this population game, characterized by the limiting behavior of the dynamics

58
under proper conditions [45], represents an evolutionarily stable state of the population,
which is an important refinement of Nash equilibrium. When this stable state is reached,
natural selection alone is sufficient to prevent the population from being influenced by mu-
tation [34, 44]. For more details on this refinement and its application in biology, we refer
the reader to [11, 34, 44, 45].

C Stochastic Approximation Theory

Following the multiple timescale stochastic approximation framework developed in [29, 140],
one can write (8) and (9) using discrete-time stochastic approximation

πik+1 − πik = λ̄ki fi (πik , ûk+1
i ) + Mik+1 ,
(C1)
ûk+1
i − ûki = µ̄ki gi (πik , ûki ) + Γk+1
i ,

where fi (πik , ûk+1

i ) and gi (πik , ûki ) are the mean-field components of (8) and (9), respectively,
and are defined as

fi (πik , ûk+1
i ) = E[Fi (πik , ûk+1
i , Uik+1 , ak+1
i )|F k−1 ],
gi (πik , ûki ) = E[Gi (πik , ûki , Uik+1 , ak+1
i )|F k−1 ].

With the mean-field part defined as above, Mik+1 = Fi (πik , ûk+1 i , Uik+1 , ak+1
i ) − fi (πik , ûk+1
i )
k+1 k k
and Γi takes a similar form. λ̄i , µ̄i are time-scaling factors dependent on the learning
rates λki , µki , which account for the adjustment of the original step sizes in asynchronous
schemes [29,64], and in synchronous cases, the time scaling factors coincide with the original
step sizes. Similar to our discussion in the main text (see (18) and (19)), we consider the
dynamical system of the joint strategy profile π k and utility vector ûk

π k+1 − π k = λ̄k f (π k , ûk+1 ) + M k+1 ,
(DSA)
ûk+1 − ûk = µ̄k g(π k , ûk ) + Γk+1 ,

where f and g are concatenations of {fi }i∈N and {gi }i∈N , respectively. λ̄k , µ̄k and M k , Γk
take similar forms.
As we have discussed in “Convergence of Learning in Games”, in order to obtain an ap-
proximately accurate score function, the two coupled dynamical systems in (DSA) should op-
erate on different timescales: the score function ûk should be updated sufficiently many times
until near-convergence before updating the strategy. This two-timescale iteration can be
achieved by adjusting the time-scaling factors: λ̄k and µ̄k are chosen so that limk→∞ λ̄k /µ̄k =
0. To understand this timescale system, it is instructive to consider a coupled continuous-
time dynamical system, as suggested in [29]

dπ(t)
= f (π(t), û(t)),
dt (C2)
dû(t) 1
= g(π(t), û(t)),
dt ε

59
where in the limit ε tends to zero. Hence, û(t) is fast transient while π(t) is slow. Then,
we can analyze the long-run behavior of the above coupled system as if the fast process is
always fully calibrated to the current value of the slow process. This suggests investigating
the ODE
dû(t)
= g(π, û(t)), (C3)
dt
where π is held fixed as a constant parameter. Suppose (C3) has a globally asymptotically
stable equilibrium Λ(π), where the mapping Λ(·) satisfies regularity conditions specified
in [30,64]. Then, it is reasonable to expect û(t) given by (C3) to closely track Λ(π). In turn,
this suggests that the investigation into the coupled system (C2) is equivalent to the study
of the single-timescale one
dπ(t)
= f (π(t), Λ(π(t))), (C4)
dt
which would capture the long-run behavior of π(t) in (C2) to a good approximation [29].
Informally speaking, to study the convergence of (DSA), we can relate its discrete-time
trajectory to that of (C2), which is further equivalent to (π(t), Λ(π(t))) specified by (C4).
Therefore, we can apply Lyapunov stability theory to (C4), in order to derive the conver-
gence results of the original discrete-time algorithm. We begin with the linear interpolation
process of the discrete-time trajectory, which connects the discrete-time system (DSA) and
its continuous-time couterpart (C2),(C4). Under some regularity conditions [30], for {π k },
the sequence generated by (DSA), we can construct the following continuous time pro-
cess π̄(t)
P : R+ → ∆(A), based on the linear interpolation of {π k }. Letting τ 0 = 0 and
k
τ k = s=1 λ̄s , we define
π k+1 − π k
π̄(t) := π k + (t − τ k ) , t ∈ [τ k , τ k+1 ).
τ k+1 − π k
Similarly, we can define a continuous-time process ū(t) corresponding to {ûk }.
As shown in [30, 64], such a linearly interpolated process (π̄(t), ū(t)) is closely related to
the flow of the following differential equations:
dπ(t)
= f (π(t), û(t)),
dt (C5)
dû(t)
= g(π(t), û(t)).
dt
We note that (C5) is defined for ease of presentation, and the actual differential inclusion
systems involves rearrangement of several terms; which we refer the reader to [64] for more
details. Further, we denote the flow of (C5) by
Φt (π 0 , u0 ) := {(π(t), û(t))|(π(t), û(t)) is a solution to (C5), with π(0) = π 0 , û(0) = u0 }.
The key of stochastic approximation theory lies in the fact that in the presence of a global
attractor for (C5), the continuous-time process (π̄(t), ū(t)) asymptotically tracks the flow
with arbitrary accuracy over windows of arbitrary length [30],
lim sup dist{(π̄(t + s), ū(t + s)), Φs (π̄(t), ū(t))} = 0,
t→∞ s∈[0,T ]

60
where dist{·, ·} denotes a distance measure on ∆(A) × RA . We refer to (π̄(t), ū(t)) as an
asymptotic pseudo-trajectory (APT) of the dynamics (C5). In other words, in order to study
the convergence of (DSA), we can resort to the convergence analysis of (C5), which can be
addressed by Lyapunov stability theory as shown in [30,64], where the key conclusion is that
if there is a global attractor A for (C4). Then the interpolated process (π̄(t), ū(t)) or simply
(π k , ūk ) converges almost surely to (A, Λ(A)).

Game-Theoretic Learning in Distributed Control: Jason R. Marden and Jeff S. Shamma
No ratings yet
Game-Theoretic Learning in Distributed Control: Jason R. Marden and Jeff S. Shamma
36 pages
A Variational Inequality Approach To Independent Learning in Static Mean-Field Games
No ratings yet
A Variational Inequality Approach To Independent Learning in Static Mean-Field Games
53 pages
Tembine Book
No ratings yet
Tembine Book
30 pages
Game-Theoretic Methods For The Smart Grid An Overview of Microgrid Systems Demand-Side Management and Smart Grid Communications
No ratings yet
Game-Theoretic Methods For The Smart Grid An Overview of Microgrid Systems Demand-Side Management and Smart Grid Communications
20 pages
Barto (1986)
No ratings yet
Barto (1986)
7 pages
2110 04638
No ratings yet
2110 04638
32 pages
An Introduction To Game Theory Notes - Latest
No ratings yet
An Introduction To Game Theory Notes - Latest
125 pages
Many Agent Games in Socio-Economic Systems: Corruption, Inspection, Coalition Building, Network Growth, Security
No ratings yet
Many Agent Games in Socio-Economic Systems: Corruption, Inspection, Coalition Building, Network Growth, Security
206 pages
Ubc 2015 May Namvargharehshiran Omid
No ratings yet
Ubc 2015 May Namvargharehshiran Omid
238 pages
Game Theoretical Motion Planning: Alessandro Zanardi, Saverio Bolognani, Andrea Censi and Emilio Frazzoli
No ratings yet
Game Theoretical Motion Planning: Alessandro Zanardi, Saverio Bolognani, Andrea Censi and Emilio Frazzoli
47 pages
Game Theory in Control Systems
No ratings yet
Game Theory in Control Systems
2 pages
An Analysis of Stochastic Game Theory For Multiagent Reinforcement Learning
No ratings yet
An Analysis of Stochastic Game Theory For Multiagent Reinforcement Learning
12 pages
Game Theory Applications in Wireless Networks - A Survey (PDF Download Available)
No ratings yet
Game Theory Applications in Wireless Networks - A Survey (PDF Download Available)
8 pages
Game Theoretic Decision Making PHD Thesis CMU-CS-23-117
No ratings yet
Game Theoretic Decision Making PHD Thesis CMU-CS-23-117
358 pages
Game Theory
No ratings yet
Game Theory
34 pages
Sere Den Ski - Competitive Coevolution
No ratings yet
Sere Den Ski - Competitive Coevolution
19 pages
MIT6 254S10 Lec01
No ratings yet
MIT6 254S10 Lec01
16 pages
Game Lnew
No ratings yet
Game Lnew
80 pages
Base 4
No ratings yet
Base 4
6 pages
Multi-Agent Reinforcement Learning
No ratings yet
Multi-Agent Reinforcement Learning
21 pages
Contents Preface Oct07
No ratings yet
Contents Preface Oct07
10 pages
RL - 01 Introduction To Reinforcement Learning
No ratings yet
RL - 01 Introduction To Reinforcement Learning
62 pages
Tips - Network Games Theory Models and Dynamics Synthesis
100% (1)
Tips - Network Games Theory Models and Dynamics Synthesis
160 pages
A Review of Cooperative Multi-Agent Deep Reinforcement Learning
No ratings yet
A Review of Cooperative Multi-Agent Deep Reinforcement Learning
46 pages
Utility Max Agents
No ratings yet
Utility Max Agents
12 pages
(Marco Slikker, Anne Van Den Nouweland (Auth.) ) So
No ratings yet
(Marco Slikker, Anne Van Den Nouweland (Auth.) ) So
295 pages
Lecture 2
No ratings yet
Lecture 2
47 pages
Z Summ Notes New 5 PDF
No ratings yet
Z Summ Notes New 5 PDF
59 pages
Mailath - Economics703 Microeconomics II Modelling Strategic Behavior
No ratings yet
Mailath - Economics703 Microeconomics II Modelling Strategic Behavior
264 pages
MechanismDesign IITB
No ratings yet
MechanismDesign IITB
64 pages
Cooperation Conflict and Transformative Artificial Intelligence A Research Agenda
No ratings yet
Cooperation Conflict and Transformative Artificial Intelligence A Research Agenda
10 pages
Multi-Scale Network Games Explained
No ratings yet
Multi-Scale Network Games Explained
22 pages
AI Game Strategy Essentials
No ratings yet
AI Game Strategy Essentials
57 pages
Co-Learning in Differential Games
No ratings yet
Co-Learning in Differential Games
35 pages
Actively Learning To Coordinate in Convex Games Via Approximate Correlated Equilibrium
No ratings yet
Actively Learning To Coordinate in Convex Games Via Approximate Correlated Equilibrium
6 pages
Cheap Talking Algorithms: Daniele Condorelli Massimiliano Furlan October 13, 2023
No ratings yet
Cheap Talking Algorithms: Daniele Condorelli Massimiliano Furlan October 13, 2023
20 pages
Introduction To Reinforcement Learning
No ratings yet
Introduction To Reinforcement Learning
19 pages
Final Project Report For EE599 (Special Topics: Decision Making in Networked Systems)
No ratings yet
Final Project Report For EE599 (Special Topics: Decision Making in Networked Systems)
18 pages
Application of Game Theory: Supervisors:1.Prof. Dr. B. R. Singh 2.Dr. Rajat Kumar Singh
No ratings yet
Application of Game Theory: Supervisors:1.Prof. Dr. B. R. Singh 2.Dr. Rajat Kumar Singh
94 pages
2018 - A Survey On The Combined Use of Optimization Methods and Game Theory
No ratings yet
2018 - A Survey On The Combined Use of Optimization Methods and Game Theory
22 pages
Game Theory: Theory, Applications, and Future Directions: Othman Al-Wardi April 1, 2025
No ratings yet
Game Theory: Theory, Applications, and Future Directions: Othman Al-Wardi April 1, 2025
16 pages
Subgoal Discovery For Hierarchical Reinforcement Learning Using Learned Policies
No ratings yet
Subgoal Discovery For Hierarchical Reinforcement Learning Using Learned Policies
5 pages
A Comprehensive Survey of Potential Game Approaches To Wireless Networks
No ratings yet
A Comprehensive Survey of Potential Game Approaches To Wireless Networks
20 pages
Dynamic and Adaptive Games
No ratings yet
Dynamic and Adaptive Games
77 pages
Swarm RL 34
No ratings yet
Swarm RL 34
15 pages
CM3 Cooperative Multi-Goal Multi-Stage Multi-Agent Reinforcement Learning
No ratings yet
CM3 Cooperative Multi-Goal Multi-Stage Multi-Agent Reinforcement Learning
24 pages
Social Distancing Network Creation: Tobias Friedrich Hans Gawendowicz Pascal Lenzner Anna Melnichenko
No ratings yet
Social Distancing Network Creation: Tobias Friedrich Hans Gawendowicz Pascal Lenzner Anna Melnichenko
44 pages
Unit 2
No ratings yet
Unit 2
26 pages
Lectures On Solid Geometry
No ratings yet
Lectures On Solid Geometry
43 pages
A Multiagent Coordination Framework Based On Markov Games
No ratings yet
A Multiagent Coordination Framework Based On Markov Games
4 pages
Game Theory For Control of Optical Network PDF
No ratings yet
Game Theory For Control of Optical Network PDF
260 pages
Lecture 11 - Multi Agent RL
No ratings yet
Lecture 11 - Multi Agent RL
36 pages
Emergent Complexity Via Multiagent Competition
No ratings yet
Emergent Complexity Via Multiagent Competition
12 pages
Underwater Sensor Networks: RL Optimization
No ratings yet
Underwater Sensor Networks: RL Optimization
13 pages
Energy-Efficient Resource Allocation in Wireless Networks: An Overview of Game-Theoretic Approaches
No ratings yet
Energy-Efficient Resource Allocation in Wireless Networks: An Overview of Game-Theoretic Approaches
21 pages
COMP3411 Week 4 - RL
No ratings yet
COMP3411 Week 4 - RL
79 pages
Non-Reciprocating Sharing Methods in Cooperative Q-Learning Environments
No ratings yet
Non-Reciprocating Sharing Methods in Cooperative Q-Learning Environments
7 pages
Midterm - 1 - ps1 - 24 - CORR 3
No ratings yet
Midterm - 1 - ps1 - 24 - CORR 3
2 pages
OS Exam Prep for Students
No ratings yet
OS Exam Prep for Students
7 pages
Fractional Factorial Design Guide
No ratings yet
Fractional Factorial Design Guide
30 pages
UCLA EE 132A Midterm Solutions
No ratings yet
UCLA EE 132A Midterm Solutions
6 pages
Module 2: Modeling Discrete Time Systems by Pulse Transfer Function
No ratings yet
Module 2: Modeling Discrete Time Systems by Pulse Transfer Function
4 pages
Computer Networks Modeling Arrivals and Service With Poisson
No ratings yet
Computer Networks Modeling Arrivals and Service With Poisson
9 pages
220C3A
No ratings yet
220C3A
2 pages
Fordham Keynote 2024
No ratings yet
Fordham Keynote 2024
17 pages
?dsa? Cheatsheets by Princeton - Edu
No ratings yet
?dsa? Cheatsheets by Princeton - Edu
6 pages
Fundamental Algorithms, Assignment 4 Solutions
No ratings yet
Fundamental Algorithms, Assignment 4 Solutions
3 pages
Hindi Paraphrasing Tool Project
No ratings yet
Hindi Paraphrasing Tool Project
34 pages
Transposition Cipher Guide
No ratings yet
Transposition Cipher Guide
10 pages
Math Tanisa
100% (1)
Math Tanisa
7 pages
Вежба 1-ФТНК
No ratings yet
Вежба 1-ФТНК
67 pages
Instruction Detection System Using Explainable AI
No ratings yet
Instruction Detection System Using Explainable AI
2 pages
Soft Computing for CS Students
No ratings yet
Soft Computing for CS Students
19 pages
Fuahtr
No ratings yet
Fuahtr
53 pages
Cahcet Updated Algorithm Lab Manual
No ratings yet
Cahcet Updated Algorithm Lab Manual
63 pages
Cours SCI31 - Reine Talj - A2015 - Séances 1 Et 2
No ratings yet
Cours SCI31 - Reine Talj - A2015 - Séances 1 Et 2
47 pages
Geroch Singularity AnnPhys1968
No ratings yet
Geroch Singularity AnnPhys1968
15 pages
Ai Assignment Term-1 24-25
No ratings yet
Ai Assignment Term-1 24-25
9 pages
Algebra Worksheets for Students
No ratings yet
Algebra Worksheets for Students
2 pages
10 DTFS DTFT
No ratings yet
10 DTFS DTFT
56 pages
CS1325: Pointers & Sorting Task
No ratings yet
CS1325: Pointers & Sorting Task
5 pages
Linear Algebra II Syllabus
No ratings yet
Linear Algebra II Syllabus
9 pages
Module 2
No ratings yet
Module 2
8 pages
Assignment One PRLD5121
No ratings yet
Assignment One PRLD5121
6 pages
Econometrics: 2SLS & Hausman Test
No ratings yet
Econometrics: 2SLS & Hausman Test
4 pages
Quantum Physics Student Guide
No ratings yet
Quantum Physics Student Guide
3 pages
Optimization Techniques: Continuous - Discrete - Functional Optimization
No ratings yet
Optimization Techniques: Continuous - Discrete - Functional Optimization
20 pages