0% found this document useful (0 votes)
64 views36 pages

Game-Theoretic Learning in Distributed Control: Jason R. Marden and Jeff S. Shamma

Commu. Network
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views36 pages

Game-Theoretic Learning in Distributed Control: Jason R. Marden and Jeff S. Shamma

Commu. Network
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Game-Theoretic Learning in Distributed

Control

Jason R. Marden and Jeff S. Shamma

Abstract
In distributed architecture control problems, there is a collection of inter-
connected decision-making components that seek to realize desirable collec-
tive behaviors through local interactions and by processing local information.
Applications range from autonomous vehicles to energy to transportation. One
approach to control of such distributed architectures is to view the components
as players in a game. In this approach, two design considerations are the
components’ incentives and the rules that dictate how components react to the
decisions of other components. In game-theoretic language, the incentives are
defined through utility functions, and the reaction rules are online learning
dynamics. This chapter presents an overview of this approach, covering basic
concepts in game theory, special game classes, measures of distributed efficiency,
utility design, and online learning rules, all with the interpretation of using game
theory as a prescriptive paradigm for distributed control design.

Keywords
Learning in games • Evolutionary games • Multiagent systems • Distributed
decision systems

This work was supported by ONR Grant #N00014-17-1-2060 and NSF Grant #ECCS-1638214
and by funding from King Abdullah University of Science and Technology (KAUST).
J.R. Marden ()
Department of Electrical and Computer Engineering, University of California, Santa Barbara,
CA, USA
e-mail: [email protected]
J.S. Shamma
Computer, Electrical and Mathematical Science and Engineering Division (CEMSE), King
Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
e-mail: [email protected]

© Springer International Publishing AG 2017 1


T. Başar, G. Zaccour (eds.), Handbook of Dynamic Game Theory,
DOI 10.1007/978-3-319-27335-8_9-1
2 J.R. Marden and J.S. Shamma

Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Game-Theoretic Distributed Resource Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Prescriptive Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Solution Concepts, Game Structures, and Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 Solution Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Measures of Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Game Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Illustrative Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.6 A Brief Review of Game Design Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Distributed Learning Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1 Model-Based Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Robust Distributed Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Equilibrium Selection in Potential Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4 Universal Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

1 Introduction

There is growing interest in distributed architecture or networked control systems,


with emergent applications ranging from smart grid to autonomous vehicle networks
to mobile sensor platforms. As opposed to a traditional control system architecture,
there is no single decision-making entity with full information and full authority that
acts as an overall system controller. Rather, decisions are made by a collective of
interacting entities with local information and limited communication capabilities.
The challenge is to derive distributed controllers to induce desirable collective
behaviors.
One approach to distributed architecture systems is to view the decision-making
components as individual players in a game and to formulate the distributed control
problem in terms of game theory. The basic elements of what constitutes a game
are (i) a set of players or agents; (ii) for each player, a set of choices; and (iii) for
each player, preferences over the collective choices of agents, typically expressed in
the form of a utility function. In traditional game theory (e.g., Fudenberg and Tirole
1991), these elements are a model of a collection of decision-makers, typically in
a societal context (e.g., competing firms, voters, bidders in an auction, etc.). In the
context of distributed control, these elements are design considerations in that one
has the degree of freedom on how to decompose a distributed control problem and
how to design the preferences/utility functions to properly incentivize agents. Stated
differently, game theory in this context is being used as a prescriptive paradigm,
rather than a descriptive paradigm (Marden and Shamma 2015; Shoham et al. 2007).
Formulating a distributed control problem in game-theoretic terms implicitly
suggests that the outcome – or, more appropriately, the solution concept – of the
resulting game is a desirable collective configuration. The most well-known solution
Game-Theoretic Learning in Distributed Control 3

concept is Nash equilibrium, in which each player’s choice is optimal with respect
to the choices of other agents. Other solution concepts, which are generalizations of
Nash equilibrium, are correlated and coarse correlated equilibrium (Young 2004).
Typically, a solution concept does not uniquely specify the outcome of a game
(e.g., a game can have multiple Nash equilibria), and so there is the issue that some
outcomes are better than others.
A remaining concern is how a solution concept emerges at all. Given the
complete description of a game, an outside party can proceed to compute (mod-
ulo computational complexity considerations Daskalakis et al. 2009) a proposed
solution concept realization. In actuality, the data of a game (e.g., specific utility
functions) is distributed among the players and not necessarily shared or commu-
nicated. Rather, over time players might make observations of the choices of the
other players and eventually the collective play converges to some limiting structure.
This latter scenario is the topic of game-theoretic learning, for which there are
multiple survey articles and monographs (e.g., Fudenberg and Levine 1998; Hart
2005; Shamma 2014; Young 2004). Under the descriptive paradigm, the learning in
games discussion provides a plausibility argument of how players may arrive at a
specified solution concept realization. Under the prescriptive paradigm, the learning
in games discussion suggests an online algorithm that can lead agents to a desirable
solution concept realization.
This article will provide an overview of approaching distributed control from the
perspective of game theory. The presentation will touch on each of the aforemen-
tioned aspects of problem formulation, game design, and game-theoretic learning.

2 Game-Theoretic Distributed Resource Utilization

2.1 Setup

Various problems of interest take the form of allocating a collection of assets to


utilize a set of resources to a desired effect. In sensor coverage problems (e.g., Cortes
et al. 2002), the “assets” are mobile sensors, and the “resources” are the regions to be
covered by the sensors. For any given allocation, there is an overall score reflecting
the quality of the coverage. In traffic routing problems (e.g., Roughgarden 2005), the
“assets” are vehicles (or packets in a communication setting), and the “resources”
are roads (or channels). The objective is to route traffic from origins to destinations
in order to minimize a global cost such as congestion.
It will be instructive to interpret the forthcoming discussion on distributed
control in the framework of such distributed resource utilization problems. As
previously mentioned, the framework captures a variety of applications of interest.
Furthermore, focusing on this specific setting will enhance the clarity of exposition.
More formally, the problem is to allocate a collection of assets N D f1; 2; : : : ; ng
over a collection of resources R D f1; 2; : : : ; mg in order to optimize a given
system-level objective. The set Ai  2R is the allowable resource selections by
asset i . In terms of the previous examples, an allowable resource selection is an area
4 J.R. Marden and J.S. Shamma

covered by a sensor or a set of roads used by vehicle. The system-level objective is


a mapping W W A ! R where A D A1      An denotes the set of joint resource
selections. We denote a collective configuration by the tuple a D .a1 ; a2 ; : : : ; an /
where ai 2 Ai is the choice, or action, of agent i .
Moving toward a game-theoretic model, we will identify the set of assets as the
set of agents or players. Likewise, we will identify Ai as the choice set of agent i .
We defer for now specifying a utility function for agent i .
Looking forward to the application of game-theoretic learning, we will consider
agents selecting actions iteratively over an infinite time horizon t 2 f1; 2; : : :g.
Depending on the update rules of the agents, the outcome is a sequence of joint
actions a.1/; a.2/; a.3/; : : :. The action of agent i at time t is chosen according to
some update policy, i ./, i.e.,

ai .t/ D i .information available to agent i at time t/ : (1)

The update policy i ./ specifies how agent i processes available information to
formulate a decision. We will be more explicit about the argument of the i ./’s
in the forthcoming discussion. For now, the information available to an agent can
include both knowledge regarding previous action choices of other agents and
certain system-level information that is propagated throughout the system.
The main goal is to design both the agents’ utility functions and the agents’ local
policies fi gi 2N to ensure that the emergent collective behavior optimizes the global
objective W in terms of the asymptotic properties of W .a.t// as t ! 1.

2.2 Prescriptive Paradigm

Once the players and their choices have been set, the remaining elements in the
prescriptive paradigm that are yet to be designed are (i) the agent utility functions
and (ii) the update policies, fi gi 2N . One can view this specification in terms of the
following two-step design procedure:

Step #1: Game Design. The first step of the design involves defining the underly-
ing interaction structure in a game-theoretic environment. In particular, this choice
involves defining a utility function for each agent i 2 N of the form Ui W A ! R.
The utility of agent i for an action profile a D .a1 ; a2 ; : : : ; an / is expressed as Ui .a/
or alternatively Ui .ai ; ai / where ai denotes the collection of actions other than
player i in the joint action a, i.e., ai D .a1 ; : : : ; ai 1 ; ai C1 ; : : : ; an /. A key feature
of this design choice is the coupling of the agents’ utility functions where the utility,
or payoff, of one agent is affected by the actions of other agents.

Step #2: Learning Design. The second step involves defining the decision-making
rules for the agents. That is, how does each agent process available information to
formulate a decision. A typical assumption in the framework of learning in games
is that each agent uses historical information from previous actions of itself and
Game-Theoretic Learning in Distributed Control 5

other players. Accordingly, at each time t the decision of each agent i 2 N is made
independently through a learning rule of the form
 
ai .t/ D i fa./g D1;:::;t 1 I Ui ./ : (2)

There are two important considerations in the above formulation. First, we


stated for simplicity that agents can observe the actions of all other agents. In
games with a graphical structure (Kearns et al. 2001), one only requires historical
information from a subset of other players. Other reductions are also possible, such
as aggregate information of other players of even just measurements of one’s own
utility (Fudenberg and Levine 1998; Hart 2005; Shamma 2014; Young 2004).1
Second, implicit in the above construction is that the learning rule is defined
independently of the utility function, and an agent’s utility function then enters as a
parameter of a specified learning rule.
This second consideration offers a distinction between conventional distributed
control and game-theoretic distributed control in the role of the utility function
for the individual agents fUi gi 2N . An agent may be using a specific learning
rule, but the realized behavior depends on the specified utility function. In a
more conventional approach, there need not be such a decomposition. Rather,
one might directly specify the agents’ control policies fi gi 2N and perform an
analysis regarding the emergent properties of the given design, e.g., as is done in
models of flocking or bio-inspired controls (Olfati-Saber 2006). An advantage of
the decomposition is that one can analyze learning rules for classes of games and
separately examine whether or not specified utility functions conform to such an
assumed game class.
The following example demonstrates how a given distributed control policy
fgi 2N can be reinterpreted as a game-theoretic control approach with appropriately
defined agent utility functions fUi gi 2N .

Example 1 (Consensus). Consider the well-studied consensus/rendezvous problem


(Blondel et al. 2005b; Jadbabaie et al. 2003; Olfati-Saber and Murray 2003; Touri
and Nedic 2011) where the goal is to drive the agents to agreement on a state x  2 R
when each agent has limited information regarding the state of other agents in the
systems. Specifically, we will say that the set of admissible states (or actions) of
each agent i 2 N is Ai D R and agent i at stage t can observe the previous state
choices at stage t  1 of a set of neighboring agents denoted by Ni .t/  N n fi g.
Consider the following localized averaging dynamics where the decision of an agent
i 2 N at time t is of the form
1 X
ai .t/ D aj .t  1/: (3)
Ni .t/
j 2Ni .t /

1
Alternative agent control policies where the policy of agent i also depends on previous actions of
agent i or auxiliary “side information” could also be replicated by introducing an underlying state
in the game-theoretic environment. The framework of state-based games, introduced in Marden
(2012), represents one such framework that could accomplish this goal.
6 J.R. Marden and J.S. Shamma

Given an initial state profile a.0/, the dynamics in (3) produces a sequence of state
profiles a.1/, a.2/, : : : . Whether or not the state profiles converge to consensus
under the above dynamics (or variants thereof) has been extensively studied in the
existing literature (Blondel et al. 2005a; Olfati-Saber and Murray 2003; Tsitsiklis
et al. 1986).
Now we will present a game-theoretic design that leads to the same collective
behavior. More formally, consider a game-theoretic model where each agent i 2 N
is assigned an action set Ai D R and a utility function of the form

1 X
Ui .ai ; ai / D  .ai  aj /2 ; (4)
2 jNi .t/j
j 2N .t /
i

where jNi .t/j denotes the cardinality of the set Ni .t/. Now, suppose each agent
follows the well-known best-response learning rule of the form

ai .t/ 2 Bi .ai .t// D arg max Ui .ai ; ai .t  1//;


ai 2Ai

where Bi .ai .t// is referred to as the best-response set of agent i to the action
profile ai .t/. Given an initial state profile a.0/, it is straightforward to show that
the ensuing action or state profiles a.1/, a.2/, : : : , will be equivalent for both design
choices.

The above example illustrates the separation between the learning rule and the
utility function. The learning rule is best-response dynamics. When the utility
function is the above quadratic form, then the combination leads to the usual
distributed averaging algorithm. If the utility function is changed (e.g., weighted,
non-quadratic, etc.), then the realization of best-response learning is altered, as well
as the structure of the game defined by the collection of the utility functions, but the
learning rule remains best-response dynamics.
An important property of best-response dynamics and other learning rules of
interest is that the actions of agent i can depend explicitly on the utility function of
agent i but not (explicitly) on the utility functions of other agents. This property
of learning rules in the learning in games literature is called being uncoupled
(Babichenko 2012; Hart and Mansour 2010; Hart and Mas-Colell 2003; Young
2004). Of course, the action stream of agent i , i.e., ai .0/; ai .1/; : : :, does depend
on the actions of other agents, but not the utility functions behind those actions.
It turns out that there are many instances in which control policies not derived
from a game-theoretic perspective can be reinterpreted as the realization of an
uncoupled learning rule from a game-theoretic perspective. These include control
policies that have been widely studied in the cooperative control literature with
application domains such as consensus and flocking (Olfati-Saber et al. 2007;
Tsitsiklis 1987), sensor coverage (Martinez et al. 2007; Murphey 1999), and routing
information over networks (Roughgarden 2005), among many others.
Game-Theoretic Learning in Distributed Control 7

While the design of such control policies can be approached in either a traditional
perspective or a game-theoretic perspective, there are potential advantages associ-
ated with viewing control design from a game-theoretic perspective. In particular,
a game-theoretic perspective allows for a modularized design architecture, i.e.,
the separation of game design and learning design, that can be exploited in a
plug-and-play fashion to provide control algorithms with automatic performance
guarantees:

Game Design Methodologies. There are several established methodologies for the
design of agent objective functions, e.g., Shapley value and marginal contribution
(Marden and Wierman 2013). The methodologies, which will be briefly reviewed
in Sect. 3.6, are systematic procedures for deriving the agent objective functions
fUi gi 2N from a given system-level objective function G. These methodologies
often provide structural guarantees on the resulting game, e.g., existence of a pure
Nash equilibrium or a potential game structure, that can be exploited in distributed
learning.

Learning Design Methodologies. The field of learning in games has sought out
to establish decision-making rules that lead to Nash equilibrium or other solution
concepts in strategic form games. In general, it has been shown (see Hart and Mas-
Colell 2003) that there are no “natural” dynamics that converge to Nash equilibria
for all games, where natural refers to dynamics that do not rely on some form of
centralized coordination, e.g., exhaustive search of the joint action profiles. For
example, there are no rules of the form (2) that provide convergence to a Nash
equilibrium in any game. However, the same limitations do not hold when we
transition from “all games” to “all games of a given structure.” In particular, there
are several positive results in the context of learning in games for special classes
of games (e.g., potential games and variants thereof). These results, which will be
discussed in Sect. 4, identify learning dynamics that yield desirable performance
guarantees when applied to the realm of potential games.

Performance Guarantees. Merging a game design methodology with an appro-


priate learning design methodology can often result in agent control policies with
automatic performance guarantees. For example, employing a game design where
agent utility functions constitute a potential game coupled with a learning algorithm
that ensures convergence to a pure Nash equilibrium in potential games provides
agent control policies that converge to the Nash equilibrium of the derived game.
Furthermore, additional structure on the agents’ utility functions can often be
exploited to provide efficiency bounds on the Nash equilibria, cf., price of anarchy
(Nisan et al. 2007), as well as approximations for the underlying convergence rates
(Borowski et al. 2013; Montanari and Saberi 2009; Shah and Shin 2010).

Human-Agent Collaborative Systems. Game theory constitutes a design choice


for control policies in distributed systems comprised purely of engineering compo-
nents. However, when a networked system consists of both engineering and human
8 J.R. Marden and J.S. Shamma

decision-making entities, e.g., the smart grid, game theory transitions from a design
choice to a necessity. The involvement of human decision-making entities in a
system requires that the system operator utilizes game theory for the purpose of
modeling and influencing the human decision-making entities to optimize system
performance.

3 Solution Concepts, Game Structures, and Efficiency

Recall that an important metric in the game-theoretic approach to distributed control


is the asymptotic properties of a system-level objective function, i.e., W .a.t// as
t ! 1. These asymptotic properties depend on both aspects of the prescriptive
paradigm, i.e., the utility functions and learning rule. The specification of utility
functions in itself defines an underlying game that is repeatedly played over stages.
In this section, we review properties related to this underlying game in terms of
solution concepts, game structures, and measures of efficiency.
In this section we will temporarily distance ourselves from the design objectives
set forth in this manuscript with the purpose of identifying properties of games that
are relevant to our mission. To that end, we will consider a finite strategic form
game G with agent set N D f1; 2; : : : ; ng where each agent i 2 N has an action
set Ai and a utility function Ui W A ! R. Further, there exists a system-level
objective W W A ! R that a system designer is interested in maximizing. We will
often denote such a game by the tuple G D fN; fAi g; fUi g; W g where we use the
shorthand notation fg instead of fgi 2N to denote the agents’ action sets or utility
functions.

3.1 Solution Concepts

The most widely known solution concept in game theory is a pure Nash equilibrium,
defined as follows.

Definition 1. An action profile ane 2 A is a pure Nash equilibrium if for any agent
i 2N

Ui .aine ; ai
ne
/  Ui .ai ; ai
ne
/; 8ai 2 Ai : (5)

A pure Nash equilibrium represents an action profile where no agent has a unilateral
incentive to alter its action provided that the behavior of the remaining agents is
unchanged. A pure Nash equilibrium need not exist for any game G.
The definition of Nash equilibrium also extends to scenarios where the agents can
probabilistically choose their actions. Define a strategy of agent i as pi 2 .Ai /
where .Ai / denotes the simplex over the finite action set Ai . We will express
a strategy pi by the tuple fpiai gai 2Ai where piai  0 for any ai 2 Ai and
Game-Theoretic Learning in Distributed Control 9

P
piai D 1. We will evaluate the utility of an agent i 2 N for a strategy
ai 2Ai
profile p D .p1 ; : : : ; pn / as
X
Ui .pi ; pi / D Ui .a/  p1a1      pnan : (6)
a2A

which has the usual interpretation of the expected utility under independent
randomized actions.
We can now state the definition of Nash equilibrium when extended to mixed (or
probabilistic) strategies.

Definition 2. A strategy profile p ne 2 .A1 /      .An / is a mixed Nash


equilibrium if for any agent i 2 N

Ui .pine ; pi
ne
/  Ui .pi ; pi
ne
/; 8pi 2 .Ai /: (7)

Unlike pure Nash equilibria, a mixed Nash equilibrium is guaranteed to exist in any2
game G.
A common critique regarding the viability of pure or mixed Nash equilibria as a
characterization of achievable behavior in multiagent systems is that the complexity
associated with computing such equilibria is often prohibitive (Daskalakis et al.
2009). We now introduce a weaker solution concept, which is defined relative to a
joint distribution z 2 .A/, that does not suffer from such issues.

Definition 3. A joint distribution z D fza ga2A 2 .A/ is a coarse correlated


equilibrium if for any agent i 2 N
X X
Ui .ai ; ai /z.ai ;ai /  Ui .ai0 ; ai /z.ai ;ai / ; 8ai0 2 Ai : (8)
a2A a2A

A coarse correlated equilibrium is a joint distribution z such that each agent’s


expected utility according to that distribution is at least as high as the agent’s
expected utility for committing to any fixed action ai0 2 A, while all the other agents
play according to their marginal distribution of z. It is straightforward to verify that
any mixed Nash equilibrium is a coarse correlated equilibrium; hence, the set of
coarse correlated equilibria is nonempty for any game, G. Furthermore, as we will
see in Sect. 4.4, there are simple learning algorithms that ensure that the empirical
frequency of play will approach the set of coarse correlated equilibria in a reasonable
period of time. We will discuss techniques for characterizing the efficiency of this
type of collective behavior in Sect. 3.3.3
Figure 1 highlights the relationship between the three solution concepts discussed
above.

2
Recall that we are assuming a finite set of players, each with a finite set of actions.
3
Another common equilibrium set, termed correlated equilibrium, is similar to coarse correlated
equilibrium where the difference lies in the consideration of conditional deviations as opposed to
10 J.R. Marden and J.S. Shamma

coarse correlated mixed Nash pure Nash


equilibria equilibria equilibria

Fig. 1 The relationship between the three solution concepts: pure Nash equilibrium, mixed Nash
equilibrium, and coarse correlated equilibrium

3.2 Measures of Efficiency

It is important to highlight that the above equilibrium definitions have no depen-


dence on the system-level objective function. The goal here is to understand how
the efficiency associated with such equilibria compares to the optimal behavior
with respect to a system-level objective function. Here, we investigate two common
worst-case measures, termed price of anarchy and price of stability (Nisan et al.
2007), for characterizing the inefficiency associated with equilibria in games.
The first measure that we consider is the price of anarchy, which is defined as the
worst-case ratio between the performance of the worst equilibrium and the optimal
system behavior. We use the terminology worst equilibrium as the price of anarchy
could be defined by restricting attention to any of the aforementioned equilibrium
sets. Focusing on pure Nash equilibria for simplicity, the price of anarchy associated
with a game G is defined as
 
W .ane /
PoA.G/ D min  1; (9)
ane 2PNE.G/ W .aopt /

where aopt 2 arg maxa2A W .a/ and PNE.G/ denotes the set of pure Nash equilibria
in the game G. Note that the price of anarchy given in (9) provides a lower bound
on the performance associated with any pure Nash equilibrium in the game G.
The second measure that we consider is the price of stability, which is defined
as the best-case ratio between the performance of the best equilibrium and the
optimal system behavior. Focusing on pure Nash equilibria for simplicity, the price
of stability associated with a game G is defined as
 
W .ane /
PoS.G/ D max  1: (10)
ane 2PNE.G/ W .aopt /

By definition, PoS.G/  PoA.G/. The price of stability is a more optimistic


measure of the efficiency loss associated with pure Nash equilibrium. When

the unconditional deviations considered in (8). A formal definition of correlated equilibrium can
be found in Young (2004).
Game-Theoretic Learning in Distributed Control 11

analyzing dynamics that converge to specific types of equilibrium, e.g., the best
Nash equilibrium, the price of stability may be a more reasonable characterization
of the efficiency associated with the limiting behavior.
The above definition of price of anarchy and price of stability also extend to
situations where there is uncertainty regarding the structure of the specific game. To
that end, let G denote a family of possible games. The price of anarchy and price
of stability associated with the family of games is then defined as the worst-case
performance of all games within that family, i.e.,

PoA.G/ D min fPoA.G/g ; (11)


G2G

PoS.G/ D min fPoS.G/g : (12)


G2G

Clearly, 1  PoS.G/  PoA.G/. For clarity, a PoA.G/ D 0:5 implies that regardless
of the underlying game G 2 G, any pure Nash equilibrium is at least 50% efficient
when compared to the performance of the optimal allocation for that game.
The definitions of price of anarchy and price of stability given in (9) and (10)
can be extended to broader classes of equilibria, i.e., mixed Nash equilibria or
coarse correlated equilibria, in the logical manner. To perform the above analysis
for broader equilibrium sets, we extend P the definition of the welfare function to a
a
distribution z 2 .A/ as W .z/ D a2A W .a/z . Note that for a given family
of games G, the price of anarchy associated with pure Nash equilibria would be
better (closer to 1) than the price of anarchy associated with coarse correlated
equilibrium. Since coarse correlated equilibria contain Nash equilibria, one would
naturally expect that the efficiency associated with equilibria could be far worse
than the efficiency associated with Nash equilibria. Surprisingly, it often turns out
that this is not the case as we will see below.

3.3 Smoothness

Charactering the inefficiency of equilibria often is challenging and often involves a


nontrivial domain specific analysis. One attempt at providing a universal approach
to characterizing efficiency loss in distributed systems, termed smoothness (Rough-
garden 2015), is given in the following theorem.

Theorem
P 1. Consider any game G where the agents’ utility functions satisfy
U
i 2N i .a/  W .a/ for any a 2 A. If there exists parameters  > 0 and  > 1
such that for any two action profiles a; a 2 A
X
Ui .ai ; ai /    W .a /    W .a/; (13)
i
12 J.R. Marden and J.S. Shamma

then the efficiency associated with any coarse correlated equilibrium zcce 2 .A/
of G must satisfy

W .zcce / 
 : (14)
W .a /
opt 1C

We will refer to a game G as .; /smooth if the game satisfies (13).

Theorem 1 demonstrates that the problem of evaluating the price of anarchy in a


given game can effectively be recast as a problem of solving for the appropriate

coefficients .; / that satisfy (13) and maximize 1C . This analysis naturally
extends to guarantees over a family of games G,
 

PoA.G/  inf W G is .; / smooth for all G 2 G ; (15)
>0;>1 1C

where the above expression is referred to as the robust price of anarchy (Roughgar-
den 2015). In line with the forthcoming discussion (cf., Sect. 4.4), implementing
a learning rule that leads to the set of coarse correlated equilibria provides
performance guarantees that conform to this robust price of anarchy.
One example of an entire class of games with known price of anarchy bounds
is congestion games with affine congestion functions (Roughgarden 2005) (see also
Example 2). Another class is valid utility games, introduced in Vetta (2002), which
is very relevant to distributed resource utilization problems. A critical property of
valid utility games is a system-level objective that is submodular. Submodularity
corresponds to a notion of decreasing marginal returns that is a common feature of
many objective function in engineering systems. A set-based function f W 2N ! R
is submodular if for any S  T  N n fi g, we have

f .S [ fi g/  f .S /  f .T [ fi g/  f .T /: (16)

In each of these settings, Roughgarden (2015) has derived the appropriate smooth-
ness parameters hence providing the price of anarchy guarantees. Accordingly, the
resulting price of anarchy holds for coarse correlated equilibrium as well as Nash
equilibrium.

Theorem 2 (Roughgarden 2015; Vetta 2002). Consider any game G D


.N; fAi g; fUi g; W / that satisfies the following three properties:

(i) The objective function W is submodular;


(ii) For any agent i 2 N and any action profile a 2 A,

Ui .a/  W .a/  W .ai D ;; ai /;

where ai D ; is when player i is removed from the game;


Game-Theoretic Learning in Distributed Control 13

(iii) For any action profile a 2 A, the sum of the agents’ utilities satisfies
X
Ui .a/  W .a/:
i 2N

We will refer to such a game as a valid utility game. Any valid utility game G is
smooth with parameters  D 1 and  D 1; hence, the robust price of anarchy
is 1=2 for the class of valid utility games. Accordingly, the efficiency guarantees
associated with any coarse correlated equilibrium zcce 2 .A/ in a valid utility
game satisfies
 
1
W .z / 
cce
W .aopt /:
2

One example of a valid utility game is the vehicle-target assignment problem


which will be presented in Example 4. Here, the system-level objective function is
submodular and Condition (ii) in Theorem 2 is satisfied by the given design. Further,
it is straightforward to verify that Condition (iii) is also satisfied. Accordingly, all
coarse correlated equilibria in the designed game for the vehicle-target assignment
problem are at least 50% efficient. Consequently, the application of learning rules
that lead to coarse correlated equilibria (cf., Sect. 4.4) will lead to a collective
behavior in line with these efficiency guarantees.

3.4 Game Structures

The two components associated with a game-theoretic design are the agent utility
functions, which define an underlying game, and the learning rule. Both components
impact various performance objectives associated with the distributed control
design. The specification of the agent utility functions directly impacts the price
of anarchy, which can be viewed as the efficiency associated with the asymptotic
collective behavior. On the other hand, the specification of the learning algorithm
dictates the transient behavior in its attempt to drive the collective behavior to the
solution concept of interest.
At first glance it appears that the objectives associated with these two components
are unrelated to one another. For example, one could employ a design where
(i) the agents’ utility functions are chosen to optimize the price of anarchy of
pure Nash equilibria and (ii) a learning algorithm is employed that drives the
collective behavior to a pure Nash equilibrium. Unfortunately, such decoupling
is not necessarily possible due to limitations associated with (ii). As previously
discussed, there are no “natural dynamics” of the form

ai .t/ D …i .a.0/; a.1/; : : : ; a.t  1/I Ui / (17)

that lead to a (pure or mixed) Nash equilibrium in every game (Hart and Mas-Colell
2003), where “natural” refers to uncoupled dynamics (i.e., agents are uninformed
14 J.R. Marden and J.S. Shamma

of the utility functions of other agents) and rules out behaviors such as exhaustive
search or centralized coordination.
Given such impossibility results, it is imperative that the game design component
addresses objectives beyond just price of anarchy. In particular, it is of paramount
importance that the resulting game has properties that can be exploited in distributed
learning. In this section we will review such game structures. Each of these game
structures provides a degree of alignment between the agents’ utility functions fUi g
and a system-level potential function  W A ! R.
The first class of games we introduce, termed potential games (Monderer and
Shapley 1996), exhibits perfect alignment between the agents’ utility functions and
the potential function .

Definition 4 (Potential Game). A game G is an (exact) potential game if there


exists a potential function  W A ! R such that for any action profile a 2 A, agent
i 2 N , and action choice ai0 2 Ai , we have

Ui .ai0 ; ai /  Ui .ai ; ai / D .ai0 ; ai /  .ai ; ai /: (18)

Note that any maximizing action profile a 2 arg maxa2A .a/ is a pure Nash
equilibrium; hence, a pure Nash equilibrium is guaranteed to exist in any potential
game. Further, as we will see in the forthcoming Sect. 4, the structure inherent
to potential games can be exploited to bypass the impossibility result highlighted
above. In other words, there are natural dynamics that lead to a Nash equilibrium in
any potential game. We will survey some of these dynamics in Sect. 4.
There are several variants of potential games that seek to relax the equality given
in (18) while preserving the exploitability of the game structure for distributed
learning. One of the properties that is commonly exploited in distributed learning is
the monotonicity of the potential function along a better reply path, which is defined
as follows:

Definition 5 (Better Reply Path). A better reply path is a sequence of joint actions
a1 ; a2 ; : : : ; am such that for each k 2 f1; : : : ; m  1g (i) akC1 D .ai ; ai
k
/ for some
k kC1 k
agent i 2 N with action ai 2 Ai , ai ¤ ai , and (ii) Ui .a / > Ui .a /.

Informally, a better reply path is a sequence of joint actions where each subsequent
joint action is the result of an advantageous unilateral deviation. In a potential game,
the potential function is monotonically increasing along a better reply path. Since
the joint action set A is finite, any better reply will lead to a pure Nash equilibrium
in a finite number of iterations. This property is known as the finite improvement
property (Monderer and Shapley 1996).4

4
Commonly studied variants of exact potential games, e.g., ordinal or weighted potential games,
also possess the finite improvement property.
Game-Theoretic Learning in Distributed Control 15

We now introduce the class of weakly acyclic games which relaxes the finite
improvement property condition.

Definition 6 (Weakly Acyclic Game). A game G is weakly acyclic under better


replies if for any joint action a 2 A there exists a better reply path from a to a pure
Nash equilibrium of G.

As with potential games, a pure Nash equilibrium is guaranteed to exist in any


weakly acyclic game. One advantage of considering broader game classes as a
mediating layer for game-theoretic control designs is the expansion of available
game design methodologies for designing agent utility functions within that class.

3.5 Illustrative Examples

At first glance it may appear that the framework of potential games (or weakly
acyclic games) is overly restrictive as a framework for the design of networked
control systems. Here, we provide three examples of potential games, which
illustrates the breadth of the problem domains that can be modeled and analyzed
within this framework.
The first example focuses on distributed routing and highlights how a reasonable
model of user behavior, i.e., users seeking to minimize their experienced congestion,
constitutes a potential game.

Example 2 (Distributed Routing). A routing problem consists of a collection of


self-interested agents that need to utilize a common network to satisfy their
individual demands. The network is characterized by a collection of edges E D
fe1 ; : : : ; em g where each edge e 2 E is associated with an anonymous congestion
function ce W f1; 2; : : : g ! R that defines the congestion associated with that edge as
a function of the number of agents using that edge. That is, ce .k/ is the congestion on
edge e when there are k  1 agents using that edge. Each agent i 2 N is associated
with an action set Ai  2E , which satisfies the agent’s underlying demands, as well
as a local cost function Ji W A ! R of the form
X
Ji .ai ; ai / D ce .jaje /;
e2ai

where jaje D jfi 2 N W e 2 ai gj denotes the number of agents using edge e in the
allocation a.5 In general, a system designer would like to allocate the agents over
the network to minimize the aggregate congestion given by
X
C .a/ D jaje  ce .jaje / :
e2E

5
Here, we use cost functions Ji ./ instead of utility functions Ui ./ in situation where the agents
are minimizers instead of maximizers.
16 J.R. Marden and J.S. Shamma

It is well known that any routing game of the above form, which is commonly
referred to as an anonymous congestion game, is a potential game with a potential
function  W A ! R of the form
jaje
XX
.a/ D ce .k/ :
e2E kD1

This implies that a pure Nash equilibrium is guaranteed to exist in any anonymous
congestion, namely, any action profile that minimizes .a/. Furthermore, it is often
the case that this is unique pure Nash equilibrium with regard to aggregate behavior,
i.e., ane 2 arg mina2A .a/. The fact that the potential function and the system cost
are not equivalent, i.e., ./ ¤ C ./, can lead to inefficiencies of the resulting Nash
equilibria.

The second example focuses on coordination games over graphs. A coordination


game is typically posed between two agents where each agent’s utility function
favors agreement on an action choice over disagreement. However, the agents may
have different preferences over which action is agreed upon. Graphical coordination
games, or coordination games over graphs, extend such two agent scenarios to n
agent scenarios where the underlying graph depicts the population that each agent
is seeking to coordinate with.

Example 3 (Graphical Coordination Games). Graphical coordination games char-


acterize a class of strategic interactions where the agents’ utility functions are
derived from local interactions with neighboring agents. In a graphical coordination
game, each agent i 2 N is associated with a common action set Ai D A, N a neighbor
set Ni  N , and a utility function of the form
X
Ui .a/ D U.ai ; aj / (19)
j 2Ni

where U W AN  AN ! R captures the (symmetric) utility associated with a


pairwise interaction. As an example, U.ai ; aj / designates the payoff for agent i
selecting action ai that results from the interaction with agent j selecting action aj .
Throughout, we adopt the convention that the payoff U.ai ; aj / is associated with
the player i whose action ai is the first in the tuple .ai ; aj /.
In the case where the common action set has two actions, i.e., AN D fx; yg, and
the interaction graph is undirected, i.e., j 2 Ni , i 2 Nj , it is straightforward to
show that this utility structure gives rise to a potential game with a potential function
of the form
1 X
.a/ D pw .ai ; aj / (20)
2
.i;j /2E

where pw W AN  AN ! R is a local potential function. One choice for this local
potential function is the following:
Game-Theoretic Learning in Distributed Control 17

pw .x; x/ D 0;
pw .y; x/ D U.y; x/  U.x; x/;
pw .x; y/ D U.y; x/  U.x; x/;
pw .y; y/ D .U.y; y/  U.x; y//  .U.y; x/  U.x; x// :

0
Observe that any potential function pw D pw C ˛ where ˛ 2 R also leads to a
potential function for the given graphical coordination game.

The first two examples show how potential games could naturally emerge in
two different types of strategic scenarios. The last example we present focuses
on an engineering-inspired resource allocation problem, termed the vehicle-target
assignment problem (Murphey 1999), where the vehicles’ utility functions are
engineered so that the resulting game is a potential game.

Example 4 (Vehicle-Target Assignment Problem). In the well-studied vehicle-target


assignment problem, there is a finite set of targets T , and each target t 2 T has
a relative value of importance vt  0. Further, there are a set of vehicles N D
f1; 2; : : : ; ng where each vehicle i 2 N has an invariant success/destroy probability
satisfying 0  pi  1 and a set of possible assignment Ai  2T . The goal of
vehicle-target assignment problem is to find an allocation of vehicles to targets a 2
A to optimize a global objective W W A ! R of the form
0 1
X Y
W .a/ D vt  @1  .1  pj /A
t 2T .a/ j Wt 2aj

where T .a/  T denotes the collection of targets that are assigned to at least one
agent, i.e., T .a/ D [i 2N ai .
Note that in this engineering-based application, there is no appropriate model
of utility functions of the engineered vehicles. Rather, vehicle utility functions are
designed with the goal of engineering desirable system-wide behavior. Consider
one such design where the utility functions of the vehicles are set as the marginal
contribution of the vehicles to the system-level objective, i.e., for each vehicle i 2 N
and allocation a 2 A we have
0 1 0 1
X Y Y
Ui .a/ D vt  @1  .1  pj /A  vt  @1  .1  pj /A ;
t 2ai j Wt 2aj j ¤i Wt 2aj
0 1
X Y
D vt  @pi .1  pj /A :
t 2ai j ¤i Wt 2aj
18 J.R. Marden and J.S. Shamma

Given this design of utility functions, it is straightforward to verify that the resulting
game is a potential game with potential function .a/ D W .a/. This immediately
implies that any optimal allocation, aopt 2 arg maxa2A W .a/, is a pure Nash
equilibrium. However, other inefficient Nash equilibria may also exist due to the
lack of uniqueness of Nash equilibrium for such scenarios.

3.6 A Brief Review of Game Design Methodologies

The examples in the previous section illustrate various settings that happen to
fall under the special category of potential games. Given that utility function
specification is a design degree of freedom in the prescriptive paradigm, it is possible
to exploit this degree of freedom to design utility functions to induce desirable
structural properties.
There are several objectives that a system designer needs to consider when
designing the game that defines the interaction framework of the agents in a
multiagent system (Marden and Wierman 2013). These goals could include (i)
ensuring the existence of a pure Nash equilibrium, (ii) ensuring that the agents’
utility functions fit into the realm of potential games, or (iii) ensuring that the agents’
utility functions optimize the price of anarchy/price of stability over an admissible
class of agent utility functions, e.g., local utility functions. While recent research has
identified the full space of methodologies that guarantee (i) and (ii) (Gopalakrishnan
et al. 2014), the existing research has yet to provide mechanisms for optimizing the
price of anarchy.
The following theorem provides one methodology for the design of agent utility
functions with guarantees on the resulting game structure (Marden and Wierman
2013; Wolpert and Tumor 1999).

Theorem 3. Consider the class of resource utilization problems defined in Sect. 2.1
with agent set N , action sets fAi g, and a global objective W W A ! R. Define the
marginal contribution utility function for each agent i 2 N and allocation a 2 A as

Ui .a/ D .a/  .aib ; ai /; (21)

where  W A ! R is any system-level function and aib 2 Ai is the fixed baseline


action for agent i . Then the resulting game G D fN; fAi g; fUi g; W g is an exact
potential game where the potential function is .

A few notes are in order regarding Theorem 3. First, the assignment of the agents’
utility functions is a byproduct of the chosen system-level design function  and the
transformation of  into the agents’ utility functions, which is given by (21) and
the choice of the baseline action aib for each agent i 2 N . Observe that the utility
design presented in Example 4 is precisely the design detailed in Theorem 3 where
 D W and aib D ; for each agent i 2 N . While a system designer could clearly
Game-Theoretic Learning in Distributed Control 19

set  D W , judging whether this design choice is effective centers on a detailed


analysis regarding the properties of the resulting game, e.g., price of anarchy. In
fact, recent research has demonstrated that setting  D W does not optimize the
price of anarchy for a large class of objective functions W . Furthermore, there
are also alternative mechanisms for transforming the system-level function  to
agent utility functions fUi g, as opposed to (21), that provide similar guarantees
on the structure of the resulting game, e.g., Shapley and weighted Shapley values
(Gopalakrishnan et al. 2014). It remains an open question as to what combination,
i.e., the transformation and system-level design function that the transformation
operates on, gives rise to the optimal utility design.

4 Distributed Learning Rules

We now turn our attention toward distributed learning rules. We can categorize the
learning algorithms into the following four areas:

Model-Based Learning. In model-based learning, each agent observes the past


behavior of the other agents and uses this information to develop a model for the
action choice of the other agents at the ensuing period. Equipped with this model,
each agent can then optimally select its actions based on its expected utility at the
ensuing time step. As the play evolves, so do the models of other agents.

Robust Learning. A learning algorithm of the form (2) defines a systematic rule
for how individual agents process available information to formulate a decision.
Many of the learning algorithms in the existing literature provide guarantees on the
asymptotic collective behavior provided that the agents follow these rules precisely.
Here, we explore the robustness of such learning algorithms, i.e., the asymptotic
guarantees on the collective behavior preserved when agents follow variations of
the prescribed learning rules stemming from delays in information or asynchronous
clock rates.

Equilibrium Selection. The price of anarchy and price of stability are two
measures characterizing the inefficiency associated with Nash equilibria. The
differences between these two measures follow from the fact that Nash equilibria are
often not unique. This lack of uniqueness of Nash equilibria prompts the question
of whether deriving distributed learning that favor certain types of Nash equilibria
is attainable. Focusing on the framework of potential games, we will review one
such algorithm that guarantees the collective behavior will lead to the specific Nash
equilibria that optimize the potential function. Note that when utility functions are
engineered, as in Example 4, a system designer can often ensure that the resulting
game is a potential game where the action profiles that optimize the potential
function coincide with the action profiles that optimize the system-level objective.
(We reviewed one such methodology in Sect. 3.6.)
20 J.R. Marden and J.S. Shamma

Universal Learning. All of the above learning algorithms provide asymptotic


guarantees when attention is restricted to specific game structures, e.g., potential
games or weakly acyclic games. Here, we focus on the derivation of learning algo-
rithms that provide desirable asymptotic guarantees irrespective of the underlying
game structure. Recognizing the previously discussed impossibility of natural and
universal dynamics leading to Nash equilibria (Hart and Mas-Colell 2003), we
shift our emphasis from convergence to Nash equilibria to convergence to the set
of coarse correlated equilibrium. We introduce one such algorithm, termed regret
matching, that guarantees convergence to the set of coarse correlated equilibrium
irrespective of the underlying game structure. Lastly, we discuss the implications of
such learning algorithms on the efficiency of the resulting collective behavior.
We will primarily gauge the quality of a learning algorithm by characterizing the
collective behavior as time t ! 1. When merging a particular distributed learning
algorithm with an underlying game, the efficiency analysis techniques presented in
Sect. 3.2 can then be employed to characterize the quality of the emergent collective
behavior with regard to a given system-level objective.

4.1 Model-Based Learning

The central challenge in distributed learning is dealing with the fact that each agent’s
environment is inherently nonstationary in that the environment from the perspective
of any agent consists of the behaviors of other agents, which are evolving. A
common approach in distributed learning is to have agents make decisions in a
myopic fashion, thereby neglecting the ramifications of an agent’s current decision
on the future behavior of the other agents. In this section we review two learning
algorithms of this form that we categorize as model-based learning algorithms. In
model-based learning, each agent observes the past behavior of the other agents
and utilizes this information to develop a behavioral model of the other agents.
Equipped with this behavioral model, each agent then performs a myopic best
response seeking to optimize its expected utility. It is important to stress here that the
goal is not to accurately model the behavior of the other agents in ensuing period.
Rather, the goal is to derive systematic agent responses that will guide the collective
behavior to a desired equilibrium.

4.1.1 Fictitious Play


One of the most well-studied algorithms of this form is fictitious play (Monderer
and Shapley 1996). Here, each agent uses the empirical frequency of past play as
a model for the behavior of the other agents at the ensuing time step. To that end,
define the empirical frequency of play for each player i 2 N at time t 2 f1; 2; : : : g
as qi .t/ D fqiai gai 2Ai 2 .Ai / where

t 1
1X
qiai .t/ D I fai ./ D ai g; (22)
t  D0
Game-Theoretic Learning in Distributed Control 21

and I fg is the usual indicator function. At time t, each agent seeks to myopically
maximize its expected utility given the belief that each agent j ¤ i will select its
action independently according to a strategy qj .t/. This update rule takes on the
form
X Y aj
ai .t/ 2 arg max Ui .ai ; ai / qj .t/: (23)
ai 2Ai ai 2Ai j ¤i

The following theorem provided in Monderer and Shapley (1996) characterizes


the long run behavior associated with fictitious play in potential games.

Theorem 4. Consider any exact potential game G. If all players follow the
fictitious play learning rule, then the players’ empirical frequencies of play
q1 .t/; : : : ; qn .t/ will converge to a Nash equilibrium of the game G.

The fictitious play learning rule provides a mechanism to guide individual agent
behavior in distributed control systems when the agents (i) can observe the previous
action choices of the other agents in the system and (ii) have access to the structural
form of their utility function. Further, fictitious play provides provable guarantees
on the emergent collective behavior provided that the system can be modeled by an
exact potential game. For example, consider the distributed routing problem given
in Example 2 which can be modeled as a potential game irrespective of the number
of agents, the number of edges, the topology of the network, or the edge-specific
latency functions. Regardless of the structure of the routing problem, the fictitious
play algorithm can be employed to drive the collective system behavior to a Nash
equilibrium.
While the asymptotic guarantees associated with fictitious play in distributed
routing problems is appealing, the implementation of fictitious play in such settings
is problematic. First, each agent must be able to observe the specific behavior of all
other agents in the network each period. Second, the choice of each agent at any
time given in (23) requires (i) knowledge of the structural form of the agent’s utility
function and (ii) computing an expectation of its utility function, which involves
evaluating a weighted summation over jAi j terms. In large-scale systems, such as
distributed routing, each of these requirements could be prohibitive. Accordingly,
research has attempted to alter the fictitious play algorithm to minimize such
requirements while preserving the desirable asymptotic guarantees.

4.1.2 Variants of Fictitious Play


One of the first attempts to relax the implementation requirements associated
with fictitious play centered on the computation of a best response given in (23).
In Lambert et al. (2005), the authors proposed a sample-based approach for
computing this best response, where each agent randomly drew samples of the
other agents’ behavior using their empirical frequencies of play and evaluated
the average performance of each possible routing decision against the drawn
22 J.R. Marden and J.S. Shamma

samples. The choice with the best average performance was then substituted for
the choice that maximized the agent’s expected utility in (23), and the process was
repeated. While simulations demonstrated reasonable performance even for limited
samples, unfortunately preserving the theoretical asymptotic guarantees associated
with fictitious play required that the number of samples drawn each period grew
prohibitively large.
A second variant of fictitious play focused on the underlying asymptotic
guarantees given in Theorem 4, which state that the empirical frequency of play
converges to a Nash equilibrium. It is important to highlight this does not imply
that the day-to-day behavior of the agents converges to a Nash equilibrium, e.g., the
agents’ day-to-day behavior could oscillate yielding a frequency of play consistent
with a Nash equilibrium. Furthermore, the cumulative payoff may be less than the
payoff associated with the limiting empirical frequencies. With this issue in mind,
Fudenberg and Levine (1995) introduced a variant of fictitious play that assures
a specific payoff consistency property against arbitrary environments, i.e., not just
when other agents employ fictitious play.

4.1.3 Joint Strategy Fictitious Play with Inertia


The focus in model-based learning is not whether such models accurately reflect
the behavior of the other agents. Rather, the focus is on whether systematic
responses to potentially inaccurate models can guide the collective behavior to a
desired equilibrium. The behavioral models used in fictitious play, i.e., assuming
each agent will play a strategy independently according to the agent’s empirical
frequency of play, provided nice asymptotic guarantees but was prohibitive from an
implementations perspective. Here, we consider a variant of fictitious play, termed
joint strategy fictitious play (JSFP), which provides similar asymptotic guarantees
while alleviating many of the computational and observational challenges associated
with fictitious play (Marden et al. 2009). The main difference between fictitious
play and joint strategy fictitious play resides in the behavioral model of the other
agents. In joint strategy fictitious play, each agent presumes that the other players
will select an action collectively in accordance with their empirical frequency of
their past joint play. In two-player games, fictitious play and joint strategy fictitious
play are equivalent. However, the learning algorithms yield fundamentally different
behavior beyond two-player games.
We begin by defining the average hypothetical utility of agent i 2 N for each
action ai 2 A as

t 1
1X t  1 N ai 1
UN iai .t/ D Ui .ai ; ai .// D Ui .t  1/ C Ui .ai ; ai .t  1//: (24)
t  D0 t t

Note that this average hypothetical utility is computed under the belief that
the action choices of the other agents remain unchanged. Now, consider the
decision-making rule where each agent i 2 N independently selects its action
probabilistically according to the rule
Game-Theoretic Learning in Distributed Control 23


arg maxai 2Ai UN iai .t/ with probability .1  /;
ai .t/ D (25)
ai .t  1/ with probability ;

where  > 0 is referred to as the agent’s inertia or probabilistic reluctance to change


actions. Hence, with high probability, i.e., probability .1  /, each agent selects the
action that maximizes the agent’s hypothetic utility.
The following theorem from Marden et al. (2009) characterizes the long run
behavior of joint strategy fictitious play in potential games.

Theorem 5. Consider any exact potential game G. If all players following the
learning algorithm joint strategy fictitious play defined above, then the joint action
profile will converge almost surely to a pure Nash equilibrium of the game G.

Hence, JFSP with inertia provides similar asymptotic guarantees to fictitious play
while minimizing the computational and observational burden on the agents. The
name “joint strategy fictitious play” is derived from the fact that maximizing the
average hypothetical utility in (24) is equivalent to maximizing an expected utility
under the belief that all agents will play collectively according to the empirical
frequency of their past joint play.

4.2 Robust Distributed Learning

Both fictitious play and joint strategy fictitious play are intricate decision-making
rules that provide guarantees regarding the emergent collective behavior. A natural
question that emerges when considering the practicality of such rules for control of
networked systems is the robustness of these guarantees to common implementation
issues including asynchronous clocks, noisy payoffs, and delays in information,
among others. This section highlights that the framework of potential games, or
more generally weakly acyclic games, is inherently robust to such issues.
We review the result in Young (2004) that deals with this exact issue. In
particular, Young (2004) demonstrates the robustness of weakly acyclic games by
identifying a broad family of learning rules, termed finite memory better response
processes, with the property that any rule within this family will probably guide the
collective behavior to a pure Nash equilibrium in any weakly acyclic game.
A finite memory better reply process with inertia is any learning algorithm of the
following form: at each time t, each agent selects its action independently according
to the rule
 m m
Bi .h .t// with probability .1  /;
ai .t/ D (26)
ai .t  1/ with probability ;

where m  1 is the size of the agent’s memory,  > 0 is the agent’s inertia, hm .t/ D
fa.t  1/; a.t  2/; : : : ; a.t  m/g denotes the previous m action profiles, and Bim W
24 J.R. Marden and J.S. Shamma

Am ! .Ai / is the finite memory better reply process.6 A finite memory better
reply process Bim ./ can be any process that satisfies the following properties:

• If the history is saturated, i.e., hm .t/ D fa; N a; N : : : ; a;


N ag
N for some action profile
aN 2 A, then the strategy pi D Bim .hm .t// must satisfy
– If aN i 2 arg maxai 2Ai Ui .ai ; aN i /, then piaN i D 1 and piai D 0 for all ai ¤ aN i .
– Otherwise, if aN i … arg maxai 2Ai Ui .ai ; aN i /, then piai > 0 if and only if
Ui .ai ; aN i /  Ui .aN i ; aN i /.
• If the history is not saturated, then the strategy pi D Bim .hm .t// can be any
probability distribution in .Ai /.7

In summary, the only constraint imposed on a finite memory better reply process is
that a better reply to saturated memory fa; : : : ; a} is consistent with a better reply
to the single action profile a.
The following theorem from Young (2004) (Theorem 6.2) demonstrates the
inherent robustness of weakly acyclic games.

Theorem 6. Consider any weakly acyclic game G. If all agents follow a finite
memory better reply process defined above, then the joint action profile will
converge almost surely to a pure Nash equilibrium of the game G.

One can view this result from two perspectives. The first perspective is that
the system designer has extreme flexibility in designing learning rules for weakly
acyclic games that guarantee the agents’ collective behavior will converge to a pure
Nash equilibrium. The second perspective is that perturbations of a nominal learning
rule, e.g., agents updating asynchronously or responding to delayed or inaccurate
histories, will also satisfy the conditions above and ultimately lead behavior to a
Nash equilibrium as well. These perspectives provide the basis for our claim of
robust distributed learning.

4.3 Equilibrium Selection in Potential Games

The preceding discussion focused largely on algorithms that ensured the emergent
collective behavior constitutes a (pure) Nash equilibrium. In the case where there
are multiple Nash equilibria, these algorithms provide no guarantees on which equi-
librium is likely to emerge. Accordingly, characterizing the efficiency associated
with the emergent collective behavior is equivalent to characterizing the efficiency
associated with the worst performing Nash equilibrium, i.e., the price of anarchy.

6
We write ai .t / D Bim .hm .t // with the understanding that this implies that the action profile ai .t /
is chosen randomly according to the probability distribution specified by Bim .hm .t //.
7
The actual definition of a finite better reply process considered in Young (2004) puts a further
condition on the structure of Bim under the case where the memory is not saturated, i.e., the strategy
assigns positive probability to any action with strictly positive regret. However, an identical proof
holds for any Bim that satisfies the weaker conditions set forth in this chapter.
Game-Theoretic Learning in Distributed Control 25

In this section we explore the notion of equilibrium selection in distributed


learning. That is, are there classes of distributed learning algorithms that converge
to specific classes of equilibria? One motivation for pursuing such developments is
the marginal cost utility, given in Theorem 3 with the design choice  D W, which
ensures that the optimal allocation is a Nash equilibrium, i.e., the price of stability is
1. Accordingly, the focus of this section will be on learning dynamics that converge
to the most efficient action profile in potential games, i.e., the action profile that
maximizes the potential function.

4.3.1 Log-Linear Learning


We begin this subsection by describing a simple asynchronous best-reply process,
where each agent chooses a best reply when given the opportunity to revise its
strategy. Let a.t/ represent the action profile at time t. The action profile at time
t C 1 is chosen as follows:

(i) An agent i 2 N is randomly picked to update its action according to a uniform


distribution.
(ii) Agent i selects an action that is a best response to the action profile played by
the other agents in the previous period, i.e.,

ai .t C 1/ 2 arg max Ui .ai ; ai .t//: (27)


ai 2Ai

(iii) All other agents j ¤ i play their previous actions, i.e., ai .t C 1/ D ai .t/.
(iv) The process is then repeated.

It is straightforward to see that the above process will converge almost surely to
a pure Nash equilibrium in any potential game by observing that .a.t C 1// 
.a.t// for all times t. Accordingly, the efficiency guarantees associated with the
application of this algorithm to a potential game are in line with the price of anarchy
of the game.
Here, a slight modification, or perturbation, is introduced of the above best-reply
dynamics that ensures that the resulting behavior leads to the pure Nash equilibrium
that optimizes the potential function, i.e., aopt 2 arg maxa2A .a/. The algorithm,
known as log-linear learning or the logit response dynamics (Alos-Ferrer and Netzer
2010; Blume 1993, 1997; Marden and Shamma 2012; Young 1998), follows the
best-reply process highlighted above where step (ii) is replaced by a noisy best
response. More formally, step (ii) is now of the form:

(ii) Agent i selects an action ai .t C1/ according to a probability distribution pi .t/ D


fpiai .t/gai 2Ai 2 .Ai / that is of the form

e .1=T /Ui .ai ;ai .t //


piai .t/ D P .1=T /Ui .Q
ai ;ai .t //
; (28)
aQi 2Ai e

where the parameter T > 0 is referred to as the temperature.


26 J.R. Marden and J.S. Shamma

A few remarks are in order regarding the update protocol specified in (28).
First, when T ! 1, the agent’s strategy is effectively a uniform distribution over
the agent’s action set. Second, when T ! 0C , the agent’s strategy is effectively
the best response strategy given in (27). Lastly, we present this algorithm (and
the forthcoming Binary Log-Linear Learning) with regard to a fixed temperature
parameter that is common to all agents. However, there are variations of this
algorithm which allow for annealing of this temperature parameter that preserve
the resulting asymptotic guarantees, e.g., Zhu and Martínez (2013).
The following theorem establishes the asymptotic guarantees associated with
the learning algorithm log-linear learning in potential games (Blume 1993, 1997;
Young 1998).

Theorem 7. Consider any potential game G with potential function . If all players
follow the learning algorithm log-linear learning with temperature T > 0, then the
resulting process has a unique stationary distribution  D f a ga2A 2 .A/ of the
form

e .1=T /.a/
a D P .1=T /.Q
a/
: (29)
aQ 2A e

The stationary distribution of the process given in (29) follows the same intuition
as presented for the update protocol in (28). That is, when T ! 1 the stationary
distribution is effectively a uniform distribution over the joint action set A. However,
when T ! 0C , all of the weight of the stationary distribution is concentrated on
the action profiles that maximize the potential function . The above stationary
distribution provides an accurate assessment of the resulting asymptotic behavior
due to the fact that the log-linear learning process is both irreducible and aperiodic,
hence (29) is the unique stationary distribution.
Merging log-linear learning with the marginal contribution utility design given
in Theorem 3 leads to the following corollary.

Corollary 1. Consider the class of resource allocation problems defined in Sect. 2.1
with agent set N , action sets fAi g, and a global objective W W A ! R. Consider
the following game-theoretic control design:

(i) Assign each agent a utility function that captures the agent’s marginal contri-
bution to the global objective, i.e.,

Ui .a/ D W .a/  W .aib ; ai /; (30)

where aib 2 Ai is any fixed baseline action for agent i .


(ii) Each agent follows the log-linear learning rule with temperature parameter
T > 0.
Game-Theoretic Learning in Distributed Control 27

Then the resulting process has a unique stationary distribution .T / D


f a .T /ga2A 2 .A/ of the form

e .1=T /W .a/


 a .T / D P .1=T /W .Q
a/
: (31)
aQ 2A e

Observe that this design rule ensures that the resulting asymptotic behavior will
be concentrated around the allocations that maximize the global objective W . This
fact has made this design methodology an attractive option for several domains
including wind farms, sensor networks, and coordination of unmanned vehicles,
among others.

4.3.2 Binary Log-Linear Learning


The framework of log-linear learning imposes a fairly rigid structure on the update
process of the agents. This structure mandates that (i) only one agent updates the
action choice at any iteration, (ii) agents are able to select any action in their action
set, and (iii) agents are able to assess their utility for any alternative action choice
given the observed behavior of the other agents. In general, Alos-Ferrer and Netzer
(2010) demonstrates that relaxing these structures arbitrarily can significantly alter
the resulting asymptotic guarantees associated with log-linear learning. However, in
each of the scenarios variations of log-linear learning can preserve the asymptotic
guarantees while making the structure more amenable to engineering systems
(Marden and Shamma 2012).
Here, we present a variation of log-linear learning that preserves the asymptotic
guarantees associated with log-linear learning while accommodating restrictions in
the agents’ action sets. By restrictions in action sets, we mean that the set of actions
available to a given agent is dependent on the agent’s current action choice, and we
express this dependence by the function Ri W Ai ! 2Ai where ai 2 Ri .ai / for all
ai . That is, if the choice of agent i at time t is ai .t/, then the ensuing choice of the
agent ai .t C 1/ must be contained in the set Ri .ai .t//. Throughout this section, we
consider restricted action sets that satisfy two properties:

(i) Reversibility: Let ai ; ai0 be any two action choices in Ai . If ai0 2 Ri .ai / then
ai 2 Ri .ai0 /.
(ii) Completeness: Let ai ; ai0 be any two action choices in Ai . There exists a
sequence of actions ai D ai0 ; ai1 ; : : : ; aim D ai0 with the property that aikC1 2
Ri .aik / for all k 2 f0; : : : ; m  1g.

One motivation for considering restricted action sets of the above form is when the
individual agents have mobility limitations, e.g., mobile sensor networks.
Note that the log-linear learning update rule given in (28) has full support on the
agent’s action set Ai thereby disqualifying this algorithm for use in the case where
there are restrictions in action sets. Here, we seek to address the question of how to
alter the algorithm so as to preserve the asymptotic guarantees, i.e., convergence in
28 J.R. Marden and J.S. Shamma

the stationary distribution to the action profile that maximizes the potential function.
One natural variation would be to replace (28) with a strategy of the form: for any
ai 2 Ri .ai .t//

e .1=T /Ui .ai ;ai .t //


piai .t/ D P .1=T /Ui .Q
ai ;ai .t //
; (32)
aQi 2Ri .ai .t // e

and piai .t/ D 0 for any ai … Ri .ai .t//. However, such modifications can have
drastic consequences on the resulting asymptotic guarantees. In fact, such a rule is
not even able to guarantee that the potential function maximizer is in the support of
the limiting distribution as T ! 0C (Marden and Shamma 2012).
Here, we introduce a variation of log-linear learning, termed binary log-linear
learning with restricted action sets (Marden and Shamma 2012), that preserves these
asymptotic guarantees. Binary log-linear learning follows the same setup as log-
linear learning where step (ii) is now of the form:

(ii) Agent i selects a trial action ait 2 Ri .ai .t// according to any distribution with
full support on the set Ri .ai .t//. Conditioned on the selection of this trial action,
the agent selects the action ai .t C 1/ according to a probability distribution
pi .t/ D fpiai .t/gai 2Ai 2 .Ai / of the form
8
ˆ e.1=T /Ui .a.t //
< ai .t/ with probability t
e.1=T /Ui .a.t // Ce.1=T /Ui .ai ;ai .t //
;
piai .t/ D t (33)
:̂ ait e.1=T /Ui .ai ;ai .t //
with probability .1=T /Ui .at ;ai .t //
;
e.1=T /Ui .a.t // Ce i

where piai .t/ D 0 for any ai … fai .t/; ait g.

Much like log-linear learning, for any temperature T > 0 binary log-linear
learning can be modeled by an irreducible and aperiodic Markov chain over the
state space A; hence, there is a unique stationary distribution which we denote
by .T / D f a .T /ga2A . While log-linear learning provides the explicit form of
the stationary distribution .T /, the value of log-linear learning centers on the fact
that the support of the limiting distribution is precisely the set of potential function
maximizers, i.e.,

lim  a .T / > 0 , a 2 arg max .a/


T !0C a2A

The action profiles contained in the support of the limiting distribution are termed
the stochastically stable states. Accordingly, log-linear learning ensures that an
action profile is stochastically stable if and only if it is a potential function
maximizer.
The following theorem from Marden and Shamma (2012) characterizes the long
run behavior of binary log-linear learning.
Game-Theoretic Learning in Distributed Control 29

Theorem 8. Consider any potential game G with potential function . If all players
follow the learning algorithm binary log-linear learning with restricted action set
and temperature T > 0, then an action profile is stochastically stable if and only if
it is a potential function maximizer.

This theorem demonstrates that a system designer can effectively deal with
restrictions in action sets by appropriately modifying the learning rule. However,
a consequence of this is that we are no longer able to provide a precise charac-
terization of the stationary distribution as a function of the temperature parameter
T . Unlike log-linear learning, binary log-linear learning applied to such a game
does not satisfy reversibility unless there are additional constraints imposed on the
agents’ restricted action sets, i.e., jRi .ai /j D jRi .ai0 /j for all i 2 N and ai ; ai0 2 Ai .
Hence, in this theorem we forgo a precise analysis of the stationary distribution in
favor of a coarse analysis of the stationary distribution that demonstrates roughly
the same asymptotic guarantees.

4.3.3 Beyond Asymptotic Guarantees


In potential games, both log-linear learning and binary log-linear learning ensure
that the resulting collective behavior can be characterized by the action profiles that
maximize the potential function when the temperature T ! 0C . Here, we focus
on the question of characterizing the convergence rates of this process. That is, how
long does it take for the collective behavior to reach these desired equilibrium points.
Several negative results have emerged regarding the convergence rates of such
algorithms (Daskalakis et al. 2009; Hart and Mansour 2010; Shah and Shin 2010).
In particular, Hart and Mansour (2010) and Shah and Shin (2010) demonstrates that
in general the amount of time that it may take to reach such an equilibrium could
be exponential in both the number of agents and the cardinality of their action sets.
Accordingly, research has shifted to identifying whether there are classes of games
and variants of the above dynamics that exhibit more desirable guarantees on the
convergence rates.
The following briefly highlights three domains where such positive results exist.

Symmetric Parallel Congestion Games. Consider the class of congestion games


introduced in Example 2. A symmetric parallel congestion game is a congestion
game where each agent i 2 N has an action set Ai D R; that is, any agent can
choose any single edge from the set of available roads R. In Shah and Shin (2010),
the authors demonstrate that the mixing times associated with log-linear learning
could grow exponentially with regard to the number of players n even in such limited
scenarios. However, the authors introduce a variant of log-linear learning, which
effectively replaces Step (i) of the algorithm (pick an updating player uniformly)
with a new procedure which biases the selection rate of certain agents based on the
current action profile a. This modification of log-linear learning provides similar
asymptotic guarantees with far superior transient guarantees. In particular, this
30 J.R. Marden and J.S. Shamma

variant of log-linear learning provides a mixing time that is nearly linear in the
number of agents for this class of congestion games.

Semi-Anonymous Potential Games. In symmetric parallel congestion games, all


of the agents are anonymous (or identical) with regard to their impact on the
potential function and their available action choices. More formally, we will call
two agents i; j 2 N anonymous in a potential game if (i) Ai D Aj and (ii)
.a/ D .a0 / for any action profiles a; a0 where ai0 D aj , aj0 D ai , and ak0 D ak
for all k ¤ i; j . Accordingly, let C1 ; : : : Cm represent a minimal partition of N
such that each set of agents Ck , k 2 f1; : : : ; mg is anonymous with respect to one
another, i.e., any agents i; j 2 Ck are anonymous with respect to each other. The
authors in Borowski et al. (2013) derive a variant of log-linear learning algorithm,
similar to the algorithm for symmetric parallel congestion games in Shah and Shin
(2010) highlighted above, that provides mixing times that are nearly linear in the
number of agents n, but exponential in the number of indistinguishable groups of
agents, m.

Graphical Coordination Games. Consider the family of graphical coordination


games introduced in Example 3. In Montanari and Saberi (2009), the authors study
the mixing times associated with log-linear learning in a special class of graphical
coordination games where the underlying pairwise utility function constitutes a 22
symmetric utility function. In particular, the authors demonstrate that the structure of
the network, in particular the min-cut of graph, is intimately related to the underlying
speed of convergence. A consequence of this characterization is that the mixing
times associated with log-linear learning is effectively linear in the number of agents
when the underlying graph is sparse.

4.4 Universal Learning

The preceding sections presented algorithms that guarantee convergence to Nash


equilibria (or potential function maximizers) for specific game structures, e.g.,
potential games or weakly acyclic games. Here, we focus on the question of
whether there are universal algorithms that provide convergence to an equilibrium
irrespective of the underlying game structure. With regard to Nash equilibrium,
it turns out that such an objective is impossible as demonstrated by Hart and
Mas-Colell (2003) which establishes that no natural dynamics converge to a Nash
equilibrium in all games. Here, the phrase natural seeks to disqualify dynamics
that can be thought of as an exhaustive search or utilizing a central coordinator.
Nonetheless, by relaxing our equilibrium requirements focus from Nash equilibria
to coarse correlated equilibria, such universal algorithms do exists. In the following,
we survey the most well-known algorithm that achieves this objective and discuss
its implications on the efficiency of this broader class of equilibria.
In this section we present an algorithm proposed in Hart and Mas-Colell (2000),
referred to as regret matching, that guarantees convergence to the set of coarse
Game-Theoretic Learning in Distributed Control 31

correlated equilibrium. The informational demands and computations associated


with the decision-making rule regret matching is very similar to those presented
for the algorithm joint strategy fictitious play with inertia highlighted above. The
main driver for each agent’s strategy selection is the regret associated with each of
its actions. For any time t 2 f1; 2 : : : g, the regret of agent i 2 N for action ai 2 Ai
is defined as

Riai .t/ D UN iai .t/  UN i .t/; (34)


P
where UN i .t/ D 1t t1
D0 Ui .a.// is the average utility received by agent i up to time
N ai 1 Pt 1
t and Ui .t/ D t  D0 Ui .ai ; ai .// is the average utility that would have been
received by agent i up to time t if the agent committed to action ai all time steps and
the behavior of the other agents were unchanged. Observe that Riai .t/ > 0 implies
that agent i could have received a higher average utility if the agent had committed
to the action ai for all previous time steps and the action choices of the other agents
was unchanged.
The regret matching algorithm proceeds as follows: at each time t 2 f1; 2; : : : g,
each agent i 2 N independently selects its action according to the strategy pi .t/ 2
.Ai / of the form

Riai .t/ C
piai .t/ D P h i (35)
aQi
aQi 2Ai Ri .t/
C

where Œ C denotes the projection to the positive orthant, i.e., Œx C D maxfx; 0g.
The following theorem characterizes the long run behavior of regret matching in
any game.

Theorem 9. Consider any finite game G. If all players follow the learning algo-
rithm regret matching defined above, then the positive regret for any agent i 2 N
and action ai 2 Ai asymptotically vanishes, i.e.,

lim ŒRiai .t/ C D 0: (36)


t !1

Alternatively, the empirical frequency of play converges to the set of coarse


correlated equilibria.

The connection between the condition (36) and the definition of coarse correlated
equilibria stems from the fact that an agent’s regret and average utility can also be
computed using the empirical frequency of play z.t/ D fza .t/ga2A where

t 1
1X
za .t/ D I fa./ D ag: (37)
t  D0
32 J.R. Marden and J.S. Shamma

In particular, at any time t 2 f1; 2; : : : g we have that


X
Ui .z.t// D Ui .a/za .t/ D UN i .t/: (38)
a2A

Further, defining the marginal


P distribution of the empirical frequency of play of all
agents j ¤ i as zaii .t/ D ai 2Ai z.ai ;ai / .t/, we have
X
Ui .ai ; zi .t// D Ui .ai ; ai /zaii .t/ D UN iai .t/: (39)
ai 2Ai

Accordingly, if a sequence of play a.0/, a.1/, : : : , a.t  1/, satisfies (36), then we
know that the empirical frequency of play z.t/ satisfies

lim fUi .z.t//  Ui .ai ; zi .t//g  0; 8i 2 N; ai 2 Ai : (40)


t !1

Hence, the limiting empirical frequency of play z.t/ is contained in the set of coarse
correlated equilibria. Note that the convergence highlighted above does not state that
the empirical frequency of play will converge to any specific correlated equilibrium;
rather, it merely states that the empirical frequency of play will approach the set of
coarse correlated equilibria.
Lastly, we presented a version of regret matching that provides convergence to
the set of coarse correlated equilibria. Variants of the presented regret matching
could also ensure convergence to the set of correlated equilibrium, which is a
more rigid solution concept than presented in Definition 3. We direct the readers
to Hart and Mas-Colell (2000) and Young (2004) for the details associated with this
variation.

4.4.1 Equilibrium Selection of Correlated Equilibrium


The set of correlated equilibria is much larger than the set of Nash equilibria and
can potentially be exploited to provide systems with better performance guarantees.
One example of such a system is the Shapley game, which is a two-player game
with utility functions of the form
Agent 2
A B C
A 0; 0 0; 1 1; 0
Agent 1 B 1; 0 0; 0 0; 1
C 0; 1 1; 0 0; 0
Payoff Matrix
There are no pure Nash equilibria in this game and the unique (mixed) Nash
equilibrium is when each agent i employs a strategy pi D .1=3; 1=3; 1=3/, which
yields an expected payoff of 1=3 to each agent. However, there is also a coarse
correlated equilibrium where the distribution z has a value 1=6 on each of the
six joint actions where some agent receives nonzero payoff; z has a value 0 for
Game-Theoretic Learning in Distributed Control 33

the other three joint actions. This coarse correlated equilibrium yields an expected
utility of 1=2 to each agent and is clearly more desirable. One could easily imagine
other scenarios, e.g., team versus team games, where specific coarse correlated
equilibrium could provide significant performance improvements over any Nash
equilibrium.
The problem with regret matching for exploiting this potential opportunity is that
behavior is not guaranteed to converge to any specific coarse correlated equilibrium.
Accordingly, the efficiency guarantees associated with coarse correlated equilibria
cannot be better than the efficiency bounds associated with pure Nash equilibria and
can often be quite worse. With this issue in mind, recent work in Marden (2015)
and Borowski et al. (2014) has sought to develop learning algorithms that converge
to the efficient coarse correlated equilibrium, where efficiency is measured by the
sum of the agents’ expected utilities. Here, the algorithm introduced in Marden
(2015) ensures that the empirical frequency of play will converge to the most
efficient coarse correlated equilibrium, while Borowski et al. (2014) provides an
algorithm that guarantees that the day-to-day behavior of the agents will converge to
the most efficient correlated equilibrium. Both of these algorithms view convergence
in a stochastic stability sense.
The motivation for these developments centers on the fact that joint randomiza-
tion, which can potentially be characterized by correlated equilibria, can be key to
providing desirable system-level behavior. One example of such a system is a peer-
to-peer file sharing system where users engage in interactions with other users to
transfer files of interest and satisfy demands (Wang et al. 2009). Here, Wang et al.
(2009) demonstrates that the optimal system performance is actually characterized
by the most efficient correlated equilibrium as defined above. Another example
of such a system is the problem of access control for wireless communications,
where there are a collection of mobile terminals that compete over access to a
common channel (Altman et al. 2006). Optimizing system throughput requires a
level of correlation between the transmission strategies of the mobiles so as to
minimize the chance of simultaneous transmissions and failures. The authors in
Altman et al. (2006) study the efficiency of correlated equilibria in this context.
Identifying the role of correlated equilibrium (and learning strategies for attaining
specific correlated equilibrium) warrants further research attention.

5 Conclusion

The goal of this chapter has been to highlight a potential role of game-theoretic
learning in the design of networked control systems. We reviewed several classes
of learning algorithms accentuating their performance guarantees and reliance on
game structures.
It is important to reemphasize that game-theoretic learning represents just a
single dimension of a game-theoretic control design. The other dimension centers on
the assignment of objective functions to the individual agents. The structure of these
agent objective functions not only dictate convergence guarantees associated with
34 J.R. Marden and J.S. Shamma

various game-theoretic learning algorithms but can also be exploited to characterize


the efficiency of the resulting behavior. To that end, consider the assignment of
agent objective functions that yields a potential game and has a given price of
anarchy. Marrying this design with a learning algorithm that guarantees convergence
to a pure Nash equilibrium in potential games yields a game-theoretic control
design that ensures that the collective behavior will converge to a specific allocation
(in particular a Nash equilibrium associated with the designed agent objective
functions) and the efficiency of this allocation will be in line with the given price of
anarchy.
Taking full advantage of this game-theoretic approach requires assigning agent
objective functions that yield a potential game and optimize the price of anarchy
over all such objective functions. Unfortunately, the existing literature provides no
mechanism for accomplishing this goal as utility design for distributed engineering
systems is currently not well understood. A reason for this gap is that agent
objective function are traditionally modeled to reflect agent preferences in a given
social system, e.g., a reasonable objective for drivers on a transportation network
is minimizing experienced congestion. Hence, efficiency measures in games, such
as the price of anarchy, are traditionally viewed from an analysis perspective
with virtually no design component. Reversing this trend and deriving systematic
methodologies for utility design in multiagent systems represents a significant
opportunity for game-theoretic control moving forward.

References
Alos-Ferrer C, Netzer N (2010) The logit-response dynamics. Games Econ Behav 68:413–427
Altman E, Bonneau N, Debbah M (2006) Correlated equilibrium in access control for wireless
communications. In: 5th International Conference on Networking
Babichenko Y (2012) Completely uncoupled dynamics and Nash equilibria. Games Econ Behav
76:1–14
Blondel VD, Hendrickx JM, Olshevsky A, Tsitsiklis JN (2005a) Convergence in multiagent
coordination, consensus, and flocking. In: IEEE Conference on Decision and Control,
pp 2996–3000
Blondel VD, Hendrickx JM, Olshevsky A, Tsitsiklis JN (2005b) Convergence in multiagent
coordination, consensus, and flocking. In: Proceedings of the Joint 44th IEEE Conference on
Decision and Control and European Control Conference (CDC-ECC’05), Seville
Blume L (1993) The statistical mechanics of strategic interaction. Games Econ Behav 5:387–424
Blume L (1997) Population games. In: Arthur B, Durlauf S, and Lane D (eds) The economy as an
evolving complex system II. Addison-Wesley, Reading, pp 425–460
Borowski H, Marden JR, Frew EW (2013) Fast convergence in semi-anonymous potential games.
In: Proceedings of the IEEE Conference on Decision and Control, pp 2418–2423
Borowski HP, Marden JR, Shamma JS (2014) Learning efficient correlated equilibria. In: Proceed-
ings of the IEEE Conference on Decision and Control, pp 6836–6841
Cortes J, Martinez S, Karatas T, Bullo F (2002) Coverage control for mobile sensing networks.
In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA’02),
Washington, DC, pp 1327–1332
Daskalakis C, Goldberg PW, Papadimitriou CH (2009) The complexity of computing a Nash
equilibrium. SIAM J Comput 39(1):195–259
Game-Theoretic Learning in Distributed Control 35

Fudenberg D, Levine DK (1995) Consistency and cautious fictitious play. Games Econ Behav
19:1065–1089
Fudenberg D, Levine DK (1998) The theory of learning in games. MIT Press, Cambridge
Fudenberg D, Tirole J (1991) Game theory. MIT Press, Cambridge
Gopalakrishnan R, Marden JR, Wierman A (2014) Potential games are necessary to ensure pure
Nash equilibria in cost sharing games. Math Oper Res 39(4):1252–1296
Hart S (2005) Adaptive heuristics. Econometrica 73(5):1401–1430
Hart S, Mansour Y (2010) How long to equilibrium? The communication complexity of uncoupled
equilibrium procedures. Games Econ Behav 69(1):107–126
Hart S, Mas-Colell A (2000) A simple adaptive procedure leading to correlated equilibrium.
Econometrica 68(5):1127–1150
Hart S, Mas-Colell A (2003) Uncoupled dynamics do not lead to Nash equilibrium. Am Econ Rev
93(5):1830–1836
Jadbabaie A, Lin J, Morse AS (2003) Coordination of groups of mobile autonomous agents using
nearest neighbor rules. IEEE Trans Autom Control 48(6):988–1001
Kearns MJ, Littman ML, Singh SP (2001) Graphical models for game theory. In: Proceedings of
the 17th Conference in Uncertainty in Artificial Intelligence, pp 253–260
Lambert III TJ, Epelman MA, Smith RL (2005) A fictitious play approach to large-scale
optimization. Oper Res 53(3):477–489
Marden JR (2012) State based potential games. Automatica 48:3075–3088
Marden JR (2015) Selecting efficient correlated equilibria through distributed learning. In:
American Control Conference, pp 4048–4053
Marden JR, Shamma JS (2012) Revisiting log-linear learning: asynchrony, completeness and a
payoff-based implementation. Games Econ Behav 75(2):788–808
Marden JR, Shamma JS (2015) Game theory and distributed control. In: Young HP, Zamir S (eds)
Handbook of game theory with economic applications, vol 4. Elsevier Science, pp 861–899
Marden JR, Wierman A (2013) Distributed welfare games. Oper Res 61:155–168
Marden JR, Arslan G, Shamma JS (2009) Joint strategy fictitious play with inertia for potential
games. IEEE Trans Autom Control 54:208–220
Martinez S, Cortes J, Bullo F (2007) Motion coordination with distributed information. Control
Syst Mag 27(4):75–88
Monderer D, Shapley LS (1996) Fictitious play property for games with identical interests. J Econ
Theory 68:258–265
Montanari A, Saberi A (2009) Convergence to equilibrium in local interaction games. In: 50th
Annual IEEE Symposium on Foundations of Computer Science
Murphey RA (1999) Target-based weapon target assignment problems. In: Pardalos PM, Pitsoulis
LS (eds) Nonlinear assignment problems: algorithms and applications. Kluwer Academic,
Alexandra
Nisan N, Roughgarden T, Tardos E, Vazirani VV (2007) Algorithmic game theory. Cambridge
University Press, New York
Olfati-Saber R (2006) Flocking for multi-agent dynamic systems: algorithms and theory. IEEE
Trans Autom Control 51:401–420
Olfati-Saber R, Murray RM (2003) Consensus problems in networks of agents with switching
topology and time-delays. IEEE Trans Autom Control 49(9):1520–1533
Olfati-Saber R, Fax JA, Murray RM (2007) Consensus and cooperation in networked multi-agent
systems. Proc IEEE 95(1):215–233
Roughgarden T (2005) Selfish routing and the price of anarchy. MIT Press, Cambridge
Roughgarden T (2015) Intrinsic robustness of the price of anarchy. J ACM 62(5):32:1–32:42
Shah D, Shin J (2010) Dynamics in congestion games. In: ACM SIGMETRICS, pp 107–118
Shamma JS (2014) Learning in games. In: Baillieul J, Samad T (eds) Encyclopedia of systems and
control. Springer, London
Shoham Y, Powers R, Grenager T (2007) If multi-agent learning is the answer, what is the question?
Artif Intell 171(7):365–377
36 J.R. Marden and J.S. Shamma

Touri B, Nedic A (2011) On ergodicity, infinite flow, and consensus in random models. IEEE Trans
Autom Control 56(7):1593–1605
Tsitsiklis JN (1987) Decentralized detection by a large number of sensors. Technical report. MIT,
LIDS
Tsitsiklis JN, Bertsekas DP, Athans M (1986) Distributed asynchronous deterministic and stochas-
tic gradient optimization algorithms. IEEE Trans Autom Control 35(9):803–812
A. Vetta, “Nash equilibria in competitive societies, with applications to facility location, traffic
routing and auctions,” Proceedings of the 43rd Annual IEEE Symposium on Foundations of
Computer Science, 2002, pp 416–425
Wang B, Han Z, Liu KJR (2009) Peer-to-peer file sharing game using correlated equilibrium. In:
43rd Annual Conference on Information Sciences and Systems, CISS 2009, pp 729–734
Wolpert D, Tumor K (1999) An overview of collective intelligence. In: Bradshaw JM (ed)
Handbook of agent technology. AAAI Press/MIT Press, Cambridge, USA
Young HP (1998) Individual strategy and social structure. Princeton University Press, Princeton
Young HP (2004) Strategic learning and its limits. Oxford University Press, New York
Zhu M, Martínez S (2013) Distributed coverage games for energy-aware mobile sensor networks.
SIAM J Control Optim 51(1):1–27

You might also like