This repository contains sample code to compare variants of the Double Oracle1 algorithm for computing Nash equilibria in 2-player zero-sum games.
2-player zero-sum games model situations where players are competing for a shared resource and the gain of one player results in a loss of equal amount for the other player (hence the name 'zero-sum', since summing the gains and losses of every player yields zero for every possible game outcome).
The code is a python script and requires numpy
, scipy
and matplotlib
packages to run. Simply invoke
python main.py
to solve a game matrix with 100 rows and columns filled with uniformly distributed random values in
The script optionally takes as arguments the matrix dimensions and the random seed. Note that for larger matrices, the hyperparameters of the regret minimization algorithm (Section 6) may require tuning; see the discussion below for more details.
usage: main.py [-h] [-r ROWS] [-c COLS] [-s SEED]
optional arguments:
-h, --help show this help message and exit
-r ROWS, --rows ROWS The rows of the random game matrix
-c COLS, --cols COLS The columns of the random game matrix
-s SEED, --seed SEED The RNG seed
We consider 2-player zero-sum games in normal-form.
Each player
A strategy
Each player has associated a utility function
The utility function of player
where
In normal-form games, the payoffs of 2-player games are conveniently represented with matrices storing the utility values of the two players, sometimes compacted in a bi-matrix with two scalar values per cell representing the payoffs of the row and column player respectively.
In the case of 2-player zero-sum games, the payoff matrices are one the negation of the other, so the game can be more compactly modeled by a single matrix
Given a strategy
$$ \mathbb{BR}i(\pi{-i}) = \underset{\pi_i}{\operatorname{arg max}} v_i(\pi_i, \pi_{-i}). $$
Note that a best response may not be unique, and for a mixed-strategy best response all the pure strategies that play any action in the support are themselves best responses to
A Nash equilibrium is a strategy profile where each player's strategy is a best response to the strategy of the adversary.
Although the primary goal of the algorithms presented later is to avoid computing a Nash equilibrium by solving a full linear program, they will still require to solve simpler games exactly. For completeness, we quickly recall how to compute an exact Nash equilibrium exploiting the minmax theorem. A fundamental result for 2-player zero sum games is that the the solution concepts of Nash equilibrium, minimax (a player seeking to minimize the maximum utility of the other) and maximin (a player seeking to maximize their minimum utility) are equivalent 2.
Throughout the rest of this document we assume that the row player is the maximin player and the column player is the minimax player, both modeled with the same payoff matrix
To compute the equilibrium strategy
$\phantom{s.t.}\; \max v$
$s.t.\; \sum_i a_{ij} x_i \geq v \qquad j = 1, \dots, n$
$\phantom{s.t.}\; \sum_i x_i = 1$
$\phantom{s.t.}\; x \geq 0$
Similarly, to compute the equilibrium strategy
$\phantom{s.t.}\; \min w$
$s.t.\; \sum_j a_{ij} y_j \leq w \qquad i = 1, \dots, m$
$\phantom{s.t.}\; \sum_j y_j = 1$
$\phantom{s.t.}\; y \geq 0$
In the python script these instances are solved by the functions solve_maxmin
and solve_minmax
, which both take as argument the game matrix
The double oracle algorithm 1 is an iterative scheme for computing NE in 2-player strictly competitive games.
The idea behind this algorithm is to avoid solving the full game, instead solving progressively larger sub-games defined by restricted subsets of pure strategies available to each player. This avoids computing and storing the full game matrix, which in many cases can be impractical to compute and store. Moreover, since it's often the case that many strategies are clearly sub-optimal and contribute little in searching for the equilibrium point, the double oracle algorithm has the potential to converge in a reasonably low number of iterations, possibly offering a speedup compared to solving the full Linear Programming model.
The double oracle algorithm works as follows:
Double Oracle Algorithm
Let
$\Pi^0 = {\Pi^0_1, \Pi^0_2}$ be the initial population of strategies for each player$i=1,2$ Repeat at each iteration
$t$
- Compute the NE
$\pi^t$ of the game restricted to the strategies in$\Pi^t$ - For each player
$i$ , find a best response$\beta_i \leftarrow \mathbb{BR}_ i(\pi^t_{-i})$ using an oracle that searches in the full strategy space (unrestricted)- Add the best responses to each player population
$\Pi^{t+1}_i \leftarrow \Pi^t_i \cup {\beta_i}$ Until No novel best response exists for either player
Return
$\pi^t$ as the NE of the full unrestricted game
It is easy to see that the algorithm is complete: eventually execution terminates with all the strategies added to each player population, producing a restricted game at the last iteration that is equivalent to the full unrestricted game.
However, the algorithm can terminate much sooner if the best response oracle is unable to find a best response that does not already belong to the strategy population for either of the two players. In this case, the algorithm ends the execution outputting the last restricted Nash Equilibrium as the solution for the full game. To see that this is correct, it is sufficient to recall the definition of Nash Equilibrium as a strategy profile in which each player's strategy is a best response to the other. In the case of early termination, the best response oracle is unable to find a best response for any of the player outside the restricted populations, i.e. no player can attain a better expected value by switching to a strategy that is not among the ones already considered. Therefore, the equilibrium cannot be broken to anyone's advantage by unilaterally changing strategy, which is exactly the definition of the Nash equilibrium.
The Double Oracle algorithm is implemented in the function double_oracle()
which takes as input the game matrix
Having the full game matrix at our disposal, computing the best response of the restricted game is straightforward and involves a simple linear scan of the utility vector generated by the adversary's strategy.
For the row player, we extend the restricted strategy of the column player to the full vector
Note that, for large and complex games, having an exact best response oracle may be not possible.
An undesirable property of the Double Oracle algorithm is that it is completely oblivious to the strategies that do not belong to the adversary's population.
This means that the NE computed at iteration
To evaluate this, we can introduce the definition of exploitability of a strategy profile
First, we introduce the exploitability of a strategy from a particular player's perspective: the exploitability
$$ e_i(\pi) = v_i(\mathbb{BR}i(\pi{-i})) - v_i(\pi). $$
That is, the exploitability of a strategy by a player is a measure of how much that player can expect to gain from implementing a best response to the opponent strategy compared to the one in the strategy profile
The exploitability of a strategy profile is then simply the sum of the individual exploitability values:
If
In this document we use the same payoff matrix for both players, with the row player aiming at maximizing utility while the column player seeks to minimize it. In this case, the definition of exploitability reduces to
That is, the exploitability is the utility gap between each player, which is consistent with the NE having utility gap zero (or the same utility for every player) in zero-sum games.
If we plot the exploitability of the restricted NEs as the Double Oracle algorithm progresses on a random
As expected, exploitability tends to decrease as more strategies are added to the restricted populations; yet, the decrease is not monotone and exploitability can increase form one iteration to the next. Having the exploitability increase at the next iteration is undesirable because it means that the quality of the iterative solutions generated by the algorithm is not proportional to the number of iterations it takes to produce them: if we need to bound the number of iterations (e.g. because the strategy must be computed as part of the response of a time-constrained system), then letting the algorithm run longer does not guarantee to produce a better output.
In the following, we examine refinements to the Double Oracle algorithm that deal with this exact issue.
The first refinement of the Double Oracle algorithm we examine is Range of Skill 3. This algorithm exploits the idea of a generalized best response.
Given a 2-player strictly competitive game
$G=(\Pi_1, \Pi_2, v)$ , a generalized best response to a restricted set of strategies$\Pi^\prime_i \subseteq \Pi_i$ is a strategy$\pi_{-i}$ such that$(\pi^\prime_i, \pi_{-i})$ is a NE in the restricted game$G^\prime = (\Pi^\prime_i, \Pi_{-i}, v)$ .
In practice, a generalized best response is a NE of a restricted game in which the adversary is unrestricted. This leads to a NE in which the restricted strategy is the least exploitable among the population of available strategies, i.e. if a player implements a generalized best response, then they are guaranteed that the adversary cannot exploit their strategy at the next iterations. In fact, this ensures that the overall exploitability can never increase at the next iteration, since the restricted player is only limiting their own strategy while considering the full strategies of the unrestricted player.
The Range of Skill algorithm is as follows (where, compared to the original definition, we avoid computing an approximate generalized best response and instead define an exact algorithm)
Range of Skill Algorithm
Let
$\Pi^0 = {\Pi^0_1, \Pi^0_2}$ be the initial population of strategies for each player$i=1,2$ Repeat at each iteration
$t$
- For each player
$i$ , compute the NEs$(\pi^\prime_i, \pi_{-i})$ of the game$G^\prime_i$ where the strategies of$i$ are restricted and the strategies of$-i$ are unrestricted- For each player
$i$ , add the support of$\pi_{-i}$ to the adversary's strategy population$\Pi^{t+1}_ {-i} \leftarrow \Pi^t_{-i} \cup S(\pi_{-i})$ Until No novel strategy is added to either population
Return
$(\pi^\prime_1, \pi^\prime_2)$ as the NE of the full unrestricted game
At each iteration, Range of Skill solves two restricted games in which one player (the adversary) is unrestricted, and adds the entire support of the restricted player to its population of strategies.
Similar considerations to the Double Oracle algorithm can be made regarding the completeness and correctness of the algorithm.
If we now plot the exploitability of Range of Skill during execution, this is what we get:
We can see that the convergence rate of the algorithm is much faster, and exploitability is monotonically decreasing. Of course, we are paying quite a steep price for this: at each iteration, we need to solve two restricted games, and each of them has one set of strategies unrestricted; moreover, we are adding the entire strategy profile of a NE to the population of strategies of each player.
The generalized best responses can be computed by solving two Linear Programs, although a more practical solution is to use the Double Oracle algorithm as a subroutine for Range of Skill as suggested by the original paper authors.
Range of Skill is implemented in the solve_ros()
function and solves the restricted games with Linear Programming.
Anytime Double Oracle 4 is a simple modification of Range of Skill: instead of adding the entire support of the restricted NE for each player, ADO simply adds a best response to the generalized best response of the adversary.
Anytime Double Oracle Algorithm
Let
$\Pi^0 = {\Pi^0_1, \Pi^0_2}$ be the initial population of strategies for each player$i=1,2$ Repeat at each iteration
$t$
- Compute the NEs
$(\pi^\prime_i, \pi_{-i})$ of the two restricted games$G^\prime_i$ where the strategies of$i$ are restricted and the strategies of$-i$ are unrestricted- For each player
$i$ , find a best response$\beta_i \leftarrow \mathbb{BR}_ i(\pi_{-i})$ using an oracle that searches in the full strategy space (unrestricted)- Add the best responses to each player population
$\Pi^{t+1}_i \leftarrow \Pi^t_i \cup {\beta_i}$ Until No novel strategy is added to either population
Return
$(\pi^\prime_1, \pi^\prime_2)$ as the NE of the full unrestricted game
With this change, the algorithm still achieves non-increasing exploitability across iterations, although convergence becomes slower:
The crucial difference compared to Range of Skill is that the algorithm now adds to a player's population the best response to an unrestricted adversary, rather than the full support of the NE for the unrestricted player. This allows the algorithm to scale to large games (where computing NEs for one unrestricted player is impractical) by computing approximate NEs instead of exact solutions, while ensuring that the exploitability of the iterate solutions is not larger than the approximation error of the algorithm.
This algorithm is a variant of Anytime Double Oracle where the restricted games are not solved exactly, instead using Regret Minimization against a Best Response (RM-BR) to find two approximate NEs as solutions.
RM-BR Double Oracle Algorithm
Let
$\Pi^0 = {\Pi^0_1, \Pi^0_2}$ be the initial population of strategies for each player$i=1,2$ Repeat at each iteration
$t$
- Compute the restricted
$\epsilon$ -NE =$(\pi_0, \pi_1)$ with RM-BR- For each player
$i$ , find a best response$\beta_i \leftarrow \mathbb{BR}_ i(\pi_{-i})$ using an oracle that searches in the full strategy space (unrestricted)- Add the best responses to each player population
$\Pi^{t+1}_i \leftarrow \Pi^t_i \cup {\beta_i}$ Until No novel strategy is added to either population
Return
$(\pi_1, \pi_2)$ as the$\epsilon$ -NE of the full unrestricted game
With this change, the algorithm now outputs an
To compute an
RM-BR (for player
$i$ )
- Initialize
$\pi_0$ - For
$t=1, \dots, n$
- Find a best response
$\beta_{-i} \leftarrow \mathbb{BR}_{-i}(\pi^t_i)$ - Update
$\pi^{t+1}_i$ with a regret minimization algorithm from$\pi^t_i$ against$\beta_i$ - Return
$\pi^n_i$
In this setting, any RM algorithm would work.
The RM-BR subroutine is implemented by the solve_rm_br()
function and uses the Exp35 update rule, which works as follows:
Regret Minimization with Exp3 updates
- Initialize the rewards vector
$r^0 = 0$ - Let
$\gamma \in [0, 1]$ ,$\eta > 0$ be the learning parameters- For
$t = 1, \dots, n$
- Compute the probability
$p_j = (1 - \gamma) \frac{\exp(\eta , r_j)}{\sum_k \exp(\eta , r_k)} + \frac{\gamma}{K}$ of selecting the$j$ -th strategy for$j=1, \dots, K$ - Sample a strategy
$j$ with probability$p_j$ and compute the reward$r(t)$ - Update the rewards
$r^{t+1}_j \leftarrow r^t_j + r(t)$ ,$r^{t+1}_i \leftarrow r^t_i\; \forall i \neq j$ - Return
$\frac{\exp(\eta , r^t)}{\sum_k \exp(\eta , r^t_k)}$
Each probability
Plotting the exploitability of the RM-BR Double Oracle iterations against DO and ADO, this is what we get:
We see that the algorithm behaves similarly to ADO, although it fails to converge to the exact NE due to the approximation error introduced by the regret minimization algorithm
This repository demonstrates adaptions of the Double Oracle algorithm to produce least-exploitable strategies before converging to a Nash equilibrium, which can be useful when it is necessary to resort to sub-optimal strategies in time-sensitive applications or to deal with complex games where enumerating all possible action profiles is unfeasible.
One possible limitation for the presented approaches is the need for an exact best-response oracle, which may be itself an intractable problem. To avoid computing exact best responses, it is possible to substitute the exact best response oracle with a Reinforcement Learning algorithm that produces approximate best responses (Policy Space Response Oracles -- PSROs) 64
Footnotes
-
McMahan et al. Planning in the Presence of Cost Functions Controlled by an Adversary. 2003. ↩ ↩2
-
Minimax theorem (John von Neumann, 1928). [https://en.wikipedia.org/wiki/Minimax_theorem, https://www.cs.ubc.ca/~kevinlb/teaching/cs532l%20-%202011-12/lectures/lect5a.pdf] ↩
-
Zinkevich et al. A New Algorithm for Generating Equilibria in Massive Zero-Sum Games. 2007. ↩
-
McAleer et al. Anytime PSRO for Two-Player Zero-Sum Games. 2022. ↩ ↩2
-
Auer et al. Gambling in a rigged casino: The adversarial multi-armed bandit problem. 1995. ↩
-
Lanctot et al. A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning. 2017. ↩