Open In App

ML | Monte Carlo Tree Search (MCTS)

Last Updated : 01 Aug, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Monte Carlo Tree Search (MCTS) is a algorithm designed for problems with extremely large decision spaces, like the game Go with its 10^{170}possible board states. Instead of exploring all moves, MCTS incrementally builds a search tree using random simulations (rollouts) to guide its decisions. It balances exploration of new possibilities and usage of known promising paths, effectively focusing computational effort where it matters most, making it highly efficient for complex decision-making tasks.

Consider a chess player deciding their next move, they could either pursue a line of play they know is good (exploitation) or investigate a new superior strategy (exploration). MCTS formalizes this decision-making process through statistical sampling and tree construction.

The Four-Phase Algorithm

MCTS consists of four distinct phases that repeat iteratively until a computational budget is exhausted:

MCTS steps

Selection Phase: Starting from the root node, the algorithm traverses down the tree using a selection policy. The most common approach employs the Upper Confidence Bounds applied to Trees (UCT) formula, which balances exploration and exploitation by selecting child nodes based on both their average reward and uncertainty.

Expansion Phase: When the selection phase reaches a leaf node that isn't terminal, the algorithm expands the tree by adding one or more child nodes representing possible actions from that state.

Simulation Phase: From the newly added node, a random playout is performed until reaching a terminal state. During this phase, moves are chosen randomly or using simple heuristics, making the simulation computationally inexpensive.

Backpropagation Phase: The result of the simulation is propagated back up the tree to the root, updating statistics (visit counts and win rates) for all nodes visited during the selection phase.

Mathematical Foundation: UCB1 Formula

The selection phase relies on the UCB1 (Upper Confidence Bound) formula to determine which child node to visit next:

\text{UCB1}(i) = \bar{X}_i + c \sqrt{\frac{\ln(N)}{n_i}}

Where:

  • \bar{X}_i is the average reward of node i
  • c is the exploration parameter (typically √2)
  • N is the total number of visits to the parent node
  • n_i is the number of visits to node i

The first term encourages exploitation of nodes with high average rewards, while the second term promotes exploration of less-visited nodes. The logarithmic factor ensures that exploration decreases over time as confidence in the estimates increases.

Python Implementation

Here's a comprehensive implementation of MCTS for a simple game like Tic-Tac-Toe:

1. Importing Libraries

We will start by importing required libraries:

  • math : to perform mathematical operations like logarithms and square roots for UCB1 calculations.
  • random : to randomly pick moves during simulations (rollouts).
Python
import math
import random

2. MCTS Node Class

We create a MCTSNode class to represent each node (game state) in the search tree. This class contains methods for:

  • __init__(): Initializes board state, parent node, move taken, children, visits, wins and untried moves.
  • get_actions(): Returns a list of all empty cells as possible moves.
  • is_terminal(): Checks if the game is over (winner or no moves left).
  • is_fully_expanded(): Checks if all possible moves have been explored.
  • check_winner(): Determines if any player has won the game.
Python
class MCTSNode:
    def __init__(self, state, parent=None, action=None):
        self.state = state                    # Current board
        self.parent = parent                  # Parent node
        self.action = action                  # Move leading to this node
        self.children = []                    # List of children
        self.visits = 0                       # Visit count
        self.wins = 0                         # Win count
        self.untried_actions = self.get_actions()  # Available moves

    def get_actions(self):
        """Return all empty cells."""
        return [(i, j) for i in range(3) for j in range(3) if self.state[i][j] == 0]

    def is_terminal(self):
        """Check if the game has ended."""
        return self.check_winner() is not None or not self.get_actions()

    def is_fully_expanded(self):
        return len(self.untried_actions) == 0

    def check_winner(self):
        """Find winner (1 or 2) or None."""
        for i in range(3):
            if self.state[i][0] == self.state[i][1] == self.state[i][2] != 0:
                return self.state[i][0]
            if self.state[0][i] == self.state[1][i] == self.state[2][i] != 0:
                return self.state[0][i]
        if self.state[0][0] == self.state[1][1] == self.state[2][2] != 0:
            return self.state[0][0]
        if self.state[0][2] == self.state[1][1] == self.state[2][0] != 0:
            return self.state[0][2]
        return None

3. Expansion, Selection, Rollout and Backpropagation

We now define methods that enable the core MCTS operations:

  • expand() : Adds a new child node for an untried move.
  • best_child() : Selects the most promising child using the UCB1 formula, balancing exploration and exploitation.
  • rollout() : Plays random moves from the current state until the game ends, simulating the outcome.
  • backpropagate() : Updates the node's statistics (wins and visits) and propagates them back up to the root.
Python
    def expand(self):
        """Add one of the remaining actions as a child."""
        action = self.untried_actions.pop()
        new_state = [row[:] for row in self.state]
        player = self.get_current_player()
        new_state[action[0]][action[1]] = player
        child = MCTSNode(new_state, parent=self, action=action)
        self.children.append(child)
        return child

    def get_current_player(self):
        """Find whose turn it is."""
        x_count = sum(row.count(1) for row in self.state)
        o_count = sum(row.count(2) for row in self.state)
        return 1 if x_count == o_count else 2

    def best_child(self, c=1.4):
        """Select child with best UCB1 score."""
        return max(self.children, key=lambda child:
                   (child.wins / child.visits) +
                   c * math.sqrt(math.log(self.visits) / child.visits))

    def rollout(self):
        """Play random moves until the game ends."""
        state = [row[:] for row in self.state]
        player = self.get_current_player()

        while True:
            winner = self.check_winner_for_state(state)
            if winner: return 1 if winner == 1 else 0

            actions = [(i, j) for i in range(3) for j in range(3) if state[i][j] == 0]
            if not actions: return 0.5  # Draw

            move = random.choice(actions)
            state[move[0]][move[1]] = player
            player = 1 if player == 2 else 2

    def check_winner_for_state(self, state):
        """Same winner check for rollout."""
        return MCTSNode(state).check_winner()

    def backpropagate(self, result):
        """Update stats up the tree."""
        self.visits += 1
        self.wins += result
        if self.parent:
            self.parent.backpropagate(result)

Now we implement the mcts_search() function, which performs:

  • Selection : choose a promising node.
  • Expansion : add new nodes for unexplored moves.
  • Simulation (Rollout) : play random games.
  • Backpropagation : update nodes with results.
Python
def mcts_search(root_state, iterations=500):
    root = MCTSNode(root_state)

    for _ in range(iterations):
        node = root

        # Selection
        while not node.is_terminal() and node.is_fully_expanded():
            node = node.best_child()

        # Expansion
        if not node.is_terminal():
            node = node.expand()

        # Simulation
        result = node.rollout()

        # Backpropagation
        node.backpropagate(result)

    return root.best_child(c=0).action  # Return best move

5. Play the Tic-Tac-Toe Game

We define the play_game() function, where:

  • Player 1 (MCTS) chooses the best move using MCTS.
  • Player 2 plays randomly for demonstration purposes.
Python
def play_game():
    board = [[0]*3 for _ in range(3)]
    current_player = 1

    print("MCTS Tic-Tac-Toe Demo")
    print("0 = empty, 1 = X, 2 = O\n")

    for turn in range(9):
        for row in board: print(row)
        print()

        if current_player == 1:
            move = mcts_search(board, iterations=500)
            print(f"MCTS plays: {move}")
        else:
            empty = [(i, j) for i in range(3) for j in range(3) if board[i][j] == 0]
            move = random.choice(empty)
            print(f"Random plays: {move}")

        board[move[0]][move[1]] = current_player

        if MCTSNode(board).check_winner():
            for row in board: print(row)
            print(f"Player {current_player} wins!")
            return

        current_player = 1 if current_player == 2 else 2

    print("Draw!")

6. Run the game

Python
play_game()

Output:

Tic-Tac-Toe-Sample-Output-MCTS
Sample run output

Expected Performance

When running the above implementation, MCTS demonstrates strong performance even against optimal play in Tic-Tac-Toe. With 1000 iterations per move, the algorithm can identify winning opportunities and avoid losing positions effectively. The quality of play improves significantly as the number of iterations increases.

AlphaGo, which uses MCTS combined with neural networks, achieved superhuman performance in Go by performing millions of simulations per move. Monte Carlo's strength lies in its ability to focus computational resources on the most promising areas of the search space.

Practical Applications Beyond Games

MCTS has found applications in numerous domains outside of game playing:

1. Planning and Scheduling: The algorithm can optimize resource allocation and task scheduling in complex systems where traditional optimization methods struggle.

2. Neural Architecture Search: MCTS guides the exploration of neural network architectures, helping to discover optimal designs for specific tasks.

3. Portfolio Management: Financial applications use MCTS for portfolio optimization under uncertainty, where the algorithm balances risk and return through simulated market scenarios.

Limitations and Edge Cases

1. Sample Efficiency: The algorithm requires a lot of simulations to achieve reliable estimates, particularly in complex domains. This can be computationally expensive when quick decisions are needed.

2. High Variance: Random simulations can produce inconsistent results, especially in games with high variance outcomes. Techniques like progressive widening and RAVE (Rapid Action Value Estimation) help mitigate this issue.

3. Tactical Blindness: MCTS may miss short-term tactical opportunities due to its reliance on random playouts. In chess, for example, the algorithm might overlook a forced checkmate sequence if the simulations fail to explore the variations.

4. Exploration-Exploitation Balance: The UCB1 formula requires careful tuning of the exploration constant. Too much exploration leads to inefficient search, while too little can cause the algorithm to get trapped in local optima.


Similar Reads