UNDERSTANDING ALPHA GO
How Deep Learning Made the Impossible Possible
ABOUT MYSELF
 Ms.c. In computer Science, HUJI
 Research interest: Deep Learning in Computer
Vision, NLP, Reinforcement learning.
 Also, DL Theory and other ML stuff.
 Works in a DL start-up (Imubit)
 Contact: mangate@gmail.com
CREDITS
 A lot of slides were taken from the following publicly
available slideshows:
 https://www.slideshare.net/ShaneSeungwhanMoon/how-
alphago-works
 https://www.slideshare.net/ckmarkohchang/alphago-in-depth
 https://www.slideshare.net/KarelHa1/alphago-mastering-the-
game-of-go-with-deep-neural-networks-and-tree-search
 Original AlphaGo article:
Silver, David, et al. "Mastering the game of Go with
deep neural networks and tree search.“Nature 529.7587
(2016): 484-489.
Available here:
http://web.iitd.ac.in/~sumeet/Silver16.pdf
DEEP LEARNING IS CHANGING OUR LIVES
 Search Engine (also for images and audio)
 Spam filters
 Recommender systems (Netflix, Youtube)
 Self-Driving Cars
 Cyber security (and regular one via computer
vision)
 Machine translation.
 Speech to text, audio recognition.
 Image recognition, smart shopping
 And more and more and more…
AI VERSUS HUMAN
 In 1997, a super computer called Deep Blue (IBM) won Garry
Kasparov.
 This was the first defeat of a reigning world chess champion
by a computer under tournament conditions.
AI VERSUS HUMAN
 In 2011 Watson, another super-computer by IBM, “crashed”
the 2 best player in Jepoerdy, a popular question-answering
tv-show.
GO
 An ancient Chinese Game
(2,500 years old!)
 Despite its relatively simple
rules, Go is very complex,
even more so than chess.
 Winning Go requires a
great deal of intuition and
therefore was considered
unachievable by computer for at least the next 30
years.
AI VESUS HUMAN
 In 2016 a AlphaGo, a computer program by
DeepMind (part of Google) played a 5-games Go
match aginst Lee Sedol.
 Lee Sedol:
 professional 9-Dan (highest ranking in Go) considered
among top 3 players in the world.
 2nd in international titles.
 Won 97 out of 100 games
against european Go
champion Fan Hui.
AI VERSUS HUMAN
 “I’m confident that I can win, at least this time” – Lee Sedol
 Alpha Go won 4-1
 “I kind of felt powerless… misjudged the capabilities of
AlphaGo” – Lee Sedol
 How is it possible? Deep Learning.
AI IN GAME PLAYING
 Almost every game can be “simulated” with a tree search.
 A move is done if it has to most chances to end in a victory.
AI IN GAMES
 More formally: an optimal value function V*(s)
determines the outcome of the game:
 From every board position (state=s)
 Under perfect play by all players.
 This is done by going over the tree containing
possible move sequences where:
 b is the games breadth (number of legal moves in each
position)
 d is the game depth (game length in moves)
 Tic-Tac-Toe:
 Chess:
d
b
4, 4b d 
35 80b d 
TREE SEARCH IN GO
 However in GO:
 This is more than the number of atoms in the entire universe!
 Go Is more complex than chess!
250, 150b d 
100
10 ( )Googol
KEY: REDUCE THE SEARCH SPACE
 Reducing b (possible actions space)
KEY: REDUCE THE SEARCH SPACE
 Reducing d – Position evaluation ahead of time
 Instead of simulating all the way to the end:
Both reductions are done with Deep Learning.
SOME CONCEPTS
 Supervised Learning (classification)
 On a given data, predict a class (or choose 1 option
out of some known number of options)
SOME CONCEPTS
 Supervised Learning (regression)
 On a given data, predict some real number
SOME CONCEPTS
 Reinforcement Learning
 Upon given state (observation) perform some
action which leads to the goal (i.e. winning a game)
SOME CONCEPTS
 CNN’s are able to learn abstract features of a given image
REDUCING ACTION CANDIDATES
 Done by learning to “imitate” expert moves
 Data: Online Go experts. 160K Games 300M moves.
 This is supervised classification (on given data predict the
expert action out of all possible ones)
REDUCING ACTION CANDIDATES
 This deep CNN achieved 55% test accuracy on predicting
expert moves.
 Imitators with no Deep Learning reached only 22% accuracy.
 Small improvement in accuracy lead to big improvement in
playing ability.
ROLLOUT NETWORK
 Train additional smaller network
(Ppi ) for imitating.
 This network achieves only 24.2%
accuracy.
 Works 1000 times faster (2us
compared to 3ms).
 This network is used for rollouts
(explained later).
IMPROVING THE NETWORK
 Improve the imitator network through self playing
(Reinforcement learning)
 An entire game is played and the parameters are
updates according to the results.
IMPROVING THE NETWORK
 Keep generating better models by self-play newer models
against old ones
 The final network also won 85% against the best GO software
(model without self play won only 11%)
 However, the model was eventually not used during the
games. It was used to generate the value function.
REDUCING SEARCH DEPTH - DATASET
 Self-play with the imitator model for some steps (0
to 450).
 Make some random move. This is the starting
position ‘s’.
 Self play until the end with the RL network (latest
model).
 If black won z=1 otherwise z=0.
 Save (s,z) to the dataset.
 Generated 30M (s,z) pairs from 30M games.
REDUCING SEARCH DEPTH –
VALUE FUNCTION
 Regression task, for a given position S give number between
0 and 1.
 Now, for each possible position we can have an evaluation of
how “good” it is for the black player.
REDUCING SEARCH SPACE
PUTTING IT ALL TOGETHER - MCST
 During game time a method called Monte-Carlo
Search Tree (MCTS) is applied.
 This method have 4 steps:
 Selection
 Expansion
 Evaluation
 Backup (update)
 For each play in the game this process is repeated
about 10K times.
MCTS - SELECTION
 At each step we have a starting
position (the board at this point).
 An action is selected
using a combination of the imitator
network and some other value
(Q) which is set to 0 at the start.
 we divide by the
times a state/action pair was
visited to encourage diversity.
( , )
( )
1 ( , )
P s a
u p
N s a


MCTS - EXPANSION
 When building the tree,
position can be expended once
(create new leafs in the tree)
with the imitator network.
 This way we have the new u(P)
for the next searches.
MCTS - EVALUATION
 After simulating 3-4 steps
with the imitating network
we evaluate the board
position.
 This is done in two ways:
 The value network prediction.
 Using the smaller imitator
network to self-play to the end
(rollout), and save the result
(1 for black win 0 for white)
 Both evaluation are combined
to give this board position a
number between 0 and 1.
MCTS – BACKUP (UPDATE)
 After the simulation we
update the tree.
 Update Q (which was
0 in the beginning) with
the value computed with
the value network and the
rollouts.
 Update N(s,a): Increase
by one for each
state/action pair visited.
CHOOSING AN ACTION
 For each step during the game MCTS is done for
10K times.
 In the end the action which was visited the most
times from the root position (the current board) is
taken.
 Notes:
 Since this process is long they had to use the smaller
network for rollouts to keep it feasible (otherwise each
move would have taken the computer several days to
compute).
 The imitator network was better in choosing the first
actions compared to the RL network, probably due to
human taking more diverse actions.
ALPHA GO WEAKNESSES
 In the 4th game, Lee Sedol got the board to a
position which was not on Alpha Go search tree,
causing the program to choose worse actions and
losing the game eventually.
 Most assumptions made for Alpha-Go are not
relevant in real life RL problems. See:
https://medium.com/@karpathy/alphago-in-context-
c47718cb95a5
RETIREMENT
 In March 2017 alpha go won Ke Jie, the 1st ranked in the
world, 3-0.
 Google’s DeepMind unit announced that it would be the last
event match the AI plays.
SUMMARY
 To this day, AlphaGo is considered one of the greatest AI
achievements in recent history.
 This achievement was made by combining Deep
Learning with standard method (like MCST) to “simplify”
the very complex game of Go.
 4 Deep Neural Networks were used:
 3 almost identical Convolutional Neural Network:
 Imitating network for action space reduction.
 RL network created through self-play, for generating the dataset
for the value network.
 Value network for search depth reduction.
 1 small network for rollouts.
 Deep Learning keeps achieving new amazing goals
every day, and is one of the fastest growing fields in
both academy and industry.
QUESTIONS?
Thank you!

Understanding AlphaGo

  • 1.
    UNDERSTANDING ALPHA GO HowDeep Learning Made the Impossible Possible
  • 2.
    ABOUT MYSELF  Ms.c.In computer Science, HUJI  Research interest: Deep Learning in Computer Vision, NLP, Reinforcement learning.  Also, DL Theory and other ML stuff.  Works in a DL start-up (Imubit)  Contact: [email protected]
  • 3.
    CREDITS  A lotof slides were taken from the following publicly available slideshows:  https://www.slideshare.net/ShaneSeungwhanMoon/how- alphago-works  https://www.slideshare.net/ckmarkohchang/alphago-in-depth  https://www.slideshare.net/KarelHa1/alphago-mastering-the- game-of-go-with-deep-neural-networks-and-tree-search  Original AlphaGo article: Silver, David, et al. "Mastering the game of Go with deep neural networks and tree search.“Nature 529.7587 (2016): 484-489. Available here: http://web.iitd.ac.in/~sumeet/Silver16.pdf
  • 4.
    DEEP LEARNING ISCHANGING OUR LIVES  Search Engine (also for images and audio)  Spam filters  Recommender systems (Netflix, Youtube)  Self-Driving Cars  Cyber security (and regular one via computer vision)  Machine translation.  Speech to text, audio recognition.  Image recognition, smart shopping  And more and more and more…
  • 5.
    AI VERSUS HUMAN In 1997, a super computer called Deep Blue (IBM) won Garry Kasparov.  This was the first defeat of a reigning world chess champion by a computer under tournament conditions.
  • 6.
    AI VERSUS HUMAN In 2011 Watson, another super-computer by IBM, “crashed” the 2 best player in Jepoerdy, a popular question-answering tv-show.
  • 7.
    GO  An ancientChinese Game (2,500 years old!)  Despite its relatively simple rules, Go is very complex, even more so than chess.  Winning Go requires a great deal of intuition and therefore was considered unachievable by computer for at least the next 30 years.
  • 8.
    AI VESUS HUMAN In 2016 a AlphaGo, a computer program by DeepMind (part of Google) played a 5-games Go match aginst Lee Sedol.  Lee Sedol:  professional 9-Dan (highest ranking in Go) considered among top 3 players in the world.  2nd in international titles.  Won 97 out of 100 games against european Go champion Fan Hui.
  • 9.
    AI VERSUS HUMAN “I’m confident that I can win, at least this time” – Lee Sedol  Alpha Go won 4-1  “I kind of felt powerless… misjudged the capabilities of AlphaGo” – Lee Sedol  How is it possible? Deep Learning.
  • 10.
    AI IN GAMEPLAYING  Almost every game can be “simulated” with a tree search.  A move is done if it has to most chances to end in a victory.
  • 11.
    AI IN GAMES More formally: an optimal value function V*(s) determines the outcome of the game:  From every board position (state=s)  Under perfect play by all players.  This is done by going over the tree containing possible move sequences where:  b is the games breadth (number of legal moves in each position)  d is the game depth (game length in moves)  Tic-Tac-Toe:  Chess: d b 4, 4b d  35 80b d 
  • 12.
    TREE SEARCH INGO  However in GO:  This is more than the number of atoms in the entire universe!  Go Is more complex than chess! 250, 150b d  100 10 ( )Googol
  • 13.
    KEY: REDUCE THESEARCH SPACE  Reducing b (possible actions space)
  • 14.
    KEY: REDUCE THESEARCH SPACE  Reducing d – Position evaluation ahead of time  Instead of simulating all the way to the end: Both reductions are done with Deep Learning.
  • 15.
    SOME CONCEPTS  SupervisedLearning (classification)  On a given data, predict a class (or choose 1 option out of some known number of options)
  • 16.
    SOME CONCEPTS  SupervisedLearning (regression)  On a given data, predict some real number
  • 17.
    SOME CONCEPTS  ReinforcementLearning  Upon given state (observation) perform some action which leads to the goal (i.e. winning a game)
  • 18.
    SOME CONCEPTS  CNN’sare able to learn abstract features of a given image
  • 19.
    REDUCING ACTION CANDIDATES Done by learning to “imitate” expert moves  Data: Online Go experts. 160K Games 300M moves.  This is supervised classification (on given data predict the expert action out of all possible ones)
  • 20.
    REDUCING ACTION CANDIDATES This deep CNN achieved 55% test accuracy on predicting expert moves.  Imitators with no Deep Learning reached only 22% accuracy.  Small improvement in accuracy lead to big improvement in playing ability.
  • 21.
    ROLLOUT NETWORK  Trainadditional smaller network (Ppi ) for imitating.  This network achieves only 24.2% accuracy.  Works 1000 times faster (2us compared to 3ms).  This network is used for rollouts (explained later).
  • 22.
    IMPROVING THE NETWORK Improve the imitator network through self playing (Reinforcement learning)  An entire game is played and the parameters are updates according to the results.
  • 23.
    IMPROVING THE NETWORK Keep generating better models by self-play newer models against old ones  The final network also won 85% against the best GO software (model without self play won only 11%)  However, the model was eventually not used during the games. It was used to generate the value function.
  • 24.
    REDUCING SEARCH DEPTH- DATASET  Self-play with the imitator model for some steps (0 to 450).  Make some random move. This is the starting position ‘s’.  Self play until the end with the RL network (latest model).  If black won z=1 otherwise z=0.  Save (s,z) to the dataset.  Generated 30M (s,z) pairs from 30M games.
  • 25.
    REDUCING SEARCH DEPTH– VALUE FUNCTION  Regression task, for a given position S give number between 0 and 1.  Now, for each possible position we can have an evaluation of how “good” it is for the black player.
  • 26.
  • 27.
    PUTTING IT ALLTOGETHER - MCST  During game time a method called Monte-Carlo Search Tree (MCTS) is applied.  This method have 4 steps:  Selection  Expansion  Evaluation  Backup (update)  For each play in the game this process is repeated about 10K times.
  • 28.
    MCTS - SELECTION At each step we have a starting position (the board at this point).  An action is selected using a combination of the imitator network and some other value (Q) which is set to 0 at the start.  we divide by the times a state/action pair was visited to encourage diversity. ( , ) ( ) 1 ( , ) P s a u p N s a  
  • 29.
    MCTS - EXPANSION When building the tree, position can be expended once (create new leafs in the tree) with the imitator network.  This way we have the new u(P) for the next searches.
  • 30.
    MCTS - EVALUATION After simulating 3-4 steps with the imitating network we evaluate the board position.  This is done in two ways:  The value network prediction.  Using the smaller imitator network to self-play to the end (rollout), and save the result (1 for black win 0 for white)  Both evaluation are combined to give this board position a number between 0 and 1.
  • 31.
    MCTS – BACKUP(UPDATE)  After the simulation we update the tree.  Update Q (which was 0 in the beginning) with the value computed with the value network and the rollouts.  Update N(s,a): Increase by one for each state/action pair visited.
  • 32.
    CHOOSING AN ACTION For each step during the game MCTS is done for 10K times.  In the end the action which was visited the most times from the root position (the current board) is taken.  Notes:  Since this process is long they had to use the smaller network for rollouts to keep it feasible (otherwise each move would have taken the computer several days to compute).  The imitator network was better in choosing the first actions compared to the RL network, probably due to human taking more diverse actions.
  • 33.
    ALPHA GO WEAKNESSES In the 4th game, Lee Sedol got the board to a position which was not on Alpha Go search tree, causing the program to choose worse actions and losing the game eventually.  Most assumptions made for Alpha-Go are not relevant in real life RL problems. See: https://medium.com/@karpathy/alphago-in-context- c47718cb95a5
  • 34.
    RETIREMENT  In March2017 alpha go won Ke Jie, the 1st ranked in the world, 3-0.  Google’s DeepMind unit announced that it would be the last event match the AI plays.
  • 35.
    SUMMARY  To thisday, AlphaGo is considered one of the greatest AI achievements in recent history.  This achievement was made by combining Deep Learning with standard method (like MCST) to “simplify” the very complex game of Go.  4 Deep Neural Networks were used:  3 almost identical Convolutional Neural Network:  Imitating network for action space reduction.  RL network created through self-play, for generating the dataset for the value network.  Value network for search depth reduction.  1 small network for rollouts.  Deep Learning keeps achieving new amazing goals every day, and is one of the fastest growing fields in both academy and industry.
  • 36.
  • 37.