0% found this document useful (0 votes)

126 views290 pages

The History of Artificial Intelligence - From Leonardo Da Vinci To Chat-GPT

The document is a book titled 'The History of Artificial Intelligence: From Leonardo da Vinci to Chat-GPT' by Maciej Świechowski, detailing the evolution of AI from early concepts to modern advancements. It covers significant milestones, theories, and technologies in AI development across various decades. The book is dedicated to the author's parents and aims to provide a comprehensive overview of AI's historical journey.

Uploaded by

Tom Withey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

126 views290 pages

The History of Artificial Intelligence - From Leonardo Da Vinci To Chat-GPT

Uploaded by

Tom Withey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 290

The history of

Artificial Intelligence
From Leonardo da Vinci to Chat-GPT

Maciej Świechowski
**Copyright Information**

Imprint: Independently published at Amazon KDP

The history of Artificial Intelligence: From Leonardo da Vinci to Chat-GPT
by Maciej Świechowski
Copyright ©2024 by Maciej Świechowski

All rights reserved.

No part of this book may be reproduced, distributed, or transmitted in any form or

by any means, including photocopying, recording, or other electronic or mechanical
methods, without the prior permission of the publisher, except in the case of brief
quotations embodied in critical reviews and certain other noncommercial uses
permitted by copyright law. For permission requests, write to the publisher at the
address: [email protected]

Corresponding paperback ISBN: 9798345671870

Corresponding hardcover ISBN: 9798345852767
First Edition: November, 2024
Cover design by Maciej Świechowski
Edited by Maciej Świechowski
Dedication

The inspiration for this book came to me while sitting by Lake Como in Italy,
admiring the view – a moment where AI and serene nature, though seemingly
worlds apart, connected in my mind. I am grateful to everyone who supported me
throughout the writing process and remained by my side. I dedicate this book to my
parents: Elżbieta and Zbigniew.
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1 Book Structure and Conventions 11
1.2 To Whom is the Book Addressed 13

2 Before Computers - Until 1950s . . . . . . . . . . . . . . . . . . . . . 14

2.1 Leonardo da Vinci’s Automata (1478 ∼ 1519) 14
2.2 Bayes’ Theorem (1763) 15
2.3 The Mechanical Turk (1770) 17
2.4 Principal Component Analysis (1901) 19
2.4.1 The PCA Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 The Min-Max Algorithm and Game Theory (1928/1944) 22
2.5.1 Combinatorial Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 IRIS Dataset (1936) 26
2.7 The First Mathematical Model of a Neuron (1943) 27
2.8 Logistic Regression (1944 ∼ 1958) 28
2.9 A Mathematical Theory of Communication (1948) 31
2.9.1 The impact on Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.10 Hebbian Learning (1949) 32

3 Years 1950 - 1964 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.1 Alan Turing’s Work (1950+) 34
3.2 K-Nearest Neighbors (1951) 37
3.3 Development of the First Clustering Methods (1950 ∼ 1960) 38
3.3.1 K-means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.2 Lloyd’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4 The AI Term is Coined - AI Recognized as a Research Field
(1956) 40
3.5 A Markovian Decision Process (1957) 42
3.5.1 Bellman Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5.2 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5.3 Markov Decision Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6 Perceptron (1958) 44
3.6.1 The Perceptron as a Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6.2 The Perceptron as a Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.7 Samuel’s Checkers (1959) 47
3.7.1 Heuristic Evaluation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.7.2 Learning in Samuel’s Checkers Program . . . . . . . . . . . . . . . . . . . . . . 48
3.8 Programs with Common Sense (1959) 49
3.9 General Problem Solver (1961) 51

4 Years 1965 - 1974 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.1 Fuzzy Sets (1965) 54
4.2 Deep Learning Foundations (∼ 1965) 57
4.3 Expert Systems (1965) 59
4.4 ELIZA (1966) 61
4.5 A* Algorithm (1969) 64
4.5.1 Elementary Definitions from Graph Theory . . . . . . . . . . . . . . . . . . . . 64
4.5.2 The A* Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.6 AI Philosophy Seminal Paper (1969) 68
4.7 The Precursor of Simulated Annealing (1970) 69
4.8 STRIPS Planning (1971) 72
4.9 Prolog (1972-1973) 74
4.9.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.9.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.10 The First AI Winter (1974) 77
5 Years 1975 - 1985 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1 The Recognition of Genetic Algorithms (culmination in 1975)79
5.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1.2 The Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 Hierarchical Partial-Order Planning (1975) 84
5.3 Alpha-Beta Pruning (1975) 86
5.4 Harpy Speech Recognition System (1976) 89
5.5 The EM Algorithm and Mixture of Gaussians (1977) 90
5.5.1 Gaussian Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.6 Stanford Cart - Predecessor of Autonomous Vehicles (1979
breakthr.) 92
5.7 LISP Machines (1979 ∼ 1980) 93
5.8 The First AAAI and ICML Conferences (1980) 95
5.9 Backpropagation in Multi-Layer Neural Networks (1982) 96
5.10 Hopfield Network (1982) 99
5.11 Boltzmann Machines (1985) 101
5.11.1 Restricted Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6 Years 1986 - 1991 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.1 Decision Trees (1986) 104
6.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.1.2 The ID3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.2 Artificial Immune Systems (1986) 108
6.3 Learning Representations By Back-Propagating Errors (1986)110
6.4 The First Autonomous Car Project (1987-1995) 110
6.5 The Second AI Winter (1987) 112
6.6 Probabilistic Reasoning and Bayesian Networks (1988) 113
6.7 TD-Learning (1988) 116
6.8 Convolutional Neural Networks (1989-1998) 117
6.9 Backpropagation through Time (1990) 121
6.10 Ant Colony Optimization (1991) 122
6.11 Autoencoders (1991) 124
6.12 Python (1991) 127
7 Years 1992 - 1997 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.1 TD-Gammon (1992) 129
7.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.1.2 The Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.2 R Language (1993) 132
7.3 Association Rules Mining (1993) 133
7.4 Particle Swarm Optimization (1995) 135
7.5 Support Vector Machines (1995) 137
7.6 Artificial Intelligence: A Modern Approach (1995) 139
7.7 Covariance Matrix Adaptation Evolution Strategy (1996) 140
7.8 Computing with Words (1996) 142
7.9 Long Short-Term Memory (1997) 146
7.10 AdaBoost (1997) 148
7.11 Deep Blue (1997 breakthrough) 150

8 Years 1998 - 2006 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

8.1 Reinforcement Learning: An Introduction (1998) 152
8.2 Random Forests (2001) 154
8.3 Neuro-Evolution of Augmenting Topologies (2002) 156
8.4 Behavior Trees (2003) 159
8.5 The Growing Field of Robotics (2002-2005) 162
8.6 DARPA Grand Challenge (2004) 163
8.7 Graph Neural Networks (2005) 165
8.8 General Game Playing Competition (2005) 168
8.9 Monte Carlo Tree Search (2006) 170

9 Years 2007 - 2013 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

9.1 Checkers is Solved! (2007) 174
9.2 GPGPU Programming (2007) 176
9.3 Stockfish (2008) 177
9.4 Netflix Prize Claimed (2009) 179
9.5 Kaggle Platform (2010) 180
9.6 IBM Watson (2011 breakthrough) 182
9.6.1 The Jeopardy! Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
9.6.2 The Post-Jeopardy! Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
9.7 Deep Learning Revolution (2012) 187
9.8 Deep Q-Learning (2013) 189
9.9 Word2Vec (2013) 192

10 Years 2014 - 2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

10.1 Generative Adversarial Networks (2014) 196
10.2 Further Advancements in Neural Network Models (2014-2015)
199
10.2.1 Adam Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
10.2.2 Sequence to Sequence Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 199
10.2.3 Residual Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
10.3 The Rise of Explainable AI (2015+) 203
10.4 Heads-up Limit Hold’em Poker is Solved (2015) 207
10.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
10.4.2 The Process of Solving The Game . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
10.4.3 Note on Poker-Playing Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
10.5 Waymo Autonomous Car (2015 breakthrough) 211
10.6 Extreme Gradient Boosting (2015) 213
10.7 Google’s Neural Machine Translation (2016) 216
10.8 AlphaGo (2016) 219
10.8.1 AlphaGo vs. Lee Sedol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
10.8.2 AlphaGo’s Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
10.8.3 AlphaGo’s Successors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

11 Years 2017 - 2024 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

11.1 Transformers and Attention (2017) 224
11.2 The First Large Language Models (2018) 227
11.2.1 GPT-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
11.2.2 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
11.3 Soft Actor-Critic (2018) 230
11.3.1 Method Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
11.3.2 Mathematical Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
11.4 OpenAI Five (2018) 234
11.5 AlphaStar (2018) 239
11.6 Emergent Tool Use From Multi-Agent Interaction (2019) 243
9

11.7 AlphaFold (2020 breakthrough) 247

11.7.1 AlphaFold Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
11.8 Chat-GPT and the Large Language Models Revolution (2022+)
251
11.8.1 Text to Image Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
11.9 Nobel Prizes (2024) 257
1. Introduction

At the time of writing this book, AI is one of the hottest topics of discourse – both
in everyday conversations and on high-level political agendas. It is often said that
AI already is, or will soon be, the next major revolution after the Internet and the
industrial revolutions [167]. It is no longer a domain restricted to mathematicians
and computer scientists. AI techniques have become so effective in many areas that
countries and organizations are beginning to regulate them. In this book, I will guide
you through how it all unfolded, from its very beginnings to the latest advancements
in the field.
There is no single definition of artificial intelligence (AI). Moreover, philosophers
and psychologists debate how the more fundamental concept of natural intelligence
should be defined [220]. As will be shown later in this book, the term “artificial
intelligence” was coined in 1955 and was described then as “the science of making
machines do things that would require intelligence if done by men”. The term
AI has different practical meanings depending on the context in which it is used.
One context is socio-philosophical - where we compare AI to human intelligence -
measuring whether AI systems can behave and perform intelligence-based tasks at
least as effectively as we can. We are measuring the progress in this endeavor:
• Narrow AI (“Weak AI”) - constructed for only one or a few specific tasks.
Historically, the entire applied AI is narrow AI.
• General AI (“Artificial General Intelligence” (AGI)) - extends to all tasks that
require human-like intelligence. Various tests have been proposed to verify
whether AGI has been achieved such as the Turing Test (which will be later
described) or the Coffee Test [292].
• Superintelligence (“Superhuman AI”) - occurs when AGI surpasses humans
in all tasks. The emergence of superintelligence would lead to the so-called
singularity, in which all scientific progress would be carried out by machines.
1.1 Book Structure and Conventions 11

In general, this is a very complex issue, as it is unclear whether only our empirical
perception of intelligence will be sufficient to consider machines truly intelligent.
Proponents of such a functional definition argue that natural intelligence requires
the so-called qualia – subjective, conscious experience, a body, as well as internal
motivation to exist and formulate objectives.
The second usage of the term AI is as a subfield of computer science. Here, it
concerns making computers solve problems without being explicitly programmed to
do so. It is said that these problems are difficult to algorithmize, i.e., there exists
no step-by-step recipe to solve them, and a more sophisticated, indirect approach is
required. This approach is goal-driven, meaning that one must define a goal with a
performance metric to evaluate how well it is met. Regardless of a particular goal,
the same AI method can be employed, perhaps with a dedicated parameterization
and small changes in the setup, thanks to its generality. Most AI techniques are
meta-methods capable of achieving various goals rather than procedures tailored for
specific ones.
There are three main pillars (ideas for such meta-methods) for coming up with
a solution to a given problem. An AI system may utilize one or a combination of
them:
1. Statistical learning - based on which machine learning has emerged. It
requires past data or a procedure to gather the data dynamically during the
training process. There is a distinct training phase during which the system
learns to better capture the relationships between the data, allowing it to make
inferences about future (yet unseen) data.
2. Intelligent search - encompasses a family of methods that iteratively search
an abstract space. This could be the space of possible solutions to the given
problem or some kind of space of intermediate representations that can be
translated into solutions. In this approach, we feed the AI system with the
mechanics (nature) of the given problem rather than past data.
3. Expert knowledge - focuses on representing the rules and patterns provided
by human experts, allowing the AI system to solve a problem based on
continuous execution of these rules, applying their effects, and analyzing
how they interplay with each other. Expert knowledge can also be embedded
into machine learning and search-based systems in the form of heuristics or
constraints that need to be preserved.

1.1 Book Structure and Conventions

Although there’s a precise date when the term “AI” was proposed, it is not easy
to pinpoint a single event that started the field. From a modern perspective, it
is essential to include the mathematical foundations that helped shape artificial
intelligence as we know it today. Thus, I have decided to include these mathematical
foundations and some of the inspirations that preceded the actual development of AI.
These are mainly concentrated in Chapter 2. However, I have chosen not to include
any fictional works, despite the fact that science fiction can also serve as a source of
inspiration. For example, the movie “2001: A Space Odyssey” directed by Stanley
1.1 Book Structure and Conventions 12

Kubrick, featured an AI system named HAL 9000, which had a significant cultural
impact on the perception of AI. It continues to inspire discussions about ethical and
philosophical implications. Isaac Asimov is another example of an inspiring science
fiction author. He introduced the “Three Laws of Robotics”. The first law states that
a robot may not injure a human being, or, through inaction, allow a human being to
come to harm. The second law says that a robot must obey the orders given to it by
human beings, except where such orders would conflict with the First Law. Finally,
a robot must protect its own existence as long as such protection does not conflict
with the First or Second Law.

The book is organized into ten regular chapters (not counting the Introduction
and Conclusions), which appear in chronological order. I wanted to emphasize the
timeline of AI advancement, so one downside of this organization is that there will
be frequent jumps between topics. I hope that the diverse topics will engage the
reader rather than cause a loss of focus.
Each chapter is divided into sections, with a total of 99 sections. They encompass
various important milestones in the history of AI, which have been subjectively
selected by the author. Most belong to one or more of the following categories:
1. Inspirations - e.g., a very inspiring book that’s important to the community
was published.
2. Theoretical results - e.g., a new AI technique or method was proposed.
3. Milestone events - e.g., a computer program defeated the top human player in
chess.
4. Transformative applications - e.g., contributions to autonomous vehicles.

Additional writing conventions in this book:

• Some terms will be written in italics when introduced for the first time. The
second and further uses will be written normally.
• Opinions vary on whether to write “artificial intelligence” or “Artificial
Intelligence”. Originally, it was typically not capitalized, but later, more
experts started to capitalize these words to emphasize the field’s significance.
Some distinguish between when it refers to a common noun versus a proper
noun. In this book, artificial intelligence will be written in lower case like we
write “biology” or “geography”.
• All illustrations presented in the book are either provided with the original
author and the license attributions or were created by me (such as all the
diagrams).
• Bibliographic references are provided as numbers in square brackets following
the text, e.g., [1].
• Some sections will be more detailed, while others are intended to be overviews.
This depends on the scope of the material, the availability of sources, and the
author’s subjective choice for presenting a particular topic.
1.2 To Whom is the Book Addressed 13

1.2 To Whom is the Book Addressed

The level of technical detail will vary depending on the focus of each section. Some
sections delve into technical contributions, such as novel methods or foundational
mathematical equations. To fully understand these, readers are recommended to
have a basic understanding of first-year college mathematics and the ability to read
pseudo-code. Other sections, like seminal book overviews, are less technically
demanding. Some parts of the book explore significant historical events, such as
important conferences or human-versus-machine competitions in well-known games.
Additionally, there are sections on curiosities that have inspired AI researchers.
These sections can be read as though they were in a general history book.
Although this book is intended for a general audience, the ideal readers include
students in technical fields, early-career researchers, junior AI/ML engineers, and
technical directors seeking a broader perspective on the field. Students might find
this book useful for inspiration in their projects or when choosing topics for master’s
or bachelor’s theses. This book can also be valuable to teachers, AI evangelists,
and science popularizers. A common misconception is to equate AI with large
language models like ChatGPT, but AI encompasses much more, as this book will
demonstrate.
2. Before Computers - Until 1950s

2.1 Leonardo da Vinci’s Automata (1478 ∼ 1519)

Leonardo da Vinci (1452 - 1519) is one of the greatest talents to ever grace our planet.
He is the epitome of genius, famous for his paintings, sculptures, architectural designs,
drawings, mechanical devices, philosophical thoughts and theories. He inspired
further generations of people of various professions from artsmen to engineers and
scientists.
Leonardo is reputed to have invented the world’s first robots [198]: a self-
propelled cart, the Mechanical Knight and the Mechanical Lion. The cart was a
design included in Codex Atlanticus - a famous 12-volume set containing over 1000
drawings and texts of various subjects dating from 1478 to 1519.
The Mechanical Knight, designed around 1495, was a humanoid construct
wearing armor in the German-Italian medieval style (see Fig. 2.1). It could perform
a few basic tasks such as to stand up and sit down. It could move its arms (in four
degrees of freedom), legs (in three degrees of freedom) and head and allegedly had
an anatomically correct jaw. The whole movement was achieved with the use of
cranks and cables connected to selected areas in the body. The Mechanical Knight
was later faithfully reconstructed as reported in [250] and it appeared to be functional.
It is unclear whether the automaton was built in Leonardo’s times.
The Mechanical Lion is believed to be constructed for the French king Francois
I in the early 1500s [198]. The purpose of the lion was to entertain the king and his
prominent guests at the royal spectacles and important events such as the wedding of
Maria de Medici. The lion could allegedly walk and move its head without human su-
pervision. It could have been using a mechanism that required winding up beforehand.

Despite the fact that the automata invented by Leonardo are relatively simple
compared to modern robots and lack autonomous decision-making capabilities, they
are proof that people – even a long time ago – dreamt about artificial creations that
2.2 Bayes’ Theorem (1763) 15

Figure 2.1: Mechanical Knight in Da Vinci Museum, Florence, Italy. Credit to

Cory Doctorow. Shared publicly under the Attribution-ShareAlike (CC BY-SA 2.0)
license. (https://creativecommons.org/licenses/by-sa/2.0/).

can mimic (at least partially) the behavior of humans or animals. I consider these
inventions to symbolize the first spark of inspiration for automatization and robotics.
Ideas always come first. This is why I decided to start our journey from the ideas
that preceded the formal inception of artificial intelligence by almost five centuries.
For more early inspirations, the readers are recommended to explore a book called
“The Sciences in Enlightened Europe” [44].

2.2 Bayes’ Theorem (1763)

Thomas Bayes (who lived circa 1701 - 1761) studied logic and theology at the
University of Edinburgh. Initially he worked as Presbyterian minister but later
became one of the most famous English mathematicians despite not publishing many
works [15].
He is most famous for his contribution, which is now known as the Bayes’
theorem [109] (see Eq. 2.1). It was introduced in “An Essay towards solving a
Problem in the Doctrine of Chances” [14], which was published 2 years after Bayes’
death.
𝑃(𝐵| 𝐴) × 𝑃( 𝐴)
𝑃( 𝐴|𝐵) = (2.1)
𝑃(𝐵)
- where 𝑃( 𝐴|𝐵) denotes the conditional probability of 𝐴 occurring given 𝐵
occurred, whereas 𝑃(𝐵| 𝐴) is the conditional probability of 𝐵 given 𝐴. 𝑃( 𝐴) and
2.2 Bayes’ Theorem (1763) 16

𝑃(𝐵) are the unconditional probabilities of single events (also referred to as the
marginal probabilities).

In statistics, the theorem is often used to calculate the probabilities of hypotheses

under evidence. For example, it is used in this way in drug testing [69], spam
filtering [187], fault diagnosis [33], A/B testing [11] and for various other applications.
Herein, the theorem would often be written using 𝐻 for the hypothesis and 𝐸 for the
evidence as follows:
𝑃(𝐸 |𝐻) × 𝑃(𝐻)
𝑃(𝐻|𝐸) = (2.2)
𝑃(𝐸)
Bayes’ theorem has had an enduring impact, not only on applied statistics but also
on machine learning (ML) - a rapidly advancing field within artificial intelligence.
ML concerns creating statistical models that are trained on data - instead of being
explicitly programmed. They are then used (inferred) on new data to provide
estimations / predictions / decisions or find patterns depending on the type task and
the type of learning applied.
Let’s assume that a ML model was trained to classify data, i.e., to assign specific
labels (classes) to data. Interchangeably, it can be said that this problem is the
prediction of a class. The idea of classification is shown in Figure 2.2. For such
models, the Bayes’ theorem is rewritten as shown in Equation 2.3, which is one of
the cornerstone equations in AI/ML.

Figure 2.2: Single-label classification problem with three classes.

Likelihood × Class prior probability

Posterior probability = (2.3)
Evidence
Below, please find a practical description of the particular elements of the
formula:
• Posterior - the probability of a hypothesis given the evidence. In the context of
machine learning: the probability of the class returned by the model (output)
given the provided data (input).
2.3 The Mechanical Turk (1770) 17

• Likelihood - the probability of observing the evidence given the hypothesis.

In machine learning: it represents the probability of a specific set of data
occurring given the assumed model. In practice, this is often computed as the
product of individual probabilities of features in data given model parameters.
Additionally, it is common to work with the natural logarithm of the likelihood,
referred to as log-likelihood, for the ease of computation.

• Prior - the probability of a hypothesis from our general knowledge of the

problem. In machine learning: this is the probability of a given class if we did
not have any additional information. In practice, it is often calculated as the
number of particular classes divided by all instances of the problem. Prior
does not take current data into account, hence the name - this is the probability
prior to observing the particular data.

• Evidence - the total probability of the evidence, for all possible hypotheses 𝐻.
In machine learning: also called the predictor prior probability. It is equal
to the probability of the data occurring regardless of the output of the model.
This is the normalization factor when considering all possible outputs. If
the equation is used only to find such 𝐻 that maximizes 𝑃(𝐻|𝐸) and not the
actual probability, then this factor can be omitted.
Bayes’ theorem, apart from being at the core of ML tasks, is particularly
fundamental of the so-called Bayesian models [303] such as Maximum a Posteriori
Estimation (MAP), Markov Chain Monte Carlo (MCMC) or Gaussian process.
There is a family of very popular and relatively simple models called Naive
Bayes classifiers [243]. Herein, the goal is to find the correct class - C - for the
input features vector 𝑋 = 𝑥 1 , 𝑥 2 , ..., 𝑥 𝑛 that describes the evidence. The term naive
corresponds to an assumption that the features are all conditionally independent. For
Naive Bayes, the theorem can be rewritten as follows:

𝑃(𝐶 |𝑋) = 𝑃(𝑥 1 |𝐶) × 𝑃(𝑥 2 |𝐶) × ... × 𝑃(𝑥 𝑛 |𝐶) × 𝑃(𝐶) (2.4)

In summary, over 250 years later, Bayes’ theorem stands as one of the most
crucial mathematical foundations for artificial intelligence techniques.

2.3 The Mechanical Turk (1770)

In this section, I will present the history of The Mechanical Turk [281, 44] also
referred to as The Turk or Automaton Chess Player. At the time following its
construction, it was considered the first automated chess machine that later turned
out to be a hoax - at least in the chess-playing department, because nonetheless it
was a spectacular achievement of art and engineering.
The Turk was designed to give the appearance and dimensions of a cabinet at
which a mannequin automaton resembling an upper body of a person was seated.
The figure was modeled after an oriental male, hence the name The Turk, and
positioned as if it was playing chess at the cabinet. The Turk could nod its head and
2.3 The Mechanical Turk (1770) 18

move its arms. The arms were facilitated to move the chess pieces on the board,
which strengthened the illusion of autonomy. Its depiction is shown in Fig. 2.3.

Figure 2.3: A drawing of The Turk. Creator: Library Company of Philadelphia.

Credit: Library Company of Philadelphia via Picryl.com Copyright: public domain.

The Turk was created by Wolfgang von Kempelen, an Austro-Hungarian inventor,

philosopher and mathematician born in today’s Bratislava. It was first unveiled at
Schönbrunn Palace, Vienna, in 1770. Kempelen was working for the Habsburg
Empire at the time and he wanted to specifically make an impression on the empress
Maria Theresa but it was her son - Emperor Joseph II - who convinced the inventor
to make it public.
Before the presentation, the inventor subsequently opened all the drawers to
convince the audience that nobody was inside the cabinet to control The Turk and that
the machine was indeed autonomous. This would become a standard component of
the show’s programme executed by Kempelen and the later owners of the automaton.
The Turk was an excellent chess player that rarely lost. Initially, it would defeat
all of the opponents it faced, leading many people to attribute a magical aura to it
and believe that it was unbeatable. It stopped being the case when it lost for the first
time, but still people could not solve the mystery of how come a machine can be so
strong. It toured Europe and later the United States of America. Some of its notable
opponents included Napoleon Bonaparte, Grand Duke Paul of Russia (who later
became Tsar Paul I of Russia), and Benjamin Franklin. It is believed that Napoleon
deliberately made illegal moves three times to test how his opponent would react.
After the third illegal move, The Turk swept all the pieces off the board.

The Turk attracted a lot of attention including American writer and poet - Edgar
Allan Poe, who published a notable essay titled “Maelzel’s Chess-Player.” [223].
Poe’s work contributed to our knowledge about the machine. On a side note, Maelzel
was the owner of the Turk after Kempelen’s death. Poe suspected that it was a hoax
2.4 Principal Component Analysis (1901) 19

with a human operator concealed inside the machine. Some people shared this belief,
however, there were three issues with that hypothesis. Firstly, the presenter would
open the drawers, allowing the audience to inspect the interior. Secondly, there
appeared to be insufficient space inside for a fully grown human (some suspected war
veterans who had lost their legs). Lastly, the chess skills displayed were exceptionally
high.
People who doubted the legitimacy of The Turk were right. Inside the cabinet,
there was indeed a human operator, which was kept a closely guarded secret to
maintain the illusion. The operator could move the pieces thanks to strong magnets
and strings attached to them. They were also responsible for the movement of the
arms and hands, achieved through the use of levers. Various people served this
role but all of them had two things in common - they were masters at chess and
relatively small in stature. It is unclear how exactly opening the drawers before the
show did not reveal the secret. It is possible that the person inside moved during the
presentation as one drawer was opened at a time. Moreover, the contraption was
very complex with intricate woodworking, plethora of elements such as cogs, gears,
levers etc. The idea was to overwhelm and distract spectators and make them believe
that they are presented with something exceptional.

Similarly to da Vinci’s robots, The Mechanical Turk was a manifestation of a

dream about autonomous machines capable of playing games or more generally
displaying intelligence [270]. It was also a symptom of what was to come centuries
later when games became an integral part of artificial intelligence research. Efficacy
in playing games has become a litmus test for assessing the level of intelligence,
serving as a benchmark to measure problem-solving skills.
Games have been serving as testing environments for AI, due to the fact that they
are:
• deterministically controlable
• cost-effective
• repeatable
• formally defined: as closed sets of rules
• capable of testing planning and reasoning skills
• fun to use
Chess, in particular, has become arguably the most important game in the
history of artificial intelligence rivaled maybe just by Go. It was referred to as “the
Drosophila of AI” [176] which is an analogy to a particular type of fruit fly that has
been extensively used in biological research. Chess is a model problem similarly to
how Drosophila refers to a model organism that was studied to understand broader
biological principles.

2.4 Principal Component Analysis (1901)

This section will present another mathematical contribution that, like Bayes’ theorem,
predates the advent of artificial intelligence as a field, but it has been relevant to it.
In 1901, Karl Pearson (1857 – 1936), an English mathematician and biostatistician,
2.4 Principal Component Analysis (1901) 20

introduced the concept of Principal Component Analysis (PCA) [132]. In principle,

it is a data dimensionality reduction method with a vast array of applications from
exploratory data analysis through machine learning to neuroscience [40], quantitative
finance [192] and applied physics [6]. Karl Pearson is also known for his other
contributions such as Pearson’s correlation coefficient, introduction of moments to
statistics or Chi-squared test.

In the previous section, we showed that the input data to machine learning models
usually take the form of vectors.

𝑋 = 𝑥1 , 𝑥 2 , 𝑥 3 , ..., 𝑥 𝑛

Each component - 𝑥𝑖 - is a feature of the model also referred to as an explanatory

variable. A feature may be a directly observable variable or a result of a transformation
of observable variables and/or other transformations. Think of transformations as
functions that preprocess raw data in a way so it is more convenient to use them in
machine learning or data mining.
We can represent input data in the form of a matrix 𝑥𝑖 𝑗 , in which each column 𝑗
represents a feature and each row 𝑖 represents a data sample (datapoint).
 𝑥 11 𝑥12 𝑥 13 ... 𝑥1𝑛 

 𝑥 21 𝑥22 𝑥 23 ... 𝑥2𝑛 

 𝑥 31 𝑥32 𝑥 33 ... 𝑥3𝑛 

 ... ... ... ... .. 

𝑥 𝑚1 𝑥 𝑚2 𝑥 𝑚3 ... 𝑥 𝑚𝑛 

Even originally non-tabular data can be represented as such if required, e.g.,
an image of dimensions 𝑤𝑖𝑑𝑡ℎ and ℎ𝑒𝑖𝑔ℎ𝑡 can be represented as 𝑤𝑖𝑑𝑡ℎ × ℎ𝑒𝑖𝑔ℎ𝑡
features, where each feature encodes the color value of its corresponding pixel. A
sound file can be, for example, transformed into a frequency-domain representation,
where each feature corresponds to a specific frequency, rows may correspond to
uniform time intervals and values can denote frequency content over given time.
The input data representation forms an 𝑛-dimensional space, where each axis
corresponds to 𝑗-th feature. In real-world applications, 𝑛 is often very large leading
to the so-called curse of dimensionality. There are multiple problems when dealing
with high-dimensional data. The most crucial is that the more dimensions, the more
data samples are required to accurately represent the process data comes from. The
data becomes sparse compared to the whole possible domain. Moreover, processing
high-dimensional data is slower and takes a lot of resources. Machine learning
models must reflect the complexity of the data, so they are large and slower to train.
Often, training with too complex data makes the model not converge at all. Therefore,
it is a standard practice to perform data preprocessing and feature selection first.

Principal Component Analysis helps to reduce the dimensionality of the data

by transforming any 𝑝-dimensional space onto arbitrary numbers of dimensions. It
forms a new coordinate system of orthogonal axes. Each new axis - called here a
2.4 Principal Component Analysis (1901) 21

principal component - is a linear combination of the original axes. The idea is to

subsequently choose the principal components in such a way they capture as much
variance of the data as possible. The larger the variance, the more information this
particular component will capture. As shown in Fig. 2.4, which depicts the idea
of PCA, the method can be used even for the same number of original and new
dimensions.

Figure 2.4: The original (𝑋1 , 𝑋2 ) space is transformed by PCA onto (𝑃𝐶1 , 𝑃𝐶2 )
space.

2.4.1 The PCA Algorithm

The basic version of the algorithm is as follows:
1. Perform data preprocessing - the algorithm will be more effective when the
data is standardized among the features. Typically, for each feature - first the
mean is subtracted from its value and then the result is divided by the standard
deviation of feature’s values.
2. Create the covariance matrix.
3. Perform eigendecomposition on the covariance matrix obtaining eigenvalues
and eigenvectors. Thanks to a relation between eigenvalues and singular
values, this can be also achieved by singular value decomposition from the
data matrix and then calculating variances.
4. Sort eigenvalues from largest to smallest. The corresponding eigenvectors will
define new axes (principal components) from the most to the least important
ones.
5. Take 𝑘 principal components, where 𝑘 is the desired number of features (new
space dimension).
2.5 The Min-Max Algorithm and Game Theory (1928/1944) 22

The general usages of the PCA method are:

• To select a subset of only the most important features, which will be consider-
ably smaller than the full set of features.
• To just reduce the volume of data.
• To reduce the number of dimensions to 2 or 3, so it can be visualized in 2D or
3D, respectively.
• To just analyze which features are the most important with respect to a given
problem.

The main limitations and aspects to consider are:

• It is a linear transformation technique. Therefore, non-linear patterns can be
lost or even misleading for the analysis of data after PCA is applied.
• The results are difficult to interpret because they do not operate on the original
data. The data needs to be first standardized and then new space use axes
which are linear combinations of original axes - so they will usually have no
corresponding real-world phenomenon.
• When all principal components have similar variance, then choosing the 𝑘 top
ones is not as effective. The best-case scenario is when there exists a relatively
small subset of features that together account for the majority of variance in
the data.
• Outliers can disproportionately impact the results.
• There are known negative side-effects of data scaling when dealing with
extreme numbers such as in astronomy [226].
• Unless data is highly correlated or trivial along some features, there will
always be some information loss when reducing the number of dimensions.
Despite the limitations and potential problems, PCA is an all-around powerful
technique. It serves as a valuable tool in the data scientist’s toolkit.

2.5 The Min-Max Algorithm and Game Theory (1928/1944)

John Von Neumann (1903–1957) was a Hungarian-American researcher known
for his notable contributions to computer science, mathematics and physics. He
designed a computer architecture, now known as von Neumann architecture, which
was not only extensively used in the 1950s but continues to be the basis for more
advanced architectures. He was a member of the Manhattan Project [236] team at
the Los Alamos National Laboratory.
It was the year 1928, when von Neumann published (in German) a series of
papers that translate to “On the Theory of Games of Strategy”. Later, in 1944,
the English version titled “Theory of Games and Economic Behavior” [316] was
published by Oskar Morgenstern and John von Neumann. The authors introduced a
cornerstone theorem for game theory called minimax (also known as min-max or
minmax).
A formal version of the minimax theorem can be written as follows:

max min 𝑓 (𝑥, 𝑦) = min max 𝑓 (𝑥, 𝑦) (2.5)

𝑥∈𝑋 𝑦∈𝑌 𝑦∈𝑌 𝑥∈𝑋
2.5 The Min-Max Algorithm and Game Theory (1928/1944) 23

where 𝑋 ∈ R𝑛 ,𝑌 ∈ R𝑚 are compact convex sets and 𝑓 is a continuous concave-convex

function.
The minimax theorem gave rise to the minimax algorithm, which is one of the
most crucial algorithms in the history of game playing agents. A pseudocode of the
algorithm is presented in Algorithm 1. The original variant is defined for two-player,
zero-sum games. However, extensions of the algorithm have been proposed for other
classes of games [174]. As already signaled in Section 2.3, games have been a major
testbed for artificial intelligence.
Algorithm 1: MINIMAX(𝑠𝑡𝑎𝑡𝑒)
if 𝑠𝑡𝑎𝑡𝑒 is terminal: then
return payoff(𝑠𝑡𝑎𝑡𝑒)
if 𝑐𝑢𝑟𝑟𝑒𝑛𝑡𝑃𝑙𝑎𝑦𝑒𝑟 is 𝑀 𝐴𝑋: then
𝑉 ←− −∞;
for 𝑛𝑜𝑑𝑒 ∈ 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛(𝑠𝑡𝑎𝑡𝑒) do
𝑉 ←− max(𝑉, 𝑀 𝐼 𝑁 𝐼 𝑀 𝐴𝑋 (𝑛𝑜𝑑𝑒))
return 𝑉
if 𝑐𝑢𝑟𝑟𝑒𝑛𝑡𝑃𝑙𝑎𝑦𝑒𝑟 is 𝑀 𝐼 𝑁: then
𝑉 ←− ∞;
for 𝑛𝑜𝑑𝑒 ∈ 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛(𝑠𝑡𝑎𝑡𝑒) do
𝑉 ←− min(𝑉, 𝑀 𝐼 𝑁 𝐼 𝑀 𝐴𝑋 (𝑛𝑜𝑑𝑒))
return 𝑉

2.5.1 Combinatorial Games

To provide more context, let the remainder of this section introduce several concepts
related to combinatorial games. Formally, a game is a model of interaction between
rational agents (players). The number of players is a property of a game. The
term rational refers to agents that pursue their individual goals and do their best
to achieve them, i.e., maximize their own payoff. The payoff is the numerical out-
come achieved by a player. The higher the value, the better the player did in the game.

Game tree - there are many ways to represent games such as extensive form or
normal form. A game tree is the most popular way to represent combinatorial games.
It is a structure that encodes all possible playouts of the game. Formally, despite
the name, it is a directed acyclic graph (DAG). The nodes represent states and the
root node is the starting state. The edges denote actions. For instance, action 𝑎
represented by an edge 𝑎 = → −𝑒 (𝑠 , 𝑠 ) meaning that action 𝑎 is available in state 𝑠
1 2 1
and playing it transforms the state into 𝑠2 . The leaf nodes represent terminal states,
in which the payoffs for the players are defined. Even for such a simple game as
Tic-Tac-Toe, it is infeasible to show a full game-tree in this book as there are 255168
unique ways to play the game. In Fig. 2.5, an excerpt is shown. It is convenient
to present how the minimax algorithm traverses the game-tree to compute the
minimax values. The idea is shown in Fig. 2.6.
2.5 The Min-Max Algorithm and Game Theory (1928/1944) 24

Figure 2.5: A subset of the game tree for Tic-Tac-Toe. Leaf nodes include the payoffs,
which both happen to be equal to 1.0 as in wins for the 𝑋 player.

Constant-sum and zero-sum properties - both describe a game in which one

player’s gain, in terms of the payoff, is exactly balanced by the remaining players’
losses. However, in constant-sum games the payoffs sum to a constant, whereas
in zero-sum games to zero. These types of games are referred to as strictly competitive.

(Im)perfect information - if a game is of perfect information (also known as com-

plete information), each player has access to all the available information including
the current state, the available strategies, and the payoffs associated with different
outcomes. Chess, checkers and tic-tac-toe are examples of perfect information games.
In imperfect (incomplete) information games, this property is not satisfied, i.e., at
least one player has access to different information than another one. Examples of
imperfect information games are: poker, bridge and Stratego.

Finiteness - a finite game is always guaranteed to end after a finite number of actions.
This property also implies a finite number of players and a finite number of available
actions in each state. For example, chess is a finite game due to the seventy-five-move
rule, which states that if after 75 moves no captures or pawn movements have been
made, the game ends in a draw. Since pawns cannot move back and forth and repeat
their positions and the number of pieces is a non-increasing function, then infinite
sequences of actions are not possible. Tic-Tac-Toe is also a finite game that ends
after a maximum of 9 actions.

Synchronous vs. asynchronous games - in synchronous games, players take turns,

i.e., make actions in an alternating sequence. In asynchronous games, players may
make actions simultaneously. One way to model simultaneous games using a game
2.5 The Min-Max Algorithm and Game Theory (1928/1944) 25

Figure 2.6: An abstract game tree with minimax values computed. The circles
denote states, in which the 𝑀 𝐴𝑋 player takes an action, whereas squares represent
states in the 𝑀 𝐼 𝑁 player is to move. The nodes with thicker borders contain values
that were chosen by the algorithm to back-propagate. They hold either max or min
values among the considered nodes.

tree is to make edges denote vectors of actions taken by players instead of single
actions.

Branching factor - the average number of actions available to a player in the game.
It is equivalent to the average number of children of each node in the game tree.

Deterministic vs. non-deterministic games - in the latter, which are also referred
to as the games of chance, there is an element of randomness such as rolling a die or
shuffling a deck of cards. Chess and checkers are examples of deterministic games,
whereas poker, backgammon and Monopoly are examples of non-deterministic
games. There are a few approaches to model a game of chance using a game tree.
The most common one is to introduce a non-rational player, typically called a nature
player with available actions corresponding to all possible outcomes of random
events.
2.6 IRIS Dataset (1936) 26

2.6 IRIS Dataset (1936)

There is no doubt that data are the fuel for machine learning and data science. Open
datasets serve as particularly useful assets both for education purposes and for the
research communities. For instance, students, who do not have yet access to business
data at work, can benefit from public datasets when training machine learning
models for the first time. Curated datasets are also very useful in getting a grasp
of various data engineering and visualization tasks as well as clustering techniques.
However, not only students use public datasets but researchers do it as well. Many
state-of-the-art methods were initially evaluated on common datasets. IRIS was the
first public dataset that has been serving researchers and data processing enthusiasts
to this day.

In 1936, Ronald Fisher, a British statistician and biologist published a paper titled
“The use of multiple measurements in taxonomic problems” [75], which featured the
IRIS dataset. It contains a total of 150 data samples that describe iris flowers using
five features (see Table 2.6). The samples have been taken for three kinds of iris
subspecies, which is the last feature in the dataset. The species are shown in Fig. 2.7.

Figure 2.7: a) Iris setosa b) Iris versicolor c) Iris virginica. Flowers featured in the
IRIS dataset. The images are publicly available in the creative common domain.

As of 2024, IRIS is considered a beginners-friendly dataset often used as a

rudimentary example in the “Hello, World” sense. For example, it is easy to learn
clustering methods with the aim of discovering clusters which correspond to the
three species or iris flowers.
Some of the other widely used datasets in AI-related research are MNIST [155] and
Boston Housing [106] datasets. They were released in 1994 and 1978, respectively.
MNIST contains images of hand-drawn numerical digits and has become a benchmark
2.7 The First Mathematical Model of a Neuron (1943) 27

Table 2.1: The first five rows and the last row in the IRIS dataset.
ID sepal length sepal width petal length petal width species
1 5.1 3.5 1.4 0.2 Iris-setosa
2 4.9 3.0 1.4 0.2 Iris-setosa
3 4.7 3.2 1.3 0.2 Iris-setosa
4 4.6 3.1 1.5 0.2 Iris-setosa
5 5.0 3.6 1.4 0.2 Iris-setosa
„, „, „, „, „, „,
150 5.9 3.0 5.1 1.8 Iris-virginica

in the machine learning community, especially for tasks related to image classification
and deep learning. The Boston Housing Prices dataset is more commonly in
regression tasks and predictive modeling. All of the datasets help in practicing and
testing machine learning algorithms.

2.7 The First Mathematical Model of a Neuron (1943)

As of 2024, artificial neural networks (ANNs) stand as a fundamental component of
machine learning, playing a pivotal role in numerous groundbreaking achievements.
The history of ANNs dates back to a 1943 paper titled “A Logical Calculus of Ideas
Immanent in Nervous Activity” [181]. It was authored by Warren McCulloch (1898
- 1969) and Walter Pitts (1923 – 1969).
The paper begins with a theoretical introduction on how the brain might function
and what rules constitute the flow of impulses between neurons. The authors discuss
the excitation of neurons - when it occurs, how long it lasts and what is its effect.
The key hypothesis is that there must arrive impulses from a sufficient number of
neighboring neurons. They state that neural events can be modelled, albeit in a
simplified way, using propositional logic. McCulloch and Pitts list the physical
assumptions for the calculus and simplifications made such as that the inputs and
outputs of neurons can only be 0 or 1. Quoting the original paper: “The all-or-none
law of these activities, and the conformity of their relations to those of the logic of
propositions, insure that the relations of psychons are those of the two-valued logic
of propositions”.
The model of the McCulloch-Pitts neuron is presented in Figure 2.8. Each
neuron realizes a logical operation and its response is a proposition. In the paper, a
system built using such neurons is referred to as a complex net, which is equivalent
to a logic of propositions. Many properties of such nets were analyzed in the paper.
The authors explicitly assumed that the nets do not change with time. There were also
no indication on how such networks might learn. Such algorithms were proposed
significantly later.
One of the many novel elements in the model is the threshold logic. The
excitement of a neuron (a positive response) occurs if the summed excitement
coming from the input neurons exceeds a certain fixed threshold. Because the
excitement is binary, it is here equivalent to the number of excited neurons. A single
2.8 Logistic Regression (1944 ∼ 1958) 28

Figure 2.8: The McCulloch-Pitts model of a neuron. Both inputs and the output are
binary. 𝑡 is an arbitrarily fixed threshold.

neuron can perform various logical operations as shown in Table 2.2. Unfortunately,
it is not possible to represent XOR or NAND using a single neuron with the threshold
logic. In general, these functions cannot be accurately represented, because the there
is no linear separation of inputs with respect to the output.

Table 2.2: Inputs 𝑋1 and 𝑋2 to basic logical operations and their results. (!) denotes
the case, for which a single McCulloch-Pitts neuron cannot provide the correct
output.
𝑋1 input 𝑋2 input AND OR NAND XOR
0 0 0 0 1 0
1 0 0 1 1 1
0 1 0 1 1 1
1 1 1 1 0 (!) 0 (!)

In summary, the McCulloch’s and Pitts’ work had a profound impact on artificial
intelligence. It was truly a cornerstone for artificial neural networks. Modern models
continue to build upon and expand these foundational ideas.

2.8 Logistic Regression (1944 ∼ 1958)

Although the concept of logistic regression was popularized by David Cox in
1958 [49], the fundaments were already introduced in 1944 by Joseph Berkson in
the paper “Application of the Logistic Function to Bio-Assay" [18].
In statistics, regression is a method used to model and estimate in terms of
numerical values the relationships in data between independent variables and the
response (target / label / independent variable / explained variable).
Before proceeding to logistic regression, let me show an example of linear
regression in Fig. 2.9. It consists in fitting a linear curve to the observed data
2.8 Logistic Regression (1944 ∼ 1958) 29

as expressed in Equation 2.6. The obtained relationship helps in analyzing and

visualizing the data. However, more importantly, the equation underpinning the
curve can then be used to predict the response on any data - not only the observed
datapoints. The best fit can be defined arbitrarily but the most commonly used
approach is to minimize the mean squared error (MSE) defined as the sum of squared
distances between the curve and the observed points. This is known as the least
squares optimization. In statistics and ML liteature,
√ you will also often find the root
mean squared error (RMSE), which is equal to 𝑀𝑆𝐸. MSE emphasizes the impact
of outliers or larger errors on the overall metric.

Figure 2.9: The idea of a linear regression is to fit a curve to the observed data. The
further away is a particular datapoint from the curve, the bigger the error associated
with it.

𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + . . . + 𝛽𝑛 𝑋𝑛 + 𝜀 (2.6)
where 𝛽𝑖 are the coefficients to be found in the regression process and 𝜀 represents the
error. In 2D, the equation can be rewritten in a more familiar form as 𝑌 = 𝑏 + 𝑎𝑋 + 𝜀
where 𝑎 is the slope, 𝑏 is the incercept and 𝜀 is the error.

The logistic regression model uses a logistic function which is a type of a sigmoid
function. Its formula is presented in Equation 2.7.
1
𝑃(𝑌 = 1) = (2.7)
1 + 𝑒 −(𝛽0 +𝛽1 𝑋1 +𝛽2 𝑋2 +...+𝛽𝑛 𝑋𝑛 )
Sigmoids are “S”-shaped functions bounded by a pair of horizontal asymptotes.
Examples are:
1 𝑒 𝑥 − 𝑒 −𝑥
𝑓 (𝑥) = ; 𝑓 (𝑥) = tanh 𝑥 = ; 𝑓 (𝑥) = arctan 𝑥
1 + 𝑒𝑥 𝑒 𝑥 + 𝑒 −𝑥
2.8 Logistic Regression (1944 ∼ 1958) 30

The result of logistic regression is a probability. 𝑃(𝑌 = 1) denotes the estimated

probability that the outcome variable has the value equal to 1. In the base variant,
𝑌 is a binary variable, whereas the independent variables - 𝑋𝑖 - can be binary or
continuous. Generalizations onto non-binary dependent variables were proposed
later [77].

Figure 2.10: The idea of logistic regression. Colors denote assignments to two
classes: blue and red, respectively. The logistic function represents probability that
a particular elements is of a given class.

The most common technique of calculating the 𝐵𝑖 coefficients is called Maximum

Lilelihood Estimation (MLE) [201]. The model assumes that the observed data are
samples from some unknown joint probability distribution, which can be expressed
by means of the 𝐵𝑖 coefficients. MLE finds such a set of coefficients that maximizes
the joint probability of the observed data occurring. In general, this cannot be done
analytically but rather numerical methods are employed.
The obtained probability of a given outcome can be used to build classifiers or
predictors. For classifiers, a threshold value 𝑡 is introduced. Now, for example, if
𝑃(𝑌 = 1) < 𝑡 then the model assigns the first class, whereas otherwise it assigns the
second class. The idea is shown in Fig. 2.10.
Logistic regression is typically taught as the first topic of neural networks courses.
In a sense, it can be viewed as a simple one-layer neural network. Logistic and
other sigmoid functions are often employed as activation functions of neurons.
The introduction of logistic regression was a major milestone influencing machine
learning and neural networks.
2.9 A Mathematical Theory of Communication (1948) 31

2.9 A Mathematical Theory of Communication (1948)

The year 1948 saw the publication of a fundamental paper in computer science
titled “A Mathematical Theory of Communication” [271] by Claude E. Shannon.
The author was an American mathematician, electrical engineer, computer scientist
and cryptographer. Although Shannon’s work was primarily a contribution in
information theory and computing, there is some relevance to machine learning
models as information processing entities. Shannon introduced several concepts
of information theory. We will go through some of them and conclude with the
connection to AI.
The information content, also referred to as surprisal, of an event 𝑋 is defined as
follows:
𝐼 (𝐸) = − log2 ( 𝑝(𝑋)) (2.8)
- where 𝑃(𝑋) is the probability of 𝑋.

One of the key notions is the entropy also referred to as Shannon’s entropy in
honor of the author. Entropy is a quantity representing uncertainty often interpreted
as the degree of chaos or randomness in the system, e.g., information system
consisting of a group of observations. The entropy of a discrete random variable 𝑋
with 𝑛 distinct outcomes (states it can take) is defined as follows:
𝑛
∑︁
𝐻 (𝑋) = E [𝐼 (𝑋)] = − 𝑝(𝑥𝑖 ) · log2 ( 𝑝(𝑥𝑖 )) (2.9)
𝑖=1

- where 𝐻 (𝑋) is the entropy and 𝑝(𝑥𝑖 ) is the probability of the 𝑥𝑖 outcome.

The maximum compression limit is defined by the entropy of the source:

𝐿 ≥ 𝐻 (𝑋) (2.10)

- where L is the average code length per symbol.

There also exists a notion of conditional entropy 𝐻 (𝑋 |𝑌 ) defined as:

∑︁ ∑︁ 𝑝(𝑥, 𝑦)
𝐻 (𝑋 |𝑌 ) = − 𝑝(𝑥, 𝑦) · log2 (2.11)
𝑦 𝑥
𝑝(𝑌 = 𝑦)

The cross-entropy is a measure defined for two probability distributions 𝑝 and 𝑞

over the same sets of variables. It can measure how dissimilar they are. The formula
is given by:
𝑛
∑︁
𝐻 ( 𝑝, 𝑞) = − 𝑝(𝑥𝑖 ) · log2 (𝑞(𝑥𝑖 )) (2.12)
𝑖=1
The mutual information is defined as:

𝐼 (𝑋;𝑌 ) = 𝐻 (𝑋) − 𝐻 (𝑋 |𝑌 ) (2.13)

2.10 Hebbian Learning (1949) 32

The maximum channel capacity 𝐶 is given by:

𝐶 = max 𝐼 (𝑋;𝑌 ) (2.14)

𝑝(𝑥)

The information gain is defined as a decrease in information entropy 𝐻 from a

prior state to a new state that treats some information as given:

𝐼𝐺 = 𝐻 (𝑇) − 𝐻 (𝑇 |𝑎) (2.15)

- where 𝐻 (𝑇 |𝑎) is the conditional entropy of T given 𝑎.

2.9.1 The impact on Artificial Intelligence

Here are examples of connections between Shannon’s information theory and
machine learning:
1. Cross-entropy is commonly used a a loss function when training neural
networks.
2. Cross-entropy is also used with the softmax activation function in the last layer
of neural networks. The softmax transforms the output values into a probability
distribution, what makes cross-entropy a natural choice for comparing such
obtained probabilities to the true distribution.
3. Information gain (or a similar measure) is often used to determine optimal
splits in decision trees (See Section 6.1). In the context of tree-based ML
models, the information gain is typically expressed as:
𝑘
∑︁ 𝑁𝑖
IG = 𝐻 (Parent) − · 𝐻 (Child𝑖 ) (2.16)
𝑖=1
𝑁
4. Mutual information is often used for feature selection and to evaluate the
similarity between clusters obtained by the clustering algorithms.
5. In principle, the goal of using ML models is to fit to data and train them to
generalize onto new data. Some models are explicitly trained to compress data
(compression by generalization), e.g. auto-encoders (which will be presented
in Section 6.11) or Gaussian mixtures (Section 5.5). The concepts of the
channel capacity and compression limits help to analyze models, understand
their limits and potential.

2.10 Hebbian Learning (1949)

In 1949, Donald O. Hebb, an award-winning Canadian psychologist, published a
book - “The Organization of Behavior: A Neuropsychological Theory” [115]. The
book is heavily focused on understanding how real neurons work in a human brain
rather than attempting to simulate them. However, the introduced formalisation
and ideas contributed to the development of artificial neural networks. Hebb was
particularly interested in the formalisation of the learning process that occurs in
neurons. This was the next milestone after the McCuloch and Pitts work, presented
in Section 2.7.
2.10 Hebbian Learning (1949) 33

One of the postulates in the book was that there is a strength of a synaptic
connection between two neurons, which becomes greater when the neurons fire
together. The more repeated and persistent it is, the faste neurons learn. Let 𝑤 𝑖 𝑗
denote the weight determining the strength of a connection between neurons 𝑖 and 𝑗.
Then, the learning occurs by updating this weight according to the following rule:

Δ𝑤 𝑖 𝑗 = 𝛾 ∗ 𝑥𝑖 ∗ 𝑥 𝑗 (2.17)

- where 𝑥𝑖 is the pre-synaptic activity of a neuron, i.e., we can say it is an input

neuron, 𝑥 𝑗 is the post-synaptic activity of a neuron and 𝛾 is the learning rate.

The post-synaptic activity is computed as:

∑︁
𝑥𝑗 = 𝑤 𝑖 𝑗 ∗ 𝑥𝑖 (2.18)
𝑖

Hebb wrote that groups of neurons, called cell assemblies, work together to
represent information. He also introduced the term “engram” that denotes the
physical trace of a memory in the brain. A significant portion of the book is devoted
to brain’s plasticity and reorganization of its structure upon experience.
3. Years 1950 - 1964

3.1 Alan Turing’s Work (1950+)

Figure 3.1: Passport photo of Alan Turing by unknown photographer. From

Wikimedia Commons Public Domain.
Alan Turing (Fig. 3.1) is regarded as one of the fathers of computer science
and one of the most influential scientists of all time [55]. He was born in 1912 in
Maida Vale, London, England, and died in 1954 in Wilmslow, Cheshire, England.
His research interests spanned many fields, including theoretical computer science,
complexity theory, cryptography, philosophy, and biology. He had impact on
artificial intelligence - both direct (through various publications) and indirect through
developing computing models that would lead to the construction of general purpose
computers. For his merits during the 2-nd World War, he was appointed an Officer
3.1 Alan Turing’s Work (1950+) 35

of the British Empire (OBE). He was one of the people responsible for cracking the
Enigma [233] code. For his scientific contributions, he was elected to the Fellowship
of the Royal Society (FRS). Since 1966, arguably the most prestigious award in
computer science has the name of The Turing Award. It is considered the “Nobel
Prize of computer science” [105].
In 1950, Turing published an article titled “Computing Machinery and Intelli-
gence” [306]. In this paper, he posed a question “Can machines think?”. During
his career, he would often ask this question contributing to the debate on the nature
of intelligence. Turing proposed an experiment called The Imitation Game, later
also named The Turing Test. The experiment involves three types of participants: 𝐴,
𝐵 and 𝐶. In the first variant of the problem, 𝐴 was a man, 𝐵 was a woman and 𝐶
was an interrogator whose goal was to identify who of the other participants is the
woman and who is the man. The interrogator was placed in a separate room and
allowed to use only such means of communication that would not help identifying
the participants. For example, the tone of the voice could be such an indicator that
was not allowed. Turing initially suggested a teleprinter communicating between
two rooms. In the The Turing Test, which is an evolution of the previous setup,
the interrogator is still there, but the 𝐴 and 𝐵 participants are now a human and a
machine. The idea is shown in Fig. 3.2.

Figure 3.2: A scheme of The Turing Test. The participants: 𝐴, 𝐵, and 𝐶 do not see
each other. 𝐴 and 𝐵 are trying their best to convince the interrogator 𝐶 that they are
humans.

The point of the original paper was that a general question “Can machines think?”
is too meaningless until it is formalized what a machine is and what thinking is.
He argued that instead of “thinking” researchers should seek specific activities that
require intelligence. Communication in a natural language and the ability to convince
3.1 Alan Turing’s Work (1950+) 36

the interrogator that they are dealing with a human is an example of such an activity.
Turing’s discussion has become part of the philosophy of artificial intelligence.
The Turing Test has been reformulated and adapted many times in the history
of AI. In 1991, the Loebner Prize [175] competition started which was a practical
implementation of the Turing Test. The difference was that there were multiple
interrogators (at least four) called the judges. This competition was held annually in
this form until 2019. The Total Turing Test [224], proposed in 1998, introduces two
additional aspects to communication through a chat interface: perceptual abilities
and proficiency in manipulating objects. Notably, the widely used CAPTCHA serves
as a de-facto reversed Turing Test, where a human must persuade a machine that
they are indeed a human. There have also been Imitation Game competitions [293]
proposed for video games, i.e., in which the interrogator’s goal is to tell apart human
players from AI bots. For example, in the 2k bot prize competition, the goal was to
create convincing human-like bots in Unreal Tournament 2004 [83]. The Turing
Test may serve as one of the ways to assess Artificial General Intelligence (AGI).
However, there have been other methods for assessing the AGI proposed such as The
Robot College Student Test [91].

Alan Turing was also interested in chess. In 1948, he designed his first chess
playing program called Turochamp [188]. However, the program was reconstructed
and implemented much later, because in Turing’s time, computers were not advanced
enough to run it. The author would execute the program manually. Chess has been
an important part of the debate on machine intelligence. Turing considered that
reasoning in complex rule-based models such as chess can be a good indicator of
intelligence. Moreover, one particular advantage of chess is that it comes with a
built-in objective assessment, i.e., the rules for winning, drawing, and losing.

In the area of computation models, Turing’s work was even more remarkable than
in the area of AI. He is particularly famous for developing the Turing Machine [311]
and its variants. Although the Turing Machine is only an abstract model, it played
a pivotal role in the creation of the first general-purpose computers. Modern
computers, as of 2024, are still equivalent, in terms of the underlying model’s
computational power, to the Turing Machine. Any system, i.e., a computer or a
programming language, is Turing-complete (also referred to as computationally
universal) if it can simulate the Turing Machine. The Turing Machine is widely used
in the computational complexity theory. Many theorems and proofs are constructed
using this model. For example, the famous P = NP problem [92], which is one of
the Millennium Prize problems is a question of whether problems solved by a non-
deterministic Turing Machine in polynomial time are equivalent to (can be reduced
to) problems solvable using a polynomial-time deterministic Turing Machine. In this
book, there will be examples of problems which are NP-hard. A problem is NP-hard
if every problem in NP can be transformed to it by a polynomial-time many-one
reduction (Turing reduction). Colloquially, an NP-hard problem is equivalently hard
to even the most difficult problems in NP.
3.2 K-Nearest Neighbors (1951) 37

3.2 K-Nearest Neighbors (1951)

K-nearest neighbors (k-NN) is a simple yet very useful algorithm that can be applied
both to classification and regression tasks. Additionally, it is often included as a step
in more complex methods. It was proposed in 1951 by Evelyn Fix and Joseph L.
Hodges, Jr [76].
Let us assume there are 𝑛 observations in data. We can refer to these observations
as examples, in line with the terminology used in k-NN. Each observation can be
either a single value or a vector of values of a fixed size. In general, the data can be
denoted as in Eq. 3.1:

𝑋 = (𝑥 1 , 𝑥 2 , 𝑥 3 , ..., 𝑥 𝑛 ) ; 𝑥𝑖 ∈ R 𝑀 (3.1)

In order to be able to apply the k-NN method, a distance metric must be defined
for the 𝑥𝑖 vectors. Due to its general formula (c.f. Eq. 3.2), Minkowski metric often
serves as a default in the algorithm’s implementations.

𝑛
! 𝑝1
∑︁
𝑑 (𝑥𝑖 , 𝑥 𝑗 ) = |𝑥𝑖𝑘 − 𝑥 𝑗 𝑘 | 𝑝 (3.2)
𝑘=1

- where 𝑥𝑖𝑘 is the 𝑘-th component of vector 𝑥𝑖 .

Please note that the Minkowski metric reduces to:
• Manhattan metric if 𝑝 = 1
• Euclidean metric if 𝑝 = 2
• Chebyshev metric if 𝑝 → ∞
Both Manhattan and Euclidean metrics are widely used with k-NN. For discrete
variables, e.g. vectors encoding sets, other metrics are preferred such as the Hamming
distance or Cosine distance.
Whenever there is a new observation, the method determines the 𝑘 nearest
neighbors with respect to the provided distance function. What happens next
depends on the particular task. For example:
• For classification: the label for the new observation can be assigned through
majority voting among labels of the neighbors. The voting can be weighted
using the inverted distances.
• For regression: the target value can be set to the (weighted) average of the
neighbors values.
The algorithm, despite being simple, can be computationally intensive. It has
𝑂 (𝑛2 ) computational and memory complexity of calculating the distance operations
for 𝑛 vectors. Typically, a full distance matrix is computed once in the initial step of
the algorithm and stored for further usage. The distances can also be calculated on
the fly when a fast space partitioning method is available [317]. In such a case, due
to specialized structures, only a subset of the space needs to be searched to find 𝑘
nearest neighbors. Determining the optimal value for 𝑘 in the k-NN algorithm is a
crucial aspect. A particular number may be derived by expert domain knowledge.
Automatic methods include k-fold cross validation or grid search over a range of
possible values.
3.3 Development of the First Clustering Methods (1950 ∼ 1960) 38

3.3 Development of the First Clustering Methods (1950 ∼ 1960)

3.3.1 K-means
K-means is one of the most commonly applied examples of unsupervised learning
techniques. Unsupervised learning describes a class of methods that learn without
ground truths given. In other words, there are no correct labels available for learning.
Conversely, unsupervised learning involves techniques that characterize the input
data, such as identifying patterns within them.
K-means is a clustering method invented between 1950 and 1960. There is
no consensus on who the original author was. It is often attributed to James
MacQueen [166], but another common thesis suggests that it was discovered by
multiple independent researchers before him. It is speculated, e.g. according to
Wikipedia (as of January 12, 2024), that k-means was invented by Stuart Lloyd in
1957 for his work but not published as a research article. However, this information
unverified and should not be taken as granted.

Let 𝑛 be the number of observations (samples) in data. As with other clustering

tasks, the goal is to create 𝑘 clusters and assign each observation to one of them.
The number of clusters - 𝑘 - is the starting parameter of the algorithm. On a side
note, there have been methods proposed to determine this number experimentally.
Figure 3.3 shows an example of clustering with 𝑘 = 4. K-means is also referred to as
the centroid-based clustering algorithm.

Figure 3.3: The result of clustering observations into four clusters.

3.3 Development of the First Clustering Methods (1950 ∼ 1960) 39

The underlying principle is to maximize:

1. The cohesion (similarity) within the clusters
2. Separation (dissimilarity) between the clusters
Let (𝑥 1 , 𝑥 2 , 𝑥 3 , ..., 𝑥 𝑛 ) be the 𝑛 observations in data to be partitioned into 𝑘 clusters.
Each observation is described by a vector of numerical, real-valued features. The
method aims to minimize the sum of squared distances between the centroid of a
cluster and examples as follows:
𝑛𝑖
𝑘 ∑︁
∑︁
𝑆= ∥𝑥𝑖 𝑗 − 𝜇𝑖 ∥ 2 (3.3)
𝑖=1 𝑗=1

- where 𝑆 is known as inertia and (𝜇1 , 𝜇𝑖 , 𝜇𝑖 , ..., 𝜇𝑛 ) is the vector of centroids (the
mean values in the clusters). This is equivalent to minimizing the total variance
within the clusters. It is an NP-hard problem perform the minimization optimally.
There are many practical algorithms for this problem. We will now present the
Lloyd’s Algorithm, which is one of them.

3.3.2 Lloyd’s Algorithm

Stuart P. Lloyd (1923 - 2007) was a computer scientist working at the Bell Telephone
Laboratories. He developed an iterative algorithm for performing the k-means
clustering (supposedly in 1957 but published in 1982 [163]). Despite the fact that
the optimal solution is not guaranteed, it is a widely used method that has become
de-facto the standard implementation of k-means clustering. Certain steps such as
the choice of the initial centroids were later optimized by other researchers [9].
The algorithm is iterative and defined as follows:
1. Set the maximum number of iterations 𝑀 𝐴𝑋
2. Set the tolerance 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑
3. Set the maximum number of clusters 𝑘
4. Initialize the current vector of centroids randomly
5. Set the iteration (time) 𝑡 = 1
6. Compute the initial inertia 𝑆 1
7. For 𝑖 = 1...𝑛
(a) Assign 𝑥𝑖 to cluster 𝑢 𝑗 that minimizes the distance between 𝑥𝑖 and 𝑢 𝑗
8. Recalculate the 𝑢 𝑗 vector
9. 𝑡 ← 𝑡 + 1
10. Recalculate the current inertia 𝑆𝑡
11. Repeat from step (7.) if 𝑡 ≤ 𝑀 𝐴𝑋 and ||𝑆 𝑡 − 𝑆 𝑡−1 || > 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑
The Lloyd’s Algorithm can be used to generate Voronoi diagrams such as shown
in Fig. 3.4.
3.4 The AI Term is Coined - AI Recognized as a Research Field (1956)
40

Figure 3.4: Voronoi graph. Each point in a region is closer to its Voronoi cell, which
are denoted by black dots, than to any other Voronoi cell.

3.4 The AI Term is Coined - AI Recognized as a Research Field

(1956)

Figure 3.5: Dartmouth Summer Research Project on Artificial Intelligence partici-

pants. From left to right: O. Selfridge, N. Rochester, R. Solomonoff, M. Minsky,
Unidentified Person, J. McCarthy, C.E. Shannon. This photo has appeared at
various website and articles. Reprinted from the article “The Turbulent Past and
Uncertain Future of Artificial Intelligence” that appeared in IEEE, October, 2021,
Spectrum [285]. Credit to the Minsky Family.
3.4 The AI Term is Coined - AI Recognized as a Research Field (1956)
41

“The Dartmouth Summer Research Project on Artificial Intelligence” [180]

(also simply known as “The Dartmouth Workshop”) was held in 1956 in Hanover,
New Hampshire. The name of this workshop marks the first time when the term
artificial intelligence was publicly used to describe an emerging field. The term was
coined a year earlier on September 2, 1955, when the proposal for the workshop was
submitted. It was signed by four researchers: J. McCarthy (Dartmouth College), M.L.
Minsky (Harvard University), N. Rochester (I.B.M. Corporation), and C.E. Shannon
(Bell Telephone Laboratories). The organizers, along with two other participants,
are in Fig. 3.5. The provided affiliations appeared in the proposal, however, all of
the scientists later worked with numerous other institutions such as Massachusetts
Institute of Technology and Stanford University. The atmosphere was described by a
collaborative spirit. There were a lot of interesting and inspiring discussions. People
were full of optimism towards the emerging field. The proposal included discussion
of topics such as:
1. Automatic computers
2. How can a computer use a language
3. “Neuron nets”
4. Theory of computation
5. Self-improvement
6. Abstraction
7. Randomness
8. Creativity

The official list of attendees has been lost but according to notes and witnesses,
there were around 20 participants [146]. Apart from the original four organiz-
ers, the following names were included in the documentation: R. Solomonoff, T.
More, O. Selfridge, J. Bigelow, W. R. Ashby, W.S. McCulloch, A. Robinson, T. Et-
ter, J. Nash, D. Sayre, A. Samuel, K. R. Shoulders, A. Bernstein, H. Simon, A. Newell.

The Dartmouth Workshop marked the beginning of the era of symbolic artificial
intelligence. In this paradigm, humans would usually encode knowledge in the
form of symbols, which would then be manipulated by machines to derive new
insights, information, results, also encoded using symbols. An important way of
implementing this was through abstract reasoning (see Section 3.8). Symbolic AI
is now often referred to as Good Old Fashioned AI (GOFAI), although the term is
used more for differentiation than in a derogatory sense.
3.5 A Markovian Decision Process (1957) 42

3.5 A Markovian Decision Process (1957)

In 1957, Richard Bellman (1920 - 1984), an American mathematician published
a paper entitled “A Markovian Decision Process” [16]. It is very important in the
history of artificial intelligence due to many foundational ideas:
1. Bellman equation.
2. Dynamic programming.
3. Specification of a Markovian Decision Process (MDP), later known as a
Markov Decision Process, which is widely utilized e.g., in games, robotics
and reinforcement learning.

3.5.1 Bellman Equation

Bellman analyzed sequences of functions 𝑓0 (𝑖), 𝑓1 (𝑖), 𝑓2 (𝑖), ..., 𝑓 𝑁 (𝑖) defined as:
𝑀
∑︁
𝑓 𝑁 (𝑖) = max [𝑏𝑖 (𝑞) + 𝑎𝑖 𝑗 (𝑞) 𝑓 𝑁−1 ( 𝑗)] (3.4)
𝑞
𝑗=1

- where 𝑏𝑖 and 𝑎𝑖 𝑗 are functions of finite dimensional vectors 𝑞. The components of

vectors 𝑞 can take finite set of values, which in general depend upon 𝑖 and 𝑗. In the
paper, the author discusses various properties of such functions and then shows how
Equation 3.4 can be applied in practice for solving problems optimally.
These are the foundations for optimal control in control theory. Acting optimally
is described as minimizing the cost or maximizing the value. A function that
returns the assessment of the process, which should be maximized is called the
value function and typically denoted by 𝑉.
Equations 3.5 and 3.6 show a modern way of writing Bellman equations for a
deterministic and non-deterministic environments, respectively:
!
∑︁
𝑉 (𝑠) = max 𝑅(𝑠, 𝑎) + 𝛾 𝑉 (𝑠′) (3.5)
𝑎∈𝐴
𝑠′
!
∑︁
𝑉 (𝑠) = max 𝑅(𝑠, 𝑎) + 𝛾 𝑃(𝑠′ |𝑠, 𝑎)𝑉 (𝑠′) (3.6)
𝑎∈𝐴
𝑠′
- where 𝑉 ∗ (𝑠) is the value function for state 𝑠, which is the maximum expected
cumulative reward achievable from state 𝑠, 𝑎 is an action, 𝐴 denotes available actions
in state 𝑠, 𝑅(𝑠, 𝑎) is the immediate reward obtained by taking action 𝑎 in state 𝑠,
𝑃(𝑠′ |𝑠, 𝑎) is the probability of transitioning to state 𝑠′ from state 𝑠 after taking action
𝑎, and 𝛾 is the discount factor.

3.5.2 Dynamic Programming

Dynamic programming is a powerful technique for approaching problems that can
be represented using Bellman equations to express the optimal value. The problem
is then recursively decomposed into smaller problems until the smallest sub-problem
is found that can be solved directly.
3.5 A Markovian Decision Process (1957) 43

In contrast to a similar technique called divide-and-conquer (Latin: divide et

impera), the sub-problems in dynamic programming are overlapping, i.e., results
for the sub-problems are stored and used more than once. In divide-and-conquer,
the sub-problems are independent and their solutions need to be used only once.
For example, the problem of computing the maximum value in a list of size 𝑁
can be divided into independent sub-problems of finding the maximum value in
non-overlapping lists of sizes 𝑁𝑘 , so it is not suitable for dynamic programming. On
the other hand, finding Fibonacci numbers is suitable for dynamic programming.
The results for the sub-problems are used in a systematic way to compute the
solution for larger problems and eventually the original one. For this, there have been
both top-down and bottom-up approaches proposed. Bellman equations are utilized
to express the relation between the optimality of sub-problems to the optimality of
the original problem.

3.5.3 Markov Decision Process

Markov Decision Process (MDP) is represented by a tuple: (𝑆, 𝐴𝑆 , 𝑃𝑎 , 𝑅𝑎 ). The
meaning of the components is as follows:
• 𝑆 - is the state space. It is a set of states that the environment can be in.
• 𝐴𝑠 - is the action space. The set of actions can be defined per state 𝑠 or, if all
actions are always available, then the subscript 𝑠 can be omitted.
• 𝑃𝑎 (𝑠, 𝑠′) - is the transition function modelled by a probability that action 𝑎
performed in state 𝑠 will lead to state 𝑠′. For deterministic environments,
it always holds 𝑃𝑎 (𝑠, 𝑠′) → {0, 1} rather than 𝑃𝑎 (𝑠, 𝑠′) → [0, 1] for non-
deterministic environments.
• 𝑅𝑎 (𝑠) - is the immediate reward for reaching state 𝑠 due to action 𝑎.
In addition, there are initial conditions, which are implicit and typically not
included in the tuple. At minimum, the initial conditions are represented by the
initial state 𝑠0 ∈ 𝑆.
The base variant of MDP is suitable for deterministic or stochastic environments
but not for situations in which certain actors observe only a portion of the state. For
such scenarios, an extension of MDP has been proposed, named Partially Observable
Markov Decision Process (POMDP). Since asymmetric access to information is
more typical in multi-agent systems or adversarial games, I will discuss POMDPs in
sections devoted to them.
Markov Decision Processes are related to dynamic programming. Bellman
introduced the Principle of Optimality, which states that an optimal policy has
the property that, whatever the initial state and initial decision are, the remaining
decisions must constitute an optimal policy with regard to the state resulting from
the first decision. The optimal policy (way of acting) in any given state does not
depend on the past states. For instance, it is irrelevant how the current state was
achieved; for all different paths to achieve it, the optimal action remains the same.
The concepts presented in this section serve as the fundamental for various areas
and algorithms such as control theory, reinforcement learning (c.f. Section 8.1) and
games.
3.6 Perceptron (1958) 44

3.6 Perceptron (1958)

Frank Rosenblatt (1928 - 1971) was an American psychologist and one of the first AI
practitioners. He made his mark in the history of AI by constructing the Perceptron.
The Perceptron was an electronic hardware constructed by Rosenblatt at Cornell
Aeronautical Laboratory in 1957-1958 under the project PARA (“Perceiving and
Recognition Automata”) [248]. However, it also gave name to the model of the neural
network he implemented [249]. At present, the second usage is more common.
However, it is worthwhile remembering that the word Perceptron has two meanings,
depending on the context.

3.6.1 The Perceptron as a Model

Perceptron is a model of a feed-forward neural network, as shown in Figure 3.6 that
contains:
1. One input layer
2. One hidden layer
3. One output layer

Figure 3.6: The Perceptron neural network. In general, the network does not need to
be fully connected. In this particular example, connections exist between all neurons
in the input and hidden layers, whereas not all connections are active between the
hidden and output layers.

The feed-forward property describes the unidirectional nature of the network

- the signal flows from the input layer in one direction through the hidden layer to
the output layer. In other words: there are no loops. The flow of a signal from
3.6 Perceptron (1958) 45

Figure 3.7: The model of a Perceptron neuron.

the input layer (when a NN is used) is called the forward pass or the inference. In
feed-forward networks, each neuron is visited at most once per one forward pass.
The Perceptron’s neuron model is presented in Figure 3.7. It is based on the
McCulloch-Pitts model.
As an activation function, the Rosenblatt’s Perceptron used a step function:
(
1 if 𝑥 ≥ threshold
𝑓 (𝑥) = (3.7)
0 otherwise

However, subsequent perceptron-model neural networks have used a variety of

activation functions such as sigmoid functions or ReLU (Eq. 3.8).

𝑓 (𝑥) = max(0, 𝑥) (3.8)

A neural network is 𝑁-layered if it contains 𝑁 hidden layers. Therefore,

Perceptron is a single-layer network. Due to having only one hidden layer, it is a
linear classifier. It fails to properly separate non-linear patterns such as the XOR
problem discussed in Section 2.7, where McCuloch-Pitts neuron was presented.
However, it is possible to add more hidden layers. It then becomes the Multi-Layered
Perceptron (MLP).
3.6 Perceptron (1958) 46

Figure 3.8: The Perceptron device with an operator. Credit to: The National Museum
of the U.S. Navy - 330-PSA-80-60 (USN 710739). Public Domain photograph.

3.6.2 The Perceptron as a Device

The Perceptron, presented as an electromechanical machine in Figure 3.8, was
designed to process optical (visual) patterns as input. The overarching idea behind
the machine was to emulate human perception. Rosenblatt introduced a specialized
term, Photoperceptron, to distinguish it from other theoretically possible types of
inputs, such as tonal patterns (Phonoperceptron).
The device comprised three main components:
1. S-System - the sensory system or “input retina”.
2. A-System - the association system.
3. R-System - the response system.
The S-System formed the input layer. It was built using 20 × 20 photocells. The
A-System was a hidden layer consisting of 512 neurons organized into sets. It was
responsible for performing the algebraic sum of input impulses and activating units
in the R-System. The R-System was the output layer consisting of R-units. Each
R-unit was binary, i.e., had only ON and OFF states. Rosenblatt wrote that this can
be constructed using type-bars or signal lights. The machine had a set of 8 R-units.
The activity of R-units within a set was mutually exclusive, i.e., the response fed
back impulses which cut off the activity of all mutually exclusive R-units.
3.7 Samuel’s Checkers (1959) 47

3.7 Samuel’s Checkers (1959)

Checkers (draughts or British draughts in some regions) has been a popular board
game that was also interesting for AI research. In 1959 Arthur Samuel, who was an
American scientist and one of the attendees of The Dartmouth Workshop, published
a paper titled “Some studies in machine learning using the game of checkers” [258].
It was one of the first papers that contained the term machine learning in its title.
The thesis of the paper was that it is possible to write a program that will learn to
player better than the program’s creator.
In the introduction, the author provides motivation behind using checkers (and
games, in general) as a tested. This topic has already appeared in Section 2.5, but
let’s see Samuel’s take on this:
• There exist no deterministic algorithm that will guarantee a win, in practical
sense. Since there cannot exist an exact algorithm, the problem is interesting.
Samuel argues that assuming 3 moves per milimicrosecond, solving checkers
would take 1021 centuries to consider. Many years later, this assumption
turned out to be too strict (See Section 9.1).
• There exist a formal definite goal - winning the game. In addition, intermediate
goals can be utilized such as the number of pieces of each color on the board.
The intermediate goal helps to measure progress towards the definite goal.
• The rules of the game are definite, known and formally defined.
• There is a background of knowledge - e.g. obtain from the history of games
between human players.
• The game is familiar to a substantial body of people. An AI program created
for a game would be then understandable by players of the game.
The main contributions of the paper are:
1. Effective utilization of heuristics and pruning methods during game tree
search.
2. Self-learning of simple parameters for the heuristic.

3.7.1 Heuristic Evaluation Function

One disadvantage of the minimax algorithm, that was already presented in Section 2.5,
is that in order to compute the optimal move, the game tree has to be searched entirely.
Samuel introduced a novel approach to utilize a heuristic evaluation function that
takes a state in checkers and returns its quality from the perspective of a given player:

ℎ(𝑠𝑡𝑎𝑡𝑒) → R (3.9)

If ℎ is constructed properly, it will yield the more accurate values the closer 𝑠𝑡𝑎𝑡𝑒
is to a terminal state. In addition, for terminal states, the function should return the
actual game result. Such a function can be used to stop the search at a certain level
(depth) in the game tree and treat heuristic values as if they were actual scores. This
is often referred to as look-ahead. A ply is one action taken by one of the players in
a game. N-ply look-ahead describes constructing a game tree up to depth = 𝑁.
In the paper, the heuristic evaluation function was a linear polynomial, i.e.,
ℎ = 𝑎 1 𝑥 1 + 𝑎 2 𝑥2 + 𝑎 3 𝑥 3 ... + 𝑎 𝑛 𝑥 𝑛 . The values were encoded board positions (only
3.7 Samuel’s Checkers (1959) 48

32 squares) and six more parameters including an aggregated piece advantage (or
disadvantage when with negative sign). In calculating the piece advantage, regular
ones were worth 2, whereas kings were worth 3.
The proposed checkers program searched the game tree to a depth that was
context-dependent:
• 3-ply: if none of the following occurs: (1) the next move is a jump; (2) the
last move was a jump; (3) a player may offer an exchange.
• 4-ply: if (2) occurs and (1) and (2) do not.
• 5-ply or greater: the search stops when the next level does not contain any
jump.
• 11-ply or greater: the search stops when one player has piece advantage
greater than 2 kings.
• 20-ply: always stops - this is the maximum level.
The concept of pruning less promising branches in a game tree search algorithm
was later refined (as described in Section 5.3).

3.7.2 Learning in Samuel’s Checkers Program

The method described in the paper is referred to as rote learning, and it comprises
two main ideas.
The first idea was to catalogue board positions that were encountered by the
program during its games. These stored boards are referred to as records of the
catalogue. his foundational idea underlies well-known concepts in game-playing
AI, such as opening books and transposition tables. Samuel developed an effective
cataloging system taking into account symmetries, deleting redundancies and
excluding positions “believed to be not of much value”. To further reduce the storage
and optimize search time, only a limited number of records were stored. When
adding more records, the system prioritized a) more frequently used ones, b) newer
ones, and c) those closer to the current game position. At the time of publication,
Samuel’s catalog contained over 53000 records.
The second idea was generalization. Groups of boards were generalized into
feature vectors, and only these vectors were stored. Samuel selected 16 parameters
to represent boards and assigned modifiable weights to them. Additionally, he
incorporated the piece advantage feature with a fixed weight, forming the evaluation
function for the game tree search. Samuel developed a procedure to adjust the
weights during both games against humans and self-play, where the program played
against a slightly altered copy of itself. Samuel calls the two variants Alpha and Beta,
which inspired subsequent development of the Alpha-Beta pruning algorithm (c.f.
Section 5.3). During self-play, one variant could adjust weights after each move and
exchange some of the 16 parameters with the reserve list (of the remaining ones),
while the second player had a fixed evaluation function.
The ideas proposed by Samuel have had a significant influence on many subse-
quent game-playing programs, making his work truly inspirational.
3.8 Programs with Common Sense (1959) 49

3.8 Programs with Common Sense (1959)

John McCarthy (1927 - 2011), an American computer scientist, was one of the
organizers of The Dartmouth Workshop. In 1959, he published a milestone paper in
AI titled “Programs with Common Sense” [178]. McCarthy starts with a motivation
that simple “verbal reasoning processes” should be possible to simulate using
programmable computers available in 1959.

The paper consists of:

• Motivation behind solving problems using reasoning and natural language.
These ideas also align with the context of Natural Language Processing (NLP),
so McCarthy can be considered one of the pioneers or inspirations for NLP.
• Formulating the primary long-term objective of enabling computer programs
to learn from experience, this ambitious goal was ahead of its time.
• Description of the principles that he wanted to initiate a project on, together
with Marvin Minsky. Although they co-founded the MIT AI Lab, they did not
pursue this project jointly.
• Discourse on Knowledge Representation. It is one of the first papers to delve
into this topic in the context of AI.
• Pros and cons analysis of the imperative vs. declarative approach.
• Proposal of an abstract language to communicate with the machine. McCarthy
started the development of key ideas of this language in 1956. The language
is known as LISP [177]. Although, it is a general purpose language, it can be
considered the first programming language designed specifically for artificial
intelligence.
The “common sense” included in the title was described as the ability to deduce
new information based on what the program already knows and what it is told by
the users. Therefore, both knowledge representation and reasoning were important
elements of the approach. The program must be able to not only represent information
provided by users but also the results of its own reasoning, i.e., the output it produces.
McCarthy shows in the paper a way to define premises. They are represented
by predicates. Predicates consist of a name such as at and arguments such as I and
desk, in the first of the examples below:
at(I, desk)
at(desk, home)
at(car, home)
at(home, county)
at(airport, county)
These premises should be interpreted as: [the symbol representing the first
argument] is at the [symbol representing the second argument]. A symbol, in turn,
may represent a person, a place, an object etc.
McCarthy also shows how to define transitivity using a consequence. The system
uses consequences to derive new information:
at(𝑥, 𝑦), at(𝑦, 𝑧) → at(𝑥, 𝑧)
3.8 Programs with Common Sense (1959) 50

A fact describing an action taken can be:

did(go(𝑥, 𝑦, 𝑧))

A plausible consequence for the user might be:

did(go(𝑥, 𝑦, 𝑧)) → at(I, 𝑦)

A particular problem or a goal can be represented as:

want(at(I, airport))

Another rules shown are:

walkable(𝑥), at(𝑦, 𝑥), at(𝑧, 𝑥), at(I, 𝑦) → can(go(𝑦, 𝑧, walking))

drivable(𝑥), at(𝑦, 𝑥), at(𝑧, 𝑥), at(𝑐𝑎𝑟, 𝑦), at(I, 𝑐𝑎𝑟) → can(go(𝑦, 𝑧, driving))
with a pair of facts:
walkable(home)
drivable(county)
There are more examples of defining programs in such a way in the paper. Such an
approach was a paradigm shift in the way of creating computer programs:
• Previously, every machine instruction had to be explicitly written. Now,
the program can perform reasoning on its own using proper premises and
consequences.
• As McCarthy points out, using Samuel’s Checkers program as an example,
computer programs typically had built-in heuristics provided by the program-
mers. Here, the idea is to enable users to define heuristics in the same language
model as the program. McCarthy refers to this as the “embodiment” of the
heuristics.
• Finally, LISP provides a way of interacting with machines that is closer to
natural language than the previous methods.
In 1959, however, this was a specification for further implementation rather than
a working system. LISP would undergo various refinements and improvements
including adding new features.
McCarthy’s work had a profound impact on artificial intelligence. He received the
Turing Award in 1971. Logical reasoning and symbolic methods became mainstream
until the so-called “AI Winter” (see Section 4.10). However, these ideas experienced
a resurgence. In 2024, there is the neurosymbolic area of artificial intelligence that
involves combining neural networks with symbolic approaches.
3.9 General Problem Solver (1961) 51

3.9 General Problem Solver (1961)

Allen Newell (1927 - 1992) and Herbert A. Simon (1916 - 2001) were American
scientists. Both were recipients of the Turing Award, in 1975. They authored
General Problem Solver (GPS), which is considered a landmark achievement in
artificial intelligence. In 1961, they published the first article on this topic in Science
magazine [205], whereas in 1963, they wrote an extended chapter for the book titled
“Computers and Thought” [204]. The novelty is well captured in the name of the
approach. Unlike programs tailored for specific tasks, GPS could be used for various
problems. The problem specification was a run-time parameter.
The underpinning idea behind GPS was to simulate human-like processes in
achieving goals. The following three types of goals were possible to define:
1. Transform object A into object B.
2. Reduce difference D between object A and object B.
3. Apply operator Q to object A.
The system was built on symbolic logic. Objects were defined as logical terms
similarly to the ideas developed by McCarthy in LISP (See Sec. 3.8). Operations
were instances of the twelve system rules. Using the original authors’ symbols
{·, ∨, ⊃, ¬, →↔}, the rules are:
Rule 1:
𝐴·𝐵 → 𝐵· 𝐴
(3.10)
𝐴∨𝐵 → 𝐵∨ 𝐴
Rule 2:
𝐴 ⊃ 𝐵 → ¬𝐵 ⊃ ¬𝐴 (3.11)
Rule 3:
𝐴· 𝐴 ↔ 𝐴
(3.12)
𝐴∨ 𝐴 ↔ 𝐴
Rule 4:
𝐴 · (𝐵 · 𝐶) ↔ ( 𝐴 · 𝐵) · 𝐶
(3.13)
𝐴 ∨ (𝐵 ∨ 𝐶) ↔ ( 𝐴 ∨ 𝐵) ∨ 𝐶
Rule 5:
𝐴 ∨ 𝐵 ↔ ¬(¬𝐴 · ¬𝐵) (3.14)
Rule 6:
𝐴 ⊃ 𝐵 ↔ ¬𝐴 ∨ 𝐵 (3.15)
Rule 7:
𝐴 · (𝐵 ∨ 𝐶) ↔ ( 𝐴 · 𝐵) ∨ ( 𝐴 · 𝐶)
(3.16)
𝐴 ∨ (𝐵 · 𝐶) ↔ ( 𝐴 ∨ 𝐵) ∨ ( 𝐴 ∨ 𝐶)
Rule 8, which applies to the main expression only:

𝐴·𝐵 → 𝐴
(3.17)
𝐴·𝐵 → 𝐵
Rule 9, which applies to the main expression only:

𝐴 → 𝐴∨𝑋 (3.18)
3.9 General Problem Solver (1961) 52

Rule 10, where 𝐴 and 𝐵 are the main expressions:

𝐴
→ 𝐴·𝐵 (3.19)
𝐵

Rule 11, where 𝐴 and 𝐴 ⊃ 𝐵 are the main expressions.

𝐴
→𝐵 (3.20)
𝐴⊃𝐵

Rule 12, where 𝐴 ⊃ 𝐵 and 𝐵 ⊃ 𝐶 are the main expressions.

𝐴⊃𝐵
→𝐴⊃𝐶 (3.21)
𝐵⊃𝐶

Newell and Simon presented an example, in the paper, of solving the problem shown
in Table 3.1.

Table 3.1: The reasoning process presented in [205] for the problem of transforming
the left-hand size of row 1. to the right-hand size of row 1.
1. (𝑅 ⊃ ¬𝑃) · (¬𝑅 ⊃ 𝑄) ¬(¬𝑄 · 𝑃)
2. (¬𝑅 ∨ ¬𝑃) · (𝑅 ∨ 𝑄) Rule 6 applied to left and right of 1.
3. (¬𝑅 ∨ ¬𝑃) · (¬𝑅 ⊃ 𝑄) Rule 6 applied to left of 1.
4. 𝑅 ⊃ ¬𝑃 Rule 8 applied to 1.
5. ¬𝑅 ∨ ¬𝑃 Rule 6 applied to 4.
6. ¬𝑅 ⊃ 𝑄 Rule 8 applied to 1.
7. 𝑅 ∨𝑄 Rule 6 applied to 6.
8. (¬𝑅 ∨ ¬𝑃) · (¬𝑅 ∨ 𝑄) Rule 10 applied to 5. and 7.
9. 𝑃 ⊃ ¬𝑅 Rule 2 applied to 4.
10. ¬𝑄 ⊃ 𝑅 Rule 2 applied to 6.
11. 𝑃⊃𝑄 Rule 12 applied to 6. and 9.
12. ¬𝑃 ∨ 𝑄 Rule 6 applied to 11.
13. ¬(𝑃 · ¬𝑄) Rule 5 applied to 12.
14. ¬(¬𝑄 · 𝑃) Rule 1 applied to 18.

GPS operated by setting sub-goals which satisfaction has a chance to lead closer
to the satisfaction of the initial goal. The means-end analysis, which enabled to
compute the differences between states was a key concept in GPS. Most importantly,
it was possible to compute the difference between the current state and a goal state.
The difference was based on a number of changes such as: add a term, delete a term,
change a connective, change sign, change lower sign, change grouping, and change
position.
3.9 General Problem Solver (1961) 53

The authors proposed the whole set of heuristics, e.g., which rules or which
object matching should be checked first. It produced a recursive system similar to a
game tree (c.f. Section 2.5). However, instead of actions and game states, there were
applied rules and problem states, respectively.
GPS was implemented as a working program on the IBM 704 computer. However,
it was more than just a contribution to artificial intelligence. The discussion presented
in articles articles [205, 204] added valuable insights to the discourse on human
cognition.
4. Years 1965 - 1974

4.1 Fuzzy Sets (1965)

In 1965, Lofti Zadeh published a seminal paper entitled “Fuzzy Sets” [341]. As
of February 2024, with over 120000 citations, it stands as one of the most widely
referenced articles in the fields of computer science and mathematics.
The motivation underlying fuzzy sets was to provide a tool for modeling the
vagueness, uncertainty, and imprecision inherent in information. In the foundational
paper, several examples are presented, such as categorizing various entities into
the animal class: (dogs, horses, and birds) vs. (starfish and bacteria) vs. (rocks,
fluids, and plants). The key idea is that there should be a quantifiable way to
express this membership rather than a definitive binary answer of yes or no. This
is more prevalent when considering examples of linguistic uncertainty: “the class
of beautiful women” or “the class of tall men” (both are from the original paper).
It matters when specific actions are taken based on the uncertain information or
specific numerical values must be assigned e.g. to the minimum height passing
as tall - various individuals may perceive it differently. Consider The Bald Man
Paradox formulated by Eubulides: “A man with a full head of hair is obviously not
bald. Now the removal of a single hair will not turn a non-bald man into a bald one.
And yet it is obvious that a continuation of that process must eventually result in
baldness”. Fuzzy sets are a solution to modelling such phenomena.

Let 𝑋 be an ordinary set. A fuzzy set – 𝐴 – adds a membership (characteristic)

function 𝑓 𝐴 :
𝐴 = {(𝑋, 𝑓 𝐴 (𝑥)) : 𝑥 ∈ 𝑋, 𝑓 𝐴 : 𝑋 → [0, 1]} (4.1)
The membership function 𝑓 𝐴 denotes the degree in which 𝑥 belongs to 𝑋. If it
takes only values 0 (when 𝑥 ⊈ 𝐴) and 1 (when 𝑥 ∈ 𝐴), then a fuzzy set is a special
case of an ordinary (crisp) set. Fuzzy logic is an infinite-valued logic in contrast to,
e.g., two-valued Boolean logic or three-valued logic [231].
4.1 Fuzzy Sets (1965) 55

The support of a fuzzy set is defined as:

𝑆 𝐴 = {𝑥 ∈ 𝑋 : 𝑓 𝐴 (𝑥) > 0}

An 𝛼-cut of a fuzzy set is defined as:

𝐴>𝛼 = {𝑥 ∈ 𝑋 : 𝑓 𝐴 (𝑥) > 𝛼}

The core of a fuzzy set is defined as:

𝐶 𝐴 = {𝑥 ∈ 𝑋 : 𝑓 𝐴 (𝑥) = 1}

A fuzzy set is normalized if and only if its height is equal to 1. The height is
defined as:
𝐻 𝐴 = sup 𝑓 𝐴 (𝑥)
𝑥∈𝑋
- where for a finite case, the supremum is equal to maximum. The supremum is
guaranteed to always exists, because 𝑓 𝐴 (𝑥) has the upper-bound of 1.
A fuzzy set is empty if and only if:

∀𝑥 ∈ 𝑋 𝑓 𝐴 (𝑥) = 0

The complement of a fuzzy set 𝐴 = (𝑋, 𝑓 𝐴 (𝑋)) is defined as:

𝐴′ = (𝑋, 𝑓 𝐴′ (𝑥) = 1 − 𝑓 𝐴 (𝑥))

Two fuzzy sets 𝐴 = (𝑋, 𝑓 𝐴 (𝑋)) and 𝐵 = (𝑌 , 𝑓 𝐵 (𝑌 )) are equal iff:

𝑋 = 𝑌 and ∀𝑥∈𝑋 𝑓 𝐴 (𝑥) = 𝑓 𝐵 (𝑥)

Table 4.1 lists some fundamental properties of set operations with distinction of
which of them hold for crisp and fuzzy sets, respectively.

A t-norm (triangular norm) is an operation that generalizes the intersection of

fuzzy sets and conjunction in fuzzy logic:

𝑇 : [0, 1] × [0, 1] → [0, 1]

T-norms satisfy the following conditions:



 𝑇 (1, 𝑥) = 𝑥
𝑇 (0, 𝑥) = 0





𝑇 (𝑥, 𝑦) = 𝑇 (𝑦, 𝑥)
𝑇 (𝑇 (𝑥, 𝑦), 𝑧) = 𝑇 (𝑥,𝑇 (𝑦, 𝑧))




≤ 𝑤 ∧ 𝑦 ≤ 𝑧 =⇒ 𝑇 (𝑥, 𝑦) ≤ 𝑇 (𝑤, 𝑧)

𝑥

An s-norm (also called t-conorm) is a dual operation to the t-norm. It generalizes
union of fuzzy sets and disjunction in fuzzy logic.

𝑆 : [0, 1] × [0, 1] → [0, 1]

4.1 Fuzzy Sets (1965) 56

Operation Notation Crisp Sets Fuzzy Sets

′
Involution ( 𝐴′) = 𝐴 Yes Yes
Commutativity 𝐴∪𝐵 = 𝐵∪ 𝐴 Yes Yes
𝐴∩𝐵 = 𝐵∩ 𝐴
Associativity ( 𝐴 ∪ 𝐵) ∪ 𝐶 = 𝐴 ∪ (𝐵 ∪ 𝐶) Yes Yes
( 𝐴 ∩ 𝐵) ∩ 𝐶 = 𝐴 ∩ (𝐵 ∩ 𝐶)
Distributivity 𝐴 ∩ (𝐵 ∪ 𝐶) = ( 𝐴 ∩ 𝐵) ∪ ( 𝐴 ∩ 𝐶) Yes Yes
𝐴 ∪ (𝐵 ∩ 𝐶) = ( 𝐴 ∪ 𝐵) ∩ ( 𝐴 ∪ 𝐶)
Idempotence 𝐴∪ 𝐴 = 𝐴 Yes Yes
𝐴∩ 𝐴 = 𝐴
Absorption 𝐴 ∪ ( 𝐴 ∩ 𝐵) = 𝐴 Yes Yes
𝐴 ∩ ( 𝐴 ∪ 𝐵) = 𝐴
Absorption by 𝑋 𝐴∪𝑋 = 𝑋 Yes Yes
Absorption by ∅ 𝐴∩∅ = ∅
Identity 𝐴∪∅ = 𝐴 Yes Yes
𝐴∩𝑋 = 𝑋
Law of contradiction 𝐴∩ 𝐴 = ∅ Yes No (!)
Law of excluded middle 𝐴∪ 𝐴 = 𝑋 Yes No (!)
De Morgan’s laws 𝐴∪𝐵 = 𝐴∩𝐵 Yes Yes
𝐴∩𝐵 = 𝐴∪𝐵
Table 4.1: Fundamental properties of crisp and fuzzy sets, respectively. The 𝑋
symbol represents all possible objects (complete set).

S-norms satisfy the following conditions:



 𝑆(1, 𝑥) = 1
𝑆(0, 𝑥) = 𝑥





𝑆(𝑥, 𝑦) = 𝑆(𝑦, 𝑥)
𝑦), 𝑧) = 𝑆(𝑥, 𝑆(𝑦, 𝑧))



 𝑆(𝑆(𝑥,
 𝑥 ≤ 𝑤 ∧ 𝑦 ≤ 𝑧 =⇒ 𝑇 (𝑥, 𝑦) ≤ 𝑆(𝑤, 𝑧)


A list of widely used t-norms and s-norms is presented in Table 4.1.

In the years following the introduction of fuzzy sets, various extensions have
been proposed. One such extension is Type-2 fuzzy sets, where each element of the
set is associated with its own fuzzy membership function. As a result, not only is
the original concept fuzzified, but the degree of uncertainty is also fuzzified:

e = (𝑋, 𝑓 e(𝑥)) : 𝑥 ∈ 𝑋, 𝑓 e : 𝑋 → F ([0, 1])
𝐴 (4.2)
𝐴 𝐴

More examples of the use of fuzzy sets will be presented in Section 7.8 of this book.
4.2 Deep Learning Foundations (∼ 1965) 57

Name of the operator Formula

Minimum t-norm 𝑇 ( 𝐴, 𝐵) = min( 𝐴, 𝐵)
Product t-norm 𝑇 ( 𝐴, 𝐵) = 𝐴 ∗ 𝐵
(
𝑚𝑖𝑛( 𝐴, 𝐵), if 𝑚𝑎𝑥( 𝐴, 𝐵) = 1
Drastic minimum t-norm 𝑇 ( 𝐴, 𝐵) =
0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Łukasiewicz t-norm 𝑇 ( 𝐴, 𝐵) = 𝑚𝑎𝑥( 𝐴 + 𝐵 − 1, 0)
𝐴∗𝐵
Hamacher t-norm 𝑇 ( 𝐴, 𝐵) = 𝐴+𝐵−𝐴∗𝐵
𝐴∗𝐵
Einstein t-norm 𝑇 ( 𝐴, 𝐵) = 2−( 𝐴+𝐵−𝐴∗𝐵))
√︁
Yager t-norm (parameterized) 𝑇 ( 𝐴, 𝐵) = 𝑚𝑎𝑥(0, 1 − 𝑝
(1 − 𝐴) 𝑝 + (1 − 𝐵) 𝑝 )
Maximum s-norm 𝑆( 𝐴, 𝐵) = max( 𝐴, 𝐵)
Probabilistic sum s-norm 𝑆( 𝐴, 𝐵) = 𝐴 + 𝐵 − 𝐴 ∗ 𝐵
(
𝑚𝑎𝑥( 𝐴, 𝐵), if 𝑚𝑖𝑛( 𝐴, 𝐵) = 0
Drastic maximum s-norm 𝑆( 𝐴, 𝐵) =
1, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Łukasiewicz s-norm 𝑆( 𝐴, 𝐵) = 𝑚𝑖𝑛( 𝐴 + 𝐵, 1)
𝐴+𝐵−2∗𝐴∗𝐵
Hamacher s-norm 𝑆( 𝐴, 𝐵) = 1−𝐴∗𝐵
𝐴+𝐵
Einstein s-norm 𝑆( 𝐴, 𝐵) = 1−𝐴∗𝐵

Table 4.2: A list of popular t-norms and s-norms in fuzzy logic.

Fuzzy sets serve both mathematicians and computer scientists. They laid the
foundation for fuzzy logic - a formal extension of the Boolean logic. Fuzzy logic
has found diverse applications in fields such as control systems [66], industrial
automation [182], forecasting [287], medicine [56], and many others. Fuzzy sets,
fuzzy logic and fuzzy modeling were among the most frequently used keywords in
AI research before the era of deep learning. Researchers are attracted to fuzzy logic
for its intuitive and clean mathematical apparatus, adaptability to complex systems,
versatility across disciplines and robustness in handling uncertainty.

4.2 Deep Learning Foundations (∼ 1965)

Although deep learning became successful and truly took off in the 2010s, it is
important to note various milestones in the history of AI that have made its success
possible.
Some of the foundations have already been covered in this book such as the
introduction of the first model of a neuron (c.f. Sec. 2.7), logistic regression (c.f.
Sec. 2.8), and the Perceptron (c.f. Sec. 3.6). In this section, another foundation
will be presented called “Group Method of Data Handling” (GMDH). It was first
introduced by Alexey Ivakhnenko (a Soviet and Ukrainian mathematician) [130].
He also published a working algorithm for training multi-layered Perceptrons [129].
4.2 Deep Learning Foundations (∼ 1965) 58

Let us consider a polynomial approximation of the connection between the input

vector 𝑋 = (𝑥1 , 𝑥 2 , 𝑥 3 , ..., 𝑥 𝑛 ) and the the output 𝑌 :
𝑛
∑︁ 𝑛 ∑︁
∑︁ 𝑛 𝑛 ∑︁
∑︁ 𝑛 ∑︁
𝑛
𝑌 (𝑋) = 𝑎 0 + 𝑎 𝑖 𝑥𝑖 + 𝑎 𝑖 𝑗 𝑥𝑖 𝑥 𝑗 + 𝑎𝑖 𝑗 𝑘 𝑥𝑖 𝑥 𝑗 𝑥 𝑘 + ... (4.3)
𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1

- where 𝑎𝑖 are the coefficients, e.g., the weights of a neural network.

Ivakhnenko proved that a second order polynomial can be used for each pair
(𝑥𝑖 , 𝑥 𝑗 ) of input variables (as shown in Eq. 4.4) instead of the one described by
Eq. 4.3 through an iterative induction of a multi-layer perceptron structure. The idea
is depicted in Fig. 4.1.

𝑦 = 𝑎 0 + 𝑎 1 𝑥𝑖 + 𝑎 2 𝑥 𝑗 + 𝑎 3 𝑥𝑖 𝑥 𝑗 + 𝑎 4 𝑥𝑖2 + 𝑎 5 𝑥 2𝑗 (4.4)

Figure 4.1: Example of a multi-layer perceptron for Group Method of Data Handling.
Two subset of input features are active: (𝑥 1 , 𝑥 4 ) and (𝑥 3 , 𝑥 4 ).

A general version of the GMDH procedure is as follows:

1. Collect a set of observations that seems relevant to the output
2. Organize data into two groups: A and B. Samples from group A will be used
for training, whereas from group B for validation. This procedure separates
the information embedded in A.
4.3 Expert Systems (1965) 59

3. Generate subsets of group A. Each group represents a subset of features.

4. For each subset, iteratively create polynomial models of increasing complexity.
5. Choose the best-fitting model for each group.
6. Combine models from all groups into a final model and fit to the whole data
in A.
7. Use group B data to evaluate the final model. This approach enhances
generalization ability. Gödel’s second incompleteness theorem [279] states
that a system cannot demonstrate its own consistency. Evaluation using data
from group B serves as an external criterion with respect to the objective used
in training.
The coefficients of the polynomials can be determined using the least squares
regression:
∑︁𝑛
𝜕𝑒𝑟𝑟𝑜𝑟
𝑒𝑟𝑟𝑜𝑟 = 𝑦𝑖 ) 2
(𝑦𝑖 − b 𝜕𝑎 𝑗 = 0 (4.5)
𝑖=1
However, the GMDH theory was developed independently of regression analysis.
The convergence of these areas can be seen in the broader context of the evolution
of machine learning and data analysis methods [7].
It is worth noticing that the whole procedure contains many of the concepts
used in neural networks such as feature selection (choosing subsets from group
A), NN-like computations (polynomials representing weighted signal) as well as
the train-test dataset split (group B). It provides better insight into how multi-layer
neural networks can be unfolded into algebraic computations.
Over time, several GMDH algorithms have been proposed in the literature.
These algorithms vary in aspects such as the selection of the base function (e.g.,
degree-2 polynomials or harmonic functions), the nature of input data (e.g., discrete
or continuous), and the external criterion used for validation

4.3 Expert Systems (1965)

Expert systems are a type of artificial intelligence systems that incorporate a
significant amount of information about a particular subject and provide an interface
for asking questions regarding this subject. These questions can range from simple
queries to complex problems, necessitating reasoning over the knowledge base.
The knowledge base (KB) is the typical name given to the component responsible
for storing the information. Its structure is typically modeled using facts and rules,
which bears resemblance to logic-based systems such as General Problem Solver that
was shown in Section 3.9. Typically, KB is not populated with facts and rules in a
single step. Instead, it functions as a dynamic and evolving subsystem. Consequently,
it requires proper knowledge acquisition and update procedures. The knowledge
acquisition may involve a dialogue with domain experts and development of a
dedicated graphical user interfaces to facilitate the process. More advanced expert
systems also incorporate various validation mechanisms, such as ensuring that the
knowledge is consistent, non-contradictory, and up-to-date. Typically, expert systems
implement the reasoning mechanisms through the use of IF-THEN production rules:
4.3 Expert Systems (1965) 60

IF (condition1) THEN (effect1)

IF (condition2) THEN (effect2)
,,,
IF (conditionN) THEN (effectN)

Please note that if a particular condition is not satisfied it does not automatically
mean that it is false. It might become true along the process as a result of another
rule unless a particular expert system is explicitly constructed in such a way that
the rules do not interfere with each other. Similarly, if a condition is not satisfied,
it does not mean that its corresponding effect is false as it might appear more than
once and another rule might make it true.
By using rule-based reasoning, expert systems significantly differ, in principle,
to ML-based systems. However, these two approaches, i.e., expert IF-THEN rules
vs. statistical learning from data, can be combined [138].

In 1965, the first expert system was created by the Americans: E. Feigenbaum
(b. 1936) and J. Lederberg (1925 - 2008). The system was named DENDRAL [31,
160]. Its field of expertise was organic chemistry. The original name of the project
was: “DENDRitic ALgorithm, a procedure for exhaustively and non-redundantly
enumerating all the topologically distinct arrangements of any given set of atoms,
consistent with the rules of chemical valence”.

The system employed a plan-generate-test approach:

• Planning. Its purpose was to generate constraints to be passed to the generator
(the next phase). It utilized planning IF-THEN rules formulated by an expert.
The results of planning rules, i.e., dynamically generated constraints, were
combined with with static user-defined constraints for superatoms. The term
superatom referred to a connected graph representing a unit of a chemical
structure. For instance, a specific chemical structure might comprise 3 distinct
units. The input to the planning rules were data obtained from the mass
spectrometry process.

• Generating. Its purpose is to generate the most plausible molecular structures

while adhering to the constraints obtained in the planning stage. It was
implemented in a brute-force manner, i.e., considering all possible structures.

• Testing. Its purpose was to validate and return only structures that are
consistent with the chemical knowledge. It utilizes IF-THEN rules about
chemical reactions, molecular structure and mass spectrometry.
An example rule was:

IF (m/z peak is observed at 44 and 60)

THEN (superatom likely contains and oxygen (O) atom)

- where 𝑚/𝑧 is the mass-to-charge ratio. The system focused on the following six
types of atoms including their isotopes: carbon (C), hydrogen (H), nitrogen (N),
4.4 ELIZA (1966) 61

oxygen (O), phosphorus (P), and sulfur (S). The user interface was relatively difficult
to use because it required adherence to a formal protocol. An example command is
provided below:

#GENERATE
SUPERATOM: Z
RANGE OF OCCURRENCES; AT LEAST 1
SUPERATOM: V
RANGE OF OCCURRENCES: AT LEAST 2
SUPERATOM:
’COLLAPSED’ FORMULA IS C 3 Z 1 V 2 H 9
CONSTRAINT; LOOP Z NONE
[This constraint prevents Z from bonding with itself.]
CONSTRAINT: SUBSTRUCTURE CH3 NONE

This would generate the possible stuctures respecting all the constraints.

The first implementation of DENDRAL utilized a teletype. According to [160],

the commonly reported telephone bills were $2000 per month, which, in terms of
purchasing power, is roughly equivalent to $20000 in 2024. Nevertheless, DENDRAL
was a huge success both in terms of a computer system and the value it brought to the
field of organic chemistry. It demonstrated that expert systems could have practical
applications and contribute to solving complex problems in specific domains. Until
the era of deep learning, expert systems were one of the most recognized examples
of applied AI. The ideas and techniques pioneered in DENDRAL influenced the
design and development of subsequent expert systems [116, 152, 332].

4.4 ELIZA (1966)

As of 2024, Chat-GPT is the hottest thing in AI and beyond. In this section, we will
talk about the ELIZA, which is widely considered the first chatbot ever created [344].
It was developed between 1964 and 1966 by Joseph Weizenbaum (1923 - 2008) of
Artificial Intelligence Laboratory in MIT. He was a German American computer
scientist who published an article “ELIZA A Computer Program For the Study of
Natural Language” [323] in 1966. The name Eliza was inspired by Eliza Doolittle
- a character from Pygmalion, which is a play written by George Bernard Shaw.
In the story, Eliza undergoes a transformation from an uneducated poor girl into
an eloquent lady, serving as the inspiration (or the motivation) for the ELIZA program.

The ELIZA program was implemented on the MAC time-sharing system available
at MIT. In the 1966 paper, the following conversation with ELIZA is given as an
example. The lines written in lowercase are user inputs, whereas the lines in
uppercase are ELIZA’s responses. As you will see, many of the machine responses
are questions to the user. However, the use of the question mark character was
forbidden because it was a special command on the MAC time-sharing system.
4.4 ELIZA (1966) 62

Men are all alike.

IN WHAT WAY
They’re always bugging us about something or other.
CAN YOU THINK OF A SPECIFIC EXAMPLE
Well, my boyfriend made me come here
YOUR BOYFRIEND MADE YOU COME HERE
He says I’m depressed much of the time.
I AM SORRY TO HEAR YOU ARE DEPRESSED
It’s true. I am unhappy.
DO YOU THINK COMING HERE WILL HELP YOU NOT TO BE UNHAPPY
I need some help, that much seems certain.
WHAT WOULD IT MEAN TO YOU IF YOU GOT SOME HELP
Perhaps I could learn to get along with my mother.
TELL ME MORE ABOUT YOUR FAMILY
My mother takes care of me.
WHO ELSE IN YOUR FAMILY TAKES CARE OF YOU
My father.
YOUR FATHER
You are like my father in some ways.
WHAT RESEMBLANCE DO YOU SEE
You are not very aggressive but I think
you don’t want me to notice that.
WHAT MAKES YOU THINK I AM NOT VERY AGGRESIVE
You don’t argue with me.
WHY DO YOU THINK I DON’T ARGUE WITH YOU
You are afraid of me.
DOES IT PLEASE YOU TO BELIEVE THAT I AM AFRAID OF YOU
My father is afraid of everybody.
WHAT ELSE COMES TO MIND WHEN YOU THINK OF YOUR FATHER
Bullies.
DOES THAT HAVE ANYTHING TO DO WITH THE FACT
THAT YOUR BOYFRIEND MADE YOU COME HERE
ELIZA was a rule-based system. The basic idea is to split the text that someone
writes into it into fragments and associate exactly one keyword with each fragment.
A fragment can be a sentence or a phrase if it is contained within delimiters such
as commas. There was a set of rules determining if something is a keyword, along
with additional heuristics that prioritized one keyword over another if multiple were
found.
For every keyword, there were decomposition and reassembly rules defined in
the following fashion:
(𝐾 ((𝐷 1 ) (𝑅11 ) (𝑅12 ) ... (𝑅1𝑚1 ))
((𝐷 1 ) (𝑅21 ) (𝑅22 ) ... (𝑅1𝑚2 )) ®
© ª

®
... ... ... ... ®
« ((𝐷 𝑛 ) (𝑅𝑛1 ) (𝑅𝑛2 ) ... (𝑅𝑛𝑚 𝑛 ))) ¬
4.4 ELIZA (1966) 63

- where 𝐾 is a particular keyword, 𝐷 𝑖 is the 𝑖-th decomposition rule and 𝑅𝑖 𝑗 is the

𝑗-th reassembly rule defined for the 𝑖-th decomposition rule. The decomposition
rules transformed sentences into patterns represented by formal data structures rather
than free-form text, e.g.

(0 YOU 1 ME)

- where 0 denoted “an indefinite number of words”, whereas 1 denoted that exactly
one word was expected in its place. Because many decomposition rules were defined,
the program would iterate over and try to match them.
At the time of publication, the largest dictionary of keywords tested contained
only 50 entries. The idea was to treat this dictionary as a specific topic for the
conversation. ELIZA could even be used for languages other than English. However,
for each application, a dedicated dictionary had to be provided so the claim about the
The author points out numerous limitations of the ELIZA approach. For instance,
the identification of the one keyword per fragment was limiting and not always
successful. In overall, it was a simple approach to human-machine conversation
in a natural language without any form of understanding of the meaning by the
machine. ELIZA was not created to fulfil the dream of human-like behavior as
Turing imagined when proposing the Imitation Game (c.f. Section 3.1). Although
ELIZA did not pass the Turing Test, it is widely regarded as a major milestone in the
history of AI. When released, it was one of a kind.

The next famous chatbot, after ELIZA, was named Parry. It was authored by
an American psychiatrist - K.M. Colby in 1972 [45]. Parry was meant to simulate
a patient with psychological disorders. In their article [98], Güven Güzeldere and
Franchi, Stefano present the following transcript from the conversation between
Parry and ELIZA:

Parry: People get on my nerves sometimes.

ELIZA: I am not sure I understand you fully.
Parry: You should pay more attention.
ELIZA: Suppose you should pay more attention?
Parry: You’re entitled to your opinion.
ELIZA: What makes you think I’m entitled to my opinion?
Parry: It bothers me just to be around people.
ELIZA: Please go on.
[Parry continues with some paranoid ramblings
about horseracing, bookies, and the Mafia]
Parry: I know the mob controls the big rackets.
ELIZA: Who are you really talking about?
Parry: Have you ever really thought about the Mafia?
ELIZA: Suppose you ever really thought about the Mafia?
Parry: Let’s talk about something else.
ELIZA: I am not sure I understand you fully.
Parry: You keep asking me the same things,
4.5 A* Algorithm (1969) 64

why not change the topic?

ELIZA: In what way?

4.5 A* Algorithm (1969)

In a 1969 paper titled “A Formal Basis for the Heuristic Determination of Minimum
Cost Paths” [107], Peter E. Hart (b. 1941), Nils J. Nilsson (1933 - 2019) and Bertram
Raphael (b. 1936) introduced the A* algorithm to the world. It is used to find the
shortest path in a graph. Because the configuration space of many problems can be
formally represented as a graph, A* is suitable and widely applied for a variety of
problems. One particular common application is navigation in a 2D or a 3D space.

4.5.1 Elementary Definitions from Graph Theory

A* can be applied to essentially any graphs such as simple, directed, weighted,
multigraphs and combinations of those, e.g., weighted directed graphs.𝐺.
A graph without any additional properties is defined as a pair:
𝐺 = (𝑉, 𝐸)

- where 𝑉 is the set of vertices and 𝐸 ⊆ 𝑣 𝑖 , 𝑣 𝑗 ∈ 𝑉 ×𝑉 is the set of edges. The
edge 𝑒𝑖 𝑗 = 𝑣 𝑖 , 𝑣 𝑗 connects 𝑖-th and 𝑗-th vertices.
A directed graph adds a direction to each edge. Therefore, an edge is now
represented as a tuple rather than a set: 𝑒𝑖 𝑗 = (𝑣 𝑖 , 𝑣 𝑗 ) ≠ (𝑣 𝑗 , 𝑣 𝑖 ). Path-finding
algorithms can only traverse the graph along the edges’ directions. Compare the
shortest paths between vertices 𝐴 and 𝐵 shown in Fig. 4.2 for undirected and directed
graphs, respectively.
A simple graph is an undirected graph without loops.
A weighted graph introduces weights 𝑤 𝑖 𝑗 to each edge 𝑒𝑖 𝑗 .
A multigraph allows for multiple edges between the same pair of vertices.
The problem of finding the shortest path in a graph can be formally defined as
finding a sequence of vertices:
𝑝𝑎𝑡ℎ = (𝑣 1 , 𝑣 2 , ..., 𝑣 𝑘−1 , 𝑣 𝑘 )
from the starting point to the target:
𝑣 1 = 𝑠𝑡𝑎𝑟𝑡
𝑣 𝑘 = 𝑡𝑎𝑟𝑔𝑒𝑡
such that they form a consistent path:
(𝑣 𝑖 , 𝑣 𝑖+1 ) ∈ 𝐸, ∀𝑖 ∈ [1, 𝑘]
that minimizes the weighted distance:
𝑘−1
∑︁
min( 𝑤 𝑖 ), 𝑤 𝑖 ∈ 𝑊 𝐸 𝐼𝐺𝐻𝑇 𝑆
𝑖=1
4.5 A* Algorithm (1969) 65

Figure 4.2: The shortest paths from vertices 𝐴 to 𝐵 in two types of graphs. The
lengths of the paths are equal to 6 and 4, respectively, for the directed and undirected
graphs.

In an unweighted graph, we can consider all weights equal to 1. An undirected

graph is equivalent to a directed graph that has all edges defined in both directions,
i.e., 𝑒𝑖 𝑗 ⇔ 𝑒 𝑗𝑖 or, using the vertex notation: (𝑣 𝑖 , 𝑣 𝑗 ) ⇔ (𝑣 𝑗 , 𝑣 𝑖 ). Therefore, the terms
“finding the shortest path” and “finding the least-cost path” are used interchangeably
in the context of path-finding algorithms.

4.5.2 The A* Procedure

The main idea of A* is to utilize a heuristic that will guide the search and drastically
improve its running time efficiency. The algorithm approximates the length (cost) of
a path using the following equation:
𝑓 (𝑥) = 𝑔(𝑥) + ℎ(𝑥) (4.6)
- where 𝑥 is the current vertex the algorithm is; 𝑔(𝑥) is the shortest (cheapest) path
from 𝑠𝑡𝑎𝑟𝑡 to 𝑥 computed explicitly as the algorithm has already traversed this path;
ℎ(𝑥) is the heuristic estimation of the distance left: from 𝑥 to 𝑡𝑎𝑟𝑔𝑒𝑡.
At any given node, the algorithm will choose the next open (yet untested)
possibility of traversal that minimizes 𝑓 (𝑥) from Eq. 4.6. The 𝑓 (𝑥) function can be
4.5 A* Algorithm (1969) 66

Algorithm 2: A* Algorithm. Finds the shortest path from 𝑠𝑡𝑎𝑟𝑡 to 𝑡𝑎𝑟𝑔𝑒𝑡.

𝑂𝑃𝐸 𝑁 = {𝑠𝑡𝑎𝑟𝑡} ; // Nodes to test.
𝐶 𝐿𝑂𝑆𝐸 𝐷 = ∅ ; // Best known visited predecessors.
𝐺, 𝐻, 𝐹 ← [] ; // Initialization of maps. See Eq.4.6
𝐺 [𝑠𝑡𝑎𝑟𝑡] ← 0 ;
𝐻 [𝑠𝑡𝑎𝑟𝑡] ← heuristic(𝑠𝑡𝑎𝑟𝑡, 𝑡𝑎𝑟𝑔𝑒𝑡) ;
𝐹 [𝑠𝑡𝑎𝑟𝑡] ← 𝐺 [𝑠𝑡𝑎𝑟𝑡] + 𝐻 [𝑠𝑡𝑎𝑟𝑡] ;
while 𝑂𝑃𝐸 𝑁 is not empty do
𝑐𝑢𝑟𝑟𝑒𝑛𝑡 ← get 𝑛𝑜𝑑𝑒 from 𝑂𝑃𝐸 𝑁 where 𝐹 (𝑛𝑜𝑑𝑒) = 𝑚𝑖𝑛 ;
𝑂𝑃𝐸 𝑁.remove(𝑐𝑢𝑟𝑟𝑒𝑛𝑡);
if 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 = 𝑡𝑎𝑟𝑔𝑒𝑡 then
return backpropagate(𝐶 𝐿𝑂𝑆𝐸 𝐷, 𝑐𝑢𝑟𝑟𝑒𝑛𝑡);
for 𝑛𝑒𝑥𝑡 in getAdjacentNodes(𝑐𝑢𝑟𝑟𝑒𝑛𝑡) do
𝑛𝑒𝑥𝑡_𝐺 ← 𝐺 [𝑐𝑢𝑟𝑟𝑒𝑛𝑡]+ distance(𝑐𝑢𝑟𝑟𝑒𝑛𝑡, 𝑛𝑒𝑥𝑡);
if 𝑛𝑒𝑥𝑡_𝐺 < 𝐺 [𝑛𝑒𝑥𝑡] then
𝐶 𝐿𝑂𝑆𝐸 𝐷 [𝑛𝑒𝑥𝑡] ← 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 ;
𝐺 [𝑛𝑒𝑥𝑡] ← 𝑛𝑒𝑥𝑡_𝐺 ;
𝐻 [𝑛𝑒𝑥𝑡] ← heuristic(𝑛𝑒𝑥𝑡, 𝑡𝑎𝑟𝑔𝑒𝑡);
𝐹 [𝑛𝑒𝑥𝑡] ← 𝐺 [𝑛𝑒𝑥𝑡] + 𝐻 [𝑛𝑒𝑥𝑡];
if 𝑛𝑒𝑥𝑡 is not 𝑂𝑃𝐸 𝑁 then
𝑂𝑃𝐸 𝑁.add(𝑛𝑒𝑥𝑡);

return NULL ; // No path found

4.5 A* Algorithm (1969) 67

Figure 4.3: The comparison of visited nodes during the search by Dijkstra’s and
A* algorithms (using the Manhattan distance heuristic), respectively. In the lower
graph, the yellow nodes represent those added to the OPEN set but not traversed by
the algorithm

regarded as an estimation of the cost if the optimal path went through 𝑥. For the
pseudocode, please refer to the Algorithm 2 listing.
If the heuristic is admissible, described by the following property:

ℎ(𝑥) ≤ actual-cost-to-goal-from-𝑥 (4.7)

then the algorithm will find the optimal solution if one exists. If additionally, the
heuristic is consistent, described by the following property:

ℎ(𝑥) + ℎ(𝑦) ≤ 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑥, 𝑦) (4.8)

then the algorithm avoids revisiting nodes that have already been visited.
In general, when the heuristic is not admissible, there is no guarantee that the A*
algorithm will find the optimal solution. If this is a requirement, then exact methods
such as Dijkstra’s Algorithm should be used. However, typically, A* will visit much
less nodes during the search than exact methods (c.f. Figure 4.3).
4.6 AI Philosophy Seminal Paper (1969) 68

4.6 AI Philosophy Seminal Paper (1969)

“Some Philosophical Problems from the Standpoint of Artificial Intelligence” [179]
is a seminal paper that was published by John McCarthy and Patrick J. Hayes in
1969. It is a relatively long article spanning 51 pages, organized into the following
sections:
1. Introduction. The organization of the article is outlined. The authors also
highlight the importance of knowledge representation that reflects the real
world in order to discuss intelligent behavior of computer programs.

2. Philosophical Questions.
This section stats with a short review of how progress is made in AI. On the one
hand, researchers pick specific problems such as the efficacy in playing chess
and find out what kind of intellectual mechanisms are required to solve them.
On the other hand, it is also possible to start with intellectual mechanisms and
design problems in such way to exercise those mechanisms. The examples
given are: memory and decision-making. The authors discuss the nature of
intelligence - how can it be possibly defined. Two alternative views are shown
- one is evidential that “a machine is intelligent if it solves problems that
require intelligence in humans” or “survives in an intellectually demanding
environment”. The second definition is more introspect and based upon having
an accurate model of the world. In this light, the remainder of their article
is written. The authors pose exemplar questions that an intelligent machine
should be able to answer. McCarthy and Hayes discuss the importance of
the common-sense knowledge and reasoning. Next, they focus on the world
representation. Examples are shown using finite automata and logic as tools.
They also discuss some problems from a modal logic such as the notion of
“can”.

3. Formalism. The authors attempt to provide formalism that could address the
questions posed in the previous sections. This work is builds upon the ideas
presented in Section 3.8 (“Programs with Common Sense”). The following
aspects of world model are introduced: situations, fluents, representing causal-
ity, actions, strategies, knowledge and ability. A situation is a subset of the
state of the universe. A fluent is a function, e.g., at( 𝑝, 𝑥)(𝑠) ∧ raining(𝑥)(𝑠),
with situations as arguments. The causality discussion is focused mostly on
modeling implications. Actions are specially distinguished fluents. Strategies
are combinations of actions performed by some subject. Actions may have
a result associated with them. The ability can be expressed by the use of
the previously defined language constructs and quantifiers. Examples are
given, however, various problems that arise (such as complex and unintuitive
notation) are shown.

4. Remarks and Open Problems. The so-called frame problem is introduced.

The frame is essentially a state vector. The frame problem is the procedure
4.7 The Precursor of Simulated Annealing (1970) 69

of updating this state vector efficiently. This is related to the closed-world

assumption (CWA) vs. open-world assumption. In CWA, the state of the world
is explicitly defined by statements (e.g., fluents in a logic system) which are
true. If something is not known to hold true, either explicitly or as a derived
information from what is true, then it is considered false. In other words,
there is no unknown or inconsistent true-and-false state. Formal methods
such as first-order non-monotonic logic are applied to tackle the frame update
problem in scenarios provided by the authors as examples. There is also
a short subsection about introducing probabilities to the formalism so not
information must be assumed certain.

5. Discussion of Literature. McCarthy and Hayes reflect upon related litera-

ture. They dedicate a portion of this section to a critique of some previous
approaches, e.g., for not being epistemologically adequate. The topics covered
include the definition of intelligence, knowledge representation in computer
systems, representation of behaviors and beliefs, modal logic, temporal logic,
and other related subjects.

6. References. The paper contains 57 bibliographical references.

4.7 The Precursor of Simulated Annealing (1970)

In 1970, Martin Pincus publishes a paper titled “A Monte Carlo method for the
approximate solution of certain types of constrained optimization problems” [221].
The ideas presented in this paper started research of a method that has been known
as “simulated annealing” (SA) since 1983 [145]. The name was inspired by the
annealing process in metallurgy. It refers to the heat treatment of a material followed
by a slow cooling process to remove defects and optimize the material’s properties,
such as hardness and strength. For an illustration, see Fig. 4.4.

Simulated annealing belongs to the family of meta-heuristic methods. Since this

term has not appeared in this book yet, let us define it:

Meta-heuristics are a special class of methods for optimization and search

problems. One of the key property of meta-heuristics is that they are multi-purpose
rather than problem-specific. They can incorporate heuristics for local optimization
or search but on top of them, there is a high-level procedure that defines a particular
meta-heuristic algorithm. Meta-heuristics are frequently used for complex problems,
for which there exist no efficient exact or a regular heuristic algorithm. They can
often provide a near-optimal solution.

Meta-heuristic methods typically share many of the following properties:

• They involve searching through the problem space with some level of random-
ness.
4.7 The Precursor of Simulated Annealing (1970) 70

Figure 4.4: Materials in a liquid state can undergo a process called crystallization
when the temperature goes below its melting pot in a sufficiently gentle fashion such
as shown in the temperature (time) plot.

• Some kind of a global search is combined with a local search (refinement

procedure).
• Many meta-heuristics are population-based, i.e., they maintain a population of
candidate solutions.
• They are iterative.
• They involve mechanisms to escape local optima, e.g. restarts.
The pseudocode of the simulated annealing method is presented in Alg. 3. It
uses a function called energy. It evaluates the quality of a solution in such a way
that is smaller the closer the solution is to the optimum. Therefore, the algorithm
will try to minimize it. It is trivial, however, to reformulate the algorithm for the
maximization objective.
4.7 The Precursor of Simulated Annealing (1970) 71

Algorithm 3: Simulated Annealing

𝑠 ←− initial solution ;
𝑒 ←− 𝑒𝑛𝑒𝑟𝑔𝑦(𝑠0 ) ;
𝑒 𝑚𝑖𝑛 ←− good enough energy (stop condition) ;
𝑠 𝑏𝑒𝑠𝑡 ←− 𝑠 ;
𝑒 𝑏𝑒𝑠𝑡 ←− 𝑒 ;
𝑡 ←− initial temperature ;
𝑡 𝑚𝑖𝑛 ←− final temperature ;
while 𝑡 > 𝑡 𝑚𝑖𝑛 or 𝑒 > 𝑒 𝑚𝑖𝑛 do
𝑠∗ = generate-neighbor(𝑠) ;
𝑒 ∗ ←− 𝑒𝑛𝑒𝑟𝑔𝑦(𝑠∗ ) ;
Δ𝑒 ←− 𝑒 − 𝑒 ∗ ;
if Δ𝑒 > 0 or 𝑒𝑥 𝑝(Δ𝑒/𝑡) > 𝑟𝑎𝑛𝑑 (0, 1): then
𝑠 ←− 𝑠∗ ;
𝑒 ←− 𝑒 ∗ ;
if 𝑒 < 𝑒 𝑏𝑒𝑠𝑡 then
𝑠 𝑏𝑒𝑠𝑡 ←− 𝑠 ;
𝑒 𝑏𝑒𝑠𝑡 ←− 𝑒 ;
𝑡 ←− 𝑐𝑜𝑜𝑙𝑖𝑛𝑔(𝑡) ;
return 𝑠 𝑏𝑒𝑠𝑡
Algorithm 4: Cooling(t) Variant 1
𝑡 ←− input temperature ;
𝑟𝑎𝑡𝑒 ←− cooling rate ∈ (0, 1) ;
return 𝑡 ∗ 𝑟𝑎𝑡𝑒
Algorithm 5: Cooling(t) Variant 2
𝑡 ←− input temperature ;
𝑡0 ←− initial temperature ;
return 𝑡 − 𝑡0 /𝑚𝑎𝑥𝐼𝑡𝑒𝑟
The method, in its basic variant, is not population-based, i.e., one solution is
considered at a time. Though it can be modified to run multiple SA instances in
different regions of the problem space. In such a case, in each iteration, there is a
vector of the current solutions and the global best one is chosen after all iterations are
complete. The algorithm jumps to a neighboring solution given that it is of sufficient
quality. The method generate-neighbor can be implemented as a random solution
sampled from a small proximity. The process “cools down” and the smaller the
temperature, the less likely it is to consider a solution that is worse than the current
solution. The exploration is then greater at the beginning of the algorithm. Two
most common methods of cooling are presented in Algorithms 4 and 5, respectively.
SA is best used for the classic mathematical optimization problems such as
function minimization. Unlike some of the other meta-heuristics, it is suitable for
functions defined for continuous domains, i.e., 𝑓 : R𝑛 → R. For example, SA can
be applied for hyperparameter tuning and optimization of weights in deep neural
4.8 STRIPS Planning (1971) 72

networks [238]. To this day, it remains one of the simpler yet highly widely utilized
meta-heuristic methods

4.8 STRIPS Planning (1971)

In 1971, at the IJCAI Conference, Nils J. Nilson and Richard E. Fikes presented
their article “STRIPS: A New Approach to the Application of Theorem Proving to
Problem Solving” [73]. It was indeed a new approach that started a new research
trend in the area of AI known as planning and scheduling. Originally, STRIPS stands
for Stanford Research Institute Problem Solver. Figure 4.5 presents the core idea of
the STRIPS planning approach.

The planner uses a world model that consists of:

1. State structure definition - how a state in the problem space is represented.
2. Initial state - a distinguished state the method starts from.
3. Goal state - a distinguished state that is the goal. It can be represented as a
constant or as a “template” with the use of variables, so that any state that
matches the template is considered a valid goal state.
4. Operator structure definition - how operators, also referred to as actions,
are represented.
5. List of available operators - which a STRIPS planner can use.
The primary objective of STRIPS, or the planning problem in general, is to
find a sequence of operators that can transform the initial state into a goal state,
minimizing the length of this sequence. This is under assumption that such a
sequence exists.
For modeling the world, first-order predicate logic is used in STRIPS. The state
consists of a set of predicates that hold true, e.g.:

at(robot, 10, 20)

holding(robot, nothing)

- which states that the robot is at coordinates (10, 20) and there is nothing in its hand.
As shown in Fig. 4.5, the model of actions includes four elements:
1. Name with parameters. For instance, 𝑔𝑜𝑡𝑜(𝑋,𝑌 ) is an action with the name
of “goto” and two parameters. The 𝑋,𝑌 parameters are variables, which
means that they can be substituted with specific symbols. Therefore, multiple
𝑔𝑜𝑡𝑜 actions can be instantiated - for various combinations of 𝑋 and 𝑌 . The
number of instantiations is constrained by the requirement that these variables
must also appear in the precondition definition. Consider another example:
𝑔𝑜𝑡𝑜(𝑋, 30). Here, only the first of the parameters is a variable and the second
one is a constant. The notation used here distinguishes variables (written in
upper-case) from constants (written in lower-case).
2. Preconditions. A logical formula that defines when a particular action
is available. The precondition are tested against the given state. For the
𝑔𝑜𝑡𝑜(𝑋,𝑌 ) example, they may be defined as follows:
4.8 STRIPS Planning (1971) 73

empty(X,Y)
at(robot, OX, OY)
existPathForRobot(OX, OX, X, Y)
The number of the 𝑔𝑜𝑡𝑜(𝑋,𝑌 ) actions will be limited to only such combina-
tions of 𝑋,𝑌 that satisfy the preconditions. Preconditions may use logical
connectives, e.g., 𝑝1 ∧ ( 𝑝2 ∨ 𝑝3) ∧ ¬𝑝4.
3. Effects (ADD). An action’s effects are resolved after applying the action. They
define which predicates should be added to set of predicates describing the
current state: 𝑆 := 𝑆 ∩ 𝐴𝐷𝐷. They cannot contain variables (all of variables
are instantiated).
4. Effects (DEL). Similar case to the above but this time: 𝑆 := 𝑆 \ 𝐴𝐷𝐷.

Figure 4.5: The objective is to find a sequence of actions from the initial to a
goal state. Each action is an instance of available action definition. The definition
includes how actions can be instantiated (arguments), what conditions must hold in
order for an action to be available in a given state, and how it alters the current state.

Please note that the state and actions model represents the “frame problem”
introduced in 4.6. In their paper, Nilson and Fikes refer to the frame problem and also
the General Problem Solver [204] (c.f. Section 3.9). The original STRIPS approach
was implemented in LISP on a PDP-10 mainframe. To solve the actual problem it
computed differences between the states in a similar fashion to the General Problem
Solver. The authors called this a “GPS-Like” strategy. The planner would proceed
to world states that are closer to the goal state.
4.9 Prolog (1972-1973) 74

The planning problem can be viewed as a path-finding problem in an abstract

space that, generally, is not Euclidean. Therefore, researchers quickly found out
that path-finding algorithms can be used. One that is particularly frequently applied
to planning is the A* algorithm (introduced in this chapter in Sec. 4.5) and its
modifications. The formula of computing state difference can be utilized as the A*
heuristic function.
The STRIPS approach has been applied to many real-world planning prob-
lems [207]. It also served as an inspiration for the so-called Goal Oriented Action
Planning (GOAP), which has seen a lot of success in video games, e.g. the critically
acclaimed F.E.A.R. [216].

4.9 Prolog (1972-1973)

4.9.1 Background
Prolog is one of the earliest logic programming languages ever created and is
undoubtedly one of the oldest still in use. Its history is considered to have begun
in 1972 [251], however, the “Prolog I” version was released in 1973 [149]. This
development was a collaborative effort, with key contributors being the Frenchmen
Alain Colmerauer and Philippe Roussel, working at the Aix-Marseille University,
in collaboration with Robert Kowalski, an American-British computer scientist
affiliated with the University of Edinburgh.
The name Prolog was coined by Roussel’s wife as an abbreviation to Programmation
en Logique. This abbreviation works for the English translation as well: Programming
Logic. The name has been confusing because of its natural connotation as a general
paradigm rather than a specific language.
In 1983, H. D. Warren designed a very efficient way of running Prolog programs
using dedicated memory sets and execution model. It was named the Warren Abstract
Machines (WAM) [321, 255]. He is also the author of the first compiler (not an
interpreter) for this language. The history of Prolog was very dynamic full of many
versions of the language, even developed by independent groups of researchers (e.g.,
Marseilles Prolog, Edinburgh Prolog) and companies (e.g. Borland Prolog). Since
1995, there is a standardized version called ISO Prolog. For more information, we
advise the reader to check the article “Fifty Years of Prolog and Beyond” [149].

4.9.2 Examples
Prolog is based on first-order logic. It is a type of formal logic system that extends
propositional logic by incorporating variables, quantifiers, and predicates. However,
quantifiers are implicitly present in Prolog in the form of variables and universally
quantified rules.
The structure of a Prolog program shares similarities with the characteristics of
an expert system shown in Section 4.3. A Prolog program is often referred to as
a knowledge base. The users can define facts, rules, run queries, and use special
built-in operators (which are not part of the logic system).
4.9 Prolog (1972-1973) 75

Let us see an example of a fact definition. It states that Socrates is a man:

man(socrates)

Facts contain a name and any number of arguments. The arguments can be
nested facts or simple terms: variables or constants. By the standard convention:
• Variables must start with an uppercase letter or the “_” sign.
• Constants do not start with an uppercase letter nor the “_” sign.
Here is an example of a fact that mixes nested facts, variables and constants:

car(engine(diesel, 2000), DRIVER, red, sedan)

Now let us see an example of a rule that states that every man is mortal:

mortal(X) :- man(X)

The :- symbol denotes an implication from the right to the left. The right-hand
side describes conditions (the body of a rule) for the rule to hold. On the left-hand,
there are consequences (the head of a rule) represented by a predicate.
Multiple conditions are separated by commas. The following rule:

r :- c1, c2, c3, c4

in mathematical syntax would be:

𝑐1 ∧ 𝑐2 ∧ 𝑐3 ∧ 𝑐4 =⇒ 𝑟

Queries in Prolog resemble predicates used in the definitions of rules. We can

now ask a query:

?- mortal(socrates)

which will result in true because a Prolog program will substitute X = socrates
and apply the rule mortal(X) :- man(X).
If we asked a query:

?- mortal(foo)

our program would return false.

If we asked a query with a variable:

?- mortal(X)

it would return all substitutions for X that are true, which in our case is only one:
[socrates].
If we define a rule this way:

mortal(X) :- man(Y)
4.9 Prolog (1972-1973) 76

then both variables are independent. They act as existential and universal quantifiers
in the following sense: if there is any fact of the name man then everything is mortal
(since X in the head is not constrained by the Y in the body).
Let us now consider a new Prolog program with the following facts:

next(0, 1)
next(1, 2)
next(2, 3)
...
next(999, 1000)

We can define a pair of recursive rules to determine whether one number is

greater than another:

greater(X, Y) :- next(Y, X)
greater(X, Y) :- X \= Y, next(Z, X), greater(Z, Y)

where X \= Y enforces the two variables to not be equal.

Please notice the results of the following queries:

?- greater(70, 10) true

?- greater(10, 70) false
?- not(greater(70, 10)) false
?- not(greater(10, 70)) true

Prolog utilizes the negation as failure approach. It means that if a program fails
to prove 𝑝 (according to the knowledge base, 𝑝 is false) then 𝑛𝑜𝑡 ( 𝑝) is assumed to
be true.
Now consider the following rule:

output(A, B, C) :- next(X, A), next(Y, B), next(Z, C)

If a query contains multiple variables, then the output will contain as many
results as there are combinations of variables that satisfy the query. Please note, how
easy is to make a query that will return a large number of results:

?- output(X, Y, Z)
----1 000 000 000 results:------
output(1, 1, 1)
output(2, 1, 1)
output(3, 1, 1)
...
output(1000, 1000, 1000)

Using the same variable name within a given scope (e.g. a rule) has a special
meaning. In technical terms: they receive the same substitutions by constants. In
general terms: they represent equivalent sets of concepts.
Observe the impact of a difference in variable naming on the result:
4.10 The First AI Winter (1974) 77

?- output(X, Y, X)
----1 000 000 results:------
output(1, 1, 1)
output(2, 1, 2)
output(3, 1, 3)
...
output(1000, 1000, 1000)

This mechanism of is called unification, and it is similar to the JOIN operation

in databases. Multiple conditions defined within a rule’s body can be joined thanks
to the use of the same variable name.

In overall, Prolog is a powerful and expressive language, and while I have provided
only elementary examples, the language’s capabilities extend far beyond the scope
of this book. Notably, one of Prolog’s strengths lies in its declarative programming
style. It allows users to describe the relationships and rules in a problem domain
rather than specifying the control flow explicitly. This makes it more intuitive for
certain types of problem-solving. It suitable for various tasks such as rule-based
logical reasoning, pattern matching, planning and theorem proving. In principle, it
is designed for artificial intelligence and information processing. In addition, Prolog
has served as a basis for various other languages such as Datalog [37].
Despite its advantages, it is considered inferior for conventional programming
of IT systems. It is also relatively slow when compared to high-performance
general-purpose languages like C++.

4.10 The First AI Winter (1974)

There have been two periods of time in the history of artificial intelligence that are
referred to as “AI Winters”. The first one is considered to last between 1974 and
1980.
This period was marked by a cooling of enthusiasm and, to some degree, a loss
of the initial belief in AI capabilities. This sentiment was solidified by reported
disappointments from agencies such as the Defense Advanced Research Projects
Agency (DARPA) and failure to perform automatic text translation. Earlier, there was
a widespread belief that AI might attain human-like cognitive capabilities, leading to
a significant amount of optimism in early research projects like the General Problem
Solver, perceptrons, or the utilization of LISP.
The second aspect of the “AI Winter”is the sudden decrease in funding for AI
projects from research agencies as well as private investors.
A disclaimer: the author of this book personally believes that, despite the negative
sentiment, many interesting contributions emerged during this period. They will be
presented in the next chapter.
4.10 The First AI Winter (1974) 78

Figure 4.6: AI winter. An AI-generated image by the model: DALL-E.

5. Years 1975 - 1985

5.1 The Recognition of Genetic Algorithms (culmination in

1975)
5.1.1 Background
On 24 November 1859, Charles Darwin, a renowned English naturalist, geologist,
and biologist, published his seminal book titled “On the Origin of Species" [52].
The full title was: “On the Origin of Species by Means of Natural Selection, or
the Preservation of Favoured Races in the Struggle for Life”. This groundbreaking
work had an enormous and lasting impact on the fields of biology, paleontology, and
even broader areas of science, philosophy, sociology, and society. It also influenced
artificial intelligence.
Genetic algorithms (GAs) are inspired by the biological evolution of species.
Naturally, GAs serve as a mathematical model aimed at solving optimization problems
rather than being an accurate simulation of evolution. They borrow key concepts
from the theory of evolution (original Darwin’s theory with modern updates) and
adapt them as parts of the algorithm. The key concepts include:
1. Survival of the Fittest. Individuals compete in an environment, where
better-suited ones survive and have the opportunity to produce offspring. The
fitness function that will be soon introduced was inspired by the survival of
the fittest.
2. Population and Variation. A large and diverse pool of genes is beneficial for
evolution. GAs maintain a population of potential solutions. Their diversity is
very important, especially in the initial phases of the algorithm.
3. Adaptation. Whole species adapt to be better suited to the environment.
This can be the fundamental motivation for genetic algorithms: adaptation
represents optimization. The population will adapt to better solve the optimized
problem.
4. Mutation. Similar to species in nature, individuals in GA undergo mutation.
5.1 The Recognition of Genetic Algorithms (culmination in 1975) 80

5. Reproduction. Better-fitted individuals have a higher chance of producing

offspring. The offspring inherits traits of its parents.
6. Gene Encoding. The discovery of DNA occurred after Darwin’s death.
However, the encoding of individuals in genetic algorithms, which is indirectly
reflected in their various possessed traits, is inspired by DNA strings

I debated where to place genetic algorithms in this book. Concepts for computational
approaches inspired by natural evolution had surfaced before 1975. However, genetic
algorithms as a cohesive and effective technique crystallized around 1975. In
particular, Holland’s book [120] structured information and significantly increased
the popularity of genetic algorithms. This realization prompted my decision to
dedicate a longer section to them, providing a summary from multiple contributions.
Here is a brief summary of notable works published up to 1975:
• 1962: “Outline for a Logical Theory of Adaptive Systems” [122] by John
Holland.
• 1965: “Cybernetic Solution Path of an Experimental Problem” [235] by Ingo
Rechenberg.
• 1965: “Kybernetische Evolution als Strategie der Experimentellen Forschung
in der Stromungstechnik” [265] by Hans-Paul Schwefel.
• 1966: “Artificial Intelligence through Simulated Evolution” [78] by Lawrence
Fogel, Alvin Owens, and Michael Walsh.
• 1970: “Computer Models in Genetics” [81] by Alex Fraser and Donald
Burnell.
• 1973: “Genetic Algorithms and the Optimal Allocation of Trials” [121] by
John Holland.
• 1973: “Computer Simulation in Genetics” [50] by Jack Crosby.
• 1975: “Adaptation in Natural and Artificial Systems” [120] by John Holland.

5.1.2 The Method

The algorithm maintains a population of 𝑁 individuals. The population size is an
important hyper-parameter of the method. Typically, 𝑁 is constant but it is not a
formal requirement as there have been implementations with dynamic population
sizes. Each individual encodes a candidate solution to the problem. At the start
of the algorithm, the initial population is prepared, which is often done by random
sampling of the solution space.

Some important notions related to individuals are:

• Genotype: represents the encoding of a candidate solution to the problem. In
genetic algorithms, the encoding is traditionally represented by a binary string.
For an example, please refer to Figure 5.2. Evolutionary algorithms (EA) are
a broader class of approaches inspired by natural evolution which incorporate
several methods. They also do not require the encoding to be binary. The
genotype is used with genetic operators.
5.1 The Recognition of Genetic Algorithms (culmination in 1975) 81

Figure 5.1: The scheme of a genetic algorithm.

• Phenotype: represents the observable traits. It is the manifestation of an

individual in the actual environment. There exist mappings:

𝑒𝑛𝑐𝑜𝑑𝑒(phenotype) = 𝑔𝑒𝑛𝑜𝑡𝑦 𝑝𝑒
𝑑𝑒𝑐𝑜𝑑𝑒(genotype) = 𝑝ℎ𝑒𝑛𝑜𝑡𝑦 𝑝𝑒

• Fitness function returns a numerical assessment of the quality of a solution

(which is an analogy to how well an organism is suited to the environment).
Typically, it is a function of a phenotype into a real number:

𝑓 𝑖𝑡𝑛𝑒𝑠𝑠(phenotype) → R (5.1)

However, in multi-objective GAs, the function is sometimes defined to return

vectors of numbers. The fitness function can be computed explicitly based
on the structure of the encoded solution or defined in some relative way, e.g.,
through trials in an environment.

A scheme of a genetic algorithm procedure is presented in Figure 5.1. The procedure

is iterative. An iteration in GA is called an epoch. Each one consists of: computing
fitness values of each individuals, checking if the defined stop condition occurs,
mutation, recombination (crossover), and selection of individuals to advance to the
new generation.
The stop condition is usually defined as one of more of the following: (1)
the algorithm reaches a maximum number of epochs e.g. 100; (2) a solution of
acceptable quality has been found; or (3) convergence: where most individuals in
the population have become the same.
5.1 The Recognition of Genetic Algorithms (culmination in 1975) 82

Mutation is the first of genetic operators. It stands for the exploration aspect of
GAs. Typically with a low probability, each individual may undergo a small change
(alteration) in its encoding. Usually, it is required that after the change, the solution
remains legal. However, there have been approaches proposed that either allow
for temporarily illegal solutions or include repair procedures. Figure 5.2 presents
examples of mutation operators.
Recombination (crossover) is the second of genetic operators. Unlike mutation,
it is a binary operator. It takes two individuals (parents) as input and typically
produces two new individuals (children). If a particular operation produces only one
child, then usually, there are two times more recombinations performed to maintain
the population size. The idea is to exploit the best solutions by combining their
parts. Therefore, parents with higher fitness have a greater chance of being selected
for recombination. Figure 5.3 shows three examples of basic crossover operations.
In the literature, there have been a rich number of other operations proposed and
it is possible to handcraft a custom one for a particular problem. For example,
genetic and evolutionary algorithms have been often applied to logistics tasks such
as variants of the vehicle routing problem (VRP), which generalize the travelling
salesman problem (TSP). In those problems, a genotype typically encodes a path or
a set of paths, i.e., ordered lists of vertices. Herein, the idea is to design crossover
operations in such a way to make use of the sub-paths that the parents encode. For
instance, the Order-One Crossover also known as the Davis’ Order Crossover (OX1)
preserves the ordering of vertices from one parent to a fragment of the path of
the second parent. This is repeated with parents’ roles switched to to produce the
second child. Another example is the tree-based crossover for problems represented
as decision trees. Herein, branches under a selected node, which is shared among
parents, are exchanged.

Figure 5.2: Three examples of mutation operators. 1. and 2. randomly change one
and multiple elements, respectively. 3. performs a shuffle by half of the length of
the encoding. In general, such a shuffle could be by an arbitrary length.
5.1 The Recognition of Genetic Algorithms (culmination in 1975) 83

Figure 5.3: Three examples of crossover operators.

Selection - the goal of this phase is to form a new population for the next
iteration (new generation of individuals). There are many selection methods, albeit
in all of them, the probability of an individual being selected is correlated with its
fitness value. Examples of selection mechanisms are: fitness-proportional sampling,
pseudo-roulette selection, ranked selection (top 𝐾 individuals), ranked-proportional
sampling, and tournament selection.
Some implementations additionally use the so-called elitism, wherein some
portion of the best solutions (e.g. 5%) advance to the next generation directly and
unchanged, i.e., without being changed by the mutation and crossover operators. The
idea is to make sure that the top-performing individuals are not lost in the process.

5.1.3 Summary
Genetic algorithms are example of meta-heuristic approaches. Their success lies
in their versatility and effectiveness in solving a wide range of optimization and
search problems across various domains. They are the most suited for discrete global
optimization tasks. The discreteness stems from the discrete nature of the genetic
encoding as well as recombination operators. GAs feature built-in mechanisms to
escape local optima such as mutation and the fact that a diverse (until it converges)
population of individuals, including potentially sub-optimal ones, is maintained.
The approach has become so prominent that dedicated events, such as The Genetic
and Evolutionary Computation Conference (GECCO), have started. Finally, it is
worth noticing that unlike some popular methods such as neural networks, genetic
algorithms do not require for the optimized function to be differentiable.
5.2 Hierarchical Partial-Order Planning (1975) 84

5.2 Hierarchical Partial-Order Planning (1975)

Planning as a sub-field of AI has been introduced in Section 4.8. Arguably, the two
most important extensions proposed over a basic planning approach were:
1. Partial Ordering - allows for the representation of relationships between
actions, determining the order in which they must occur. For example, certain
groups of actions might be executed in parallel, while some actions must be
performed after others.
2. Hierarchization - actions can be expressed at various levels of generality. The
planner might first create a coarse-grained plan consisting of general actions
and then plan a specific execution of those general actions.
Both of the above extensions were introduced in years 1975-1976. In 1975, Earl
Sacerdoti [253] of Department of Computer Science, Stanford University, published
a technical report titled “A Structure for Plans and Behavior”. In 1976, Austin Tate of
Department of Artificial Intelligence, University of Edinburgh, published a technical
report titled “Project Planning Using a Hierarchic Non-Linear Planner” [298].
A novel idea of partially-ordered network of actions (in short, referred to as
procedural nets) was proposed. Austin Tate demonstrated it using a house-building
project as an example. Below, there is a lists available actions in this planning
environment along with their network constraints. The network constrains represent
temporal relationships rather than just preconditions for the action being legal (which
can also be defined). Example of a network constraint for the action “build a roof”
might be “build walls” first. Example of a local precondition might be “X square
meters of roof material is available”.
The list follows a format: action description, [immediate predecessors], [time in
days]. Figure 5.4 shows a partially-ordered network for this project:
1. Excavate, Pour Footers, [], [4]
2. Pour Concrete Foundations, [a], [2]
3. Erect Frame and Roof, [b], [4]
4. Lay Brickwork, [c], [6]
5. Install Drains, [b], [1]
6. Pour Basement Floor, [e], [2]
7. Install Rough Plumbing, [e], [3]
8. Install Rough Wiring, [c], [2]
9. Install Air Conditioning, [c,f], [4]
10. Fasten Plaster and Plaster Board, [g,h,i], [10]
11. Lay Finished Flooring, [j], [3]
12. Install Kitchen Equipment, [k], [1]
13. Install Finished Plumbing, [k], [2]
14. Finish Carpentry, [k], [3]
15. Finish Roofing and Flashing, [d], [2]
16. Fasten Gutters and Downspouts, [o], [1]
17. Lay Storm Drains, [b], [1]
18. Sand and Varnish Floors, [n,s], [2]
19. Paint, [l,m], [3]
5.2 Hierarchical Partial-Order Planning (1975) 85

20. Finish Electrical Work, [s], [1]

21. Finish Grading, [p,q], [2]
22. Pour Walks, and Landscape, [u], [5]

Figure 5.4: Partially-ordered network for the house construction project shown in
this section as an example.

In [298], a relatively detailed formalism for hierarchical plans was specified. The
main idea was to allow certain nodes in the network to be dynamically expanded
during planning. Each node was assigned a LEVEL, which was a sort of priority
number specifying which nodes should be expanded first. Before expansion, a node
already represents a certain task with a goal it achieves. The expansion result in a
more detailed way of achieving this task (See Fig. 5.5). The planning algorithm was
implemented using a heuristic graph search. It incorporated a number of pattern
matching heuristics.
5.3 Alpha-Beta Pruning (1975) 86

Figure 5.5: Hierarchical planning.

5.3 Alpha-Beta Pruning (1975)

Alpha-Beta Pruning is an extension of the Minimax algorithm (see Section 2.5)
that allows to reduce the number of nodes visited during the search. Although the
fundamental ideas behind Alpha-Beta had already appeared earlier, notably in Arthur
Samuel’s Checker program (c.f. Section 3.7), the Alpha-Beta pruning algorithm as
it is known today was proposed by Donald Knuth in 1975 [147].

Let MAX and MIN describe the score maximizing and minimizing players,
respectively. The idea is to keep track of two values: alpha and beta, during the
search. Alpha denotes the minimum value MAX is already guaranteed to get in the
currently considered sub-tree. Therefore, if there would be a lower value under a
MIN node in the same sub-tree, the whole sub-tree under this MIN node can be
discarded. There would be no possibility to improve the already known MAX score.
Analogically, beta denotes the maximum score the MIN player is guaranteed to get.
Like vanilla Minimax, Alpha-Beta pruning can be used to search the tree entirely
without any heuristics. However, the algorithm is typically used with a heuristic
evaluation function (c.f. Eq. 3.9) and a fixed depth that serves as the search horizon.
Let payoff(𝑠𝑡𝑎𝑡𝑒) denote a function that returns the score from the perspective of the
MAX player: be it an originally game-defined score or a result of heuristic evaluation.
The pseudo-code of the Alpha-Beta search is presented in Algorithm 6. With a fixed
5.3 Alpha-Beta Pruning (1975) 87

depth search horizon equal to 𝑑𝑒 𝑝𝑡ℎ𝑚𝑎𝑥 , the presented function is called as:

𝐴𝑙 𝑝ℎ𝑎𝐵𝑒𝑡𝑎(𝑟𝑜𝑜𝑡, −∞, ∞, 𝑑𝑒 𝑝𝑡ℎ𝑚𝑎𝑥 , 𝑀 𝐴𝑋)

Without the depth limit, it can be called as:

𝐴𝑙 𝑝ℎ𝑎𝐵𝑒𝑡𝑎(𝑟𝑜𝑜𝑡, −∞, ∞, ∞, 𝑀 𝐴𝑋)

Algorithm 6: AlphaBeta(𝑛𝑜𝑑𝑒, 𝛼, 𝛽, 𝑑𝑒 𝑝𝑡ℎ, 𝑝𝑙𝑎𝑦𝑒𝑟)

if 𝑠𝑡𝑎𝑡𝑒 is terminal or 𝑑𝑒 𝑝𝑡ℎ = 0: then
return payoff(𝑠𝑡𝑎𝑡𝑒)
if 𝑐𝑢𝑟𝑟𝑒𝑛𝑡𝑃𝑙𝑎𝑦𝑒𝑟 is 𝑀 𝐴𝑋: then
𝑉 ←− −∞;
for 𝑛𝑜𝑑𝑒 ∈ 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛(𝑠𝑡𝑎𝑡𝑒) do
𝑉 ←− max(𝑉, 𝐴𝑙 𝑝ℎ𝑎𝐵𝑒𝑡𝑎(𝑛𝑜𝑑𝑒, 𝛼, 𝛽, 𝑑𝑒 𝑝𝑡ℎ − 1, 𝑀 𝐼 𝑁));
if 𝑉 > 𝛽: then
break;
𝛼 ←− 𝑚𝑎𝑥(𝑉, 𝛼) ;
return 𝑉
if 𝑐𝑢𝑟𝑟𝑒𝑛𝑡𝑃𝑙𝑎𝑦𝑒𝑟 is 𝑀 𝐼 𝑁: then
𝑉 ←− ∞;
for 𝑛𝑜𝑑𝑒 ∈ 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛(𝑠𝑡𝑎𝑡𝑒) do
𝑉 ←− min(𝑉, 𝐴𝑙 𝑝ℎ𝑎𝐵𝑒𝑡𝑎(𝑛𝑜𝑑𝑒, 𝛼, 𝛽, 𝑑𝑒 𝑝𝑡ℎ − 1, 𝑀 𝐴𝑋));
if 𝑉 < 𝛼: then
break;
𝛽 ←− 𝑚𝑖𝑛(𝑉, 𝛽) ;
return 𝑉
It is worth noticing that Alpha-Beta pruning is not an approximate algorithm.
It always computes the exact minimax score for the root node (the current state in
the game). However, due to optimizations, certain scores computed deeper in the
tree may differ between the Minimax algorithm and Alpha-Beta pruning, albeit they
do not affect the optimal strategies of the players. As an example, please compare
game trees searched by Minimax (Fig. 5.6) and the Alpha-Beta pruning (Fig. 5.7).
In the first figure, there is a node with −5 score, whereas on the second one, the
same node’s score is 0. This is because the branch that returned −5 has not been
considered.
Minimax has an exponential computational complexity, equal to 𝑂 (𝑏 𝑑 ), where
𝑏 is a branching factor of the game and 𝑑 is the average depth of the tree (length
of the game). In a pessimistic case, Alpha-Beta pruning has the same complexity
due to the fact that the scores might be distributed in such a way, that no cut-off
is possible.
√ However, in an optimistic case, its computational complexity is equal
to 𝑂 ( 𝑏 ). Comparing the algorithms using the game tree presented in Fig. 5.6
𝑑

and Fig. 5.7, it shows that the number of visited nodes decreased from 29 to 20
in favor of Alpha-Beta pruning. In larger trees, discarding whole branches may
5.3 Alpha-Beta Pruning (1975) 88

Figure 5.6: Game tree search example with minimax scores. Nodes with values
back-propagated are distinguished by a colored background and border.

save a lot of calculations. In such a small tree as shown in the example, the most
expensive operations could be the heuristic function calls that happen in the leaf
nodes. Alpha-Beta pruning needed to visit only 9 out of 15 leaves.
The Alpha-Beta pruning algorithm saw a lot of usage and success as a clean
improvement over the vanilla Minimax method. It also received various subsequent
heuristic improvements aimed at increasing its expected effectiveness in a typical
case. One such example is the MTD(f) [310] algorithm that narrows the search
window even further.

Figure 5.7: This is the same game tree as shown in Fig. 5.6 but this time with
Alpha-Beta pruning applied. Empty nodes denote the ones that have never been
visited. Three branches were successfully pruned: 𝛼1 , 𝛼2 , 𝛼3 .
5.4 Harpy Speech Recognition System (1976) 89

5.4 Harpy Speech Recognition System (1976)

Harpy [164] was a pioneering speech recognition system developed at Artificial
Intelligence Research Lab, Department of Computer Science of Carnegie Mellon
University in 1976. The project was initiated in 1971 and received a considerate fund-
ing support from DARPA. Despite not being the first, it is widely regarded as the first
successful system of its kind. Preceding Harpy, two systems named The Hearsay-I
(1974) and The Dragon (1975) were developed at the same university. Harpy not only
combined the best features of these earlier approaches but also introduced several
enhanced features of its own. They included a novel search algorithm and knowledge
representation. The team leader of the Harpy project was Dabbala Rajagopal
“Raj” Reddy, an Indian-American computer scientist. In 1994, he was awarded the
prestigious Turing Award for his significant contributions, among which Harpy was
noteworthy. IBM and Stanford University helped to construct the device. In terms
of real-time performance, the goal was to not exceed 300 millions of computer
instructions per second (MIPSS). Ultimately, Harpy only used 28 MIPSS. This is not
to be confused with flops, which are floating point operations per second, specifically.

In 1977, Harpy could understand 1011 words with recognition accuracy ranging
from 92.2% to 97.4% on test datasets [334]. The performance was measured in a
laboratory setting. It was not ready for a real-world application with issues such as
background noise, multiple people speaking at the same time, people speaking very
fast etc. It could understand full sentences (containing a subset of these 1011 words)
with accuracy ranging from 90% to 95%. It also featured a dynamic lexicon, allowing
it to adapt and learn from user interactions over time. However, it recognized only
the words and not their meaning or context. For instance, it could not distinguish
homonyms or homophones. Harpy’s performance was influenced by the speaker’s
voice and style. Achieving speaker independence, where the system can accurately
recognize speech from various speakers, was a challenging aspect that Harpy did
not fully address. While it may not have been perfect or as advanced as modern
speech recognition systems, Harpy laid the groundwork for subsequent research and
development in the field.

Harpy made use of preprocessing. The input audio signal was first digitized into
the so-called acoustic units of varying length depending on the acoustic properties.
For each unit, it extracted features using various heuristics. Then it performed a
beam search to find a sequences of features that match available patterns - specific
words, sounds, silence and phrases. The knowledge was represented by:
1. Word networks - that stored various sequences of the available words.
2. Pronunciation networks - that represented various ways in which words can
be pronounced. Such networks could be dynamically “embedded” instead of
words when specific speech recognition task was running.
3. Word boundary rules - heuristics helping in cases where pronunciation differs
because of the way how words are connected. For instance, “tell me about” is
often pronounced as “tell me bout” or “let me tell you” as “leme tell ya”.
5.5 The EM Algorithm and Mixture of Gaussians (1977) 90

The beam search is a heuristic search algorithm that explores a graph or search space,
focusing on the most promising paths while discarding less likely ones. The key is
that it maintains a number of top solutions as it would be a greedy best-first search
algorithm otherwise. The number of examined subsequent nodes is not constant and
depends on their heuristic scores. Paths with scores below a certain threshold are
discarded, as well as those with scores too far from the best candidate. The algorithm
is then able to examine near-miss alternatives. The heuristics used in Harpy were
based on the numerical accuracy of the matches between the input signal and the
stored ones.

5.5 The EM Algorithm and Mixture of Gaussians (1977)

In 1977, Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin published “Max-
imum Likelihood from Incomplete Data via the EM Algorithm” [58]. The EM
algorithm has become an extremely versatile tool in statistics and machine learning.

For the remainder of this section, let 𝒳 denote observed data and 𝒴 denote latent
data from the same process. In general, the process can be described by an arbitrary
data source. Symbols 𝑥, 𝑦 will be used for samples from 𝒳 and 𝒴, respectively.
The likelihood (see the analogy to Bayes’ Theorem in Section 2.2) is the
probability of a data sample 𝑥 given that the parameters of a model are 𝜃. It is
expressed as:
L (𝜃|𝑥) or L (𝜃; 𝑥)
The EM algorithm is used find such parameters of a model – 𝜃ˆopt – that maximize
the likelihood of the observed data:

𝜃ˆopt = arg max L (𝜃|𝒳) (5.2)

𝜃

This objective is called Maximum Likelihood Estimation (MLE), and the EM

algorithm is a paradigmatic example of this class of approaches.
In terms of probabilities, the goal is to find 𝜃 that maximizes:
∫
𝑝(𝒳|𝜃) = 𝑝(𝒳, 𝑦|𝜃)𝑑𝑦 (5.3)
𝑦

The EM algorithm is iterative. Typically, the stop condition is defined as

convergence with some 𝜖. The algorithm has been proven to converge given enough
iterations. Each iteration consists of two steps:
1. Expectation Step (E-Step). In principle, the goal is to compute the expected
value of the log-likelihood given all data (observed and latent) as well as the
model parameters:
∑︁
𝑄(𝜃|𝜃 𝑡 ) = 𝑃(𝑦|𝒳, 𝜃 𝑡 ) log 𝑃(𝒳|𝑦, 𝜃)𝑃(𝑧|𝜃)
𝑦
5.5 The EM Algorithm and Mixture of Gaussians (1977) 91

2. Maximization Step (M-Step). The new estimation of the parameters is

obtained by maximizing the computed expected value:

𝜃 𝑡+1 = arg max 𝑄(𝜃|𝜃 𝑡 )

𝜃

The EM algorithm can be used for various tasks such as clustering, missing data
augmentation, probability distributions density estimations as well as utilized as a
step in training certain machine learning models such as Generative Adversarial
Networks [161] (c.f. Section 10.1).
This method can also be used for data compression:
1. With Gaussian mixture models. They can be employed to represent the
probability distribution of the data as a mixture of several Gaussian components.
The EM algorithm helps in estimating the parameters of these components,
which can then be used for compression.
2. Sparse Coding and Dictionary Learning. EM can be used in sparse coding
and dictionary learning, where the goal is to represent data as a sparse linear
combination of basis elements (atoms) from a learned dictionary. This can be
applied in image compression and other signal processing tasks.
3. With Hidden Markov Models. Hidden Markov Models, which involve latent
states, can be trained using the EM algorithm. HMMs are used in various
applications, including speech and audio compression. The EM algorithm
helps estimate transition probabilities and emission probabilities, contributing
to effective compression algorithms.
The EM method has seen applications in many areas such as anomaly detection
and medical imaging. In the latter area, examples are PET (Positron Emission
Tomography) and MRI (Magnetic Resonance Imaging) image reconstruction. The
EM algorithm helps generative models handle situations where there are unobserved
or latent variables, contributing to the training of more sophisticated models that
capture complex data distributions.

5.5.1 Gaussian Mixture Models

The Central Limit Theorem (CLT) states that a sufficiently large sequence of
independent random variables with expected value 𝜇 and finite variance 𝜎 2 converges
2
to the normal distribution (Gaussian distribution) with mean 𝜇 and variance 𝜎𝑛 ,
where is the sequence length.
The CLT theorem was a mathematical inspiration for researchers to try to
approximate any data distribution by a sum (or a mean) of a finite number of
Gaussian distributions, even if the CLT assumptions were not met.
A Gaussian mixture model assumes that all data are generated by 𝑘 Gaussian
distributions denoted by the symbol 𝑁. Let 𝑥𝑖 denote the 𝑖-th data sample (it can be
a vector). Its probability is expressed as:
𝑘
∑︁
𝑝(𝑥𝑖 ) = 𝑃(𝑥|𝜃) = 𝑤 𝑗 𝑁 (𝑥𝑖 |𝜇 𝑗 , Σ 𝑗 ) (5.4)
𝑗=1
5.6 Stanford Cart - Predecessor of Autonomous Vehicles (1979
breakthr.) 92

- where Σ 𝑘 is the covariance matrix of the 𝑗-th component and the remaining symbols
have already been defined. For a one-dimensional case, Σ 𝑗 can be replaced by 𝜎.
The EM algorithm can be utilized to estimate the parameters 𝜃 of the model.
They include parameters from the right-hand size of the equation - weights, means
and covariances.
As models, Gaussian mixtures can be utilized for tasks such as probability
density estimation, clustering, augmentation (with generated data from plausible
distributions) and forecasting. They are particularly useful for multi-modal data as
each modality can be modeled using a Gaussian distribution.

5.6 Stanford Cart - Predecessor of Autonomous Vehicles (1979

breakthr.)
The Stanford Cart is a recurring project consisting in developing navigation mecha-
nisms for a cart - a machine that resembled a lunar rover [199]. There have been
many leaders of this project and the robot itself has undergone multiple modifications
and upgrades. The project originated in 1960 with James L. Adams, a graduate
student at Stanford who had been working on a NASA contract. The motivation
was to be able to send such a vehicle to the Moon or possibly different celestial
objects too. The cart had small bicycle wheels and an electric powertrain. It was
originally barely functional and highly inaccurate. To do the job of getting from
point A to B, it must have maintained a low speed of 0.3 kph (about 0.2 mph) due to
certain technical deficiencies. A photograph of the cart is available at the Artificial
Intelligence and Robotics exhibition at the Computer History Museum [240].

Since 1960 till 1979, the cart was remote controlled by radio ultra high frequencies
(UHF). This was really not ideal given the motivation about the robot being used
for exploring the outer space. Even for a body as close as the Moon, it takes 2.6
seconds for light to travel back and forth from Earth.
In 1979, there was a breakthrough performed by the person then responsible for
the cart – Hans Moravec [199] – during his research for PhD. The robot successfully
traversed through a room full of obstacles without human intervention. To sense
the surroundings, Stanford Cart utilized a TV camera. The image captured was
converted to a digital one of 256 × 256 resolution. Series of several subsequent
frames were captured. A computer program was written called the “camera solver”
that determined 3D positions of regions that were mapped together on multiple
frames. This step involved processing a lot of image correction techniques, algebraic
equations as well as features extraction (e.g. using Fourier transform).
Then, a program called the “Navigator” planned the movement to the destination
avoiding obstacles. The volumes of obstacles were represented as bounding spheres.
Naturally, this was just an approximation that could result in an open space being
incorrectly classified as obstacles. The path was assumed to be piece-wise linear,
i.e., containing segments of straight lines. First, the simplest case was checked - if
it is possible to drive in straight line avoiding obstacles. If this was not possible,
5.7 LISP Machines (1979 ∼ 1980) 93

Figure 5.8: Consider a path-finding problem with the shortest path passing through
segments 𝑣1, 𝑣2, approximated by spheres. The lines represent tangent paths between
the segments.

then tangential lines (to each obstacle represented a a sphere) were considered as
shown in Figure 5.8. The possibilities were arranged into a graph of positions and
connections between them. The path-finding algorithm ran in 𝑂 (𝑛2 ) time.
As of 2024, autonomous vehicles and autonomous driving are one of the cutting
edge areas of AI research. It is still far from a solved problem. Because of the fact
that Stanford Cart navigated autonomously, it can be considered one of the first
autonomous vehicles. Though, it should be noted that this achievement had many
limitations. For instance:
• Stanford Cart required calibration.
• The movement was very slow due to two reasons. Firstly, faster movement
lead to over-steer and lost control. Secondly, the cart had to stop and take
pictures. The pictures had to be taken at precise intervals during the movement
(6.5cm). At each stop, the camera would slide and take several pictures.
• The environment was very simple: a room with a flat floor in a research
laboratory.
• The responses of the robot were not precise. Therefore, an internal world
model and the reality could become out of sync.
• Edges of certain angles were incorrectly handled by the image processing
program.
• It failed to identify very small or smooth objects as obstacles.

5.7 LISP Machines (1979 ∼ 1980)

In 1980, the most popular general purpose programming languages were C, For-
tran, COBOL, Assembly, BASIC, and Ada. Not as popular as them were the two
programming languages designed specifically for AI - Prolog and LISP. Both of
them were primarily used for research and lacked high-performance hardware support.

1979 and 1980s were particularly game-breaking for LISP due to the introduction
of LISP machines. They were specialized computers for creating and running
LISP programs. First and foremost, the hardware architecture was optimized to
5.7 LISP Machines (1979 ∼ 1980) 94

Figure 5.9: Symbolics 3620 Lisp machine and LMI Lambda Lisp ma-
chine. Shared publicly under the Attribution-ShareAlike (CC BY-2.0) license.
(https://creativecommons.org/licenses/by-sa/2.0/).

carry out symbolic manipulations (a standard computer hardware was optimized for
numeric computations). Examples of such operations are pattern matching, variables
unification and symbolic search. Memory management and garbage collection was
also optimized for large symbol-based operations on the LISP machines hardware.
It allowed for dynamic typing, where the type of a variable is not known upfront.
The machines came with dedicated LISP compilers, an integrated development
environment (IDE) and debugging tools for this language. Many of them also
featured a graphical user interface (GUI).

LISP machines were developed by a few companies. To name a few:

• Lisp Machines, Inc. Founded by Richard Greenblatt of the MIT AI Lab. The
first model was called LMI-CADR. The next model was called LMI-LAMBDA.
• Symbolics. Most of the initial employees of the company were hackers from
the MIT AI Lab. The first model was called LM-2. Since 1983 until 1986, the
company released models: 3600, 3670, 3640, 3675, 3645, 3610, 3620, 3650.
• Xerox. One of the oldest and most renown companies related to computer-
related technology. Xerox’s LISP machines were built to run InterLisp, which
later became Common Lisp (a standardized version of this language). InterLisp
featured block compilation allowing functions to be compiled as a single
block.
LISP machines received a lot of attention between 1980 and 1987. After 1987,
their popularity declined. While they are no longer prevalent today, their historical
significance lies in their role as specialized platforms that supported the development
5.8 The First AAAI and ICML Conferences (1980) 95

of AI applications and symbolic processing during a period when LISP was a

dominant language in those domains.

5.8 The First AAAI and ICML Conferences (1980)

The first AAAI conference was held on August 18–21, 1980, at Stanford University,
Stanford, California. AAAI stands for the Association for the Advancement of
Artificial Intelligence. Since its inception, the conference has been held annually
in different cities in the United States of America or Canada. Over the years,
the AAAI conference has grown into one of the largest and most esteemed ones
drawing leading researchers, students, and influential companies in the field of
AI. Numerous groundbreaking innovations have been unveiled at this conference,
marking significant milestones in AI. The AAAI conference is a general forum for
AI topics rather than focusing on a specific area.
The first ICML conference was held in 1980 at Carnegie Mellon University,
Pittsburgh, Pennsylvania. ICML stands for the International Conference on Machine
Learning. It is organized by the International Machine Learning Society (IMLS).
It has been held annually at various locations in North America, Europe, Asia and
Australia. In contrast to AAAI, ICML has a narrower focus, specifically on machine
learning research. However, the field of machine learning has evolved into a vast
domain with a diverse range of approaches.
The emergence and growth of AI-focused conferences significantly contribute
to the development and maturity of this field. Apart from AAAI and ICML, some
other top scientific conferences in AI (listed by founding year) are:
• IJCAI (1969): International Joint Conference on Artificial Intelligence, held
every 2 years.
• CVPR (1983): Conference on Computer Vision and Pattern Recognition.
• AISTATS (1985): Conference on Artificial Intelligence and Statistics.
• NeurIPS (1987): The Annual Conference on Neural Information Processing
Systems.
• ICCV (1987): IEEE International Conference on Computer Vision.
• ACM SIGKDD (or just KDD) (1995): Association for Computing Machinery’s
Special Interest Group on Knowledge Discovery and Data Mining.
• ICLR (2013): International Conference on Learning Representations.
5.9 Backpropagation in Multi-Layer Neural Networks (1982) 96

5.9 Backpropagation in Multi-Layer Neural Networks (1982)

Backpropagation is one of the most important algorithms in machine learning and
AI, in general. It allows for training neural networks through back-propagating errors.
It is difficult to attribute the invention of backpropagation to just one researcher. The
ideas were emerging independently, alternative implementation attempts existed and
the area of neural networks was constantly evolving. I decided to distinguish the
contribution of Paul Werbos in the article “Applications of Advances in Nonlinear
Sensitivity Analysis” [324] from 1982. It was the first implementation of backpropa-
gation in the way it is applied today.

Other notable inventions related to the backpropagation algorithm were:

• 1676: A German mathematician Gottfried Wilhelm Leibniz invented the chain
rule that allows to compute the derivative of composite functions as a function
of simpler terms. Let ℎ be a composite function of 𝑓 and 𝑔:

ℎ = 𝑓 ◦ 𝑔 or equivalently ℎ(𝑥) ≡ 𝑓 (𝑔(𝑥))

The derivative of ℎ can be computed in the following way:

ℎ′ (𝑥) = ( 𝑓 ◦ 𝑔) ′ = ( 𝑓 ′ ◦ 𝑔) ∗ 𝑔′ (5.5)

Using the second notation, the same expression is written as:

ℎ′ (𝑥) = 𝑓 ′ (𝑔(𝑥)) ∗ 𝑔′ (𝑥) (5.6)

The chain rule serves as the foundation for backpropagation.

• 1847: The idea of optimizing differentiable functions by traversing along

the direction of negative gradient was presented. It is commonly credited
to Augustin-Louis Cauchy, a French mathematician. Similar ideas were
published later [102]. This approach is known as the Gradient Descent.

• 1960: Henry Kelley publishes an article titled “Gradient Theory of Optimal

Flight Paths” [139]. He optimized systems of equations governing flight paths.
The proposed technique was similar to backpropagation in artificial neural
networks despite not being used for this task.

• 1989: David Rumelhart, Geoffrey Hinton and Ronald Williams publish an

article titled “Learning representations by back-propagating errors” in Nature
Magazine [252] that popularizes the backpropagation algorithm. It has become
one of the highest cited papers in computer science.
First and foremost, let us start with stating the objective of the optimization. The
goal of training neural networks through backpropagation is to find such a set of
internal parameters Ω that minimizes the loss function. The internal parameters are
weights in the network as any other parameters (such as activation functions) are
typically fixed for backpropagation training purposes.
5.9 Backpropagation in Multi-Layer Neural Networks (1982) 97

The loss function is a type of error function used for neural networks. It expresses
the cost (loss) as a function of the expected (correct) output 𝑦ˆ with the current neural
network output 𝑦. In other words: it measures how costly the difference between
𝑦ˆ and 𝑦 is. The loss function is often computed for a batch (a set) of observations.
In such a case, there is an additional subscript: 𝑦ˆ𝑖 and 𝑦𝑖 , where 𝑖 denotes the 𝑖-th
observation.
The choice of a particular loss function depends on the type of machine learning
task (classification vs. regression) and other, more problem-specific, factors. Below,
please find two examples of loss functions for classification (1. and 2.) and regression
(3. and 4.) problems, respectively:
1. Cross-Entropy (Log) Loss:
𝑛
1 ∑︁
𝐿=− 𝑦𝑖 log( 𝑦ˆ𝑖 ) (5.7)
𝑛 𝑖=1

2. Hinge Loss:
𝐿 = max(0, 1 − 𝑦𝑖 ∗ 𝑦ˆ𝑖 ) (5.8)
3. MSE (L2) Loss:
𝑛
1 ∑︁
𝐿= ( 𝑦ˆ𝑖 − 𝑦𝑖 ) 2 (5.9)
𝑛 𝑖=1
4. Huber Loss:
(
1
2 (𝑦 𝑖 − 𝑦ˆ𝑖 ) for |𝑦𝑖 − 𝑦ˆ𝑖 | ≤ 𝛿
2
𝐿 𝛿 (𝑦𝑖 , 𝑦ˆ𝑖 ) = (5.10)
𝛿(|𝑦𝑖 − 𝑦ˆ𝑖 | − 12 𝛿) otherwise

- where 𝛿 parameter controls the steepness vs. smoothness.

Now the idea is as follows:

1. Consider neural network as a function 𝑦 = 𝑓 (𝑥) expressed using parameters
𝑊ˆ (they are modifiable by the learning process but constant from the function
perspective).
2. Compute the gradient of the loss function with respect to the weights (to
calculate the contribution of each one).
3. Update each weight with respect to its partial derivative:
𝜕𝐿
Δ𝑤 = −𝛼 ∗ (5.11)
𝜕𝑤
- where 𝛼 is the learning rate. It tells how much the particular weight is
modified in one training iteration.
In theory, one could write an equation for the neural network in terms of the input
and the weights and calculate the gradient analytically. This is infeasible in practice
due to complexity (for any non-trivial networks) as well as the fact that it would
need to be repeated for each new network topology. Here comes the chain rule that
allows to express the derivatives of a function a particular unit in a neural network
5.9 Backpropagation in Multi-Layer Neural Networks (1982) 98

realizes and derivatives of the function that calculated the input to this particular
unit. Let me illustrate this idea using an example shown in Figure 5.10. The goal
is to calculate the gradient of the loss function with respect to the weights. The
calculations required will be shown for weights 𝑤 7 and 𝑤 1 as relatively distinctive
𝜕𝐿 𝜕𝐿
examples, so the goal is to find 𝜕𝑤 7
and 𝜕𝑤 1
.

Figure 5.10: Simple neural network used as an example. 𝑓2 , 𝑓4 , 𝑓6 are activation

functions that take the outputs of 𝑓1 , 𝑓3 and 𝑓5 , respectively. The output from 𝑓6 is
the final output of the network.
The output of the network directly depends on the function 𝑓6 (since its the last
computation in the network). Now, 𝑓 6 is a composite function of all the functions
appearing earlier in the network and depends on 𝑤 7 :
𝜕𝐿 𝜕𝐿 𝜕 𝑓6
= ∗ (5.12)
𝜕𝑤 7 𝜕 𝑓6 𝜕𝑤 7
𝜕 𝑓6
Now, 𝜕𝑤 7
must be rewritten but since we know it is a sum, we can apply the
chain rule only along the 𝑤 7 edge. We utilize the property that:
Í𝑘
𝜕 [ 𝑖=1 (𝑤 𝑖 ∗ 𝑓 (𝑥𝑖 ))]
= 𝑓 (𝑥 𝑗 ) (5.13)
𝜕𝑤 𝑗
Finally, we get:
𝜕𝐿 𝜕𝐿 𝜕 𝑓6 𝜕 𝑓5 𝜕𝐿 𝜕 𝑓6
= ∗ ∗ = ∗ ∗ 𝑥7 (5.14)
𝜕𝑤 7 𝜕 𝑓6 𝜕 𝑓5 𝜕𝑤 7 𝜕 𝑓6 𝜕 𝑓5
The chain for 𝑤 1 looks as follows:
𝜕𝐿 𝜕𝐿 𝜕 𝑓6 𝜕 𝑓5 𝜕 𝑓2 𝜕 𝑓1
= ∗ ∗ ∗ ∗ (5.15)
𝜕𝑤 1 𝜕 𝑓6 𝜕 𝑓5 𝜕 𝑓2 𝜕 𝑓1 𝜕𝑤 1
5.10 Hopfield Network (1982) 99

Some aspects to keep in mind:

1. The training signal 𝑦ˆ is required. The neural network will be trained to
reproduce it. The higher quality it is, preferably perfect ground truth, the
better.
2. The loss function must be differentiable.
3. The activation functions must be differentiable. However, if the activation
function is continuous only piece-wise differentiable (with a finite number
of local non-differentiable points), then it can also be used in practice. For
instance, the ReLU function is not differentiable in one point.
4. The learning rate must be carefully chosen.
The chain rule allows for effective computer implementations but this topic is
beyond the scope of this section.

5.10 Hopfield Network (1982)

In 1982, John Hopfield introduced a type of artificial neural network that became
known as a Hopfield network [123]. The title of the publication was “Neural networks
and physical systems with emergent collective computational abilities”. Hopfield
networks had a notable impact on the field of neural networks and computational
neuroscience due to incorporating many novel ideas. They sparked discussions
about how the human brain might memorize information. Additionally, they later
inspired the introduction of the attention mechanism in neural networks. Hopfield
networks were the first artificial neural networks designed for pattern retrieval and
demonstrated success in such tasks [136]. They were also utilized for other tasks,
e.g., the so-called N-Queens problem [169], which is NP-hard.

Recurrence - a Hopfield NN is a network with feedback in contrast to feed-

forward neural networks. It is fully connected - each neuron is connected to all other
neurons but not to itself (to avoid self reinforcement). Hopfield cleverly utilized
recurrence to simulate memory. Figure 5.11 is a sketch of a simple Hopfield network.

Associative Memory and Representing Patterns - the network is designed in

such a way that it effectively realizes an associative memory model. It uses a form
of Hebbian learning (c.f. Section 2.10), in which synaptic efficacy increases due
to persistent and repeated stimulation of connected neurons. Hopfield saw that a
recurrent neural network is capable of making the most use of this idea. The network
essentially memorizes information in the form of stable states, i.e., the states it wants
to converge to. In the context of Hopfield networks, a stable state is often referred to
as a memory pattern or retrieval pattern.

Energy of a Network - is a notion used in the model. The idea is to introduce

a scalar value that represents the stability of the network. It is called the energy
because the training objective is to minimize its value - the lower the energy, the
more stable the network is. This is a parallel to energy minimization in nature. The
5.10 Hopfield Network (1982) 100

Figure 5.11: Example of a simple Hopfield network with 3 neurons.

energy – 𝐸 – is computed as follows:

𝑁 𝑁 𝑁
1 ∑︁ ∑︁ ∑︁
𝐸 =− 𝑊𝑖 𝑗 𝑠𝑖 𝑠 𝑗 + 𝜃 𝑖 𝑠𝑖 (5.16)
2 𝑖=1 𝑗=1 𝑖=1

- where 𝑠𝑖 is the state (output) of the 𝑖-th neuron, 𝑊𝑖 𝑗 are weights, and 𝜃 is the bias of
the 𝑖-th neuron.

Symmetric Weights - the network uses symmetric weights, which is a direct

consequence of applying Hebb’s rule (c.f. Equation 2.17) as multiplication is
commutative. Moreover, the symmetric weights contribute to a simpler and more
symmetric energy function.
Neuron’s models in Hopfield networks typically use binary activation values:
(+1) – if the input exceeds the threshold set and (−1) – otherwise. This is called
inverting and non-inverting output. The network dynamics evolve over time as
the neurons update their states in an asynchronous (one at a time) or synchronous
manner (all neurons at the same time). However, variants with continuous values
have also been proposed. In particular, a continuous variant was investigated in a
2020 paper titled “Hopfield Networks is All You Need” [234], which plays around
the title of a widely cited paper “Attention is All You Neeed” [313]. The authors
5.11 Boltzmann Machines (1985) 101

have shown that their proposed network, which is essentially a modernized version
of the original Hopfield’s network, is a generalization of the attention mechanism.

5.11 Boltzmann Machines (1985)

A Boltzmann machine, like Hopfield network, is a type of non-standard (meaning
not being a variant of MLP) recurrent neural network. There have been three major
advancements concerning the adoption of Boltzmann machines in the history of AI,
so it was relatively difficult to choose one date to put them in this book.
• 1985 - Introduction to Boltzmann machines in the paper: “A Learning
Algorithm for Boltzmann Machines” by David H. Ackley, Geoffrey E. Hinton,
and Terrence J. Sejnowski [1].
• 1986 - Paul Smolensky introduces a model based on the harmony theory which
with minimal changes evolved into Restricted Boltzmann Machines (RBM).
The initial exposition of this idea is found in the book titled “Information
processing in dynamical systems: Foundations of harmony theory” [278].
RBMs has become much more widely utilized than the general Boltzmann
machines. However, in 1986 there did not exist an efficient training algorithm
for them.
• 1998 - Radford M. Neal and Geoffrey E. Hinton present a much more efficient
for training Boltzmann machines (and not only them) [203]. It lead to a
renewed interest in these models. However, there still have been challenges
regarding their practical application on a broader scale.
• 2002 - G. Hinton publishes “Training Products of Experts by Minimizing
Contrastive Divergence” [118]. This algorithm significantly improves the
efficiency of learning and contributes to the widespread adoption of RBMs.
Boltzmann machines were named after Ludwig Boltzmann, an Austrian physicist
and thinker. They bear many similarities to the Hopfield networks:
1. Boltzmann machines and Hopfield networks are energy-based models. The
objective is to minimize (or maximize) the energy of the network.
2. Both networks are recurrent.
3. Both models follow the Hebbian learning principle.
4. Both typically use binary neurons, i.e., taking only 0 or 1 as their states.
5. Both models are particularly suitable for pattern recognition tasks.
The connections in Boltzmann machines are symmetric (which is typical in
Hebbian learning). Their are type of stochastic networks, i.e., the neurons are
activated based on a probability rather then in deterministic fashion. Let 𝑧𝑖 denote
the total signal reaching 𝑖-th neuron:
𝑁
∑︁
𝑧𝑖 = 𝑏 𝑖 + 𝑠 𝑗 ∗ 𝑤𝑖 𝑗 (5.17)
𝑗

- where 𝑏𝑖 denotes the bias, 𝑤 𝑖 𝑗 is the weight of the 𝑖 − 𝑗 connection and 𝑠 𝑗 ∈ {0, 1}
is the state of 𝑗-th unit. The probability of activation is most often defined using the
5.11 Boltzmann Machines (1985) 102

logistic function:
1
𝑝𝑟𝑜𝑏(𝑠𝑖 = 1) = (5.18)
1 + 𝑒 −𝑧𝑖
The total energy in a Boltzmann machine is defined as:
𝑁
∑︁ 𝑁
∑︁
𝐸 =− 𝜃 𝑖 ∗ 𝑠𝑖 − 𝑤 𝑖 𝑗 ∗ 𝑠𝑖 ∗ 𝑠 𝑗 (5.19)
𝑖 𝑖< 𝑗

A Boltzmann machine consists of two types of units: visible and hidden neurons.
The latter represent latent variables. In general, the network can be fully connected.
The model learns a probability distribution of a process based on the training
samples from the process provided as input to the network. Convergence to the
distribution is equivalent to minimizing the energy of the global network’s state.
The authors of [1] showed that the relationship between the probability of the global
state (𝑃𝛼 ) and the individual connection strengths can be simply written as:
𝜕 ln 𝑃𝛼 1 𝛼 𝛼
= [𝑠𝑖 ∗ 𝑠 𝑗 − 𝑝′𝑖 𝑗 ] (5.20)
𝜕𝑤 𝑖 𝑗 𝑇

- where 𝛼 denotes the subsequent iteration index of the global state; 𝑇 (from
temperature) is effectively the learning rate; and 𝑝′𝑖 𝑗 is the probability of the given
two units on at the same time when the system is at equilibrium.
Then, the authors formulate the training objective, which is to minimize function
𝐺, that denotes the discrepancy between the internal network distribution and the
real data distribution:
𝜕𝐺 1
= − [ 𝑝𝑖 𝑗 − 𝑝′𝑖 𝑗 ] (5.21)
𝜕𝑤 𝑖 𝑗 𝑇
- where:
• 𝑝𝑖 𝑗 is the probability of the given neurons (𝑖 and 𝑗) being on at the same time
in the so-called positive phase, in which the states of visible units are clamped
to a particular binary state vector sampled from the training data.
• 𝑝′𝑖 𝑗 is the probability of the given neurons both being on in the so-called
negative phase. It is defined as in Equation 5.18
The optimization presented in Eq. 5.21 is typically performed using gradient
descent.

5.11.1 Restricted Boltzmann Machines

The restricted Boltzmann Machine (RMB) is a variant, in which there are no
internal connections within the visible and hidden layers, respectively. Only the
connections between visible and hidden units are allowed. This makes the network
a bipartite graph. As a result, the model assumes that the hidden variables are
conditionally independent given the state of the visible units and visible variables
are independent given the state of the hidden units.
5.11 Boltzmann Machines (1985) 103

The most common method of training using RBMs is called contrastive divergence
(CD). The details of its derivation can be found in the seminal article [118]. The
procedure can be summarized into following steps. Let 𝑄 0 denote the distribution
over the training data and 𝑄 ∞ the equilibrium distribution.
1. Randomly sample from 𝑄 0 obtaining a vector 𝑑. This can also be a batch of
samples.
2. Run the model giving it 𝑑 and obtain the sample of the hidden units ℎ. This
represents 𝑃(ℎ = 1|𝑑).
3. Perform Gibbs (Boltzmann) sampling giving ℎ as the input to obtain a
reconstruction of the visible units. Let’s denote it by 𝑣 ′. The resample the
hidden units as ℎ′
4. Compute the log-likelihood gradients of 𝑃(𝑑) with respect to the model
parameters 𝜃 (weights and biases).
5. Update the weights and biases along the gradient multiplied by a learning rate
parameter.

The more efficient learning procedure made restricted Boltzmann machines

popular models. One of the most notable applications is for collaborative filter-
ing. Collaborative filtering involves creating recommendation systems capable of
predicting a given user’s preferences (e.g., the next movie to watch) by leveraging
the preferences of other users. In 2006, Netflix announced a collaborative filtering
competition aimed at improving the accuracy of their internal recommendation sys-
tem [17] (c.f. Section 9.4). The top solutions to this problem use RBMs [256]. Some
other problems, in which RBMs are widely used, include dimensionality reduction,
data imputation and denoising as well as generative modelling (as a method that
creates a distribution that can be sampled from to generate new samples).
6. Years 1986 - 1991

6.1 Decision Trees (1986)

6.1.1 Introduction
Year 1986 marked the introduction of decision tree induction. The algorithm was
developed by John Ross Quinlan, an Australian computer scientist. The seminal
article, titled “Induction of Decision Trees” [228], was published in the Machine
Learning journal. A decision tree is one of the most popular machine learning
models, praised for its simplicity and interpretability. It possesses high educational
value, serving as an introduction to more advanced tree-based or rule-based models.
The state-of-the-art approaches to ML problems sometimes use decision trees in
some way as part of the overall approach.
Decision trees can be constructed by experts or induced from data. The induction
is more of a transformation than training. Training is typically iterative and involves
updating the entire structure in some way, such as modifying weights. In addition,
training is often non-deterministic. A decision tree can be created iteratively, but
each iteration adds nodes corresponding to a certain variable rather than updating the
entire structure. Once a decision tree is created, the process is complete. Induction
can be viewed as an algorithm to transform data into rules and constraints represented
in a decision tree. Another term used interchangeably in this context is “synthesis”.
Initially, decision trees were developed for the classification tasks. Herein, the
possible class labels are stored in leaves. Later, a generalized approach was proposed
that included regression tasks. The general approach is called Classification And
Regression Trees (CART).
6.1 Decision Trees (1986) 105

A decision tree, as the name implies, is a tree graph structure. It consists of two
types of nodes:
1. Internal condition nodes, that represent conditions. A condition has the form
“IF condition THEN GO TO i-th child node”.
2. Leaf nodes, that represent targets. These are classes in classification problems
and numbers or distributions of numbers (to be sampled) in regression
problems. Problems that involve decision making, such as games, can be
typically represented as classification problems, in which the objective is to
assign the decision to the state described by a number of attributes.

An inference of a decision tree consist in traversing it from the root node to a leaf
node and returning the associated value (e.g. a decision) in that leaf node. Whenever
an internal node is visited, the attribute that correspond to this node is tested against
a particular condition. Figure 6.1 presents a simple decision tree and an example of
inferring it. This tree was induced from tabular data listed in Table 6.1.1.

Outlook Temperature Humidity Wind Decision

Sunny 85 85 Weak No
Sunny 80 90 Strong No
Overcast 83 78 Weak Yes
Rain 70 96 Weak Yes
Rain 68 80 Weak Yes
Rain 65 70 Strong No
Overcast 64 65 Strong Yes
Sunny 72 95 Weak No
Sunny 69 70 Weak Yes
Rain 75 80 Weak Yes
Sunny 75 70 Strong Yes
Overcast 72 90 Strong Yes
Overcast 81 75 Weak Yes
Rain 71 80 Strong No

Table 6.1: A dataset for the decision tree shown in Fig. 6.1. There are three attributes
regarding the weather and a corresponding decision about going out (playing tennis).
This exact dataset or a very similar one commonly appears in introductory courses to
decision trees. A similar dataset also served as an example in the original Quinlan’s
paper [228]. In this book, it was reproduced from: https://sefiks.com/2018/
05/13/a-step-by-step-c4-5-decision-tree-example/

A decision tree is not only a tree-based model but also a rule-based model,
because each distinct path from the root node to a leaf node is equivalent to a rule.
6.1 Decision Trees (1986) 106

Figure 6.1: A decision tree induced from data listed in Table 6.1.1. The separation
metric applied was information gain. The thicker path shows the traversal caused by
the inference provided as an example.

For example, the tree shown in Fig. 6.1 is equivalent to the following set of rules:

IF (outlook=sunny) AND (humidity > 80) THEN (decision=NO)

IF (outlook=sunny) AND (humidity <= 80) THEN (decision=YES)
IF (outlook=overcast) THEN (decision=YES)
IF (outlook=rain) AND (wind = strong) THEN (decision=NO)
IF (outlook=rain) AND (wind = weak) THEN (decision=YES)

6.1.2 The ID3 Algorithm

In 1986, Quinlan presented a decision tree induction algorithm named ID3, which
expands to Iterative Dichotomiser 3. This method was subsequently improved and
succeeded by the C4.5 algorithm in 1993 [227]. The C4.5 algorithm introduces the
inclusion of continuous numeric values, the handling of missing data, the ability
to incorporate weights for attributes, pruning redundant branches, and a few minor
enhancements.
Let 𝐷 denote the dataset. 𝐷 contains rows (samples, observations) and columns
(attributes). Let 𝐴 denote the initial set of attributes in the given problem. In the
example given in Fig. 6.1 and Tab. 6.1.1, the attributes are {𝑜𝑢𝑡𝑙𝑜𝑜𝑘, 𝑡𝑒𝑚 𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒,
ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦, 𝑤𝑖𝑛𝑑}. The available classes for the 𝑑𝑒𝑐𝑖𝑠𝑖𝑜𝑛 are {𝑦𝑒𝑠, 𝑛𝑜}
The algorithm iteratively finds the best attribute to split the dataset on. The idea
is to find a condition (using one attribute) that carries the most information about
the decision in 𝐴 – in other words – simplifies the decision the most – or separates
the classes the most. For this tasks, various metrics have been proposed. The most
popular ones are:
• arg max information gain
• arg min entropy
6.1 Decision Trees (1986) 107

• arg min Gini index

The formula for information gain is defined as:
𝐼𝐺 (𝐷, 𝑎) = 𝐻 (𝐷) − 𝐻 (𝐷|𝑎) (6.1)
- where 𝐷 is the current dataset, 𝑎 ∈ 𝐴 is an attribute, 𝐻 (𝐷) is the entropy of the
whole dataset (c.f. Section 2.9), and 𝐻 (𝐷|𝑎) is the conditional entropy of the dataset
given the specific value of attribute 𝑎.
The formula from Eq 6.1 is rewritten as:
∑︁ |𝐷 𝑎=𝑣 |
𝐼𝐺 (𝐷, 𝑎) = 𝐻 (𝐷) − 𝐻 (𝐷 𝑎=𝑣 ) (6.2)
|𝐷|
𝑣∈values(𝑎)

- where 𝐷 𝑎=𝑣 is the subset of the dataset, in which the attribute 𝑎 has the value of 𝑣.
Or more generally, 𝐷 𝑎=𝑣 denotes the subset of data filtered by a particular condition
(not necessarily using the equality operator).

The Gini index for a dataset is defined as follows:

∑︁
𝐺𝑖𝑛𝑖(𝐷) = 1 − 𝑃(𝑐) 2 (6.3)
𝑐∈𝑐𝑙𝑎𝑠𝑠𝑒𝑠
- where 𝑐 denotes a unique decision class, whereas 𝑃 is the proportion of instances
of a given class to all instances.

When finding the best split, the formula is analogical to Eq.6.2:

∑︁ |𝐷 𝑎=𝑣 |
𝐺𝑖𝑛𝑖(𝐷, 𝑎) = 𝐺𝑖𝑛𝑖(𝐷 𝑣 ) (6.4)
|𝐷|
𝑣∈values(𝑎)

The split defines a condition on the attribute’s values. Examples of conditions

are presented in Figure 6.2. For each condition (e.g. value = 1 and value ≠ 1), a new
dataset is created containing those rows of the parent dataset that satisfy the given
condition. Such a subset of data serves as the initial dataset for the corresponding
sub-tree. In other words, a particular edge in the decision tree filters the dataset
according to its associated condition.
After the best split is determined, the attribute used for it is removed from the
current set of attributes:
𝐴 ← 𝐴\𝑎
The algorithm recursively continues for each sub-tree and finds the next best
attribute to split its new dataset on.
The current branch terminates when either of the following occurs:
1. All observations (rows) in the dataset contain the same decision (class). In
such a case, a leaf node is constructed with the value equal to this decision.
2. There are no more attributes, i.e., | 𝐴| = 0. In such a case, a leaf node with the
most frequent decision in the remaining dataset is created.
Alternatively, it is possible to construct an approximating decision tree that is
truncated at a specific level. In this scenario, the method for determining the leaf
value can be the same as in 2.
6.2 Artificial Immune Systems (1986) 108

Figure 6.2: Examples of conditions used in decision trees.

1. is a binary split (one vs. all), which can be categorical or numeric.
2 is a nominal split that represents a switch operation common in programming
languages.
3. is a typical numeric-based split.
4. is a custom split, where continuous domain was converted into discrete intervals.
1., 2., and 3. can be implemented as one case (and treated in the same way).
In the introductory paper [228], the ID3 algorithm could only create splits of type 1.
and 2.

6.2 Artificial Immune Systems (1986)

Artificial Immune Systems (AIS) belong to the nature-inspired, population-based
meta-heuristic approaches. Their history dates back to 1986 with the publication
of a paper titled “The Immune System, Adaptation, and Machine Learning” [71],
authored by J.Doyne Farmer, Norman H. Packard, and Alan S. Perelson. The
approach is inspired by the natural immune systems of living organisms, in particular
the human one. It is a complex and highly adaptive biological system that protects
the organisms against harmful activity. The primary function of the immune system
is to eliminate any foreign material.
Unlike, e.g., genetic algorithms or ant colony optimization, AIS represents a
broad class of approaches inspired by the principles found in natural immune systems.
Different implementations of these ideas may result in distinct algorithms. A common
characteristic among all AIS algorithms is the maintenance of a population of cells,
each serving specific roles. The following types of cells are commonly utilized:
• B Cells (antibodies). In natural immune systems, large quantities of antibodies
are specialized to recognize antigens, i.e., harmful foreign material (See
Fig. 6.3). In AIS, they are responsible for the recognition of patterns.
6.2 Artificial Immune Systems (1986) 109

Figure 6.3: A simplified depiction of a B Cell (antibody).

• Antigens (Pathogens). The harmful foreign material such as certain types of

bacteria. In AIS, they are used to represent patterns or some other types of
data that the system needs to recognize.
• Cloned Cells. Natural immune systems clone antibodies with a higher
likelihood of effectively eliminating the antigens present in the organism. In
AIS, the cloned cells are constructed from B cells with higher fitness values.
These cells undergo mutation, serving as a local search aimed at finding the
best match to the pattern.
• Memory Cells. In natural systems, cells retain a memory of past encounters
with antigens. The greater the number of these cells in an organism, the
more immune or tolerant it becomes to a particular type of antigens. This
increased presence allows for a faster and more efficient response upon
subsequent exposure. In AIS, memory cells may store information about the
most successful solutions or patterns found so far, enhancing the learning and
adaptation capabilities of the algorithm
• Helper Cells. In natural systems, they provide an additional help in coordinat-
ing the immunologic process. In AIS, they can coordinate various components
of the algorithm.
• Suppressor Cells. They can be introduced to control the population enforcing
specific constraints, e.g., to prevent it from over-fitting to a particular pattern.

AIS bear many similarities to genetic algorithms. They also utilize concepts such
as the fitness function, a population (albeit pattern detectors rather than explicite
solutions), clonal selection and mutation operation. While AIS have not been as
widely utilized as some other AI approaches, such as artificial neural networks,
decision trees, or genetic algorithms, they are still considered important algorithms in
the AI toolset and can be valuable in specific contexts. Example applications include
pattern recognition in wireless multimedia sensor networks [319], hardware fault
6.3 Learning Representations By Back-Propagating Errors (1986) 110

tolerance [24], intrusion detection in cybersecurity [333],industrial tool breakage

detection [53], and multimodal function optimization [57].

6.3 Learning Representations By Back-Propagating Errors (1986)

“Learning representations by back-propagating errors” [252] is a seminal paper
authored by David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams,
published in Nature, in 1986. It is relatively short, spanning only 4 pages; therefore,
it focuses solely on the principles of back-propagation. In 2018, Hinton received the
Turing Award for his contributions to deep learning including the contributions to
the back-propagation algorithm. In 2024, he received the Nobel prize. This paper
is sometimes provided as the original source that introduced the back-propagation
concept but this is a misconception. I have already presented in the book, that this
algorithm was introduced earlier (c.f. Section 5.9).
The importance of the paper is threefold. Firstly, it provides a novel perspective
on back-propagation in many-layered neural networks. The authors delve into
the dynamics of the network structure, stating that the units learn to represent
features. Starting from the input layer and progressing towards deeper, hidden-layers,
increasingly complex features can be represented. The more complex ones build
upon the features represented by previous layers. Secondly, the paper introduces
a method of changing the weights layer by layer and in a parallel fashion. Finally,
the paper greatly popularized the back-propagation algorithm. Its popularity can be
attributed to various factors. Being published in Nature, a high-impact journal with
a broad readership, certainly contributed. The article streamlined the presentation by
including only the essential equations required for computing the error and updating
the weights (two variants) using gradient descent. The authors also provided
specific instructions, proposing, for instance, to update each weight by an amount
proportional to the gradient of the error with respect to the weight accumulated over
all input-output training examples. Additionally, they optimized this procedure by
using the gradient to modify the velocity of the point in weight space rather than its
position.

6.4 The First Autonomous Car Project (1987-1995)

The first autonomous car is an example of an invention-based milestone. In 1986,
Ernst Dickmanns, a then professor at Bundeswehr University Munich, together with
his team ran a successful attempt to make a car drive autonomously, being controlled
only by a computer program [62]. The experiment took place at the university
campus, at a driving speed of 36kmh (≈ 55𝑚 𝑝ℎ) and without any traffic or obstacles
(the car drove in circles). However, in 1987, it was repeated on a closed, soon
to be opened, section of a highway for a longer time and speeds close to 90𝑘𝑚ℎ
(≈ 55𝑚 𝑝ℎ). After this groundbreaking event, Dickmanns was offered a partnership
from Daimler-Benz. Together, they joined the Eureka PROMETHEUS Project
(PROgraMme for a European Traffic of Highest Efficiency and Unprecedented
Safety). It was a multi-project comprising multiple initiatives, and autonomous
6.4 The First Autonomous Car Project (1987-1995) 111

driving was one of them. The project spanned from 1987 until 1995 and had a total
budget of 749 million euros. A big demonstration took place in 1994 on a French
highway under real traffic conditions. Three autonomous cars covered a distance of
more than 1000 kilometers (620 miles). For safety reasons, there were human drivers
in all cars ready to take the steering wheel if an emergency arose. For instance,
the solution did not feature any detailed mapping of the surroundings apart from
lanes and other cars. The drivers intervened occasionally, e.g., near construction
sites. The cars accelerated up to the allowed speed. Driving on a highway appears
easier than driving inside a city, however, the car still had to keep lanes, switch them
in both directions - left and right, maintain distance to the car in front. A similar
experiment was repeated once more on a highway from Bavaria to Denmark in 1995.
This marked the last progress in this project.
On a technical side, the invention combined new computer vision algorithms
with vehicle control mechanisms. It is worth noting that the car was controlled based
on camera views without the assistance of GPS. During the Prometheus project,
the car had four cameras mounted - two at the front and two at the back. When
Dickmanns started, and even during the later demonstrations in 1994 and 1995, the
state-of-the-art image processing methods were too slow for such a complex and
real-time application as autonomous driving. This resulted in one of the largest
computer vision projects up to date. A novel approach named dynamic vision
was developed. The team proposed a novel 4D representation of the surroundings
including both the depth (the 3-rd dimension) and the time (the 4-th dimension).
The motion of objects, in this 4D representation, was predicted using complex
models. Another module was responsible for the detection of the objects. The
work greatly utilized pioneering work by Rudolf E. Kalman: “A new approach to
linear filtering and prediction problems” [135]. Apart from efficient implementation,
Dickmanns’ team used a number of high-performance transputer computers to
significantly reduce the computing time. For control, the car used the approach
then named closed-loop perception and action. It involved state estimation, road
curvature estimation (both horizontal and vertical), lateral anticipatory feed-forward
and compensatory feedback control, longitudinal control and speed control. Apart
from the cameras, the car was equipped with inertial sensors like angular rate and
linear acceleration detectors. Everything was happening in real-time. The computer
would send commands to a specialized software to control the steering wheel, brakes
and throttle. The road recognition was model-based. The model was adjusted to
highways specifically. The motion was described using a set of differential equations.
Currently, autonomous driving stands out as one of the forefront applications of
AI. Dickmanns and Zapp foresaw this development. They concluded their paper [62]
by stating: “In order to fully utilize the new potential, (artificial) intelligence-
methods will have to be developed which allow these systems to acquire a certain
understanding of processes in their environment and to react properly. This is a
challenging new step in engineering” (verbatim quote).
6.5 The Second AI Winter (1987) 112

6.5 The Second AI Winter (1987)

Figure 6.4: AI winter. An AI-generated image by the model: DALL-E.

The period between 1987 and the early 2000s, or even until the mid-2000s, is
considered the second AI winter. The reasoning behind calling it so was similar
to the case of the first AI winter. However, this time it was related more to the
disappointment and lack of belief that true artificial intelligence can be achieved.
It was debated whether the term “artificial intelligence” can be used for what AI
as a field has become - a set of tools and algorithms - rather than a simulation of
a human-like intelligence. In 2005, John Markoff summarized this period writing:
“At its low point, some computer scientists and software engineers avoided the term
artificial intelligence for fear of being viewed as wild-eyed dreamers”(verbatim
quote) [172].

After the winter comes the summer. At the time of writing this book, AI is at
the record highest levels of popularity. However, similarly to the case of the first AI
winter (c.f. Section 4.10), the author of this book personally believes that a lot of
interesting research has been done during these harder times. This includes both
the development of new techniques for solving problems efficiently and spectacular
achievements such as AI defeating Garry Kasparov at chess (c.f. Section 7.11).
6.6 Probabilistic Reasoning and Bayesian Networks (1988) 113

6.6 Probabilistic Reasoning and Bayesian Networks (1988)

In 1988, Judea Pearl, an Israeli-American computer scientist and philosopher,
published an impactful and highly cited book titled “Probabilistic Reasoning in
Intelligent Systems: Networks of Plausible Inference” [217]. According to Google
Scholar, in 2024, the book has been cited more than 30 thousand times. Not only
it serves as a review of probabilistic reasoning methods and as a key reference for
researchers, practitioners and students. In addition, it introduced and formalized
novel concepts such as inferring and updating probabilistic graphical models.
The book (revised second printing, which is the most commonly available version)
is divided into ten chapters. Each section in a chapter is classified to one of three
levels of technical complexity.
1. Uncertainty in AI Systems: An Overview. This is a background and
motivation chapter. In classical reasoning, such as the one presented in
Sections 3.8, 4.9 and 3.9, there is no room for the inclusion of uncertainty.
Implications are satisfied if and only if the premises are satisfied. Production
rules are applied if and only if their conditions are satisfied. Probabilistic
reasoning allows for the information to be uncertain. Such an uncertain
information can still be used for reasoning. Both the input information and
the intermediate results of the reasoning process can be uncertain. There are
several approaches presented in the book to modeling this problem.

2. Bayesian Inference. A hypothesis with an associated probability (or a degree

of certainty) is considered a partial belief. The subject of Bayesian Inference is
how to update the probability of a belief under changing evidence (which can
also be represented as sets of belief). In a rule-based reasoning system, a rule
can be seen as a relation between the joint probability of its conditions and the
probability of its effect (consequence). Calculating this probability becomes a
challenging task when there is a lack of information about the dependence of
the conditions. This chapter refers to the mathematical foundations including
the Bayes theorem (c.f. Section 2.2). It explores various related theorems,
properties and examples.

3. Markov and Bayesian Networks: Two Graphical Representations of

Probabilistic Knowledge. This chapter provides an introduction to two
prominent models — Markov Networks and Bayesian Networks (BN). I will
focus on the latter due to their greater popularity. A Bayesian Network, also
known as a belief network or a bayesian net, is a directed acyclic graph, in
which vertices denote random variables. When thinking of BNs in machine
learning terms, a variable may represent a feature. Edges represent causal
relationships between the variables. For example, if an edge from 𝐴 to 𝐵
exists, then 𝐵 depends on 𝐴 and the value of 𝐴 influences the probability of
𝐵. Missing arcs are equivalent to conditional independence. Each node in
the network is additionally described by a conditional probability table. This
table represents the probability distribution of its node’s variable given all
6.6 Probabilistic Reasoning and Bayesian Networks (1988) 114

Figure 6.5: A simple Bayesian Network for binary variables. The conditional
probability tables (CPT) are shown next to the nodes. In the binary case, the second
row could be derived from the first row due to the fact that 𝑝(𝑥 = 1) = 1 − 𝑝(𝑥 = 0).
However, a more general representation that also works for non-binary variables is
presented.

possible combinations of outcomes of the predecessor nodes’ variables. An

example Bayesian Network is shown in Figure 6.5.

The key advantage and the main strength of Bayesian Networks is that they can
be used to compute any marginal, joint or conditional probability involving
the variables present in the network. For example, they can be used to answer
queries such as P(lung cancer=yes |smoking=no, positive X-ray=yes) that
translates to “what is the probability of a lung cancer given that the person
is not a smoker and got a positive X-ray result?”. The reasoning is achieved
via inference with the model for a given situation. If the outcomes of some
events are known and plugged into the random variables, the model can be
used to estimate the probability of causes for the events or possible further
outcomes. The main disadvantage of Bayesian networks is the difficulty of
creating them. While an expert may specify the network topology, it is a really
challenging task to provide all the conditional probabilities. However, even
the causal relationships are often not fully known or even incorrectly identified
by humans (e.g. the other way around).
The second model described in this chapter is a Markov Network. Its graphical
6.6 Probabilistic Reasoning and Bayesian Networks (1988) 115

structure is an undirected graph. Instead of causality, the edges represent

symmetric probability relationship.

4. Belief Updating by Network Propagation.This chapter focuses on methods

for updating Markov and Bayesian Networks when new evidence becomes
available.

5. Distributed Revision of Composite Beliefs. This chapter features advanced

topics of beliefs’ update and revision. It also contains a section demonstrating
a practical application in medical diagnosis. It concludes with explanations of
beliefs generated by the probabilistic models.

6. Decision and Control. Modeling uncertainty is one aspect, but making

decisions under uncertainty is a different one. This chapter addresses the
problem of decision analysis, covering example topics such as utility theory
with utility functions, planning, decision trees, as well as the assessment of
information sources.

Figure 6.6: Example of a taxonomic hierarchy.

7. Taxonomic Hierarchies, Continuous Variables, and Uncertain Probabil-

ities. Richer descriptions of situations are discussed such as hierarchies of
probabilistic relationships (taxonomic vs. causal). An example of a taxonomic
hierarchy is presented in Fig. 6.6. This section also addresses the probabilities
of continuous variables and illustrates the propagation schemes for distribu-
tions and ranges of such variables.
6.7 TD-Learning (1988) 116

8. Learning Structure from Data. Most sections of this chapter are classified
as highly technically complex, focusing on the creation of Bayesian Networks
from data. In earlier chapters, it was assumed that the structure of the networks
and the conditional probabilities are provided by external experts. The learning
approach is inductive, i.e., involving the creation of a structure from empirical
data, as seen in the case of decision trees (c.f. Section 6.1).

9. Non-Bayesian Formalisms for Managing Uncertainty. In this chapter, three

formalisms are described. The first one is Dempster-Shafer theory (DST),
also referred to as the evidence theory. It employs the notion of belief, which
describes uncertain information in contrast to knowledge, which encompasses
certain, irrefutable information. DST introduces the so-called belief functions
representing the relationship between the degrees and outcomes of beliefs.
The remaining two formalisms are truth maintenance systems and probabilistic
logic. Probabilistic logic extends classical Boolean logic. A probabilistic
logic system is often non-monotonic. In this context, the non-monotonicity
means that previously made conclusions can be retracted based on updated evi-
dence. In classical logic, once something is known to be true or false, it stays so.

10. Logic and Probability: The Strange Connection. This section covers
topics such as non-monotonic reasoning, default reasoning and commonsense
reasoning. It also shows probabilistic semantics for these concepts.

6.7 TD-Learning (1988)

In 1988, a new learning paradigm was introduced, known as temporal-difference
learning (TD learning). The inaugural paper, authored by Richard S. Sutton, a
Canadian computer scientist, was titled "Learning to Predict by the Methods of
Temporal Differences" [290]. Traditionally1 , machine learning divides into three
classes of approaches: supervised learning (based on labelled data), unsupervised
learning (unlabelled data, finding patterns and clusters), and reinforcement learning
(RL). TD learning belongs to the latter, wherein an agent learns by interacting in an
environment representing a Markov Decision Process (MDP). It updates its policy
in order to maximize the cumulative reward obtained from the environment. For
MDP and other preliminary definitions, please revisit Section 3.5.
The main idea underaching TD learning is that the value of a given state,
corresponding to time step 𝑡, can be approximated by a discounted sum of rewards
at future steps: (∞ )
∑︁
𝑉 (𝑠) = 𝐸 𝛾 𝑡 ∗ 𝑅𝑡+1 (6.5)
𝑖=0
- where the agent starts from state 𝑠 and follows its action policy in consecutive
1There are three main classes of ML tasks: supervised , unsupervised , and reinforcement
learning. However, lately, there have been more detailed distinctions proposed including tasks such
as semi-supervised learning and self-supervised learning.
6.8 Convolutional Neural Networks (1989-1998) 117

time steps; 𝑅𝑡 is the reward obtained after 𝑡-th step and 𝛾 ∈ [0, 1] is the discount factor.

Naturally, if the value was to be computed at the end of the process (without
the need to account for future rewards), it would simply be the sum of the obtained
rewards.
The learning process is iterative. The learned function is updated in each step by
the learning rate, e.g.:
𝑉 (𝑠𝑡 ) ← 𝑉 (𝑠𝑡 ) + 𝛼 ∗ Δ𝑉 (𝑠𝑡 ) (6.6)
Sutton introduced a family of TD learning methods parameterized by 𝜆 naming
them the 𝑇 𝐷 (𝜆). They differ in the way they modify the learned function during
an update. Below, the examples for TD(0) and TD(1) methods are presented in
Equation 6.7 and 6.8, respectively.

Δ𝑉 (𝑠𝑡 ) = 𝛾 ∗𝑉 (𝑠𝑡+1 −𝑉 (𝑠𝑡 )) (6.7)

Δ𝑉 (𝑠𝑡 ) = 𝑅𝑡+1 + 𝛾 ∗ 𝑅𝑡+2 + 𝛾 2 ∗ 𝑅𝑡+3 + ... + 𝛾 𝑛 ∗𝑉 (𝑠𝑡+𝑛 ) −𝑉 (𝑠𝑡 ) (6.8)

TD learning has been utilized for various problems, e.g., psychiatry [80] and
neuroscience [209]. However, arguably the most common application of the method
has been in the combinatorial games domain. One of the most famous such
applications is TD-Gammon, which will be presented in Section 7.1.

6.8 Convolutional Neural Networks (1989-1998)

Convolutional neural networks (CNNs) have become a fundamental and widely
used neural network architecture in the field of computer vision and beyond. Deep
convolutional neural networks (DCNNs), in particular, have played a significant
role in popularizing deep learning. The success of CNNs in computer vision tasks
inspired researchers to explore deep learning techniques in other domains. Moreover,
even seemingly strictly computational problems were sometimes represented as
images to apply image-based recognition using DCNNs.

It is important to note the introduction of CNNs as an AI milestone. The

groundwork was laid in 1989 by Y. LeCun, B. Boser, J. S. Denker, D. Henderson,
R.E. Howard, W. Hubbard, and L.D. Jackel in their paper titled “Handwritten Digit
Recognition with a Back-Propagation Network” [154]. The first author – Yann
LeCun – has become one of the most prominent figures in the field of AI. In 2019,
he was a recipient of the prestigious Turing Award. The next stepping stone was a
seminal article authored in 1998 by LeCun et. al. titled “Gradient-based learning
applied to document recognition” [155]. It is one of the highest cited articles in
the field. The authors proposed a refined version of the CNN architecture called
LeNet-5.
The motivation behind the 1989 paper was to demonstrate the possibility of
performing image-recognition tasks without extensive prior knowledge and feature
engineering. The authors introduced convolutions to neural networks and showed
6.8 Convolutional Neural Networks (1989-1998) 118

Figure 6.7: The convolution operation is analogous to a 2D filter. The difference

lies in the fact that, in traditional filters, the values are usually fixed, whereas in
convolutional neural networks, they are represented by learnable weights.

that it is efficient for the network to learn local features first and then more advanced
features based on the local ones.
In the context of signal processing, convolution involves multiplying the signal
by a set of values (the kernel) element-wise to produce a new set of values. In
convolutional neural networks, this operation is similar to the filter operations
commonly used in 2D graphics. The kernel is a 𝑘 × 𝑙 sliding window over the signal.
The signal is the input to the convolutional layer, which in the case of the first layer,
is the image. Most commonly, 2D layers are used for convolutions. If the data
has more dimensions, e.g., 3D color, depth, temperature, alpha etc., then each 2D
layer is typically processed separately. The convolution operation performs the sum
of multiplied values and returns it. The resulting sum corresponds to the middle
element of the window. Figure 6.7 presents this idea. The same technique is used,
for instance, for edge detection, sharpening, blurring, and other filters (differing by
the values in the kernel).
Such a window slides over the input map similarly to the scan line algorithm. It
produces a new map as a result. Depending on the step size and how the edges are
handled, the resulting map may have the same dimensions as the original, be smaller
by the size of one window edge or downsampled (See Figure 6.8).
Each convolution operation has its matrix of trainable weights. However, an
important idea of CNNs is the weight sharing as each convolution in a given layer
shares the same weights regardless of the position it is currently in. This has many
beneficial effects such as invariance to the positions of detected features and less
free parameters (higher potential convergence rate in training).
CNNs are multi-layer networks, primarily designed to have only feed-forward
connections (although there exist variants with recurrent connections as well). They
usually contain multiple convolutional layers. Each subsequent layer takes a map as
input that was the output of the previous layer. However, not all layers have to be
convolutional. Figure 6.9 presents the LeNet-5 architecture introduced in [155].
6.8 Convolutional Neural Networks (1989-1998) 119

Figure 6.8: Convolutions of various step sizes and boundary-handling strategies. In

the first example at the top, an artificial boundary is introduced, and its values are
copied from the neighboring elements. The subsequent examples do not incorporate
an extra border. The middle example corresponds to a step size of 1, while the last
example employs a step size of 2.)

In general, the most commonly utilized layers, serving as building blocks in

CNNs are:
• Convolutional layer (Conv) - as described before. They are responsible for
patterns/features detection.
• Pooling layer (Pool, Subsampling) - performs aggregation operations such as
max (in this case called max pooling). They are responsible for dimensionality
reduction by data selection/summarization.
• Fully connected layer (Dense) - adds a regular NN layer behavior. Typically
serves as a classifier or regression network put after the part responsible for
the feature extraction.
• Dropout layer - a dedicated layer to prevent overfitting. It removes (drops)
certain neurons from training.
• Activation - they apply activation functions such as ReLU to introduce non-
linearity. In contrast to regular NNs, the activation function is applied directly
to the output of the previous layer (element-wise) instead of the weighted
output.
• Concatenation layer (Concat) - combines multiple features from previous
layers.
• Flatten layer - converts multi-dimensional output into a 1D vector. They are
often used just before fully connected layers to prepare the input for them.
6.8 Convolutional Neural Networks (1989-1998) 120

Figure 6.9: LeNet-5 architecture for digits recognition task. LeNet-5 is a convolu-
tional neural network proposed in [155].

• Output layer - often utilizes the soft-max function to convert the output
values into class probabilities (if the ML task is some kind of classification or
segmentation).
6.9 Backpropagation through Time (1990) 121

6.9 Backpropagation through Time (1990)

In this brief section, I retrieve a paper: “Backpropagation Through Time: What It
Does and How to Do It” [325] authored by Paul Werbos in 1990. In this paper, Werbos
demonstrated a pragmatic method of applying back-propagation in recurrent neural
networks (RNNs) that incorporate the concept of time. Werbos begins the abstract
with the following sentence: “Backpropagation is now the most widely used tool in
the field of artificial neural networks”. The significance of this contribution lies in
the pivotal role that neural networks now play in the field of artificial intelligence.
This work has significantly bolstered the popularity of RNNs. Werbos adopted a
tutorial-like style in the paper, using relatively simple formalism. He included code
samples written in Fortran. Readers interested in exploring these code snippets are
encouraged to refer to the cited paper for further details.

Figure 6.10: Network design presented as an example in [325].

In the first example, Werbos uses a network as shown in Fig. 6.10. The values
from the previous time steps are represented as “virtual” connections to past network
states. A problem that arises for 𝑡 = 0 and 𝑡 = 1 is discussed in the paper. In short,
the recommended approach is to assume zeroes or atribtrary values for the signal.
The chain rule, which is the core mathematical apparatus behind back-propagation,
is expanded to incorporate the virtually connected networks from previous time
steps. Practical implementations for various tasks are shown. Werbos summarizes
that the proposed approach generalizes beyond neural networks. It can be applied,
for instance, to feed-forward systems of equations or systems with instantaneous
feedback between variables (as in ordinary differential equations).
6.10 Ant Colony Optimization (1991) 122

6.10 Ant Colony Optimization (1991)

Ant Colony Optimization (ACO) is a nature-inspired (like genetic algorithms, neural
networks, artificial immune systems and many more) meta-heuristic approach,
applicable to various optimization problems. It was the topic of Marco Dorigo’s
PhD thesis [63] titled “Optimization, Learning and Natural Algorithms”, which he
successfully defended in 1992 at the Polytechnic University of Milan. While some
sources attribute the commencement of ACO to this PhD thesis, it is worth noting
that Dorigo and his colleagues had already published papers on this approach in
1991 [64, 46].
The approach is inspired by the way ants discover paths between a food source
to the nest and v-ce versa. As ants move, they deposit pheromones on the ground,
establishing a communication channel. The amount of pheromone accumulated at
a location increases with the frequency of ant visits. Initially, the ants’ behavior
appears random as they explore their surroundings. Over time, more ants discover
the right path, and due to the accumulation of pheromones, the behavior becomes
less random. Ants begin to follow a path, gradually converging towards an optimal
route. This inspiration sets the tone for the types of problems that might be suitable
for an ACO algorithm - particularly those related to finding the shortest paths.
In the 1991 paper [46] describing the approach, the authors selected the Traveling
Salesman Problem (TSP) as an example to demonstrate its effectiveness. They
described three variants of the Ant System named:
• ANT-quantity - an ant leaves a constant quantity 𝑄 1 of pheromone whenever
it visit a location.
• ANT-density - an ant leaves a quantity 𝑄𝑑2 of pheromone proportional to the
distance traversed from the last pheromone deposit occurrence.
• ANT-cycle - an ant leaves pheromone only after its completed a whole tour
(TSP cycle). The quantity is inversely proportional to the tour length: 𝑄𝐿3 .
A block scheme of the algorithm is presented in Figure 6.11. Firstly, the
parameters of the algorithm are initialized, including the number of ants in the
population, the initial pheromone, and other pheromone-related values that are part
of the Equations 6.9, 6.11, and 6.12. The stop condition is typically defined as a fixed
number of iterations to perform. However, any problem-specific stop conditions can
be applied such as obtaining a solution of a minimum required quality.
The algorithm simulates ant iterations until the stop condition is reached. In
each iteration, every ant constructs a single solution from scratch, step by step. For
example, in a path-finding problem, the ant starts in the starting node node in a graph
(representing the current location) and chooses the next edge to traverse. ACO is a
stochastic method. The next edge (𝑣 𝑖 , 𝑣 𝑗 ) is chosen with the probability:
𝛽
𝜏𝑖𝛼𝑗 ∗ 𝜂𝑖 𝑗
𝑝(𝑖 𝑗) = Í 𝛽
(6.9)
𝑖 𝑗 (𝜏𝑖 𝑗 ∗ 𝜂𝑖 𝑗 )
𝛼

- where 𝜏𝑖 𝑗 is the pheromone quantity deposited on the edge (𝑣 𝑖 , 𝑣 𝑗 ) (or more generally
on the transformation from the 𝑖-th state into the 𝑗-th state); 𝜂𝑖 𝑗 is the general quality
6.10 Ant Colony Optimization (1991) 123

Figure 6.11: A scheme of the Ant Colony Optimization algorithm.

of the edge being part of the solution (e.g. the inverse distance for the shortest path
problem); whereas 𝛼 and 𝛽 are the control constants for the 𝜏 and 𝜂 parameters,
respectively.

The quality of the solution constructed by ant 𝑎 is determined. For the path-
finding problems, it can be calculated as follows:
1
𝑄 𝑎 = Í 𝑘−1 (6.10)
𝑖=1 𝑑𝑖𝑠𝑡 (𝑣 𝑖 , 𝑣 𝑖+1 )

- where 𝑘 is the number of vertices on the path.

6.11 Autoencoders (1991) 124

A general rule for calculating the pheromone increment for the (𝑣 𝑖 , 𝑣 𝑗 ) edge (or
more generally, the 𝑖 𝑗-th transition) is as follows:
∑︁
Δ𝜏(𝑡)𝑖 𝑗 = (𝛿𝑖 𝑗 ∗ 𝑄 𝑎 ∗ 𝑏𝑖 𝑗 ) (6.11)
𝑎∈𝐴𝑛𝑡𝑠

- where 𝛿𝑖 𝑗 takes one of the two values:

• 0: if the edge was not part of the solution 𝑄 𝑎 .
• 1: if the edge was part of the solution 𝑄 𝑎 .
- and the 𝑏𝑖 𝑗 parameter can be set as a bonus multiplier > 1 if the edge was part of
the globally best solution. This is the elitism mechanism implementation for ACO.
If the elitism is not used or the edge was not part of the best solution, then: 𝑏𝑖 𝑗 = 1.
Finally, the value deposited on each edge (transition) is updated using the
computed increment and taking into account the evaporation:

𝜏(𝑡 + 1)𝑖 𝑗 = max(𝜏𝑚𝑖𝑛 , (1 − 𝜌) ∗ 𝜏(𝑡)𝑖 𝑗 + Δ𝜏(𝑡)𝑖 𝑗 ) (6.12)

- where 𝜏(𝑡) is the amount of pheromone in iteration 𝑡; 𝜌 is the evaporation coefficient;

and 𝜏𝑚𝑖𝑛 is the minimum value of pheromone that is allowed (often: 𝜏𝑚𝑖𝑛 = 0).

Ant Colony Optimization algorithm was first used for the TSP problem. However,
it has been applied to problems beyond path-finding such as load balancing in cloud
environments [208], financial crisis prediction [308], model search [170]. Its idea
generalizes well: having a population of searchers, exchanging information between
them and maintaining balance between exploration and exploitation during the
search.

6.11 Autoencoders (1991)

1991 marked the birth of a novel and distinctive neural network architecture called
an autoencoder. Since then, they have evolved into a fundamental component of the
machine learning ecosystem. They have proven to be versatile tools, influencing
diverse applications as well as subsequent state-of-the-art architectures such as
variational autoencoders.
Autoencoders originated as an attempt to create a non-linear and neural-network-
based variant of the PCA method. The original paper, written by Mark Kramer of
MIT, was titled: “Nonlinear Principal Component Analysis Using Autoassociative
Neural Networks” [150].

As you recall from Section 2.4, PCA is a space dimensionality reduction method
(that can be used for related tasks such as feature engineering). The primary goal of
an autoencoder is similar - to encode the input data into a lower-dimensional form
and then decode it back to reconstruct the original input as accurately as possible.
A general depiction of the autoencoder architecture is presented in Figure 6.12. It
consists of two parts: the encoder and the decoder. The input layer, which is part
of the encoder, and the output layer, which is part of the decoder, have equal sizes.
6.11 Autoencoders (1991) 125

In the middle, there is a bottleneck through which data must must traverse. This
bottleneck layer is often referred to as the code. In the standard encoder architecture,
the size of the code must be smaller than that of the input and output layers.

Figure 6.12: A scheme of the autoencoder model. Traditionally, the model is

designed with fully connected layers, but in general, it does not necessarily have to
be. The code is the bottleneck, meaning it comprises fewer neurons than the input
and output layers, respectively.

Let 𝐸 denote the encoder component (encoding function) – which is referred

to as the latent representation – with the parameters 𝜙, which are referred to as the
latent vector:
𝐸𝜙 : X → F
Let 𝐷 denote the decoder component with parameters 𝜓. The output of the decoder
is referred to as the decoded message:

𝐷𝜓 : F → X

The neural network, as a whole, performs the following non-linear mapping:

𝑁 𝑁 (𝑥) → 𝐷 𝜓 (𝐸 𝜙 (𝑥))

and its sub-networks perform the 𝐸 and 𝐷 functions, respectively.

6.11 Autoencoders (1991) 126

The loss function for such a network is typically defined as some form of a
distance function of the input and the output. In contrast to the standard supervised
learning, the input data serve the role of ground truth. This approach is referred to
as the reconstruction loss:

L (𝜙, 𝜓) = 𝑑 (𝑋, 𝐸 𝜙 (𝐷 𝜓 (𝑋))

For example, the L2 loss is defined as follows, where 𝑛 is the number of

observations in the training batch:
𝑛
1 ∑︁
L (𝜙, 𝜓) = [𝑥𝑖 − 𝐸 (𝐷 (𝑥𝑖 ))] 2
𝑛 𝑖=1

When training autoencoders, there is a risk of making it learn an identity function.

To prevent this from happenning, a few good practices have been proposed:
1. Making the sizes of hidden layers significantly smaller than the size of the
input layer. This way, the network cannot effectively learn to memorize
the input data but must develop meaningful representations instead (encode
abstract features) and find a way to reduce the dimensionality if possible.
However, it is crucial to recognize that this strategy involves a tradeoff between
generalization capabilities and accuracy. If the hidden layers are excessively
small, auto-encoders may become inaccurate.
2. Applying regularization. Regularization refers to the incorporation of tech-
niques that prevent the model from fitting the training data too closely, with
the goal of improving its generalization to new, unseen data. It applies to all
neural networks, not only to autoencoders. In particular, a common practice is
to inject noise to the input data during training.
3. Adjusting the activation functions in such a way that the activations become
sparse.
4. Adversarial training - which will be described in Section 10.1.
Autoencoders shine as an example of how a relatively simple idea has spawned
numerous valuable applications. Apart from dimensionality reduction, autoencoders
are utilized for denoising [314], representation learning [348], generative AI [21],
anomaly detection [43] and feature extraction [185]. Their applications include:
medicine [13], speech recognition [60], generative art [54], cybersecurity [338],
sales prediction [110], and interpretable AI [269].
6.12 Python (1991) 127

6.12 Python (1991)

Python [312] is a high-level, dynamically typed programming language that made its
debut on 20th of February, 1991. It began as a hobby project of Guido van Rossum,
a Dutch programmer. The name “Python” is a tribute to the British group Monty
Python, of which Van Rossum was a fan. As of the time of writing this book, Python
is maintained and developed by the Python Software Foundation.
The language received acclaim for its simplicity and the speed at which programs
could be written in it. However, it is worth noting that the language itself exhibits
a relatively slower run-time code execution speed, particularly when compared to
natively-compiled languages like C++.
The first applications of Python were as a scripting language for various automa-
tion tasks. It also saw use for web development due to frameworks such as Zope
(introduced in 1995) and later Django (2005).
In February 2024, according to TIOBE, Python was the most popular language
in the world. It surpassed C in 2023 and Java in 2021. Even before this global
surge in popularity, it has unquestionably been the most widely used programming
language for machine learning by a significant margin. Python has increased the
productivity of AI researchers and engineers. It allowed for a faster transformation
from research to applications. Therefore, its introduction is a milestone in the history
of AI. The lower run-time performance of Python was mitigated by two facts:
1. C/C++ libraries. Most of the machine learning libraries available in Python
have a C or C++ code-behind. The algorithms are written in C/C++, whereas
only the input and output interfaces to them are written in Python. They offer
seamless workflow to the users as they can program only in Python.
2. GPU computing. Some of the machine learning tasks are suitable for General-
Purpose computing on Graphics Processing Units (GPUs). A high-end GPU
is a powerful and massively parallel unit. The most prominent example of
tasks typically performed on GPUs is training deep neural networks. In such a
case, the performance of Python does not affect the training time.
The most popular Python libraries that serve various data processing and ML/AI
tasks include:
• 1995/2005: NumPy - for data handling and efficient processing. In 1995, the
core package named numeric was released, whereas in 2005 it was ported to
what started as NumPy.
• 1999: OpenCV - dedicated to computer vision. It was developed for non-ML
tasks too such as classic image and video processing.
• 2001: NLTK (Natural Language ToolKit) - dedicated to natural language
processing (NLP).
• 2001: SciPy - a library for scientific computing that now depends on NumPy.
• 2003: Matplotlib - for creating various scientific plots, e.g., data distributions
or the performance of AI/ML models.
• 2007: Scikit-learn - arguably the most popular library for training and using
machine learning models.
• 2007: Theano - a deep learning library developed by University of Montreal.
6.12 Python (1991) 128

• 2008: Pandas - typically used for their DataFrames serving as more advanced
containers for data with plethora of operations.
• 2015: TensorFlow - a high-performance framework for graph operations
dedicated to training deep neural networks. It allows for efficient training
of models using GPUs. It was developed by Google. The introduction of
TensorFlow can be considered an important milestone in the history of AI.
• 2015: Keras - a high-level interface for neural network training backends. It
features GUI for configuring the models from building blocks as well as for
observing the training process.
• 2016: PyTorch - a deep learning library, originally developed by Meta AI.
Below, please find an example of training and testing a simple machine learning
model in Python using Sci-kit learn:

# X contains observed data (features);

# y contains labels

X_train, X_test, y_train, y_test = \

train_test_split(X, y, test_size=0.2, random_state=123)

model = RandomForestClassifier\
(criterion=’gini’, n_estimators=40)n_estimators=80)
model.fit(X_train, y_train)

predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print("Accuracy RF: %.2f%%" % (accuracy * 100.0))
7. Years 1992 - 1997

7.1 TD-Gammon (1992)

7.1.1 Introduction
Backgammon is a popular board game believed to have originated in England, in the
17th century. However, it has most likely been inspired by much older games such
as Tabula [10], which dates back to the Roman and Byzantine Empires. The board
consists of 24 triangle-shaped positions called points. They are divided into four
quadrants: two outer board quadrants and two players’ homes. In contrast to chess or
checkers, backgammon is a game of chance through the use of dice rolls for various
aspects of the movement - by how many points the pieces (checkers, tablemen, men)
move or which positions in the opponent’s home they are allowed to enter (there are
several rules regarding this). Figure 7.1 presents the initial setup of the game.
Backgammon is a zero-sum game, wherein the victory of one player is equivalent
to a symmetric defeat of the other player. The victory is achieved by transferring all
pieces into the own home and then bearing them off (escaping with them). There
are three types of wins:
• +1. A regular (single) win. The winning player has borne off all pieces,
whereas the losing player borne off at least one piece.
• +2. Win by gammon. Occurs when a player successfully bears off all their
pieces from the board before their opponent has borne off any.
• +3. Win by backgammon. Occurs when the losing player has not borne off
any pieces and still has at least one piece in the winner’s home board.
Therefore, there are six possible outcomes in a game of backgammon from the
perspective of a player: [+3, +2, +1, −1, −2, −3]. In principle, the game cannot end
in a draw.
7.1 TD-Gammon (1992) 130

Figure 7.1: Backgammon’s board in its initial state. Public domain image under the
CC BY-SA 4.0 license.

The “Backgammon is Hard” [327] article proves that determining if a player can
win is EXPTIME-Hard, in general. The state-space size is 1020 and the branching
factor is equal to 250, which is relatively high compared to other popular board
games (e.g., 35 in chess and just 2.8 in checkers).

7.1.2 The Approach

TD-Gammon [301] is a program developed in 1992 to play backgammon by Gerald
Tesauro, an American researcher at the IBM Watson Research Center. It employs a
combination of reinforcement learning and an artificial neural network to model the
non-linear value function of the game. Please note that the value function, denoted
as 𝑉, estimates the payoff of a player in a given state. As backgammon is a zero-sum
game, the network was trained to estimate only the value of the White player (the
first to move). The value of the second player can be derived by taking the inverse.
The chosen neural network model was a fully-connected MLP with one hidden
layer (c.f. Figure 3.6). TD-Gammon had versions with 40 and 80 units in the hidden
layer, respectively. In earlier experiments reported in [300], Tesauro demonstrated
scaling of the performance from 10 to 80 hidden-layer units, so it was likely that the
program would be stronger with even more units.
The number of input units also varied between versions. TD-Gammon is a direct
successor of Tesauro’s previous program called Neurogammon [299]. It was also
NN-based but did not employ the TD-learning approach. Neurogammon (and the
first version of TD-Gammon) had 289 input neurons. However, the most widely
known version of TD-Gammon utilized 198 input units [291]. The encoding is
7.1 TD-Gammon (1992) 131

shown in Figure 7.2.

Figure 7.2: The input encoding that consists of board representation (8 ∗ 24 = 192
units) and 6 additional features. Features are either binary (on/off) or scaled integers.
For example, (𝑛 − 3)/2 means that 3 is subtracted from the number of pieces and
next it is divided by 2.

The output of the network consisted of 4 units that estimated the probabilities of
[+2, +1, -1, -2] outcomes, respectively. The [+3, -3] outcomes (“backgammons”)
were omitted due to their very low frequency of occurring. The neural network (NN)
was trained using TD-learning, and more precisely, TD(𝜆) method. Each game is a
sequence of input vectors 𝑥 [𝑡], where 𝑡 denotes the consecutive time step:

𝑥 [1], 𝑥 [2], 𝑥 [3], ..., 𝑥 [ 𝑓 ]

the corresponding network output is:

𝑌 [1],𝑌 [2],𝑌 [3], ...,𝑌 [ 𝑓 ]

the weight update formula is defined as follows:

𝑡
∑︁
𝑤 𝑡+1 − 𝑤 𝑡 = 𝛼 ∗ (𝑌𝑡+1 −𝑌𝑡 ) 𝜆𝑡−𝑘 ∗ ▽𝑤𝑌𝑘 (7.1)
𝑘=1

- where 𝛼 is the learning rate parameter, ▽𝑤𝑌𝑘 is the gradient of 𝑌𝑘 with respect to
weights, 𝜆 ∈ [0, 1] is the temporal credit assignment parameter. The last 𝑌𝑡 is set to
the actual game result rather than estimation from the network.

TD-Gammon utilized a purely knowledge-free and learning-based approach

to intelligent game playing [168]. It trained by playing against a copy of itself.
Whenever an action was to be made, the algorithm calculated its expected outcome,
i.e., the probability of each outcome times its value estimation from the network.
The action that maximized this expected outcome was chosen. There have been
various versions of TD-Gammon created. The initial one reached a relatively
strong intermediate level of play (similar to the Neurogammon) after 200 000
games. Versions 1.0, 2.0 and 2.1 were trained for 300K, 800K, and 1500K games,
respectively.
TD-Gammon is an important milestone in the AI research history. Firstly, it was
the first program to achieve high playing efficacy in the game. Previous programs,
7.2 R Language (1993) 132

which relied on heuristic rules and shallow search, were easily beaten by humans.
TD-Gammon was at the level of strong human players and only marginally below the
top human players. According to [244], TD-Gammon “would have a decent chance
of winning local and regional Open-division tournaments”. It played 51 games
against two world-class players, one of whom was 11-th in the world at the time. It
achieved the net result of −13. Please note that the range was [−153, +153], where
0 denotes parity. TD-Gammon 2.1 played against William Gerard (Bill) Robertie, a
two times world champion. In 40 games, it achieved the net score of −1.
Secondly, it has been one of the most remarkable practical demonstrations of the
temporal difference learning approach. In the paper [301], Tesauro motivates that:

Unfortunately, despite the considerable attention that has

been devoted to reinforcement learning over many years,
so far there have been few practical successes in terms of
solving large-scale, complex real-world problems.

TD-gammon advanced the knowledge about the game by finding strategies that
people had previously overlooked or erroneously rule out. In a sense, it taught people
how to play in certain positions.

7.2 R Language (1993)

In 1993, the first version of the R programming language was developed by Robert
Gentleman and Ross Ihaka of University of Auckland, New Zealand. The language
was initially called S but to avoid legal issues due to a conflict with an already
existing name, the name was changed to R.
R stands as one of the most important and widely used languages for statistical
computing and data science tasks including data mining and data visualization [128].
It also features a variety of machine learning libraries. According to TIOBE, it is the
23-rd most popular language globally (regardless of the purpose and application), in
2024.
Since 1997, R is maintained as a GNU project and available under the Free
Software Foundation’s GPL license.
7.3 Association Rules Mining (1993) 133

Below, please find an example of implementing k-means in R:

data("dataset") # load data
dataframe <- scale(dataset) # scaling and centering
set.seed(123) # random state (seed)

#nstart: the number of random starting partitions

km.res <- kmeans(dataframe, centers=10, nstart=25)

print(km.res)

7.3 Association Rules Mining (1993)

Association rule mining is an area of data analysis and knowledge discovery in
databases (KDD). It concerns finding connections between items that can appear
together in some kind of activity. It can be applied, e.g., in market basket analysis,
which concerns associations between products customers buy. For example, if a
customer buys bread and ham, then there is a high chance (affinity) that they will
also buy butter. Another example: if a customer buys chocolate and wine, then they
will likely buy flowers too. This is typically written as:
{bread, ham} => {butter}
{chocolate, wine} => {flowers}
Association rule mining is sometimes used interchangeably with affinity analysis,
as they encompass similar concepts. However, affinity analysis specifically focuses
on discovering relationships based on correlation, whereas association rule mining
can be applied to a broader class of relationships. These areas emerged in 1993 after
the work “Mining association rules between sets of items in large databases” [3] by
Rakesh Agrawal, Tomasz Imieliński, and Arun Swami.

Formally, association rule mining is defined as follows. Let 𝐼 = {𝑖𝑖 , 𝑖 2 , ..., 𝑖 𝑛 }

denote the set of possible items, which can also be referred to as the attributes. Let 𝑡
denote a transaction that involves a group of items, i.e., 𝑡 ⊆ 2 𝐼 . A particular instance
of a problem is defined by a database (repository) of transactions. For instance, the
items can be products in a shop and the database can contain historical transactions
made by its customers.
In association rule mining, the goal is to find rules of form: 𝑋 =⇒ 𝑌 , where
𝑋,𝑌 ⊂ 𝐼 and 𝑋 ∩𝑌 = ∅. In many applications, 𝑌 is required to contain only a single
item, i.e., |𝑌 | = 1.

Below, please find some of the basic notions used in association rule mining:
• Support is an occurrence frequency measure. We can define it as the number
of transactions that contain the set 𝑋 to the total number of transactions:
|𝑡 : 𝑋 ⊆ 𝑡|
supp(𝑋) = Í (7.2)
𝑡𝑡
7.3 Association Rules Mining (1993) 134

• Confidence estimates the degree of “accuracy” for a rule. It represents the

conditional probability of the conclusion given conditions (premise):

supp(𝑋 ∩𝑌 )
conf(𝑋 ⇒ 𝑌 ) = (7.3)
supp(𝑋)
• Conviction estimates the strength of implication between the conditions
(premise) and conclusion of a rule. It is defined as follows:

1 − supp(𝑌 )
conv(𝑋 ⇒ 𝑌 ) = (7.4)
1 − conf(𝑋 ⇒ 𝑌 )
• Lift calculates how often the conditions and consequences of the rule occur to
the multiplied occurrences of the respective sets (as if they were independent).

supp(𝑋 ∩𝑌 )
lift(𝑋 ⇒ 𝑌 ) = (7.5)
supp(𝑋) ∗ supp(𝑌 )
• Jaccard index calculates the similarity between two sets by comparing the
size of their intersection to the size of their union:
supp(𝑋 ∩𝑌 )
𝐽 (𝑋,𝑌 ) = (7.6)
supp(𝑋 ∪𝑌 )
The issue of association rule mining can be divided into two subproblems. Firstly,
large itemsets are identified, i.e., sets of items that possess a minimum required
support. The minimum support is a parameter for the problem, typically established
by experts. Secondly, among each large itemset 𝐿, minimal rules in the form of
𝑋 ⇒ (𝐿 \ 𝑋) are generated if the ratio of supp(𝐿) to supp(𝑋) is high enough.

In 1994, Agraval and Srikant, proposed an algorithm named the Apriori al-
gorithm [4] for the considered problem. It has become the gold standard in
the area of association rule mining. The pseudocode is shown in Listing 7.
It references the apriori-gen function. This function takes the set of all large
itemsets of size 𝑘 − 1 and joins them into supersets of size 𝑘. For example,
{{1, 2, 3}, {1, 2, 4}, {1, 3, 4}, {1, 3, 5}, {2, 3, 4}} becomes {{1, 2, 3, 4}, {1, 3, 4, 5}} af-
ter the join operation. Next, it deletes itemsets, which contain 𝑘 − 1 subsets not
present in the original set of sets. All 𝑘 − 1 subsets of {1, 2, 3, 4} are present in the
original one, so it is retained. However, the {1, 4, 5} subset of {1, 3, 4, 5} is not, so it
is deleted. The function returns the remaining undeleted supersets.
7.4 Particle Swarm Optimization (1995) 135

Algorithm 7: The Basic Form of the Apriori Algorithm

𝐿 1 = {} ;
for each 𝑖𝑡𝑒𝑚 ∈ 𝐼 do
if supp(𝑖) ≥ minsupport then
𝐿1 = 𝐿1 ∩ 𝑖
for 𝑘 = 2; 𝐿 𝑘−1 ≠ ∅; 𝑘++ do
𝐶 𝑘 = apriori-gen(𝐿 𝑘−1 ) ;
for each 𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛 ∈ 𝑇 do
𝐶𝑡 = {𝑐 ∈ 𝐶 𝑘 : 𝑐 ∈ 𝑡} ;
for each 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒 ∈ 𝐶𝑡 do
𝑐𝑜𝑢𝑛𝑡 [𝑐]++ ;
𝐿 𝑘 = {𝑐 ∈ 𝐶 𝑘 : 𝑐𝑜𝑢𝑛𝑡 [𝑐] ≥ 𝑚𝑖𝑛𝑠𝑢 𝑝 𝑝𝑜𝑟𝑡 ;
Ð
return 𝑘 𝐿 𝑘
In the paper [4], Agrawal and Srikant also introduced a second variant of the
Apriori algorithm called AprioriTid, along with a hybrid version that combines
Apriori and AprioriTid. However, the fundamental idea remains consistent.

Association rule mining, particularly in applications like market basket analysis,

has played a pivotal role in applied AI. Its success can be attributed to factors such
as ease of use and high interpretability. It has proven effective in diverse fields,
including retail, supply chain management, recommendation systems (e.g., movie
streaming platforms), fraud detection, and various others.

7.4 Particle Swarm Optimization (1995)

Particle Swarm Optimization (PSO) is a global optimization technique that belongs
to population-based metaheuristics as well as to Swarm Intelligence approaches. It
was first introduced by James Kennedy and Russell Eberhart in 1995 [140]. The
Swarm Intelligence approaches are inspired by the behaviors of large groups of
entities such as insects and their social behavior. For example, the Ant Colony
Optimization presented in Section 6.10 is also an example of Swarm Intelligence. In
these approaches, the behavior of an individual is typically very simple. However,
from the collective behavior of groups of individuals, complex behavior of the swarm
emerges (emergent intelligent properties can be observed).
In PSO, the population of particles is maintained, where each particle represents
a candidate solution in the search space of the given problem. The initial population
may be generated in a random fashion or according to some problem-specific formula
provided that it is able to generate diverse solutions. Particles move in this space
trying to optimize the quality of their position based on its current information and
the information exchanged with their neighbors. Let 𝑥𝑖 denote the components of
the position vector, and 𝑣 𝑖 denote the components of the velocity vector. In each
7.4 Particle Swarm Optimization (1995) 136

update step, the new position is calculated as:

𝑥𝑖′ = 𝑥𝑖 + 𝑣 𝑖 (7.7)

The formula of updating velocity is shown in Eq. 7.8.

𝑣 ′𝑖 = 𝑎 ∗ 𝑣 𝑖 + 𝜑1 × ( 𝑝 − 𝑥𝑖 ) + 𝜑2 × ( 𝑝 𝑔 − 𝑥𝑖 ) (7.8)

- where 𝑎 is an inertia coefficient; 𝜑1 , 𝜑2 are local and global attraction coefficients,

respectively. They are represented as vectors, in a general case, so each component
of the position may have a different attraction coefficient defined; 𝑝 is the personal
best position found so far, whereas 𝑝 𝑔 is the global best position found within the
neighbourhood of the particle for which the velocity is updated. In summary, there
are three attractors and the final position is a weighted average of them as shown in
Figure 7.3.
The assignment of the 𝜑 coefficients should involve randomness at each step in
order for the method to explore the state space in a diverse manner. In the original
paper [140], the 𝜑 coefficients were drawn from U [0, 1], i.e., the uniform probability
distribution from the [0, 1] interval. This is the most typical approach, however, they
are sometimes additionally multiplied by a problem-specific scaling factor.

Figure 7.3: An illustration of the particle’s position update procedure.

The quality of the positions is estimated using a fitness function (similarly to

genetic algorithms): 𝑓 𝑖𝑡𝑛𝑒𝑠𝑠( 𝑝) → R. The algorithm terminates upon reaching
a defined stop condition such as the maximum number of iterations or finding a
solution of the desired quality.
The size and the shape of the neighborhood used for the calculation of 𝑝 𝑔 (c.f.
Eq 7.8) plays an important role in the configuration of the PSO algorithm. Smaller
and simpler neighborhoods tend to lead to better results for complex multi-modal
problems, whereas larger neighborhoods for simpler problems [184]. Figure 7.4
presents common examples of neighborhood topologies.
7.5 Support Vector Machines (1995) 137

Figure 7.4: Examples of neighbordhood topologies.

There have been many variants of the PSO algorithm proposed including the
Inertia Weighted PSO [206], Fully Informed PSO [184], Discrete PSO [141] and
many others. The canonical method as well as its variants have been applied in
many domains making them staple in the the AI / soft-computing / computational
intelligence toolsets. Examples of applications are:
• Optimization of neural networks [68, 36].
• Logistics, e.g., Capacitated Vehicle Routing Problem [142, 103].
• Motion planning in robotics [143, 345].
• Resource allocation [93, 173].
• Image processing [39, 193].

7.5 Support Vector Machines (1995)

Support Vector Machines (SVMs) are memory efficient machine learning models
that perform well in high-dimensional spaces. They often deliver strong performance
for various problems with minimal hyperparameter tuning. For example, they have
been applied to Natural Language Processing [225] tasks. For a comprehensive
review of applications, please refer to the books [165] and [320]. There are strong
mathematical foundations behind SVMs that date back to 1960s. However, as a
machine learning model, they emerged in 1995. The seminal paper titled “Support-
vector Networks” [47] was authored by Corinna Cortes and Vladimir Vapnik, and it
was published in the Machine Learning journal.
7.5 Support Vector Machines (1995) 138

The concept of the SVM model involves mapping the data into a high-dimensional
feature space using a predetermined non-linear mapping. Next, the idea is to separate
the features based on their classes (labels) using the best separating hyperplane.
Consequently, the separation is linear. The hyperplane serves as a multi-dimensional
decision surface. This concept is illustrated in Figure 7.5 for a simple two-dimensional
scenario.

Figure 7.5: SVM for a simple 2-dimensional case. The objective is to find the
largest margin that separates the classes. The support vectors are marked with the
𝑆𝑉 symbol. They define the margin of the largest separation.

SVMs were initially designed for binary classification problems. However, they
can be generalized to address both regression (e.g., through the introduction of
objective and loss functions) as well as multi-label classification (e.g., by utilizing
the one-vs-all classification strategy). Below, we will focus on the original - binary
classification version.
Let 𝑥 denote the input data (which can be of arbitrarily large dimensions); 𝑥𝑖
denote the 𝑖-th input vector, 𝑦𝑖 ∈ {−1, 1} denote the class encoding of the respective
𝑥𝑖 feature in the training dataset; and 𝑦 denote the entire 𝑦𝑖 vector. The optimal
hyperplane can be expressed as:
𝑤𝑜 · 𝑥 + 𝑏𝑜 = 0 (7.9)
Points that satisfy the Equation 7.9 are lying on this hyperplane. The weights 𝑤 𝑜
can be written as a linear combination of the support vectors (SV):
∑︁
𝑤𝑜 = 𝛼𝑖 ∗ 𝑥𝑖 (7.10)
𝑖∈𝑆𝑉
and as a linear combination of the training vectors:
|𝑥|
∑︁
𝑤𝑜 = 𝑦𝑖 ∗ 𝛼𝑖𝑜 ∗ 𝑥𝑖 (7.11)
𝑖∈1
7.6 Artificial Intelligence: A Modern Approach (1995) 139

where 𝛼𝑖𝑜 ≥ 0 denotes a vector of parameters for the optimal hyperplane. The goal is
then to find them.
Cortes and Vapnik derive the optimization goal for two cases:
1. Hard margin hyperplane - when the training data can be linearly separated
without an error.
2. Soft margin hyperplane - when it is not possible to linearly separate data
without an error.
In the first case, the authors of SVMs show that to obtain the optimal hyperplane,
the task is to minimize 𝑤 · 𝑤 under the constraint 𝑦𝑖 (𝑤 · 𝑥𝑖 + 𝑏) ≥ 1 for 𝑖 = 1, ..., |𝑥|.
It can be done by solving the quadratic programming problem:
1
𝑊 (Λ) = Λ𝑇 1 − Λ𝑇 𝐷Λ (7.12)
2
w.r.t. Λ𝑇 = (𝛼1 , ..., 𝛼|𝑋 | ) and subect to constraints: Λ ≥ 0, Λ𝑇 𝑦𝑇 = 0. The 1 symbol
denotes a vector full of 1 of the matching length (|𝑋 |).
In the second case, the aim is to construct an optimal hyperplane by minimizing
errors (since they are unavoidable). This can be achieved by minimizing:
𝑁
1 ∑︁
||𝑤|| 2 + 𝐶 𝜉𝑖 (7.13)
2 𝑖=1

subject to constraints:

𝑦𝑖 (𝑤 · 𝑥𝑖 + 𝑏) ≥ 1 − 𝜉𝑖 , 𝜉𝑖 ≥ 0 (7.14)

where 𝐶 is a constant that denotes the trade-off between the margin size and
the tolerance to misclassification; 𝜉 correspond to the deviations or errors in the
classification of training instances.

7.6 Artificial Intelligence: A Modern Approach (1995)

“Artificial Intelligence: A Modern Approach” is a book authored by Stuart Russel
and Peter Norvig, published in 1995 [286]. As of February 2024, the fourth edition
is the newest publication of this book. It is one of the most popular textbooks in
artificial intelligence. It has influenced the way AI is taught and learned. It is a
highly valuable educational resource, featuring not only theory but also examples,
exercises, and code samples. Given the profound impact this book has had on
students, researchers, and practitioners, it should be regarded as a milestone in the
history of AI, even though it does not introduce novel AI methods.
The table of contents for the 4-th edition of the book is as follows:
1. Artificial Intelligence - introduction to the topic.
2. Problem-solving - solving problems by searching, search in complex environ-
ments, adversarial search and games, constraint satisfaction problems.
3. Knowledge, reasoning, and planning - the use of logic formalisms, knowledge
representation and automated planning.
7.7 Covariance Matrix Adaptation Evolution Strategy (1996) 140

Figure 7.6: Book cover of the first edition of the book.

4. Uncertain knowledge and reasoning - quantifying uncertainty, probabilistic

reasoning and multi-agent decision making under uncertainty.
5. Machine Learning - supervised learning, Bayesian learning, deep learning,
reinforcement learning.
6. Communicating, perceiving, and acting - natural language processing,
computer vision, and robotics.
7. Conclusions - ethics, safety, and the future of AI.

7.7 Covariance Matrix Adaptation Evolution Strategy (1996)

In 1996, a new method called Covariance Matrix Adaptation Evolution Strategy
(CMA-ES) was introduced by Nikolaus Hansen and Andreas Ostermeier [104]. This
method has become the state-of-the-art approach in evolutionary strategies (ES) and
one of the most successful approaches in evolutionary computing, in general. The
ES methods are employed for function optimization tasks. In particular, they are
suitable for black-box optimization and hyperparameter tuning as they can handle
noisy, high-dimensional, complex and non-differentiable functions. Examples of
applications are:
• Optimizing solar irradiation availability[137].
• Optimizing robot design [330].
• Training parameters of models for network security forecasting [125].
The novelty of the CMA-ES method lies in the fact that it considers evolution
paths. Rather than treating each mutation step separately, the algorithm keeps track
of evolving solutions through multiple generations. It aims to adapt the algorithm
7.7 Covariance Matrix Adaptation Evolution Strategy (1996) 141

in such a way as to reduce the correlation between subsequent steps. Hansen and
Ostermeier postulated that they intend to minimize the difference between the distri-
butions of the taken evolution path and the one that would be taken under random
selection (as truly random selection would be uncorrelated). In the paper [104], the
authors focus on the adaptation of arbitrary normal mutation distributions and do
not use any recombination (crossover) operators. The CMA-ES method was initially
demonstrated for the (1, 𝜆)-ES variant of evolutionary strategies, but its application to
other variants is relatively straightforward. In (1, 𝜆), there are 𝜆 candidate solutions,
which are the results of mutations (as no crossover is utilized) and 1 of them advances
to the next generation and serves as the basis for the subsequent mutations.

Due to the CMA-ES algorithm’s effectiveness, the prototypical method has

received various modifications tailored for specific optimization tasks. In general,
the algorithm consists of the following steps:
1. Initialization of:
(a) The parameters of the distribution used during the mutation - the mean
vector: 𝑚 and the covariance matrix: 𝐶
(b) The control parameters of the ES algorithm - 𝜆, step size and stop
condition.
(c) Any other variables such as evolution paths.
2. Sampling:
For 𝑘 = 0...𝜆, candidate solutions 𝑥 𝑘 are generated from the multivariate
normal distribution with mean 𝑚 and covariance 𝐶:
(𝑔+1)
𝑥𝑘 ∼ 𝑚 (𝑔) + 𝜎 (𝑔) N (𝑂, 𝐶 (𝑔) ) (7.15)

- where 𝑔 denotes the generation number and 𝜎 denotes the overall standard
deviation, which also serves as the step size in the algorithm.
3. Fitness evaluation of each candidate solution. The solutions are sorted from
best to worst according to their fitness: (𝑥1 , 𝑥 2 , 𝑥 3 , ..., 𝑥𝜆 ).
4. Mean update:
𝜇
∑︁
(𝑔+1) (𝑔+1)
𝑚 = 𝑤 𝑖 ∗ 𝑥𝑖 (7.16)
𝑖=1
- where 𝜇 ≤ 𝜆 is the number of solutions to advance to the next generation.
For (1, 𝜆)-ES, 𝜇 = 1. The weights 𝑤 𝑖 sum up to 1 and satisfy the following
constraints:
𝑤 1 ≥ 𝑤 2 ≥ 𝑤 3 ≥ ... ≥ 𝑤 𝜇 > 0 (7.17)
5. Evolution paths update. There are typically two evolution paths: 𝑝 𝑐 and
𝑝 𝜎 (conjugate evolution path) for the covariance matrix and the overal step
adaptation, respectively.
√
(𝑔+1) (𝑔)
√︁ 𝑤
𝑝𝑐 = (1 − 𝑐 𝑐 ) · 𝑝 𝑐 + 𝑐 𝑐 · (2 − 𝑐 𝑐 ) · 𝜇𝑐 · · (m (𝑔+1) − m (𝑔) )
∥𝑤∥ 2
7.8 Computing with Words (1996) 142

(𝑔+1) (𝑔)
√︁ (𝐶 (𝑔) ) −1/2 · (m (𝑔+1) − m (𝑔) )
𝑝𝜎 = (1 − 𝑐 𝜎 ) · 𝑝 𝜎 + 𝑐 𝜎 · (2 − 𝑐 𝜎 ) · 𝜇𝜎 ·
𝜎 (𝑔)
- where 𝑐 𝑖𝑛[0, 1] and 𝜇 denote constants used for normalization; 𝑤 is
the weighted mean direction of successful solutions; and (𝐶 (𝑔) ) −1/2 is an
eigendecomposition of 𝐶 (𝑔) to 𝐵 (𝑔) (𝐷 (𝑔) ) 2 (𝐵 (𝑔) )𝑇 , where 𝐵 is an orthonormal
basis of eigenvectors, and 𝐷 is a diagonal matrix with square roots of the
corresponding positive eigenvalues.
6. Covariance matrix update. The covariance matrix 𝐶 is adapted to capture
the relationships between variables in the search space. This step is crucial
for aligning the search distribution with the local structure of the objective
function:

(𝑔+1) (𝑥 1 − 𝑚 (𝑔+1) ) (𝑥 2 − 𝑚 (𝑔+1) ) (𝑥𝜆 − 𝑚 (𝑔+1) )

𝐶 (𝑔+1) = 𝑢 𝑝𝑑𝑎𝑡𝑒(𝐶, 𝑝 𝑐 , , , ..., )
𝜎 (𝑔) 𝜎 (𝑔) 𝜎 (𝑔)
(7.18)
7. Step size adaptation.
The covariance matrix adaptation alters the scale only in a single direction
per step. The step size, in the CMA-ES algorithm, allows to adapt the overall
(direction-independent) variance. The common formula for updating the step
size is as follows:
(𝑔+1)
𝑐𝜎 𝑝𝜎
𝜎 (𝑔+1) = 𝜎 (𝑔) exp( ( )) (7.19)
𝑑 𝜎 𝐸 ∥N (0, 𝐼) ∥
- where 𝑑 𝜎 is the damping parameter.
8. Check for termination. If the chosen stop criteria are not satisfied, then the
process continues from step 2. The criteria are for example: (1) reaching
𝑔 ≥ max-generations, (2) observing convergence of the optimization or (3)
achieving a satisfactory fitness value.

7.8 Computing with Words (1996)

1996 saw further development of fuzzy logic and fuzzy modeling. In this year, Lofti
Zadeh published an article titled “Fuzzy Logic = Computing with Words” [340].
He began the paper with the statement that “fuzzy logic has come of age”, which
could also be a suitable title for this section. At this point, the foundational fuzzy
set theory that underpins fuzzy logic had already been in existence for 31 years
(c.f. Section 4.1). “Computing with Words” (CWW) is a paradigm advocating the
use linguistic terms for modelling various quantities instead of – or – in addition
to numbers in the AI reasoning. It is also one of the practical applications of the
so-called soft-computing idea.
The overall conceptual scheme of the CWW approach is presented in Figure 7.7.
It is a highly human-centered approach, therefore, the input and output are repre-
sented in a natural language. The input is then transformed into the fuzzy logic
formalism, a task that can be performed by an expert, using some form of automation
7.8 Computing with Words (1996) 143

or a combination both (in a hybrid fashion). Fuzzy variables and fuzzy rules, which
express constraints and consequences, serve as the building blocks for the knowledge
base in the system. Subsequently, computation takes place, operating on the created
knowledge base. New information is inferred in this step. Finally, the fuzzy terms are
converted back into natural language, making them easily interpretable. A specific
implementation of the CWW idea might differ in details. For instance, a fuzzy expert
system created in the spirit of CWW might adopt a human-in-the-loop approach
with an infinite loop, enabling both incremental improvement of the system and
querying at any time.

Figure 7.7: The concept of the Computing with Words approach.

Some of the key notions used in the CWW paradigm are:

• Linguistic variable is a variable having both the name and the value expressed
with linguistic terms in a natural (or artificial) language. For example, today’s
temperature can be a linguistic variable with possible values: very cold, cold,
moderate, warm, hot.
7.8 Computing with Words (1996) 144

Linguistic variables are one of the building blocks in the CWW model. They
are less precise that specific numbers, e.g., temperature equal to 30 degrees
Celsius, but they are convenient to work with for a couple of reasons:
1. They inherently allow the representation of subjective, uncertain or im-
precise data. For instance, in traditional purely numerical computing, the
assignment 𝑡𝑒𝑚 𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 30 does not allow any space for subjectivity
or imprecision. Including those concepts would require equipping each
variable with additional constructs such as confidence intervals. In CWW,
the assignment of a value to a linguistic variable does not require any
additional formalisms, however, the value itself carries the subjectivity /
imprecision / uncertainty.
2. The domains of linguistic variables are usually much more constrained
than the domains of numeric variables. For example, a linguistic
temperature might be used with as little as 5 values ranging from (very
cold to hot). This is to say that CWW is suitable for computing on a
higher level of abstraction. Linguistic variables can represent certain
linguistic summaries of data.
3. Finally, linguistic variables often naturally appear in conversations or
human-written texts. For example, they can be directly taken from an
article written in a natural language.

Figure 7.8: A fuzzy variable Temperature with three values. Each value is represented
as a fuzzy set over a base variable (numeric temperature).
7.8 Computing with Words (1996) 145

• Information granule - defined as a fuzzy set of points put together under

the same label because of their similarity. In CWW, a word 𝑤 is a label of
a granule 𝑔, whereas 𝑔 is denotation of 𝑤. For example, very cold is a label
of a granule that groups together various very low temperatures. In [340],
Zadeh postulates that granulation plays a key role in human cognition. The
term granular computing, that concerns processing information using data
abstractions (information granules), emerged later but has been inspired by
CWW.
• Fuzzy constraint - plays a pivotal role in CWW. It serves as a functional
representation of a fuzzy variable. Formally, fuzzy constraints are fuzzy sets
defined over base variables (numeric ones) that “glue together” the linguistic
and numerical representations. In the following sentence:
Bob lives near Alice
there is a fuzzy constraint (“near”) on the distance variable between Bob and
Alice.
In the sentence:
It is very cold in here
there is a constraint on the temperature linguistic variable. Figure 7.8 illustrates
the concept of fuzzy linguistic variables - three words describing constraints
on the temperature (the base variable) are used with their interpretation as
fuzzy sets.
• Fuzzy IF-THEN rule is similar to a classic IF-THEN rule and has the form of:
IF conditions THEN consequence
However, the conditions can be expressed in fuzzy variables terms, e.g.:
IF temperature is very cold THEN risk is high
• Fuzzy quantifiers and generalized constraints. In [340], Zadeh proposed to
incorporate into CWW many types of operations, which are available in other
reasoning systems. In order to be imprinted into CWW, each operation needs
to be represented both in linguistic terms (which can be regarded as a keyword)
and within the fuzzy logic formalism. Such keywords enrich the structure
of fuzzy rules and increase the expressivity of the inference mechanism that
uses them. Zadeh proposed to use the term isr for a generalized constraint.
For example, disjunction and conjunction of fuzzy linguistic variables are
constraints. Other examples examples can be:
usually(X is R)
most_of(X is R)
almost_all(X is R)
The ideas to employ fuzzy logic for representing not only uncertain/imprecise
variables but also operations have been picked up by many researchers.
Examples are the works “Fuzzy Quantifiers: a Computational Theory” [90]
and “Fuzzy Ordered Weighted Averaging” [186].
7.9 Long Short-Term Memory (1997) 146

In summary, “Computing with Words” has been considered an inspiring ap-

proach in AI due to its human-centric nature, its ability to handle uncertainty, its
expressiveness and interpretability, its versatility across domains, and its facilitation
of effective human-machine interaction. The paradigm has contributed to the devel-
opment of more inclusive and adaptable AI systems that can effectively navigate the
complexities of real-world information. It has been applied in expert systems based
on fuzzy rules and fuzzy reasoning and also in hybrid projects combining fuzzy
logic with machine learning [127, 126, 272].

7.9 Long Short-Term Memory (1997)

This section is dedicated to the introduction of a novel artificial neural network
model called Long Short-Term Memory (LSTM). It was proposed in 1997 by Sepp
Hochreiter and Jürgen Schmidhuber in an article published in Neural Computation
journal [119].
Similarly to the autoencoders, which were described in Section 6.11 and CNNs,
which were described in Section 6.8, LSTMs are among the most important ANN
models in the history of machine learning. Their effectiveness and popularity are
attributed to the following main reasons:
1. Improving Vanilla Recurrent Neural Networks (RNNs). Long Short-Term
Memory is a recurrent neural network model (for introduction to them, please
refer to Section 6.9). One of the common problems of recurrent neural networks
are either vanishing or exploding gradients, which describe a situation in
which the propagated gradient in the network converges to zero or infinity,
respectively. This is caused by many repeated multiplications by the same
weights and the effect depends on the particular weights values, i.e., whether
𝑤 𝑖 > 1 or 𝑤 𝑖 < 1 dominate. Typically, the vanishing gradient is more prevalent.
LSTMs are designed in such a way that the effect of vanishing (or exploding)
gradient is mitigated. This was one of the main motivations behind inventing
them.
2. Practical Applications. LSTMs have been applied in various domains includ-
ing natural language processing [158], speech recognition [96], time series
prediction [274], video analysis [307], and many others. They played a crucial
role in breakthroughs in game AI in Dota 2 and Starcraft II, both of which will
be discussed in this book in Section 11.4 and Section 11.5, respectively. They
are used for automatic text translations by various commercial applications.
3. Handling Long-Term Dependencies. LSTMs have a unique architecture
that allows them to capture and remember information over long sequences,
making them well-suited for tasks requiring memory of past events.
4. Adaptability to Variable Input Sequences. The processing of LSTM is
adjusted for the input of arbitrary length. This is partially achieved thanks to
reusing the same weights and biases.
5. Serving as Foundations for Other Architectures. The LSTM models have
been used directly and also have been incorporated into other state-of-the-art
architectures such as stacked LSTMs and transformers.
7.9 Long Short-Term Memory (1997) 147

As previously mentioned, the official introduction of the LSTM architecture

dates back to 1997 [119]. However, owing to its popularity and effectiveness, the
model has undergone refinements and extensions. One notable modification was the
introduction of the so-called forget gates [89], in 1999. A rather popular variant will
be briefly presented below.

RNNs essentially consist of sub-networks with outputs connected to inputs,

allowing the signal to pass through time. Besides the problems of vanishing or
exploding gradients, regular RNNs encounter difficulties in learning long-term
dependencies between data due to the absence of explicit memory. In each pass,
weight updates may occur in such a way that older patterns are not effectively
retained. The LSTM model, on the other hand, introduces two types of connections:
short-term and long-term ones. Certain neural network models are based on their
model-specific building blocks referred to as units (a general term) or cells (in the
RNNs terminology). A unit is a basic processing component that can be viewed
as a generalized neuron. While a neuron is a unit (the simplest one), a unit does
not necessarily have to be just one neuron. Figure 7.9 presents an LSTM unit. The
entire LSTM network is built using these units.

Figure 7.9: A building block of a LSTM network called cell. These cells can be
chained together. All the operations are vector operations. Each layer has a dedicated
set of weights and biases. The long-term connection is marked by the light blue
color.

• 𝑥𝑡 ∈ R 𝑘 is the input to the network, where 𝑘 is its size (the number of input
features).
• ℎ𝑡−1 is the output from the previous hidden layer that represents the hidden
state.
• 𝑓𝑡 is the output from the forget gate. The input data concatenated with the data
7.10 AdaBoost (1997) 148

coming from the previous time step is multiplied by the weights 𝑊 𝑓 of the
forget gate and squashed by sigmoid activation function. The result defines
how much of the memory 𝐶𝑡−1 to retain.
• 𝑖𝑡 is the transformed input vector, whereas 𝐶 e𝑡 is a corresponding filtered
candidate vector for updating the memory cell state. Those two vectors have
distinct activation functions, weights and biases associated to them.
• Δ𝐶𝑡 = 𝑖𝑡 ⊙ 𝐶e𝑡 . It is added to the filtered previous memory cell state 𝐶𝑡−1 ⊙ 𝑓𝑡 .
• The output 𝑜𝑡 is first transformed by a sigmoid function in a standard way
using concatenated ℎ𝑡−1 , 𝑥𝑡 and a separate set of weights 𝑊𝑜 . Next, it is filtered
by tanh𝐶𝑡 . Such an output forms the new ℎ𝑡 vector. Independently of it, there
is also the 𝐶𝑡 output connection.
For training LSTM networks, please refer to papers [89, 337].

7.10 AdaBoost (1997)

Boosting is one of the two widely used ensemble techniques in machine learning. It
involves combining the outputs from multiple weak learners to create a single strong
learner model. In this context, a weak learner is an ML model that performs only
slightly better than random sampling. In practice, it is typically a simple model, e.g.,
operating given a single “rule of thumb”, which sometimes will be accurate, but in
general – over all problem instances – not much better than a random guess. The
central concept behind boosting is to iteratively train weak learners by focusing on
the instances where they make the most mistakes. Consequently, each iteration aims
to improve the model’s performance in its weakest spots.

In 1997, the AdaBoost (from “Adaptive Boosting”) algorithm was proposed

in a paper titled “A decision-theoretic generalization of on-line learning and an
application to boosting” [82] by Yoav Freund and Robert E. Schapire. AdaBoost
is one of the most important boosting techniques. In the following paragraphs, the
algorithm will be presented in the same way and using the original equations as in
the paper [82].
The first introduction of the AdaBoost algorithm was presented for a binary
classification problem. Let there be a sequence of 𝑁 training observations:
(𝑥 1 , 𝑦 1 ), ..., (𝑥 𝑁 , 𝑦 𝑁 ), each 𝑦𝑖 ∈ {0, 1} coming from some distribution P. The
goal was to find a hypothesis ℎ 𝑓 which is consistent (with low error) with most of
the samples. The algorithm utilizes a WeakLearn method that performs learning for
each of the weak learners. Firstly, the weights vector is initialized for each training
example:
1
𝑤 𝑖1 = (7.20)
𝑁
- where the superscript 1 denotes the first iteration number.
Then, for each iteration 𝑡, repeat steps (1)-(5):
7.10 AdaBoost (1997) 149

1. Construct the distribution 𝑝 𝑡 based on the current weights:

𝑤𝑡
𝑝 𝑡 = Í𝑁 𝑡
(7.21)
𝑖=1 𝑤 𝑖

2. Run the WeakLearn procedure providing it with distribution 𝑝 𝑡 . It returns the

hypothesis: ℎ𝑡 : 𝑋 → [0, 1].
3. Calculate the error of ℎ𝑡 :
𝑁
∑︁
𝜀𝑡 = 𝑝𝑖𝑡 ∗ |ℎ𝑡 (𝑥𝑖 ) − 𝑦𝑖 | (7.22)
𝑖=1

4. Set the weight update normalization parameter:

𝜀𝑡
𝛽𝑡 = (7.23)
1 − 𝜀𝑡
5. Update the weights:
1−|ℎ𝑡 (𝑥 𝑖 )−𝑦 𝑖 |
𝑤 𝑖𝑡+1 = 𝑤 𝑖𝑡 ∗ 𝛽𝑡 (7.24)
Finally, output the hypothesis:
Í Í
1 if 𝑇𝑡=1 log( 𝛽1𝑡 ) ∗ ℎ𝑡 (𝑥) ≥ 12 𝑇𝑡=1 log( 𝛽1𝑡 )
ℎ 𝑓 (𝑥) = (7.25)
0 otherwise

In their paper [82], Freund and Schapire have proven that the combination rule
generated by AdaBoost is optimal in the Bayesian sense.

For the multi-label classification, Freund and Schapire made the following
changes to the previously shown procedure:
1. They replaced |ℎ𝑡 (𝑥𝑖 ) − 𝑦𝑖 | by ∥ℎ𝑡 (𝑥𝑖 ) ≠ 𝑦𝑖 ∥
2. In step 3., they added: “if 𝜀𝑡 > 12 then set 𝑇 = 𝑡 − 1 and abort loop”
3. The final hypothesis is:
𝑇
∑︁ 1
ℎ 𝑓 (𝑥) = arg max (log( ) ∥ℎ𝑡 (𝑥𝑖 ) = 𝑦𝑖 ∥) (7.26)
𝑦∈𝑌
𝑡=1
𝛽𝑡

The authors of AdaBoost stated that the proposed framework is relatively general
and can be applied in various problems. For example, AdaBoost is utilized to
minimize the loss function in a gradient-descent fashion, i.e.:

𝑓𝑡 (𝑥) = 𝑓𝑡−1 (𝑥) + 𝛼𝑡 ∗ ℎ𝑡 (𝑥) (7.27)

- where 𝑓𝑡 (𝑥) is the current prediction 𝛼𝑡 is the step; ℎ𝑡 (𝑥) is the weak learner’s
prediction.

AdaBoost has seen numerous variants such as LogitBoost that mimimizes the
logistic loss 𝑖 log(1 + 𝑒 −𝑦𝑖 ∗ 𝑓 (𝑥𝑖 ) ) or GentleBoost that assigns confidence values
Í
7.11 Deep Blue (1997 breakthrough) 150

to each weak learner’s predictions, allowing for a more gradual adjustment of the
weights. This confidence-rated approach helps reduce the influence of outliers and
can improve the algorithm’s robustness.
The algorithm can be applied to any domain. Some common applications
outlined in [335] include handwritten characters recognition, face detection, images,
filtering text and multimedia information, filtering spam emails, human-action
recognition and computational biology.

7.11 Deep Blue (1997 breakthrough)

Deep Blue [124, 35] was a chess-playing AI that operated on a computer manufactured
by IBM, designed specifically for this purpose. The name “Deep Blue” is commonly
used to refer to both the AI and the computer. At its peak, Deep Blue secured a
place on the TOP500 list, which features the 500 most powerful supercomputers
globally, ranking at 259-th position.
In 1997, one of the most famous pre-2000 breakthroughs in AI took place.
Deep Blue defeated Garry Kasparov in a six-game match of chess with the result
of 3 12 − 2 21 . Deep Blue won two games, drew three, and lost one. The match was
held in Equitable Center in New York with the first game played on May 3 and
the last on May 11. The prize money was $1.1M. Kasparov received $700K for
participation. The duel received significant media coverage. Unlike many of the
achievements covered in this book, it was known to people outside the mathematical
and computer science fields. The matches were reported in newspapers, magazines,
and television news, contributing to a broader discussion about the capabilities
of artificial intelligence and its potential impact on various fields. In the second
game, Kasparov accused IBM of cheating (that the move made by Deep Blue was
standing out and made manually by a human grandmaster). It is interesting to note
that, nowadays, cheating in chess involves the use of machine assistance. Kasparov
requested a rematch, but Deep Blue was disassembled and has not participated in
any further matches.

To put this achievement into perspective - this marked the first occurrence when
an artificial player defeated a human world champion in such a complex game as
chess. On a side note, Kasparov had previously faced Deep Blue in 1996 and emerged
victorious: 4 - 2. However, for the subsequent match, Deep Blue was upgraded and
its program was improved. Garry Kasparov was an undisputed champion from 1985
to 2000. He held the highest FIDE (International Chess Federation) chess rating of
2851 until 2013. As of February 2024, he currently possesses the second-highest
rating in history. Garry Kasparov is widely regarded as one of the greatest chess
players of all time and is often considered the strongest by many.
Deep Blue was a group effort of IBM Research. The lead creators were Feng-
hsing Hsu, Murray Campbell and Joseph Hoane, and C.J. Tan. Let me list the main
components responsible for the success behind this project:
1. Sheer computational power supported by the supercomputer consisting of
30 PowerPC 604e-based nodes and 480 VLSI chips specialized for chess. It
7.11 Deep Blue (1997 breakthrough) 151

was capable of analyzing 200 million chess positions per second, a substantial
increase from the 5 million positions it could analyze in 1995, marking a
40-fold improvement. Depending on the current position, Deep Blue was
capable of searching from 6 up to even 20 moves deep in the tree.
2. Opening book consisting of about 4000 states. It was created from more than
300,000 grandmaster games. Deep Blue played by the book until it contained
the current position on the board. Then, it switched to using a tree search
algorithm.
3. Search algorithm of choice was zero-windows alpha-beta pruning (c.f.
Section 5.3). It was a hardware implementation for the specialized chips.
Apart from that, the algorithm was rather standard. The prototypical alpha-beta
pruning had been known for over two decades. The algorithm was parallelized
to work efficiently with the supercomputer.
4. Heuristic evaluation function that consisted of 8000 features [34]. It included
elements related to the material strength, king safety, pins, X-rays, square
control, pawn structure, passed pawns, outposts, pawn majority, trapped pieces,
development, mobility and other chess concepts. Some of the features were
specific board positions with assigned evaluation. There was also a simplified,
faster variant of the evaluation function that utilized only a subset of features.
The fast variant was prioritized if Deep Blue was confident enough about the
approximation (e.g., winning by a substantial margin in material).
5. Ending database that contained every ending with up to 5 pieces on the
board.
In summary, Deep Blue made a significant impact on AI, computer science,
and the chess community. It heralded an era in which AI programs became more
efficient players than humans in an increasing number of games. Much of Deep
Blue’s success can be attributed to its computational capabilities. While it was
undeniably a remarkable engineering achievement, it should also be noted that it
was a practical demonstration of various research ideas such as refined alpha-beta
pruning. Game tree search has always been recognized as a part of AI as a research
field. Deep Blue marked a historic moment in the man vs. machine competitions,
often referred to as “Carbon vs. Silicon.” Since then, the development of more
powerful and sophisticated chess engines, coupled with advances in hardware and
algorithms, has further solidified the superiority of machines in chess. Today,
top-level human players often use chess engines for analysis and improvement.
However, in head-to-head matches, even the best human players consistently face
defeat against the strongest computer programs.
8. Years 1998 - 2006

8.1 Reinforcement Learning: An Introduction (1998)

“Reinforcement Learning: An Introduction” is a book written by Richard S. Sutton
and Andrew G. Barto [290]. It was published in 1998 by The MIT Press. It has
made to the list of the important events in the history of AI due to its significance
as the source to go in the subfield of reinforcement learning (RL). In 1998, there
was already a growing interest in RL within academia. However, it still lacked a
critical mass of successful applications. One of the most prominent exceptions was
the success of TD-Gammon, as described in Section 7.1. Given this context, there
was a need for a comprehensive source of knowledge that could bring together the
concepts of RL for a wider audience, including researchers, students, and engineers.
Subsequent advancements in the field led to the development of Deep Reinforce-
ment Learning, which has had a significant impact on robotics and AI. Breakthroughs
such as AlphaGo and OpenAI Five, discussed in Sections 10.8 and 11.4, respec-
tively, exemplify the progress made in this area. Reinforcement learning remains
a promising paradigm due to its similarity to the natural way humans learn (c.f.
Figure 8.1). Some researchers believe that a combination of evolutionary algorithms
(which simulate the “learning” of the species) and RL (responsible for an individual’s
learning throughout their life) is a potential path to the artificial general intelligence
(AGI) [65]. The “Reinforcement Learning: An Introduction” book still remains
a definitive and go-to resource in the field of RL. In summary, the contents of its
second edition (from 2020) are:
1. Introduction - of the basic idea and key elements of reinforcement learning.
An example of applying RL in Tic-Tac-Toe is presented. This chapter introduces
the notion of the value function.
2. Multi-armed Bandits - focuses on the multi-armed bandit problem, which
is also referred to as the k-armed bandit problem (it will be presented in
Section 8.9 of this book). The concept of upper-confidence bounds is explained.
8.1 Reinforcement Learning: An Introduction (1998) 153

Figure 8.1: A basic scheme of the reinforcement learning loop. The process is
iterative. The chosen action and the environment determine the next state. The agent
observes the next state and the obtained reward to update its policy of choosing
subsequent actions. The goal is to maximize the cumulative reward over the training
horizon (or in infinity). In general, the immediate obtained rewards can be positive
(positive reinforcement), negative (penalty) or zero (no reward in the current state).

3. Finite Markov Decision Processes - presents the formal model – Markov

Decision Processes (MDP) – for applying RL. Here, many essential components
of RL are discussed including the policy function.
4. Dynamic Programming - the chapter dedicated to using dynamic program-
ming for solving MDP. RL and dynamic programming share mathematical
foundations by means of Bellman equations (c.f. Section 3.5).
5. Monte Carlo Methods - are a statistical approach to estimating the value
function and value-action pairs. A specific Monte Carlo method, called
Monte Carlo Tree-Search, aimed for planning in combinatorial spaces, will be
presented in Section 8.9.
6. Temporal-Difference Learning - discusses TD learning methods (c.f. Sec-
tion 6.7).
7. n-step Bootstrapping - introduces an approach that synthesizes concepts
from Monte Carlo and temporal-difference (TD) learning methods.
8. Planning and Learning with Tabular Methods - here the concepts of the
model-based and model-free methods are explained. The main difference
is that the model-based methods rely on planning, whereas the model-free
methods rely on learning. Various tabular methods are introduced in this
chapter.
9. On-policy Prediction with Approximation - on-policy methods update
the policy the agent is currently using based on the value function. The
approximation means that the policy function is represented is a non-exact
way - e.g., by a neural network. This is in contrast to the tabular methods.
10. On-policy Control with Approximation - focuses on the control problem
(designing a controller to influence the behavior of a system, e.g., a robot, over
time to achieve desired objectives) with on-policy approaches. The methods
8.2 Random Forests (2001) 154

such as Sarsa are presented.

11. Off-policy Methods with Approximation - this type of methods allow the
agent to update its policy based on experiences generated by a different policy.
12. Eligibility Traces - are short-term memory vectors that can make the learning
more efficient. This concept in the context of various RL methods is the topic
of this chapter.
13. Policy Gradient Methods - introduces the Policy Gradient theorem and
method that learn a parameterized policy without the explicit need of using the
value function to do so. For example, the REINFORCE algorithm is presented
here.
14. Psychology - explores the connection between reinforcement learning and
human psychology.
15. Neuroscience - discusses how the natural, biology-driven learning can be
understood through the lens of reinforcement learning.
16. Applications and Case Studies - they include TD-Gammon (c.f. Sec. 7.1),
Samuel’s Checkers (c.f. Sec. 3.7), and Alpha-Go (c.f. Sec. 10.8).
17. Frontiers - covers the then recent research trends and discusses the future of
AI.

8.2 Random Forests (2001)

In 2001, Leo Breiman published an article titled “Random Forests” [26] in the
Machine Learning journal. This technique has since become one of the most popular
and widely used methods in machine learning both for classification and regression.
Random forest is based on the idea of bagging (bootstrap aggregating). Bagging is
the second of the two widely used ensemble approaches in machine learning. The
first one was boosting that has already been covered in Section 7.10.
A random forest is an ensemble of many decision trees, or tree-based classifiers,
more generally. The idea is illustrated in Figure 8.2. We already know how
an individual decision tree is constructed (for the description, please refer to
Section 6.1). Each decision tree serves a similar role to the weak learner from the
boosting approach. The decision trees are simpler classifiers (or estimators) that
together with the aggregation method constitute the ensemble classifier. There are
two principles during the construction of the model. Typically, both are applied
simultaneously:
1. Bootstrapped sampling. Each tree in a random forest is built on a random
subset of the training data. This process is known as bootstrapped sampling,
where a random sample with replacement is taken from the original dataset to
build each tree.
2. Feature randomization. When splitting nodes in the decision trees, only a
random subset of features is considered at each split. It increases the chance
of lowering the correlation between the decision trees.
Breiman empirically demonstrates that, ideally, the individual decision trees
in the random forest should not be strongly correlated. They should capture dif-
ferent aspects of the underlying patterns in the data. This diversity is beneficial
8.2 Random Forests (2001) 155

Figure 8.2: A random forest model. This simple example illustrates four decision
tree classifiers, each operating on a two-element subset of five decision variables.
The thicker lines represent the inference path for the current observation (input data),
while the colors indicate distinct decisions.

because it helps to reduce the chance of over-fitting to the training data, leading
to a more robust and generalizable model. However, he also shows that apart
from low correlation, it is beneficial if each tree maintains a reasonable strength.
The number of trees is one of the hyper-parameters of the model. n practical
applications, the value of this parameter is usually set between 40 and 300. De-
pending on a specific implementation, other hyperparameters may include the
maximum number of features to use per tree, as well as all the hyperparameters
required for decision tree induction, such as the split method and the maximum depth.

After a random forest model is created, it can be queried for new data. There are
a few strategies of generating output from the ensemble of trees. For example:
• Majority voting for classification problems:
∑︁
𝑦 final = arg max 𝐼 (𝑦 𝑡 = 𝑐) (8.1)
𝑐∈𝐶𝑙𝑎𝑠𝑠𝑒𝑠
𝑡∈𝑇𝑟𝑒𝑒𝑠
– where 𝐼 is the indicator function.

• Soft voting for classification:

1 ∑︁
𝑦 final = arg max 𝑃(𝑦 𝑡 = 𝑐) (8.2)
𝑐∈𝐶𝑙𝑎𝑠𝑠𝑒𝑠 𝑁
𝑡∈𝑇𝑟𝑒𝑒𝑠
– where 𝑃(𝑦 𝑡 = 𝑐) denotes the probability assigned by tree 𝑡 to class 𝑐

• Mean for regression problems:

1 ∑︁
𝑦 final = 𝑦𝑡 (8.3)
𝑁 𝑡∈𝑇𝑟𝑒𝑒𝑠
8.3 Neuro-Evolution of Augmenting Topologies (2002) 156

The random forest model has demonstrated significantly greater strength com-
pared to the decision tree model. Moreover, it is much less prone to over-fitting
and effectively generalizes to unseen data. Breiman shows that it achieves error
rates similar to AdaBoost but boasts increased robustness in dealing with noisy data,
while being much more straightforward to implement. Random forests provide an
inherent method for determining feature importance as a mean decrease of impurity
(increasing information) across all trees. This enhances the model’s interpretability.
However, due to its ensemble nature, which involves multiple decision trees, it
sacrifices the “diagram-like” interpretability offered by a single tree. The complexity
of comprehending all rules from every tree and understanding the interactions
leading to the final output typically surpasses human cognitive capabilities. Random
forests can be applied to essentially any ML problem. In particular, they have been
frequently applied to tabular data [268, 273]. In the recent years, however, they lost
some popularity in favor of the XGBoost (eXtreme Gradient Boosting) method that
will be introduced in Section 10.6. It is also worth noticing, that random forests are
often chosen as the baseline method to compare against.

8.3 Neuro-Evolution of Augmenting Topologies (2002)

Neuro-Evolution of Augmenting Topologies (NEAT) is an approach that connects
two important AI concepts: neural networks and evolutionary computation. It was
proposed in a 2002 paper authored by Kenneth O. Stanley and Risto Miikkulainen
titled “Evolving Neural Networks through Augmenting Topologies” [282]. NEAT
represents a step further from the standard neuroevolution (NE) that emerged in
the early 1990s. In underlying principle of NE was to search for parameters of a
neural network of a fixed topology that performs well in a given task. A typical
application of NE was as an alternative to reinforcement learning, particularly in
dynamic environments providing some form of feedback, and less frequently in
supervised learning scenarios. Despite being computationally more intensive than
backpropagation, NE can be applied to problems with non-differentiable objec-
tive/loss functions. In theory, as a global optimization method, it is also less prone
to getting stuck in a local minimum during training.

The NEAT approach combines the following ideas:

1. Evolving topology: In contrast to vanilla NE, NEAT does not assume a fixed
network topology. Not only the weights are updated through evolution but also
the set of nodes and connections between them. As a result, the optimization
is holistic.
2. Incremental complexification: NEAT starts with a minimal size of neural
network. The topology is incrementally complexified as problem requires. It
is possible due to a variable-length genome rather than a fixed-length encoding.
This allows to keep the dimensionality small, which has positive impact on
training and inference.
3. Protecting innovation through speciation: The authors define innovation as
adding new structure to the network (nodes or connections). Such an operation
8.3 Neuro-Evolution of Augmenting Topologies (2002) 157

introduces non-linearity and usually decreases the performance of the network

for some time, because the new structure needs time to optimize its associated
weights. It is very rare that a new structure performs well as soon as it is
introduced. NEAT groups similar individuals into niches. For a given number
of iterations the individuals compete within their own niches instead of the
entire population. In paper [282], the similarity measure of two individuals
was defined as a parameterized function shown in Equation 8.4:
𝑐1 ∗ 𝐸 𝑐2 ∗ 𝐷
𝛿= + + 𝑐3 ∗ 𝑊 (8.4)
𝑁 𝑁
- where 𝐸 and 𝐷 denote the number of excess and disjoint genes, respectively;
𝑊 denotes the average weight differences of matching genes; whereas the
remaining parameters are tunable constants. To protect innovation, NEAT
uses an explicit fitness sharing mechanism. In reproduction, organisms in the
same niche share the same fitness. This also controls the size of the niche as it
is unlikely that the larger niche will keep increasing its fitness.
Stanley and Miikkulainen discuss a few different direct and indirect encodings
to represent evolving neural networks [282]. For NEAT, they proposed to represent
genome as a variable-length list of connection genes. Each connection gene
contains:
1. The index of the in-node. The existence of the index implies the existence of
the corresponding node.
2. The index of the out-node. The existence of the index implies the existence of
the corresponding node.
3. Weight of the connection
4. Whether the gene is expressed (1: enabled, 0: disabled).
5. Innovation number - which represents a discrete timestamp of the structural
modification that created this gene. It is shared among genes of different
individuals, which have the same origin. The innovation number serves as the
ID of the given gene. The ID is persistent if the gene is reused (cloned), e.g.,
advances with the individual to the next population.
Mutation can change any weight as well as add connection, enable or disable
any connection. Both add connection and add node mutations increase the size of
the genome. The first one connects two previously unconnected nodes and assigns
a random weight to the new connection. The add node mutation adds a node that
replaces a connection as shown in Figure 8.3. Let 𝐴 → 𝐵 denote the replaced
connection. Instead two new connections are constructed: 𝐴 → 𝐶 and 𝐶 → 𝐵,
where 𝐶 is the new node.
Crossover is performed with a given probability. The idea is to compose new
individuals by taking random genes from the parents. However, the random genes
can only be taken from the set of matching genes between the parents or from the
parent with a higher fitness value. The only exception is when the fitness values of
two parents are equal. In such a case, even non-matching genes can be inherited from
both parents (which is presented in Figure 8.4). The matching genes are defined as
those with the same innovation number.
8.3 Neuro-Evolution of Augmenting Topologies (2002) 158

Figure 8.3: One of two possible mutation operation. This one adds a new node in
place of a randomly chosen connection.

In a classical NE approach, the initial population comprises 𝑁 randomly gener-

ated network topologies. In contrast, in the NEAT approach, the initial population is
uniform and consists of 𝑘-sized (𝑘 being small) networks without any hidden nodes.
In other words, the input nodes are directly connected to the output nodes.

NEAT is a powerful technique, which resembles the natural evolution of intelligent

species. It combines many ideas within one AI framework. In an ablation study
outlined in the paper [282], the authors demonstrated that all these ideas are indeed
required for the method to work. The NEAT approach has inspired researchers
to delve deeper into combining evolutionary algorithms with neural networks,
establishing it as a foundational technique in the field of neuroevolution. This has led
to the introduction of variants such as HyperNEAT [157] and CoDeepNEAT [189].
The concept of coevolution stands out as a significant advancement. It denotes a
process where multiple populations evolve simultaneously, mutually influencing
each other’s progression. It enables the discovery of intricate and adaptive solutions,
which is especially effective in adversarial multi-agent scenarios.
NEAT has already demonstrated success in various domains, including robotics,
autonomous vehicles, game playing, and optimization. It holds significant potential
to be one of the leading methods in AI. However, so far, NEAT has been constrained
by its high computational complexity, primarily due to its conduct of a two-level
optimization: firstly, optimizing the topologies of networks, and secondly, optimizing
the weights. With advancements in hardware and the potential emergence of new
computing architectures, the NEAT approach might gain prevalence.
8.4 Behavior Trees (2003) 159

Figure 8.4: Example of a crossover. The weights of the matching genes are chosen
randomly from the parents. Disjoint and excess genes are taken only from the better
fit parent unless their fitness is equal. In this case, which is assumed here, these
genes are inherited from both parents.

8.4 Behavior Trees (2003)

Behavior trees are a computational model for controlling the behavior of intelligent
agents, primarily used in robotics and video games. Unlike many of the methods
discussed in this book, behavior trees are hand-crafted by experts. They do not
involve mechanisms such as planning, search, learning or induction from data.
Their creation is attributed to Geoff Dromey. While he developed key ideas around
2000-2001, the introductory publication on this topic appeared in 2003 and it was
titled “From Requirements to Design: Formalizing the Key Steps” [67]. Behavior
trees serve as a practical tool to configure rather the desired behavior and they can
also be viewed as a formal language for modelling the behavior.
Behavior trees are constructed from the following three main types of nodes:
• Sequence nodes, which cannot be leaves.
• Selector nodes, which cannot be leaves.
• Behavior / Action nodes, which must be leaves.
Similarly to decision trees, the control flow starts from the root and aims to reach
a leaf node with the selected behavior. However, in contrast to decision trees, the
control flow stays in the last visited node until it is explicitly returned to the
parent node. Each time a node is visited, its execution function is called, which is
typically named Tick and accepts the Δ𝑡 parameter that denotes the time since the
last behavior tree execution (not the last time this particular node was visited), or
Δ𝑡 = 0 if the tree was executed for the first time. The execution of a behavior tree
can be expressed as:
𝑠𝑖+1 (𝑡𝑖+1 ) = 𝑓 (𝑠𝑖 (𝑡𝑖 ))
(8.5)
𝑡𝑖+1 = 𝑡 𝑘 + Δ𝑡
8.4 Behavior Trees (2003) 160

- where 𝑓 is the function represented by the tree; 𝑖 is the discrete time step and 𝑠𝑖 is
the state of the modelled system in 𝑖-th step.
Each node of a behavior tree visited by the control flow has an execution status that
is either 𝑆𝑈𝐶𝐶𝐸 𝑆𝑆, 𝐹 𝐴𝐼 𝐿, or 𝑅𝑈𝑁 𝑁 𝐼 𝑁𝐺. When a status other than 𝑅𝑈𝑁 𝑁 𝐼 𝑁𝐺
is returned, the flow goes to the parent node (unless it is a root node).
Now with the execution flow and statuses defined, let’s get back to describing
the available nodes:
• Sequence nodes - iteratively, accordingly to the given order, call the execution
function for each child node. Whenever any child node returns a status different
than 𝑆𝑈𝐶𝐶𝐸 𝑆𝑆, then the sequence immediately terminates and returns this
child’s status (which can only be 𝐹 𝐴𝐼 𝐿 or 𝑅𝑈𝑁 𝑁 𝐼 𝑁𝐺). The sequence returns
𝑆𝑈𝐶𝐶𝐸 𝑆𝑆 only if all of its child nodes returned 𝑆𝑈𝐶𝐶𝐸 𝑆𝑆. A sequence
node represents steps that must be taken in order and all of them are required.
• Selector nodes - iteratively, accordingly to the given order, call the execution
function for each child node. Whenever any child node returns a status different
than 𝐹 𝐴𝐼 𝐿𝑈 𝑅𝐸, then the selector immediately terminates and returns this
child’s status (which can only be 𝑆𝑈𝐶𝐶𝐸 𝑆𝑆 or 𝑅𝑈𝑁 𝑁 𝐼 𝑁𝐺). The selector
returns 𝐹 𝐴𝐼 𝐿𝑈 𝑅𝐸 only if all of its child nodes returned 𝐹 𝐴𝐼 𝐿𝑈 𝑅𝐸. A
selector nodes represents alternative actions, sometimes ordered by their
utility, that can be performed.
• Behavior / Action nodes - execute the behavior accordingly to the amount of
time that have passed. For example, the continuous movement behavior can
translate the current agent’s position by 𝑣𝑒𝑙𝑜𝑐𝑖𝑡𝑦 ∗ Δ𝑡. If, however, a behavior
models an atomic change of state (e.g. loss of health or setting the new goal
to pursue), then the Δ𝑡 parameter can be ignored.

Figure 8.5: A behavior tree for an agent that aims to get to the enemy castle.

A behavior tree example with sequence and selector nodes is presented in

Figure 8.5. The basic formalism is sometimes extended with parallel nodes that,
in contrast to regular sequences and selectors, tick all children simultaneously.
The returned status is then typically derived based on the majority vote. Some
implementations also use the term decorator, which expresses additional conditions
8.4 Behavior Trees (2003) 161

imposed on the node it is attached to. If the conditions are not satisfied in the
current state, then the node may return 𝐹 𝐴𝐼 𝐿𝑈 𝑅𝐸 immediately without the need
of traversing to an action node and checking the conditions within its execution
function. Behavior trees often utilize blackboard architecture [112] to implement
data storage (e.g. parameters of behaviors and all variables that needs to be checked)
as well as transfer of information between agents.

Behavior trees are popular for robot control but also they have been arguably
the most popular technique for implementing the behavior of agents in video games.
Herein, agents include the non-player characters (NPCs), both enemies and allies,
single units or whole armies. There exist popular implementations for the two most
popular game engines: Unity3D and Unreal Engine. The main factors behind their
popularity in this domain are:
1. Graphical representation: Apart from the actions’ execution functions,
behavior trees can be fully visualized. The actions, however, can be represented
a simplified fashion by user-friendly names. This makes it very convenient to
work with behavior trees using a graphical user interface (GUI). Programmers
can prepare a portfolio of actions, and any team member can use a graphical
editor to construct trees from building blocks. Many GUI-based tools exist for
constructing, debugging, and testing behavior trees.
2. Full control: Stakeholders in the game development industry are hesitant
to include methods for controlling AI agents that can be unpredictable. For
example, machine learning models are often less favored. In contrast, behavior
trees allow for full control over the agents’ behaviors, making it easier to
thoroughly test them on a case-by-case basis. With behavior trees, the agents
are less likely to act unexpectedly.
3. Ability to represent hierarchies: The behavior nodes can be implemented as
nested behavior trees, thus creating hierarchies. This is possible because the
input and output interfaces are exactly the same. Designers are only required
to create the nested behavior trees as a model for complex behavior.
4. Simple, intuitive design: it is easy to understand the components of the model.
Intuitive design makes behavior trees accessible not only to programmers but
also designers, narrative creators etc.
5. Reusability: Once implemented, tested, and accepted, behavior trees can be
reused among many agents. However, this requires proper abstraction of the
agent and environment representations. They should not be too tightly coupled
with a specific agent. A blackboard architecture is helpful, as it allows, for
instance, the creation of an entry like “MyVariable” that can be initialized to
unique values depending on the agent with which the particular (blackboard,
behavior tree) pair is associated.
8.5 The Growing Field of Robotics (2002-2005) 162

8.5 The Growing Field of Robotics (2002-2005)

The first years of 21-st century marked a period of a growing recognition of robotics
as a field. Robots have always been considered a fascinating topic, but before,
they had been mostly limited to industrial settings, space exploration and one-time
research projects. This section will refer to three trends exemplified by specific
contributions that occurred between 2002 and 2005.
1. Designing Sociable Robots is a book written by Cynthia Breazeal that was
published by MIT Press [25] in 2002. The book is dedicated to a topic of robots
in a human environment, interacting with us, and following social rules. It
was a completely novel approach in research literature, a topic previously only
attributed to science fiction works. Breazeal notices practical advantages of
creating sociable robots. Firstly, human brain has evolved to overcome social
challenges and understand social settings. This is a natural way of interaction
for us. Secondly, robots may further support our social relationships. Thirdly,
it would be easier to develop responsible and safe AI if the social norms were
imprinted in its core design. Fourthly, we, as humans, learn and grow by
constructing socially intelligent robots. Finally, such robots could replace us in
various tests and pilot studies before new methods are tested on humans. The
initial sections of the book provide even more motivation and background than
presented here. Then, the author goes over aspects of sociable robots - what
makes an entity social (e.g. “being there” and “being understood”). Breazel
not only describes the theory, but she also introduces a robot named Kismet,
which is a pioneering work in terms of social aspects included in its design.
A detailed architecture of Kismet is described including sensors, visual and
auditory components, visual attention system, motivation system, recognition
of affective behavior displayed by humans, emotive responses, motor systems,
facial expressions, vocalization and lip synchronization systems and evaluation
criteria. The book concludes with results, grand challenges and prospects for
the future in the area of designing sociable robots.

2. Probabilistic Robotics is a book written by Sebastian Thrun, Wolfram Bur-

gard, and Dieter Fox that was published by MIT Press [304] in 2005. The
book focuses on handling two types of uncertainty. The first type is concerned
with perception and sensor limitations. The second type concerns control,
i.e., acting in an environment, which is not fully observable and, therefore,
the decision-control must take into account the current and future uncertainty.
After the introduction, the authors formulate the main notions related to world
representation and robot-world interaction. For example, the concept of beliefs
and belief distributions is presented as well as the Bayes Filter algorithm
that can generate the belief distribution from measurement data. The next
chapters are devoted to filters including the widely used Kalman Filters. This
is followed by a large portion of the book dedicated to problems and tech-
niques related to robot motion, localization, and mapping. The subsequent
chapters present mathematical derivations of various algorithms including the
8.6 DARPA Grand Challenge (2004) 163

Extended Information Form algorithm. Readers will also find comprehensive

chapters about Markov Decision Processes (MDP) and Partially Observable
Markov Decision Processes (MDP), which are the models used in robotics
and reinforcement learning. “Probabilistic Robotics” is recognized for its
clarity and accessibility, making complex concepts understandable for a broad
audience.

3. Roomba [79] - is an autonomous vacuum cleaner robot developed by the

company iRobot. The first model was introduced in 2002. Roomba can be
viewed as a sign of democratization in robotics. Since then, robots are no
longer confined to industrial, corporate, or research settings but have become
an integral part of daily life. Roomba incorporates a relatively basic form of
AI for sensor-based navigation, dirt detection algorithms, and various cleaning
algorithms chosen dynamically, for example, based on the current type of
floor. While they undergo initial calibration that optimizes them for specific
households, they do not possess learning capabilities beyond this initial phase.
However, they are periodically refined and equipped with updated technology,
making the use of more advanced AI techniques likely in the future. Roombas
have gained popularity for their convenience and time-saving capabilities.
They also signal a future where robots will be more prevalent, assisting in
everyday tasks.

8.6 DARPA Grand Challenge (2004)

The automotive industry is associated with great achievements of engineering made
by some of the most innovative companies. It has always been one of the descriptors
of progress of human civilization. Autonomous vehicles have always been among
the major goals and dreams of the applied AI. It was expected that they would
require a combination of cutting-edge solutions. If you wish to recall some earlier
breakthroughs, please refer back to Sections 5.6 and 6.4.

In 2004, the Defense Advanced Research Projects Agency (DARPA) announced

the DARPA Grand Challenge [266]. It was a competition for the teams developing
autonomous driving systems. As a federal organization, DARPA required that
each team contains at least 1 citizen of United States. To qualify for the prize,
the vehicle had to autonomously complete the entire track within the designated
time limit. No human intervention was possible once the vehicle started moving.
In the 2004-2005 challenges, which took place in desert environments, the time
limit was set at 10 hours, whereas in the 2007 competition held in an urban-like
setting, it was reduced to 6 hours. The grand prize was awarded to the first robotic
vehicle to cross the finish line in time. The participants were informed about
the details of the track, including the geographic coordinates of the checkpoints
(called “gates”) and corridor widths, only 2 hours before the start. This was aimed at
preventing tailoring the robots for the specific track (e.g., preprogramming the paths).
8.6 DARPA Grand Challenge (2004) 164

2004 2005 2007

Top prize) 1 million 2 million 2 million
Top vehicle - Stanley Boss
Winning team - Stanford Racing Team Tartan Racing
Affiliation - Stanford University Carnegie Mellon University
Total time - 6h 53m 8s 4h 10m 20s
Avg speed - 30.7 km/h (19.1 mph) 22.53 km/h (14 mph)
Base car - Volkswagen Touareg Chevrolet Tahoe
Track terrain Mojave Des. California/Nevada offroad Urban (Air Force Base)
Track length 240 km 212 km (132 mi) 96 km (60 mi)

Table 8.1: Overview of the DARPA Grand Challenges from 2004 to 2007. No
competition was organized in 2006. In 2004, no contestant successfully completed
the race, and consequently, the prize was not awarded.

Table 8.6 presents a summary of the three DARPA Grand Challenge installments
of the competition. The grand prize, totaling 2 million US dollars, was first awarded
in 2005.
Teams participating in the Grand Challenge had to develop AI algorithms and
robotic systems capable of interpreting sensor data, making real-time decisions, and
safely navigating complex environments without human intervention. The team
behind Stanley, the first winner [305], consisted of four groups:
1. The vehicle group - responsible for the core vehicle, which was a diesel-
powered Volkswagen Touareg R5 with four-wheel drive.
2. The software group - responsible for all the software including navigation and
monitoring.
3. The testing group.
4. The communication group - responsible for the media, contact with DARPA
and fund raising activities.
Most of the sensors were installed on the roof of the car as it allowed for the best
view of the surroundings. They included five laser range finders (measuring up to 25
meters), a color camera positioned forward and pointed slightly down (on the road)
as well as two 24 GHz radars. The aforementioned units comprised the “environment
sensor group”. Stanley also utilized a GPS unit, a GPS compass and GPS-based
system allowing DARPA to remotely stop the car in case of emergency. In the trunk,
the computing hardware was placed including six Pentium M computers.
Below, please find a summary of the key facts related to the software and
algorithms Stanley used. There are many more details that are presented in
paper [305].
• A three-layer architecture [86].
• The system comprised of 30 modules. Each module was running asyn-
chronously in parallel. The communication between the modules was imple-
mented using the publish-subscribe pattern.
• The main groups of modules were of the following categories: (1) perception,
(2) planning and control, and (3) user interface.
8.7 Graph Neural Networks (2005) 165

• The perception module transformed sensor data into internal models. It was
also responsible for creating 2D maps (images) of the environment based on
the data from the sensors.
• The main vehicle state was represented using 15 variables: 3 coordinates
for position, velocity, orientation, accelerometer and gyroscope, respectively.
Their predictions were made using unscented Kalman filter (UKF)
• A probabilistic test based on a Markov model was used for classification
of terrain obstacles. The parameters of the model were obtained through
a discriminative supervised learning algorithm (based on data labelled by
experts).
• For the analysis of the images, Stanley utilized a mixture of Gaussians model
for the segmentation of pixels (c.f. Section 5.5).
• The checkpoints provided by DARPA were dense enough, so they made it
possible to avoid using a route-planning algorithm. Instead, Stanley only used
a local algorithm for staying within the road and avoiding collisions with
obstacles or other cars.
The significance of the DARPA Grand Challenge extends beyond the specific
advancements in autonomous driving technology. It showcased the potential of AI
and robotics to tackle complex real-world problems and demonstrated the feasibility
of autonomous systems in various applications beyond military use, including
transportation, logistics, and search and rescue operations. Furthermore, the DARPA
Grand Challenge helped to raise public awareness about the capabilities and potential
impact of autonomous vehicles, contributing to the broader discussion on the ethical,
legal, and societal implications of AI and robotics technology.

8.7 Graph Neural Networks (2005)

Graph Neural Networks (GNNs) stand for a model of artificial neural network
designed to process non-fixed size graphs as input data. It is important to note that
GNNs are not about the neural network itself being a graph, as all artificial neural
networks are graphs.
For instance, GNNs can handle various types of data, including:
• Social networks, where edges represent relationships such as knowing someone
or collaborating with someone.
• Chemical structures, with edges denoting connections between atoms or
molecules etc.
• Maps, where edges indicate geometrical or topological relationships between
locations.
• Customer preferences, with nodes representing users and items, and edges
representing actions like purchases or movie views.
• Graph-based structures like buildings, biological skeletons or 3D meshes.
• Plans, where nodes represent events (states), and edges represent actions
(transitions).
• Ontologies and graph databases.
• Graph databases
8.7 Graph Neural Networks (2005) 166

• Images, where pixels or image fragments can be represented as nodes, and

edges capture spatial relationships.
• Semantic webs.
• Other neural networks
• And many other types of data with inherent graph structures.
The introduction of Graph Neural Networks (GNNs) dates back to around 2005,
but their popularity began to gradually rise around 2009. Below are listed seminal
papers concerning this model:
• 2005: “A New Model for Learning in Graph Domains” [95] by M. Gori, G.
Monfardini, and F. Scarselli.
• 2005: “Graph Neural Networks for Ranking Web Pages” [259] by F. Scarselli,
L.Y. Sweah, M. Gori, M. Hagenbuchner, A.C. Tsoi, and M. Maggini.
• 2008: “The Graph Neural Network Model” [260] by F. Scarselli, M. Gori,
A.C. Tsoi, M. Hagenbuchner, and G. Monfardini.
There is a large number of GNNs variants [346] including Graph Convolutional
Networks, GraphSAGE, Gated Attention Networks, and many others. Therefore, we
will present only the basic idea that is common to them.

The input to a GNN is a graph encoding. The input graphs can be directed
or undirected, single or multi-graph (with more than one type of edge possible
connecting a pair of nodes). Each node is associated with a vector representation,
which serves as the initial representation of the concept. It can be a direct encoding
or an embedding, with the latter being much more common in GNNs. An embedding
is an indirect representation of some concept, typically in a lower dimensional space
that the original space. Usually, embeddings operate on abstract features (common
to objects of certain type) and similar objects have similar representations with
regard to these features. The idea is shown in Figure 8.6.

Figure 8.6: The idea of an embedding. In this example, it was assumed that the
elephant has 0.9 assigned for being large, because higher values are reserved for
whales. The African elephant would have 1.00 for living in Africa, but the convention
assumed here assigns slightly lower values, if the occurrence is non-exclusive.

Graph neural networks are iteratively updated (trained) by applying operations on

neighboring nodes. These operations are called neural messages and the application
of them message passing. They modify the representation vectors. The training
stops after a fixed number of iterations, after the allotted time elapses, or when the
model converges to a stable state (which is guaranteed from the Banach’s theorem).
8.7 Graph Neural Networks (2005) 167

For example, in Graph Convolutional Networks (GCN), the message passing

update is performed as follows:

- where ℎ𝑖(𝑙) is the representation (embedding) of 𝑖-th node in the 𝑙-th layer of the
GCN; 𝑁 (𝑖) is the set of neighboring nodes of node 𝑖; 𝑊 (𝑙) is the weight matrix for
the 𝑙-th layer; 𝑐𝑖 𝑗 is the normalization constant that typically depends on the number
of neighbours a node has; 𝜎 is an activation function.
In general, the update is implemented as a differentiable function 𝑓 of the current
state of the node ℎ𝑖(𝑙) , the state of the neighbors ℎ (𝑙) (𝑙)
𝑗 and the state of the edges 𝑒𝑖 𝑗 :
Ê
ℎ𝑖(𝑙+1) = 𝑓 (ℎ𝑖(𝑙) , 𝜓(ℎ𝑖(𝑙) , ℎ (𝑙) (𝑙)
𝑗 , 𝑒𝑖 𝑗 )) (8.7)
𝑗 ∈𝑁 (𝑖)

The operation presented in Eq. 8.6 performs a weighted aggregation of the

neighboring nodes’ values. In GNNs, such an aggregation should be permutation-
invariant. This property ensures that the GNN can learn to recognize patterns and
relationships within the graph structure regardless of the node order.

GNNs as deep learning models can be composed of various types of layers. For
example:
• Different variants of aggregation layers. For instance, GNNs can be combined
with the attention mechanisms (c.f. Section 11.1). In such layers, the attention
coefficients are calculated based on the similarity between nodes. This allows
the network to focus on more relevant nodes during the aggregation process.
• Pooling layers - aggregation with down-sampling. This type of layers is also
popular in convolutional neural networks (c.f. Section 6.8).
• Information retrieval layers - that extract features from the vector representa-
tions stored in nodes.
Graph neural networks can be used in various ways - as classifiers, prediction
models, or data transformers that produce transformed graphs for further analysis.
Many applications of GNNs have been listed earlier in this section. One particular
breakthrough attributed to GNNs is the discovery of the Halicin antibiotic. For
those interested in learning more about this discovery, please refer to the paper
titled “A deep learning approach to antibiotic discovery” [283] published in the Cell
magazine.
8.8 General Game Playing Competition (2005) 168

8.8 General Game Playing Competition (2005)

General Game Playing (GGP) [296] as well as multi-game playing is a research
trend and an area of AI that focuses on creating intelligent agents (computer players)
capable of playing a variety of games rather than being programmed for a specific
one. There are multiple reasons why such an approach is beneficial for advancing
artificial intelligence. Let us outline some of the motivations:
1. Human-inspired. Humans are born without the knowledge of any specific
game. They gradually learn how to play - typically from others or from
rule-books. The more games we can play, the more chance that we will able
to transfer similar concepts shared between games. Although, humans also
specialize, we are naturally multi-game players and transfer learners.
2. More room for intelligent methods. Jacques Pitrat [222] said that “AI
methods are dumb, because AI researchers are too clever”. A game-specific
approach might be an obscure set of rules and heuristics derived by human
experts. The intelligence required to create them is often displayed by humans,
whereas the algorithms are relatively straightforward. The GGP idea aims
to transfer the responsibility of intelligent analysis from humans onto the
algorithms.
3. Integration of techniques. Because GGP concerns essentially unlimited
number of possible games, the solutions are expected to integrate various
techniques of action-selection, knowledge-representation and reasoning.
4. More potential for transfer learning. First of all, new general-purpose
algorithm may emerge from GGP research. Second of all, there is much higher
chance that they will be transferable onto new domains (beyond games) than
algorithms crafted for single games.
5. Specific programs become very strong. Finally, GGP offers a refreshing
restart in terms of computer programs playing efficacy in comparison to
approaches such as Deep Blue (c.f. Section 7.11), Chinook (c.f. Section 9.1)
or TD-Gammon (c.f. Section 7.1). There is a lot of room to create a general
purpose programs that can at least equal game-specific programs in their
respective games.
General Game Playing Competition was an annual challenge proposed by the
Logic Group at Stanford University in 2005 [88]. It was regarded as the official
world championships for GGP programs until 2016. The competition was hosted
online during the AAAI conferences (except for one year, when it was under ECAI),
however, there was also an on-site associated event at the conference. The participat-
ing programs had to be accessible via TCP/IP by the so-called Gamemaster provided
by the organizers.

The Gamemaster governed the matches. It was responsible for:

• the entire communication. It received the actions chosen by the players and
broadcasted them. Players did not communicate with themselves directly.
• running the game process. It always had the correct state of the game. The
players needed to be synchronized with it.
8.8 General Game Playing Competition (2005) 169

• providing the players with rules of the game to be played.

• managing the game timers: start and play clocks, respectively. The start clock
defined the time for the initial preparation between receiving the rules and
starting the game. It was typically set to 40-120 seconds. The player clock
defined the time for each move and was typically set to 5-20 seconds.
• legality validation. For instance, if a player did not respond in time or chosen
an illegal move, then the Gamemaster chose a random available move instead
and the player received a warning. The replacement move was broadcasted to
each player (including the one that the move was for). Two warnings resulted
in disqualification.
GGP programs were designed to play even previously unknown games, with the
game description being a run-time parameter of the program. The question arises
“How to represent games in such a system?”. A dedicated first-order and rule-based
logic language, named Game Description Language (GDL), was introduced. Any
game that is finite and deterministic can be represented in GDL. GDL and GGP
primarily focused on discrete combinatorial games, predominantly two-player and
one-player puzzles. However, there were occasional instances of games involving
a larger number of players. These games could be competitive (not necessarily of
constant-sum), cooperative, or a combination of both. GDL was based on Datalog,
which is syntactically a subset of Prolog (see Section 4.9). There were 9 reserved
keywords with a predefined meaning for the elements required to play a game. For
example, init(<predicate>) indicated that was comes after is part of the initial state;
legal(<role> (<predicate>) was the name of rules for determining available actions
for the player 𝑟𝑜𝑙𝑒; whereas terminal was the name for rules determining if the game
has ended. Each rule was an implication and there was the closed-world assumption
(CWA). Below, please find an example of a rule written in the GDL notation. It
states that the content of a cell remains the same if the player made a move to a
different cell:
(<= (next (cell ?x ?y ?piece))
(true (cell ?x ?y ?piece))
(does ?player (move ?a ?b))
OR((distinct ?x ?a) (distinct ?y ?b)))
During the competition, the descriptions were obfuscated, i.e., all the names
apart from the keywords were changed in order to prevent any language-based feature
engineering. This operation did not affect the syntax of the logic description.

The GGP competition was officially organized 12 times. There have been 7
unique winners from 5 countries: USA, Germany, Iceland, France, and the UK.
CadiaPlayer [74], developed by Hilmar Finnsson and Yngvi Björnsson of Reykjavik
University, holds the record for being a three-time champion (2007, 2008, 2012). In
2005 and 2006, the dominant approach was to dynamically construct an evaluation
function. Year 2007 brought a breakthrough in the form of the Monte Carlo Tree
Search approach (which will be described in the next Section). Some programs
focused more on performing efficient simulations (e.g., using propositional networks),
8.9 Monte Carlo Tree Search (2006) 170

whereas others on dynamic knowledge extraction. There was also a program named
nnrg.Hazel [237], which used the NEAT approach.

After Deep Blue defeated Garry Kasparov in 1997 (cf. Section 7.11), there
was some disappointment that achieving such a feat did not bring us closer to
truly intelligent programs. Despite winning against the human world champion in
chess, Deep Blue was unable to play even a simple game of Tic-Tac-Toe. The GGP
competition introduced a paradigm shift in designing game-playing programs, greatly
popularizing a simulation-based approach and also inspiring many subsequent game
AI competitions [293]. For example, at the 2023 IEEE Conference on Games, 11
different competitions took place. In 2014, there was also a General Video Game AI
(GVGAI) Competition [219] proposed. Similar to GGP, it focused on multi-game
playing but in contrast to GGP, it was dedicated to fast-paced real-time games.

8.9 Monte Carlo Tree Search (2006)

Monte Carlo Tree Search (MCTS) [30, 295] is a radically new approach to searching
game trees compared to the min-max (c.f. Sec. 2.5) and alpha-beta pruning (c.f.
Sec. 5.3) methods. It was initially designed for the game of Go and soon after
adopted for other games and multi-game playing domains. As pointed out in
the previous section, all winners of the GGP Competition since 2007 were based
on the MCTS algorithm, which is a testimony to its efficacy. One of its main
advantages over the aforementioned methods is that it does not require an evaluation
function to be applied. Min-max approaches, including alpha-beta pruning, either
necessitate searching the entire game tree to propagate values or must be equipped
with a heuristic evaluation function. Designing such functions can be a tedious and
complex task (as noted in Section 7.11 about Deep Blue).
In Go, it was exceptionally challenging to come up with a state evaluation
function that would enable strong computer players. As of 2005, the top programs
were at the level of average amateurs (14 kyu). It is worth noting, that Go became
the next big interest in game AI after chess. It was called the “grand challenge of
AI” [87]. Min-max approaches performed poorly, partly due to the high branching
factor of 250 and the intricacies of Go positions, which often remain open and make
it difficult to predict the winner until the very end. Unlike chess, Go lacks a concept
of material that significantly influences players’ chances.
The MCTS algorithm was introduced to Go in 2006 thanks to the works by Rémi
Coulom [48] as well as Levente Kocsis and Csaba Szepesvári [148]. Subsequently,
in the same year, Gelly et al. developed a program called MoGo, which facilitated the
leap from 14 kyu to 5 dan (an advanced player level). Ten years later, in 2016, MCTS
was combined with machine learning to contribute to arguably the most significant
breakthrough in Go and game AI in general, through AlphaGo, as discussed in
Section 10.8.
8.9 Monte Carlo Tree Search (2006) 171

Figure 8.7: The MCTS algorithm. It is run for a given number of 𝑁 iterations or the
allotted time. Each iteration consists of four phases.

The algorithm is iterative. It constructs a portion of the game-tree that is stored

in memory. The first iteration starts from a tree that contains only the root node
(representing the current state of the game). Each iteration consists of four phases,
as depicted in Figure 8.7, executed in the following order:
1. Selection: always starts from the root node and traverses the tree by selecting
actions according to the so-called selection policy. The most commonly
implemented selection policy is Upper Confidence Bounds Applied for Trees
(UCT), shown in Equation 8.8, which was proposed by L. Kocsis and C.
Szepesvári [148].
( √︄ )
∗ 𝑙𝑛 [𝑁 (𝑠)]
𝑎 = arg max 𝑄(𝑠, 𝑎) + 𝐶 (8.8)
𝑎∈𝐴(𝑠) 𝑁 (𝑠, 𝑎)

- where 𝐴(𝑠) denotes the set of available actions in 𝑠, 𝑄(𝑠, 𝑎) is the average
result of playing action 𝑎 in state 𝑠 empirically computed in the iterations so
far, 𝑁 (𝑠) is a number of times state 𝑠 has been visited in previous iterations
and 𝑁 (𝑠, 𝑎) - the number of times action 𝑎 has been chosen in state 𝑠 by
Selection phases so far, 𝐶 is the exploration ratio parameter.
This mechanism balances the exploitation of the best actions discovered so far
while also exploring the space to gather more evidence and confidence about
the actions. It originates from methods aimed at playing one-armed bandits in
casinos and is therefore referred to as “bandit methods”. This exploration vs.
exploitation mechanism is also related to the minimization of regret (choosing
suboptimal actions while exploring). The selection policy is crucial for the
convergence of the algorithm.
The selection phase ends either when (1) the currently chosen action points to
a state that has not yet been added to the stored game tree (the search would
8.9 Monte Carlo Tree Search (2006) 172

leave the tree) or (2) if a terminal state has been reached (the taken path has
been entirely added to the tree). In the first case, it proceeds to Expansion,
whereas in the latter skips directly to Backpropagation.

2. Expansion: here, the tree grows. Typically, only one node is added to the
tree corresponding to the state after taking the last selected action. Adding
only one node per iteration is an established trade-off between memory usage
and the size of the tree. The deeper a node is, the less likely it is to be visited
again.

3. (Monte Carlo) Simulation: starting from the expanded state, the algorithm
simulates the game until the end using the so-called default policy for selecting
actions for the players. In the original variant of the algorithm, the default
policy consists in making actions according to a uniform random distribution.
Later, various policies were proposed. They were either previously known
from reinforcement learning such as 𝜖-greedy or devised specifically for MCTS
such as heavy playouts [294] (simulations according to specific strategies).
However, the simulation procedure should be fast to allow for running many
of them within the allotted time. The greater the number of simulations, the
higher the number of iterations, resulting in more confident statistical analysis.

4. Backpropagation: in this phase, the numerical results (payoffs) obtained

by all players in the current iteration are propagated to all nodes visited in
Selection as well as in Expansion. For each such node, the statistics are
updated: the total score (to be able to calculate the average score), the average
score, and the number of visits to a node. If a different selection policy than
UCT is used, then it may require some additional statistics.

When there is time to make the move in the actual game, there are two approaches,
which are equivalent, in most cases. One is to choose the action among the currently
legal ones (children of the root node) with the highest average score obtained in the
performed iterations:
𝑎 ∗ = arg max 𝑄(𝑠, 𝑎) (8.9)
𝑎∈𝐴(𝑠)
Alternatively, the most visited action can be chosen. The rationale is that its
evaluation is the most confident and since actions with higher 𝑄 value are chosen
more frequently, it is often the strongest action:
𝑎 ∗ = arg max 𝑁 (𝑠, 𝑎) (8.10)
𝑎∈𝐴(𝑠)

Compared to min-max-based approaches, Monte Carlo Tree Search possesses

several advantages. It is:
• Aheuristic - which means that the heuristic evaluation function is not required.
It relies on the actual outcomes from the game.
• Asymmetric - the tree is constructed in an asymmetric manner focusing on the
most promising areas, that are likely to be played in the actual game. Different
8.9 Monte Carlo Tree Search (2006) 173

branches have varying depth. This approach scales better, especially if the
branching factor of the game is high.
• Anytime - the algorithm can be stopped at anytime and return the currently best
recommended action. The confidence of the best action estimation is usually
relatively proportional to the number of simulations performed. In contrast,
most of the min-max approaches have to expand the entire level of the tree in
order to get new information about the estimations (they require exponential
growth). Some variants of the min-max approaches such as quiescence search
applied in Stockfish (c.f. Section 9.3) aim to mitigate this problem by exploring
some branches deeper, however, they have to be manually implemented for a
specific game and they are not applicable to every game.
• Easy parallelizable - there are a few ideas of parallelizing the MCTS algorithm
for multiple CPU cores and/or clusters of computers [41]. Most of them are
straightforward and relatively easy to implement. The three core ideas are:
1. Performing many parallel simulations instead of one and returning their
average result. This way, the confidence of the Monte Carlo estimation
is higher.
2. Performing simulations in parallel without waiting for them, i.e., not
blocking the main iteration. Until a simulation is finished, it is assumed
to return the lowest possible scores for all players. Consequently, the
selection policy will not have 100% accurate statistics, as the currently
running simulations will update them. However, this issue diminishes
over time, as the number of currently running simulations in parallel
decreases relative to the total number of performed simulations.
3. Grow multiple game-trees by running the entire MCTS algorithm in
parallel. When the move is to be made, aggregate statistics from all the
trees.
Formally, MCTS can be applied to problems that can be represented as Markov
Decision Processes (see Section 3.5). Its appealing features, such as ease of
implementation, robustness, and lack of requirement for expert knowledge (although
it can be utilized to increase the convergence or efficacy of the algorithm), have
led to its widespread adoption even beyond games. It has been applied in various
domains including planning and scheduling problems, logistics, combinatorial
optimization, sequential decision-making, and others. One of the notable examples
of the application of MCTS to a non-game domain is chemical synthesis. It is
described in an article titled “Planning chemical syntheses with deep neural networks
and symbolic AI” authored by Segler et al [267]. In this research, a neural network
was combined with MCTS to guide the search of possible transformations from one
structure to another. The network was trained using 17134 formulas of reactions
from organic chemistry.
In this book, I focused on combinatorial games application and the original
(vanilla) version of the algorithm. For the recent advances in MCTS, its various
applications and modifications, please refer to the survey [295].
9. Years 2007 - 2013

9.1 Checkers is Solved! (2007)

2007 marked an important milestone in game-playing AI that resonated throughout
the entire field of computer science. The article published in Science announced
that “Checkers is Solved!” [261]. This achievement was accomplished by the team
at the University of Alberta, Edmonton, Canada, led by Jonathan Schaeffer.
Checkers were the most complex game solved up to date. It has had a long
research history in AI dating back to contributions such as the Arthur Samuel’s
(c.f. Section 3.7). Its search-space complexity (the number of possible positions)
is approximately equal to 5 ∗ 1020 . Despite a relatively low branching factor, it is
believed to possess high decision complexity, making it challenging to find the
optimal move. In Table 9.1, please find an overview of combinatorial complexity of
various games.
Checkers was weakly solved, which means finding a perfect strategy that
guarantees the game-theoretic maximum obtainable value for the given player from
the initial position. The team discovered that perfect play from both sides leads to a
draw - an explicit strategy that never loses was found for both players. While this
outcome was anticipated in the checkers community, its formal proof in 2007 was
significant.
Solving checkers was an important step towards computer programs that are
not only capable of winning games against human champions, but also prove
unbeatable. However, due to the nature of the task, solving a game could not rely
on sophisticated AI methods, which are inherently approximate. Instead, it had to
employ a brute-force search approach. It combined end-game databases (all positions
with the total number of 10 pieces or less), backward search and forward search
with optimizations to eliminate some redundant calculations. The forward search
comprised two components: (1) a proof-tree manager that chose positions, which
need to be evaluated next and (2) a proof-solver examining individual positions.
9.1 Checkers is Solved! (2007) 175

Game State-space Game-tree Avg. Length Branching

complexity complexity factor
Tic-Tac-Toe 103 105 9 4
Kalah 1013 1018 30 50
Connect Four 1013 1021 36 4
Domineering 8 × 8 1015 1027 30 8
Awari 1012 1032 60 3.5
Checkers 1020 1040 70 2.8
Othello (Reversi) 1028 1058 58 10
Hex 1057 1098 50 96
Chess 1044 10123 70 35
Backgammon 1020 10144 55 250
Abalone 1025 10154 87 60
Havannah 10127 10157 66 240
Carcassonne 2p ≥ 1040 10195 71 55
Shogi 1071 10226 115 92
Go 10170 10360 150 250
Arimaa 1043 10402 92 17281
Stratego 10115 10535 381 21.7
Starcraft II > 101000 > 101000 > 3000 ?

Table 9.1: The combinatorial complexity of selected games. For the average length
in Starcraft II, it was assumed a lower bound of 2 moves per second and an average
length of a game equal to 25 minutes.

Sequences of moves with their best responses were stored. There was a custom
compression algorithm for board positions proposed. The end-game database,
containing 39 trillion positions, occupied only 237GB of space after compression.
The project spanned 18 years, from 1989 to 2007, involving over 200 processors at
its peak—highlighting the immense scale of this endeavor.
9.2 GPGPU Programming (2007) 176

9.2 GPGPU Programming (2007)

This short section underscores the importance of advancements in computer hardware
that have had a significant impact on artificial intelligence. One of the long lasting
and game-changing impact was done by the introduction of the general-purpose
graphical processing units programming (GPGPU) in 2007. Prior to this, GPUs were
almost exclusively used for rendering and visualization, with rare exceptions where
they were employed for other types of calculations. However, these applications
were relatively limited and often required representation as graphical operations
(e.g., encoding data into textures). In 2007, GeForce 8800 [159], the first GPU with
general-purpose compute capabilities was released by nVidia. NVidia’s computing
platform, called CUDA (Compute Unified Device Architecture), has been continu-
ously developed since then, exclusively available for NVIDIA hardware. The main
competitors of nVidia, now AMD and previously ATI, utilize the OpenCL computing
platform, which is maintained by the open, non-profit Khronos Group. OpenCL is
available on both AMD and nVidia graphic cards.

Graphical processing units excel in matrix multiplications, vector operations

(e.g., translations, products), interpolation, and highly parallel calculations of the
same type (e.g., many matrix multiplications running in parallel). These properties
make them ideal for both training and inferring neural networks. This is particularly
notable in deep learning, where the ability to train neural networks on the GPU was
essentially an enabler for their success. Algorithms for training deep neural networks
were specifically optimized for GPUs, mainly consisting of tensor multiplications.
A tensor, as a mathematical object, can be viewed as a generalized 𝑛-dimensional
matrix.
Computational power has become a limiting factor in numerous deep learning
projects and applications. The utilization of GPUs, rather than CPUs, has enabled
the training of larger and more intricate models. Google has pushed the boundaries
further by introducing Tensor Processing Units (TPUs) [133], designed specifically
for neural networks. However, these units are not openly available for purchase. They
are primarily utilized internally within the company or through exclusive agreements
with selected companies.
Many popular deep learning frameworks, such as TensorFlow and PyTorch,
have integrated support for GPU acceleration through CUDA and OpenCL. For
a list of them, you can refer back to Section 6.12 (about Python language). This
integration has made it easier for developers and researchers to leverage the power
of GPUs without needing extensive low-level programming knowledge. As a result,
the widespread adoption of GPU programming frameworks has contributed to the
democratization of deep learning.
9.3 Stockfish (2008) 177

9.3 Stockfish (2008)

Figure 9.1: Stockfish chess engine logo (https://stockfishchess.org/). It is available

under the GPLv3 license: https://www.gnu.org/licenses/gpl-3.0.html

Stockfish is the next chess-related milestone, after Deep Blue, described in this
book. In the time between, there have been a few notable chess engines but they did
not have such a big impact on the AI as a whole to be included in this publication.
Those include Deep Fritz, Deep Junior, Hydra, and Rybka. Stockfish is arguably
the strongest chess program that has been created up to date. It was developed
as an open-source project1 initiated by Tord Romstad, Marco Costalba, and Joona
Kiiski in 2008 [247, 196]. It is written in C++ and available under the GNU GPL
license. It does not come with any GUI-based interface to play against or visualize
the chessboard. However, there have been other projects developed independently to
Stockfish to serve this purpose. In 2024, it is estimated to posses an Elo rating of
3664. By comparison, the highest rating of a human player was 2882 by Magnus
Carlsen in 2014. On a side-note, a difference of 200 in Elo points transfers to the
stronger player winning 76% of games against the weaker one, on average, whereas a
difference of 800, like in this case, gives over 99.9% chance of the computer winning
against the strongest human player. On the one hand, Stockfish can be regarded only
as a computer program. On the other hand, it marks a new era in combinatorial
games characterized by the following facts:
1. People, no matter how skilled, are no longer able to win against the strongest
computer programs. For many games like chess, this era has already arrived,
while for others it is anticipated. However, there are still games in which human
players hold an upper hand. Examples include: No Limit Texas Hold’em
poker (but no longer in the Limit variant), Arimaa, and many real-time video
games.
2. Access to computer engines is considered cheating in human tournaments, as
they provide such a significant advantage due to their strength.
3. Computer programs can serve as oracles, that provide the optimal move. It
can be used, for instance, during training sessions or by chess commentators.
1 https://github.com/official-stockfish/Stockfish (Last accessed: 14 Feb 2024)
9.3 Stockfish (2008) 178

Stockfish is employed to verify strategies player employ and to identify

blunders.

Since 2010, there has been Top Chess Engine Championship (TCEC) [254] organized
that serves as the most prestigious tournament for the AI chess players. Stockfish first
demonstrated its strength in 2013, finishing in the second place. The tournament has
been held from one (2010-2011, 2015-2017, 2021) to four times a year (2018, 2020).
Since 2014, Stockfish has won 15 times and has been the runner-up 5 times. It has
never achieved a lower place than second. As of 2024, it is also the reigning champion.

Stockfish uses a highly optimized alpha-beta pruning (c.f. Section 5.3) game tree
search algorithm with null windows. Apart from the standard pruning, it also utilizes
heuristic pruning of unpromising areas. The search is equipped with the iterative
deepening (iterative deepening depth-first search) strategy. It combines properties
of both breadth-first and depth-first search strategies, respectively. Starting from a
small depth of 𝑑, it first completes searching the sub-tree up to 𝑑 before progressing
further to 𝑑 + 1. This algorithm will find relatively good initial estimation (and
move candidate) much quicker than the depth-first search, while maintaining its
memory efficiency (in contrast to breadth-first search). This can be crucial in low
time settings or in other scenarios when the responsiveness must be high. It also
tends to improve the ratio of cuts in alpha-beta pruning.
Stockfish also utilizes quiescence search, a technique developed to improve
the accuracy of move evaluation during the search process. It dynamically adjust
the search depth depending on the stability (or “noise” in the quiescence search
terminology) of the position. It makes sure to examine very dynamic positions to
a depth, where the position is “quiet”. This minimizes the risk of performing the
heuristic evaluation too early - before important changes might take place (which
can be potentially not included properly in the evaluation).

Like Deep Blue, Stockfish combines:

1. Opening books
2. Complex heuristic evaluation functions
3. End-game database
There are different evaluation functions for different phases of the game. They
are linearly combined based on the categorization of the current position. The initial
evaluation functions were similar, in their idea, to the one Deep Blue utilized, updated
with modern chess knowledge. Their form were weighted linear combinations of
features. Notably, the number of features was smaller than in Deep Blue’s function
due to the fact that the search depth could be larger. However, in 2020, Stockfish
started using an approach called Efficiently updatable neural network (NNUE) [202]
instead of the hand-crafted heuristic. It is capable of evaluating the raw board
positions. The network has been trained on a large number of games. It has lead to
an increase of ≈ 90 Elo.
Unlike Deep Blue, Stockfish is a software framework that can be executed
on various hardware platforms, ranging from smartphones and standard personal
9.4 Netflix Prize Claimed (2009) 179

computers to clusters (with a dedicated MPI-based version). It can utilize up to 1024

physical processor threads and support up to 32 terabytes of memory for storing
positions. As of 2024, Stockfish has played over 7 billion chess games across all
supported devices.

9.4 Netflix Prize Claimed (2009)

Netflix is one of the leading companies offering per-subscription movies and series
through a streaming application available on various devices. It was founded in
1997, initially offering DVD movies before transitioning into the streaming model.
The Netflix Prize was a famous competition proposed in October 2006. It
was open to individuals or teams from around the world, except for 6 countries.
The objective was to submit the first solution that improved upon the existing
recommendation algorithm, named Cinematch, used by the company at the time,
by at least 10% in terms of the Root Mean Squared Error (RMSE) score. The
RMSE score of Cinematch on the test dataset was 0.9514, so the target was to
reach 0.8572 or below. The recommendation was defined as assigning a rating to a
< 𝑢𝑠𝑒𝑟, 𝑚𝑜𝑣𝑖𝑒 > pair, which represents the predicted score ∈ [1, 5] the particular
user rates the particular movie.
The prize money for the winner was 1 million US dollars. In addition, the
team whose solution demonstrated the biggest improvement each year over the best
solutions from the previous year was awarded 50 thousand USD. There was also
a 30-day time window after the first submission eligible to claim the grand prize,
during which the selected closest competitors could still submit their solutions.
The contestants had access to a training dataset that consisted of over 100
million ratings assigned by over 480 thousand randomly chosen customers to over 18
thousand movies. The test dataset consisted of over 2.8 million entries with the rating
labels hidden from the participants. The performance on the test dataset was also
partially disclosed. The RMSE scores obtained only on a randomly selected portion
(about half) were shown publicly on the website. However, the organizers had access
to the scores calculated for the whole test dataset and informed all participants that a
potentially winning solution had been submitted.

The Netflix Prize had a significant impact on the AI for a few reasons:
1. It demonstrated that a private company may benefit from outsourcing AI-
related problems to researchers and enthusiasts and it is a viable business
model.
2. Due to the large prize money, it led to a wider understanding of how important
role the recommendation algorithms play.
3. It popularized the branch of data science known as collaborative filtering.
4. The competition received significant media attention.
5. The released datasets were important for the data science community. Some
teams participated solely to gain access to the data.
9.5 Kaggle Platform (2010) 180

A total of more than 40 thousand teams from over 186 countries registered
for the competition. In June 2009, the first team, named BellKor’s Pragmatic
Chaos, submitted a solution that achieved a 10.05% improvement over Cinematch.
This team was a cooperation of previously separated teams - BellKor (formerly
KorBell from AT&T Research) and Pragmatic Theory. One other team - named The
Ensemble - managed to reach the required improvement in the 30-day window. Both
teams went to the finals full of drama and excitement as both submitted solutions
with tied scores. Ultimately, BellKor’s Pragmatic Chaos was awarded the prize
for submitting their solution 20 minutes earlier than the opponents. The winning
solution was an ensemble of predictors. The predictors were primarily based on an
improved singular value decomposition (SVD++) with temporal dynamics.

There was a plan for the second Netflix prize, however, it was canceled following
a lawsuit concerning the privacy of its users. Researchers from The University of
Texas have demonstrated that it is possible to partially deanonymize the datasets
released for the public by matching them with the Internet Movie Database.

9.5 Kaggle Platform (2010)

In April 2010, the Kaggle platform was founded. It is a service for hosting data
science and machine learning competitions2 . The scale of Kaggle, with its global
reach boasting over 15 million registered users from 194 countries (as of October
2023), combined with its user-friendly experience, has contributed to its leading
position. Participants in Kaggle competitions can compete individually or in teams.
One of the key features of Kaggle is its collaborative environment. Participants
can share code, discuss strategies, and learn from each other through forums,
notebooks, and discussion threads. Undoubtedly, Kaggle has been a transformative
game-changer for the data science and machine learning community.
The platform is addressed to various types of users:
1. Competition organizers, including:
• Companies seeking to present their business problems to the community
for a prize.
• Universities interested in sharing compelling research challenges.
2. Competition participants, such as:
• Professionals primarily focused on winning the prize.
• Individuals learning data science/machine learning to enhance their skills
by solving practical problems.

2Available at: available at https://www.kaggle.com/ (Last accessed: March 10, 2024)

9.5 Kaggle Platform (2010) 181

There are four main types of competitions hosted at Kaggle:

1. Featured - these are full-scale machine learning challenges related to unsolved
commercial problems. They are typically the most demanding and offer the
highest prizes. Examples of featured competition hosted at Kaggle:
• Passenger Screening Algorithm Challenge was organized by the Depart-
ment of Homeland Security in 2017. The objective was to estimate the
probability of a threat during airport screening. The top prize was $1.5
million and the total prize pool exceeded 2 million USD.
• Zillow’s Home Value Prediction (Zestimate) was organized in 2017 by
Zillow. The goal was to beat the Zillow benchmark, called “Zestimates”,
for estimating the values of properties. The competition had two rounds.
In the first round, the top prize was $50,000, whereas in the second
round, the top prize was 1 million USD.
• The Deepfake Detection Challenge was organized in 2019 with support
from Amazon, Facebook, Microsoft, the Partnership on AI’s Media
Integrity Steering Committee, and academics. As the name implies, the
goal was to detect artificially created (manipulated) videos. The first
prize amounted to $500,000, with a total prize pool exceeding 1 million
USD.
2. Research - these revolve around problems of a more experimental nature.
While they may offer prizes, this is not a strict requirement. Examples of
research competitions hosted at Kaggle:
• American Sign Language Fingerspelling Recognition organized by
Google in 2023. The goal was to to detect and translate American
Sign Language fingerspelling into text. The first prize was $100,000.
• Earthquake Prediction organized by the Los Alamos National Laboratory
in 2019. It was based on real-time seismic data. The first prize was
$20,000.
• Understanding Clouds from Satellite Images organized by the Max Planck
Institute for Meteorology in 2019. The objective was to classify different
types of clouds based on images. The first prize was $5,000.
3. Playground - these are relatively simple competitions with prizes ranging
from “kudos” to small cash prizes or donations to charity. Examples of
playground competitions hosted at Kaggle:
• Binary Classification with a Bank Churn Dataset organized by Kaggle
in 2024. The goal was to predict whether a customer continues with
their account or closes it.
• Don’t Overfit! organized by Kaggle in 2019. It featured a low number
of samples and its aim was to learn how not to overfit the models under
such circumstances.
• New York City Taxi Trip Duration organized by Kaggle in 2017. It was
one of the few playground challenges with cash prizes with the total
prize pool equal to $30,000. The objective was to predict the total time
of a taxi trip in New York city.
9.6 IBM Watson (2011 breakthrough) 182

4. Getting started - these are the simplest of all competitions, tailored to

introducing users to the Kaggle platform. They can also be regarded as
training challenges. Not a single one featured cash prizes.
Competitions typically range from 2 months to a year in duration. Even past
competitions serve as valuable sources of knowledge for Kaggle users. Not only can
they inspire personal projects, but notably, the platform also hosts many datasets
from past competitions for public use. The most common formats of the datasets
are comma-separated values (CSVs) and JavaScript Object Notation (JSON) files as
well as SQLite databases and the Google’s BigQuery format.
The users may also view machine learning models that are trending on the
platform for given tasks (e.g., image classification, text generation), data types (e.g.,
tabular, multimodal) and other filters.

What typically is required to organize a competition:

1. A clearly defined and well-described problem that can be solved using data
science/machine learning techniques.
2. One or more training datasets that can be publicly shared with the participants.
3. A double check if the data is free from leaks.
4. Evaluation datasets, typically divided into two partitions: one used for
the public leaderboard and one exclusively for the final leaderboard. This
division prevents competitors from overfitting or biasing their solutions towards
observed scores.
5. Adequate evaluation metrics for the problem.
6. A baseline solution that competitors might need to improve upon.
7. A plan for the competition timeline, including its duration, whether it consists
of two stages, and whether new data might be released after some time from
the start.

9.6 IBM Watson (2011 breakthrough)

IBM Watson [72], named after the first CEO of IBM, is another famous AI system
developed by IBM after Deep Blue (see Section 7.11). Watson underwent two stages
of evolution. First, inspired by the Deep Blue’s clash against the top human player,
Watson was designed to fulfill a similar role in the game of Jeopardy!. Deep Blue,
despite being a huge achievement, was tailored for an abstract game, and therefore,
had limited use outside its domain.
In the second stage of evolution, after Watson proved highly successful on
Jeopardy!, it evolved into an AI-driven question-answering system for medical,
industrial, and commercial applications.

9.6.1 The Jeopardy! Phase

Jeopardy! is an American television quiz show that has been licensed to many other
countries, often under different names. In my country of Poland it was known under
the name “Va Banque”. Three contestants are presented with a board containing
9.6 IBM Watson (2011 breakthrough) 183

Figure 9.2: Jeopardy! game board that has been featured in one of
the shows. Credit to: Kinu Panda. Released under: CC BY-SA 4.0
https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons.

six categories of knowledge, each with five different-prized entries, as shown in

Figure 9.2. The score of each player is their current money earned throughout
the game. The game consists of two standard rounds followed by a final question
where participants can wager any portion of their total earnings. Each player who
provides the correct answer, receives the money waged. In the standard rounds,
the participants have to press a button (“buzz in”) when they are ready to respond,
and only the first participant to do so is allowed to respond. What makes Jeopardy!
unique is that clues are presented to participants in the form of answers, to which
they must respond with a question. For example, if the provided answer is “The
legendary sword of King Arthur,” the participant should respond with “What is
Excalibur?”
On February 14th, 2011, Watson competed in an official series of Jeopardy!
matches for the first time and emerged victorious with earnings totaling $35,734.
The human opponents achieved $10,400 and $4,800, respectively. Watson played
another match the following day and won again, this time earning $70,147 compared
to the human opponents’ earnings of $24,000 and $21,600. Watson’s scores were
the result of complex wagering strategies in the so-called Daily Doubles and final
round. Human players tend to prefer round numbers. In the grand final against
former human champions, which Watson also won, the prize was 1 million USD.
IBM donated the whole amount to charity. Ken Jennings, who won the second prize
only being beaten by Watson reportedly said: “I for one welcome our new computer
overlords”, which was a reference to a popular show, The Simpsons.
IBM Watson’s appearance on the quiz show Jeopardy! in 2011 was a groundbreak-
ing demonstration of its capabilities in natural language processing and knowledge
retrieval. For details, please refer to paper [302]. Not only it was the most proficient
query-answering system of its kind at the time, but as a participant also had to:
9.6 IBM Watson (2011 breakthrough) 184

• Understanding the wording - in most query-answering systems, queries must

be formulated in a compatible format. Watson had to comprehend the diverse
ways in which clues were presented in Jeopardy! It utilized state-of-the-art
natural language processing algorithms.
• Select squares to play - Watson employed heuristics tuned using simulations
to determine its choices on the game board.
• Buzz in - for the sake of fairness, Watson was required to physically press the
signaling device, just like the other participants. Despite this requirement,
it demonstrated remarkable speed in doing so. Additionally, Watson had to
make strategic decisions about whether to attempt an answer or leave it to the
opponents. It is worth noting that providing a wrong answer would result in a
deduction from earnings equal to the value of the question. To navigate this
challenge, Watson employed a sophisticated strategy, considering factors such
as: (1) all players’ scores (money), (2) its confidence scores regarding the
question, and (3) a Monte Carlo simulation model that predicted the likelihood
of opponents knowing the answer.
• Coming up with the solution - Watson did not have access to the Internet.
However, it was equipped with an extensive offline database that included the
full 2011 version of Wikipedia. Watson utilized an efficient implementation
of multiple search algorithms running in parallel, alongside logistic regression
for classification. Initially, it generated potential responses, then gathered
additional evidence and assigned confidence levels to each response. The
system could generate millions of different confidence scores, which were then
consolidated into a single probability per response. The top three responses
with the highest probability were displayed for viewers of the show.
• Construct questions - the output had to be a properly constructed question in
English.
• Understanding others’ incorrect answers - when a player provides an incorrect
response, other players have an opportunity to buzz in. This aspect proved to
be difficult for Watson. For example, it occasionally attempted to provide the
same answer that had previously been confirmed as incorrect by the quiz host.
• Handle Daily Double - in the first round, one random cell on the gameboard
contains the so-called Daily Double, while in the second round, two are
present. This information remains hidden until a participant selects this
specific cell. The Daily Double allows the player to wager any amount of
money they currently possess. Analysis suggests that the distribution of Daily
Doubles is not uniform in the historical data of the quiz. Watson leveraged this
information to its advantage when selecting them. The wagering strategy was
based on evaluating the current game state and trained using reinforcement
learning. It considered various variables, such as the number of clues Watson
had answered incorrectly from the Daily Double’s category. In general, it
adopted an aggressive betting approach.
• Wager in the finals - Watson used simulations in conjunction with confidence
estimation. It analyzed the situation based on its ranking in terms of prizes
and the ratios of money between the players. These ratios were categorized
9.6 IBM Watson (2011 breakthrough) 185

into buckets, such as when the second player’s money is less than two-thirds of
the first player’s money, between two-thirds and three-quarters, or more than
three-quarters, and whether the second player has at least double the money of
the third player. For each bucket, Watson employed a distinct logic based on
statistics from historical games.
Watson’s win in Jeopardy! received significant attention because not only it
performed well, but also defeated very strong human players in the game that required
natural language understanding rather than being purely abstract. The debate of the
potential applications of AI has sparkled again.

9.6.2 The Post-Jeopardy! Phase

IBM Watson became a commercial product that has been undergoing constant
changes and updates with the advancements in AI. Initially, it was primarily a
knowledge-based query-answering system. Later, it became advertised more as a
cognitive system [117].
The underlying engine of Watson is named DeepQA. The main factors distin-
guishing it from other query-answering systems at the time were:
1. A natural language, human-friendly user interface.
2. The fact that DeepQA was not merely a pattern-matching database search; it
involved cognitive-process-inspired reasoning, including formulating hypothe-
ses, understanding context, matching facts, and estimating uncertainty.
3. DeepQA could handle both structured and unstructured data sources.
4. The accuracy of the answers, as demonstrated in Jeopardy!
Watson first breaks down the query into a structure composed of features and
connections between them that make up for the context of each feature. Then it
generates various hypotheses that have some similarity or correlation with them.
It utilizes a large portfolio of reasoning algorithms (e.g., for spatial or contextual
analysis). Each candidate match is assigned a weighted confidence score aggregated
for different dimensions of evidence (e.g. semantic relatedness). Next, a machine
learning model estimates the final confidence scores. It takes into account how well
the particular reasoning algorithm performed in the past on similar data.
The first applications of Watson were in the medical sector [156]. Between 2012
and 2014 it has been employed by the following IBM’s partners:
• Memorial Sloan Kettering Cancer Center as a clinical-decision support tool
for cancer.
• Cleveland Clinic as a training tool for medical students.
• MD Anderson Cancer Center as a clinical-decision support tool for cancer.
• New York Genome Center as a genomic-analysis tool for brain cancer.
• GenieMD as a consumer app for personalized medical advice.
• Mayo Clinic as a clinical-trial matching tool.
IBM Watson, while initially hailed as a potential game-changer in healthcare
with its promise to revolutionize medical diagnosis and treatment, has faced criticism
and challenges in delivering on these expectations. In an article titled “IBM
Watson, heal thyself: How IBM overpromised and underdelivered on AI health
9.6 IBM Watson (2011 breakthrough) 186

care” [284] published in IEEE Spectrum, the author outlines challenges Watson faces
in real-world applications. They include too much reliance on statistics, rather than
individual patients or not being able to understand subtle clues and nuances medical
doctors can. It turned out that it also could not extract meaningful information
from patients’ health records. Its accuracy in medical task was unpredictable - it
varied from around 90% to 60% depending on the task. The system, a combination
of statistical reasoning, machine learning, and algorithms, proved unsuitable for
existing medical procedures. Ultimately, many partners discontinued their use of
Watson.
Nevertheless, both critics, including the author of [284], agree on one point:
Watson possesses significant untapped potential. Yet, translating this potential into
practical implementation proves challenging when dealing with intricate technical
systems.
9.7 Deep Learning Revolution (2012) 187

9.7 Deep Learning Revolution (2012)

Several times in this book, I have referenced a particular approach that, apart from
being state-of-the-art at the time, was considered a precursor to deep learning, now
recognized as the forefront of artificial intelligence as of 2024. Finally, we arrive at
the year 2012, a pivotal moment in the evolution of deep learning. Many researchers,
including the author of this book, regard 2012 as the start of the deep learning
revolution.
The turning point occurred when a deep learning model named AlexNet won the
ImageNet Large Scale Visual Recognition Challenge (ILSVRC). This groundbreaking
approach was introduced in the 2012 paper titled “ImageNet Classification with
Deep Convolutional Neural Networks” [151] by Alex Krizhevsky, Ilya Sutskever,
Geoffrey E. Hinton. It has since become one of the most cited works in artificial
intelligence.
ImageNet is the common name of ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) [59]. It has been held annually since 2010 and serves as one
of the most influential benchmarks and competitive environments for computer
vision algorithms. Each year, an accompanying workshop is organized at one of
the top computer vision conferences. The competition relies on the ImageNet
dataset, introduced prior to the challenge, which has played a crucial role in its
success. In 2012, the year when AlexNet won, the dataset contained over 14 million
labeled images across approximately 22,000 categories. Each image is annotated
with bounding boxes of featured objects. At that time, it was the largest labeled
image-based dataset. The scale was made possible by employing crowd-sourcing
through Amazon’s Mechanical Turk service (on a side note: this name has no
connection to the Mechanical Turk chess automaton described in Section 2.3).
AlexNet not only won the 2012 ImageNet challenge but also significantly
outperformed both its competitors and the past winners. It achieved an error rate
of 15.3%, whereas the second-best entry reported an error rate of 26.2%. The
paper [151] presents detailed results of the model for different parameterizations
and versions of the ImageDataset from different years, all of which improve upon
the second-best result by a significant margin.
The architecture of AlexNet was a deep convolutional neural network, illustrated
in Figure 9.3. Its structure is nearly bipartite, consisting of two halves connected
as depicted by lines and arrows in the figure. The network comprised eight layers
with learnable parameters (weights). In total, it contained 650,000 neurons and 60
million parameters. Such large networks had not been successfully trained before.
Among the eight layers, five were convolutional with the following kernels:
• 96 kernels of size 11 × 11 × 3 (with 4 pixels stride)
• 256 kernels of size 5 × 5× 48
• 384 kernels of size 3 × 3× 256
• 384 kernels of size 3 × 3× 192
• 256 kernels of size 3 × 3× 192
9.7 Deep Learning Revolution (2012) 188

Figure 9.3: AlexNet convolutional neural network model. The network’s input is
150,528-dimensional, and the number of neurons in the network’s remaining layers is
given by 253,440–186,624–64,896–64,896–43,264–4096–4096–1000. This image
and the description of dimensions are taken from the original paper [151] without
any changes.

The main novel features were:

1. Effective training on the GPU - although GPGPU had already been introduced
for some time (see Section 9.2), it had limitations (both in terms of technology
and programming interface), and ML techniques were predominantly CPU-
trained. For instance, the high-end consumer GPU had a maximum of 3GB
of memory, which limited the number of training samples that could be
transferred to the card. It also limited the number of parameters that could
be stored on the GPU. The almost bipartite architecture of AlexNet made it
possible to train the model using two GPUs working in parallel - they had
access to distinct sets of parameters and training samples. The communication
between them is denoted by lines and arrows in Figure 9.3.
2. Large scale - although CNNs had already been known by the AI/ML community
(see Section 6.8), they had been considered yet another computer vision
method with similar performance to already existing approaches. AlexNet
demonstrated that CNNs become very powerful at scale - when both the
network size is large and there are appropriately many training examples.
3. The use of ReLU with Local Response Normalization - instead of using
traditional activation functions like sigmoid or hyperbolic tangent, AlexNet
started a trend of using ReLU activation functions. This activation function
introduces non-linearity into the network by replacing negative values with
zeros while leaving positive values unchanged. In addition, it turned out that
the rate of convergence of the training process benefits from using ReLU
functions. Although ReLUs do not require input normalization, the authors
introduced local response normalization and showed that it reduces the error
of the network by around 2 percent points. The normalization scheme is
presented in Eq. 9.1, where 𝑎𝑖𝑥,𝑦 and 𝑏𝑖𝑥,𝑦 denote the regular and normalized
activity of a neuron by applying 𝑖-th kernel at (𝑥, 𝑦) position. The 𝑘, 𝛼, 𝛽, and
𝑛 are hyperparameters. In the challenge, AlexNet used 𝑘 = 2, 𝛼 = 0.0001,
9.8 Deep Q-Learning (2013) 189

𝛽 = 0.75, and 𝑛 = 5.
𝛽
𝑚𝑖𝑛(𝑁−1,𝑖+𝑛/2)
∑︁
𝑗
𝑖 𝑖
𝑏 𝑥,𝑦 = 𝑎 𝑥,𝑦 / 𝑘 + 𝛼 (𝑎 𝑥,𝑦 ) 2 ®
© ª
(9.1)
« 𝑗=𝑚𝑎𝑥(0,𝑖−𝑛/2) ¬
4. Overlapping Pooling - the aim of the pooling operation, which was described
in Sec. 6.8 is to downsize the image. Prior to AlexNet, it was typically
performed using non-overlapping sliding windows (neighborhoods of pixels).
The authors have shown that using overlapping windows further reduces the
error rates of the network.
5. Effective prevention of overfitting - which was a serious challenge for very large
models, consisting of millions of parameters. To tackle it, AlexNet utilizes
a number of techniques. Firstly, it employs two types of data augmentation
which is performed on the fly, without storing the images on the disk, only
for the training purpose. Additionally, the authors utilized a relatively new
dropout technique. Without it, the network was reported to drastically overfit
to data. Using dropout not only reduced this problem but also decreased the
training time, as the “dropped out” neurons are not used in a given forward
pass and backpropagation. Due to the success of AlexNet, dropout has been
validated and popularized.
AlexNet had a significant impact on artificial intelligence. It was essentially a
proof-of-concept for deep learning. Previously, deep learning was mostly confined to
research projects and lacked scalability. AlexNet demonstrated that it could not only
be applied in practical, competitive environments but also achieve unprecedented
results. This achievement was made possible through the combination of numerous
components outlined in the previous paragraph, all of which synergized effectively.
The authors greatly popularized CNNs and inspired a plethora of different CNN-
based architectures. One example is U-Net, originally developed for biomedical
image segmentation.
AlexNet also began a trend of using the same model for various tasks across
domains. For instance, some problems were reformulated as image segmentation
or object detection tasks in order to leverage AlexNet-inspired deep learning
architectures to them.

9.8 Deep Q-Learning (2013)

This section continues the deep learning revolution topic presenting the work of
DeepMind - one of the major innovators in this field. DeepMind was a startup
founded in 2010 in London, United Kingdom by Demis Hassabis, Shane Legg and
Mustafa Suleyman as an interdisciplinary AI research lab. Demis Hassabis became
the CEO. In 2014, DeepMind was acquired by Google reportedly for a price of at
least $400 million USD. McKinsey Analytics reports “approximately $500 million
USD” [8]. Since then, it appears as Google DeepMind. In 2023, DeepMind was
merged with Google Brain, another Google-owned company, into a single team.
9.8 Deep Q-Learning (2013) 190

One of the earliest widely influential contributions made by DeepMind is related

to Deep Q-Learning (DQN). While Q-Learning originated in the late 1980s, Deep
Q-Learning represented a pioneering effort by DeepMind. It was widely commented
both in scientific literature and popular media due to the compelling presentation
of the results achieved in Atari 2600 games. The paper titled “Playing Atari with
Deep Reinforcement Learning” [195], authored by Volodymyr Mnih et. al., was first
released at arXiv in 2013. The experiments included 7 Atari games: Beam Rider,
Breakout, Enduro, Pong, Q*bert, Seaquest, and Space Invaders. Two of the games
are presented in Figure 9.4. In 2016, a paper titled “Deep Reinforcement Learning
with Double Q-Learning” [309] authored by Hado van Hasselt , Arthur Guez, and
David Silver of Google DeepMind was published at the AAAI conference. This
paper utilized an enhanced version of Q-learning and expanded the experimentation
to include 57 Atari games, which encompassed classics such as Video Pinball,
Asteroids, Ms. Pacman, and Montezuma’s Revenge. In this section, we will refer to
the first paper published in 2013.

Figure 9.4: Two Atari games: Breakout (on the left) and Space Invaders (on the
right). Among other games, they have been used for DeepMind’s work and later
picked up by many other researchers.

The main novel features of this research can be summarized as:

1. Deep Q-learning - the authors introduced a novel method for Q-learning
using a deep neural network.
2. “One model to rule them all” - the same neural network model was utilized to
play all the games rather than being tuned for a specific game. This represents
a step towards a more human-like way of learning how to play games, as
we use the same cognitive processes for various games and other intellectual
activities.
3. Processing raw pixel input - the method was designed to only take screen
captures of raw pixels values as the state representation. The state was a
210 × 160 map of RGB values (R3 ) passed at the frequency of 60Hz. No
handcrafted game-state features were employed, which was a radically new
approach to AI in games.
4. Popularizing the use of Atari games for reinforcement learning - the Atari
2600 benchmark, is currently a widely used benchmark in the field of AI and
9.8 Deep Q-Learning (2013) 191

reinforcement learning research.

The authors used an emulator of the Atari 2600 computer. They did not rely
on any game-state features other than the screen output. However, this specialized
emulator, designed for research purposes, also provided access to the current reward.
This reward mechanism forms the backbone of reinforcement learning, where the
objective is to maximize rewards within a given time horizon. Let 𝑇 denote the game
termination, in which the ultimate scoring is performed, and 𝑡 denote the current
time. The maximized reward is a sum of immediate rewards at consecutive steps 𝑡 ∗
discounted in each step by a factor of 𝛾 ∈ [0, 1].
𝑇
∑︁ ∗ −𝑡)
𝑅𝑡 = 𝛾 (𝑡 ∗ 𝑟𝑡 (9.2)
𝑡 ∗ =𝑡
The optimal action-value function is expressed as:

∗ ∗ ′ ′
𝑄 (𝑠, 𝑎) = E𝑠′ ∼E 𝑟 + 𝛾 max
′
𝑄 (𝑠 , 𝑎 )|𝑠, 𝑎 (9.3)
𝑎

which reads as choosing action 𝑎′ among the available actions that lead to a state
𝑠′ maximizing the expected discounted reward at the next step. This is an iterative
procedure of updating the 𝑄 estimation.
It is, however, infeasible for most scenarios to represent the 𝑄 function exactly.
Therefore, a neural network approximation is used, denoted by 𝑄(𝑠, 𝑎; 𝜃 𝑖 ), where 𝜃 𝑖
are all the modifiable parameters of the network at 𝑖-th time step. The loss in such a
network can be expressed as:

𝐿 𝑖 (𝜃 𝑡 ) = E𝑠,𝑎∼𝜌(.) (𝑦𝑖 − 𝑄(𝑠, 𝑎; 𝜃 𝑖 )) 2 (9.4)

where 𝑦𝑖 is the target for 𝑖-th iteration (the estimated 𝑄 value for the next step using
the previous step network parameters) and 𝜌(.) is a probability distribution serving
as a policy for selecting actions. If this was a supervised learning problem, then 𝑦𝑖
would be just examples from the training dataset.
The authors refer to TD-Gammon (c.f. Section 7.1) as an inspiration for their
approach but their chosen approach to learning is different. They utilize the so-called
experience replay, where a pool of tuples (𝑠𝑡 , 𝑎 𝑡 , 𝑟 𝑡 , 𝑠𝑡+1 ) called experience is stored.
Part of the learning algorithm is to randomly sample one tuple from the currently
stored experiences and execute an action replaying the experience according to
the 𝜖-greedy policy. This policy advises to follow the best-estimated action with
probability 1 − 𝜖 and a random one, otherwise. The value of 𝜖 was chosen as linear
degradation from 1.0 to 0.1 for the first million of training frames. After that, it
stayed fixed at 0.1. One million of the most recent frames was also the size of the
stored experience pool.
The 𝑄 function, from Eq. 9.3, was represented as a 5-layer neural network. The
graphical representation of the game is trimmed to the bounding box where the
action occurs. Following this operation, it undergoes conversion to grayscale format
and additional downsampling, resulting in a size of 84 × 84 × 4. Here, the last 4
9.9 Word2Vec (2013) 192

denotes the last four frames considered in the estimation. Layers 2 and 3 consist
of convolutional layers with ReLU functions applied to the results of convolution
filters. The fourth one is a fully-connected layer. The output layer consists of neurons
representing each possible action. Therefore, each output approximates the Q-value
of the corresponding action. The network was initialized with random weights. It
was trained using the root mean square propagation (RMSprop) with batches of size
32 sampled from the pool of experiences. For a more detailed description of the
algorithm, please refer to [195].
The deep Q-learning approach implemented by DeepMind achieved better results
than other AI/ML methods tested in [195] across all 7 games. Additionally, in
Breakout, Enduro and Pong, it outperforms expert human players. In Beam Rider, it
achieves a similar score to an expert human player. In Q*bert, Seaquest and Space
Invaders, the human players were significantly stronger.

9.9 Word2Vec (2013)

The next milestone is related to the Natural Language Processing (NLP) field, which
has long been recognized as one of the pivotal aspects of AI. Firstly, natural language
is deeply ingrained in our daily tasks and activities. High proficiency in it would
allow the AI systems support us in various ways, such as translation and writing
assistance, without relying on complex and artificial (to us) interfaces. Secondly,
there is a lot of sources of information written in natural language that can be
utilized for training the AI. Thirdly, even during Turing’s era (see Section 3.1),
there was recognition that language proficiency is a crucial component of intelligence.

There have been many advancements in NLP, some of them already presented
in this book. This section introduces one of the newer ones, named Word2Vec.
Word2Vec is a paradigm that encompasses a few models under one umbrella. The
key idea is to represent words from a natural language as vectors in a continuous
numerical space. Put simply, it aims to convert words into word embeddings
such that similar words will be attributed with similar numerical representation.
This is improvement over approaches consisting in directly using one-hot-encoding
or indexation mapping as presented in Figure 9.5. Those representations lack the
ability to represent similarity, ambiguity, and context. Additionally, they can become
extremely sparse, particularly for large corpora of words. Finally, the arbitrary
number assignment makes it challenging to transfer them between domains.
The Word2Vec technique emerged in 2013 through the work of T. Mikolov,
I. Sutskever, K. Chen, G. Corrado, and J. Dean. The two seminal articles that
introduce this concept are: “Efficient Estimation of Word Representations in Vector
Space” [191] and “Distributed Representations of Words and Phrases and Their
Compositionality” [190]. The authors proposed two novel models for practical
realization of the Word2Vec idea. These are Continuous Bag-of-Words (CBOW)
and Continuous Skip-gram, respectively. Both of them are based on neural networks
that learn the vector representation of words.
9.9 Word2Vec (2013) 193

Figure 9.5: Simple encodings of words for NLP that do not utilize the concept of
embeddings.

CBOW consists of three layers: input, projection and output. The input layer
accepts words from a given dictionary. Let 𝑉 denote the size of such dictionary and
the input. Like most language models, CBOW operates with a specific context. The
way the context is defined is usually the main aspect that differentiates Word2Vec
approaches. In CBOW, the context is defined as the 𝑁/2 preceding and 𝑁/2
succeeding words. In paper [191], the authors chose 𝑁 = 8. Those 𝑁 words are
one-hot-encoded and passed as the input, so, at any given time, there are 𝑁 input
neurons with a value of 1, and the remaining ones are zeroes. It is important to
note that the order of the words is lost, or in other words, the method is invariant
to the order in which the words are passed (hence the name “bag-of-words”). The
second layer performs the projection operation defined by multiplying the vectors by
a transformation matrix. Its values are learnable parameters, which are shared for all
words. The output units correspond to each unique word in a vocabulary encoded as
a Huffman binary tree. On this tree, a hierarchical softmax function produces the
normalized probability for each word that it should be in the context.
The training complexity of the CBOW approach, as presented in [191], is
proportional to:
𝑄 = 𝑁 × 𝐷 + 𝐷 × log2 (𝑉) (9.5)
- where 𝐷 is the length of vector representation equal to the second dimension of the
projection matrix (the first dimension is equal to 𝑁). In the introductory paper, the
authors chose 𝐶 = 10 for the experiments.

The Skip-gram model is similar to CBOW. However, in the original variant, only
the current word is presented as the input, and the model estimates the probabilities
9.9 Word2Vec (2013) 194

of words within a certain range. These are the most related words to the given
one. An illustrative comparison of both approaches is presented in Figure 9.6. The
training order of complexity is:

𝑄 = 𝐶 × (𝐷 + 𝐷 × log2 (𝑉)) (9.6)

- where 𝐶 is the maximum distance of words.

Figure 9.6: CBOW and Skip-gam models. The image has been reproduced
from [191].

The introduction of continuous numerical embedding of words was a break-

through. They allowed to significantly improve the accuracy of various language-
related. For instance, the authors reported over 50.0% accuracy of answers in
the Semantic-Syntactic Word Relationship. Selected examples of relationships
from [191] are presented in Table 9.2. The best other method the authors com-
pared their approaches to, were able to only achieve 13.3% and 36.5% accuracy
in the semantic and syntactic tasks, respectively. In Microsoft Research Sentence
Completion Challenge, where the objective was to fill one missing word into a
sentence, the authors improved the state-of-the-art performance by around 3.5%.
Nowadays, word embeddings play a crucial role in modern AI and NLP by enabling
efficient representation of words, capturing semantic relationships, improving model
performance, facilitating transfer learning, and finding applications across various
domains.
9.9 Word2Vec (2013) 195

Relationship Example 1 Example 2

Pattern +word answer +word answer
France - Paris Italy Rome Florida Tallahassee
big - bigger small larger quick quicker
Miami - Florida Baltimore Maryland Kona Hawaii
Einstein - scientist Messi midfielder Picasso painter
Sarkozy - France Berlusconi Italy Koizumi Japan
copper - Cu zinc Zn uranium plutonium
Berlusconi - Silvio Sarkozy Nicolas Obama Barack
Microsoft - Windows Google Android Apple iPhone
Microsoft - Ballmer Google Yahoo Apple Jobs
Japan - sushi Germany bratwurst USA pizza
Table 9.2: Examples of relationships presented in the paper [191]. The models were
provided with context in the form of relationship patterns and the first word of the
relationship to complete. A discerning observer will notice that not all answers were
the most accurate.
10. Years 2014 - 2016

10.1 Generative Adversarial Networks (2014)

Generative AI is one of the fastest-growing areas in the landscape of artificial intelli-
gence. It involves the creation of various types of content, including images, videos,
3D models, speech, and music, that feels original and meaningful, despite being
created artificially. Typically, human involvement is only required for specifying
frames, such as directions or constraints, for the generation process. These frames
can take the form of prompts written in natural language or a set of examples that the
output is expected to resemble. Generative AI represents a unique avenue compared
to traditional machine learning approaches, which primarily focus on classification
and regression models. Instead, it prioritizes the production of new data rather than
the evaluation of existing data.
Generative Adversarial Networks (GANs) are a prime example of generative AI.
They are also the first machine learning approach to tackle this problem. Previously,
generative AI was mostly implemented through evolutionary approaches, cellular
automata, and various specific algorithms for particular types of content. For example,
fractal generation was used for some types of plants or clouds, and Brownian motion
was employed for terrain 3D model generation.
GANs were first introduced in a paper titled “Generative Adversarial Nets” [94],
authored by I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S.
Ozair, A. Courville, and Y. Bengio. This work was presented in 2014 at the NIPS
conference.
In the GAN approach, two neural network models are trained:
1. 𝐺 - a generative model (generator).
2. 𝐷 - a discriminative model (discriminator).
10.1 Generative Adversarial Networks (2014) 197

Algorithm 8: GAN training algorithm from [94].

while 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 < 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 do
for 𝑘 steps do
Sample minibatch of 𝑚 noise samples 𝑧 (1) , ..., 𝑧 (𝑚) from noise

prior 𝑝 𝑔 (𝑧) ;
Sample minibatch of 𝑚 examples 𝑥 (1) , ..., 𝑥 (𝑚) from data

generating distribution 𝑝 𝑑𝑎𝑡𝑎 (𝑥) ;

Update the discriminator by ascending its stochastic gradient:
log 𝐷 (𝑥 (𝑖) ) + log(1 − 𝐷 (𝐺 (𝑧 (𝑖) )))
Í𝑚
▽𝜃 𝑑 𝑚1 𝑖=1
Sample minibatch of 𝑚 noise samples 𝑧 (1) , ..., 𝑧 (𝑚) from noise prior

𝑝 𝑔 (𝑧) ;
Update the generator by descending its stochastic gradient:
log(1 − 𝐷 (𝐺 (𝑧 (𝑖) )))
Í𝑚
▽𝜃 𝑔 𝑚1 𝑖=1

Figure 10.1: GANs during training. The highlighted column contains the nearest
training examples of the neighboring samples. The figure taken from the paper that
introduced GANs [94].
The authors of [94] compare the generator to a team of counterfeiters. Its
purpose is to generate artificial (fake) objects, e.g., images, voices, texts, etc. This is
achieved by learning to represent a distribution of these objects, from which new
instances can be sampled. Since neural networks propagate numerical data through
to their topologies, a fixed numerical representation of the generated objects must be
assumed. For example, in the case of images, this involves mapping pixels to input
and output neurons, as well as determining how to represent colors (e.g., as triplets
of floating-point values ∈ [0, 1]).
The purpose of the discriminator is to estimate the probability whether a given
sample is a “fake” one, generated by 𝐺, or if it originates from real data. It acts as a
10.1 Generative Adversarial Networks (2014) 198

Figure 10.2: Various applications (use-cases) of the StyleGAN models, which are a
type of GANs. This image was taken from paper [183].

“critic” that evaluates the authenticity of the generated samples.

The training process resembles an adversarial two-player game between these

two models. The objective of 𝐺 is to maximize its score as evaluated by 𝐷, whereas
the goal of 𝐷 is to correctly distinguish between samples from the domain (real)
and samples generated by 𝐺 (“fake”), assigning them values of 0 and 1 respectively.
The game can be expressed in min-max terms as:
min max 𝑉 (𝐷, 𝐺) (10.1)
𝐺 𝐷

𝑉 (𝐷, 𝐺) = E𝑥∼𝑝data (𝑥) [log 𝐷 (𝑥)] + E𝑧∼𝑝 𝑧 (𝑧) [log(1 − 𝐷 (𝐺 (𝑧)))] (10.2)
- where 𝑥 denotes real data samples drawn from the real data distribution 𝑝 data (𝑥),
𝑧 denotes noise samples drawn from a prior distribution 𝑝 𝑧 (𝑧), 𝐷 (𝑥) and 𝐺 (𝑥)
10.2 Further Advancements in Neural Network Models (2014-2015)199

represent the outputs of the discriminator (the probability) and generator (the sample)
networks, respectively.
Both generator and discriminator networks are trained simultaneously. Typically,
around half of the samples passed to the discriminator should be from the real (in
the distribution) data and the one in two should come from the generator.
The procedure of the training algorithm is presented in Listing 8. The gradient-
based updates can use any standard gradient-based learning rule. In [94], Goodfellow
et al. used momentum in their experiments.
Figure 10.1 presents example images generated by GANs from the original 2014
paper by Goodfellow et al. [94]. Additionally, in Fig. 10.2, a newer showcase (as of
2024) of GAN capabilities is presented. The original source can be found in the
paper titled “Face Generation and Editing with StyleGAN: A Survey” by Melnik et
al. [183]. Readers inclined to explore this topic further are encouraged to consult
the article entitled “A Review on Generative Adversarial Networks: Algorithms,
Theory, and Applications” [97].

10.2 Further Advancements in Neural Network Models (2014-

2015)
10.2.1 Adam Optimizer
Adam stands for Adaptive Moment Estimation. It is a stochastic gradient-based
optimization technique, which was introduced by Diederik P. Kingma and Jimmy Lei
Ba in 2014 in a paper titled “Adam: A Method for Stochastic Optimization” [144].
Although it can be applied to any objective function, it is particularly suitable
for training neural networks. Adam has become one of the most commonly used
algorithms to train large neural networks. In many machine learning libraries, it
serves as the default optimizer. It is a very generic contribution to the field, with
potential applications not limited to any specific tasks.
The idea is to estimate the first and second moments of the gradients of the
objective function with respect to the parameters. Based on these estimates, the
method introduces adaptive learning rates. The method’s pseudocode is presented
in Algorithm Listing 9.

Adam combines the strong aspects of two previously introduced methods:

• AdaGrad, which is suitable for scenarios where gradients are sparse.
• RMSProp, which is good for online and non-stationary problems.

10.2.2 Sequence to Sequence Learning

“Sequence to Sequence Learning with Neural Networks” is a seminal article authored
by three Google researchers - Ilya Sutskever, Oriol Vinyals, and Quoc V. Le, presented
at the NIPS conference in 2014 [289]. The authors proposed a method for training
neural networks capable of processing sequences of arbitrary length as input and
producing sequences of varying lengths as output, a task depicted in Figure 10.3.
This was groundbreaking research considering that conventional neural networks
10.2 Further Advancements in Neural Network Models (2014-2015)200

Algorithm 9: Pseudocode of Adam as listed in [144].

Require: 𝛼 (Stepsize) ;
Require: 𝛽1 , 𝛽2 ∈ [0, 1) (Exp. decay rates for the moment estimates);
Require: 𝑓 (𝜃) (Stochastic objective function with parameters 𝜃);
Require: 𝜃 0 (Initial parameter vector);
𝑚 0 ← 0 (Initialize 1-st moment vector) ;
𝑣 0 ← 0 (Initialize 2-nd moment vector) ;
𝑡 ← 0 (Initialize timestep) ;
while 𝜃 𝑡 not converged do
𝑡 ← 𝑡 +1 ;
𝑔𝑡 ← ∇ 𝑓𝑡 (𝜃 𝑡−1 ) (Get gradients w.r.t. stochastic objective at timestep 𝑡));
𝑚 𝑡 ← 𝛽1 ∗ 𝑚 𝑡−1 + (1 − 𝛽1 ) ∗ 𝑔𝑡 (Update biased first moment estimate);
𝑣 𝑡 ← 𝛽2 ∗ 𝑣 𝑡−1 + (1 − 𝛽2 ) ∗ 𝑔𝑡 ⊙ 𝑔𝑡 (Update biased second raw moment
estimate));
b𝑡 ← 𝑚 𝑡 /(1 − 𝛽1𝑡 ) (Compute bias-corrected first moment estimate);
𝑚
𝑣 𝑡 ← 𝑣 𝑡 /(1 − 𝛽2𝑡 ) (Compute bias-corrected second raw moment
b
estimate);
√︁
𝜃 𝑡 ← 𝜃 𝑡−1 − 𝛼 ∗ 𝑚 b𝑡 /( b 𝑣 𝑡 + 𝜖) (Update parameters);
return 𝜃 𝑡 (Resulting parameters) ;

struggled with such tasks, often being confined to non-sequential data or sequences
of fixed sizes, or requiring extensive manual feature engineering. Please notice that
we are already in the deep learning era, where the underpinning idea is to avoid
extensive feature engineering and let the models learn internal representations on
their own.
The core idea of Sequence to Sequence learning can be summarized as follows:
• The input sequence is processed by a large encoding LSTM network in reverse
order.
• To differentiate between sequences in the training dataset a special delimiter
< 𝐸𝑂𝑆 > was proposed.
• Each element of the input sequence is processed sequentially, with each
consecutive word representing the next time step.
• The output of the encoding LSTM is a fixed-size vector representation known
as the context vector, denoted as 𝑉.
• The vector 𝑉 is passed to the second large decoding LSTM, which produces
the output.

Figure 10.3: One sequence of any length (e.g., a sentence) is converted to another
sequence that may be of a different length.
10.2 Further Advancements in Neural Network Models (2014-2015)201

In their experiments, the authors chose the problem of machine translation

between English and French. They trained a family of models with various
parameters on a text corpus comprising 348 million French words and 304 million
English words. The authors did not reveal how many unique words were among
them, but the sizes of dictionaries were 160,000 and 80,000 for English and French,
respectively. Both LSTM sub-networks had 4 layers with 1000 cells per each
layer. The size of word embeddings was equal to 1000. The resulting model had
380 millions of parameters, in total. Impressively, the Seq2Seq model (as it is
called) outperformed the Statistical Machine Translation (SMT) system, which was
considered the standard approach at the time.
Subsequent research demonstrated the effectiveness of the Seq2Seq model across
a spectrum of tasks [339], including but not limited to machine translation, text
summarization, automated writing, and speech recognition. Its enduring impact
on natural language processing and sequence modeling underscores its status as a
seminal contribution in the field of AI.

10.2.3 Residual Networks

ResNet stands for Residual Network, which is an architecture introduced in a seminal
paper titled “Deep Residual Learning for Image Recognition" [114] authored by K.
He, X. Zhang, S.Ren, and J. Sun. Published first in 2015 as a preprint and in 2016 in
the proceedings of the Conference on Computer Vision and Pattern Recognition, it
is one of the highly cited papers in the entire field of artificial intelligence.
Since the advent of the deep learning revolution sparked by models like AlexNet,
increasing the depth of neural networks appeared to enhance their accuracy, provided
there was sufficient training data available. However, as the number of layers reached
really high values (e.g. 30-50), researchers began to observe a degradation in
accuracy and an increase of errors. The effect was similar to overfitting but it was
not caused by overfitting. In addition, unlike overfitting that has low training error
but high test errors, in this case both errors were rising. This phenomenon has been
documented in numerous studies including [113, 280, 114] and confirmed in [114].
It was an intrinsic behavior observed during training, irrespective of the volume of
available training data.
Residual learning framework was proposed to tackle this problem of degradation
of performance when training large models. The fundamental concept involves the
introduction of residual blocks and residual connections. Residual connections are
also called skip connections or shortcuts.
An illustration of residual blocks is presented in Figure 10.4. Let H (𝑥) denote
a function realized by a certain number of stacked layers in a network, where 𝑥
is the input. It can be the entire network or a subset of it. Residual blocks are
designed to learn F (𝑥) = H (𝑥) − 𝑥 instead, and the original input is directly added
to the output through a residual connection, effectively creating an identity mapping.
These shortcuts and residual blocks do not have to be of the same fixed length. The
entire network may contain many different ones.
10.2 Further Advancements in Neural Network Models (2014-2015)202

Figure 10.4: A standard residual block. In general, the number of layers in the
block can be larger. The input signal is both transformed by layers with weights and
propagated through a shortcut.

Residual connections offer benefits from preserving the input signal, enabling
the blocks to learn differences, and addressing the so-called shattered problem. The
authors demonstrated, for instance, on the CIFAR-10 dataset, that they facilitate
effective training of neural networks comprising 100 or even 1000 layers.
In summary, residual networks (ResNets) represented a groundbreaking advance-
ment in image recognition following AlexNet. They emerged as dominant forces
in the ImageNet competition of 2015, displacing architectures akin to AlexNet.
Nonetheless, the implications of residual learning extend beyond computer vision,
as they have been adapted for various deep learning applications employing models
with large numbers of layers.
10.3 The Rise of Explainable AI (2015+) 203

10.3 The Rise of Explainable AI (2015+)

In the 21st century, artificial intelligence has become more prevalent in various
sectors such as business, industry and entertainment, profoundly influencing our
daily lives. Just to name a few examples of personal assistants, automatic translation,
personalized recommendations, image and text generation, video games, credit
assessment, detection, sales prediction, automated large-scale planning, robotics,
drug discovery, and medical imaging. The outputs produced by AI-driven products
have become more consequential to society. AI is no longer limited to research
projects or one-time deployments by large companies. Its growing popularity and an
increasing number of applications in critical domains such as medicine, autonomous
vehicles or finance, necessitated the emergence of the explainable AI (XAI) and
interpretable AI paradigms. Previously, the primary emphasis was on the efficacy of
AI methodologies.
A shift in how XAI is viewed has started occurring around 2015. It has become
an active and formally grounded research area. Notably, in 2016, prominent machine
learning conferences hosted dedicated workshops on interpretable AI, reflecting the
growing interest in this field. For instance, NIPS 2016 featured the “Interpretable
Machine Learning for Complex Systems Workshop”, while ICML 2016 included the
“Human Interpretability in Machine Learning” workshop, among others. Nowadays,
as of 2024, with regulatory initiatives such as the EU AI Act being proposed, there
appears to be even more recognition of the importance of comprehending and
managing AI systems to establish trust.
While XAI can be applied to various techniques, it predominantly focuses on
machine learning models, which often feature complex internal structures. Since
these models are trained on data rather than being handcrafted by experts, their
connection to familiar concepts may be obscured by an additional layer of abstraction.
The debate around XAI gained momentum with the rise of deep neural networks
(DNNs), which became increasingly popular and widespread in their applications.
DNNs pose particular challenges for interpretability due to their abstract nature
consisting of nodes, weights, and connections as well as the sheer size - sometimes
even going into billions of internal parameters. In contrast, there exist AI approaches,
which are relatively interpretable by design such as decision trees, fuzzy systems or
logical reasoning. However, these approaches have seen less popularity in recent
years compared to deep learning models.

The question is - what constitutes the explainable or interpretable AI? There are
numerous facets to this problem:
1. Understanding how models operate.
This is the primary objective of XAI. In particular, it is critical to gain insights
into how black-box models operate. This understanding allows the users to
trust the model, especially in unknown scenarios, i.e., with the presence of data
outside the training set. In addition, deep understanding of a model enables us
to assess its relevance on a human level beyond just accuracy measures. It
also helps to find weak spots and improve the model when needed.
10.3 The Rise of Explainable AI (2015+) 204

2. Interpreting their outputs.

Even if the inner workings of a model cannot be analyzed, it is critical to be
able to correctly interpret the outputs from it. It must be made sure what
exactly the values returned by the model stand for and whether they need to be
processed further to obtain meaningful concepts. Interpretation of the output
in conjunction with understanding how the model operates also allows us to
justify the decisions made by it. For instance, when a person is rejected a loan
based on the credit assessment score performed by an AI, the bank’s employee
may additionally provide a justification. This is expected in a human-centric
environment.
3. Understanding the uncertainty
Another important aspect is understanding the uncertainty associated with
specific output. For example, consider a model for cancer detection in a
medical image. It is crucial to know the probability of an error of such
detection. It is often possible to estimate a general error rate of a model during
training. However, there are three important issues to be aware of.
(a) The general error rate is not the same as an error estimation for a particular
prediction (instance).
(b) The error rate might be different on the training data and when the model
is actually deployed in the real-world environment.
(c) With third-party models or models embedded into bigger systems, we
might not have access to anything more than just the output (like errors
and summary of the performance).
4. Diagnostics.
Different aspects of diagnostics are also a part of XAI research.
The monitoring of the learning process ensures that the chosen evaluation
methodology is correct. There are many potential pitfalls, one notable being
data leaks. Data leaks result in a very good performance during the creation of
the model due to (typically indirect and not easily spotted by humans) access
to data it should not have access to. But when such a model is deployed in a
production environment, it performs poorly as it cannot rely on the data leak
anymore.
Another component of diagnostics is monitoring whether the model remains
relevant. The two main reasons causing the model to become less relevant
is data drift and concept drift, respectively. In the former, the statistical
properties of the data used to train the model have changed compared to the
situation in which the model is currently used. In concept drift, the nature
of the problem changes, e.g. new patterns, optimal strategies, objectives or
trends emerge, which were not captured in the training process. Understanding
models’ inefficiency is critical in developing trust in them.
5. Robustness and reliability.
These include aspects such as how ML models handle edge cases or how
resistant they are to adversarial attacks.
Edge cases might require deep understanding of the models and possibly
external solutions dedicated to them. It is not always possible to have edge
10.3 The Rise of Explainable AI (2015+) 205

cases in the training data, therefore, additional analysis of the models is

required.
With the rise in popularity of ML models, there has also been emergence
of adversarial attacks on them. They can take different forms. For instance,
computer vision models can be tricked by placing specially designed stickers
that influence their predictions. Another type of attacks include interference to
the training process if a system learns from real-time data (as in the life-long
learning paradigm). Then, the users may feed the model with synthetic data
to make it biased.

Machine learning models and XAI techniques can be categorized into two main
groups:
1. White-box models - where the structure and the procedure is accessible,
transparent and can be analyzed by humans.
2. Black-box models - which lack transparency, with only the input and output
accessible. While they can be analyzed based on the output they produce,
delving into their internal structure is not possible.
XAI techniques can further be classified as local or global, depending on whether
they provide explanations at an individual instance level (in a particular application)
or across the entire model (how the model works).

Finally, XAI methods can be model-specific or model-agnostic. Examples of

model-agnostic methods are:
• LIME (Local Interpretable Model-agnostic Explanations). Designed in
2016 [239] for explaining local instances. For example, LIME for image
detection might show the subset of the image responsible for the prediction.
• SHAP (SHapley Additive exPlanations) [197] - a method originating from
cooperative game theory. It is one of the state-of-the-art methods to calculate
feature importance without the use of the internal structure of the model.
However, it is a computationally intensive method.
• Model-agnostic counterfactual analysis - a “what-if” simulation of what
result the model will output in certain scenarios [2].
• Cross validation - a simple technique for models’ evaluation during training
to reduce the risk of overfitting. It concerns training and evaluating the model
on various subsets (splits) of the data.
• Surrogate models - when the original model is a black-box model or there is
no access to its structure, a simpler but interpretable model can be trained to
recontrust the behavior of the original one. In such a case, the output of the
explained model acts as labels to the explainable model.
10.3 The Rise of Explainable AI (2015+) 206

Examples of model-specific methods are:

• Visualization of random forests - random forests are composed of many
decision trees, which are inherently interpretable models. A decision tree
can be visualized to look like a rule-based diagram, possibly with additional
statistics about how confident particular splits are, how many samples were
present during training etc. Visualization of a random forest is more tricky. The
most common method is to create a so-called consensus tree that aggregates
(with some approximation loss) the rules from individual decision trees.
• Model-specific feature importance - many models, such as random forest,
XGBoost, or linear models, have inherent mechanisms for calculating feature
importance.
• Attention mechanisms in RNNs - attention weights can be visualized to
understand which parts of the input sequences are most relevant.
• Integrated gradients [288] - an advanced method to attribute the output of a
neural network to its input features.
• Rule extraction from SVM - SVMs can be used to extract rules from
the decision boundaries learned during training, providing interpretable
explanations for the model’s predictions.
10.4 Heads-up Limit Hold’em Poker is Solved (2015) 207

10.4 Heads-up Limit Hold’em Poker is Solved (2015)

10.4.1 Introduction
Throughout the constant progress in game AI research, there are three special
milestones to reach in each game. Initially, the goal is to develop strong (compe-
tent) computer players. Next, the long-term objective is to achieve super-human
performance, i.e., creating programs capable of defeating top human players. Finally,
the pursuit is for solving the game, which can often be beyond the reach of current
computer technology. As of 2024, we have already solved checkers (cf. Section 9.1)
and achieved super-human performance in chess (cf. Section 9.3).

Figure 10.5: Royal flush - the strongest 5-card hand a player can have Texas Hold’em
and many other variants of poker. There are four royal flushes possible (one per color),
which are tied in terms of strength. This image is free to copy, publish and distribute
under the OGL v1.0 license https://www.nationalarchives.gov.uk/doc/
open-government-licence/version/1/open-government-licence.htm.

This section is dedicated to poker (see Fig. 10.5), which is a quintessential game
with hidden information also known as imperfect information. Another example
is Bridge. Most card games are imperfect information ones. In such games, some
aspect of the game state is hidden to the players when they take actions. The most
popular variant of poker is Texas Hold’em, in which players are dealt 2 private cards
(hands). There are up to four betting phases: pre-flop, flop, turn, and river. Flop
concerns revealing 3 community cards, which everybody can see. The turn and river
stages each reveal one additional card. Players may fold, check, call or raise by a
specified amount. Two players are forced to start with the initial bet of small blind
(SB) and big blind (BB), respectively. The value of the big blind often serves as a
reference to how much money players have, e.g., 50BB (50 big blinds). A player
10.4 Heads-up Limit Hold’em Poker is Solved (2015) 208

wins either by having the best five-card hand or when all other players fold. The
best hand is determined using both the given player’s private cards and community
cards. The hidden information in poker consists of cards in other players’ hands as
well as 5, 2, and 1 community cards concealed in the preflop, flop and turn phases,
respectively.

Imperfect information games are very interesting to work with because in real life,
we are also limited in terms of information we possess about problems. In addition,
they are typically more difficult to implement solutions for than perfect information
games due to a few reasons:
• Although there is always an underlying single state of the game, the players do
not have access to it and instead take actions considering a whole family of
possible states, which increases the effective computational complexity. Such
states are referred to as information sets, which group states indistinguishable
with respect to the available information.
• Optimal strategies may involve actions, which turn out to be non-optimal in
hindsight (when new evidence is present). For example, in a given information
set, a particular action can be winning 90% of the time and losing 10% of the
time. It is always correct to play the first one, although sometimes it will be
non-optimal. This is not the case for perfect information games.
• Many of those games, including poker, involve elements of bluffing and
deception. At high levels, it is very important to properly include it in one’s
own game as well as to respond to it.
• Optimal strategies might be randomized in order to reduce the chance of
exploiting them.
• Imperfect information games are usually also stochastic, containing random
events such as card shuffling or a dice roll. Each probabilistic outcome is a
separate path in the game tree.
• Often, there is a need for a separate module dedicated to reasoning under
uncertainty and inferring missing information.

10.4.2 The Process of Solving The Game

In 2015, researchers affiliated with the University of Alberta weakly solved Heads-
Up Limit Texas Hold’em (HULHE) poker variant. Heads-Up is a synonym for a
two-player variant of the game, whereas Limit refers to the amount of money players
may bet. In No-Limit, they can bet the whole amount they currently have. In Limit,
they are restricted to fixed amounts, e.g., two big blinds.
This achievement was reported in two articles (first in Science and then reprinted in
Communications of the ACM) titled “Heads-Up Limit Hold’em Poker Is Solved” [23,
22] by Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. This
was the first time, an imperfect game of such complexity was solved. The complete
game tree has 3.19 ∗ 1014 nodes counting symmetries. Without symmetries it is
one order of magnitude smaller. Figure 10.6 presents the most complex hidden
information game solved up to 2014.
10.4 Heads-up Limit Hold’em Poker is Solved (2015) 209

Figure 10.6: A timeline of the most complex imperfect-information game solved up

to Heads-Up Limit Hold’em poker. The 𝑌 axis represents the state-space complexity.
The 𝑋 axis represents the main technique used to solve the game. The figure has
been reproduced from the “Heads-up Limit Hold’em Poker is Solved” article [23].

What made this achievement influential was also the fact that poker is a hugely
popular game, in which money is an inherent part, and thus any AI-related ad-
vancements in it are carefully monitored. It has also been important from the AI
research perspective as John von Neumann stated that it was one of the motivations
for developing game theory [316].

For solving the HULHE variant of poker, the team utilized 200 machines with
24 CPU cores each, totaling 4800 cores. The game tree was partitioned, starting
from the root (pre-flop phase), into 110,565 subgames. Those subgames were
assigned to particular machines. The computations took 68.5 days, which was
equivalent to 900 core-years. The researchers developed a dedicated compression
algorithm that allowed for using only 11 terabytes of hard disk storage for this task.
A straightforward non-compressed implementation would require 262 terabytes of
storage.

There are a few techniques for solving imperfect information games. The earliest
ones are based on linear programming. They are exact but require an extensive
amount of computation (typically unfeasibly large). Bowling et al. employed a
variant of a method that is also a backbone of the strongest poker playing programs
- Counterfactual Regret Minimization+ (CFR+). Regret denotes the cost paid for
not choosing the optimal strategy (known in hindsight). It has been proven that
iterative minimization of regret converges to the optimal play from a game theoretic
perspective [108]. Despite the title that reads “Heads-Up Limit Hold’em Poker
Is Solved”, the authors of [23] further clarify in the paper that the game has been
“essentially weakly solved”. Essentially, it is defined as such a good approximation
that a human lifetime of play is not enough to distinguish the solution from the exact
solution with statistical confidence.
10.4 Heads-up Limit Hold’em Poker is Solved (2015) 210

10.4.3 Note on Poker-Playing Programs

The advancements in poker AI have continued. In 2006, the Annual Computer Poker
Competition was first organized. It was held annually until 2018. Since 2010, it has
included the No-Limit variant of Texas Hold’em with unrestricted bets.
Many of the strongest poker-playing computer programs are based on the
CFR+ algorithm. Examples include Supremus [342] (introduced in 2020), which
is an upscaled implementation of a previously groundbreaking program named
DeepStack [200] (introduced in 2017), and Pluribus [28] (2019), which is an
improved version of Libratus [27] (2017).
Let’s focus on the Supremus-DeepStack approach. They employ a deep neural
network called Deep Counterfactual Value Network (DCVN) to learn the counter-
factual values for their respective game states. To speed up the calculation process,
an optimization technique called bucketing is used. Buckets group together hands
that are functionally similar. The counterfactual values are effectively defined by
a multiplication of the expected values for players in a given state (bucket) by the
probabilities of the states (buckets). This is combined with a depth-limited search
to traverse the tree and propagate values from leaf nodes to the root as well as
a technique called continual resolving for updating the policy during the search.
Both Supremus and DeepStack use separate DCVNs for different phases of Texas
Hold’em. Apart from the pre-flop network, all of them consist of seven layers. The
Supremus’ river, turn, flop, and preflop networks were trained using 50 million, 20
million, 5 million, and 10 million scenarios, respectively.
Pluribus, developed at Carnegie Mellon University, learns the core strategy
(referred to as the blueprint strategy) via a large number of games played against a
copy of itself. Then it improves the blueprint strategy in real-time during games
against actual opponents based on the current situation. It utilizes various abstraction
techniques to reduce the dimensionality of the problem, e.g., grouping similar states
(from the perspective of the players) together. Like Supremus and DeepStack, it
is based on CFR but instead of a DCVN, it utilizes Monte Carlo simulations to
approximate counterfactual values. Pluribus performs a depth-limited search with
a large number of modifications tailored for the game. It also utilizes continual
resolving.
A program named ReBeL [29], presented in 2020, serves as an example of a
different approach. It is based on a general reinforcement learning framework for
two-player zero-sum games. ReBeL introduces a novel representation for poker
based on belief states. Similarly to the CFR approach, it can be used to compute
Nash equilibrium and, subsequently, solve the game (although it has not happened
yet due to the computational complexity of the task for the general No-Limit variant).
The ReBeL approach has been inspired by the AlphaZero approach, which will be
presented in Section 10.8.
Poker AI programs have been tested against humans and were able to defeat elite
human players. Pluribus participated in various experiments in six-player games of
No-Limit Texas Hold’em. The results have been reported in 2019 in the Science
magazine [20]. For instance, in a 5-Human plus 1-AI experiment, it came first with
10.5 Waymo Autonomous Car (2015 breakthrough) 211

an average win of 48 milli big blinds per game. In a 1-Human plus 5-AI (five copies
of Pluribus) experiment, it outperformed the human opponent by an average of 32
milli big blinds per game. However, in a game with hidden information and bluffing,
it is difficult to say whether top human players could not adapt to them and win
back. Nevertheless, poker playing programs have made a substantial progress around
year 2017. It is also worthwhile noting that the use of bots is strictly prohibited in
all poker-rooms. Especially in a playing format named sit-and-go, where players
join tables, play for a while for a cash prize and then switch tables to play against
different opponents, computer bots prove to be massively profitable. It is interesting
to analyze the reasons behind this phenomenon. One reason might be that it is not
possible to exploit the strategy utilized by bots if they often switch tables. Another
reason might be related to human psychology. Computer players are not vulnerable
to the tilt effect in poker, which refers to a state of emotional or mental frustration,
often resulting from a series of bad beats, losses, or other negative experiences at the
poker table. When a player is on tilt, they may experience heightened emotions such
as anger, frustration, or disappointment, which can impair their decision-making
abilities.

10.5 Waymo Autonomous Car (2015 breakthrough)

The Google Self-Driving Car Project began in 2009. However, it was 2015 that
marked a significant leap in self-driving technology with the first fully autonomous
drive. It was conducted on the streets of Austin, Texas and the passenger was Steve
Mahan, former CEO of the Santa Clara Valley Blind Center. The vehicle closely
resembled the one depicted in Figure 10.7. The experiment concluded with complete
success.
After reaching this milestone, the Google Self-Driving Project transitioned into
an independent company called Waymo LLC. In 2017, the company initiated the
“Early Rider Program” in Arizona’s Metro Phoenix area, enabling public trials of
their technology. By 2019, Waymo extended regular rides to select customers, and
by 2020, these services were available to all riders in the Metro Phoenix area. From
2021 to 2023, Waymo provided limited services, restricted to specific individuals
and routes, in San Francisco and Los Angeles, respectively. The services have been
generally well received. In 2023, the company shared extensive data regarding
incidents, crashes, and general safety. For instance, data presented in [153] indi-
cates that the crash rate of Waymo vehicles is 6.8 times lower than that for human
drivers for incidents involving any reported injury, whereas it is 2.3 times lower for
all police-reported crashes. These figures represent aggregated data for all three
locations: Phoenix, San Francisco, and Los Angeles. Interestingly, crash rates were
lower in Phoenix and San Francisco compared to Los Angeles.

Given Waymo’s status as a private commercial company, many details of their

technology are not publicly disclosed. From [277], we can learn that it utilizes various
sensors such as multiple LIDAR units (mid-range and long-range), radar, cameras,
GPS and audio sensors (e.g., to detect sirens). The same article also mentions
10.5 Waymo Autonomous Car (2015 breakthrough) 212

Figure 10.7: Waymo autonomous vehicle under the previous Google name.
This picture has been publicly released under CC BY-SA 4.0 license:
https://creativecommons.org/licenses/by-sa/4.0/.

three major software modules: Perception, Behavior Prediction and Planner. The
behavior prediction includes avoiding collisions, completing the trip and respecting
the rules of the road as we can read in the “Waymo’s safety methodologies and safety
readiness determinations” [322]. Mapping algorithms, including a deep learning
model based on graph neural networks (see Sec. 8.7), called VectorNet, generate
detailed maps of the environment, including the information about road geometry,
lane markings, traffic signs, and landmarks. These maps serve as a reference for
the vehicle’s navigation system, enabling it to plan safe and efficient routes. The
performance of hardware and software components has been tested both through
extensive simulations and more than 40 million miles driven on public roads as of
2024.
Waymo issued a statement that it is interested only in self-driving technology
and not manufacturing their own cars. In pursuit of this objective, the company has
established partnerships with several automakers, including Fiat-Chrysler, Toyota,
Audi, and Jaguar [51]. Despite Waymo’s breakthroughs in bringing their vehicles
onto public roads, competitive pressures have intensified. On one front, entities such
as Uber, Microsoft, and Apple have launched their research programs. On another
front, automotive manufacturers like Mercedes-Benz, Tesla, and Volvo have also
entered the autonomous vehicle race.
10.6 Extreme Gradient Boosting (2015) 213

10.6 Extreme Gradient Boosting (2015)

XGBoost, which is short for eXtreme Gradient Boosting, represents a powerful
machine learning algorithm that can often rival and even outperform deep learning
models. It has been particularly suitable for tabular data and has served as the
backbone for many competition-winning approaches, such as those on the Kaggle
platform (see Section 9.5). In 2015, 17 of the 29 Kaggle competition winners used
XGBoost [42]. This method was also employed by all the top 10 teams in 2015
edition of the KDDCup. Since then, it continues to be a frequent winner and a
popular choice for the first method to try for an unknown problem. It can be used
both for classification and regression problems.
The XGBoost method was introduced in 2015. The seminal article authored
by Tianqi Chen and Carlos Guestrin, titled "XGBoost: A Scalable Tree Boosting
System" [42], was published in the Proceedings of the SIGKDD conference.
In summary, the method combines four elements:
1. Gradient boosting.
2. Tree representations (for the learners in the ensemble).
3. Regularization.
4. A lot of tricks and optimizations.
The concept of boosting has been introduced in Section 7.10. It is an ensemble
method that combines multiple weak learners into a strong learner. Boosting utilizes
the entire data and iteratively improves the solution by focusing on instances, which
lead to the highest errors in previous iterations.

Figure 10.8: A simple example of ensemble of two classification and regression

trees. The model’s prediction is the sum of predictions from each tree.

In XGBoost, each weak learner is a Classification And Regression Tree (CART)

such as ones shown in Figure 10.8. In such trees, each leaf contains a numeric score,
10.6 Extreme Gradient Boosting (2015) 214

which is the predicted error of classification or regression defined by the decisions

taken on the path from the root to this leaf. Let’s use regression as the example. The
final model is obtained by summing the outputs of 𝐾 trees:
𝐾
∑︁
𝑦ˆ𝑖 = 𝜙(𝑥𝑖 ) = 𝑓 𝑘 (𝑥𝑖 ) (10.3)
𝑘=1

- where (𝑥𝑖 , 𝑦𝑖 ), 𝑥𝑖 ∈ 𝑦𝑖 ∈ 𝑅 are 𝑛 training data points; 𝜙 is the parameterized

𝑅𝑚 ,
function representing the model.
The objective can be expressed as minimization of the loss with an additional
regularization term: ∑︁ ∑︁
L (𝜙) = 𝑙 ( 𝑦ˆ𝑖 , 𝑦𝑖 ) + Ω( 𝑓 𝑘 ) (10.4)
𝑖 𝑘
Here, 𝑙 ( 𝑦ˆ𝑖 , 𝑦𝑖 ) represents loss on a single instance, i.e., a function that returns
the cost of choosing 𝑦ˆ𝑖 instead of the ground truth 𝑦𝑖 . The regularization term
is an important element of XGBoost. It penalizes complex models and prevents
overfitting. It is typically set to a sum of 𝐿1 (lasso) and 𝐿2 (ridge) regularization
terms. In the original paper [42], they were defined as shown in Eq. 10.5:
1
Ω( 𝑓 ) = 𝛾 ∗ 𝑇 + 𝜆||𝑤|| 2 (10.5)
2
- where 𝑇 is the number of leaves in the tree, whereas 𝑤 is the weight vector
containing all the values in leaves. Sometimes, the 𝐿1 regularization term is defined
as 𝛾 multiplied by the sum of absolute values of the weights.

The method is iterative. It starts with one weak learner (a simple CART) and
each iteration creates a new one. Continuing with the regression example, this initial
model can be constructed to return the average value of the predicted feature in
the dataset. Let 𝑡 denote the 𝑡-th iteration and 𝑓𝑡 denote the 𝑡-th learner. Since the
final model’s output consists in adding results of all learners, the additive learning
objective can be formulated as:
∑︁
L (𝑡) (𝜙) = 𝑙 (𝑦𝑖 , 𝑦ˆ𝑖 (𝑡−1) + 𝑓𝑡 (𝑥𝑖 )) + Ω( 𝑓 𝑘 ) (10.6)
𝑖

As a result, subsequent trees that are created will fit to data to predict errors
made by the previous models. The closer they are able to predict the errors, the
closer the 𝑙 (𝑦𝑖 , 𝑦ˆ𝑖 (𝑡−1) + 𝑓𝑡 (𝑥𝑖 )) term will be to zero.
In each iteration the gradient of the loss function is computed. In addition to
the gradient, the method computes differences between the ground truth 𝑦𝑖 and the
predictions 𝑦ˆ𝑖 for each data sample. These differences are referred to as residuals.
For complex loss function, a second-order approximation (e.g. using the Taylor
expansion) can be used. In the original paper [42], the first and second order
gradients are denoted by 𝑔 and ℎ, respectively:
𝑔𝑖 = 𝜕𝑦ˆ (𝑡 −1) 𝑙 (𝑦𝑖 , 𝑦ˆ (𝑡−1) )
(10.7)
ℎ𝑖 = 𝜕𝑦2ˆ(𝑡 −1) 𝑙 (𝑦𝑖 , 𝑦ˆ (𝑡−1) )
10.6 Extreme Gradient Boosting (2015) 215

Let 𝐼 𝑗 = {𝑖|𝑞(𝑥𝑖 ) = 𝑗 } denote the instance set of 𝑗-th leaf. This a set, in which
observations from the training data fall into, when propagated from the root to the
leaf according to the decisions in the subsequent nodes on the path. The optimal
value, referred to as the structure score or similarity score, for the tree is computed
by:
Í
1 ∑︁ ( 𝑖∈𝐼 𝑗 𝑔𝑖 )
𝑇 2
𝑜𝑏 𝑗∗ = − ∗ Í + 𝛾 ∗𝑇 (10.8)
2 𝑗=1 𝑖∈𝐼 𝑗 ℎ𝑖 + 𝜆

Now, either an exact-greedy or an approximate algorithm investigates various split

candidates and uses the formula from Equation 10.8 in a similar way how information
gain (or other impurity score) is used in decision trees. The split that minimizes
the score is applied for the tree constructed in the current iteration. In a practical
implementation, the 𝛾 ∗𝑇 constant as well as the leading negative sign can be omitted
for the sake of this calculation (the minimization is then changed to maximization).
A large number of tricks and optimization built-in XGBoost include but are not
limited to:
1. Shrinkage technique that scales newly added weights by a learning factor 𝜂
after each step. It reduces the effects of sudden big influences.
2. Optional feature sub-sampling similar to the method used in random forests
(see Section 8.2).
3. Cross-validation, which is inherently included in the training process.
4. Efficient, sparsity-aware handling of missing values. Users of the XGBoost
algorithm do not need to devise additional methods for this problem.
5. Weighted quantile sketch - a novel approximate algorithm based on the non-
weighted variant to determine split points in the trees. This feature becomes
particularly crucial for very large datasets, whereas for smaller ones, an exact
algorithm can be utilized. This is combined with an efficient pruning technique
aimed at simplifying the created trees.
6. Cache-aware access optimized for finding optimal splits for the tree nodes.
This is a strictly computational optimization.
7. Memory optimizations consisting of techniques known as block compression
(for storing large data sets) and block sharding (when data needs to be split
onto multiple hard disks). In result, datasets containing billions of values can
be used.
8. Optimizations aimed for GPU implementations that take advantage of the
modern GPU hardware features.
10.7 Google’s Neural Machine Translation (2016) 216

10.7 Google’s Neural Machine Translation (2016)

Google Translate is an automatic translation service widely used by millions of
people on a daily basis. For many years prior to 2016, Google Translate primarily
relied on statistical machine translation (SMT) techniques. Although functional, this
approach worked better for certain language pairs, such as English and French, but
was essentially unusable for others, such as Japanese and Hindi.
Around 2014, Google began investigating neural network-based approaches to
address the challenges of automatic translation. The breakthrough came in 2016
with the maturation and official deployment of GNMT. The details of the GNMT
approach were documented in a paper titled “Google’s Neural Machine Translation
System: Bridging the Gap between Human and Machine Translation” [329] authored
by Yonghui Wu et al.
In this paper, the authors outline the problems associated with standard NMT
approaches, which the GNMT approach aimed to address:
1. NMT requires a significant amount of data and computational resources to
train.
2. Inference in NMT systems is slow.
3. NMT systems struggle with translating rare words.
4. They sometimes produce translations that omit certain parts of the original
text.
GNMT is an end-to-end approach in the sense that the entire translation process
is learned jointly by a single neural network model. The architecture of this model
is presented in Figure 10.9. It follows an autoencoder structure (for autoencoders,
please refer to Section 6.11) with the bottleneck layer in the form of an attention
module. The attention mechanism will be described in Section 11.1.
Both the encoder and decoder consist of 8 LSTM layers. The authors introduced
residual connections (see Section 10.2.3) to the LSTM layers. They stated that
without them, the decoder and encoder networks scaled only up to 4 LSTM layers.
Both networks were parallelized for GPU units, so each layer is essentially handled
by a separate GPU.
The encoder component is responsible for processing sequences of words into a
list of fixed-size vectors (for each input symbol). The decoder decodes the list of
vectors and emits one symbol at a time. It calculates the probability of the candidate
symbols to output using the softmax output layer.

One of the novel elements of the approach was the wordpiece model, which was
initially proposed to solve the Japanese/Korean segmentation problem for Google’s
speech recognition system. In this approach, text sequences are partitioned into
wordpieces. The training sentences are converted to wordpieces, the model is trained
using wordpieces, and then the output is converted back to original words. The
authors showed an example of the sentence: “Jet makers feud over seat width with
big orders at stake,” which partitions into the following wordpieces: “-J et -makers
-fe ud -over -seat -width -with -big -orders -at -stake”. Additional symbols were
included to ensure that the words are decoded without ambiguity. One of the special
10.7 Google’s Neural Machine Translation (2016) 217

Figure 10.9: The architecture of the GNMT neural network model. This figure was
drawn by the author based on Figure 1 from [329]. Two encoder layers are processed
by the same GPU.

symbols represents unknown characters. Many wordpieces, such as entity names or

numbers, are copied without translation (they are shared among languages). This
addresses the problem of handling rare names in translations.
The wordpiece model is used in combination with the mixed word/character
model. Words are tokenized and converted into sequences of constituent characters.
The characters are equipped with special meta-data prefixes such as < 𝐵 >, < 𝑀 >,
< 𝐸 >, which denote the beginning, middle, or end of the word, respectively, in
which they appear. This helps train the model to predict individual characters rather
than entire words. The authors also use beam search in the decoding phase to
generate several (8-12) hypotheses about the output sequences (i.e., longer sequences
consisting of many symbols). The goal is to output a sequence 𝑌 that maximizes the
following scoring function 𝑠:
log(𝑃(𝑌 |𝑋))
𝑠(𝑌 , 𝑋) = + 𝑐 𝑝(𝑋;𝑌 ) (10.9)
𝑙 𝑝(𝑌 )
where
(5 + |𝑌 |) 𝛼
𝑙 𝑝(𝑌 ) = (10.10)
(5 + 1) 𝛼
and
|𝑋 |
∑︁ |𝑌 |
∑︁
𝑐 𝑝(𝑋;𝑌 ) = 𝛽 log(𝑚𝑖𝑛( 𝑝𝑖, 𝑗 , 1.0)) (10.11)
𝑖=1 𝑗=1

where 𝑝𝑖, 𝑗 is the attention probability corresponding to the 𝑗-th translated word and
𝑖-th source word; 𝛼 and 𝛽 balance the normalization of length and coverage penalty.
10.7 Google’s Neural Machine Translation (2016) 218

For the GNMT approach, the authors used a refined maximum-likelihood training
objective and additionally stabilized it by reinforcement learning:

O𝑀𝑖𝑥𝑒𝑑 (𝜃) = 𝛼 ∗ O𝑀 𝐿 (𝜃) + O𝑅𝐿 (𝜃) (10.12)

where 𝜆 is a weight of the maximum-likelihood component, and:

𝑁
∑︁
O𝑀 𝐿 (𝜃) = log 𝑃𝜃 (𝑌 ∗(𝑖) |𝑋 (𝑖) ) (10.13)
𝑖=1

where (𝑋 (𝑖) ,𝑌 ∗(𝑖) ) are 𝑁 sequence pairs in the training dataset; 𝑃 denotes the
probability. The second component in Eq. 10.12 is defined as:
𝑁 ∑︁
∑︁
O𝑅𝐿 (𝜃) = 𝛼 ∗ 𝑃𝜃 (𝑌 |𝑋 (𝑖) )𝑟 (𝑌 ,𝑌 ∗(𝑖) ) (10.14)
𝑖=1 𝑦∈𝑌

where 𝑟 is the per-sentence score, which measures how well the translation produced
by the GNMT system aligns with human-generated translations or references on a
per-sentence basis.
The approach introduces various novel modifications and tricks to a machine
translation system. One of them was the use of low-precision arithmetic at the
inference stage, which allowed for a drastic increase in inference speed without a
significant loss of efficacy.

GNMT has been extensively tested both using prepared benchmarks and in a
production environment. Regarding the first type of tests, it led to an increase in
the BLEU (Bilingual Evaluation Understudy) scores for many language pairs. For
instance, on the WMT’14 English-to-French benchmark, it improved from 31.45 to
38.95 BLEU, while on English-to-German, it improved from 20.77 to 24.17 BLEU.
Regarding the subjective human evaluation of the quality of translations, the results
were even more impressive. It reduced errors by 60% compared to the previous
state-of-the-art machine translation system.
Overall, the approach represented a breakthrough in automatic translation. For
example, when it was deployed for the Japanese language as a Google Translate
service, users immediately noticed its improvements even though there was initially
no public announcement of the deployment.
10.8 AlphaGo (2016) 219

10.8 AlphaGo (2016)

This section is devoted to AlphaGo and its successors such as AlphaZero. AlphaGo
was developed by DeepMind in 2016. It is an AI system that combines deep neural
networks, learning by self-play, and search to play Go.
If this book contained only 10 milestones in the history of AI, this particular
one would definitely be included. It was a remarkable breakthrough in terms of
technology that also had a wide cultural impact [19]. Many researchers regarded
AlphaGo as the most important contribution in AI of 2016 or even up to date. A
2017 documentary movie titled AlphaGo, directed by Greg Kohs and distributed by
Netflix, further marks the significance of this invention.

10.8.1 AlphaGo vs. Lee Sedol

In 2015, during the final development stages of AlphaGo, DeepMind invited the
2012-2014 European Go Champion, Fan Hui, to their headquarters to play against
AlphaGo. The match ended with a 5-0 win by the computer program.
Motivated by this success, DeepMind then arranged a 5-game match against Lee
Sedol (pictured in Figure 10.10). Lee Sedol, a professional Go player from South
Korea, is regarded as one of the greatest in the history of the game. He holds a
9-dan (the highest possible rank) and has won 18 international titles, which is the
second-highest result in history.

Figure 10.10: AlphaGo vs. Lee Sedol. The image is released under the CC
BY-NC-ND 2.0 DEED license: https://creativecommons.org/licenses/
by-nc-nd/2.0/.

The match was hosted at the Four Seasons Hotel in Seoul, South Korea, between
March 9th and 15th, 2016. Lee Sedol entered the match with confidence that he
would win. After all, the best-known computer programs in Go at the time were at
the level of strong amateurs. This historical match ended with the following results:
10.8 AlphaGo (2016) 220

• Game 1: AlphaGo won (playing White)

• Game 2: AlphaGo won (playing Black)
• Game 3: AlphaGo won (playing White)
• Game 4: Lee Sedol won (playing White)
• Game 5: AlphaGo won (playing White)
The match resembled the 1997 clash between Deep Blue and Garry Kasparov
(see Section 7.11), but it received even more media attention, especially in China
and South Korea, due to enormous interest.
The result was unexpected in general. AI had achieved a new level of advancement.
Although this outcome was generally positively received, some comments were full
of fear regarding how strong self-learning machines had become.
The hardware AlphaGo used consisted of 40 search threads, 1202 CPUs, and
178 GPUs. It is worth noting that, although AlphaGo had access to considerable
computing power, it was not capable of searching as many states of the game as
Deep Blue ([276] reports three orders of magnitude less). It relied more on learning
and abstraction.

10.8.2 AlphaGo’s Technology

The approach has been described in a paper titled “Mastering the Game of Go with
Deep Neural Networks and Tree Search” [276] authored by David Silver et al., which
was published in Nature, in 2016.
It combines:
1. Supervised Learning (SL)
2. Reinforcement Learning (RL)
3. Monte Carlo Tree Search (MCTS)

Figure 10.11: AlphaGo utilizes three policies and a value network. The training
workflow combines supervised learning and reinforcement learning. The value
network and the rollout policies are then used by the MCTS algorithm.
10.8 AlphaGo (2016) 221

As shown in Figure 10.11, three policies are utilized. Two of them, the rollout
policy 𝑝 𝜋 and the SL policy 𝑝 𝜎 , are trained in the supervised-learning fashion
from human expert games, which contained around 30 million positions (game
states) from the public KGS Go Server. The training objective was to maximize the
log-likelihood:
𝜕 log 𝑝 𝜎 (𝑎|𝑠)
(10.15)
𝜕𝜎
for state-action pairs (𝑠, 𝑎) present in the training dataset. Therefore, the goal of
those two policies was to mimic human behavior. This task alone would not lead to
a strong player, because the achieved accuracy was 57.0% for 𝑝 𝜎 , and 24.2% for
𝑝 𝜋 . The SL policy (𝑝 𝜎 ) was a 13-layer convolutional neural network, whereas the
rollout policy (𝑝 𝜋 ) was just a softmax layer on the input pattern features specific
to Go. The list of those features is available in paper [276]. The inference time of
the rollout policy was very small - under 2𝜇𝑠, compared to 3𝑚𝑠 for the SL policy.
This was a crucial enabler for the rollout policy to be used as the default MCTS policy.

Because the SL policy did not result in satisfactory performance, it was subse-
quently trained in the reinforcement learning paradigm, resulting in the third policy
(the RL policy):
1. First, the RL policy (𝑝 𝜌 ) was initialized to be the same as the SL policy.
2. Then it iteratively played games against a copy of random policy from last
iterations. This randomization procedure was a method to prevent overfitting.
3. After each game, the reward was either +1 for winning or −1 for losing.
4. The weights of the current policy were updated using stochastic gradient
ascent in the direction that maximizes the reward.
The value network similarly to SL and RL policies, was a convolutional neural
network shown in Fig 10.12. However, unlike the policy networks, which return
probability distributions of actions, the goal of the value network was to estimate the
quality (value) of a Go position in terms of the expected outcome for a given player.
This network was trained based on data obtained from a large corpus of self-played
games by players following the RL policies. From each game, only one (state, action)
sample was taken together with its result. The objective of the regression task was
to minimize the MSE error of between the value predicted by the network and in the
obtained dataset.

The final piece to the puzzle is the MCTS algorithm (which was described in
Section 8.9). The authors of AlphaGo used a few modifications of the base method.
Most notably, instead of just relying on scores obtained from random games played
using the default policy, they combine the score estimated by the value function (𝑣 𝜃 )
and the result of playing the game by the rollout policy:

𝑉 (𝑠) = (1 − 𝜆)𝑣 𝜃 (𝑠) + 𝜆 ∗ 𝑟𝑒𝑠𝑢𝑙𝑡 ( 𝑝 𝜋 ) (10.16)

10.8 AlphaGo (2016) 222

Figure 10.12: Policy and value deep convolutional networks used in AlphaGo. The
figure is reproduced from the original paper about AlphaGo [276].

10.8.3 AlphaGo’s Successors

AlphaGo is considered a breakthrough from several perspectives. Firstly, it repre-
sented a quantum leap in the strength of Go-playing programs. Never before had
a computer program defeated a professional Go player, let alone multiple world
champion such as Lee Sedol. Secondly, it stood as a remarkable technological
achievement that further underscored the deep learning revolution initiated by
AlexNet and other approaches. Thirdly, it demonstrated the hope and promise of
AI as a field by addressing a long-standing problem that had challenged human
players for centuries. Moreover, AlphaGo inspired a plethora of approaches not
only for games but also beyond, such as AlphaFold, which will be discussed in
Section 11.7. At that point in time, many companies were exploring the integration
of deep learning and tree search for solving decision problems. Returning to the
games domain, selected successors of AlphaGo include:
• AlphaGo Zero (2017): is a direct successor of AlphaGo also developed by
DeepMind. This approach was described in the paper titled “Mastering the
game of Go without human knowledge” authored by D. Silver et al. The major
change was that it did not utilize human expert knowledge in any form. Firstly,
the entire training relied on self-play. Secondly, there were no Go-specific
input features prepared by experts. The state was encoded as a plain board
configuration, and the method itself performed feature engineering. The fully
trained AlphaGo Zero was able to win 100% of games against AlphaGo.
10.8 AlphaGo (2016) 223

• Alpha Zero (2017): described in the paper titled “A general reinforcement

learning algorithm that masters chess, shogi, and Go through self-play” [275]
by D. Silver et al, published in Science. This represents the next evolution
of the approach, encompassing a single methodology that generalizes across
three games.
• Leela Chess Zero (2018): also known as LCZero is an open-source project
aimed at recreating the AlphaGo approach for chess. It is considered slightly
weaker than Stockfish (c.f. Section 9.3).
• MuZero (2019): is a direct successor of AlphaZero, developed by DeepMind,
which, in addition to chess, shogi, and Go, includes 57 Atari games. It is
important to note that DeepMind previously conducted research on Atari
games, which introduced the Deep Q-learning method (c.f. Sec 9.8). MuZero
represents a merger of these two approaches. Details are described in the
paper “Mastering atari, go, chess, and shogi by planning with a learned model”
by Schrittwieser et al. [262].
• MoHex3HNN and DeepEzo (2019): are two AlphaGo-inspired approaches
developed for the game Hex. More information can be found about them in
articles [84] and [297], respectively.
• AlphaStar (2019): Inspired by the success of AlphaGo, DeepMind tackled a
real-time strategy video game as the next challenge for advancing AI. AlphaStar
is an approach to StarCraft II. It will be described in Section 11.5 of this book.
11. Years 2017 - 2024

11.1 Transformers and Attention (2017)

“Attention is All You Need” is the title of a groundbreaking paper, published in
2017 as part of the proceedings of the 31-st NIPS conference. It was authored by
researchers from Google and the University of Toronto [313]. This paper introduces
the transformer model architecture and also demonstrates the power of the attention
mechanism. As of 2024, transformers continue to revolutionize the field of artificial
intelligence.
One of the advantages of the transformer model with attention is its ability to
process complex sequences without the need for recurrence. The authors demon-
strated that it can achieve comparable or even superior performance to recurrent
neural networks (RNNs) and convolutional neural networks (CNNs) in tasks where
these models were previously considered state-of-the-art. Due to fewer sequential
dependencies compared to RNNs, the transformer model with the attention mecha-
nism can be parallelized more effectively, leading to faster training.

Now, let me present the structure of transformers. Like autoencoders, they

consist of two parts: the encoder and decoder, as depicted in Figure 11.1. Although
the architecture was originally proposed for NLP tasks, it can be generalized to other
problems as well. Both the encoder and decoder are built with 𝑁 identical stacked
layers, highlighted with rounded rectangles in Figure 11.1. In the paper [313], the
authors used 𝑁 = 6 layers for both components.
The model is designed to process sequences (e.g., words). The encoder transforms
a sequence (𝑥 1 , ..., 𝑥 𝑛 ) into a continuous numeric encoding 𝑧 = (𝑧1 , ..., 𝑧 𝑛 ), while the
decoder generates an output sequence (𝑦 1 , ..., 𝑦 𝑚 ) by producing one element 𝑦𝑖 at a
time. Additionally, the output of each encoder layer is connected to the multi-head
attention layer of the decoder, enabling the model to focus on individual encodings.
11.1 Transformers and Attention (2017) 225

Figure 11.1: The transformer architecture. The left-hand size represents the encoder,
whereas the right-hand side represents the decoder. Both are constructed with 𝑁
repeated components as shown above.

The input to the model is first embedded to a vector space (for the introduction
of embeddings, please refer to Section 9.9). However, since the transformer model
is neither recurrent nor “unrolled” to match the maximum size of the input sequence,
the order of the elements would be lost. In order to preserve it, the authors proposed
a novel way of including the order in the encoding. It consists in adding wavelengths
of different frequencies based on 𝑝𝑜𝑠 (the position) and 𝑖 (the dimension):
𝑝𝑜𝑠
𝑒𝑛𝑐𝑜𝑑𝑖𝑛𝑔( 𝑝𝑜𝑠, 2𝑖) = 𝑠𝑖𝑛( 2𝑖
) (11.1)
𝑑𝑚𝑜𝑑𝑒𝑙
10000
𝑝𝑜𝑠
𝑒𝑛𝑐𝑜𝑑𝑖𝑛𝑔( 𝑝𝑜𝑠, 2𝑖 + 1) = 𝑐𝑜𝑠( 2𝑖
) (11.2)
10000 𝑑𝑚𝑜𝑑𝑒𝑙
where 𝑑𝑚𝑜𝑑𝑒𝑙 is the dimensionality of the model representation.
These positional encodings are simply added to the embeddings creating a
“position-aware” embeddings.

Both, the encoder and decoder layers start with attention layers. The key idea
of attention is to selectively and contextually focus on relevant parts of the input
11.1 Transformers and Attention (2017) 226

(either to the model or to a given layer). Formally, attention is defined as a function

that takes a triplet of query (Q), key (K), and value (V). All of these components are
represented as numeric-valued vectors, and in a practical implementation, they are
grouped together (as batches of multiple instances) into matrices. The meaning 𝑄,
𝐾, and 𝑉 depend on the application. of underpinning intuition behind 𝑄, 𝐾, and 𝑉
is as follows:
• Query represents the positional encoded input element of the sequence.
Queries are compared against keys.
• Keys are vectors that represent all the remaining positions in the input sequence.
They are used to compute the similarity between the query and each position
in the sequence.
• Values are the encoder embedding vectors that are transformed by the attention
weights - calculated as a normalized multiplication of keys and queries as
shown in Eq. 11.3.
Mathematically, 𝑄, 𝐾, and 𝑉 represent transformations of the input so it becomes
the “attended”, i.e., focused on the relevant parts.

Figure 11.2: On the left-hand side, the scaled dot-product attention is shown.
The right-hand side depicts the multi-head attention, which utilizes many scaled
dot-product attention operations.

The encoder uses the so-called scaled dot-product attention, depicted in Fig-
ure 11.2 on the left-hand side. In paper [313], the scaled dot-product attention
function scales values in the following way:

𝑄𝐾 𝑇
𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄, 𝐾,𝑉) = softmax( √ )𝑉 (11.3)
𝑑𝑘
where 𝑑 𝑘 denotes the dimension of the keys.
11.2 The First Large Language Models (2018) 227

The decoder utilizes multi-head attention depicted in Figure 11.2 on the right-
hand side. It performs multiple, let say ℎ, attention operations for 𝑄, 𝐾, and 𝑉,
transformed by their respective weight matrices 𝑊 that are learnt during the training.
The results of these parallel operations are then concatenated into a single matrix:

𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 = 𝐶𝑜𝑛𝑐𝑎𝑡 (ℎ𝑒𝑎𝑑1 , ..., ℎ𝑒𝑎𝑑𝑛 )𝑊 𝑂

(11.4)
ℎ𝑒𝑎𝑑𝑖 = 𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄𝑊𝑖𝑄 , 𝐾𝑊𝑖𝐾 ,𝑉𝑊𝑖𝑉 )

The masked multi-head attention, which is performed at the beginning of the

decoding part, is used only during training. Its purpose is to prevent data leakage,
i.e., access to the elements of the ground truth sequence before their corresponding
elements have been predicted. The values to which the decoder should not have
access are set to −∞. This preserves the auto-regressive property
In figure 11.1, there can be seen residual connections directly to the “add &
norm” blocks. The authors apply dropout for those connections before the input is
added to the output from feed forward layers. The signal is normalized after each
attention and feed-forward layer. The latter is a fully-connected layer that realizes
the main task of the network.

The authors demonstrated in their paper [313] that the Transformer model
improves upon the previous best results in machine translation for English-German
and English-French language pairs. However, the paper had a profound impact
on the entire field of deep learning. It became the foundation for Large Language
Models (LLMs), which, as of 2024, represent cutting-edge advancements in AI.

11.2 The First Large Language Models (2018)

The transformer model introduced in the previous section, has lead to the inception
of Large Language Models (LLMs). They encompass a class of models trained using
massive amounts of data for general human-like natural language processing tasks.
The applications include contextual language understanding (parsing questions,
interacting in a natural language interface), language translation, text generation and
more.
The first major scale LLMs were GPT-1.0 [229], developed in June 2018 by
OpenAI, and BERT [61], developed in October 2018 by Google.

11.2.1 GPT-1
GPT-1 was described in a 2018 paper titled “Improving Language Understanding by
Generative Pre-Training” by A. Radford et al. GPT stands for Generative Pre-trained
Transformer. As the name implies, this approach was based on the transformer
model introduced in 2017.
GPT-1 uses a variant of the transformer architecture, which, apart from the
input and output layers, consists of only 𝑁 = 12 stacked decoder layers, as shown
in Figure 11.3. The dimensions of the inner states were 768 for attention layers
and 3072 for feed-forward layers, respectively. Adam (cf. Section 10.2.1) was the
11.2 The First Large Language Models (2018) 228

Figure 11.3: The GPT-1 architecture. The image was reproduced from [229].

chosen optimizer. The training was conducted for 100 epochs with minibatches of
64 sequences of tokens. The total number of modifiable parameters (of any kind) in
the network was 117 million.
The training process consists of two stages:
1. Pre-training - here, the idea is to train the model using vast amounts of text
data. Such a model is then serialized and available for further use. For this
task, the BookCorpus dataset was used that contains more than 7, 000 unique
books. The total number of words in these books was approximately close to
800 million. The pre-training was performed in the semi-supervised learning
fashion. The objective was to train a model capable of predicting the next
token 𝑢𝑖 given the context of 𝑘 tokens, where 𝑃 is the conditional probability
approximated by the network with learnable parameters 𝜃:
∑︁
𝐿 1 (U) = log 𝑃(𝑢𝑖 |𝑢𝑖−𝑘 , ..., 𝑢𝑖−1 ; 𝜃) (11.5)
𝑖

In addition to its main idea of creating a model that can be quickly adapted for
various tasks, pre-training also acts as a regularization technique.

2. Fine-tuning - a copy of the pre-trained model is then trained (in the supervised
learning fashion) using labelled data for a specific task. Here, there is a
sequence of tokens (𝑥 1 , ..., 𝑥 𝑚 ) and the ground-truth label 𝑦. In pre-training,
there were also used unordered sets of tokens (e.g. taken from the same
paragraph). The same architecture was used as in the pre-training phase with
the exception of adding a dropout layer with a rate of 0.1. The authors stated
that, for most tasks, only 3 epochs were enough to fine-tune the model.
11.2 The First Large Language Models (2018) 229

Figure 11.4: Various tasks, which the GPT-1 model was fine-tuned for. The image
was reproduced from [229].

The specific tasks included in paper [229] are presented in Fig. 11.4:
1. Text classification: (A) whether a sentence correct in terms of grammar; and
also (B) a binary sentiment analysis classification.
2. Textual entailment, i.e., determining whether one piece of text logically follows
from another, or if they are logically equivalent.
3. Semantic texts similarity.
4. Multiple choice question answering.
GPT-1 was one of the largest neural networks up to date. With minimal
supervision, it achieved new best results for 9 out of 12 datasets used in the
experiments by the authors. It solidified its position in history as the first model in
the GPT series developed by OpenAI.

11.2.2 BERT
BERT was described in a 2018 paper titled “BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding” by J. Devlin et al. BERT stands for
Bidirectional Encoder Representations from Transformers. Like GPT, this approach
was also based on the transformer model introduced in 2017. While GPT-1 employed
stacked transformer decoder layers, BERT utilizes stacked transformer encoder
layers.
The training process of BERT was almost exactly the same as for GPT-1. It also
consisted of a computationally expensive pre-training and fast fine-tuning for specific
NLP tasks. For the pre-training phase, its creators also used the BooksCorpus
in addition to English Wikipedia (2.5 billion words from it). For the fine-tuning
phase, eleven NLP problems were selected such as: the General Language Under-
standing Evaluation benchmark (GLUE) [318], the Stanford Question Answering
11.3 Soft Actor-Critic (2018) 230

Dataset [232], and the Situations With Adversarial Generations [343] (SWAG). In
all 11 problems, it improved the previous best results.
BERT was created in two variants differing by the model size:
1. BERTBASE was intentionally designed, for comparison purposes, to have the
same size as GPT-1. It contained 12 layers and 110 million total parameters.
2. BERTLARGE had 12 layers and 340 million total parameters. The authors re-
ported that this larger model achieves significantly better results, demonstrating
that this approach scales well.
The major difference between BERT and GPT-1 lies in the implementation
of self-attention. In GPT-1, each token (element of the input) could only attend
to preceding tokens, whereas BERT introduced an enhanced variant. Instead of
masking the remainder of a sequence for a given token, BERT masked some of the
input tokens, allowing the model to consider the entire sequence.

11.3 Soft Actor-Critic (2018)

11.3.1 Method Overview
Let us start with the regular Actor-Critic method. It is a type of reinforcement
learning (RL) approach that involves creating and training two models: the actor
and the critic. These models are often implemented as neural networks, though they
can also be represented using other types of models.
• The actor is responsible for action selection. It iteratively learns a policy
𝜙(𝑎|𝑠), which is a probability distribution over actions given concrete states.
A policy describes the probability of taking certain actions in particular states.
A policy can be stochastic, providing a distribution of probabilities for different
actions, or deterministic, where all but one action in a given state have a
probability of 0, with one action having a probability of 1.
• The critic evaluates the actions chosen by the actor. It learns the value function
𝑉 (𝑠) and/or the action-value function 𝑄(𝑠, 𝑎), which estimate the expected
return = cumulative future rewards — starting from specific state-action pairs
or states, respectively. The critic provides feedback to the actor, estimating the
quality of the chosen actions. This optimization process with two components
working in tandem resembles the GANs approach (c.f. Section 10.1).
In 2018, an improved version of the method named Soft Actor-Critic (SAC) was
published in the article titled “Soft Actor-Critic: Off-Policy Maximum Entropy Deep
Reinforcement Learning with a Stochastic Actor” [100] authored by T. Haarnoja,
A. Zhou, P. Abbeel, and S. Levine. The off-policy learning distinguishes the
action-selection mechanism from the ultimate policy that is being learnt, whereas
the on-policy learning uses the same policy for action-selection and continuous
improvement in the process.
Model-free deep RL approaches are frequently characterized by their sample
inefficiency, i.e., they are slow to train and require large numbers of independent
samples. Depending on a particular task’s complexity, it may be even more than
millions. However, through the utilization of the off-policy learning and the reuse
11.3 Soft Actor-Critic (2018) 231

of past experiences in the form of the so-called replay bufffer, the SAC algorithm
presents notable enhancements in this aspect. Overall, it stands out as one of the most
effective RL methods ever devised, combining strong aspects of various model-free
RL approaches regarding:
1. Sample complexity
2. Sensitivity to hyperparameters
3. Stability and robustness
A standard RL objective is to maximize the expected sum of rewards over time 𝑡:
∑︁
E (𝑠𝑡 ,𝑎𝑡 )∼𝜌 𝜋 [𝑟 (𝑠𝑡 , 𝑎 𝑡 )] (11.6)
𝑡

where 𝜌 𝜋 denotes the trajectory distribution from following a policy 𝜋, 𝑠𝑡 ∈ 𝑆 is the

state in time 𝑡, 𝑎 𝑡 ∈ 𝐴 is the action taken in this state, and 𝑟 : 𝑆 × 𝐴 → R is a reward.
Soft actor-critic introduces an additional component responsible for entropy
regularization: ∑︁
E (𝑠𝑡 ,𝑎𝑡 )∼𝜌 𝜋 [𝑟 (𝑠𝑡 , 𝑎 𝑡 ) + 𝛼 ∗ 𝐻 (𝜋)] (11.7)
𝑡
where 𝛼 > 0 is a temperature parameter that represents trade-off between maximizing
the reward and entropy 𝐻.
The entropy is defined as:

𝐻 (𝜋) = E𝑎𝑡 ∼𝜋 [− log(𝜋(𝑎 𝑡 |𝑠𝑡 ))] (11.8)

Adding the entropy term is crucial for a few reasons:

• Exploration: the entropy is a measure of uncertainty. High entropy means
more exploration and less focus on exploiting the already known action
sub-spaces.
• Handling Continuous Action Spaces: the entropy-based approaches have
been proven to be particularly well-suited exploration strategies for spaces with
an essentially infinite number of possible actions (see Figure 11.5). Compared
to methods such as Q-Learning (c.f. Section 9.8), SAC is a particularly suitable
approach for continuous spaces.
• Handling noisy environments: the entropy reduces the determinism of the
agent. Highly deterministic agents often do not learn well in highly non-
deterministic environments. The inclusion of entropy makes it less likely for
the model to overfit to noise. Noisy signals from the environment may result
in erroneously learning that some actions are good in a particular state. The
more deterministic the training process, the harder it might be to avoid getting
stuck in erroneous local optima.
• Adaptability: introducing entropy makes the trained policies more general,
meaning they are less “committed” to specific sets of actions, as they have
essentially been trained on a more diverse range of situations. This can
be beneficial for dynamic environments or for deployments of SAC-based
approaches from training phases to production environments.
11.3 Soft Actor-Critic (2018) 232

Figure 11.5: A comparison between continuous and discrete action spaces. The
colored area depicts available positions to go to using one action.

The authors of SAC performed experiments on a diverse range of problems. They

compared the method to prior off-policy and on-policy RL approaches including
deterministic policy gradient (DDPG), proximal policy optimization (PPO), soft
Q-learning (SQL), deep deterministic policy gradient (TD3), and trust region path
consistency learning (Trust-PCL). In general, SAC performed significantly better on
the harder problems, whereas achieving similar performance to other methods in
easier ones. Its blend of stability, efficiency, and entropy-driven exploration solidifies
its position as one of the most significant RL techniques available.

11.3.2 Mathematical Details

The soft actor-critic method intertwines the policy evaluation (using a value function)
and policy improvement steps. The improvement is typically defined as minimization
of Kullback-Leibler (KL) divergence. This is a very computationally expensive
process and SAC was designed to address it.

Haarnoja with co-authors, introduce three functions:

1. A soft value function 𝑉𝜓 (𝑠𝑡 ) that returns the payoff associated with state 𝑠𝑡 that
is approximated by a model (preferably a deep neural network) parameterized
by 𝜓. This model is trained (i.e., the parameters 𝜓 are found) using stochastic
gradients descent (SGD) by minimizing the following objective function
𝐽𝑉 (𝜓):

1 2
𝐽𝑉 (𝜓) = E𝑠𝑡 ∼D 𝑉𝜓 (𝑠𝑡 ) − E𝑎𝑡 ∼𝜋 𝜃 𝑄 𝜃 (𝑠𝑡 , 𝑎 𝑡 ) − log 𝜋 𝜙 (𝑎 𝑡 |𝑠𝑡 ) (11.9)
2

where D is called a replay buffer - a collection sampled states and actions.

Instead of calculating the exact gradient, it can be approximated using the
following formula that uses the current value function (before the update), the
11.3 Soft Actor-Critic (2018) 233

Q function as well as the current policy.

∇𝜓 𝐽𝑉 (𝜓) ≈ ∇𝜓 𝑉𝜓 (𝑠𝑡 ) 𝑉𝜓 (𝑠𝑡 ) − 𝑄 𝜃 (𝑠𝑡 , 𝑎 𝑡 ) + log 𝜋𝜓 (𝑎 𝑡 |𝑠𝑡 ) (11.10)

2. The Q function approximated by the Q network with parameters 𝜃. It is

trained by minimizing the following (Bellman residual) error:

1 2
𝐽𝑄 (𝜃) = E (𝑠𝑡 ,𝑎𝑡 )∼D (𝑄 𝜃 (𝑠𝑡 , 𝑎 𝑡 ) − 𝑄(𝑠𝑡 , 𝑎 𝑡 ))
ˆ (11.11)
2
with:
ˆ 𝑡 , 𝑎 𝑡 ) = 𝑟 (𝑠𝑡 , 𝑎 𝑡 ) + 𝛾E𝑠𝑡+1 ∼𝑝 𝑉𝜓¯ (𝑠𝑡+1 )
𝑄(𝑠 (11.12)
where 𝜓¯ represents the parameters of the target value network. The authors
suggest that these parameters can be computed as an exponentially moving
average of the weights of the value network. Alternatively, they can be updated
to match the same values as 𝜓 after every few iterations. It is important to note
that both 𝑉 and 𝑄 networks depend on each other, which makes the training
process less stable. The introduction of additional parameters 𝜓¯ updated as a
moving average or with a delay helps stabilize the process.
Again, the training is performed using SGD, wherein the gradient can be
estimated as follows:

ˆ 𝜃 𝐽𝑄 (𝜃) ≈ ▽𝜃 𝑄 𝜃 (𝑎 𝑡 , 𝑠𝑡 )(𝑄 𝜃 (𝑠𝑡 , 𝑎 𝑡 ) − 𝑟 (𝑠𝑡 , 𝑎 𝑡 ) − 𝛾𝑉𝜓¯ (𝑠𝑡+1 ))

▽ (11.13)

3. The policy function represented by the policy network with parameters 𝜙.

In [100], the authors show how to reparameterize the objective of minimizing
the KL divergence to minimizing the following error:

𝐽𝜋 (𝜙) = E (𝑠𝑡 ∼D,𝜖𝑡 ∼N ) log 𝜋 𝜙 ( 𝑓𝜙 (𝜖𝑡 ; 𝑠𝑡 )|𝑠𝑡 ) − 𝑄 𝜃 (𝑠𝑡 , 𝑓𝜙 (𝜖𝑡 ; 𝑠𝑡 )) (11.14)

where 𝑓𝜙 is a neural network transformation that acts as a reparameterization

of the policy and 𝜖𝑡 is a noise vector sampled from some fixed distribution.
The authors suggested a spherical Gaussian distribution.
The gradient can be estimated as follows:

ˆ 𝜙 ≈ ▽𝜙 log 𝜋 𝜙 (𝑎 𝑡 |𝑠𝑡 ) + (▽𝑎𝑡 log 𝜋 𝜙 (𝑎 𝑡 |𝑠𝑡 ) − ▽𝑎𝑡 𝑄(𝑠𝑡 , 𝑎 𝑡 ))▽𝜙 𝑓𝜙 (𝜖𝑡 ; 𝑠𝑡 )

▽
(11.15)
In summary, SAC is a computationally-friendly approximation of soft policy
iteration, which repeatedly applies a modified Bellman backup operator. For details,
please consult [100]. In a later paper [99], Haarnoja et al. presented a refined version
that only makes use of the 𝑄 network and the policy network (dropping the value
network). However, for historical reasons, this chapter focused on the original SAC
formulation.
11.4 OpenAI Five (2018) 234

11.4 OpenAI Five (2018)

About the Game
Dota 2 is a real-time video game developed and published by Valve that belongs
to the multiplayer online battle arena (MOBA) genre. It evolved from a fan-made
mod for another popular game called Warcraft III, named Defense of the Ancients,
which gave it its name. Warcraft III is a real-time strategy game (RTS), and while
Dota 2 uses similar controls and visual style, it is more of a tactical game focused on
individual units (called heroes) rather than massive armies (as in RTS games). The
heroes are selected in a draft process before each match. As of the time of writing
this book, there are 124 distinct heroes. They belong to two main categories: core
or support. The former are main damage dealers but they start weak. Supports, with
the highest impact early on, are focused on utility rather than damage.

Figure 11.6: Dota 2 game. Credit to Vedmakzxc, CC BY-SA 4.0, via Wikimedia
Commons.
https://creativecommons.org/licenses/by-sa/4.0.

In Dota 2, two teams of five players, with each player controlling one hero, battle
against each other on a map. They have to defend their respective bases, located in
the opposite corners, and destroy the central structure of the opposite base called
an ancient. Additionally, there are many activities on the map that allow players to
earn resources, which can be used to improve the heroes and the base. High-level
play requires a lot of micromanaging of the heroes, proper positioning, using their
abilities, and deciding whether to push or retreat, among other tactics.

OpenAI Five vs. Humans

OpenAI Five is the name of a bot created by OpenAI to play Dota 2. The research by
OpenAI on this topic started around late 2016. In 2017, the developers were ready
with a one-on-one version, i.e., that could control one hero against one enemy hero,
11.4 OpenAI Five (2018) 235

which is not the main way to play the game. In this mode, certain game mechanics
were disabled. The first player to score two kills is declared the winner.
A showcase one-on-one match between the OpenAI bot and professional human
player Danil “Dendi” Ishutin took place during the 2017 Dota 2 Championships.
The match consisted of two games, in which both contestants played the same hero:
Shadow Fiend. Dendi attempted two different strategies in these games but lost on
both occasions. In an interview immediately following the match, Greg Brockman
of OpenAI, stated that the version of the bot at the time had been trained for only
two weeks in real-time. Throughout this event, the OpenAI bot faced several other,
less experienced players and remained undefeated.
In 2018, the computer program gained capabilities to play the regular, five-on-five
matches, and thus became known as OpenAI Five. In 2019, its most famous best-of-
three match occurred versus OG - the 2018 Dota 2 Championship Winning Team.
This event was held live in San Francisco as well as streamed on the Twitch platform.
OpenAI Five convincingly won 2-0, which was an unprecedented achievement at the
time.
It was the first time when a non-human player defeated top human professionals
at a complex e-sports game during a live event. When considering non-live events
as well, it is difficult to determine whether OpenAI Five or DeepMind’s AlphaStar
(introduced in Section 11.5) was the first, as research on them was initially not
publicly reported.
Following the 2019 win against OG, OpenAI Five was made available online
for players worldwide to challenge. After more than 40,000 games played, the bot
boasted a winning percentage of 99.4% [326].

OpenAI Five stands as a remarkable achievement in the annals of AI history.

A self-learning system managed to defeat top human players in a highly complex
and competitive e-sports game with large money prizes. However, there has been
skepticism regarding the true impact of this achievement. Firstly, it was tailored
specifically for one game. Despite lacking a human game database to learn from,
significant effort was invested in engineering features for this solution. Some argued
that a purely image-based approach, focusing solely on visual game representations,
would offer greater transferability across different domains.
Furthermore, the bot could only play 17 out of the 124 available heroes in the
game. This limitation suggests that certain mechanics of the other heroes might have
proven too complex for the bot to master. Additionally, all the showcase matches
it participated in were relatively short. Both Dendi and OG were caught off guard,
raising the possibility that such high-level players could adapt to the bot’s strategies,
whereas the reverse might not hold true. Finally, the 99.4% win-rate against online
opponents indicates that maybe the bot is not superhuman yet as it could be defeated
(at least 240 times).
Nevertheless, OpenAI Five made the history, much like Deep Blue and AlphaGo
in their respective games.
11.4 OpenAI Five (2018) 236

A New Scale of the Challenge

Historically, games have served as a testbed for AI. Some notable examples include
checkers, chess, backgammon, and Go. However, given that chess is more complex
than checkers, and Go is more complex than chess from an AI perspective, real-time
strategy/tactical (RTS) video games represent the next step in the progression of
complexity. The following factors make such games particularly complex:
1. Action-space size. In classical combinatorial games, there is typically a board
containing a limited number of discrete positions where actions can be played.
In RTS or MOBA games, numerous units act simultaneously, resulting in a
multiplication of the branching factor by the number of actions per unit. Units
may possess various abilities with numerous valid targets to choose from.
Some actions, such as movement, involve essentially infinite parameterization
due to the continuous nature of space. Enumerating all actions and selecting
one becomes infeasible. Instead, actions must be generated according to
known rules or an upfront space-limiting discretization must be applied. For
example, the OpenAI Five solution to Dota 2 limits the number of possible
actions to 80,000. Paper [214] reports that the bot actually chooses from
between 8,000 and 80,000 actions, depending on the hero and the state the
game is currently in. This is still a significantly larger space compared to Go
(≈ 250) and chess (≈ 35).
2. Actions similarity landscape. Most of the time, in classical combinatorial
games, each action matters (unless the game is already over). In video games,
many actions are often nearly identical, e.g. movement to a position (𝑋,𝑌 )
vs. (𝑋 + 𝜖,𝑌 ) or attacking with 50 units vs. with 51 units, etc. This makes it
difficult for the search-based algorithm to converge as the the search budget
will be split among many similar actions. It is hard to isolate the optimal one.
3. Action frequency. In combinatorial games, depending on the time controls, a
player may have from a few seconds to even an hour to deliberate on a single
move. This makes search a feasible strategy. Real-time video games are much
faster paced. For instance, the OpenAI Five player of Dota 2 selects actions
1
every fourth frame, where one frame lasts 30 of a second. Such a scenario
often demands both short-term action selection and some form of long-term
planning.
4. Time horizon. A typical chess game lasts 80 moves, whereas it is 150 moves
for Go [5]. OpenAI Five takes an average of 20,000 moves per game. This
leads to significantly delayed rewards over time.
5. State-space size. Similarly to the action space, the state-search space is
enormous in RTS and MOBA video games. It includes the combined statistics
of all units, structures, and resources, including their positions. In addition,
there are players’ statistics, map topology, information about the visibility,
game time, current timed effects (e.g. buffs) and more.
6. Partially observable state and need for reconnaissance. Unlike chess
and Go, RTS and MOBA games are imperfect information games. They
incorporate the concept of the “fog of war,” where players have limited
11.4 OpenAI Five (2018) 237

visibility of the game map. In contrast to certain hidden information games

of chance, success in RTS and MOBA games relies heavily on effective
reconnaissance and strategic reasoning about the missing information, such as
determining the location and composition of enemy forces.
7. Coordination. The games in question necessitate the coordination of multiple
units. While in RTS games, a single computer program can typically handle
the coordination, most MOBA games, including Dota 2, require each hero to
be controlled by a separate player. Consequently, an AI player must be capable
of coordinating actions with a duplicate of itself, which controls its teammate.

Summary of the Technology

The key elements are outlined below. For a more detailed description, please refer
to paper [214]:
• Training Algorithm. The model was trained using reinforcement learning
solely based on self-play. The RL technique of choice was Proximal Policy
Optimization (PPO) [264]. The key idea behind PPO is to constrain the policy
update such that it remains close to the policy used to collect the data. This
is done by introducing a surrogate objective function that approximates the
policy improvement. By optimizing this surrogate objective function, PPO
ensures that the policy updates are conservative and do not deviate too far from
the original policy. Initially, the agent started from performing random actions.
The policy was trained entirely based on self-play. The “rollout workers”, as
the authors call them, are responsible for performing game simulations during
the training phase. These games were run at twice as normal speed due to
optimizations that could have been made for games played exclusively by bots.
The back-propagation through time is performed with samples of 16 timesteps.
The optimizer was Adam (see Section 10.2.1). After every 32 update steps, the
current version of the model was serialized into a database. The latest policy
played 80% of games against a copy of itself and 20% of games versus older
versions sampled from the database. This was made to prevent overfitting to
the last opponent and other unwanted convergence effects.
• Architecture: OpenAI Five uses a neural network model to represent both
the value function and the policy. It is presented in Figure 11.7. The input is
first processed to generate embedding vectors that represent the summarized
states of the hero units and the remaining state. This combined input is then
fed into the neural network. The network architecture is conceptually simple,
consisting of a single layer of 4096 LSTM units (for a description of the LSTM
model, please see Section 7.9. The network is replicated five times, with
identical parameters (though slightly different inputs) to control the five heroes.
The outputs of the LSTMs are further processed using linear projections
to generate the probability distribution over a specific hero’s actions. It is
important to note, however, that the developers included also certain scripted
rule-based actions that would be played in given conditions regardless of
the output from the neural network. Additionally, the state of each LSTM
undergoes processing using a distinct linear projection to produce the value as
11.4 OpenAI Five (2018) 238

a second output head that represents the winning chance estimation.

Figure 11.7: A simplified architecture of the OpenAI Five model.

• State Representation. The state representation has been manually engineered

by the OpenAI team in an iterative research process. The observed state is
processed and concatenated into a single large vector consisting of 1,200
categorical values and 14,534 continuous/boolean values. These encompass:
1. Global observations of 22 numeric features, such as time since the game
started.
2. Spatial data resulting from processing the game minimap through a
2-layer convolutional network. It consists of 900 continuous values.
3. Spatial data of a nearby map (14 × 14 grid around the particular hero
for which actions are calculated). It is also processed through a 2-layer
CNN. It consists of 512 values.
4. Statistics of pickups on the map: 15 features per 6 available pickups.
5. Unit observations: a large number of features per each of 5 allied heroes,
5 enemy heroes, 82 allied non-heroes, 82 enemy non-heroes, and 15
neutral units.
• Hardware and Training Time. One of the reasons behind the bot’s success
is the use of computational hardware and time. According to the OpenAI Five
official website [213], the bot plays the equivalent of 180 years worth of games
every day. It utilizes between 80,000 and 172,800 CPUs and, at peak, also
employs 1,536 GPUs. Prior to the match against Team OG, the bot underwent
continuous training for 10 months.
• Continual Transfer. The OpenAI Five underwent a long evolution process
driven by several factors. Firstly, its development occurred gradually, with
new capabilities such as the use of certain abilities and heroes being added
incrementally. Secondly, researchers iterated with various ideas, including
adjustments to the reward structure. Thirdly, the game itself underwent
iterative updates by the developer. Training the model from scratch for
each update would have been both time and cost inefficient. To address this
challenge, the OpenAI team devised a method referred to as “surgery” for
transferring relevant parameters from the old model to the new one and enable
11.5 AlphaStar (2018) 239

to continue training from a relatively competent player. Prior to publication,

they conducted over 20 surgeries [214].

11.5 AlphaStar (2018)

Starcraft
StarCraft II, as well as its prequel, StarCraft, are real-time strategy (RTS) games
developed and published by Blizzard Entertainment. They have been among the
best-selling and highly-acclaimed RTS games of all time. StarCraft was one of the
pioneering games in the realm of e-sports, with international tournaments offering
prizes ranging up to millions of US dollars. Figure 11.8 illustrates a scene from one
of these tournaments.

Figure 11.8: Picture from a Starcraft tournament held in South

Korea. Taken by Maximusnukeage for English Wikipedia:
https://en.wikipedia.org/wiki/StarCraft_in_esports
CC BY 3.0 https://creativecommons.org/licenses/by/3.0, via Wikimedia Commons

StarCraft is similar to Dota 2 in terms of its high-paced real-time gameplay and

the requirement for players to make tactical decisions. However, as an RTS game, it
possesses several distinctions. Rather than controlling a single hero, as in Dota 2,
players command entire armies comprising up to 200 units. Naturally, these units are
simpler than the heroes in Dota 2, having fewer abilities and being more replaceable.
RTS games place more emphasis on base building, including decisions regarding
when and where to place structures, production orders (which units to produce), and
resource gathering. StarCraft features two types of resources: minerals and gas, both
of which are collected by workers, which are also used to construct structures.
11.5 AlphaStar (2018) 240

In StarCraft, players can choose from three factions (races): Terrans, Zergs,
and Protoss. Each race offers unique characteristics and gameplay experiences.
Professional human players typically specialize in one race. By far the most popular
game mode is 1v1, e.g., one Zerg player against one Protoss player or one Terran
player against another Terran player. However, other game modes up to 4v4 are also
available. Each races start with the main building and a few worker units.
StarCraft has been identified by the academic community as a promising research
framework [212]. In general, the challenges related to creating AI for RTS games have
already been discussed in the previous section (11.4). The state space complexity of
StarCraft has been estimated to be at least 101685 , whereas the branching factor is at
least 1050 [211]. It inspired the creation of the Open Real-Time Strategy (ORTS)
framework[32], which served as the environment for the ORTS Competition first
organized in 2006. In 2009, the Brood War API (BWAPI) was released, which
allowed for programmatically interacting with StarCraft I (Brood War was the
name of its first expansion). This was a groundbreaking event as it enabled the
use of StarCraft directly as a research environment. In 2010, the first StarCraft
AI competition [38] was organized as part of the AAAI Artificial Intelligence and
Interactive Digital Entertainment (AIIDE) conference. StarCraft II has served as a
research environment since 2016. Researchers have developed many bots to play
the game, based on techniques such as MCTS and other tree-search algorithms,
evolution, and dynamic scripting, but until AlphaStar, no bot has even come close to
defeating a human professional.

AlphaStar vs. Humans

AlphaStar is the name of the bot developed by Google DeepMind to play Star-
Craft II. In a sense, it represents the natural progression following the success of
AlphaGo, which was discussed in Section 10.8. DeepMind initiated this project
in secrecy, and in December 2018, invited the first player outside of the company
to evaluate its strength. The company decided to showcase the bot’s capabilities
only in Protoss-vs-Protoss matches, suggesting it may be best with this particular race.

The first match between AlphaStar and a professional Starcraft player, Dario
“TLO” Wünsch, took place in December 2018 at DeepMind’s headquarters in
London. TLO was a member of the e-sports team Team Liquid. AlphaStar managed
to defeat the human opponent in five matches with a score of 5 − 0. Throughout these
games, AlphaStar employed various strategies, surprising TLO, who mentioned
in an interview that he attempted to adapt to previous strategies, but each game
presented a different challenge. The DeepMind team later revealed that they selected
five of the best-performing AlphaStar clones from a population of trained players.
However, the order in which TLO played against these bots was randomized. At the
time, the company had limited knowledge of AlphaStar’s strength, having only tested
it against other bots, and thus, its convincing victory was not anticipated. However,
it is worth noting that TLO was not specialized in playing as the Protoss race, which
he had to play as during these matches.
11.5 AlphaStar (2018) 241

Next, on December 19, 2018, DeepMind invited one of the best Protoss players,
Grzegorz “MaNa” Komincz, who was also a member of Team Liquid. The match
was scheduled for five games, played under competitive settings. Once again,
AlphaStar emerged victorious, winning all five games. In an interview, the creators
of AlphaStar stated that during the week between the two showcase matches, the
bot was continuously in training mode. They admitted that the version that played
against TLO would have likely lost against MaNa. Throughout these games, it was
noticeable that the bot significantly improved its micromanagement of units. Its
positioning and timing of attacks were superior. The Guardian named AlphaStar a
“landmark achievement” for the field of AI [257].
This sparked a discussion about whether AlphaStar possesses an unfair advantage
over humans in terms of access to information, reaction time, and actions per minute
(APM). At least the last two of these accusations were shown to be rather unfounded.
In games against TLO and MaNa, AlphaStar maintained an average APM of 280
(peaking at 678), whereas MaNa had an average of 390 APM. It can be argued
that although the bot maintained a lower average, it had spikes of APMs precisely
when it mattered. The reaction time of AlphaStar averaged 350ms, which was
naturally influenced by the input processing and inference times of the AI model.
Such a reaction time is at similar or below the level of humans. Regarding access to
information, AlphaStar could perceive the same information available to a human
player but without the need to scroll the map. Human players are limited to viewing
only a portion of the map at a time and must scroll (or click on the minimap)
to observe different regions. In contrast, AlphaStar could view the entire map
simultaneously.

Summary of the Technology

The key elements are outlined below. For a more detailed description, please refer
to paper [315]:
• Training Algorithm Unlike OpenAI Five, AlphaStar was kick-started by a
supervised learning algorithm trained using a publicly available anonymized
database of human games. These games were played by relatively high-skilled
players (top 22%), although not necessarily professional e-sport ones. The
objective was to predict the actions made by humans. This supervised learning
phase enabled the initialization of weights for the AlphaStar model.
In the next stage, the agent was trained using multi-agent reinforcement
learning (RL). The authors used a combination of various achievements in
RL. The base method was policy gradients, closely resembling advantage
actor-critic combined with TD(𝜆), clipped importance sampling (V-trace) [70],
and upgoing policy update [210] for off-policy corrections. Readers interested
in details concerning how all these techniques were combined are advised to
check the original paper [315]. The total loss used in training was defined as
a weighted sum of the value and policy functions losses. Like OpenAI Five,
AlphaStar also uses the Adam optimizer.
In [315], the authors introduced a novel approach called the AlphaStar League.
The league is a competitive multi-agent environment, in which many versions
11.5 AlphaStar (2018) 242

of the bot play against each other with the goal of training as the strongest and
least-exploitable player in the end. The league algorithm is responsible for
selecting the matchups, i.e., choosing two particular agents that will play the
next game. There are three main types of participating agents:
1. Main agents - from which the final solution is to be chosen. They
are rewarded for defeating a wide variety of opponents. The authors
introduced a so-called prioritized fictitious self-play (PFSP) mechanism,
which is responsible for choosing the opponents and playing games
against them. It is based on the win rates of each opponent against this
particular agent.
2. Main exploiters - which only play against the current iteration of main
agents. Their purpose is to find weaknesses in the current best strategies.
3. League exploiters - similar to the main agents, but they do not play the
main exploiters and are not influenced by them.

Figure 11.9: The AlphaStar model overview.

• Architecture. The neural network model is presented in Figure 11.9. The

value network is utilized only during training (serving the critic role), whereas
during actual live matches it can be used for informational purposes to see how
AlphaStar evaluates the chance of winning. It does not affect the actions of the
agents, though. Therefore, during training, the value network is provided with
complete information (even those outside the field of view). The action-based
outputs are responsible for constructing the action to play during inference.
First, the type of the action is chosen. Next, it is decided when the action can
be played. Finally, the context-specific parameters of the action are determined.
The central element is the Deep LSTM component, similar to the one used in
11.6 Emergent Tool Use From Multi-Agent Interaction (2019) 243

OpenAI Five. However, there is no exact information in paper [315] about how
many LSTM units it contains. The entire model has 139 million parameters.
During inference in the actual game, only 55 million of these parameters are
used. The LSTM network accepts a temporal sequence of the last observed
state representation.
• State Representation. To train the policy, AlphaStar uses a combination of:
1. Scalar features that summarize the state of the game. Examples of scalar
features include the current game time, current camera position, player’s
current resources, supply, army supply, etc.
2. Detailed information about the entities (up to 512). These entities consist
of units and buildings. Each entity is characterized by various features
such as type, owner, current health, energy, shields, position, number of
assigned workers, cooldowns for attacks and abilities, etc.
3. Visual spatial information. It is provided for the entire map (think
of it as a rescaled minimap). This information consists of 7 layers:
height, visibility, creep (neutral units), entity owners, alerts, pathable,
and buildable. Information is stored per pixel, if available. Otherwise,
the visibility map will mark a particular pixel as unavailable, and the
corresponding pixels of some of the remaining layers can be ignored.
• Hardware and Training Time. Google DeepMind prepared a sophisticated
and highly salable architecture for running the AlphaStar league. The matches
were played using a computing system equivalent to 150 processors, each with
28 physical cores. Machine learning tasks were processed using a powerful
Tensor Processing Unit v3 (TPU v3) machine with eight TPU cores. Requests
to the TPU-based machine were grouped into batches of size 512. The learning
algorithm processed 50,000 in-game steps per second, and the parameters
of the agents were updated every 10 seconds. In an interview following the
matches against TLO and MaNa, the creators of AlphaStar stated that the bot
had played an equivalent of 200 years’ worth of StarCraft games.

11.6 Emergent Tool Use From Multi-Agent Interaction (2019)

This section will focus on a particular reinforcement learning project carried out
by OpenAI titled “Emergent tool use from multi-agent interaction”. The associated
research paper is titled “Emergent Tool Use From Multi-Agent Autocurricula” [12].
It was widely commented on in the media and gained a lot of attention. It has been
inspiring to researchers, partially because of its core contributions and partially due
to the attractive presentation of the results.
In summary, OpenAI demonstrated that groups of agents placed into an envi-
ronment with simple rules may learn to display very complex emergent behavior.
The environment was both cooperative and adversarial, resembling a hide-and-seek
scenario. It can be found here1 . One group of agents was marked as seekers, whereas
the other group was marked as hiders. The agents cooperated within their group

1 https://github.com/openai/multi-agent-emergence-environments (Last accessed: 3 April 2024)

11.6 Emergent Tool Use From Multi-Agent Interaction (2019) 244

and were adversarial towards the other group. The goal of the seekers was to tag the
hiders by touching them. The experiments were run on various maps, differing by
the number of agents, available map objects, and their positions. At the beginning of
each scenario, the seekers were frozen for a certain amount of time, which prevented
the hiders from being easily captured.
The map objects (all of which can be seen in Figure 11.12) included:
• Boxes - that could be grabbed and moved by the agents into free space.
However, the agents could lock the boxes, provided that they were right next
to them, making them immovable. Once locked boxes could be unlocked but
only by agents of the same team that had locked them. The boxes came in two
shapes and sizes:
1. Cubes
2. Long wall-like rectangles
• Walls - which blocked any movement.
• Ramps - if placed next to boxes or walls, they allowed for jumping over them.
Ramps could also be locked by the hiders.
The agents were trained using a variant of actor-critic RL. Both value and
policy were represented by neural networks. The policy network was optimized by
Proximal policy optimization (PPO) [264] and Generalized Advantage Estimation
(GAE) [263].
It is interesting to report what kind of behavior could be observed after a particular
number of training episodes:
• Episode 0: the agents start with tabula-rasa. They act completely randomly.
• Episodes 0-2.69M: the seeking agents learn that they have to tag the hiders
and manage to regularly do it. The seekers have advantage at this point.

Figure 11.10: The hiders learn to block the doors to the room. This figure depicts
one frame from the video that can be played on the OpenAI website, available at:
https://openai.com/research/emergent-tool-use (last accessed: 31 March 2024).

• Episodes 2.69M - 8.62M: the hiding agents learn to run towards the boxes at
the beginning of the game, grab them, and block the doors to the room while
staying inside. The hiders were safe. The final state is depicted in Figure 11.10.
11.6 Emergent Tool Use From Multi-Agent Interaction (2019) 245

Figure 11.11: The hiders learn to grab the ramp and lock it inside the room with
themselves. This figure depicts one frame from the video that can be played on the
OpenAI website, available at: https://openai.com/research/emergent-tool-use (last
accessed: 31 March 2024).

It is important to mention that the hiders had to cooperate and split up to bring
one box each. Otherwise, there would not be enough time to lock themselves
inside before the seekers could tag them.
• Episodes 8.62M - 14.5M: the seekers start to utilize the ramp available on the
map. One of them learned to grab it, move it towards the southernmost box,
blocking the door, and jump over it (as shown in Figure 11.11). The seekers
started to regain their advantage.
• Episodes 14.5M - 43.4M: the hiding agents learned to grab the ramp and
bring it to the room before blocking the entrances. Only one hider agent was
close enough to do it on time. The whole process required very efficient
movement and coordination. The seekers lost their advantage.

In another map featuring a larger space and more objects, the advantage was regularly
exchanged between the hiders and seekers. When one group learned a new strategy,
after a while, the other group would learn to counter it. For instance, the hiders
learned to make a shelter, as shown in Figure 11.12, between episodes 22M - 88M.
However, the seekers learned to pursue a completely unpredictable strategy
between episodes 388M and 458M. One of them would bring a box near a ramp and
use the ramp to jump on the box. It turns out that the agent’s movement, while staying
on top of the box, caused the box to move. This resulted in essentially "surfing"
on the box. The seeker then surfed near the shelter and jumped into it (depicted
in Fig. 11.13). This is very significant as it shows that reinforcement learning is
capable of exploiting the environment, and therefore testing it. The agents were
only provided with rewards and actions they can perform. Such an approach can be
used to test virtual environments, e.g., for VR training (fighters, medics, soldiers,
etc.), for manufacturing, digital twins, games, and many others. This represents an
unprecedented level of quality assurance. Ultimately, the hiders learned to lock all
11.6 Emergent Tool Use From Multi-Agent Interaction (2019) 246

Figure 11.12: The hiders learn to build a shelter and stay inside. This figure depicts
one frame from the video that can be played on the OpenAI website, available at:
https://openai.com/research/emergent-tool-use (last accessed: 31 March 2024).

objects to prevent the seekers from surfing on them.

Figure 11.13: The seekers learn to “surf” on the box and ambush the hiders. This
figure depicts one frame from the video that can be played on the OpenAI website,
available at: https://openai.com/research/emergent-tool-use (last accessed: 31 March
2024).
11.7 AlphaFold (2020 breakthrough) 247

11.7 AlphaFold (2020 breakthrough)

Proteins are complex molecules that consist of chains of amino acids. There are
many types of those amino acids, which determine the functions and chemical
properties of the molecules. Proteins play a key role in many biological processes
and are essential to life [134]. They are sometimes referred to as the “workhorses of
biology” [245]. For example, various types of proteins may catalyze biochemical
reactions, transport other molecules such as nutrients, be responsible for muscle
contraction, serve as hormones, regulate the gene expressions by binding to specific
DNA sequences etc. A visualization of a protein structure from a scientific computer
program is shown in Figure 11.14.

Figure 11.14: Visualization of a protein structure fragment in an RNA methyltrans-

ferase protein. Created by Opabinia regalis and shared under the CC BY-SA 3.0
http://creativecommons.org/licenses/by-sa/3.0.

Prediction of the 3D structure of protein sequences is one of the fundamental

scientific problems. The input to such a prediction is a sequence of particular amino
acids based on which the protein will fold. This process, when performed manually
in a lab, is tedious and takes many intermediate steps. Historically, it took the
equivalent of one PhD to predict a single structure. It is known that there are billions
of unique protein sequences [194]. However, less than 0.01% had been found prior
to AlphaFold2.

AlphaFold is an AI system developed by Google DeepMind for the long-standing

protein structure prediction problem. Although AlphaFold was first released in
2018, we will focus on the second version – AlphaFold2 – released in 2020, as it
was significantly more accurate and has become a breakthrough achievement. Its
technical details have been described in a 2021 article titled “Highly accurate protein
structure prediction with AlphaFold” [134]. It was authored by J. Jumper et al. and
published in Nature. AlphaFold2 was the first autonomous system to offer such high
accuracy on the task that it led to its practical impact on the biology and medical fields.
Not only was it accurate in prediction, but it also offered additional advantages.
Firstly, it handled much more difficult cases in which no similar (homologous)
11.7 AlphaFold (2020 breakthrough) 248

structures were available to guide the prediction process. Secondly, it required

significantly less time than manual prediction. The system was evaluated in the 14th
Critical Assessment of Protein Structure Prediction (CASP14) [218] competition.
Across all predictions, it achieved a median backbone accuracy of 0.96 𝐴◦𝑟.𝑚.𝑠.𝑑.95
(almost three times better than the next best solution) and a global distance test
(GDT) score of 92.4, which was the main measure used in the competition. This is a
similar quality of prediction to a laborious and time-consuming manual experiment
using X-ray crystallography. It was a significant improvement over the results from
CASP13 held 2 years earlier, when the top GDT score was 68.5 (also achieved
by AlphaFold but in its previous version). The CASP14 results were named as
"groundbreaking", "astounding", and "stunning". Following the success, DeepMind
partnered with the EMBL’s European Bioinformatics Institute to create a database of
accurate predictions made by AlphaFold. As of 2024, this database contains over
200 million entries.

11.7.1 AlphaFold Technology

This section will only summarize key aspects of the AlphaFold approach as the
system is very complex and going into details would require a dedicated article. A
high-level architecture of the model is presented in Figure 11.15. Readers interested
in delving deeper into the topic are advised to check [134].
AlphaFold combines existing knowledge databases with a sophisticated training
algorithm based on deep neural networks and a 3D reconstruction algorithm based
on inductive bias. The inductive bias refers to a set of assumptions or constraints
embedded within a model that favor solutions conforming to specific geometric
properties or structures. In other words, it reflects the prior knowledge or assumptions
about the geometric relationships or properties present in the data.
• Input data. There are four types of initial input data, which are subsequently
combined into two representations as shown in Fig. 11.15.
1. Naturally, the sequence of amino acids based on which the task is to
predict the protein structure is part of the input. It is expressed in a
specialized format and can be referred to as the raw input.
2. The next representation that extends the input sequence is the so-called
multiple sequence alignment (MSA), which is widely used in biology.
From its structure, sequence homology can be inferred. The MSA
representation is created by querying multiple genetic databases with the
sequence and embedding the obtained results. It is important to note that
unlike approaches such as AlphaZero or OpenAI Five, AlphaFold utilizes
existing knowledge and is trained in the supervised learning paradigm.
The MSA representation is 256 × 𝑁 𝑠𝑒𝑞 dimensional.
3. The third type of input to the neural network part of the model is the
so-called pair representation, which is organized in the form of a matrix
and denotes the relationships between the residues (the individual units of
the structure). The pair representation serves as the initial representation,
which is iteratively refined to ultimately form the final structure. It is is
11.7 AlphaFold (2020 breakthrough) 249

Figure 11.15: A summary of the AlphaFold model. This figure was drawn inspired
by Fig.3 from [134].

128 × 𝑁𝑟𝑒𝑠 dimensional.

4. The final type of input is structural templates, which are also retrieved
from numerous biological databases. The most similar templates of the
given MSA input (both in training and in inference) are searched, and
they serve as additional context for the network.
• EvoFormer is a novel neural network architecture proposed by DeepMind
based on a transformer (c.f. Section 11.1). The idea underpinning its name
is that the structure of the protein is iteratively constructed (formed) and
evolves over time. EvoFormer is a two-tower model, meaning that there are
two separate data processing stacks, which communicate with each other
only through a specific communication mechanism. The two stacks here are
referred to as MSA and Pair stacks based on the type of data they process.
The EvoFormer model is a deep network composed of 𝑁block = 48 blocks.
Each block contains several layers, which are distinct per stack. Both stacks
utilize the attention mechanism. The MSA stack uses a row attention followed
11.7 AlphaFold (2020 breakthrough) 250

by a column attention layer, which allows pairs to attend within the structure
as well as enabling information exchange between the structures, respectively.
Both types of attention utilize 𝑁heads = 8 attention heads, and 32 keys, values,
and queries each. The pair block utilizes the so-called triangular self-attention,
which updates the pair representation based on the triangular topology of the
current protein structure. Each block also contains 1 and 4 dropout layers for
the MSA and pair stacks, respectively.
The MSA stack contains a 2-layer MLP called the transition layer, which is
its final layer. Its output is transformed by an outer product mean, and after
this operation, it is added to the pair representation that is passed as an input
to the pair stack. The final layer of the pair stack is also a transition layer.
After processing both stacks in a block, the outputs are combined into a single
representation using a linear transformation.
• Recycling is an important feature of the AlphaFold approach that significantly
improves its accuracy. In general, it consists of reusing the representation that
is produced by the final EvoFormer block as the input to the entire model in
the next iteration. Such a process is repeated for some small number 𝑁cycle
times (e.g., three).
• Structure model. The goal of the structure model is to transform the
representation used by EvoFormer into an actual protein structure in 3D space,
i.e., predict the 3D coordinate of each atom. It is an 8-layer neural network
with additional functions that preprocess data and preserve constraints. The
network predicts the vectors for translation and quaternions for rotation, which
are then fed to proper equations that put each atom into the correct place
relative to the structure.
• Training. The entire AlphaFold model was trained in a single supervised
learning process. 25% of the training samples came from the Protein Data
Bank (PDB) that contained structures discovered prior to AlphaFold. The
remaining 75% of the samples came from a self-distillation set, which was
generated using a relatively sophisticated process that involved using the model
trained only on the PDB samples on unlabelled instances filtered according to
various criteria. After the filtering, the self-distillation set contained 355,993
training instances. The authors define the so-called Frame Aligned Point
Error (FAPE) loss, which measures the deviation from the structure given as
a training example and the produced one. The final loss used in training is a
linear combination of the FAPE loss and a few others such as the torsion angle
loss or the average cross-entropy for certain features of the structure. The
inclusion of specific losses in the linear combination depended on whether it
was the main training process of the model or fine-tuning.
11.8 Chat-GPT and the Large Language Models Revolution (2022+)
251

11.8 Chat-GPT and the Large Language Models Revolution

(2022+)
As of writing this book, large language models are first what comes to mind when
thinking about the state-of-the-art of artificial intelligence. They have been on the
forefront of this field. In particular, Chat-GPT [328], introduced in 2022, has been
the hot topic both in technical and casual discourse. Chat-GPT is a product offered by
OpenAI as a service, which can be used with various foundational models developed
by this company.
Let’s delve into more context on how Chat-GPT was created. Historically,
automatic language translation was one of the first use cases for pioneering LLMs
such as GPT-1 and BERT, described previously in this chapter, in Section 11.2. They
were direct successors of models tailored for machine translation but extended their
capabilities onto other language-related tasks (as shown in Fig. 11.4). While LLMs
have been trained to process natural language, they were not designed as chatbots.
This started to change with the introduction of the GPT-2 model in 2019 [230].

Figure 11.16: A user interface of Chat-GPT for writing prompts. Last accessed:
April 2, 2024.

A paradigm shift that is important to notice in the history of AI concerns making

LLMs work with a natural language user interface as in the case of Chat-GPT.
Since its inception, the users can engage in a conversation with a model to make it
generate a desired response. They write the so-called prompts, which are tokenized,
embedded, positionally encoded, and passed as the input to the models. An example
of a user interface for entering prompts is shown in Figure 11.16. Based on the
response, the user might write a new prompt, possibly referring to the previous ones.
This resembles a natural conversation as could happen between two humans. The
way it works is that the history of prompts serves as a context that helps the model
provide better responses. In addition to this, most modern LLMs allow defining a
global context in the form of telling the model what persona (assistant) it should act
as, e.g., an English language expert or a financial advisor. However, each model has
11.8 Chat-GPT and the Large Language Models Revolution (2022+)
252

a specific limit on the number of tokens it can use as a context. Therefore, it is not
always possible to keep details provided long back in the conversation or process
very large texts (e.g., entire books) in one go. Let us now compare how responses
generated by Chat-GPT differ from the responses of ELIZA, the first chatbot, which
was introduced in Section 4.4. The user prompted ELIZA with “Men are all alike”,
to which the bot responded with “IN WHAT WAY”. Figure 11.17 shows the response
given by Chat-GPT to the same prompt. It is much more sophisticated and it appears
that Chat-GPT understands what the topic is about.

Figure 11.17: Chat-GPT response to a prompt “Men are all alike”.

Chat-GPT has become a phenomenon. It paved the way for LLMs to become an
integral part of our lives. It has had a profound impact on society in many areas.
One of its first applications was in education. On the one hand, students and pupils
in lower education level has been using it to help them with their homeworks, e.g.,
writing essays. It can also explain complex topics, and facilitate interactive learning
experience. On the other hand, if affected teachers in various ways. It necessitates
to check works prepared by students more thoroughly, whether they had been done
by themselves. However, teachers can also use Chat-GPT as an educational tool.
Programmers are a group that benefits from LLMs to great extend. They can help
writing complex expressions or act as an extension of a documentation. However,
they make mistakes from time to time and have the inclination to “hallucinate”.
Therefore, it is important to have a programming background and use LLMs such as
Chat-GPT consciously as an aid rather then fully rely on them. They can also help
to learning new languages in an interactive fashion - both programming languages
and natural foreign languages.
The list of applications of Chat-GPT is endless. Some common tasks, which
were not yet mentioned, it can be used for are:
• Creative writing.
• Spell checking.
• Making summaries, as long as the summarized text, when converted to tokens,
does not exceed the context length of the model.
• Ideas for presentations or businesses.
• Customer support.
• Assistance in making websites.
11.8 Chat-GPT and the Large Language Models Revolution (2022+)
253

• Serving as a personal assistant.

• Question answering, fetching encyclopedic data.
• Performing market research.
Many companies and communities have set out on a mission to create their own
large language models. Examples of such models are listed in Table 11.1. However,
OpenAI remains to be one of the major players in this field. Their latest model,
GPT-4 [215], was introduced in 2023. It is the third major successor of GPT-1 after
GPT-2, GPT-3, and GPT-3.5. In most part, these models are rescaled and refined
versions of the architectures presented in Section 11.2. They have more trainable
parameters than their predecessors and can process more tokens at the same time.
Often, these multi-modal models involve sophisticated tuning algorithms for various
tasks. One of the recent trends in creating LLMs is the utilization of the so-called
mixture of experts (MOE) approach [242]. The basic idea behind it is to divide
the model into multiple “experts”, each of which specializes in a specific aspect or
subset of the data. These experts work together to collectively make predictions,
with their outputs combined in some way to produce the final result. Another trend
is called “Retrieval-Augmented Generation” [85]. It is a type of model architecture
designed to enhance the generation capabilities of language models by incorporating
a retrieval mechanism.

Model Name Owner Open-Source? # of Parameters

GPT-4 OpenAI No around 1.76T
BARD Google No 1.6T
BERT Google Yes at least 345B
LLaMA Meta AI Yes 65B
Claude 3 Opus Anthropic No at least 175B
Mixtral 8X7B Mistral AI Yes 45B (12B for inference)
Gemini Google No ?
LaMDA Google No 137B
Falcon 40B Falcon LLM Yes 40B
Orca Microsoft Yes 13B
Stable LM Zephyr Stability AI Yes 3B
Table 11.1: Large Language Models maintained by various companies. These
models have been trained using extensive computational resources. The numbers of
parameters are approximations based on the available information.

The widespread adoption of LLMs for various tasks is arguably the first suc-
cessful implementation of the transfer learning idea that had been around in AI
research for a while [347]. In short, transfer learning concerns training a model for
one task and adapting it for a different task with little to no fine-tuning (orders of
magnitude faster than learning from scratch). LLMs are trained on extremely large
corpora of texts, where the objective is to essentially learn how to use a language -
what words should come after what words. Without focusing on any specific task,
11.8 Chat-GPT and the Large Language Models Revolution (2022+)
254

LLMs can be fine-tuned for a lot of them.

One must be aware that Chat-GPT and similar tools based on LLMs are not
infallible. Firstly, while various benchmarks and experiments have proven that they
can be very effective, they are still not perfect and there is room for improvement.
Secondly, they have tendency to sometimes “hallucinate”, i.e., make up facts
and pretending as though they were true. A disclaimer stating the possibility
of hallucination is typically included on the websites of the companies offering
LLM-based services. Thirdly, there exist certain types of tasks which have been
proven difficult for LLMs such as performing mathematical calculations, logical
reasoning, commonsense reasoning, visual and long-term temporal reasoning.

11.8.1 Text to Image Models

GPT has proven to be such a powerful architecture that it has also been adopted
for solutions that generate images based on given prompts. The more detailed
the prompts are, the higher the chance that the output image will be original and
accurately follow the expectations. However, while the prompts should be detailed,
they should also be formulated clearly in simple language. A well-prepared prompt
will include the main subject of the image, the setting (background), the emotional
tone, as well as the artistic style (e.g., photorealistic or French impressionism). Some
of the most notable text-to-image models are:
• DALL-E [171] created by OpenAI in 2021. The name is a mix of the surname
of Salvador Dali and an animated robot’s name WALL-E created by Pixar.
Its successors, DALL-E 2 and DALL-E 3, were released in 2022 and 2023,
respectively. The first version was based on the GPT-3 model. It was trained on
millions of pairs of textual image descriptions and images using a combination
of supervised and self-supervised learning. The descriptions were processed
through the same pipeline as the pure-text sources (positionally encoded
embeddings). The images were cut into 4 × 4 and subsequently converted
to a token by a variational autoencoder. Since the second version (DALL-E
2), the model also utilizes a technique called Contrastive Language-Image
Pre-training (CLIP) [336]. It was designed to efficiently represent images and
text in a unified embedding space, allowing it to train and perform various
tasks involving both modalities. DALL-E is capable of generating high-quality
images conditioned on the text prompts. It can be guided to apply a specific
style to an image including “photorealistic”, “anime”, “neon punk”, “pixel-
art”, and many others. It is also capable of manipulating objects on existing
images, generate images similar to the given one or filling empty spaces when
instructed so. Figure 11.19 shows an image generated by the DALL-E model.

• Stable Diffusion [246], which originally started as a research project titled

“Latent Diffusion” at Ludwig Maximilian University in Munich and Heidelberg
University. It later developed by two companies: Runway and Stability AI.
The model was publicly released in August 2022. However, diffusion models
11.8 Chat-GPT and the Large Language Models Revolution (2022+)
255

encompass a variety of models that utilize the concept of diffusion, not just
one proprietary model. An exemplar architecture of such models is presented
in Fig. 11.18. Training diffusion models involves a few steps:
1. First, a variational auto-encoder compresses images to a latent space.
2. Then, noise is iteratively added to the image.
3. The generative model is trained to reverse this diffusion process. It
learns to generate images by starting from a noisy image and iteratively
denoising it. The model is trained on a large dataset of images using
maximum likelihood estimation or other training objectives. Stable
Diffusion introduces stability regularization techniques to improve the
training stability and the quality of generated images.

Figure 11.18: An overview of the stable diffusion model. The “QKV” block denotes
cross-attention.

• Midjourney [101] created by Midjourney, Inc.. The alpha version was released
in November 2022. Initially, the users would enter their prompts through
communicating with a discord bot hosted by the company. Midjourney offers
a large number of visual styles to choose from. The program has been praised
for generating very high quality images, often regarded as more stylish than
those of DALL-E. Figure 11.20 shows an image generated by the Midjourney
model.

While generative AI models have been universally praised for their technical
prowess, they also raise many concerns. For instance, how to prevent people from
using these models for unethical and malicious purposes such as impersonation,
generating fake information, or creating sexually explicit, violent, or otherwise
disturbing content. These models have also been controversial with respect to
usage rights [111]. There have been cases of legal actions against the companies
that own LLMs for generating images [241]. For instance, a lawsuit against
Midjourney, Stability AI, DeviantArt and Runway AI was filed in November 2023 for
unauthorized use of works of 4700 artists. It is not yet fully clear whether and under
11.8 Chat-GPT and the Large Language Models Revolution (2022+)
256

Figure 11.19: An image generated by the DALL-E model given the prompt: “Cars
having a race on a road in a jungle (photorealistic)”.

what restrictions these models can be trained on copyrighted images. Moreover, it

is arguable how the output of these models should be copyrighted. At the time of
writing this book, the user that generates an image is free to use it commercially.
There are many aspects to consider, including that different entities use these models,
different entities own these models, and different entities own rights to the data that
contributed to training these models. Furthermore, what about a scenario when
two people generate two very similar images? It is much easier to do so with short
prompts rather than accidentally painting two similar images.
Another important topic is how generative AI models will impact the work of
artists, graphic designers, and other creative jobs [131]. On the one hand, there
are fears that the continuous development of more and more spectacular generative
AI models will make many graphics-related jobs replaceable or at least reduce the
demand for their services. On the other hand, they can empower artists by providing
them with new tools.
11.9 Nobel Prizes (2024) 257

Figure 11.20: An image generated by the Midjourney model given the prompt:
“Open door to a magical castle (crayon art)”.

11.9 Nobel Prizes (2024)

In 2024, the Royal Swedish Academy of Sciences awarded two Nobel Prizes
to researchers who have made significant contributions to the field of artificial
intelligence.
As there is no specific category for AI or computer science, the first prize
was awarded in physics. The recipients were John Hopfield for his work on
physics-inspired Hopfield networks (see Section 5.10) and Geoffrey Hinton for
his foundational work on Boltzmann machines and deep learning contributions, in
general (see Sections 6.3 and 9.7).
The second Nobel Prize was awarded in chemistry to David Baker, Demis
Hassabis, and John Jumper for their advancements in protein folding predictions.
This contribution was! exemplified by AlphaFold (see Section 11.7).
By honoring the pioneering contributions of AI researchers and engineers,
the 2024 Nobel Prizes lend further credibility and momentum to the ongoing AI
revolution. This recognition highlights the profound advancements made in the field
of AI and its increasingly central role in driving scientific progress and technological
innovation.
Conclusions

The history of AI encompasses various events and inventions. In this book, I have
carefully selected ninety-nine of them. These include mathematical theorems that
serve as the backbone of AI methods, specific algorithms and models, important
milestones, famous competitions and their winners, as well as seminal books and
research papers. Advancements in autonomous game-playing are particularly in-
teresting to me, as games have served as a testbed for AI. The sections have been
organized in chronological order, and each section of this book covers a specific
topic. We embarked on a journey through various periods. The first chapter provided
a historical perspective on automata and mathematical foundations, predating the
formal emergence of the AI term and the field. We traversed periods of early
enthusiasm, the creation of the first classifiers and clustering methods, and traditional
AI approaches based on symbols and logic. This period was followed by the
so-called AI winters. The subsequent era saw increased interest in computational
intelligence and machine learning methods, such as neural networks. The book
presented the evolution of neural networks with various architectures and models
proposed. However, it also included other machine learning models and non-ML
approaches. Whenever possible, technical details were included, making the book
useful as a handbook and a first source of reference. A rich bibliography helps
readers find related works to delve further into the topics.

Artificial Intelligence is a fascinating field of study that is hard to define, as it can

be challenging to draw a definitive line between a complex hand-crafted algorithm
and an AI algorithm. The term "AI" was coined in 1956, as presented in Section 3.4.
However, its foundations can be traced back to important mathematical contributions
such as Bayes’ theorem, statistical regression, and the chain rule for computing
derivatives of complex functions. AI is also associated with autonomous machines
such as intelligent robots. These first appeared in science fiction, but the earliest
automated mechanical contraptions predate the sci-fi genre.
11.9 Nobel Prizes (2024) 259

AI is constantly growing and maturing as a field. For many years, it was mainly
limited to academia with little impact on our everyday lives. In the last few years,
progress has accelerated. While there is much more exposure to AI through news
articles and AI-powered products, the general level of understanding of its principles,
implications, and potential pitfalls still needs improvement. It is important to raise
awareness about both the limitations of current AI technologies and the implications
of using them. Responsibility is key. It is crucial to build reliable, safe, ethical,
and trustworthy AI systems. One of the many factors influencing whether a system
meets these criteria is our understanding of it – how it works, when it makes errors,
and how confident and reliable its output is. Trends such as eXplainable Artificial
Intelligence (XAI) and interpretable AI will likely become even more significant in
the future. There will be an urgent need to understand, monitor, and diagnose AI
systems not only during their creation but also after they have been deployed.
Additionally, it will become increasingly important to refine legislation to keep
pace with the rapid advancements in the field. This includes both the creation and
use of AI systems as well as issues broadly related to data. Some of these issues
are algorithmic biases, data protection and rights, fairness, privacy, accountability,
responsibility, misinformation, safety, alignment, diversity, impact on the job market
and unemployment, and implications for critical applications such as medicine
and the military. The proper growth of young humans and other societal impacts
are also concerns. In recent years, we have observed an unprecedented number
of various laws and guidelines related to AI. For example, the European Union
released the EU AI Act, which contains a detailed regulatory and legal framework
for AI in member countries. On March 13, 2024, it was accepted by the European
Parliament. Many countries outside of Europe, including Australia, Canada, China,
and the United States, have either prepared regulations that include AI or plan to do so.

Legal and social aspects will certainly define the near future of AI. On the
technological side, some predictions can also be made. Many researchers consider
quantum computing and quantum AI to be the next big breakthrough. Quantum
computers go beyond the classical computing paradigm based on binary states
and silicon chips. They take advantage of quantum mechanics effects such as
superposition and entanglement, allowing us to represent information in the form
of qubits, which exist in a superposition of two states. Certain problems, which
are suitable for a quantum computational framework, can be solved exponentially
faster than traditional computers. If quantum computers become practically usable
(accessible, cheaper, with enough qubits, stable, and efficient in reading results), it is
very likely that new AI algorithms will emerge both to be suitable for this new type
of computer and to optimize their hardware.
In the near future, however, it is safe to say that progress will continue in the
areas of multi-modal Large Language Models (LLMs). Researchers are constantly
looking for more efficient neural network architectures. Some recent ones include
Mamba [331] and KAN: Kolmogorov-Arnold Networks [162]. Another trend will
be developing new efficient ways of using LLMs and combining them with other
systems and techniques. It may be promising to combine large neural networks with
11.9 Nobel Prizes (2024) 260

logical reasoning and rule-based systems that operate on a higher level of abstraction.
In my prediction, until the next big breakthrough, big companies and governments
will be in a race to create and train bigger and bigger LLMs. The focus will be
on multimodality (operating on various types of data such as text and videos)
and transfer learning (using trained models for various tasks with minimal or no
retraining).
Bibliography

[1] David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. “A Learning

Algorithm for Boltzmann Machines”. In: Cognitive Science 9.1 (1985),
pp. 147–169. doi: doi.org/10.1016/S0364-0213(85)80012-4.
[2] Amina Adadi and Mohammed Berrada. “Peeking Inside the Black-Box: A
Survey on Explainable Artificial Intelligence (XAI)”. In: IEEE Access 6
(2018), pp. 52138–52160. doi: 10.1109/ACCESS.2018.2870052.
[3] Rakesh Agrawal, Tomasz Imieliński, and Arun Swami. “Mining Association
Rules Between Sets of Items in Large Databases”. In: SIGMOD Record 22.2
(1993), 207–216. issn: 0163-5808. doi: 10.1145/170036.170072.
[4] Rakesh Agrawal and Ramakrishnan Srikant. “Fast Algorithms for Mining
Association Rules in Large Databases”. In: Proceedings of the 20th Inter-
national Conference on Very Large Data Bases. Ed. by Jorge B. Bocca,
Matthias Jarke, and Carlo Zaniolo. VLDB ’94. San Francisco, CA, USA:
Morgan Kaufmann Publishers Inc., 1994, 487–499. isbn: 1558601538.
[5] LV ALLIS. “Searching for Solutions in Games and Artificial Intelligence”.
PhD thesis. University of Limburg, 1994.
[6] Alexandros Altis et al. “Dihedral angle principal component analysis of
molecular dynamics simulations”. In: The Journal of Chemical Physics
126.24 (2007). doi: 10.1063/1.2746330.
[7] Mehdi Amiri and Seyfollah Soleimani. “ML-based group method of data
handling: An improvement on the conventional GMDH”. In: Complex &
Intelligent Systems 7 (2021), pp. 2949–2960. doi: 10.1007/s40747-021-
00480-0.
[8] McKinsey Analytics. “The age of analytics: competing in a data-driven
world”. In: McKinsey Global Institute Research (2016).
BIBLIOGRAPHY 262

[9] David Arthur and Sergei Vassilvitskii. “K-means++ the advantages of careful
seeding”. In: Proceedings of the eighteenth annual ACM-SIAM symposium
on Discrete algorithms. Ed. by Harold Gabow. Society for Industrial and
Applied Mathematics, 2007, pp. 1027–1035.
[10] Roland G Austin. “Roman Board Games. II”. In: Greece & Rome 4.11
(1935), pp. 76–82.
[11] Eduardo M Azevedo et al. “Empirical Bayes Estimation of Treatment Effects
with Many A/B Tests: An Overview”. In: AEA Papers and Proceedings. Ed.
by William Johnson, Lars Vilhuber, and Tina Lippert. Vol. 109. American
Economic Association 2014 Broadway, Suite 305, Nashville, TN 37203.
2019, pp. 43–47. doi: 10.1257/pandp.20191003.
[12] Bowen Baker et al. “Emergent Tool Use From Multi-Agent Autocurricula”.
In: 8th International Conference on Learning Representations. 2020.
[13] Christoph Baur et al. “Autoencoders for Unsupervised Anomaly Segmen-
tation in Brain MR Images: A Comparative Study”. In: Medical Image
Analysis 69 (2021). doi: 10.1016/j.media.2020.101952.
[14] Thomas Bayes. “An Essay Towards Solving a Problem in the Doctrine of
Chances”. In: Biometrika 45.3-4 (1958). (reprint), pp. 296–315.
[15] David R Bellhouse. “The Reverend Thomas Bayes, FRS: A Biography to
Celebrate the Tercentenary of His Birth”. In: Statistical Science 19.1 (2004),
pp. 3–43.
[16] Richard Bellman. “A Markovian Decision Process”. In: Journal of Mathe-
matics and Mechanics 6.5 (1957), pp. 679–684.
[17] James Bennett and Stan Lanning. “The Netflix Prize”. In: Proceedings of
KDD Cup and Workshop. Ed. by Bing Liu et al. Vol. 2007. New York. 2007,
p. 35.
[18] Joseph Berkson. “Application of the Logistic Function to Bio-Assay”. In:
Journal of the American Statistical Association 39.227 (1944), pp. 357–365.
doi: 10.2307/2280041.
[19] Werner Binder. “AlphaGo’s Deep Play: Technological Breakthrough as
Social Drama”. In: The Cultural Life of Machine Learning (2021). Ed. by
Jonathan Roberge and Michael Castelle, pp. 167–195. doi: 10.1007/978-
3-030-56286-1\_6.
[20] Alan Blair and Abdallah Saffidine. “AI surpasses humans at six-player
poker”. In: Science 365.6456 (2019), pp. 864–865. doi: 10.1126/science.
aay777.
[21] Thomas Blaschke et al. “Application of Generative Autoencoder in De
Novo Molecular Design”. In: Molecular Informatics 37.1-2 (2018). doi:
doi.org/10.1002/minf.201700123.
BIBLIOGRAPHY 263

[22] Michael Bowling et al. “Heads-up limit hold’em poker is solved”. In:
Communications of the ACM 60.11 (2017), pp. 81–88. doi: 10.1145/
3131284.
[23] Michael Bowling et al. “Heads-up limit hold’em poker is solved”. In: Science
347 (6218 2015), pp. 145–149. doi: DOI:10.1126/science.1259433.
[24] D. W. Bradley and A. M. Tyrrell. “Immunotronics : Hardware Fault Tolerance
Inspired by the Immune System”. In: Evolvable Systems: From Biology to
Hardware. Ed. by Julian Miller et al. Berlin, Heidelberg: Springer Berlin
Heidelberg, 2000, pp. 11–20. isbn: 978-3-540-46406-8.
[25] C. Breazeal. Designing Sociable Robots. Intelligent Robotics and Au-
tonomous Agents series. MIT Press, 2002. isbn: 9780262025102. url:
https://books.google.pl/books?id=uNeMEAAAQBAJ.
[26] Leo Breiman. “Random Forests”. In: Machine learning 45 (2001), pp. 5–32.
doi: 10.1023/A:1010933404324.
[27] Noam Brown and Tuomas Sandholm. “Libratus: The Superhuman AI for
No-Limit Poker”. In: Proceedings of the Twenty-Sixth International Joint
Conference on Artificial Intelligence. Ed. by Carles Sierra. 2017, pp. 5226–
5228. isbn: 978-0-9992411-0-3. doi: 10.24963/ijcai.2017/772.
[28] Noam Brown and Tuomas Sandholm. “Superhuman AI for multiplayer
poker”. In: Science 365 (6456 July 2019), pp. 885–890. doi: 10.1126/
science.aay2400.
[29] Noam Brown et al. “Combining Deep Reinforcement Learning and Search
for Imperfect-Information Games”. In: Proceedings of the 34th International
Conference on Neural Information Processing Systems. Ed. by H. Larochelle
et al. NIPS ’20. Curran Associates Inc., 2020. isbn: 9781713829546.
[30] Cameron B Browne et al. “A Survey of Monte Carlo Tree Search Methods”.
In: IEEE Transactions on Computational Intelligence and AI in Games 4.1
(2012), pp. 1–43. doi: 10.1109/TCIAIG.2012.2186810.
[31] Bruce Buchanan, Georgia Sutherland, and Edward A Feigenbaum. “Heuris-
tic DENDRAL: a Program for Generating Explanatory Hypotheses”. In:
Machine Learning and Heuristic Programming. Ed. by D. Mitchie and B.
Melter. Edinburgh University Press, 1969, pp. 209–254.
[32] Michael Buro and David Churchill. “Real-Time Strategy Game Competi-
tions”. In: AI Magazine 33.3 (2012), pp. 106–108. doi: 10.1609/aimag.
v33i3.2419.
[33] Baoping Cai, Lei Huang, and Min Xie. “Bayesian Networks in Fault Di-
agnosis”. In: IEEE Transactions on Industrial Informatics 13.5 (2017),
pp. 2227–2240. doi: 10.1109/TII.2017.2695583.
[34] Murray Campbell, A.Joseph Hoane, and Feng hsiung Hsu. “Deep Blue”.
In: Artificial Intelligence 134.1 (2002), pp. 57–83. issn: 0004-3702. doi:
10.1016/S0004-3702(01)00129-1.
BIBLIOGRAPHY 264

[35] Murray Campbell, A Joseph Hoane Jr, and Feng-hsiung Hsu. “Deep Blue”.
In: Artificial Intelligence 134.1-2 (2002), pp. 57–83. doi: 10.1016/S0004-
3702(01)00129-1.
[36] Marcio Carvalho and Teresa B. Ludermir. “Particle Swarm Optimization
of Neural Network Architectures and Weights”. In: Proceedings of the 7th
International Conference on Hybrid Intelligent Systems. HIS ’07. IEEE
Computer Society, 2007, 336–339. isbn: 0769529461.
[37] Stefano Ceri, Georg Gottlob, Letizia Tanca, et al. “What you always wanted
to know about Datalog (and never dared to ask)”. In: IEEE Transactions on
Knowledge and Data Engineering 1.1 (1989), pp. 146–166. doi: 10.1109/
69.43410.
[38] Michal Čertickỳ and David Churchill. “The Current State of StarCraft
AI Competitions and Bots”. In: Proceedings of the AAAI Conference on
Artificial Intelligence and Interactive Digital Entertainment. Vol. 13. 2. 2017,
pp. 2–7. doi: 10.1609/aiide.v13i2.12961.
[39] Akhilesh Chander, Amitava Chatterjee, and Patrick Siarry. “A new social and
momentum component adaptive PSO algorithm for image segmentation”.
In: Expert Systems with Applications 38.5 (2011), pp. 4998–5004. doi:
10.1016/j.eswa.2010.09.151.
[40] John K Chapin and Miguel AL Nicolelis. “Principal component analysis
of neuronal ensemble activity reveals multidimensional somatosensory
representations”. In: Journal of Neuroscience Methods 94.1 (1999), pp. 121–
140. doi: 10.1016/s0165-0270(99)00130-2.
[41] Guillaume M. J. B. Chaslot, Mark H. M. Winands, and H. Jaap van den
Herik. “Parallel Monte-carlo Tree Search”. In: Computers and Games. Ed. by
H. Jaap van den Herik et al. Berlin, Heidelberg: Springer Berlin Heidelberg,
2008, pp. 60–71. isbn: 978-3-540-87608-3.
[42] Tianqi Chen and Carlos Guestrin. “XGBoost: A Scalable Tree Boosting Sys-
tem”. In: Proceedings of the 22nd ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining. Ed. by Balaji Krishnapuram
et al. 2016, pp. 785–794. isbn: 9781450342322. doi: 10.1145/2939672.
2939785.
[43] Zhaomin Chen et al. “Autoencoder-based network anomaly detection”. In:
2018 Wireless Telecommunications Symposium (WTS). IEEE. 2018, pp. 1–5.
doi: 10.1109/WTS.2018.8363930.
[44] William Clark, Jan Golinski, and Simon Schaffer. The Sciences in Enlight-
ened Europe. University of Chicago Press, 1999.
[45] Kenneth Mark Colby. “Ten criticisms of parry”. In: ACM SIGART Bulletin
48 (1974), pp. 5–9. doi: 10.1145/1045200.1045202.
BIBLIOGRAPHY 265

[46] Alberto Colorni, Marco Dorigo, Vittorio Maniezzo, et al. “Distributed

Optimization by Ant Colonies”. In: Proceedings of the First European
Conference on Artificial Life. Ed. by F Varela and P Bourgine. Vol. 142.
Paris, France. Elsevier Publishing, 1991, pp. 134–142.
[47] Corinna Cortes and Vladimir Vapnik. “Support-Vector Networks”. In: Ma-
chine Learning 20 (1995), pp. 273–297. doi: 10.1007/BF00994018.
[48] Rémi Coulom. “Efficient Selectivity and Backup Operators in Monte Carlo
Tree Search”. In: International Conference on Computers and Games. Ed. by
H. Jaap van den Herik, Paolo Ciancarini, and H. H. L. M. (Jeroen) Donkers.
Springer. Springer Berlin Heidelberg, 2006, pp. 72–83. doi: 10.1007/978-
3-540-75538-8\_7.
[49] David R Cox. “Two Further Applications of a Model for Binary Regression”.
In: Biometrika 45.3/4 (1958), pp. 562–565. doi: 10.2307/2333203.
[50] Jack L Crosby. Computer Simulation in Genetics. John Wiley Sons, 1973.
isbn: 978-0-471-18880-3.
[51] Michael A. Cusumano. “Self-Driving Vehicle Technology: Progress and
Promises”. In: Communications of the ACM 63.10 (2020), 20–22. issn:
0001-0782. doi: 10.1145/3417074.
[52] Charles Darwin. On the Origin of Species. London: John Murray, Albemarle
Street., 1859.
[53] Dipankar Dasgupta and Stephanie Forrest. “Artificial immune systems in
industrial applications”. In: Proceedings of the 2nd International Conference
on Intelligent Processing and Manufacturing of Materials, IPMM 1999. Ed.
by Marcello M. Veiga et al. Proceedings of the 2nd International Conference
on Intelligent Processing and Manufacturing of Materials, IPMM 1999.
Institute of Electrical and Electronics Engineers Inc., 1999, pp. 257–267.
doi: 10.1109/IPMM.1999.792486.
[54] Omid E David and Nathan S Netanyahu. “DeepPainter: Painter Classification
Using Deep Convolutional Autoencoders”. In: Artificial Neural Networks
and Machine Learning–ICANN 2016: 25th International Conference on
Artificial Neural Networks, Proceedings. Ed. by Alessandro E.P. Villa, Paolo
Masulli, and Antonio Javier Pons Rivero. Springer. 2016, pp. 20–28. isbn:
978-3-319-44781-0. doi: 10.1007/978-3-319-44781-0\_3.
[55] Edgar G Daylight. “Towards a Historical Notion of ‘Turing—the Father
of Computer Science’”. In: History and Philosophy of Logic 36.3 (2015),
pp. 205–228. doi: 10.1080/01445340.2015.1082050.
[56] Supriya Kumar De, Ranjit Biswas, and Akhil Ranjan Roy. “An Application
of Intuitionistic Fuzzy Sets in Medical Diagnosis”. In: Fuzzy Sets and Systems
117.2 (2001), pp. 209–213. doi: 10.1016/S0165-0114(98)00235-8.
BIBLIOGRAPHY 266

[57] Leandro Nunes De Castro and Jon Timmis. “An Artificial Immune Net-
work for Multimodal Function Optimization”. In: Proceedings of the 2002
Congress on Evolutionary Computation. CEC’02 (Cat. No. 02TH8600).
Ed. by Russ Eberhart. Vol. 1. IEEE. 2002, pp. 699–704. doi: 10.1109/CEC.
2002.1007011.
[58] Arthur P Dempster, Nan M Laird, and Donald B Rubin. “Maximum Likeli-
hood from Incomplete Data via the EM Algorithm”. In: Journal of the Royal
Statistical Society: Series B (methodological) 39.1 (1977), pp. 1–22.
[59] Jia Deng et al. “ImageNet: A Large-Scale Hierarchical Image Database”. In:
2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE.
2009, pp. 248–255. doi: 10.1109/CVPR.2009.5206848.
[60] Jun Deng et al. “Semisupervised Autoencoders for Speech Emotion Recog-
nition”. In: IEEE/ACM Transactions on Audio, Speech, and Language
Processing 26.1 (2017), pp. 31–43. doi: 10.1109/TASLP.2017.2759338.
[61] Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers
for Language Understanding”. In: Proceedings of the 2019 Conference of the
North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN,
USA, June 2-7, 2019, Volume 1 (Long and Short Papers). Ed. by Jill
Burstein, Christy Doran, and Thamar Solorio. Association for Computational
Linguistics, 2019, pp. 4171–4186. doi: 10.18653/V1/N19-1423.
[62] Ernst D Dickmanns and Alfred Zapp. “Autonomous High Speed Road Vehicle
Guidance by Computer Vision”. In: IFAC Proceedings Volumes 20.5 (1987),
pp. 221–226. issn: 1474-6670. doi: 10.1016/S1474-6670(17)55320-3.
[63] Marco Dorigo. “Optimization, Learning and Natural Algorithms”. PhD
thesis. Politecnico di Milano, 1992.
[64] Marco Dorigo. “Positive Feedback as a Search Strategy”. In: Technical
report (1991), pp. 91–16.
[65] Kenji Doya and Tadahiro Taniguchi. “Toward Evolutionary and Develop-
mental Intelligence”. In: Current Opinion in Behavioral Sciences 29 (2019),
pp. 91–96. doi: 10.1016/j.cobeha.2019.04.006.
[66] Dimiter Driankov, Hans Hellendoorn, and Michael Reinfrank. An Introduc-
tion to Fuzzy Control. Springer Berlin, Heidelberg, 2013. isbn: 978-3-540-
60691-8. doi: 10.1007/978-3-662-03284-8.
[67] R Geoff Dromey. “From Requirements to Design: Formalizing the Key
Steps”. In: Proceedings of the First International Conference on Software
Engineering and Formal Methods. IEEE. 2003, pp. 2–11. doi: 10.1109/
SEFM.2003.1236202.
[68] Russell C Eberhart, Yuhui Shi, and James Kennedy. Swarm Intelligence.
Elsevier, 2001. isbn: 978-1-55860-595-4. doi: 10.1016/B978-1-55860-
595-4.X5000-1.
BIBLIOGRAPHY 267

[69] Sean Ekins et al. “Bayesian Models Leveraging Bioactivity and Cytotoxicity
Information for Drug Discovery”. In: Chemistry & Biology 20.3 (2013),
pp. 370–378.
[70] Lasse Espeholt et al. “IMPALA: Scalable Distributed Deep-RL with Impor-
tance Weighted Actor-Learner Architectures”. In: International conference
on machine learning. Ed. by Jennifer Dy and Andreas Krause. PMLR. 2018,
pp. 1407–1416.
[71] J Doyne Farmer, Norman H Packard, and Alan S Perelson. “The Immune
System, Adaptation, and Machine Learning”. In: Physica D: Nonlinear
Phenomena 22.1-3 (1986), pp. 187–204.
[72] David A Ferrucci. “Introduction to “This is Watson””. In: IBM Journal of
Research and Development 56.3.4 (2012), pp. 1–15. doi: 10.1147/JRD.
2012.2184356.
[73] Richard E. Fikes and Nils J. Nilsson. “Strips: A new approach to the
application of theorem proving to problem solving”. In: Artificial Intelligence
2.3 (1971), pp. 189–208. issn: 0004-3702. doi: 10.1016/0004-3702(71)
90010-5.
[74] Hilmar Finnsson and Yngvi Björnsson. “CadiaPlayer: A Simulation-Based
General Game Player”. In: IEEE Transactions on Computational Intelligence
and AI in Games 1.1 (2009), pp. 4–15. doi: 10 . 1109 / TCIAIG . 2009 .
2018702.
[75] Ronald A Fisher. “The Use of Multiple Measurements in Taxonomic Prob-
lems”. In: Annals of Eugenics 7.2 (1936), pp. 179–188. doi: 10.1111/j.
1469-1809.1936.tb02137.x.
[76] Evelyn Fix and Joseph Lawson Hodges. Discriminatory Analysis. Non-
parametric Discrimination: Consistency Properties. 11. USAF School of
Aviation Medicine, 1951.
[77] Bernhard Flury. Common principal components & related multivariate
models. John Wiley & Sons, Inc., 1988. isbn: 978-0-471-63427-0.
[78] Lawrence J. Fogel, Alvin J. Owens, and Michael J. Walsh. Artificial Intelli-
gence Through Simulated Evolution. John Wiley & Sons Inc, 1966. isbn:
0471265160.
[79] Jodi Forlizzi and Carl DiSalvo. “Service Robots in the Domestic Environment:
A Study of the Roomba Vacuum in the Home”. In: Proceedings of the 1st
ACM SIGCHI/SIGART Conference on Human-Robot Interaction. Ed. by
Michael A. Goodrich, Alan C. Schultz, and David J. Bruemmer. 2006,
pp. 258–265. doi: 10.1145/1121241.1121286.
[80] Guido KW Frank et al. “Altered Temporal Difference Learning in Bulimia
Nervosa”. In: Biological Psychiatry 70.8 (2011), pp. 728–735. doi: 10.
1016/j.biopsych.2011.05.011.
BIBLIOGRAPHY 268

[81] Alex Fraser and Donald Burnell. Computer Models in Genetics. Mcgraw-Hill
Book Co., New York., 1970. isbn: 0070219044.
[82] Yoav Freund and Robert E Schapire. “A Decision-Theoretic Generalization of
On-Line Learning and an Application to Boosting”. In: Journal of Computer
and System Sciences 55.1 (1997), pp. 119–139. doi: 10.1006/jcss.1997.
1504.
[83] David Gamez, Zafeirios Fountas, and Andreas K Fidjeland. “A Neurally
Controlled Computer Game Avatar With Humanlike Behavior”. In: IEEE
Transactions on Computational Intelligence and AI in Games 5.1 (2012),
pp. 1–14. doi: 10.1109/TCIAIG.2012.2228483.
[84] Chao Gao, Kei Takada, and Ryan Hayward. “Hex 2018: MoHex3HNN over
DeepEzo”. In: Journal of International Computer Games Association 41.1
(2019), pp. 39–42. doi: 10.3233/ICG-190094.
[85] Yunfan Gao et al. “Retrieval-Augmented Generation for Large Language
Models: A Survey”. In: arXiv preprint arXiv:2312.10997 (2023).
[86] Erann Gat. “Three-layer architectures”. In: Artificial Intelligence and Mobile
Robots: Case Studies of Successful Robot Systems. Cambridge, MA, USA:
MIT Press, 1998, 195–210. isbn: 0262611376.
[87] Sylvain Gelly et al. “The Grand Challenge of Computer Go: Monte Carlo
Tree Search and extensions”. In: Communications ACM 55.3 (Mar. 2012),
pp. 106–113. issn: 0001-0782. doi: 10.1145/2093548.2093574.
[88] Michael Genesereth, Nathaniel Love, and Barney Pell. “General game
playing: Overview of the AAAI competition”. In: AI magazine 26.2 (2005),
pp. 62–62. doi: 10.1609/aimag.v26i2.1813.
[89] FA Gers, J Schmidhuber, and F Cummins. “Learning to Forget: Continual
Prediction with LSTM”. In: Neural Computation 12.10 (2000), pp. 2451–
2471. doi: 10.1162/089976600300015015.
[90] Ingo Glöckner. Fuzzy Quantifiers: A Computational Theory. Vol. 193.
Springer Berlin, Heidelberg, 2008. isbn: 978-3-540-29634-8. doi: 10 .
1007/3-540-32503-4.
[91] Ben Goertzel. “Artificial General Intelligence: Concept, State of the Art, and
Future Prospects”. In: Journal of Artificial General Intelligence 5.1 (2009),
pp. 1–48. doi: 10.2478/jagi-2014-0001.
[92] Oded Goldreich. P, NP, and NP-Completeness: The Basics of Computational
Complexity. Cambridge University Press, 2010. isbn: 9781139490092.
[93] Yue-Jiao Gong et al. “An Efficient Resource Allocation Scheme Using
Particle Swarm Optimization”. In: IEEE Transactions on Evolutionary Com-
putation 16.6 (2012), pp. 801–816. doi: 10.1109/TEVC.2012.2185052.
[94] Ian Goodfellow et al. “Generative adversarial nets”. In: Advances in neural
information processing systems 27 (2014).
BIBLIOGRAPHY 269

[95] M. Gori, G. Monfardini, and F. Scarselli. “A New Model for Learning in

Graph Domains”. In: Proceedings. 2005 IEEE International Joint Conference
on Neural Networks, 2005. Vol. 2. 2005, pp. 729–734. doi: 10.1109/IJCNN.
2005.1555942.
[96] Alex Graves and Jürgen Schmidhuber. “Framewise Phoneme Classification
with Bidirectional LSTM and Other Neural Network Architectures”. In:
Neural Networks 18.5-6 (2005), pp. 602–610. doi: 10.1016/j.neunet.
2005.06.042.
[97] Jie Gui et al. “A Review on Generative Adversarial Networks: Algorithms,
Theory, and Applications”. In: IEEE Transactions on Knowledge and Data
Engineering 35.4 (2023), pp. 3313–3332. doi: 10 . 1109 / TKDE . 2021 .
3130191.
[98] Güven Güzeldere and Stefano Franchi. “Dialogues with colorful “person-
alities” of early AI”. In: Stanford Humanities Review 4.2 (1995), pp. 161–
169.
[99] Tuomas Haarnoja et al. “Soft Actor-Critic Algorithms and Applications”. In:
arXiv preprint arXiv:1812.05905 (2018).
[100] Tuomas Haarnoja et al. “Soft Actor-Critic: Off-Policy Maximum Entropy
Deep Reinforcement Learning with a Stochastic Actor”. In: International
Conference on Machine Learning. Ed. by Jennifer G. Dy and Andreas Krause.
PMLR. 2018, pp. 1856–1865.
[101] Mark Hachman. Midjourney’s enthralling AI art generator goes live for ev-
eryone. https://www.pcworld.com/article/820518/midjourneys-
ai-art-goes-live-for-everyone.html. Accessed: 2024-04-04.
[102] Jacques Hadamard. Mémoire sur le problème d’analyse relatif à l’équilibre
des plaques élastiques encastrées. Vol. 33. Imprimerie Nationale, 1908,
pp. 515–629.
[103] MA Hannan et al. “Capacitated vehicle-routing problem model for scheduled
solid waste collection and route optimization using PSO algorithm”. In:
Waste Management 71 (2018), pp. 31–41. issn: 0956-053X. doi: 10.1016/
j.wasman.2017.10.019.
[104] Nikolaus Hansen and Andreas Ostermeier. “Adapting Arbitrary Normal
Mutation Distributions in Evolution Strategies: The Covariance Matrix Adap-
tation”. In: Proceedings of IEEE International Conference on Evolutionary
Computation. Ed. by Toshio Fukuda and Takeshi Furuhashi. IEEE. 1996,
pp. 312–317. doi: 10.1109/ICEC.1996.542381.
[105] Vicki L Hanson. Celebrating 50 Years of the Turing Award. 2017.
[106] David Harrison Jr and Daniel L Rubinfeld. “Hedonic Housing Prices and
the Demand for Clean Air”. In: Journal of Environmental Economics and
Management 5.1 (1978), pp. 81–102. doi: 10 . 1016 / 0095 - 0696(78 )
90006-2.
BIBLIOGRAPHY 270

[107] Peter E Hart, Nils J Nilsson, and Bertram Raphael. “A Formal Basis for the
Heuristic Determination of Minimum Cost Paths”. In: IEEE Transactions on
Systems Science and Cybernetics 4.2 (1968), pp. 100–107. doi: 10.1109/
TSSC.1968.300136.
[108] Sergiu Hart and Andreu Mas-Colell. “A Simple Adaptive Procedure Leading
to Correlated Equilibrium”. In: Econometrica 68.5 (2000), pp. 1127–1150.
doi: 10.1111/1468-0262.00153.
[109] John A Hartigan. Bayes Theory. Springer Science & Business Media, 2012.
[110] Abdirahman Hashi and Yakup Genc. “Approximate Temporal Conditioned
Autoencoder and Regressor for Sales Prediction”. In: 2020 7th International
Conference on Electrical and Electronics Engineering (ICEEE). IEEE. 2020,
pp. 347–350. isbn: 978-1-7281-6787-9. doi: 10.1109/ICEEE49618.2020.
9102518.
[111] Carol M Hayes. “Law and Ethics of Generative Artificial Intelligence and
Copyright”. In: Future of Information and Communication Conference. Ed.
by Kohei Arai. Springer. Springer Nature Switzerland, 2024, pp. 576–591.
doi: 10.1007/978-3-031-53963-3\_40.
[112] Barbara Hayes-Roth. “A Blackboard Architecture for Control”. In: Artificial
Intelligence 26.3 (1985), pp. 251–321. doi: 10.1016/B978-0-934613-
63-7.50052-8.
[113] Kaiming He and Jian Sun. “Convolutional Neural Networks at Constrained
Time Cost”. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition. IEEE, 2015, pp. 5353–5360. doi: 10.1109/CVPR.
2015.7299173.
[114] Kaiming He et al. “Deep Residual Learning for Image Recognition”. In:
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 2016, pp. 770–778. doi: 10.1109/CVPR.2016.90.
[115] Donald Olding Hebb. The Organization of Behavior: A Neuropsychological
Theory. Psychology Press, 2005. isbn: 978-0805843002.
[116] James C Hendee. “An Expert System For Marine Environmental Monitoring
In The Florida Keys National Marine Sanctuary And Florida Bay”. In: WIT
Transactions on Ecology and the Environment 18 (1998), pp. 57–66. doi:
10.2495/CENV980061.
[117] Rob High. “The Era of Cognitive Systems: An Inside Look at IBM Watson
and How it Works”. In: IBM Corporation, Redbooks 1 (2012), p. 16.
[118] Geoffrey E Hinton. “Training Products of Experts by Minimizing Contrastive
Divergence”. In: Neural Computation 14.8 (2002), pp. 1771–1800. doi:
10.1162/089976602760128018.
[119] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-term Memory”. In:
Neural Computation 9 (Dec. 1997), pp. 1735–80. doi: 10.1162/neco.
1997.9.8.1735.
BIBLIOGRAPHY 271

[120] John H Holland. Adaptation in Natural and Artificial Systems: An In-

troductory Analysis with Applications to Biology, Control, and Artificial
Intelligence. MIT press, 1992 (1975 reprint). isbn: 9780262581110.
[121] John H Holland. “Genetic Algorithms and the Optimal Allocation of Trials”.
In: SIAM Journal on Computing 2.2 (1973), pp. 88–105. doi: 10.1137/
0202009.
[122] John H Holland. “Outline for a Logical Theory of Adaptive Systems”. In:
Journal of the ACM (JACM) 9.3 (1962), pp. 297–314. doi: 10 . 1145 /
321127.321128.
[123] John J Hopfield. “Neural networks and physical systems with emergent
collective computational abilities”. In: Proceedings of the National Academy
of Sciences 79.8 (1982), pp. 2554–2558. doi: 10.1073/pnas.79.8.2554.
[124] Feng-hsiung Hsu, Murray S Campbell, and A Joseph Hoane Jr. “Deep Blue
System Overview”. In: Proceedings of the 9th International Conference
on Supercomputing. Ed. by Mateo Valero. ACM, 1995, pp. 240–244. doi:
10.1145/224538.224567.
[125] Guan-Yu Hu et al. “A method for predicting the network security situation
based on hidden BRB model and revised CMA-ES algorithm”. In: Applied
Soft Computing 48 (2016), pp. 404–418. doi: 10.1016/j.asoc.2016.05.
046.
[126] Eyke Hüllermeier. “Does Machine Learning Need Fuzzy Logic?” In: Fuzzy
Sets and Systems 281 (2015), pp. 292–299. doi: doi.org/10.1016/j.fss.
2015.09.001.
[127] Eyke Hüllermeier. “Fuzzy Methods in Machine Learning and Data Mining:
Status and Prospects”. In: Fuzzy Sets and Systems 156.3 (2005). 40th
Anniversary of Fuzzy Sets, pp. 387–406. issn: 0165-0114. doi: https:
//doi.org/10.1016/j.fss.2005.05.036.
[128] Ross Ihaka and Robert Gentleman. “R: A Language for Data Analysis and
Graphics”. In: Journal of Computational and Graphical Statistics 5.3 (1996),
pp. 299–314. doi: 10.2307/1390807.
[129] Aleksey Grigorevich Ivakhnenko and Valentin Grigorevich Lapa. Cybernetic
Predicting Devices. Joint Publications Research Service, 1965.
[130] Alexey Grigorevich Ivakhnenko. “The Group Method of Data Handling-A
Rival of the Method of Stochastic Approximation,” in: Soviet Automatic
Control 1.3 (1968), pp. 43–55.
[131] Harry H Jiang et al. “AI Art and its Impact on Artists”. In: Proceedings of
the 2023 AAAI/ACM Conference on AI, Ethics, and Society. Ed. by Francesca
Rossi et al. Association for Computing Machinery, 2023, pp. 363–374. isbn:
9798400702310. doi: 10.1145/3600211.3604681.
BIBLIOGRAPHY 272

[132] Ian T Jolliffe. Principal Component Analysis for Special Types of Data.
Springer, 2002. isbn: 978-0-387-95442-4. doi: https://doi.org/10.
1007/0-387-22440-8\_13.
[133] Norman Jouppi et al. “Motivation for and Evaluation of the First Tensor
Processing Unit”. In: IEEE Micro 38.3 (2018), pp. 10–19. doi: 10.1109/
MM.2018.032271057.
[134] John Jumper et al. “Highly accurate protein structure prediction with Al-
phaFold”. In: Nature 596.7873 (2021), pp. 583–589. doi: 10.1038/s41586-
021-03819-2.
[135] R. E. Kalman. “A New Approach to Linear Filtering and Prediction Prob-
lems”. In: Journal of Basic Engineering 82.1 (1960), pp. 35–45. issn:
0021-9223. doi: 10.1115/1.3662552.
[136] Behzad Kamgar-Parsi and Behrooz Kamgar-Parsi. “On problem solving
with Hopfield neural networks”. In: Biological Cybernetics 62.5 (1990),
pp. 415–423. doi: 10.1007/BF00197648.
[137] Jérôme Henri Kämpf and Darren Robinson. “A hybrid CMA-ES and HDE
optimisation algorithm with application to solar energy potential”. In: Applied
Soft Computing 9.2 (2009), pp. 738–745. doi: 10.1016/j.asoc.2008.09.
009.
[138] Murat Karabatak and M Cevdet Ince. “An expert system for detection of
breast cancer based on association rules and neural network”. In: Expert
Systems with Applications 36.2 (2009), pp. 3465–3469.
[139] Henry J Kelley. “Gradient Theory of Optimal Flight Paths”. In: Ars Journal
30.10 (1960), pp. 947–954. doi: doi.org/10.2514/8.5282.
[140] James Kennedy and Russell Eberhart. “Particle Swarm Optimization”. In:
Proceedings of International Conference on Neural Networks (ICNN’95).
Vol. 4. 1. 1995, pp. 1942–1948. doi: 10.1109/ICNN.1995.488968.
[141] James Kennedy and Russell C Eberhart. “A Discrete Binary Version of the
Particle Swarm Algorithm”. In: 1997 IEEE International Conference on
Systems, Man, and Cybernetics. Computational Cybernetics and Simulation.
Vol. 5. IEEE. 1997, pp. 4104–4108. doi: 10.1109/ICSMC.1997.637339.
[142] Mostepha R Khouadjia et al. “A comparative study between dynamic
adapted PSO and VNS for the vehicle routing problem with dynamic
requests”. In: Applied Soft Computing 12 (4 2012), pp. 1426–1439. doi:
10.1016/j.asoc.2011.10.023.
[143] Jeong-Jung Kim and Ju-Jang Lee. “Trajectory Optimization With Par-
ticle Swarm Optimization for Manipulator Motion Planning”. In: IEEE
transactions on Industrial Informatics 11.3 (2015), pp. 620–631. doi:
10.1109/TII.2015.2416435.
BIBLIOGRAPHY 273

[144] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic
Optimization”. In: International Conference on Learning Representation
(ICLR 2015). Ed. by Yann LeCun Yoshua Bengio. 2014.
[145] Scott Kirkpatrick, C Daniel Gelatt Jr, and Mario P Vecchi. “Optimization
by Simulated Annealing”. In: Science 220.4598 (1983), pp. 671–680. doi:
10.1126/science.220.4598.671.
[146] Ronald Kline. “Cybernetics, Automata Studies, and the Dartmouth Confer-
ence on Artificial Intelligence”. In: IEEE Annals of the History of Computing
33.4 (2010), pp. 5–16. doi: 10.1109/MAHC.2010.44.
[147] Donald E Knuth and Ronald W Moore. “An analysis of alpha-beta pruning”.
In: Artificial intelligence 6.4 (1975), pp. 293–326. issn: 0004-3702. doi:
10.1016/0004-3702(75)90019-3.
[148] Levente Kocsis and Csaba Szepesvári. “Bandit Based Monte-Carlo Plan-
ning”. In: Machine Learning: ECML 2006. Ed. by Johannes Fürnkranz,
Tobias Scheffer, and Myra Spiliopoulou. Berlin, Heidelberg: Springer Berlin
Heidelberg, 2006, pp. 282–293. isbn: 978-3-540-46056-5.
[149] Philipp Körner et al. “Fifty Years of Prolog and Beyond”. In: Theory and
Practice of Logic Programming 22.6 (2022), pp. 776–858. doi: 10.1017/
S1471068422000102.
[150] Mark A Kramer. “Nonlinear Principal Component Analysis Using Autoas-
sociative Neural Networks”. In: AIChE journal 37.2 (1991), pp. 233–243.
doi: 10.1002/aic.690370209.
[151] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet Classi-
fication with Deep Convolutional Neural Networks”. In: Advances in Neural
Information Processing Systems. Ed. by F. Pereira et al. Vol. 25. Curran
Associates, Inc., 2012.
[152] Casimir A Kulikowski. “Artificial intelligence methods and systems for
medical consultation”. In: IEEE Transactions on Pattern Analysis and
Machine Intelligence PAMI-2.5 (1980), pp. 464–476. doi: 10.1109/TPAMI.
1980.6592368.
[153] Kristofer D Kusano et al. “Comparison of Waymo Rider-Only Crash
Data to Human Benchmarks at 7.1 Million Miles”. In: arXiv preprint
arXiv:2312.12675 (2023).
[154] Yann LeCun et al. “Backpropagation Applied to Handwritten Zip Code
Recognition”. In: Neural Computation 1.4 (1989), pp. 541–551. doi: 10.
1162/neco.1989.1.4.541.
[155] Yann LeCun et al. “GradientBased Learning Applied to Document Recog-
nition”. In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324. doi:
10.1109/5.726791.
BIBLIOGRAPHY 274

[156] Kang Yoon Lee and Junhewk Kim. “Artificial Intelligence Technology
Trends and IBM Watson References in the Medical Field”. In: Korean
Medical Education Review 18.2 (2016), pp. 51–57. doi: 10.17496/kmer.
2016.18.2.51.
[157] Suchan Lee et al. “Evolving Gaits for Physical Robots with the HyperNEAT
Generative Encoding: The Benefits of Simulation”. In: Applications of Evo-
lutionary Computation. Ed. by Anna I. Esparcia-Alcázar. Berlin, Heidelberg:
Springer Berlin Heidelberg, 2013, pp. 540–549. isbn: 978-3-642-37192-9.
[158] Chenbin Li, Guohua Zhan, and Zhihua Li. “News Text Classification Based
on Improved Bi-LSTM-CNN”. In: 2018 9th International Conference on
Information Technology in Medicine and Education (ITME). Ed. by Qun
Jin, Ning Jin, and Enrico Haemmerle. IEEE. 2018, pp. 890–893. doi:
10.1109/ITME.2018.00199.
[159] Erik Lindholm and Stuart Oberman. “The NVIDIA GeForce 8800 GPU”.
In: 2007 IEEE Hot Chips 19 Symposium (HCS). IEEE. 2007, pp. 1–17. doi:
10.1109/HOTCHIPS.2007.7482490.
[160] Robert K Lindsay et al. “DENDRAL: a case study of the first expert system
for scientific hypothesis formation”. In: Artificial intelligence 61.2 (1993),
pp. 209–261. doi: 10.1016/0004-3702(93)90068-M.
[161] Jinduo Liu et al. “EC-GAN: Inferring Brain Effective Connectivity via
Generative Adversarial Networks”. In: The Thirty-Fourth AAAI Conference
on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Appli-
cations of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI
Symposium on Educational Advances in Artificial Intelligence, EAAI 2020,
New York, NY, USA, February 7-12, 2020. AAAI Press, 2020, pp. 4852–4859.
doi: 10.1609/AAAI.V34I04.5921.
[162] Ziming Liu et al. “KAN: Kolmogorov-Arnold Networks”. In: arXiv preprint
arXiv:2404.19756 (2024).
[163] Stuart Lloyd. “Least squares quantization in PCM”. In: IEEE Transactions
on Information Theory 28.2 (March 1982), pp. 129–137. doi: 10.1109/
TIT.1982.1056489.
[164] Bruce Lowerre and R Reddy. “The Harpy Speech Recognition System:
performance with large vocabularies”. In: The Journal of the Acoustical
Society of America 60.S1 (1976), s10–s11. doi: 10.1121/1.2003089.
[165] Yunqian Ma and Guodong Guo. Support Vector Machines Applications.
Vol. 649. Springer Cham, 2014. isbn: 978-3-319-02299-4. doi: 10.1007/
978-3-319-02300-7.
[166] J. B. MacQueen. “Some Methods for Classification and Analysis of MultiVari-
ate Observations”. In: Proc. of the fifth Berkeley Symposium on Mathematical
Statistics and Probability. Ed. by L. M. Le Cam and J. Neyman. Vol. 1.
University of California Press, 1967, pp. 281–297.
BIBLIOGRAPHY 275

[167] Spyros Makridakis. “The Forthcoming Artificial Intelligence (AI) Revolution:

Its Impact on Society and Firms”. In: Futures 90 (2017), pp. 46–60. issn:
0016-3287. doi: 10.1016/j.futures.2017.03.006.
[168] Jacek Mandziuk. Knowledge-Free and Learning-Based Methods in Intelligent
Game Playing. Vol. 276. Studies in Computational Intelligence. Springer,
2010. isbn: 978-3-642-11677-3. doi: 10.1007/978-3-642-11678-0.
[169] Jacek Mańdziuk. “Solving the N-Queens problem with a binary Hopfield-type
network: Synchronous and asynchronous model”. In: Biological Cybernetics
72 (5 1995), pp. 439–445. doi: 10.1007/BF00201419.
[170] George A Marcoulides and Zvi Drezner. “Model Specification Searches Us-
ing Ant Colony Optimization Algorithms”. In: Structural Equation Modeling
10 (1 2003), pp. 154–164. doi: 10.1207/S15328007SEM1001\_8.
[171] Gary Marcus, Ernest Davis, and Scott Aaronson. “A very preliminary
analysis of DALL-E 2”. In: arXiv preprint arXiv:2204.13807 (2022).
[172] John Markoff. “Behind Artificial Intelligence, a Squadron of Bright Real
People”. In: The New York Times 14 (2005).
[173] Marcella SR Martins et al. “PSO with path relinking for resource allocation
using simulation optimization”. In: Computers & Industrial Engineering
65.2 (2013), pp. 322–330. doi: 10.1016/j.cie.2013.02.004.
[174] Michael Maschler, Shmuel Zamir, and Eilon Solan. Game Theory. Cam-
bridge University Press, 2013. isbn: 9781107005488. doi: 10 . 1017 /
CBO9780511794216.
[175] Michael L Mauldin. “ChatterBots, TinyMuds, and the Turing Test: Entering
the Loebner Prize Competition”. In: Proceedings of the Twelfth AAAI
National Conference on Artificial Intelligence. Ed. by Barbara Hayes-Roth
and Richard E. Korf. Vol. 12. AAAI Press, 1994, pp. 16–21. doi: 10.5555/
2891730.2891733.
[176] John McCarthy. “Chess as the Drosophila of AI”. In: Computers, Chess, and
Cognition. Ed. by T. Anthony Marsland and Jonathan Schaeffer. Springer,
1990, pp. 227–237. isbn: 978-1-4613-9080-0.
[177] John McCarthy. “History of LISP”. In: History of Programming Lan-
guages. Association for Computing Machinery, 1978, pp. 173–185. isbn:
0127450408. doi: 10.1145/800025.1198360.
[178] John McCarthy. “Programs with Common Sense”. In: Proceedings of the
Teddington Conference on the Mechanization of Thought Processes. Her
Majesty’s Stationary Office, 1959, pp. 75–91.
[179] John McCarthy and Patrick J. Hayes. “Some Philosophical Problems from
the Standpoint of Artificial Intelligence”. In: Machine Intelligence. Ed. by
B. Meltzer and D. Michie. Vol. 4. Edinburgh University Press, 1969, pp. 463–
502.
BIBLIOGRAPHY 276

[180] John McCarthy et al. “A Proposal for the Dartmouth Summer Research
Project on Artificial Intelligence, August 31, 1955”. In: AI magazine 27.4
(2006), pp. 12–12. doi: https://doi.org/10.1609/aimag.v27i4.
1904.
[181] Warren S McCulloch and Walter Pitts. “A Logical Calculus of the Ideas
Immanent in Nervous Activity”. In: The Bulletin of Mathematical Biophysics
5 (1943), pp. 115–133. doi: 10.1007/BF02478259.
[182] Kamyar Mehran. “Takagi-Sugeno Fuzzy Modeling for Process Control”.
In: Industrial Automation, Robotics and Artificial Intelligence 262 (2008),
pp. 1–31.
[183] Andrew Melnik et al. “Face Generation and Editing with StyleGAN: A
Survey”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence
46 (2024), pp. 3557–3576. doi: 10.1109/TPAMI.2024.3350004.
[184] Rui Mendes, James Kennedy, and José Neves. “The Fully Informed Particle
Swarm: Simpler, Maybe Better”. In: IEEE transactions on evolutionary
computation 8.3 (2004), pp. 204–210. doi: 10.1109/TEVC.2004.826074.
[185] Qinxue Meng et al. “Relational Autoencoder for Feature Extraction”. In:
2017 International Joint Conference on Neural Networks (IJCNN). IEEE.
2017, pp. 364–371. doi: 10.1109/IJCNN.2017.7965877.
[186] Jose M Merigo and Montserrat Casanovas. “The Fuzzy Generalized OWA
Operator and Its Application in Strategic Decision Making”. In: Cybernetics
and Systems 41.5 (2010), pp. 359–370. doi: 10.1080/01969722.2010.
486223.
[187] Vangelis Metsis, Ion Androutsopoulos, and Georgios Paliouras. “Spam
Filtering with Naive Bayes - Which Naive Bayes?” In: CEAS - The Third
Conference on Email and Anti-Spam. Vol. 17. Mountain View, CA. 2006,
pp. 28–69.
[188] Donald Michie. “Game-Playing and Game-Learning Automata”. In: Ad-
vances in Programming and Non-Numerical Computation. Ed. by L. Fox. El-
sevier, 1966, pp. 183–200. doi: 10.1016/B978-0-08-011356-2.50011-
2.
[189] Risto Miikkulainen et al. “Evolving Deep Neural Networks”. In: Artificial
Intelligence in the Age of Neural Networks and Brain Computing. Elsevier,
2024, pp. 269–287. doi: 10.1016/B978-0-12-815480-9.00015-3.
[190] Tomas Mikolov et al. “Distributed Representations of Words and Phrases
and their Compositionality”. In: Advances in Neural Information Processing
Systems 26 (2013).
[191] Tomas Mikolov et al. “Efficient Estimation of Word Representations in
Vector Space”. In: arXiv preprint arXiv:1301.3781 (2013).
BIBLIOGRAPHY 277

[192] Michael B Miller. Mathematics and Statistics for Financial Risk Manage-
ment. John Wiley & Sons, 2013. isbn: 9781118750292. doi: 10.1002/
9781118819616.
[193] Debashis Mishra et al. “Medical Image Thresholding Using Particle Swarm
Optimization”. In: Intelligent Computing, Communication and Devices.
Ed. by Lakhmi C. Jain, Srikanta Patnaik, and Nikhil Ichalkaranje. New Delhi:
Springer India, 2015, pp. 379–383. isbn: 978-81-322-2012-1.
[194] Alex L Mitchell et al. “MGnify: the microbiome analysis resource in
2020”. In: Nucleic Acids Research 48.D1 (2020), pp. D570–D578. doi:
10.1093/nar/gkz1035.
[195] Volodymyr Mnih et al. “Playing Atari with Deep Reinforcement Learning”.
In: arXiv preprint arXiv:1312.5602 (2013).
[196] Ethan Modi and Kymberly Acuna. “The Effects of Computer and AI Engines
on Competitive Chess”. In: Journal of Student Research 12.3 (2023). issn:
2167-1907.
[197] Tima Pakfetrat et al., eds. Interpreting Financial Time Series with SHAP
Values. 2019, pp. 166–172. doi: 10.5555/3370272.3370290.
[198] Michael E Moran. “The da Vinci Robot”. In: Journal of Endourology 20.12
(2006), pp. 986–990.
[199] Hans P Moravec. “The Stanford Cart and the CMU Rover”. In: Proceedings
of the IEEE 71.7 (1983), pp. 872–884. doi: 10.1109/PROC.1983.12684.
[200] Matej Moravčík et al. “DeepStack: Expert-level Artificial Intelligence in
heads-up no-limit Poker”. In: Science 356.6337 (2017), 508–513. issn:
1095-9203. doi: 10.1126/science.aam6960.
[201] In Jae Myung. “Tutorial on maximum likelihood estimation”. In: Journal of
Mathematical Psychology 47.1 (2003), pp. 90–100. doi: 10.1016/S0022-
2496(02)00028-7.
[202] Yu Nasu. “Efficiently Updatable Neural-Network-based Evaluation Functions
for Computer Shogi”. In: The 28th World Computer Shogi Championship
Appeal Document 185 (2018).
[203] Radford M Neal and Geoffrey E Hinton. “A View of the Em Algorithm
that Justifies Incremental, Sparse, and other Variants”. In: Learning in
Graphical Models. Ed. by Michael Jordan. Springer, 1998, pp. 355–368.
isbn: 978-94-010-6104-9. doi: 10.1007/978-94-011-5014-9\_12.
[204] Allen Newell and Herbert A. Simon. “GPS, a program that simulates human
thought”. In: Computers and Thought. Ed. by E. A. Feigenbaum and J.
Feldman. McGraw-Hill, 1963, pp. 279–293. isbn: 0-262-62101-0.
[205] Allen Newell and Herbert Alexander Simon. “Computer Simulation of
Human Thinking”. In: Science 134.3495 (1961), pp. 2011–2017.
BIBLIOGRAPHY 278

[206] Ahmad Nickabadi, Mohammad Mehdi Ebadzadeh, and Reza Safabakhsh.

“Particle Swarm Optimization with Adaptive Inertia Weight”. In: Applied
Soft Computing 11.4 (2011), pp. 3658–3670. doi: 10.1016/j.asoc.2011.
01.037.
[207] Ronen Nir, Alexander Shleyfman, and Erez Karpas. “Automated Synthesis
of Social Laws in STRIPS”. In: Proceedings of the AAAI Conference on
Artificial Intelligence. Vol. 34. 06. 2020, pp. 9941–9948. doi: 10.1609/
aaai.v34i06.6549.
[208] Kumar Nishant et al. “Load Balancing of Nodes in Cloud Using Ant Colony
Optimization”. In: 2012 UKSim 14th International Conference on Computer
Modelling and Simulation. Ed. by David Al-Dabass. IEEE. 2012, pp. 3–8.
doi: 10.1109/UKSim.2012.11.
[209] Yael Niv, Michael O Duff, and Peter Dayan. “Dopamine, uncertainty and
TD learning”. In: Behavioral and Brain Functions 1.6 (2005), pp. 1–9. doi:
10.1186/1744-9081-1-6.
[210] Junhyuk Oh et al. “Self-Imitation Learning”. In: International Conference
on Machine Learning. Ed. by Jennifer Dy and Andreas Krause. PMLR. 2018,
pp. 3878–3887.
[211] Santiago Ontanón. “The Combinatorial Multi-Armed Bandit Problem and
Its Application to Real-Time Strategy Games”. In: Proceedings of the AAAI
Conference on Artificial Intelligence and Interactive Digital Entertainment.
Ed. by Gita Sukthankar and Ian Horswill. Vol. 9. 1. 2013, pp. 58–64. doi:
10.1609/aiide.v9i1.12681.
[212] Santiago Ontanón et al. “A Survey of Real-Time Strategy Game AI Research
and Competition in StarCraft”. In: IEEE Transactions on Computational
Intelligence and AI in games 5.4 (2013), pp. 293–311. doi: 10 . 1109 /
TCIAIG.2013.2286295.
[213] OpenAI. OpenAI-Five Web Page. Accessed on March 31, 2024. 2018. url:
https://openai.com/research/openai-five.
[214] OpenAI et al. Dota 2 with Large Scale Deep Reinforcement Learning. 2019.
arXiv: 1912.06680. url: https://arxiv.org/abs/1912.06680.
[215] OpenAI et al. GPT-4 Technical Report. 2024. arXiv: 2303.08774 [cs.CL].
[216] Jeff Orkin. “Agent Architecture Considerations for Real-Time Planning in
Games”. In: Proceedings of the AAAI Conference on Artificial Intelligence
and Interactive Digital Entertainment. Ed. by Michael Young and John Laird.
Vol. 1. 1. 2005, pp. 105–110. isbn: 978-1-57735-235-1.
[217] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of
Plausible Inference. Morgan kaufmann, 1988.
[218] Joana Pereira et al. “High-accuracy protein structure prediction in CASP14”.
In: Proteins: Structure, Function, and Bioinformatics 89.12 (2021), pp. 1687–
1699. doi: 10.1002/prot.2617.
BIBLIOGRAPHY 279

[219] Diego Perez-Liebana et al. “The 2014 General Video Game Playing Com-
petition”. In: IEEE Transactions on Computational Intelligence and AI in
Games 8.3 (2015), pp. 229–243. doi: 10.1109/TCIAIG.2015.2402393.
[220] Rolf Pfeifer and Christian Scheier. Understanding Intelligence. MIT press,
2001. isbn: 9780262661256.
[221] Martin Pincus. “A Monte Carlo Method for the Approximate Solution
of Certain Types of Constrained Optimization Problems”. In: Operations
Research 18.6 (1970), pp. 1225–1228. doi: 10.1287/opre.18.6.1225.
[222] Jacques Pitrat. “A Program to Learn to Play Chess”. In: Pattern Recognition
and Artificial Intelligence (1976), pp. 399–419.
[223] Edgar Allan Poe. Complete Essays, Literary Criticism, Cryptography, Au-
tography, Translations & Letters: The Philosophy of Composition, The
Rationale of Verse, The Poetic Principle, Old English Poetry, Maelzel’s
Chess Player, Eureka, The Literati of New York, Fifty Suggestions, Exordium,
Marginalia. . . Musaicum Books, 2017.
[224] David MW Powers. “The Total Turing Test and the Loebner Prize”. In:
Proceedings of the Joint Conferences on New Methods in Language Pro-
cessing and Computational Natural Language Learning. Sydney, Aus-
tralia: Association for Computational Linguistics, 1998, pp. 279–280. doi:
10.5555/1603899.1603947.
[225] Sameer Pradhan et al. “Shallow Semantic Parsing using Support Vector
Machines”. In: Proceedings of the Human Language Technology Confer-
ence of the North American Chapter of the Association for Computational
Linguistics: HLT-NAACL 2004. Ed. by Julia B. Hirschberg et al. 2004,
pp. 233–240.
[226] Laurent Pueyo. “Detection and Characterization of Exoplanets using Pro-
jections on Karhunen-Loeve Eigenimages: Forward Modeling”. In: The
Astrophysical Journal 824.2 (2016), p. 29. doi: 10.3847/0004- 637X/
824/2/117.
[227] J Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann,
1993. isbn: 978-1-55860-240-3.
[228] J. Ross Quinlan. “Induction of Decision Trees”. In: Machine learning 1
(1986), pp. 81–106. doi: 10.1007/BF00116251.
[229] Alec Radford et al. Improving Language Understanding by Generative Pre-
Training. 2018. url: https://api.semanticscholar.org/CorpusID:
49313245.
[230] Alec Radford et al. “Language Models are Unsupervised Multitask Learners”.
In: OpenAI blog 1.8 (2019), p. 9.
[231] Anna Radzikowska. “A three-valued approach to default logic”. In: Journal
of Applied Non-Classical Logics 6.2 (1996), pp. 149–190. doi: 10.1080/
11663081.1996.10510876.
BIBLIOGRAPHY 280

[232] Pranav Rajpurkar et al. “SQuAD: 100,000+ Questions for Machine Com-
prehension of Text”. In: Proceedings of the 2016 Conference on Empirical
Methods in Natural Language Processing. Ed. by Jian Su, Kevin Duh, and
Xavier Carreras. Austin, Texas: Association for Computational Linguistics,
Nov. 2016, pp. 2383–2392. doi: 10.18653/v1/D16-1264.
[233] Elisabeth Rakus-Andersson. “The Brains behind the Enigma Code Breaking
before the Second World War”. In: Mathematics and War. Ed. by Elisabeth
Rakus-Andersson. Springer Basel AG, 2003, pp. 83–102. isbn: 978-3-7643-
1634-1. doi: https://doi.org/10.1007/978-3-0348-8093-0\_3.
[234] Hubert Ramsauer et al. “Hopfield Networks is All You Need”. In: arXiv
preprint arXiv:2008.02217 (2020).
[235] Ingo Rechenberg. “Cybernetic Solution Path of an Experimental Problem
by Ingo Rechenberg”. In: Royal Aircraft Establishment 1122 (1965).
[236] Bruce Cameron Reed. The History and Science of the Manhattan Project.
Springer, 2014. isbn: 978-3-662-58174-2. doi: 10.1007/978- 3- 662-
58175-9.
[237] Joseph Reisinger et al. “Coevolving Strategies for General Game Playing”.
In: 2007 IEEE Symposium on Computational Intelligence and Games. IEEE.
IEEE Computer Society, 2007, pp. 320–327. doi: 10.1109/CIG.2007.
368115.
[238] L.M. Rasdi Rere, Mohamad Ivan Fanany, and Aniati Murni Arymurthy.
“Simulated Annealing Algorithm for Deep Learning”. In: Procedia Com-
puter Science 72 (2015). Proceedings of the Third Information Systems
International Conference 2015, pp. 137–144. issn: 1877-0509. doi: 10.
1016/j.procs.2015.12.114.
[239] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “"Why Should I
Trust You?": Explaining the Predictions of Any Classifier”. In: Proceedings of
the 22nd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining. Association for Computing Machinery, 2016, pp. 1135–
1144. isbn: 9781450342322. doi: 10.1145/2939672.2939778.
[240] Mark Richards. The Stanford Cart Photograph. https://www.computerhistory.
org/revolution/artificial- intelligence- robotics/13/293/
1277. Accessed: 2024-03-03.
[241] Gary Rinkerman. “Artificial Intelligence and evolving issues under US
copyright and patent law”. In: Interactive Entertainment Law Review 6.2
(2023), pp. 48–65. doi: 10.4337/ielr.2023.0002.
[242] Carlos Riquelme et al. “Scaling Vision with Sparse Mixture of Experts”.
In: Advances in Neural Information Processing Systems 34 (2021). Ed. by
M. Ranzato et al., pp. 8583–8595.
BIBLIOGRAPHY 281

[243] Irina Rish. “An empirical study of the naive Bayes classifier”. In: IJCAI
2001 Workshop on Empirical Methods in Artificial Intelligence. Vol. 3. 22.
2001, pp. 41–46.
[244] B Robertie. “Carbon versus silicon: Matching wits with TD-Gammon”. In:
Inside Backgammon 2.2 (1992), pp. 14–22.
[245] Richard J Roberts. “How restriction enzymes became the workhorses of
molecular biology”. In: Proceedings of the National Academy of Sciences
102.17 (2005), pp. 5905–5908. doi: 10.1073/pnas.0500923102.
[246] Robin Rombach et al. “High-Resolution Image Synthesis with Latent Diffu-
sion Models”. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. 2022, pp. 10684–10695. doi: 10.1109/
CVPR52688.2022.01042.
[247] Tord Romstad, Marco Costalba, and Joona Kiiski. Stockfish. https://
stockfishchess.org/ (Last accessed: 5 May 2024). 2008.
[248] Frank Rosenblatt. Perceptron, a Perceiving and Recognizing Automaton
(Project Para). Report: Cornell Aeronautical Laboratory. Cornell Aeronauti-
cal Laboratory, 1957.
[249] Frank Rosenblatt. “The perceptron: a probabilistic model for information
storage and organization in the brain”. In: Psychological Review 65.6 (1958),
p. 386. doi: 10.1037/h0042519.
[250] Mark Rosheim. Leonardo’s Lost Robots. Springer Science & Business Media,
2006.
[251] Philippe Roussel. “Definition et traitement de l’egalite formelle en demon-
stration automatique”. PhD thesis. Université d’Aix, 1972.
[252] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. “Learning
representations by back-propagating errors”. In: nature 323.6088 (1986),
pp. 533–536. doi: 10.1038/323533a0.
[253] Earl D Sacerdoti. “A Structure for Plans and Behavior”. Technical Report.
PhD thesis. Department of Computer Science, Stanford University, 1975.
[254] GM Matthew Sadler. “The TCEC20 Computer Chess Superfinal: A perspec-
tive”. In: Journal of International Computer Games Association 43.1 (2021),
pp. 74–87. doi: 10.3233/ICG-210184.
[255] Konstantinos Sagonas, Terrance Swift, and David Scott Warren. “An Abstract
Machine for Computing the Well-Founded Semantics”. In: Proceedings of the
1996 Joint International Conference and Symposium on Logic Programming.
Ed. by Michael Maher. MIT Press, 1996, pp. 274–288. isbn: 9780262291309.
doi: 10.7551/mitpress/4302.003.0027.
BIBLIOGRAPHY 282

[256] Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. “Restricted Boltz-
mann Machines for Collaborative Filtering”. In: Proceedings of the 24th
International Conference on Machine Learning. Ed. by Zoubin Ghahra-
mani. Association for Computing Machinery, 2007, pp. 791–798. isbn:
9781595937933. doi: 10.1145/1273496.1273596.
[257] Ian Sample. “AI Becomes Grandmaster in “Fiendishly Complex” StarCraft
II”. In: The Guardian (2019).
[258] Arthur L Samuel. “Some Studies in Machine Learning Using the Game
of Checkers”. In: IBM Journal of Research and Development 3.3 (1959),
pp. 210–229. doi: 10.1147/rd.33.0210.
[259] F. Scarselli et al. “Graph Neural Networks for Ranking Web Pages”. In:
The 2005 IEEE/WIC/ACM International Conference on Web Intelligence
(WI’05). Ed. by Andrzej Skowron et al. 2005, pp. 666–672. doi: 10.1109/
WI.2005.67.
[260] Franco Scarselli et al. “The Graph Neural Network Model”. In: IEEE
Transactions on Neural Networks 20.1 (2008), pp. 61–80. doi: 10.1109/
TNN.2008.2005605.
[261] Jonathan Schaeffer et al. “Checkers is Solved”. In: science 317.5844 (2007),
pp. 1518–1522. doi: 10.1126/science.1144079.
[262] Julian Schrittwieser et al. “Mastering Atari, Go, chess and shogi by planning
with a learned model”. In: nature 588.7839 (2020), pp. 604–609. doi:
10.1038/s41586-020-03051-4.
[263] John Schulman et al. “High-Dimensional Continuous Control Using General-
ized Advantage Estimation”. In: Proceedings of the International Conference
on Learning Representations (ICLR). Ed. by Yoshua Bengio and Yann LeCun.
2016.
[264] John Schulman et al. “Proximal Policy Optimization Algorithms”. In: arXiv
preprint arXiv:1707.06347 (2017).
[265] Hans-Paul Schwefel. “Kybernetische Evolution als Strategie der exper-
imentellen Forschung in der Stromungstechnik”. PhD thesis. Technical
University of Berlin, 1965.
[266] Guna Seetharaman, Arun Lakhotia, and Erik Philip Blasch. “Unmanned
Vehicles Come of Age: The DARPA Grand Challenge”. In: Computer 39.12
(2006), pp. 26–29. doi: 10.1109/MC.2006.447.
[267] Marwin HS Segler, Mike Preuss, and Mark P Waller. “Planning chemical
syntheses with deep neural networks and symbolic AI”. In: Nature 555.7698
(2018), pp. 604–610. doi: 10.1038/nature25978.
BIBLIOGRAPHY 283

[268] Anjaneyulu Babu Shaik and Sujatha Srinivasan. “A Brief Survey on Random
Forest Ensembles in Classification Model”. In: International Conference
on Innovative Computing and Communications. Ed. by Siddhartha Bhat-
tacharyya et al. Singapore: Springer Singapore, 2019, pp. 253–260. isbn:
978-981-13-2354-6.
[269] Sharath M Shankaranarayana and Davor Runje. “ALIME: Autoencoder
Based Approach for Local Interpretability”. In: Intelligent Data Engineering
and Automated Learning–IDEAL 2019. Ed. by Hujun Yin et al. Springer.
2019, pp. 454–463. isbn: 978-3-030-33607-3.
[270] Claude E Shannon. “A Chess-Playing Machine”. In: Scientific American.
Springer, 1950, pp. 48–51.
[271] Claude Elwood Shannon. “A Mathematical Theory of Communication”.
In: The Bell System Technical Journal 27.3 (1948), pp. 379–423. doi:
10.1002/j.1538-7305.1948.tb01338.x.
[272] KV Shihabudheen and Gopinatha N Pillai. “Recent Advances in Neuro-Fuzzy
System: A Survey”. In: Knowledge-Based Systems 152 (2018), pp. 136–162.
doi: 10.1016/j.knosys.2018.04.014.
[273] Ravid Shwartz-Ziv and Amitai Armon. “Tabular Data: Deep Learning is
Not All You Need”. In: Information Fusion 81 (2022), pp. 84–90. doi:
10.1016/j.inffus.2021.11.011.
[274] Sima Siami-Namini, Neda Tavakoli, and Akbar Siami Namin. “The Perfor-
mance of LSTM and BiLSTM in Forecasting Time Series”. In: 2019 IEEE
International Conference on Big Data. Ed. by Chaitanya Baru. IEEE. 2019,
pp. 3285–3292. doi: 10.1109/BigData47090.2019.9005997.
[275] David Silver et al. “A general reinforcement learning algorithm that masters
chess, shogi, and Go through self-play”. In: Science 362.6419 (2018),
pp. 1140–1144. doi: 10.1126/science.aar6404.
[276] David Silver et al. “Mastering the Game of Go with Deep Neural Networks
and Tree Search”. In: Nature 529.7587 (2016), pp. 484–489. doi: 10.1038/
nature16961.
[277] Autonomous Cars: Recent Developments, Challenges, and Possible Solutions.
Vol. 1022. 1. IOP Publishing. 2021, p. 012028. doi: 10 . 1088 / 1757 -
899X/1022/1/012028.
[278] Paul Smolensky. “Information Processing in Dynamical Systems: Founda-
tions of Harmony Theory”. In: Parallel Distributed Processing: Exploration
in the Microstructure of Cognition. Ed. by Rumelhart D.E. et al. MIT Press,
1986.
[279] Raymond Smullyan. Gödel’s incompleteness theorems. Oxford University
Press, 1992.
[280] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. “Highway
Networks”. In: arXiv preprint arXiv:1505.00387 (2015).
BIBLIOGRAPHY 284

[281] Tom Standage. The Turk: The Life and Times of the Famous Eighteenth-
Century Chess-Playing Machine. Berkley Trade, 2003. isbn: 0425190390.
[282] Kenneth O Stanley and Risto Miikkulainen. “Evolving Neural Networks
through Augmenting Topologies”. In: Evolutionary Computation 10.2 (2002),
pp. 99–127. doi: 10.1162/106365602320169811.
[283] Jonathan M Stokes et al. “A Deep Learning Approach to Antibiotic Discov-
ery”. In: Cell 180.4 (2020), pp. 688–702. doi: 10.1016/j.cell.2020.01.
021.
[284] Eliza Strickland. “IBM Watson, Heal thyself: How IBM Overpromised
and Underdelivered on AI Health Care”. In: IEEE Spectrum 56.4 (2019),
pp. 24–31. doi: 10.1109/MSPEC.2019.8678513.
[285] Eliza Strickland. “The Turbulent Past and Uncertain Future of Artificial
Intelligence”. In: IEEE Spectrum 30 (2021).
[286] Russell Stuart and Norvig Peter. Artificial Intelligence: A Modern Approach.
1995.
[287] Joe Sullivan and William H Woodall. “A comparison of fuzzy forecasting
and Markov modeling”. In: Fuzzy Sets and Systems 64.3 (1994), pp. 279–293.
doi: 10.1016/0165-0114(94)90152-X.
[288] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. “Axiomatic Attribution
for Deep Network”. In: Proceedings of the 34th International Conference on
Machine Learning. Ed. by Doina Precup and Yee Whye Teh. Vol. 70. PMLR.
2017, pp. 3319–3328. doi: 10.5555/3305890.3306024.
[289] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. “Sequence to Sequence
Learning with Neural Networks”. In: Advances in Neural Information
Processing Systems. Ed. by Z. Ghahramani et al. Vol. 2. MIT Press, 2014,
pp. 3104–3112.
[290] Richard S Sutton. “Learning to predict by the methods of temporal dif-
ferences”. In: Machine Learning 3 (1988), pp. 9–44. doi: 10 . 1007 /
BF00115009.
[291] Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement
Learning. Vol. 135. (second edition). MIT Press, 1998. isbn: 978-0-262-
19398-6.
[292] Maciej Świechowski. “Deep Learning and Artificial General Intelligence:
Still a Long Way to Go”. In: Proceedings of the 3rd Polish Conference on
Artificial Intelligence PP-RAI’2022. 2022. isbn: 978-83-7421-401-8.
[293] Maciej Świechowski. “Game AI Competitions: Motivation for the Imitation
Game-Playing Competition”. In: 2020 15th Conference on Computer Science
and Information Systems (FedCSIS). Ed. by Maria Ganzha, Leszek Maciaszek,
and Marcin Paprzycki. IEEE. IEEE, 2020, pp. 155–160. doi: doi : 10 .
15439/2020F126.
BIBLIOGRAPHY 285

[294] Maciej Świechowski and Jacek Mańdziuk. “Self-Adaptation of Playing

Strategies in General Game Playing”. In: IEEE Transactions on Com-
putational Intelligence and AI in Games 6.4 (2014), pp. 367–381. doi:
10.1109/TCIAIG.2013.2275163.
[295] Maciej Świechowski et al. “Monte Carlo Tree Search: A Review of Recent
Modifications and Applications”. In: Artificial Intelligence Review 56 (2023),
2497–2562. doi: 10.1007/s10462-022-10228-y. url: https://doi.
org/10.1007/s10462-022-10228-y.
[296] Maciej Świechowski et al. “Recent Advances in General Game Playing”. In:
The Scientific World Journal 2015 (2015). doi: 10.1155/2015/986262.
[297] Kei Takada, Hiroyuki Iizuka, and Masahito Yamamoto. “Reinforcement
Learning to Create Value and Policy Functions Using Minimax Tree Search
in Hex”. In: IEEE Transactions on Games 12.1 (2019), pp. 63–73. doi:
10.1109/TG.2019.2893343.
[298] Austin Tate. Project Planning Using a Hierarchic Non-Linear Planner. Tech-
nical Report. Department of Artificial Intelligence, University of Edinburgh,
1976.
[299] Gerald Tesauro. “Neurogammon: A Neural-Network Backgammon Program”.
In: 1990 IJCNN International Joint Conference on Neural Networks. Ed. by
Maureen Caudill. IEEE. 1990, pp. 33–39. doi: 10.1109/IJCNN.1990.
137821.
[300] Gerald Tesauro. “Practical Issues in Temporal Difference Learning”. In:
Advances in Neural Information Processing Systems 4 (1991). Ed. by J.
Moody, S. Hanson, and R.P. Lippmann. doi: 10.1007/BF00992697.
[301] Gerald Tesauro. “Temporal Difference Learning and TD-Gammon”. In:
Communications of the ACM 38.3 (1995), pp. 58–68. doi: doi.org/10.
1145/203330.203343.
[302] Gerry Tesauro et al. “Analysis of Watson’s Strategies for Playing Jeopardy!”
In: Journal of Artificial Intelligence Research 47 (2013), pp. 205–251. doi:
10.1613/jair.3834.
[303] Sergios Theodoridis. Machine Learning: a Bayesian and Optimization
Perspective. Academic press, 2015. isbn: 0128015225.
[304] Sebastian Thrun. Probabilistic Robotics. MIT Press, 2005. isbn: 9780262201629.
[305] Sebastian Thrun et al. “Stanley: The Robot that Won the DARPA Grand
Challenge”. In: Journal of Field Robotics 23.9 (2006), pp. 661–692. doi:
10.1002/rob.20147.
[306] Alan M Turing. “Computing Machinery and Intelligence”. In: Mind 49
(1950), pp. 433–460. doi: 10.1093/mind/LIX.236.433.
[307] Amin Ullah et al. “Action Recognition in Video Sequences using Deep
Bi-Directional LSTM With CNN Features”. In: IEEE Access 6 (2017),
pp. 1155–1166. doi: 10.1109/ACCESS.2017.2778011.
BIBLIOGRAPHY 286

[308] J Uthayakumar et al. “Financial crisis prediction model using ant colony
optimization”. In: International Journal of Information Management 50
(2020), pp. 538–556. doi: 10.1016/j.ijinfomgt.2018.12.001.
[309] Hado Van Hasselt, Arthur Guez, and David Silver. “Deep Reinforce-
ment Learning with Double Q-learning”. In: AAAI’16: Proceedings of
the AAAI Conference on Artificial Intelligence. Vol. 30. 1. AAAI Press, 2016,
2094–2100.
[310] Jan-Jaap Van Horssen. “Move Selection in MTD (f)”. In: ICGA Journal 41.1
(2019), pp. 15–23. doi: 10.3233/ICG-190096.
[311] Jan Van Leeuwen and Jiří Wiedermann. “The Turing Machine Paradigm in
Contemporary Computing”. In: Mathematics Unlimited—2001 and Beyond
(2001). Ed. by Schmid Engquist B., pp. 1139–1155. doi: https://doi.
org/10.1007/978-3-642-56478-9\_59.
[312] Guido Van Rossum and Jelke De Boer. “Interactively testing remote servers
using the Python programming language”. In: CWI quarterly 4.4 (1991),
pp. 283–303.
[313] Ashish Vaswani et al. “Attention is all you need”. In: Proceedings of the
31st International Conference on Neural Information Processing Systems.
Ed. by Ulrike von Luxburg et al. NIPS’17. Curran Associates Inc., 2017,
6000–6010. isbn: 9781510860964.
[314] Pascal Vincent et al. “Stacked Denoising Autoencoders: Learning Useful
Representations in a Deep Network with a Local Denoising Criterion”.
In: Journal of Machine Learning Research 11 (2010), 3371–3408. issn:
1532-4435. doi: 10.5555/1756006.1953039.
[315] Oriol Vinyals et al. “Grandmaster level in StarCraft II using multi-agent
reinforcement learning”. In: Nature 575.7782 (2019), pp. 350–354. doi:
10.1038/s41586-019-1724-z.
[316] John Von Neumann and Oskar Morgenstern. Theory of Games and Economic
Behavior, 2nd rev. Princeton University Press, 1947.
[317] Arjen P de Vries et al. “Efficient k-NN Search on Vertically Decomposed
Data”. In: Proceedings of the 2002 ACM SIGMOD International Conference
on Management of Data. Association for Computing Machinery, 2002,
pp. 322–333. doi: 10.1145/564691.564729.
[318] Alex Wang et al. “GLUE: A Multi-Task Benchmark and Analysis Platform
for Natural Language Understanding”. In: Proceedings of the 2018 EMNLP
Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for
NLP. Ed. by Tal Linzen, Grzegorz Chrupała, and Afra Alishahi. Association
for Computational Linguistics, 2018, pp. 353–355. doi: 10.18653/v1/W18-
5446.
BIBLIOGRAPHY 287

[319] Honggang Wang et al. “Artificial Immune System based image pattern
recognition in energy efficient wireless multimedia sensor networks”. In:
MILCOM 2008-2008 IEEE Military Communications Conference. IEEE.
2008, pp. 1–7. doi: 10.1109/MILCOM.2008.4753651.
[320] Lipo Wang. Support Vector Machines: Theory and Applications. Ed. by
Lipo Wang. Vol. 177. Studies in Fuzziness and Soft Computing. Springer
Berlin, Heidelberg, 2005. isbn: 978-3-540-24388-5. doi: 10.1007/b95439.
[321] David HD Warren. “An Abstract Prolog Instruction Set”. In: Technical
Report. SRI Project 4776. (1983).
[322] Nick Webb et al. “Waymo’s Safety Methodologies and Safety Readiness
Determinations”. In: arXiv preprint arXiv:2011.00054 (2020).
[323] Joseph Weizenbaum. “ELIZA—a computer program for the study of natural
language communication between man and machine”. In: Communications
of the ACM 9.1 (1966), pp. 36–45. doi: 10.1145/365153.365168.
[324] Paul J Werbos. “Applications of Advances in Nonlinear Sensitivity Analysis”.
In: System Modeling and Optimization: Proceedings of the 10th IFIP
Conference New York City, USA, August 31–September 4, 1981. Ed. by
R F Drenick and Kozin F. Springer. 1982, pp. 762–770.
[325] Paul J Werbos. “Backpropagation Through Time: What It Does and How to
Do It”. In: Proceedings of the IEEE 78.10 (1990), pp. 1550–1560.
[326] K Wigger. “OpenAI’s Dota 2 bot defeated 99.4% of players in public
matches”. In: The Machine 22 (2022).
[327] R Teal Witter. “Backgammon Is Hard”. In: 15th International Conference
on Combinatorial Optimization and Applications. Ed. by Ding-Zhu Du et al.
Springer. Springer-Verlag, 2021, pp. 484–496. isbn: 978-3-030-92680-9.
doi: https://doi.org/10.1007/978-3-030-92681-6\_38.
[328] Tianyu Wu et al. “A Brief Overview of ChatGPT: The History, Status Quo
and Potential Future Development”. In: IEEE/CAA Journal of Automatica
Sinica 10.5 (2023), pp. 1122–1136. doi: 10.1109/JAS.2023.123618.
[329] Yonghui Wu et al. “Google’s Neural Machine Translation System: Bridging
the Gap between Human and Machine Translation”. In: arXiv preprint
arXiv:1609.08144 (2016).
[330] Jie Xu et al. “Multi-Objective Graph Heuristic Search for Terrestrial Robot
Design”. In: 2021 IEEE International Conference on Robotics and Automa-
tion (ICRA). IEEE. IEEE Press, 2021, pp. 9863–9869. doi: 10 . 1109 /
ICRA48506.2021.9561818.
[331] Rui Xu et al. “A Survey on Vision Mamba: Models, Applications and
Challenges”. In: arXiv preprint arXiv:2404.18861 (2024).
BIBLIOGRAPHY 288

[332] Juri Yanase and Evangelos Triantaphyllou. “A systematic survey of computer-

aided diagnosis in medicine: Past and present developments”. In: Expert
Systems with Applications 138 (2019). doi: 10 . 1016 / j . eswa . 2019 .
112821.
[333] Hua Yang et al. “A Survey of Artificial Immune System Based Intrusion
Detection”. In: The Scientific World Journal 2014 (2014). doi: 10.1155/
2014/156790.
[334] B Yegnanarayana and D Raj Reddy. “Performance of Harpy speech recog-
nition system for speech input with quantization noise”. In: The Jour-
nal of the Acoustical Society of America 62.S1 (1977), S27–S27. doi:
10.1121/1.2016106.
[335] Cao Ying et al. “Advance and Prospects of AdaBoost Algorithm”. In:
Acta Automatica Sinica 39.6 (2013), pp. 745–758. doi: 10.1016/S1874-
1029(13)60052-X.
[336] Haoxuan You et al. “Learning Visual Representation from Modality-Shared
Contrastive Language-Image Pre-training”. In: European Conference on
Computer Vision. Ed. by Shai Avidan et al. Springer. 2022, pp. 69–87. doi:
10.1007/978-3-031-19812-0\_5.
[337] Yang You et al. “Large-Batch Training for LSTM and Beyond”. In: Pro-
ceedings of the International Conference for High Performance Computing,
Networking, Storage and Analysis. ACM, 2019, pp. 1–16. doi: 10.1145/
3295500.3356137.
[338] Mahmood Yousefi-Azar et al. “Autoencoder-based feature learning for cyber
security applications”. In: 2017 International Joint Conference on Neural
Networks (IJCNN), IEEE. 2017, pp. 3854–3861. doi: 10.1109/IJCNN.
2017.7966342.
[339] Hana Yousuf et al. “A Systematic Review on Sequence-to-Sequence Learning
with Neural Network and its Models”. In: International Journal of Electrical
& Computer Engineering 11.3 (2021), pp. 2315–2326. doi: 10.11591/
ijece.v11i3.pp2315-2326.
[340] L.A. Zadeh. “Fuzzy Logic = Computing with Words”. In: IEEE Transactions
on Fuzzy Systems 4.2 (1996), pp. 103–111. doi: 10.1109/91.493904.
[341] Lotfi A Zadeh. “Fuzzy Sets”. In: Information and Control 8.3 (1965),
pp. 338–353. doi: 10.1016/S0019-9958(65)90241-X.
[342] Ryan Zarick et al. Unlocking the Potential of Deep Counterfactual Value
Networks. 2020. arXiv: 2007.10442 [cs.AI].
[343] Rowan Zellers et al. “SWAG: A Large-Scale Adversarial Dataset for
Grounded Commonsense Inference”. In: Proceedings of the 2018 Con-
ference on Empirical Methods in Natural Language Processing. Ed. by Ellen
Riloff et al. Association for Computational Linguistics, 2018, pp. 93–104.
doi: 10.18653/v1/D18-1009.
BIBLIOGRAPHY 289

[344] M Tmáš Zemčík. “A Brief History of Chatbots”. In: DEStech Transactions

on Computer Science and Engineering 10 (2019). doi: 10.12783/dtcse/
aicae2019/31439.
[345] Yong Zhang, Dun-Wei Gong, and Jian-Hua Zhang. “Robot Path Planning in
Uncertain Environment using Multi-Objective Particle Swarm Optimization”.
In: Neurocomputing 103 (2013), 172–185. issn: 0925-2312. doi: 10.1016/
j.neucom.2012.09.019.
[346] Jie Zhou et al. “Graph Neural Networks: A Review of Methods and Applica-
tions”. In: AI Open 1 (2020), pp. 57–81. doi: 10.1016/j.aiopen.2021.
01.001.
[347] Fuzhen Zhuang et al. “A Comprehensive Survey on Transfer Learning”. In:
Proceedings of the IEEE 109.1 (2020), pp. 43–76. doi: 10.1109/JPROC.
2020.3004555.
[348] Fuzhen Zhuang et al. “Supervised Representation Learning: Transfer Learn-
ing with Deep Autoencoders”. In: Proceedings of the 24th International Con-
ference on Artificial Intelligence. Ed. by Qiang Yang and Michael Wooldridge.
IJCAI’15. Buenos Aires, Argentina: AAAI Press, 2015, 4119–4125. isbn:
9781577357384.
Author Bio

Maciej Świechowski is an experienced computer scientist, researcher, and program-

mer, who has always been interested in various aspects of computer science. This
is the main reason of why he decided to write this book - to leverage these wide
interests in all shapes and techniques of AI. “One approach is not enough”. His early
interests involved programming computer games, including the mathematics behind
3D graphics. Later, his focused switched to Artificial Intelligence, but not to forget
classical programming and systems development. His main programming languages
of choice are: C#, C++, Python, Rust, and Haskel.
He was awarded a Ph.D. with distinction in Artificial Intelligence in 2015 from
Systems Research Institute of the Polish Academy of Sciences. He received his M.Sc.
and B.Sc. in Computer Science from Warsaw University of Technology. In 2013-
2014, he was a visiting researcher at the University of New South Wales, Australia.
In 2012-2014, Mr. Świechowski participated in the International General Game
Playing Competition held by Stanford University, reaching the quarterfinals twice.
Currently, his main positions are as the Head of QED Labs and Chief Technology
Officer of QED Games. In these roles, he has gained hands-on experience in many
R&D projects. In 2022, he co-founded the Information Technologies for Psychiatry
Foundation, where he serves as the vice president. The foundation is a non-profit
organization with the aim of promoting modern technologies for the monitoring,
diagnosis, and treatment of mental disorders. Occasionally, he also teaches one
course per semester at either Warsaw University or Warsaw University of Technology.
Mr. Świechowski has co-authored over 35 papers in some of these fields, often as
the first author. He has also been a speaker at various conferences, from scientific
ones such as AAAI and IEEE SSCI to industry events like Digital Dragons and
Code Europe. He is also a member of various communities, including IEEE and
the Polish Information Technology Processing Community (PTI). He also serves as
a track chair at the FedCSIS Conference and reviews papers for a large number of
conferences and journals.

AGI: Future Possibilities & Challenges
No ratings yet
AGI: Future Possibilities & Challenges
8 pages
AI in Civil Engineering Presentation
No ratings yet
AI in Civil Engineering Presentation
15 pages
2025 March PF Brief
No ratings yet
2025 March PF Brief
291 pages
How To Generate Ideas Like Elon Musk - by Benedict Neo - The Startup - Medium
No ratings yet
How To Generate Ideas Like Elon Musk - by Benedict Neo - The Startup - Medium
10 pages
Ai Book
No ratings yet
Ai Book
101 pages
I-1 Civil Engineering and Society - 30 Jan 2019
100% (1)
I-1 Civil Engineering and Society - 30 Jan 2019
57 pages
Fin Aaaaaaa Al
No ratings yet
Fin Aaaaaaa Al
138 pages
Strategy On ICT Development
100% (1)
Strategy On ICT Development
29 pages
Chat GPT User Guide
No ratings yet
Chat GPT User Guide
40 pages
Network and System Administration: Chapter 1: Introduction
No ratings yet
Network and System Administration: Chapter 1: Introduction
72 pages
Yale Smart Home Catalogue 2024
100% (1)
Yale Smart Home Catalogue 2024
63 pages
26 March 2025, Cyber-Defence Campus: Threats and Opportunities in AI-generated Images For Armed Forces
No ratings yet
26 March 2025, Cyber-Defence Campus: Threats and Opportunities in AI-generated Images For Armed Forces
37 pages
Gang Wang: Short Biography
100% (1)
Gang Wang: Short Biography
21 pages
ChatGPT in Education A Blessing or A Curse A Qual - 2024 - Computers in Human
No ratings yet
ChatGPT in Education A Blessing or A Curse A Qual - 2024 - Computers in Human
20 pages
AI in Academia
100% (1)
AI in Academia
26 pages
SEC202310001 - AI - State of AI Safety in China
100% (1)
SEC202310001 - AI - State of AI Safety in China
163 pages
Advancing Artificial General Intelligenc
No ratings yet
Advancing Artificial General Intelligenc
7 pages
Edge Ai Tech Report 25
No ratings yet
Edge Ai Tech Report 25
87 pages
Linux Essentials
No ratings yet
Linux Essentials
100 pages
ChatGPT and Bard For Busines Automation Achieving AIDriven Growth 1st Edition HQ File Fast Access
No ratings yet
ChatGPT and Bard For Busines Automation Achieving AIDriven Growth 1st Edition HQ File Fast Access
335 pages
Pop OS How To Install
No ratings yet
Pop OS How To Install
6 pages
Artificial Intelligence Marketing (PDFDrive)
No ratings yet
Artificial Intelligence Marketing (PDFDrive)
13 pages
Business Intelligence and AI
No ratings yet
Business Intelligence and AI
110 pages
Comfyui Gguf
No ratings yet
Comfyui Gguf
40 pages
Applied Generative AI For Beginners Practical Knowledge On Diffusion Models ChatGPT and Other LLMs 1st Edition Akshay Kulkarni Digital Access
100% (1)
Applied Generative AI For Beginners Practical Knowledge On Diffusion Models ChatGPT and Other LLMs 1st Edition Akshay Kulkarni Digital Access
402 pages
Natural General Intelligence How Understanding The Brain Can Help Us Build Ai 1nbsped 0192843885 9780192843883 Compress
100% (1)
Natural General Intelligence How Understanding The Brain Can Help Us Build Ai 1nbsped 0192843885 9780192843883 Compress
341 pages
Reactor Comfyui - Ipynb
No ratings yet
Reactor Comfyui - Ipynb
24 pages
MOOCs and Open Educational Resources
No ratings yet
MOOCs and Open Educational Resources
26 pages
Pop!_OS Post-Install Guide
No ratings yet
Pop!_OS Post-Install Guide
8 pages
AI and Tech For Creatives
No ratings yet
AI and Tech For Creatives
105 pages
The Future of Automated Kitchens
No ratings yet
The Future of Automated Kitchens
64 pages
Computer Hoy 684 Full Download
100% (2)
Computer Hoy 684 Full Download
406 pages
FLUX Workflow Guide for AI Artists
100% (1)
FLUX Workflow Guide for AI Artists
15 pages
Chat GPT Final Speech Edited
No ratings yet
Chat GPT Final Speech Edited
2 pages
Linux Crash Course For Beginners - Kodecloud
0% (1)
Linux Crash Course For Beginners - Kodecloud
270 pages
ChatGPT For Java A Handson Developers Guide To ChatGPT and Open AI APIs 1st Edition Bruce Hopkins Digital Access
100% (2)
ChatGPT For Java A Handson Developers Guide To ChatGPT and Open AI APIs 1st Edition Bruce Hopkins Digital Access
406 pages
Quaternions For Computer Graphics CH John Vince Digital Access
No ratings yet
Quaternions For Computer Graphics CH John Vince Digital Access
411 pages
ChatGPT Primer
No ratings yet
ChatGPT Primer
13 pages
Meet AI
No ratings yet
Meet AI
22 pages
Knowledge Representation in AI (AGI, Cognitive and Conscious)
100% (1)
Knowledge Representation in AI (AGI, Cognitive and Conscious)
34 pages
The Rise of Artificially Generated Art
No ratings yet
The Rise of Artificially Generated Art
3 pages
Forget Prompt Engineering-Here's How AI Can Work For You
No ratings yet
Forget Prompt Engineering-Here's How AI Can Work For You
9 pages
Python Programming
No ratings yet
Python Programming
441 pages
For A Kid Like Me What I Should Learn in AI
No ratings yet
For A Kid Like Me What I Should Learn in AI
5 pages
Untitled
No ratings yet
Untitled
32 pages
Smart Home Tech: Benefits & Devices
No ratings yet
Smart Home Tech: Benefits & Devices
27 pages
Lab1-Build Chatbot Using Watson Assistant Tool PDF
No ratings yet
Lab1-Build Chatbot Using Watson Assistant Tool PDF
106 pages
ChatGPT in Education - Strategies For Responsible Implementation
No ratings yet
ChatGPT in Education - Strategies For Responsible Implementation
11 pages
Little Buddy Franchisee Program July 2024
No ratings yet
Little Buddy Franchisee Program July 2024
13 pages
Codesaif Chatgpt Prompt
No ratings yet
Codesaif Chatgpt Prompt
59 pages
"TinyML Survey: Applications & Challenges"
No ratings yet
"TinyML Survey: Applications & Challenges"
31 pages
Teachers' Perceptions of The Use of Artificial Intelligence in The Classroom
100% (1)
Teachers' Perceptions of The Use of Artificial Intelligence in The Classroom
11 pages
Keybolt LLM Agents
No ratings yet
Keybolt LLM Agents
24 pages
AI Startup Sana Labs Transforms Education
No ratings yet
AI Startup Sana Labs Transforms Education
6 pages
ChatGPT - Read-Only
No ratings yet
ChatGPT - Read-Only
14 pages
AI Unleashed Revolutionizing Our World With Cutting-Edge Technology
100% (1)
AI Unleashed Revolutionizing Our World With Cutting-Edge Technology
129 pages
ChatGPT Detailed Presentation
No ratings yet
ChatGPT Detailed Presentation
7 pages
Home Automation 2024 Guide
No ratings yet
Home Automation 2024 Guide
8 pages
Avishkaar Proposal Lab 2024 - Option Wise - Institutional
No ratings yet
Avishkaar Proposal Lab 2024 - Option Wise - Institutional
9 pages
Artificial Intelligenceii
No ratings yet
Artificial Intelligenceii
48 pages
E-Farming: Agri E-Commerce Platform
50% (2)
E-Farming: Agri E-Commerce Platform
15 pages
Ayomide - 2
No ratings yet
Ayomide - 2
32 pages
Maggie and Max Visit The Haunted Castle - Scan
No ratings yet
Maggie and Max Visit The Haunted Castle - Scan
8 pages
Problem A: Python File: Time Limit: 1 Second
No ratings yet
Problem A: Python File: Time Limit: 1 Second
15 pages
4009 2807 PDF
No ratings yet
4009 2807 PDF
351 pages
PD Syllabus
No ratings yet
PD Syllabus
2 pages
Intersight Essentials Licensing
No ratings yet
Intersight Essentials Licensing
14 pages
Compressor control-TS - L Manual Operacion
80% (5)
Compressor control-TS - L Manual Operacion
80 pages
OUA-Memo - 0421136 - May 2021 Mi-TechTalk Webinar Sessions On Microsoft 365 - 2021 - 04 - 28
No ratings yet
OUA-Memo - 0421136 - May 2021 Mi-TechTalk Webinar Sessions On Microsoft 365 - 2021 - 04 - 28
6 pages
Report ...
No ratings yet
Report ...
54 pages
Online Safety Questionnaire For Adults
No ratings yet
Online Safety Questionnaire For Adults
2 pages
MAT4082 LPPmoreexamples
No ratings yet
MAT4082 LPPmoreexamples
12 pages
Module 4
No ratings yet
Module 4
34 pages
Hannan Sir Lab File
No ratings yet
Hannan Sir Lab File
26 pages
I Have The Comprehensive Solution Manual, Solutions Manual
No ratings yet
I Have The Comprehensive Solution Manual, Solutions Manual
1 page
BEE Awesome SlidesMania 2
No ratings yet
BEE Awesome SlidesMania 2
21 pages
ASUS VW198S Service Manual PDF
0% (1)
ASUS VW198S Service Manual PDF
57 pages
Test Bank For Modern Business Statistics With Microsoft Excel, 6th Edition, David R. Anderson, Dennis J. Sweeney, Thomas A. Williams, Jeffrey D. Camm James J. Cochran PDF Download
100% (5)
Test Bank For Modern Business Statistics With Microsoft Excel, 6th Edition, David R. Anderson, Dennis J. Sweeney, Thomas A. Williams, Jeffrey D. Camm James J. Cochran PDF Download
60 pages
LCGPA - UG - Verifying The Local Content Certificate or Target Percentage Document - v0.4
No ratings yet
LCGPA - UG - Verifying The Local Content Certificate or Target Percentage Document - v0.4
17 pages
2023EIR211E01MEMO
No ratings yet
2023EIR211E01MEMO
17 pages
Harris HD-0165 Keyboard Encoder Specifications (1977)
No ratings yet
Harris HD-0165 Keyboard Encoder Specifications (1977)
3 pages
Vinafix - VN - Tra Ma IC Richtek 1
No ratings yet
Vinafix - VN - Tra Ma IC Richtek 1
42 pages
Computer Programming 1 Bachelor of Science in Information Technology
No ratings yet
Computer Programming 1 Bachelor of Science in Information Technology
1 page
Intelligent Document Processing Guide
No ratings yet
Intelligent Document Processing Guide
26 pages
DMS (313302) Unit1
No ratings yet
DMS (313302) Unit1
75 pages
Test2 Chap5678 HonsDBMS-11feb25
No ratings yet
Test2 Chap5678 HonsDBMS-11feb25
1 page
Topic 05
No ratings yet
Topic 05
117 pages
Dell EMC VDI Complete Solutions Brief
No ratings yet
Dell EMC VDI Complete Solutions Brief
3 pages
Survey On AI-Based Polyp Localization and Segmentation For Enhanced Colonoscopy Diagnosis
No ratings yet
Survey On AI-Based Polyp Localization and Segmentation For Enhanced Colonoscopy Diagnosis
5 pages
Solved - Trigger Job in SAP BW From SAC - SAP Community
No ratings yet
Solved - Trigger Job in SAP BW From SAC - SAP Community
9 pages