Learning in Networks: were Pavlov and Hebb right?

Learning In Networks:
From Spiking Neural Nets To Graphs
Victor Miagkikh
Machine Learning Specialist,
Cisco Systems Inc.
Ironport business unit

Roadmap
•
•
•
•

Introduction to Hebbian learning
Bee navigation problem
Introduction to Spiking Neural Networks(SNN)
rHebb algorithm: Hebbian learning augmented
with reward controls plasticity learning principle
• Reinforcement learning applications for movies
and stock purchase recommendations
• Methodology for introducing reinforcement
learning in networks
• Q&A

Networks
• There are many different kinds of networks around us

Learning in Networks

• Suppose we would like to add a connection
between San Francisco and Orlando. Is it a
good thing to do?

Neural Networks

• There is no external analyst in the brain who
changes connections between neurons. What
principles make learning possible?

Hebbian Learning
• Let us assume that the persistence or
repetition of a reverberatory activity (or
"trace") tends to induce lasting cellular
changes that add to its stability.…
When an axon of cell A is near enough
to excite a cell B and repeatedly or
persistently takes part in firing it, some
growth process or metabolic change
Donald Olding Hebb
takes place in one or both cells such
1904-1985
that A's efficiency, as one of the cells
firing B, is increased. (Hebb, The
organization of behavior, 1949)

Hebbian Learning (Cont.)
• In other words: if two neurons fire “close in time”
then strength of synaptic connection between
them increases.

∆wij (t ) = ην iν j g (tvi , tv j )
v1

“close”

v3
v2

tvi

timetime

tv j

• Weights reflect correlation between firing events.

Bee Navigation Problem
• A bee has to correlate signs in order to find the
feeder (S. W. Zhang et al., Honeybee Memory:
Navigation by Associative Grouping and Recall
of Visual Stimuli, 1998).

Spiking Neural Networks
• Short term memory is needed in order to solve
this problem.
• Spiking Neural Networks (SNN) have “built in”
short term memory.
Perceptron
x1

x1

w1

x 2 w2
w3
x3 w
0
bias

Spiking Neuron

∑

y

y = Activation _ Function(∑ wi xi )

w1

x 2 w2
x3

∫

y

w3
E (t ) = E (t − 1)decay +

t

∫∑w x

i i

t −1

If E(t)>threshold -> emit spike; E(t+1)=0

SNN Solution
Input

N1_1
i1,v*
i1
i2
i3

∫

v1_1 i2,v*

N2_1
i4,v*

i4

∫

Output

N1_2

∫

v1_2

L R

N2_2
v2_1 i3,v*

∫

v2_1

Stage 1
Stage 2

v2_2
v2_2
v3_1

i5

N3_1

N3_2

i6
i6,v*

∫

v3_1

i5,v*

∫

v3_2
v3_2

Stage 1
Stage 2

Evolution of Weights
Input

N1_1

∫

i1,v*
i1
i2
i3

∫

v1_1 i2,v*

N2_1
i4,v*

i4

∫

Output

N1_2
v1_2

L R

N2_2

∫

v2_1 i3,v*

v2_1
v2_2
v3_1

i5

N3_1

N3_2

i6
i6,v*

Iteration 25

Stage 1
Stage 2

v2_2

∫

N1_1

v3_1

∫

i5,v*

v3_2
v3_2

N1_2

N2_1

N2_2

N3_1

N3_2

0.5188

0.1881

1.0356

0.2256

N1_1

N/A

0.0872

N1_2

0.0872

N/A

0.17953

0.4338

0.0831

0.4935

N2_1

0.5188

0.17953

N/A

0.0931

2.4867

0.2107

N2_2

0.1881

0.4338

0.0931

N/A

0.1325

0.31944

N3_1

1.0356

0.0831

2.4867

0.1325

N/A

0.1296

N3_2

0.2256

0.4935

0.2107

0.31944

0.1296

N/A

Evolution of Weights (Cont.)
Input

N1_1

∫

i1,v*
i1
i2
i3

∫

v1_1 i2,v*

N2_1
i4,v*

i4

∫

Output

N1_2
v1_2

L R

N2_2

∫

v2_1 i3,v*

v2_1
v2_2
v3_1

i5

N3_1

N3_2

i6
i6,v*

Iteration 50

Stage 1
Stage 2

v2_2

∫

N1_1

v3_1

∫

i5,v*

v3_2
v3_2

N1_2

N2_1

N2_2

N3_1

N3_2

N1_1

N/A

0.0743

0.9007

0.1819

2.6617

0.2256

N1_2

0.0743

N/A

0.0973

1.7223

-0.6328

0.6212

N2_1

0.9007

0.0973

N/A

0.0652

8.579

0.2107

N2_2

0.2107

1.7223

0.0652

N/A

0.0453

0.4476

N3_1

0.6212

-0.6328

8.579

0.0453

N/A

0.0978

N3_2

0.2256

0.6212

0.2107

0.4476

0.0978

N/A

rHebb Algorithm
Assumption#1 - Hebbian Learning:
The strength of the connections between the neurons that
fire “close in time” increases (subject to assumption#2 ).
∆wij = η ⋅ f ( vi , v j , ∆tij ) = η ⋅ vi ⋅ v j ⋅ g ( ∆tij ) = η ⋅ vi ⋅ v j ⋅ e

− k1∆tij

vi, vi outputs of neurons i and j
Assumption#2

∆tij - difference in time between firings of neurons i and j

Reward controls plasticity (the ability to change weights)
Therefore, the changes in weights as defined by Hebbian
learning occur not all the time but are controlled in sign
and intensity by the reward.
− k ∆t
∆wij = η ⋅ reward ⋅ vi ⋅ v j ⋅ g ( ∆tij ) = η ⋅ reward ⋅ vi ⋅ v j ⋅ e 1 ij

rHebb Algorithm (Cont.)
Assumption#3
Temporal credit assignment
vi

vj

i

j

reward
r

time

∆wij = η ⋅ reward ⋅ vi ⋅ v j ⋅ g ( ∆tij ) ⋅ h( ∆tir , ∆t jr ) =

∆wij = η ⋅ reward ⋅ vi ⋅ v j ⋅ e

− k1∆tij

∆wij = η ⋅ reward ⋅ vi ⋅ v j ⋅ e

⋅e

− k 2 ( ∆tir + ∆t jr )

− k1∆tij − k 2 ∆tir − k 2 ∆t jr

rHebb Algorithm (Cont.)
• Weights in rHebb reflect not correlations, but
utilities.
• Related work: C.M. Pennartz, Reinforcement
Learning by Hebbian Synapses with Adaptive
Thresholds, Neurocince, 1997 (Caltech).
• rHebb is biologically plausible.
• Could “reward control plasticity” learning axiom be
used for training not only spiking neural networks,
but other kinds of networks?

Movie Recommendation Problem
Problem: given a list of favorite movies, a wish
list, and demographic information (age, gender,
zip code) try to come up with a movie
recommendation list.
Matrix Reloaded
Matrix

13th Floor

12 Monkeys
th
Johnny Mnemonic 5 Element
Constantine

- Seen or in queue movies.
Set HQ (history + queue).
- Possible recommendations. Set A.

Reinforcement Learning in Movie
Network
•

Let’s assume that for some movies in H set we have personal ratings. For
combined HQ set we can introduce a reward function G(HQi) that returns 0.5
for unseen, and neutral movies, +1 for the best movie ever made, and 0 for
the worst movie according to a personal taste.

•

For example: H = <13th Floor, 5th Element >, Q = <Matrix>.
G(HQi) = <0.7, 0.5, 0.5>.

Matrix

Johnny Mnemonic

•
•

Matrix Reloaded
13th Floor
12 Monkeys

5th Element
Constantine
Let’s introduce the set R that is composed of movies that our system
recommended. For example above, let R = Q = <Matrix>
At some point user watches a movie in R or some other movie not in R. If
user watches a movie in R system gets a reward G(HQi). If not then system
gets reward = - small_penalty. Thus, we can get a sequence of <m, reward>
pairs for each user. Let’s call this set T.

Reinforcement Learning in Movie
Network (Cont.)
• The set T could be used to adjust connections in Cij in
reinforcement learning mode. Consider a <m, reward>
pair. Given m we know average cumulative flow through
each edge and vertex to all vertices in H set. Let’s
denotes them as fn(i,m,H) and fe(i,j,m,H).
• Then we can update edges as follows:

∆Cij = η e ⋅ reward ⋅ fe(i, j , m, H )
• And vertices:

∆Cii = η n ⋅ reward ⋅ fn(i, m, H )

• Thus, we got updated Cij and we can do it for entire
population, a segment or a customer.

Reinforcement Learning in Networks
• Generally, flow in network and Hebbian learning are
examples of eligibility trace e(t) that is a degree of
participation of each functional unit in decision making
that caused the action.
• e(t) * reward pattern is an example of doing structural
credit assignment in a learner. It’s very intuitive:
distribute reward according to participation.
• Structural credit assignment is different in each kind of
learner, but the pattern e(t) * reward is a good rule of
thumb to introduce RL.
• In general, RL can have some place when the only
feedback that we have from environment is a
reinforcement signal that doesn’t allow the use of
supervised learning.

An Approach for Solving NetFlix Problem
• Bipartite graph representation:
u1
u2
u3

m1
m1

m2

m2

m3

m4

0.3

0

0

0

u2

m3

u1

-0.1

0

0

0.9

u3

0.1

0.4

-0.2

0

m4

• After finding network equilibrium:
u1
u2
u3

m1

m1

m2

m3

m4

m2

u1

0.3

0.1

-0.1

0.2

u2

-0.1

0.1

0.15

0.9

m3

u3

0.1

0.4

-0.2

0.05

m4

Example: Influence Networks for Economy
• Consider the following network where nodes represent
companies and connections represent mutual
influences. Stock price could be considered as the
output of a company node similar to the output of a
neuron.
• We can apply Hebbian learning = the strength of
connection between two neurons that fire at the same
time increases.
Intel

IBM

Input: stock price s(t) for any company ci
Output: Strength of influence connections
w(ci,cj)

Xerox

AMD

Learning Rule (Hebbian learning):
∆wci c j (∆t ) = η (∆si (∆t ) ⋅ ∆s j (∆t ))

21

Example: Influence Networks for
Economy (Cont.)
• We can easily extend simple Hebbian Learning to
“Fuzzy” Hebbian Learning:
∆t

∆w = η ∫ f ci (t ) f c j (t )dt / ∆t
0

×

Defuzzyfication

=

(center of gravity)
ˆ
∆w = η ∫ f (t )dt / ∆t

• We can train network for various delta t. The
meaning of weight becomes timed correlation of
stock price.

Portfolio Analyses
Intel

HP

- some stock in portfolio
IBM
- no stock in portfolio
Apple

Xerox

AMD
P&G

In general: portfolio is represented
by weights on vertices

• Characteristics we can compute: diversity, coverage,
competition within portfolio, company influence.
• Maximum flow between any two vertices could be used to
calculate cumulative influence.
• Key idea in stock purchase recommendations: if a fluctuation
in a stock price of one company is observed then the network
can tell the stock price of which companies will start to
fluctuate.

Multi Aspect Influence Model
Intel1
L1
evt
Xerox1
Intel2
L2
evt
Xerox2

s(t)

IBM1 Stock price could be seen as a form of

reinforcement signal reflecting future
expected reward of a company allowing
AMD1 the use of reinforcement learning.
Each “plane” Li models separate dimension
IBM2 of influences. It can represent involvement
of a company in different sectors of
economy. In general, each plane could be
AMD2
a result of decomposition into some other
more abstract basis.
IBM3

Intel3
L3
evt
Xerox3

AMD3

Each company node cj exist on each Li.
Forming vertical subnetworks that could be
used to model transformations within a
company.

Reward Controls Plasticity Learning
Principle
• -e(t) * error pattern used in supervised learning
for long time.
• e(t) * reward is a generic principle to introduce
reinforcement learning.
• e(t) could be any causality inducing principle,
not necessarily Hebbian learning.
• There is huge area of applicability. Depending
on application, reward could be defined as
number of users clicking on online ad or
number of messages caught by anti-spam
system.
• Q&A

Learning in Networks: were Pavlov and Hebb right?

More Related Content

What's hot (20)

Similar to Learning in Networks: were Pavlov and Hebb right? (20)

Recently uploaded (20)

Learning in Networks: were Pavlov and Hebb right?