CMU 11-785 L21 Boltzmann machines2

最新推荐文章于 2025-04-06 11:16:46 发布

原创最新推荐文章于 2025-04-06 11:16:46 发布 · 447 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#神经网络 #机器学习 #深度学习

CMU 11-785 专栏收录该内容

22 篇文章

订阅专栏

The Hopfield net as a distribution

The Helmholtz Free Energy of a System

At any time, the probability of finding the system in state $s$ at temperature $T$ is $P_T(s)$
At each state it has a potential energy $E_s$
The internal energy of the system, representing its capacity to do work, is the average
- $U_{T}=\sum_{S} P_{T}(s) E_{S}$
The capacity to do work is counteracted by the internal disorder of the system, i.e. its entropy
- $H_{T}=-\sum_{S} P_{T}(s) \log P_{T}(s)$
The Helmholtz free energy of the system measures the useful work derivable from it and combines the two terms
- $F_{T}=U_{T}+k T H_{T}$
- $=\sum_{S} P_{T}(s) E_{S}-k T \sum_{S} P_{T}(s) \log P_{T}(s)$
The probability distribution of the states at steady state is known as the Boltzmann distribution
- Minimizing this w.r.t $P_T(s)$ , we get
- $P_{T}(s)=\frac{1}{Z} \exp \left(\frac{-E_{S}}{k T}\right)$
- $Z$ is a normalizing constant

Hopfield net as a distribution

$E(S)=−∑i<jwijsisj−bisiE(S)=-\sum_{i<j} w_{i j} s_{i} s_{j}-b_{i} s_{i}$
$P(S)=exp⁡(−E(S))∑S′exp⁡(−E(S′))P(S)=\frac{\exp (-E(S))}{\sum_{S^{\prime}} \exp \left(-E\left(S^{\prime}\right)\right)}$
The stochastic Hopfield network models a probability distribution over states
It is a generative model: generates states according to $P (S)$

The field at a single node

Let’s take one node as example
Let $S$ and $S′S^\prime$ be the states with the +1 and -1 states
- $P(S)=P(si=1∣sj≠i)P(sj≠i)P(S)=P\left(s_{i}=1 \mid s_{j \neq i}\right) P\left(s_{j \neq i}\right)$
- $P(S′)=P(si=−1∣sj≠i)P(sj≠i)P\left(S^{\prime}\right)=P\left(s_{i}=-1 \mid s_{j \neq i}\right) P\left(s_{j \neq i}\right)$
- $log⁡P(S)−log⁡P(S′)=log⁡P(si=1∣sj≠i)−log⁡P(si=−1∣sj≠i)\log P(S)-\log P\left(S^{\prime}\right)=\log P\left(s_{i}=1 \mid s_{j \neq i}\right)-\log P\left(s_{i}=-1 \mid s_{j \neq i}\right)$
- $log⁡P(S)−log⁡P(S′)=log⁡P(si=1∣sj≠i)1−P(si=1∣sj≠i)\log P(S)-\log P\left(S^{\prime}\right)=\log \frac{P\left(s_{i}=1 \mid s_{j \neq i}\right)}{1-P\left(s_{i}=1 \mid s_{j \neq i}\right)}$
$log⁡P(S)=−E(S)+C\log P(S)=-E(S)+C$
- $i+∑j≠iwijsj+bi)E(S)=-\frac{1}{2}\left(E_{\text {not } i}+\sum_{j \neq i} w_{i j} s_{j}+b_{i}\right)$
- $i−∑j≠iwijsj−bi)E\left(S^{\prime}\right)=-\frac{1}{2}\left(E_{\text {not } i}-\sum_{j \neq i} w_{i j} s_{j}-b_{i}\right)$
$log⁡P(S)−log⁡P(S′)=E(S′)−E(S)=∑j≠iwijSj+bi\log P(S)-\log P\left(S^{\prime}\right)=E\left(S^{\prime}\right)-E(S)=\sum_{j \neq i} w_{i j} S_{j}+b_{i}$
- $log⁡(P(si=1∣sj≠i)1−P(si=1∣sj≠i))=∑j≠iwijsj+bi\log \left(\frac{P\left(s_{i}=1 \mid s_{j \neq i}\right)}{1-P\left(s_{i}=1 \mid s_{j \neq i}\right)}\right)=\sum_{j \neq i} w_{i j} s_{j}+b_{i}$
- $P(si=1∣sj≠i)=11+e−(∑j≠iwijsj+bi)P\left(s_{i}=1 \mid s_{j \neq i}\right)=\frac{1}{1+e^{-\left(\sum_{j \neq i} w_{i j} s_{j}+b_{i}\right)}}$
The probability of any node taking value 1 given other node values is a logistic

Redefining the network

Redefine a regular Hopfield net as a stochastic system
Each neuron is now a stochastic unit with a binary state $s_i$ , which can take value 0 or 1 with a probability that depends on the local field
- $zi=∑jwijsj+biz_{i}=\sum_{j} w_{i j} s_{j}+b_{i}$
- $P(si=1∣sj≠i)=11+e−ziP\left(s_{i}=1 \mid s_{j \neq i}\right)=\frac{1}{1+e^{-z_{i}}}$
Note
- The Hopfield net is a probability distribution over binary sequences (Boltzmann distribution)
- The conditional distribution of individual bits in the sequence is a logistic
The evolution of the Hopfield net can be made stochastic
- Instead of deterministically responding to the sign of the local field, each neuron responds probabilistically
Recall patterns

在这里插入图片描述

The Boltzmann Machine

The entire model can be viewed as a generative model
Has a probability of producing any binary vector $y$
- $E(y)=−12yTWyE(\mathbf{y})=-\frac{1}{2} \mathbf{y}^{T} \mathbf{W} \mathbf{y}$
- $P(y)=Cexp⁡(−E(y)T)P(\mathbf{y})=\operatorname{Cexp}\left(-\frac{E(\mathbf{y})}{T}\right)$
Training a Hopfield net: Must learn weights to “remember” target states and “dislike” other states
- Must learn weights to assign a desired probability distribution to states
- Just maximize likelihood

Maximum Likelihood Training

$log⁡(P(S))=(∑i<jwijsisj)−log⁡(∑S′exp⁡(∑i<jwijsi′sj′))\log (P(S))=\left(\sum_{i<j} w_{i j} s_{i} s_{j}\right)-\log \left(\sum_{S^{\prime}} \exp \left(\sum_{i<j} w_{i j} s_{i}^{\prime} s_{j}^{\prime}\right)\right)$
$L=1N∑S∈Slog⁡(P(S))=1N∑S(∑i<jwijsisj)−log⁡(∑S′exp⁡(∑i<jwijsi′sj′))\mathcal{L}=\frac{1}{N} \sum_{S \in \mathbf{S}} \log (P(S)) =\frac{1}{N} \sum_{S}\left(\sum_{i<j} w_{i j} s_{i} s_{j}\right)-\log \left(\sum_{S^{\prime}} \exp \left(\sum_{i<j} w_{i j} s_{i}^{\prime} s_{j}^{\prime}\right)\right)$
Second term derivation
- $dlog⁡(∑S′exp⁡(∑i<jwijsi′sj′))dwij=∑S′exp⁡(∑i<jwijsi′sj′)∑S′exp⁡(∑i<jwijsi′′sj′)si′sj′\frac{d \log \left(\sum_{S^{\prime}} \exp \left(\sum_{i<j} w_{i j} s_{i}^{\prime} s_{j}^{\prime}\right)\right)}{d w_{i j}}=\sum_{S^{\prime}} \frac{\exp \left(\sum_{i<j} w_{i j} s_{i}^{\prime} s_{j}^{\prime}\right)}{\sum_{S^{\prime}} \exp \left(\sum_{i<j} w_{i j} s_{i}^{\prime \prime} s_{j}^{\prime}\right)} s_{i}^{\prime} s_{j}^{\prime}$
- $dlog⁡(∑S′exp⁡(∑i<jwijsi′sj′))dwij=∑S′P(S′)si′sj′\frac{d \log \left(\sum_{S^{\prime}} \exp \left(\sum_{i<j} w_{i j} s_{i}^{\prime} s_{j}^{\prime}\right)\right)}{d w_{i j}}=\sum_{S_{\prime}} P\left(S^{\prime}\right) s_{i}^{\prime} s_{j}^{\prime}$
- The second term is simply the expected value of $s_iS_j$ , over all possible values of the state
- We cannot compute it exhaustively, but we can compute it by sampling!
Overall gradient ascent rule
- $wij=wij+ηd⟨log⁡(P(S))⟩dwijw_{i j}=w_{i j}+\eta \frac{d\langle\log (P(\mathbf{S}))\rangle}{d w_{i j}}$
Overall Training
- Initialize weights
- Let the network run to obtain simulated state samples
- Compute gradient and update weights
- Iterate
Note the similarity to the update rule for the Hopfield network
- The only difference is how we got the samples

Adding Capacity

在这里插入图片描述

Visible neurons
- The neurons that store the actual patterns of interest
Hidden neurons
- The neurons that only serve to increase the capacity but whose actual values are not important
We could have multiple hidden patterns coupled with any visible pattern
- These would be multiple stored patterns that all give the same visible output
We are interested in the marginal probabilities over visible bits
- $S = (V, H)$
- $P(S)=exp⁡(−E(S))∑S′exp⁡(−E(S′))P(S)=\frac{\exp (-E(S))}{\sum_{S^{\prime}} \exp \left(-E\left(S^{\prime}\right)\right)}$
- $P (S) = P (V, H)$
- $P(V)=∑HP(S)P(V)=\sum_{H} P(S)$
Train to maximize probability of desired patterns of visible bits
- $E(S)=−∑i<jwijsisjE(S)=-\sum_{i<j} w_{i j} s_{i} s_{j}$
- $P(S)=exp⁡(∑i<jwijsisj)∑S′exp⁡(∑i<jwijsi′sj′)P(S)=\frac{\exp \left(\sum_{i<j} w_{i j} s_{i} s_{j}\right)}{\sum_{S^{\prime}} \exp \left(\sum_{i<j} w_{i j} s_{i}^{\prime} s_{j}^{\prime}\right)}$
- $P(V)=∑Hexp⁡(∑i<jwijsisj)∑S′exp⁡(∑i<jwijsi′sj′)P(V)=\sum_{H} \frac{\exp \left(\sum_{i<j} w_{i j} s_{i} s_{j}\right)}{\sum_{S^{\prime}} \exp \left(\sum_{i<j} w_{i j} s_{i}^{\prime} s_{j}^{\prime}\right)}$
Maximum Likelihood Training

$log⁡(P(V))=log⁡(∑Hexp⁡(∑i<jwijsisj))−log⁡(∑S′exp⁡(∑i<jwijsi′sj′))\log (P(V))=\log \left(\sum_{H} \exp \left(\sum_{i<j} w_{i j} s_{i} s_{j}\right)\right)-\log \left(\sum_{S_{\prime}} \exp \left(\sum_{i<j} w_{i j} s_{i}^{\prime} s_{j}^{\prime}\right)\right)$

$L=1N∑V∈Vlog⁡(P(V))\mathcal{L}=\frac{1}{N} \sum_{V \in \mathbf{V}} \log (P(V))$
$\frac{d \mathcal{L}}{d w_{i j}}=\frac{1}{N} \sum_{V \in \mathbf{V}} \sum_{H} P(S \mid V) s_{i} s_{j}-\sum_{S !} P\left(S^{\prime}\right) s_{i}^{\prime} s_{j}^{\prime}$
$∑HP(S∣V)sisj≈1K∑H∈HsimulsiSj\sum_{H} P(S \mid V) s_{i} s_{j} \approx \frac{1}{K} \sum_{H \in \mathbf{H}_{s i m u l}} s_{i} S_{j}$
Computed as the average sampled hidden state with the visible bits fixed
$∑S′P(S′)si′sj′≈1M∑Si∈Ssimulsi′Sj′\sum_{S^{\prime}} P\left(S^{\prime}\right) s_{i}^{\prime} s_{j}^{\prime} \approx \frac{1}{M} \sum_{S_{i} \in \mathbf{S}_{s i m u l}} s_{i}^{\prime} S_{j}^{\prime}$
- Computed as the average of sampled states when the network is running “freely”

Training

Step1

For each training pattern $V_i$
- Fix the visible units to $V_i$
- Let the hidden neurons evolve from a random initial point to generate $H_i$
- Generate $S_i = [V_i,H_i]$
Repeat K times to generate synthetic training

$\mathbf{S}=\{S_{1,1}, S_{1,2}, \ldots, S_{1 K}, S_{2,1}, \ldots, S_{N, K}\}$

Step2

Now unclamp the visible units and let the entire network evolve several times to generate

$Ssimul=S_simul,1,S_simul,2,…,S_simul,M \mathbf{S}_{simul}=S\_{simul, 1}, S\_{simul, 2}, \ldots, S\_{simul, M}$

Gradients
$\frac{d\langle\log (P(\mathbf{S}))\rangle}{d w_{i j}}=\frac{1}{N K} \sum_{\boldsymbol{S}} s_{i} s_{j}-\frac{1}{M} \sum_{S_{i} \in \mathbf{S}_{\text {simul }}} s_{i}^{\prime} s_{j}^{\prime}$