CMU 11-785 L21 Boltzmann machines2

The Hopfield net as a distribution

The Helmholtz Free Energy of a System

  • At any time, the probability of finding the system in state sss at temperature TTT is PT(s)P_T(s)PT(s)

  • At each state it has a potential energy EsE_sEs

  • The internal energy of the system, representing its capacity to do work, is the average

    • UT=∑SPT(s)ES U_{T}=\sum_{S} P_{T}(s) E_{S} UT=SPT(s)ES
  • The capacity to do work is counteracted by the internal disorder of the system, i.e. its entropy

    • HT=−∑SPT(s)log⁡PT(s) H_{T}=-\sum_{S} P_{T}(s) \log P_{T}(s) HT=SPT(s)logPT(s)
  • The Helmholtz free energy of the system measures the useful work derivable from it and combines the two terms

    • FT=UT+kTHT F_{T}=U_{T}+k T H_{T} FT=UT+kTHT
    • =∑SPT(s)ES−kT∑SPT(s)log⁡PT(s) =\sum_{S} P_{T}(s) E_{S}-k T \sum_{S} P_{T}(s) \log P_{T}(s) =SPT(s)ESkTSPT(s)logPT(s)
  • The probability distribution of the states at steady state is known as the Boltzmann distribution

    • Minimizing this w.r.t PT(s)P_T(s)PT(s), we get

    • PT(s)=1Zexp⁡(−ESkT) P_{T}(s)=\frac{1}{Z} \exp \left(\frac{-E_{S}}{k T}\right) PT(s)=Z1exp(kTES)

    • ZZZ is a normalizing constant

Hopfield net as a distribution

  • E(S)=−∑i<jwijsisj−bisiE(S)=-\sum_{i<j} w_{i j} s_{i} s_{j}-b_{i} s_{i}E(S)=i<jwijsisjbisi
  • P(S)=exp⁡(−E(S))∑S′exp⁡(−E(S′))P(S)=\frac{\exp (-E(S))}{\sum_{S^{\prime}} \exp \left(-E\left(S^{\prime}\right)\right)}P(S)=Sexp(E(S))exp(E(S))
  • The stochastic Hopfield network models a probability distribution over states
  • It is a generative model: generates states according to P(S)P(S)P(S)

The field at a single node

  • Let’s take one node as example

  • Let SSS and S′S^\primeS be the states with the +1 and -1 states

    • P(S)=P(si=1∣sj≠i)P(sj≠i)P(S)=P\left(s_{i}=1 \mid s_{j \neq i}\right) P\left(s_{j \neq i}\right)P(S)=P(si=1sj=i)P(sj=i)
    • P(S′)=P(si=−1∣sj≠i)P(sj≠i)P\left(S^{\prime}\right)=P\left(s_{i}=-1 \mid s_{j \neq i}\right) P\left(s_{j \neq i}\right)P(S)=P(si=1sj=i)P(sj=i)
    • log⁡P(S)−log⁡P(S′)=log⁡P(si=1∣sj≠i)−log⁡P(si=−1∣sj≠i)\log P(S)-\log P\left(S^{\prime}\right)=\log P\left(s_{i}=1 \mid s_{j \neq i}\right)-\log P\left(s_{i}=-1 \mid s_{j \neq i}\right)logP(S)logP(S)=logP(si=1sj=i)logP(si=1sj=i)
    • log⁡P(S)−log⁡P(S′)=log⁡P(si=1∣sj≠i)1−P(si=1∣sj≠i)\log P(S)-\log P\left(S^{\prime}\right)=\log \frac{P\left(s_{i}=1 \mid s_{j \neq i}\right)}{1-P\left(s_{i}=1 \mid s_{j \neq i}\right)}logP(S)logP(S)=log1P(si=1sj=i)P(si=1sj=i)
  • log⁡P(S)=−E(S)+C\log P(S)=-E(S)+ClogP(S)=E(S)+C

    • E(S)=−12(Enot i+∑j≠iwijsj+bi)E(S)=-\frac{1}{2}\left(E_{\text {not } i}+\sum_{j \neq i} w_{i j} s_{j}+b_{i}\right)E(S)=21(Enot i+j=iwijsj+bi)
    • E(S′)=−12(Enot i−∑j≠iwijsj−bi)E\left(S^{\prime}\right)=-\frac{1}{2}\left(E_{\text {not } i}-\sum_{j \neq i} w_{i j} s_{j}-b_{i}\right)E(S)=21(Enot ij=iwijsjbi)
  • log⁡P(S)−log⁡P(S′)=E(S′)−E(S)=∑j≠iwijSj+bi\log P(S)-\log P\left(S^{\prime}\right)=E\left(S^{\prime}\right)-E(S)=\sum_{j \neq i} w_{i j} S_{j}+b_{i}logP(S)logP(S)=E(S)E(S)=j=iwijSj+bi

    • log⁡(P(si=1∣sj≠i)1−P(si=1∣sj≠i))=∑j≠iwijsj+bi\log \left(\frac{P\left(s_{i}=1 \mid s_{j \neq i}\right)}{1-P\left(s_{i}=1 \mid s_{j \neq i}\right)}\right)=\sum_{j \neq i} w_{i j} s_{j}+b_{i}log(1P(si=1sj=i)P(si=1sj=i))=j=iwijsj+bi

    • P(si=1∣sj≠i)=11+e−(∑j≠iwijsj+bi)P\left(s_{i}=1 \mid s_{j \neq i}\right)=\frac{1}{1+e^{-\left(\sum_{j \neq i} w_{i j} s_{j}+b_{i}\right)}}P(si=1sj=i)=1+e(j=iwijsj+bi)1

  • The probability of any node taking value 1 given other node values is a logistic

Redefining the network

  • Redefine a regular Hopfield net as a stochastic system
  • Each neuron is now a stochastic unit with a binary state sis_isi, which can take value 0 or 1 with a probability that depends on the local field
    • zi=∑jwijsj+biz_{i}=\sum_{j} w_{i j} s_{j}+b_{i}zi=jwijsj+bi
    • P(si=1∣sj≠i)=11+e−ziP\left(s_{i}=1 \mid s_{j \neq i}\right)=\frac{1}{1+e^{-z_{i}}}P(si=1sj=i)=1+ezi1
  • Note
    • The Hopfield net is a probability distribution over binary sequences (Boltzmann distribution)
    • The conditional distribution of individual bits in the sequence is a logistic
  • The evolution of the Hopfield net can be made stochastic
    • Instead of deterministically responding to the sign of the local field, each neuron responds probabilistically
  • Recall patterns

在这里插入图片描述

The Boltzmann Machine

  • The entire model can be viewed as a generative model
  • Has a probability of producing any binary vector yyy
    • E(y)=−12yTWyE(\mathbf{y})=-\frac{1}{2} \mathbf{y}^{T} \mathbf{W} \mathbf{y}E(y)=21yTWy
    • P(y)=Cexp⁡(−E(y)T)P(\mathbf{y})=\operatorname{Cexp}\left(-\frac{E(\mathbf{y})}{T}\right)P(y)=Cexp(TE(y))
  • Training a Hopfield net: Must learn weights to “remember” target states and “dislike” other states
    • Must learn weights to assign a desired probability distribution to states
    • Just maximize likelihood

Maximum Likelihood Training

  • log⁡(P(S))=(∑i<jwijsisj)−log⁡(∑S′exp⁡(∑i<jwijsi′sj′))\log (P(S))=\left(\sum_{i<j} w_{i j} s_{i} s_{j}\right)-\log \left(\sum_{S^{\prime}} \exp \left(\sum_{i<j} w_{i j} s_{i}^{\prime} s_{j}^{\prime}\right)\right)log(P(S))=(i<jwijsisj)log(Sexp(i<jwijsisj))

  • L=1N∑S∈Slog⁡(P(S))=1N∑S(∑i<jwijsisj)−log⁡(∑S′exp⁡(∑i<jwijsi′sj′))\mathcal{L}=\frac{1}{N} \sum_{S \in \mathbf{S}} \log (P(S)) =\frac{1}{N} \sum_{S}\left(\sum_{i<j} w_{i j} s_{i} s_{j}\right)-\log \left(\sum_{S^{\prime}} \exp \left(\sum_{i<j} w_{i j} s_{i}^{\prime} s_{j}^{\prime}\right)\right)L=N1SSlog(P(S))=N1S(i<jwijsisj)log(Sexp(i<jwijsisj))

  • Second term derivation

    • dlog⁡(∑S′exp⁡(∑i<jwijsi′sj′))dwij=∑S′exp⁡(∑i<jwijsi′sj′)∑S′exp⁡(∑i<jwijsi′′sj′)si′sj′\frac{d \log \left(\sum_{S^{\prime}} \exp \left(\sum_{i<j} w_{i j} s_{i}^{\prime} s_{j}^{\prime}\right)\right)}{d w_{i j}}=\sum_{S^{\prime}} \frac{\exp \left(\sum_{i<j} w_{i j} s_{i}^{\prime} s_{j}^{\prime}\right)}{\sum_{S^{\prime}} \exp \left(\sum_{i<j} w_{i j} s_{i}^{\prime \prime} s_{j}^{\prime}\right)} s_{i}^{\prime} s_{j}^{\prime}dwijdlog(Sexp(i<jwijsisj))=SSexp(i<jwijsisj)exp(i<jwijsisj)sisj
    • dlog⁡(∑S′exp⁡(∑i<jwijsi′sj′))dwij=∑S′P(S′)si′sj′\frac{d \log \left(\sum_{S^{\prime}} \exp \left(\sum_{i<j} w_{i j} s_{i}^{\prime} s_{j}^{\prime}\right)\right)}{d w_{i j}}=\sum_{S_{\prime}} P\left(S^{\prime}\right) s_{i}^{\prime} s_{j}^{\prime}dwijdlog(Sexp(i<jwijsisj))=SP(S)sisj
    • The second term is simply the expected value of siSjs_iS_jsiSj, over all possible values of the state
    • We cannot compute it exhaustively, but we can compute it by sampling!
  • Overall gradient ascent rule

    • wij=wij+ηd⟨log⁡(P(S))⟩dwijw_{i j}=w_{i j}+\eta \frac{d\langle\log (P(\mathbf{S}))\rangle}{d w_{i j}}wij=wij+ηdwijdlog(P(S))
  • Overall Training

    • Initialize weights
    • Let the network run to obtain simulated state samples
    • Compute gradient and update weights
    • Iterate
  • Note the similarity to the update rule for the Hopfield network

    • The only difference is how we got the samples

Adding Capacity

在这里插入图片描述

  • Visible neurons

    • The neurons that store the actual patterns of interest
  • Hidden neurons

    • The neurons that only serve to increase the capacity but whose actual values are not important
  • We could have multiple hidden patterns coupled with any visible pattern

    • These would be multiple stored patterns that all give the same visible output
  • We are interested in the marginal probabilities over visible bits

    • S=(V,H)S=(V,H)S=(V,H)
    • P(S)=exp⁡(−E(S))∑S′exp⁡(−E(S′))P(S)=\frac{\exp (-E(S))}{\sum_{S^{\prime}} \exp \left(-E\left(S^{\prime}\right)\right)}P(S)=Sexp(E(S))exp(E(S))
    • P(S)=P(V,H)P(S) = P(V,H)P(S)=P(V,H)
    • P(V)=∑HP(S)P(V)=\sum_{H} P(S)P(V)=HP(S)
  • Train to maximize probability of desired patterns of visible bits

    • E(S)=−∑i<jwijsisjE(S)=-\sum_{i<j} w_{i j} s_{i} s_{j}E(S)=i<jwijsisj
    • P(S)=exp⁡(∑i<jwijsisj)∑S′exp⁡(∑i<jwijsi′sj′)P(S)=\frac{\exp \left(\sum_{i<j} w_{i j} s_{i} s_{j}\right)}{\sum_{S^{\prime}} \exp \left(\sum_{i<j} w_{i j} s_{i}^{\prime} s_{j}^{\prime}\right)}P(S)=Sexp(i<jwijsisj)exp(i<jwijsisj)
    • P(V)=∑Hexp⁡(∑i<jwijsisj)∑S′exp⁡(∑i<jwijsi′sj′)P(V)=\sum_{H} \frac{\exp \left(\sum_{i<j} w_{i j} s_{i} s_{j}\right)}{\sum_{S^{\prime}} \exp \left(\sum_{i<j} w_{i j} s_{i}^{\prime} s_{j}^{\prime}\right)}P(V)=HSexp(i<jwijsisj)exp(i<jwijsisj)
  • Maximum Likelihood Training

    log⁡(P(V))=log⁡(∑Hexp⁡(∑i<jwijsisj))−log⁡(∑S′exp⁡(∑i<jwijsi′sj′))\log (P(V))=\log \left(\sum_{H} \exp \left(\sum_{i<j} w_{i j} s_{i} s_{j}\right)\right)-\log \left(\sum_{S_{\prime}} \exp \left(\sum_{i<j} w_{i j} s_{i}^{\prime} s_{j}^{\prime}\right)\right)log(P(V))=log(Hexp(i<jwijsisj))log(Sexp(i<jwijsisj))

    L=1N∑V∈Vlog⁡(P(V))\mathcal{L}=\frac{1}{N} \sum_{V \in \mathbf{V}} \log (P(V))L=N1VVlog(P(V))
    dLdwij=1N∑V∈V∑HP(S∣V)sisj−∑S!P(S′)si′sj′ \frac{d \mathcal{L}}{d w_{i j}}=\frac{1}{N} \sum_{V \in \mathbf{V}} \sum_{H} P(S \mid V) s_{i} s_{j}-\sum_{S !} P\left(S^{\prime}\right) s_{i}^{\prime} s_{j}^{\prime} dwijdL=N1VVHP(SV)sisjS!P(S)sisj

  • ∑HP(S∣V)sisj≈1K∑H∈HsimulsiSj\sum_{H} P(S \mid V) s_{i} s_{j} \approx \frac{1}{K} \sum_{H \in \mathbf{H}_{s i m u l}} s_{i} S_{j}HP(SV)sisjK1HHsimulsiSj

  • Computed as the average sampled hidden state with the visible bits fixed

  • ∑S′P(S′)si′sj′≈1M∑Si∈Ssimulsi′Sj′\sum_{S^{\prime}} P\left(S^{\prime}\right) s_{i}^{\prime} s_{j}^{\prime} \approx \frac{1}{M} \sum_{S_{i} \in \mathbf{S}_{s i m u l}} s_{i}^{\prime} S_{j}^{\prime}SP(S)sisjM1SiSsimulsiSj

    • Computed as the average of sampled states when the network is running “freely

Training

Step1

  • For each training pattern ViV_iVi
    • Fix the visible units to ViV_iVi
    • Let the hidden neurons evolve from a random initial point to generate HiH_iHi
    • Generate Si=[Vi,Hi]S_i = [V_i,H_i]Si=[Vi,Hi]
  • Repeat K times to generate synthetic training

S={S1,1,S1,2,…,S1K,S2,1,…,SN,K} \mathbf{S}=\{S_{1,1}, S_{1,2}, \ldots, S_{1 K}, S_{2,1}, \ldots, S_{N, K}\} S={S1,1,S1,2,,S1K,S2,1,,SN,K}

Step2

  • Now unclamp the visible units and let the entire network evolve several times to generate

Ssimul=S_simul,1,S_simul,2,…,S_simul,M \mathbf{S}_{simul}=S\_{simul, 1}, S\_{simul, 2}, \ldots, S\_{simul, M} Ssimul=S_simul,1,S_simul,2,,S_simul,M

Gradients
d⟨log⁡(P(S))⟩dwij=1NK∑Ssisj−1M∑Si∈Ssimul si′sj′ \frac{d\langle\log (P(\mathbf{S}))\rangle}{d w_{i j}}=\frac{1}{N K} \sum_{\boldsymbol{S}} s_{i} s_{j}-\frac{1}{M} \sum_{S_{i} \in \mathbf{S}_{\text {simul }}} s_{i}^{\prime} s_{j}^{\prime} dwijdlog(P(S))=NK1SsisjM1SiSsimul sisj

wij=wij−ηd⟨log⁡(P(S))⟩dwij w_{i j}=w_{i j}-\eta \frac{d\langle\log (P(\mathbf{S}))\rangle}{d w_{i j}} wij=wijηdwijdlog(P(S))

  • Gradients are computed as before, except that the first term is now computed over the expanded training data

Issues

  • Training takes for ever
  • Doesn’t really work for large problems
    • A small number of training instances over a small number of bits

Restricted Boltzmann Machines

在这里插入图片描述

  • Partition visible and hidden units
    • Visible units ONLY talk to hidden units
    • Hidden units ONLY talk to visible units

Training

Step1

  • For each sample
    • Anchor visible units
    • Sample from hidden units
    • No looping!!

Step2

  • Now unclamp the visible units and let the entire network evolve several times to generate

Ssimul=S_simul,1,S_simul,2,…,S_simul,M \mathbf{S}_{simul}=S\_{simul, 1}, S\_{simul, 2}, \ldots, S\_{simul, M} Ssimul=S_simul,1,S_simul,2,,S_simul,M

在这里插入图片描述

  • For each sample
    • Initialize V0V_0V0 (visible) to training instance value
    • Iteratively generate hidden and visible units
  • Gradient

在这里插入图片描述

∂log⁡p(v)∂wij=<vihj>0−<vihj>∞ \frac{\partial \log p(v)}{\partial w_{i j}}=<v_{i} h_{j}>^{0}-<v_{i} h_{j}>^{\infty} wijlogp(v)=<vihj>0<vihj>

A Shortcut: Contrastive Divergence

  • Recall: Raise the neighborhood of each target memory
  • Sufficient to run one iteration to give a good estimate of the gradient

∂log⁡p(v)∂wij=<vihj>0−<vihj>1 \frac{\partial \log p(v)}{\partial w_{i j}}=< v_{i} h_{j}>^{0}-<v_{i} h_{j}>^{1} wijlogp(v)=<vihj>0<vihj>1

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值