DL(21) - Restricted Boltzmann Machine

Separation

no connection among hᶨ (latent variable)

no commections among vᵢ (visible variable)

P(v) is itractable

P(h|v) is factorial and easy to compute

P(v|h) is factorial and easy to compute

Energy-based Undirected Model

E (v, h) = − bvchvWh

Z = ∑ᵥ∑ₕ exp{ − E (v, h) }

P(v,h) = 1/Z exp( − E (v, h) )

Why do we care about efficient computation of P(h|v) and P(v|h)

computing ∇ᶿ log p̃(x; θ) is not a problem

P(h|v) = up to nᵥ : ∏ 𝜎 [ (2h - 1) (c + Wv) ]

P(v|h) = up to nₕ: ∏ 𝜎 [ (2v - 1) (b + Wh) ]

however the term ∇ᶿ log Z(θ) is more problematic

because of learning

it is easier to maximize the Log-Likelihood, because we consider sample indipendent, and we have summations

given training data x we need to maximize p(x; θ)

however its gradient wrt θ is: ∇ᶿ log p(x; θ) = ∇ᶿ log p̃(x; θ) - ∇ᶿ log Z(θ)

Monte Carlo Method

ŝⁿ = 1/n f(xᶤ), xᶤ ~ p

E[ŝⁿ] = s

s = p(x)f(x)dx = Eₚ[ f(x)]

under some condition it can be shown that ∇ᶿ log Z(θ) = E x~p(x) ∇ᶿ log p̃(x)

which is not possible to compute exactly, however we can use Monte Carlo method to compute a good approximation of it

however, sampling from p(x) is not always possible

Importance Sampling

why is this useful?

a good q can reduce the variance

is still unbiased for every q

maybe feasible to sample from q but not from p

p(x) f(x) = q(x) (p(x) f(x)) / q(x)

q(x) is our new p, we will draw sample from

(p(x) f(x)) / q(x) this ratio is our new f, we will evaluate each sample

optimal q

Var[ ŝ𝐪 ] = Var[ ( p(x) f(x) ) / q(x) ] / n

ŝ𝐪 = 1/n ( p(xᶤ) f(xᶤ) ) / q(xᶤ)

E[ŝ𝐪] = E[ŝₚ] = s

minimum variance occus when q

q٭(x) = ( p(x) f(x) ) / Z