DL(21) - Restricted Boltzmann Machine
Separation
no connection among hᶨ (latent variable)
no commections among vᵢ (visible variable)
P(v) is itractable
P(h|v) is factorial and easy to compute
P(v|h) is factorial and easy to compute
Energy-based Undirected Model
E (v, h) = − bᵀv − cᵀh − vᵀWh
Z = ∑ᵥ∑ₕ exp{ − E (v, h) }
P(v,h) = 1/Z exp( − E (v, h) )
Why do we care about efficient computation of P(h|v) and P(v|h)
computing ∇ᶿ log p̃(x; θ) is not a problem
P(h|v) = up to nᵥ : ∏ 𝜎 [ (2h - 1) ⊙ (c + Wᵀv) ] ᶨ
P(v|h) = up to nₕ: ∏ 𝜎 [ (2v - 1) ⊙ (b + Wh) ] ᵢ
however the term ∇ᶿ log Z(θ) is more problematic
because of learning
it is easier to maximize the Log-Likelihood, because we consider sample indipendent, and we have summations
given training data x we need to maximize p(x; θ)
however its gradient wrt θ is: ∇ᶿ log p(x; θ) = ∇ᶿ log p̃(x; θ) - ∇ᶿ log Z(θ)
Monte Carlo Method
ŝⁿ = 1/n ∑ f(xᶤ), xᶤ ~ p
E[ŝⁿ] = s
s = ∫p(x)f(x)dx = Eₚ[ f(x)]
under some condition it can be shown that ∇ᶿ log Z(θ) = E x~p(x) ∇ᶿ log p̃(x)
which is not possible to compute exactly, however we can use Monte Carlo method to compute a good approximation of it
however, sampling from p(x) is not always possible
Importance Sampling
why is this useful?
a good q can reduce the variance
is still unbiased for every q
maybe feasible to sample from q but not from p
p(x) f(x) = q(x) ∙ (p(x) f(x)) / q(x)
q(x) is our new p, we will draw sample from
(p(x) f(x)) / q(x) this ratio is our new f, we will evaluate each sample
optimal q
Var[ ŝ𝐪 ] = Var[ ( p(x) f(x) ) / q(x) ] / n
ŝ𝐪 = 1/n ∑ ( p(xᶤ) f(xᶤ) ) / q(xᶤ)
E[ŝ𝐪] = E[ŝₚ] = s
minimum variance occus when q
q٭(x) = ( p(x) f(x) ) / Z