Bayesian Machine Learning

quantities of interest (QoI) treated as random variables

conclusions drawn by analysing the posterior distribution over these quantities given the observed data.

posterior probability

conditional probability assigned to a random event after relevant (prior) evidence/background is taken into account.

used in machine learning due to the flexibility Bayesian statistics provides in building structured models of real world phenomena.

when to use Bayesian machine learning

parameter estimation problems

have a derived statistcal model of some domain

model comparisons

use bayesian statistics to make a prediction

think the parameters of the
model are meaningful

fit the parameters in order to learn something useful

which model best fits the data?

multiple models of differing complexities

trade off complexity with the degree of fit

Bayesian model averaging

define a prior over the models themselves and average the predictions with respect to the posterior over models.

Bayesian networks (Bayes nets)

represent Bayesian inference problems as Bayes nets

non-Bayesian techniques

generalisation

how well a machine learning algorithm performs on data it hasn't seen before

techniques

regularisation

method to prevent overfitting

maximum liklihood

criterion for fitting the parameters of a generative model

expectation maximisation (EM)

algorithm for fitting generative models where each data point has associated lated (or unobserved) variables

Bayesian inference algorithms

no analytic solution for posterior distribution over a model's parameters (and possibly latent variables) given the observed data

use inference to approximate and improve posterior distribution fit

techniques

MAP estimation

Gibbs sampling

iterative procedure sampling each random variable from its conditional distribution given the remaining ones.

result is (hopefully) an approximate sample from the posterior distribution

This is the method of choice for potfit

approximate the posterior with a point estimate on the optimal parameters

replaces integration problem with an optimisation problem

Markov chain Monte Carlo (MCMC)

sampling-based algorithms running Markov chains over parameters in the posterior distribution

Metropolis-Hastings (M-H) recipe to construct valid Markov chains

valid if Markov chain has the posterior distribution as a stationary distribution

choose a proposal distribution and correct for bias by stochastically accepting or rejecting the proposal.

variational inference

approximate the intractable posterior distribution with a tractable distribution

choose parameters of tractable approximation which minimise some measure of distance from the true posterior

Models

Advanced models

easier to implement inference approximations

Laplace approximation

approximates the posterior distibution with a Gaussianc centered around the MAP estimate.

Bayesian information criterion (BIC)

takes the value of the MAP solution and adds a penalty proportional to the number of parameters.

generative models to which Bayesian techniques are applied

in discriminative models the conditional distribution of the targets given the observations is directly modelled.

Bayesian linear regression is the canonical example of this

hidden Markov models

model for time series data where there is a latent discrete state which evolves over time

factor analysis

each data point is approximated as a linear function of a lower dimensional representation

each dimension of the latent space corresponds to a meaningful factor (or dimension of variation) in the data.

mixture of gaussians

each data point belongs to one of several "clusters ", or groups

fitting this model often allows to infer a meaningful grouping of the data points.

the data points within each cluster are Gaussian distributed

logistic regression

Bayesian networks (Bayes nets)

latent Dirichlet allocation (a "topic model")

linear dynamical systems

sparse coding

each data point is modelled as a linear combination of a small number of elements drawn from a larger dictionary.

applications

natural image patches: when applied to natural image patched, the learned dictionary resembles the receptive fields of neurons in the primary visual cortex

independent component analysis is a closely related model

applied to problems containing a set of documents (e.g. web pages) are each assumed to be composed of some number of topics .

related models

nonnegative matrix factorization

probabilistic latent semantic analysis

a discriminative model for predicting binary targets given input features

directed graphs which encode patterns of probabilistic dependencies between different random variables.

typically chosen to represent the causal relationships between variables.

Bayes nets can also be learned in a non-Bayesian way, but Bayesian techniques can be used to learn both the parameters and structure (the sets of edges) f the network.

a time series model where a low-dimensional gaussian latent state evolves over time. The observations are noisy linear functions of the latent states.

can be thought of as a continuous version of the HMM.

performing inference in this model can be performed exactly using the Kalman filter and smoother

Bayesian nonparametrics

sidesteps the problems containing unknown parameters which are typically required in advance, by defining models which are infinitely compex.

for a finite dataset, posterior inference can still be performed in the models while only expressing a finite portion of them.

building blocks of Bayesian nonparametric models

Gaussian processes

form the priors over the functions such that the values sampled at any finite set of points are jointly Gaussian. The default approach is a prior over functions is desired.

posterior inference is tractable in many instances.

Chinese restaurant process

a prior over partitions of an infinite set of objects

most commonly used in clustering models when the number of components will not be specified in advance.

The inference algorithms are fairly simple and well understood, so there is no reason not to use a CRP model in place of a finite clustering model.

the process can equivalently be viewed as a Dirichlet process.

hierarchial Dirichlet process

a set of Dirichlet processes which share the same base measure, and the base measure is itself drawn from a Dirichlet process.

Indian buffet process

a prior over infinite binary machines such that each row of the matrix has only a finite number of 1's.

commonly used in models where each object can have various attributes, i.e. rows of the matrix correspond to objects, columns correspond to attributes, and an entry is 1 if the object has an attribute.

examples

IBP linear-Gaussian model

the observed data are linear functions of the attributes.

the beta process is to the IBP as the Dirichlet process is to the CRP.

Dirichlet diffusion trees

a hierarchical clustering model where the data points cluster at different levels of granularity.

Pitman-Yor process

like the CRP but has a more heavy tailed distribution (in particular, a power law) over cluster sizes.

i.e. very few large clustered would be expected and instead there will likely be a large number of smaller clusters.

i.e. there may be a few coarse-grained clusters, but these themselves might decompose into more finite-grained clusters.

power law distributions are a better fit to many real-world datasets than the exponential distributions favoured by the CRP.

advanced sampling algorithms

Markov chain Monte Carlo (MCMC) algorithms

collapsed Gibbs sampling

a subset of the variables are marginalised (or collapsed) out analytically, and Gibbs sampling is performed over the remaining variables.

i.e. when fitting a CRp clustering model, the cluster parameters are often marganilised out and Gibbs sampling is performed over the cluster assignments. (This can dramatically improve mixing as the assignments and cluster parameters are highly coupled.)

Hamiltoniam Monte Carlo (HMC)

an instance of M-H for continuous spaces

the gradient of the log probability is used to choose promising directions to explore

this is the algorithm which powers Stan

slice sampling

an auxiliary variable method for sampling from one-dimensional distributions.

the main advantage is that the algorithm doesn't require specifying any parameters.

it is oftem combined with other algorithms such as HMC which would otherwise require specifying step size parameters/

reversible jump MCMC

a way of constructing M-H proposals between spaces of differing dimensionality

The most common case is Bayesian model averaging.

sequential Monte Carlo (SMC) algorithms

a class of techniques based on approximately sampling from a sequence of related distributions.

the particle filter

the most common example , an inference algorithm, typically applied to time series models.

accounting for observations one time step at a time, where at each time step the posterior over the latent state is represented with a set of particles.

annealed importance sampling (AIS)

an easy initial distribution (such as the prior) gradually "anneals" to an intractable target distribution (such as the posterior) by passing through a sequence of intermediate distributions.

a MCMC transition is performed with respect to each of the intermediate distributions. Since mixing is generally faster near the initial distribution. this is supposed to help the sampler avoid getting stuck in local modes.

the algorithms computes a set of weights which can be used to estimate the marginal likelihood.

if enough intermediate distributions are used, the variance of the weights is small, and therefore they yield an accurate estimate if the marginal likelihood.

variational inference

a class of approximate inference techniques based on optimisation rather than sampling

the intractable posterior distribution is approximated with a tractable approximation. The parameters of the approximate distribution are chosen to minimise some measure of distance (usually KL divergence) between the approximation and the posterior.

variational inference vs. sampling

variational inference algorithms involve different implementation challenges from sampling algorithms

they are harder, in that they may require lengthy mathematical derivations to determine the update rules.

once implemented, variational Bayes can be easier to test, as the standard checks for optimisation code can be employed (i.e. local optimum tests, gradient checking)

most variational inference algorithms converge to (local) optima, which eliminates the need to check convergence diagnostics.

the output of most variational inference algorithms is a distribution, rather than samples.

the variational distribution can be checked for information on the expectation or variance of a model parameter.

the accuracy of the approximation with variational methods is limited by the expressiveness of the approximating class, which is not always obvious in how it differs from the posterior,

if a sampling algorithm is run long enough, you will eventually get accurate results.

with sampling methods, a large number of samples often need to be collected, which is expensive.

examples

variational Bayes

the application of variational inference to Bayesial models where the posterior distribution over parameters cannot be represented exactly

variational Bayes EM can be used if the model includes latent variables.

mean field approximation

the approximating distribution has a simple form as all variables are assumed to be independent.

mean fields can be viewed in terms of convex duality, which leads to different generalisations from the usual representation.

expectation approximation

an approximation to loopy belief propagation

approximate messages which represent only the expectations of certain sufficient statistics of the relevant variables are sent.

canonical examples

belief propagation

a family of inference algorithms intended for graphical models such as Bayes nets and Markov random fields (MRFs).

the variables in the model ""pass messages" to each other which summarise information about the joint distribution over other variables.

applied to tree-structured graphical models

when applied to these models BP performs exact posterior inference.

sum-product algorithm

computed the marginal distribution of each individual variables (an also over all pair of neighbouring variables).

max-product algorithm

computed the most likely joint assignment to all of the variables.

applied to non-tree structured graphs

this doesn't give exact resuls, and in fact lacks even basic guarantees such as convergence to a fixed point, but often works well in practise

often called loopy belief propagation to distinguish it from the tree structures version.

loopy BP can be interpreted as a variational inference algorithm.

junction tree algorithm

defined coarse-grained "super-variables" to apply exact BP to non-tree structured graphs. The graph is then tree-structured with respect to the ""super-variables.

forward-backward algorithm

the most common special case of BP on trees for HMMs.

Kalman smoothing is a special case of the forward-backward algorithm

commonly used in computer vision and information theory, whre the inference problems tend to have a regular structure.

theoretical issues with Bayesian methods

defining a Bayesian model requires choosing priors over the parameters

if there aren't strong beliefs about the parameters, uniformative priors may be chosen.

Jeffreys prior is a common choice of uninformative prior.

how much data is required to accurately estimate the parameters of the model?

the asymptotics of maximum likelihood provide insight, since for finite models, the posterior distribution has similar asymptotic behaviour to the distribution of maximum likelihood estimates.