Bayesian Statistics

Miscellaneous

Maximum Likelihood Estimation
Find the hypothesis(H) which maximizes the likelihood

Log Likelihood
Let likelihood function in log function
ln(P(D|H))

Bayesian Updating

Prior * Likelihood = Bayes numerator
\(P(H) * P(D|H)\)

Posterior \(P(H|D)\)
Posterior is scaled version of Bayes Numerators

Predictive Probabilities
Find the probability of observing the data
Using Law of Total Probability

Prior Predictive Probability
likelihood of H1 prior of H1 + likelihood of H2 prior of H2 + ...
\(P(D) = P(D|H_1) * P(H_1) + P(D|H_2) * P(H_2) + ...\)

Posterior Predictive Probability
likelihood of H1 posterior of H1 + likelihood of H2 posterior of H2 + ...


\(P(D) = P(D|H_1) * P(H_1 | D_0) + P(D|H_2) * P(H_2 | D_0) + ...\)

Odd
\(O(E) = \frac{P(E)}{P(E^c)}\)

Bayes Factor
\(\frac{P(D|H)}{P(D|H^c)}\)

\(BF > 1\)
Data provides evidence for the hypothesis
\(BF < 1\)
Data provides evidence against the hypothesis
\(BF = 1\)
Data provides no evidence

Posterior odd = Bayes Factor * Prior odd
\(O(H|D) = \frac{P(D|H)}{P(D|H^c)} * O(H)\)

Bayesian Inference

Monte Carlo Estimation


let \(\theta^\ast\) be samples, then the sample mean \(\bar\theta^\ast\) approaches the real mean \(\theta\), and the distribution become closer to \(N(\theta, \sigma^2)\)


due to (LORN, CLT)

Probability and Variance

\(P(a < \theta < b) = \int^b_a 1*f(\theta^\ast)d \theta\)

Metropolis-Hastings
\(p(\theta) \propto g(\theta)\)


  1. Select initial value \(\theta_0\)
  2. For i=1, ..., m, repeat:
    a) Draw \(\theta^\ast \sim q(\theta^\ast | \theta_{i-1})\)
    b) \(\alpha = \frac{g(\theta^\ast)/q(\theta^\ast | \theta_{i-1})}{g(\theta_{i-1})/q( \theta_{i-1}|\theta^\ast)} = \frac{g(\theta^\ast)*q( \theta_{i-1}|\theta^\ast)}{g(\theta_{i-1})*q(\theta^\ast | \theta_{i-1})}\)
    c) --
    \(\alpha \geq 1\) Accept sample, and update initial value
    \(0 < \alpha < 1\) Accept sample, and update initial value
    \(\alpha \leq 0\) Reject sample and reuse initial value

Gibbs Sampling
\(p(\theta, \phi | y) \propto g(\theta, \phi)\)
interested in getting \(\theta\), if \(\phi\) is known, we can draw \(\theta\)


Assume we know \(\phi\)
\(p(\theta, \phi | y) = p(\phi | y) * p(\theta | \phi, y)\)


Therefore, if \(\phi\) is a known constant,
\(p(\theta | \phi, y) \propto p(\theta, \phi | y) \propto g(\theta, \phi)\)
Or, if \(\theta\) is a known constant,
\(p(\phi | \theta, y) \propto p(\theta, \phi | y) \propto g(\theta, \phi)\)

Convergence Checking

Autocorrelation


As move along the MCMC, the correlation should decrease, thus, indicating that the model is converging

Monte Carlo Effective Sample size


Number of independent samples required from the stationary distribution you would have.


or number of steps required to have samples that are no longer correlated to each other

Gelman and Rubin Diagnostics

Using multiple chains to evaluate the convergence


Potential scale reduction factor should be close to 1.0 if each chain is converging to the same distribution. Otherwise, the chain is not yet converged.

Burn-in period

Number of steps to skip before sampling

Regression Model

Fitting

Linear Regression (real input, real output)
\(y_i = \beta_0 + \beta_1 x_{1i} + ... + \beta_k x_{ki} + \epsilon_i\) where \(\epsilon_i \sim N(0, \sigma^2)\)
\(y_i|x_i, \beta_i, \sigma^2 \sim N(\beta_0 + \beta_1 x_{1i} + ... + \beta_k x_{ki},\sigma^2)\)
\(\beta_0 \sim p(\beta_0) ...\)
\(\sigma^2 \sim p(\sigma^2)\)

ANOVA (categorical input, real output)
\(y_i|g_i,\mu_{gi},\sigma^2 \sim N(\mu_{gi}, \sigma^2)\) or different variance for each mean


\(E(y_i) = \mu = \beta_0 + \beta_1 x_1 + ... + \beta_{G-1} x_{G-1}\)
where \(x_i = I(g=1), ..., x_{G-1} = I(g = G-1)\) so x are indicators

Logistic Regression (real input, binary output)
\(y_i | \theta_i \sim Bern(\theta_i)\)


Link Function, logit
\(logit(\theta) = log(\frac{\theta}{1-\theta}) = \beta_0 + \beta_1 x_1 + ...\)
\(E(y_i) = \theta_i = ... = \frac{e^{ \beta_0 + \beta_1 x_1 + ...}}{1 + e^{ \beta_0 + \beta_1 x_1 + ...}} = \frac{1}{1 + e^{-( \beta_0 + \beta_1 x_1 + ...)}}\)

  • This is Sigmoid Function!

Prediction
\(y_i = I(\theta_i > \alpha) = I(\frac{1}{1 + e^{-( \beta_0 + \beta_1 x_1 + ...)}} > \alpha)\) \(\alpha\) is threshold

Poisson Regression (real input, real count output)
\(y_i|\lambda_i \sim Pois(\lambda_i)\) and E(y_i) = Var(y_i)


Link Function, log
\(log(\lambda_i) = \beta_0 + \beta_1 x_1 + ...\)
\(E(y_i) = \lambda_i = e^{\beta_0 + \beta_1 x_1 + ...}\)

Deviance Information Criterion
(DIC)


Penalty: # of estimated parameters


Lower the DIC, better the model

Standard Error
\(\sqrt \frac{VAR(\theta^\ast)}{m} \)

Likelihood
\(P(D|H)\)

Bayesian Decision Theory
minimizing the risk

Concepts

Decision Rules
If action \(\alpha_i\) is taken where the true state of nature is \(\omega_i\), the decision is

  • correct if \(i = j\)
  • incorrect if \(i \neq j\)

Prior only
\(P(error) = min\{ P(w_1), P(w_2) \}\)

State of nature (category) \(\omega_j\)
or hypothesis
Prior: \(P(\omega_i)\)

Evidence \(f(x)\) (pdf)
or collected data
How frequently we will measure a pattern

Features
The input x
\(x \in \mathbb{R}^d\)

Actions
Finite number (\(l\)) of actions to perform in given state
\(\alpha_1, \alpha_2, ..., \alpha_l\)

Loss function
Cost associated with taking action given that in a correct state


\(\lambda(a_i | w_j)\) # #

Risk
Cost of mis-classification

Optimal
Let \(\alpha(x)\) be a function to generate action based on current observation

Minimum risk rule

  1. Choose action with lower risk
    \(min_{a_i}\{R(a_1|x), R(a_2|x), ..., R(a_l|x)\}\)

\((\lambda_{21} - \lambda_{11}) P(\omega_1 | x) > (\lambda_{12} - \lambda_{22}) P(\omega_2 | x)\)

\(\frac{p(x|\omega_1)}{p(x|\omega_2)} > \frac{(\lambda_{12} - \lambda_{22}) P(\omega_2)}{(\lambda_{21} - \lambda_{11}) P(\omega_1)}\)

Discriminant (classifier) function
\(g_i(\vec x), i = 1,2,...,c\)


\(i\) with maximum \(g_i\) assign the \(\vec x\) to \(\omega_i\) (est.), hence the reasonable action to take is \(\alpha_i\)


Monotonically increasing function
If \(f(x)\) is monotonically increasing, then \(f(g_i(\vec x)) \) still produces the same result


Good candidate for \(f(x) = ln(x)\)

Bayes Rule
We may use bayes rule to optimize discriminant


\(g_i(x) = \frac{p(x|\omega_i)p(\omega_i)}{p(x)}\)
\(\propto p(x|\omega_i) P(\omega_i)\)
\(\propto ln(p(x|\omega_i)) + ln(P(\omega_i)) \)

Multivariate Gaussian Density Function
\(N(\vec \mu, \Sigma) = \frac{1}{(2\pi)^{\frac{d}{2}} |\Sigma|^{\frac{1}{2}}} e^{-\frac{1}{2}(\vec x - \vec \mu)' \Sigma^{-1} (\vec x - \vec \mu)}\)

# Multivariate Normal
Assuming \(p(x|\omega_w) \sim N(\mu_i, \Sigma_i)\)
\(g_i(x) \propto -\frac{d}{2}ln(2\pi) -\frac{1}{2}ln(|\Sigma_i|) -\frac{1}{2}(x-\mu_i)'\Sigma^{-1}(x-\mu_i) + lnP(\omega_i)\)

Bayesian Classification

Conditional Risk (or Expected Loss)
Given an observation 'x' and take action \(\alpha_i\), the conditional risk is follow:
\(R(\alpha_i|x) = \sum_{j=1}^c \lambda(a_i|\omega_j)P(\omega_j | x)\)

  1. Define the classes (categories)
  • Split the data into known categories
  • Also split data into testing and training sets
  1. Have preliminary statistics from the data
  • From all data together
  • Per category
  • Not required for testing sets
  1. Choose a discriminant
  • Implement one function per class
  • Parameters \(\mu, \sigma\) are chosen based on the data
  • For each class, apply the function to each observation (testing set) -> Compute the posteriors
  1. Determine the efficiency of the classifier
  • Have some metrics

Cost
If cost are not equal for classifications, cost can be include in the risk

Loss Function
The function that measures how close the action is to the correct decision.


\(\lambda_{ij} = \lambda(\alpha_i | \omega_j)\)
Taking action \(\alpha_i\) when the true state of nature is \(\omega_j\)

Symmetrical (zero-one) Loss
\(\lambda(\alpha_i | \omega_j) = \) {0 if \(i = j\), 1 if \(j \neq j\)}
penalizing for making a wrong decision

  • All errors are equally costly

Overall Risk
\(R^* = \int R(\alpha(x) | x) p(x) dx\)

Risk
\(R(\alpha_i|x) = \sum_{j=1}^c \lambda(a_i|\omega_j)P(\omega_j | x)\)
\(= \sum_{j \neq i} P(\omega_j | x)\)
\(= 1 - P(\omega_i | x)\)


Note \(\lambda(a_i|\omega_j) = 0\) when \(i = j\)

Adjusting Penalty
\(\lambda_{ij}\) may be adjusted to output higher loss for making a specific 'mistake'


\(\lambda_{12} > \lambda_{21}\) gives higher penalty for taking \(\alpha_1\) when the state of nature is \(\omega_2\)

Minimax Criterion
Find prior \(P(\omega_1)\) that the Bayes risk is maximum

Neyman-Pearson Criterion
Minimize the overall risk subject to a contastraint


\(\int R(a_i|x) dx < \gamma\)


  • There is a fixed resources accompanying action \(\alpha_i\)
  • Must not misclassify more than the specific frequency

Types

Dichotomizer
Two-category only case


\(g(x) = g_1(x) - g_2(x)\)


Choose \(\omega_1\) if \(g(x) > 0\)
Choose \(\omega_2\) otherwise

General Case
Multi-category case


\(g_i(x) = -R(\alpha_i | x)\)


Simplifies in minimum-error-rate case:
\(g_i(x) = P(\omega_i | x)\)
Higher the \(g_i(x)\), lower the risk #

Covarriance Matrix \(\Sigma\)

  • Symmetric Positive Definite
  • \(Det(\Sigma) > 0\)
  • if \(x_i\) and \(x_j\) are statistically independent, then \(\sigma_{ij} = \sigma{ji} = 0\)
  • if \(\forall_{i \neq j}, \sigma_{ij} = 0\), then multivariate gaussian density becomes multiple of univariate densities.

Squared Mahalanobis Distance (r^2)
from \(x\) to \(\mu\)


\(r^2 = (\vec x - \vec \mu)' \Sigma^{-1} (\vec x - \vec \mu)\)

Linear Combination
If \(y = A^T x\)
Then \(p(y) \sim N(A^T \vec\mu, A^T \Sigma A)\)

Gaussian Classifier

Case 1 \(\Sigma_i = \sigma^2 I\)
Minimum distance classifier

  • Statistically independent
  • Different mean, same variance
  • Equaltion simplfies to \(g_i(x)\)

Case 2 \(\Sigma_i = \Sigma\)

  • Each entry in all Sigma is all same
  • \(Det(\Sigma) = 0\)

Case 3 \(\Sigma_i = ?\)

  • Decision boundaries are non-linear

Likelihood Ratio
\(\frac{p(x|\omega_1)}{p(x|\omega_2)}\)

Threshold
\(\frac{P(\omega_2)}{P(\omega_1)}\)

Taking Average each entry

Taking Maximum each entry

Taking most/least observation

Taking sigma that has most variability