Bayesian Statistics
Miscellaneous
Maximum Likelihood Estimation
Find the hypothesis(H) which maximizes the likelihood
Log Likelihood
Let likelihood function in log function
ln(P(D|H))
Bayesian Updating
Prior * Likelihood = Bayes numerator
\(P(H) * P(D|H)\)
Posterior \(P(H|D)\)
Posterior is scaled version of Bayes Numerators
Predictive Probabilities
Find the probability of observing the data
Using Law of Total Probability
Prior Predictive Probability
likelihood of H1 prior of H1 + likelihood of H2 prior of H2 + ...
\(P(D) = P(D|H_1) * P(H_1) + P(D|H_2) * P(H_2) + ...\)
Posterior Predictive Probability
likelihood of H1 posterior of H1 + likelihood of H2 posterior of H2 + ...
\(P(D) = P(D|H_1) * P(H_1 | D_0) + P(D|H_2) * P(H_2 | D_0) + ...\)
Odd
\(O(E) = \frac{P(E)}{P(E^c)}\)
Bayes Factor
\(\frac{P(D|H)}{P(D|H^c)}\)
\(BF > 1\)
Data provides evidence for the hypothesis
\(BF < 1\)
Data provides evidence against the hypothesis
\(BF = 1\)
Data provides no evidence
Posterior odd = Bayes Factor * Prior odd
\(O(H|D) = \frac{P(D|H)}{P(D|H^c)} * O(H)\)
Bayesian Inference
Monte Carlo Estimation
let \(\theta^\ast\) be samples, then the sample mean \(\bar\theta^\ast\) approaches the real mean \(\theta\), and the distribution become closer to \(N(\theta, \sigma^2)\)
due to (LORN, CLT)
Probability and Variance
\(P(a < \theta < b) = \int^b_a 1*f(\theta^\ast)d \theta\)
Metropolis-Hastings
\(p(\theta) \propto g(\theta)\)
- Select initial value \(\theta_0\)
- For i=1, ..., m, repeat:
a) Draw \(\theta^\ast \sim q(\theta^\ast | \theta_{i-1})\)
b) \(\alpha = \frac{g(\theta^\ast)/q(\theta^\ast | \theta_{i-1})}{g(\theta_{i-1})/q( \theta_{i-1}|\theta^\ast)} = \frac{g(\theta^\ast)*q( \theta_{i-1}|\theta^\ast)}{g(\theta_{i-1})*q(\theta^\ast | \theta_{i-1})}\)
c) --
\(\alpha \geq 1\) Accept sample, and update initial value
\(0 < \alpha < 1\) Accept sample, and update initial value
\(\alpha \leq 0\) Reject sample and reuse initial value
Gibbs Sampling
\(p(\theta, \phi | y) \propto g(\theta, \phi)\)
interested in getting \(\theta\), if \(\phi\) is known, we can draw \(\theta\)
Assume we know \(\phi\)
\(p(\theta, \phi | y) = p(\phi | y) * p(\theta | \phi, y)\)
Therefore, if \(\phi\) is a known constant,
\(p(\theta | \phi, y) \propto p(\theta, \phi | y) \propto g(\theta, \phi)\)
Or, if \(\theta\) is a known constant,
\(p(\phi | \theta, y) \propto p(\theta, \phi | y) \propto g(\theta, \phi)\)
Convergence Checking
Autocorrelation
As move along the MCMC, the correlation should decrease, thus, indicating that the model is converging
Monte Carlo Effective Sample size
Number of independent samples required from the stationary distribution you would have.
or number of steps required to have samples that are no longer correlated to each other
Gelman and Rubin Diagnostics
Using multiple chains to evaluate the convergence
Potential scale reduction factor should be close to 1.0 if each chain is converging to the same distribution. Otherwise, the chain is not yet converged.
Burn-in period
Number of steps to skip before sampling
Regression Model
Fitting
Linear Regression (real input, real output)
\(y_i = \beta_0 + \beta_1 x_{1i} + ... + \beta_k x_{ki} + \epsilon_i\) where \(\epsilon_i \sim N(0, \sigma^2)\)
\(y_i|x_i, \beta_i, \sigma^2 \sim N(\beta_0 + \beta_1 x_{1i} + ... + \beta_k x_{ki},\sigma^2)\)
\(\beta_0 \sim p(\beta_0) ...\)
\(\sigma^2 \sim p(\sigma^2)\)
ANOVA (categorical input, real output)
\(y_i|g_i,\mu_{gi},\sigma^2 \sim N(\mu_{gi}, \sigma^2)\) or different variance for each mean
\(E(y_i) = \mu = \beta_0 + \beta_1 x_1 + ... + \beta_{G-1} x_{G-1}\)
where \(x_i = I(g=1), ..., x_{G-1} = I(g = G-1)\) so x are indicators
Logistic Regression (real input, binary output)
\(y_i | \theta_i \sim Bern(\theta_i)\)
Link Function, logit
\(logit(\theta) = log(\frac{\theta}{1-\theta}) = \beta_0 + \beta_1 x_1 + ...\)
\(E(y_i) = \theta_i = ... = \frac{e^{ \beta_0 + \beta_1 x_1 + ...}}{1 + e^{ \beta_0 + \beta_1 x_1 + ...}} = \frac{1}{1 + e^{-( \beta_0 + \beta_1 x_1 + ...)}}\)
- This is Sigmoid Function!
Prediction
\(y_i = I(\theta_i > \alpha) = I(\frac{1}{1 + e^{-( \beta_0 + \beta_1 x_1 + ...)}} > \alpha)\) \(\alpha\) is threshold
Poisson Regression (real input, real count output)
\(y_i|\lambda_i \sim Pois(\lambda_i)\) and E(y_i) = Var(y_i)
Link Function, log
\(log(\lambda_i) = \beta_0 + \beta_1 x_1 + ...\)
\(E(y_i) = \lambda_i = e^{\beta_0 + \beta_1 x_1 + ...}\)
Deviance Information Criterion
(DIC)
Penalty: # of estimated parameters
Lower the DIC, better the model
Standard Error
\(\sqrt \frac{VAR(\theta^\ast)}{m} \)
Likelihood
\(P(D|H)\)
Bayesian Decision Theory
minimizing the risk
Concepts
Decision Rules
If action \(\alpha_i\) is taken where the true state of nature is \(\omega_i\), the decision is
- correct if \(i = j\)
- incorrect if \(i \neq j\)
Prior only
\(P(error) = min\{ P(w_1), P(w_2) \}\)
State of nature (category) \(\omega_j\)
or hypothesis
Prior: \(P(\omega_i)\)
Evidence \(f(x)\) (pdf)
or collected data
How frequently we will measure a pattern
Features
The input x
\(x \in \mathbb{R}^d\)
Actions
Finite number (\(l\)) of actions to perform in given state
\(\alpha_1, \alpha_2, ..., \alpha_l\)
Risk
Cost of mis-classification
Optimal
Let \(\alpha(x)\) be a function to generate action based on current observation
Minimum risk rule
- Choose action with lower risk
\(min_{a_i}\{R(a_1|x), R(a_2|x), ..., R(a_l|x)\}\)
\((\lambda_{21} - \lambda_{11}) P(\omega_1 | x) > (\lambda_{12} - \lambda_{22}) P(\omega_2 | x)\)
\(\frac{p(x|\omega_1)}{p(x|\omega_2)} > \frac{(\lambda_{12} - \lambda_{22}) P(\omega_2)}{(\lambda_{21} - \lambda_{11}) P(\omega_1)}\)
Discriminant (classifier) function
\(g_i(\vec x), i = 1,2,...,c\)
\(i\) with maximum \(g_i\) assign the \(\vec x\) to \(\omega_i\) (est.), hence the reasonable action to take is \(\alpha_i\)
Monotonically increasing function
If \(f(x)\) is monotonically increasing, then \(f(g_i(\vec x)) \) still produces the same result
Good candidate for \(f(x) = ln(x)\)
Bayes Rule
We may use bayes rule to optimize discriminant
\(g_i(x) = \frac{p(x|\omega_i)p(\omega_i)}{p(x)}\)
\(\propto p(x|\omega_i) P(\omega_i)\)
\(\propto ln(p(x|\omega_i)) + ln(P(\omega_i)) \)
Multivariate Gaussian Density Function
\(N(\vec \mu, \Sigma) = \frac{1}{(2\pi)^{\frac{d}{2}} |\Sigma|^{\frac{1}{2}}} e^{-\frac{1}{2}(\vec x - \vec \mu)' \Sigma^{-1} (\vec x - \vec \mu)}\)
# Multivariate Normal
Assuming \(p(x|\omega_w) \sim N(\mu_i, \Sigma_i)\)
\(g_i(x) \propto -\frac{d}{2}ln(2\pi) -\frac{1}{2}ln(|\Sigma_i|) -\frac{1}{2}(x-\mu_i)'\Sigma^{-1}(x-\mu_i) + lnP(\omega_i)\)
Bayesian Classification
Conditional Risk (or Expected Loss)
Given an observation 'x' and take action \(\alpha_i\), the conditional risk is follow:
\(R(\alpha_i|x) = \sum_{j=1}^c \lambda(a_i|\omega_j)P(\omega_j | x)\)
- Define the classes (categories)
- Split the data into known categories
- Also split data into testing and training sets
- Have preliminary statistics from the data
- From all data together
- Per category
- Not required for testing sets
- Choose a discriminant
- Implement one function per class
- Parameters \(\mu, \sigma\) are chosen based on the data
- For each class, apply the function to each observation (testing set) -> Compute the posteriors
- Determine the efficiency of the classifier
- Have some metrics
Cost
If cost are not equal for classifications, cost can be include in the risk
Loss Function
The function that measures how close the action is to the correct decision.
\(\lambda_{ij} = \lambda(\alpha_i | \omega_j)\)
Taking action \(\alpha_i\) when the true state of nature is \(\omega_j\)
Symmetrical (zero-one) Loss
\(\lambda(\alpha_i | \omega_j) = \) {0 if \(i = j\), 1 if \(j \neq j\)}
penalizing for making a wrong decision
- All errors are equally costly
Overall Risk
\(R^* = \int R(\alpha(x) | x) p(x) dx\)
Risk
\(R(\alpha_i|x) = \sum_{j=1}^c \lambda(a_i|\omega_j)P(\omega_j | x)\)
\(= \sum_{j \neq i} P(\omega_j | x)\)
\(= 1 - P(\omega_i | x)\)
Note \(\lambda(a_i|\omega_j) = 0\) when \(i = j\)
Adjusting Penalty
\(\lambda_{ij}\) may be adjusted to output higher loss for making a specific 'mistake'
\(\lambda_{12} > \lambda_{21}\) gives higher penalty for taking \(\alpha_1\) when the state of nature is \(\omega_2\)
Minimax Criterion
Find prior \(P(\omega_1)\) that the Bayes risk is maximum
Neyman-Pearson Criterion
Minimize the overall risk subject to a contastraint
\(\int R(a_i|x) dx < \gamma\)
- There is a fixed resources accompanying action \(\alpha_i\)
- Must not misclassify more than the specific frequency
Types
Dichotomizer
Two-category only case
\(g(x) = g_1(x) - g_2(x)\)
Choose \(\omega_1\) if \(g(x) > 0\)
Choose \(\omega_2\) otherwise
General Case
Multi-category case
\(g_i(x) = -R(\alpha_i | x)\)
Simplifies in minimum-error-rate case:
\(g_i(x) = P(\omega_i | x)\)
Higher the \(g_i(x)\), lower the risk #
Covarriance Matrix \(\Sigma\)
- Symmetric Positive Definite
- \(Det(\Sigma) > 0\)
- if \(x_i\) and \(x_j\) are statistically independent, then \(\sigma_{ij} = \sigma{ji} = 0\)
- if \(\forall_{i \neq j}, \sigma_{ij} = 0\), then multivariate gaussian density becomes multiple of univariate densities.
Squared Mahalanobis Distance (r^2)
from \(x\) to \(\mu\)
\(r^2 = (\vec x - \vec \mu)' \Sigma^{-1} (\vec x - \vec \mu)\)
Linear Combination
If \(y = A^T x\)
Then \(p(y) \sim N(A^T \vec\mu, A^T \Sigma A)\)
Gaussian Classifier
Case 1 \(\Sigma_i = \sigma^2 I\)
Minimum distance classifier
- Statistically independent
- Different mean, same variance
- Equaltion simplfies to \(g_i(x)\)
Case 2 \(\Sigma_i = \Sigma\)
- Each entry in all Sigma is all same
- \(Det(\Sigma) = 0\)
Case 3 \(\Sigma_i = ?\)
- Decision boundaries are non-linear
Likelihood Ratio
\(\frac{p(x|\omega_1)}{p(x|\omega_2)}\)
Threshold
\(\frac{P(\omega_2)}{P(\omega_1)}\)
Taking Average each entry
Taking Maximum each entry
Taking most/least observation
Taking sigma that has most variability