MAST30020
Probability for Inference
Probability & Random variables
Expectations
Convergence
Characteristic functions
Statistical applications
Expectation (General)
Conditional expectation
Basic probability
Probabilities on R
Random variables
About RVs
Random Experiment
Has mass character eg. could be repeated many times, in theory
Outcomes are uncertain (to the best of our prior knowledge)
Has some statistical regularly - the relative frequencies of outcomes stablize around some values as # independent repeations grows
Events
Events are subsets of the outcome space, which may or may not be in the \(\sigma\)-algebra.
Events must be measurable in order to calculate their probability
Indicator functions
Definition
Operations with indicator functions
\(\begin{align} A \lor B \equiv&& A \cup B \equiv && \max\{I_A, I_B\} && \approx \exists \\ A \land B \equiv&& A \cap B \equiv && I_A I_B && \approx \forall \\ \lnot A \equiv&& A^c \equiv && 1- I_A && \\ A \oplus B \equiv&& (A \cup B) \setminus (A \cap B) \equiv && |I_A - I_B| \end{align}\)
\(I_A(w) := \begin{cases} 1, & w \in A \\ 0, & w\not\in A \end{cases} \)
Complements, unions, intersections are events
Special cases
Event \(A\): \( A_1, A_2, \dots \) occurred infinitely often (i.o.)
\( \qquad \equiv A = \bigcap_{n=1}^\infty \bigcup_{k=n}^\infty A_k \)
\( \qquad \equiv A \text{ occurred iff } \forall n, \exists k\geq n, s.t. [A_k \text{ occured}]\)
Event \( A: A_1, A_2, \dots \) occurred finitely often (f.o.):
\( \qquad \equiv A = \bigcup_{n=1}^\infty \bigcap_{k=n}^\infty A_i^c \)
\( \qquad \equiv A \text{ occurred iff } \exists n, \forall k\geq n, [A_k \text{ did not occur} ]\)
Probability spaces
Probability function
Basic properties
⭐ Theorem 1.25:
\( \qquad \textbf{P} \text{ satisfies } P.3 \quad \equiv \quad A_n \uparrow A \Rightarrow P(A_n) \uparrow P(A)\quad \equiv \quad A_n \downarrow A \Rightarrow P(A_n) \downarrow P(A)\quad \equiv \quad A_n \downarrow \emptyset \Rightarrow P(A_n) \downarrow 0 \)
Disjointification:
\(\qquad A_1 \subset A_2 \subset A_2 ....\\ \qquad B_n = A_n \setminus \bigcup_{i=1}^{n-1} A_i \\ \Rightarrow\\ \qquad \bigcup_{i=1}^n A_i = \bigcup_{i=1}^n B_i \\ \qquad B_n \subset A_n \)
Monotonicity:
\( \qquad A \subset B \Rightarrow P(A) \leq P(B) \)
⭐ Theorem 1.24 Boole's Inequality:
\(\qquad \forall A_1, A_2, ... \in \mathcal{F}, P(\bigcup_{j=1}^\infty A_i) \leq \sum_{j=1}^\infty P(A_i) \)
Axioms to satisfy
🚩 P.3) Countable additivity (for disjoint \( A_1, A_2, \dots \))
\( \qquad \textbf{P} (\bigcup_{i=1}^{\infty} A_i) = \sum_{i=1}^{\infty} \textbf{P} (A_1) \)
🚩 P.1) Non-negativity:
\( \qquad \textbf{P} (A) \geq 0, A \in \mathcal{F} \)
🚩 P.2) Law of total probability:
\( \qquad \textbf{P} (\Omega) = 1 \)
Defined on measurable subsets of \( \Omega\)
Sigma-algebra
Smallest generated: \(g \subset g'\) for all others
Examples
Borel set: \(\mathcal{F} = \mathcal{B}(\mathbb{R}) = \sigma\{ (a,b] : a,b \in \mathbb{R}, a < b \} \)
Generated by event: \( \mathcal{F} = \{ \emptyset, A, A^c, \Omega \} \)
Powerset: \( \mathcal{F} = \mathscr{P}(\Omega) \)
Trivial: \(\mathcal{F} = \{ \emptyset, \Omega \} \)
Must satisfy
🚩 A.3) Closed under countable union:
\(\qquad A ,B \in \mathcal{F} \Rightarrow A \cup B \in \mathcal{F} \) etc
🚩 A.2) Closed under complementation:
\( \qquad A \in \mathcal{F} \Rightarrow A^c \in \mathcal{F} \)
🚩 A.1) Contains the outcome space itself:
\( \qquad \Omega \in \mathcal{F} \)
A set of subsets of the outcome space / a subset of the powerset of the outcome space
Outcome space
The set of all possible outcomes (may be finite, countable, uncountable). Elements of the outcome space are individual outcomes.
\( (\Omega, \mathcal{F}, \textbf{P} ) \)
\(\Omega\): outcome space
\(\mathcal{F} \subset \mathscr{P}(\Omega)\) : a \(\sigma\)-algebra of \( \Omega\)
\( \textbf{P} : \mathcal{F} \rightarrow [0,1]\): a probability function from measurable sets to [0-1]
PF.3) - finite version of P.3
⭐ Theorem 1.27, (1st) Borel-Cantelli Lemma
If \( \sum_{n=1}^\infty P(A_n) \leq \infty, \text{ then } P(A_n, i.o.) = 0 \)
Proof:
\( \qquad P(A_n, i.o.)\overset{def i.o.}{=} P(\bigcap_{n\geq 1} \bigcup_{k\geq n} A_k) \overset{1.25c}{=} \lim_{n\rightarrow \infty} P(\bigcup_{k \geq n} A_k) \overset{Boole}{ \leq} \lim_{n\rightarrow \infty} \sum_{k \geq n}P(A_k) \overset{*}{=} 0 \)
(*) since by assumption \( \sum_{i\geq 1} P(A_k) < \infty \)
Definition
Probability space is \( ( \mathbb{R}, \mathbb{B}(\mathbb{R}), P) \)
Probability function is defined in a "simpler" way - by the cdf, which defines every \( (a,b] \) interval hence defines the borel sets
Distribution function
The "distribution function" of \( P \) is an economical way to define it where \(F(t) = P(\{-\infty, t]) \), which
⭐ Theorem 1.33 Definition of DF,
For any DF of P, \(F(t)\) satisfies:
🚩 1) \(F\) is non-decreasing, so it has one sided limits:
\( \qquad F(t-) = \lim_{s \uparrow t} F(t) \\ \qquad F(t+) = \lim_{s \downarrow t} F(t) \)
Proof: monotonicity of \(P, (\infty, s] \subset (\infty, t], s < t, P( (\infty, s]) \leq P((\infty, t]), F(s) \leq F(t) \)
🚩 2) \( F \) is right continuous
\( \qquad F(t) = F(t+) \)
Proof: Right-continuity of \( P: A_n \downarrow A = (\infty, t], P(A_n) \downarrow P(A), F(t_n) \downarrow F(t) \)
3) \( \lim_{t\rightarrow -\infty} F(t) = 0, \lim_{t \rightarrow \infty} F(t) = 1 \) 🚩
Proof: \( A_n = (-\infty, t] \downarrow \emptyset, F(t) \downarrow P(\emptyset) = 0 \\ A_n = (-\infty, t] \uparrow \mathbb{R}, F(t) \uparrow P(\mathbb{R}) = 1 \)
⭐ Theorem 1.36
Any function \(F \) which satisfies 1.33 defines a unique \(\textbf{P}\) on \( \mathbb{B}(\mathbb{R}), F = F_P\)
This means we can completely define a probability on \( \mathbb{R} \) by the DF.
Types of distributions
Absolutely continuous (AC) probabilities on \( \mathbb{R} \)
Discrete probabilities on \( \mathbb{R} \)
Mixed distributions
Singular distributions
\(P(C) = 1\) for some countable set \( C \subset \mathbb{R} \)
Equivalent characterizations:
- \(P\) is discrete
- \(P = \sum_i p_i \epsilon_{t_i}, \sum_i p_i = 1 \)
- \(F_P(t) = \sum_i p_i I_{t_i \leq t} \)
for some \( \{ t_i \}_{i \geq 1} \subset \mathbb{R} \)
These ones have densities / pdf's!
\( F_P(t) \) is AC iff exists \(f(t)\) s.t. \( F_P(t) = \int_{-\infty}^t f(s) ds \)
Any integral \(f(t) \geq 0 \) which integrates to 1 is a density, hence defines a distribution
Any convex combination of \(P_1, P_2\) defines a new probability function:
\( P = pP_1 + (1-p)P_2, p \in [0,1] \)
Has continuous DF, but not AC! Ie. lacks a density
⭐ Theorem 1.52 Lebesgue's decomposition
Every probability function \(P\) can be represented as the weighted combination of some discrete, AC, and singular distirbutions
Definition
This means the probability \( \textbf{P}(X \in B) \) is defined for any \( B \in \mathbb{B}(\mathbb{R}) \)
Since \( X^{-1}(B) \) preserves all set operations and disjointness, it is sufficient to show that:
\( \qquad \{X \in (-\infty, t] \} \equiv X^{-1}((-\infty, t]) \equiv \{ \omega \in \Omega : X(\omega) \in (-\infty, t] \} \in \mathcal{F} , \forall t \in \mathbb{R}\)
RVs preserve set operations 🚩 Prop. 2.2
Subsets: \( B_\alpha \subset B_\beta \Rightarrow X^{-1}(B_\alpha) \subset X^{-1}(X_\beta) \subset \Omega \)
Intersection: \( \bigcap_{\alpha\in I} X^{-1}(B_i) = X^{-1} (\bigcap_{\alpha \in I} B_i) \)
Unions: \( \bigcup_{\alpha\in I} X^{-1}(B_i) = X^{-1} (\bigcup_{\alpha \in I} B_i) \)
Disjointness: \( B_\alpha \cap B_\beta = \emptyset \Rightarrow X^{-1}(B_\alpha) \cap X^{-1}(B_\beta) = \emptyset \)
Complements: \( X^{-1}(B_\alpha^c) =[X^{-1}(B_\alpha)]^c \)
Types of RVs
Simple RVs: \(X = \sum_{i=1}^n a_i I_A\)
Random vectors:
Complex valued RVs: \( Z : \Omega \rightarrow \mathbb{C}, Z = X + iY \)
🚩 Prop 2.9, given an RV \( X \), \( \sigma(X) = \{ X^{-1}(B) : B\in \mathbb{B}(\mathbb{R}) \} \) defines a \( \sigma\)-algebra
Transformations of RVs
Functions of RVs
General fact: if \(g(x)\) is a continuous function then \(g(X)\) is a RV
Probability functions of RVs
\(P_X(B) := \textbf{P}(X \in B), B \in \mathbb{B}(\mathbb{R}) \) is called the distribution of \(X\)
This defines a probability function
So the DF of \(X\) is \( F_X(t) = P_X((-\infty, t]) = \textbf{P}(X \in (-\infty, t])) \)
"Survival tail" of \(X\) is \(S_X(t) = 1 - F_X(t)\)
Random vectors
\( X = (X_1, X_2, ... X_d) \in \mathbb{R}^d\)
Equivalently, \(X^{-1}(B) \in \mathcal{F}, B \in \mathbb{B}(\mathbb{R}^d) \)
Distributions of RVecs are:
\( \qquad F_X(t_1,...t_d) = P(X_1 \leq t_1, ... X_d \leq t_d) \)
In terms of joint density:
\( \qquad F_X(t_1, ... t_d) = \int_{-\infty}^{t_1} ... \int_{-\infty}^{t_d} f_X(t_1,...t_d) dt_d...dt_1 \)
⭐ Prop 2.28 RVecs comprised of discrete RVs are discrete
⭐ Prop 2.29 Marginal density of \(X_j\) in RVec can be found by integrating over all the other variables
Constant multipliers, linear combinations, products of RVs are RVs
⭐ Prop 2.40 if \( g \) is increasing and continuous function then:
\( g(X) \sim F_X(g^{-1}(t)) \)
⭐ Theorem 2.41 if \(X\) is RV, \(g\) is continuously differentiable on and open set, then density is:
\( g(X) \sim f_X(g^{-1}(t))|\tfrac{d}{dt}g^{-1}(t)| \)
General transformation is:
\( Y = g(X), F_Y(t) = P(Y \leq t) = P(g(X) \leq t) \) etc
⭐ Theorem 2.43 For RVecs, replace |...| with the Jacobian of the transform (determinant)
🚩 \( X(w) : \Omega \rightarrow \mathbb{R} \), s.t. the inverse image of any \( B \in \mathbb{B}(\mathbb{R}) \) must be in \(\mathcal{F}\):
\( \qquad \{X \in B \} \overset{def}{\equiv} X^{-1}(B) \equiv \{ \omega \in \Omega : X(\omega) \in B \} \in \mathcal{F}, \forall B \in \mathbb{B}(\mathbb{R}) \)
Also called "measurable" function.
Independence
Definition
RVs \( X_1,...,X_n \) are independent if the joint probabilities can always be factored into individual probabilities:
\( \qquad \forall B_1, ... B_n \in \mathbb{B}(\mathbb{R^n}), P(X_1 \in B_1, ..., X_n \in B_n) = P(X_1 \in B_1) ...P(X_n \in B_n) \)
Theorems
Continuous RVs
Discrete RVs
⭐ Theorem 3.3, RVs are independent if DFs factorize:
\( \qquad \forall t_1, ..., t_n \in \mathbb{R}^n, F(t_1, ... , t_n) = F_{X_1}(t_1) ... F_{X_n}(t_n) \)
⭐ Theorem 3.4, discrete RVs are independent if joint probability factorizes:
\( \qquad \forall t_1, ..., t_n \in \mathbb{R}^n, \textbf{P}(X_1 = t_1, ... ,X_n = t_n) = \text{P}(X_1 = t_1) ... \textbf{P}(X_n = t_n) \)
⭐ Theorem 3.5, Rvs are independent iff joint density factorizes:
\( \qquad \forall t_1, ..., t_n \in \mathbb{R}^n, f(t_1, ... , t_n) = f_{X_1}(t_1) ... f_{X_n}(t_n) \)
General
⭐ Functions of independent RVs are independent
Independence
Definition: Events \( A_1, A_2, ... \) are independent iff
\( \qquad \textbf{P}(\bigcap_i A_i) = \prod_i \textbf{P}(A_i) \)
Of events
⭐ Definition 3.19: Events \( A_1, A_2, ... \) are independent iff
\( \qquad \textbf{P}(\bigcap_i A_i) = \prod_i \textbf{P}(A_i) \)
Equivalently, if their indicator RVs are independent RVs as per definitions above. Proof:
\( \qquad \textbf{P}(\bigcap_i A_i ) = \textbf{P} (I_{A_1} = 1, I_{A_2}, = 1, ...) \overset{ind.}{=} \textbf{P}(I_{A_1} = 1)\textbf{P}(I_{A_2} = 1).... = \prod_i \textbf{P} (A_1) \)
⭐ Corollary 3.21: Events \( A_1, ... A_n \) are independent iff \( A_1^c, ... A_n^c \)
Definition
Discrete case
Definition in terms of summation: (discrete)
\( E(X) = \sum_i t_i P(X = t_i) \)
Continuous case
\( E(X) = \int x f_X(x) dx \)
Interpretation in terms of relative frequency (r.f.):
\( \qquad \overline{X_n} := \frac1n \sum_j X_j = \frac1n \sum_j \underbrace{ \sum_i t_i I(X_j = t_i) }_{=X_j} = \frac1n \sum_i t_i \underbrace{\sum_j I(X_j = t_i) }_{=n_i} = \sum_i t_i \underbrace{\frac{n_i}{n}}_{=r.f.} \approx \sum_i t_i P(X = t_i) \)
Properties of expectation
Constants: Expectations of constants are the constants themselves eg. \(X=c \Rightarrow E(X) = c \)
Linearity: \(E(aX+b) = aE(X) + b\)
Monotonicity: \( X \leq Y \Rightarrow E(X) \leq E(Y) \)
Examples
Simple RV:
\( \qquad E(X) = E(\sum_i a_i I_{A_i}) = \sum_i a_i E(I_{A_i}) = \sum_i a_i P(A_i) \)
Indicators:
\( \qquad E(I_A) = P(I_A = 0) \cdot 0 + P(I_A = 1)\cdot 1 = P(1_A=1) = P(A) \)
Interpretation in terms of indicators:
\( \qquad E(X) := E(\sum_i t_i I(X_i = t_i)) = \sum_i t_i E(I(X_i = t_i)) = \sum_i t_i P(X_i = t_i) \)
Integrability
⚠ Def 4.12 An RV is "integrable" iff \( E|X| < \infty \)
⚠ Notation: \(X\) is integrable \(\equiv X \in L^1 \)
Integrable RVs can have expectation defined as:
\( \qquad E(X) := EX^+ - EX^- \)
where \( X = X^+ - X^-\) (so \( |X| = X^+ + X^- \)). This definition still works if at most one of the terms is infinite.
Expectation over an event
Defined as \( E(X;A) = E(X I_A) \)
\(E(X;A) \leq E(X) \)
\( \sum_i E(X;A_i) = E(X) \) for some \(\Omega\) partition \(A_i \)
.... reminiscent of the LTP \( \sum_i P(A \cap B_i) = P(A) \)
⭐ Cor 4.14: if \( X \in L^1 \Rightarrow |EX| \leq E|X| \)
Proof: \( |EX| = |E(X^+ - X^-)| \overset{tri.}{\leq} |E(X^+)| + |EX^-| \overset{pos.}{=} E(X^+) + E(X^-) \overset{lin.}{=} E(X^+ + EX^-) = E|X| \)
Random Vectors & Complex numbers
Defined element-wise
Definition in terms of integrals:
\( E(X) = \int_\Omega X(\omega) P(d\omega) = \int_\Omega X(\omega) dP(\omega) = \int_\Omega X(\omega) dP \)
While Reimann integration partitions the domain, Lebesgue integration partitions the range:
\( \int g(x) dF(x) \)
Theorems
⭐ Thm 4.23 For \( X \geq 0\):
\( \qquad E(X) = \int_0^\infty (1 - F_X(x))dx \)
\( \qquad E(X) = \sum_{n\geq1} nP(X=n) = \sum_{n\geq 1}P(X \geq n) \)
Related stuff
Functions of RVs
⭐ Cor 4.30: if \( X_1, X_2 \) independent then \( \qquad E(g(X_1),h(X_2)) = E(g(X_1)) E(h(X_2)) \)
Moments
🚩 Def. k'th moment: \(E(X^k)\)
🚩 Def. k'th central moment: \( E((X - E(X))^k) \)
🚩 Def. k'th absolute moment: \( E|X - E(X)|^k \)
if \( E(|X|^p) < \infty \) then \(X \in L^p, p > 0 \)
⭐ 4.39 Jensen's inequality:
Let \( X\in L^1, g\) be convex, then, \( g(EX) \leq Eg(X) \)
(special case: \( |EX| \leq E|X| \)
If concave, then \( g(EX) \geq Eg(X)\)
⭐ Cor 4.37 Lyapunov's inequality. For \( 0 < r \leq s\):
\( \qquad (E|X|^r)^{1/r} \leq (E|X|^s)^{1/s} \)
NOTE: implies that if k'th moment is finite, all smaller ones are too!
⭐ Thm 4.40 Chebyshev's / Markov's inequality
if \(g\) non-decreasing function, then for any \(X, a \in \mathbb{R} \):
\( \qquad P(X \geq a) \leq \frac{Eg(X)}{g(a)} \)
Proof:
\( P(X \geq a) = E(I(X\geq a)), I(X\geq a) \leq X/a \\
\Rightarrow I(X \geq a) \leq g(X)/g(a) \Rightarrow E(I(X \geq a) \leq E(g(X)/g(a)) \Rightarrow P(X \geq a) \leq E(g(X)) / g(a) \)
Covariance
🚩 Def: For \( X,Y \in L^2\), \( Cov(X,Y) \ E((X-EX)(Y-EY)) = E(XY) - E(X)E(Y)\)
🚩 Def: \( Corr = \frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}}\). NOTE: \(|Corr(X,Y)| \leq 1 \)
\(Eg(X) = \sum_i g(t_i) P(X = t_i) \)
\(Eg(X) = \int g(x) f_X(x) dx \)
⭐ Cor 4.39 Cauchy-Bunyakovsky inequality
\( \qquad E|XY| \leq \sqrt{EX^2 EY^2} \)
Notes
Correlation is a measure of linear association
Covariance matrices
Uncorrelated RVs are like orthogonal vectors; cos of angle between u,v ~~ correlation between X,Y - where 1 means zero angle and perfect correlation etc
\( |Corr(X,Y)| = 1 \Leftrightarrow P(Y = aX+b) = 1 \)
\( M= Cov_X^2 = E((X-EX)^T (X-EX)) \)
\( M_ij = Cov(X_i, X_j) \)
⚠ CovM.1 \( Cov_X^2 \) is symmetry
⚠ CovM.2 \( Cov_X^2 \) is positive/non-negative definite: \( x Cov_X^2 x^T \geq 0 \)
Multivariate normal (MVN)
Let \( X \sim N(0,1)^d \), with density \( \frac{1}{(2\pi)^{d/2}} \exp \{ -\tfrac12 xx^T \} \)
Let \( Y = \mu + XA \), then:
\( C_Y^2 = E((Y - EY)^T(Y - EY)) = E((XA)^T(XA)) = E(A^TX^T X A) = A^T I A = A^T A \)
Then \( f_Y(y) = \frac{1}{(2\pi)^{m/2} \sqrt{det(C_Y^2)}} \exp \{ -\tfrac12 (y-\mu) (C_Y^2)^{-1} (y - \mu)^T \} \)
Definition
The idea
However when we do have information about \( X \) then we can improve the best case via conditional expectation
Conditional expectation is not a numerical value; rather it is a function of the observed RVs
Conditional expectation \( E(X|A) \) minimizes the quadratic error of the conditional event:
\( g(a) := E((X-a)^2; A) = E((X-a)^2 I_A) = E(X^2 I_A) - 2aE(X I_A) + a^2 E(I_A) \)
\( \tfrac{d}{da} g(a) = -2E(X I_A) + 2a P(A) \overset{set}{=} 0 \)
\( \min_a g(a) = E(X I_A)/P(A) = E(X; A)/P(A) = E(X | A) \)
The values of \(Y\) don't matter for \(\hat{X} = E(X|Y) \); the partition generated by the values is what matters.
Properties
If \(A_i := \{Y = y_i\}\) is observed then the best guess at \(X\) is:
\( \hat{X} = E(X|Y=y_i) = E(X|A_i) = \frac{E(X; A_i)}{P(A_i)} := x_i \)
Axioms
🚩 CE.2 On the atoms of \( Y \), \(X, \hat{X} \) are indistinguishable ie. \( E(X|A) = E(E(X|Y)|A) = E(\hat{X}|A) \)
If \( \psi(x) \) is a one to one function then \(E(X|Y) = E(X|\psi(Y)) \). So the RV \(Y\) is less important than the information it contained
Conditional distributions
Conditional distribution of \( X|Y \), \( P(X \in B | Y) = P(I(X \in B) | Y) = E(I(X \in B) | Y) \)
For AC distributions:
\( f_{X|Y}(x|y) = \frac{f_{X,Y} (x,y)}{f_Y(y)}, f_Y = \int f_{X,Y}(x,y) dx\)
\(E(X|Y) = \int x f_{X|Y}(x|y) dx = g(Y)\)
🚩 CEP.1 Linearity: \( E(aX + bZ | Y) = aE(X|Y) + bE(Z|Y) \)
🚩 CEP.2 Monotonicity: \( X \leq Z \Rightarrow E(X|Y) \leq E(Z|Y) \)
🚩 CEP.3 functions of Y behave like constants when conditioning on Y:
\(Z = h(Y) \Rightarrow E(ZX|Y) = ZE(X|Y) \)
Expectation is the "best guess". Also it minimizes the mean squared error, ie. \( EX = \min_a (X - a)^2 \), in the absence of extra information about \( X\)
🚩 CE.1 The RV \( \hat{X}\) is flat on the atoms of \( Y \) ie. is a RV of \( \sigma(Y) \)
i.e. \( \hat{X} =E(X | Y_1, ... , Y_n) = h(Y_1, ..., Y_n) \) is a function of \( Y_1, ... Y_n \) and nothing else
🚩 CEP.4 Independence: if \( X,Y\) independent then \( E(X|Y) = E(X) \)
🚩 CEP.5 Double expectation law:
\( E(E(X|Y_1, Y_2) | Y_1) = E(X | Y_1) \)
\( E(E(X|Y)) = E(X) \)
Overview
In probability theory we know the probability space and make conclusions about the samples. In mathematical statistics we know the sample and draw conclusions about the underlying distribution.
In MS we have some RE \(\Omega, \mathcal{F}, \textbf{P}_\theta \) where \( \theta\) is an unknown element of \( \Theta\)
We introduce a random sample \(X = X(\omega) = \mathbb{R}^n\), so probability space is \((\mathbb{R}^n, \mathcal{B}(\mathbb{R}^n), P_\theta) \)
Key ideas
Any function of the sample \(S(X)\) is a statistic
Estimators of \(\theta\) are statistics which shoudl approach/approximate the parameter
Statistical test is a statistic with range [0,1] where 1 indicates accepting some hypothesis at some degree of certainty
Sufficiency
Statistic \(S\) is sufficient for \(\theta\) if \(P_\theta (X \in B | S) \) does not depend on \(\theta\)
if \(S\) is sufficient and \(\psi\) is one to one function then \(\psi(S)\) is sufficient
⚠ Neyman-Fisher Factorisation Theorem
"Suppose all \(P_\theta\) are AC (covers discrete case) w.r.t. some measure \( \mu \), with density \(f_\theta\) . A necessary and sufficient condition for \(S\) to be sufficient for \(\theta\) is that the density factorizes:
\(f_\theta(x) = \psi(S(X), \theta) h(x) \)
Estimators
Maximum Likelihood Estimators
⚠ Cor (of NF): the MLE is a function of a sufficient statistic:
\( \max_\theta f_\theta(X) = \max_\theta h(X)\psi(S(x),\theta) = h(X) \max_\theta \psi(S(x)) \)
⚠ Def. \(\hat{\theta}= argmax_\theta f_\theta(x) \)
Unbiased estimators \(\mathcal{K}_0\) are such that \( E(\theta_0^*) = \theta \)
Efficiency: Estimator \(\theta_0^*(X) \) is efficient in class \( K \) if \( E(\theta_0^* - \theta)^2 \leq E(\theta^* - \theta)^2 \)
Ie. minimum "variance"
Theorems
⚠ Theorem: An efficient estimator in \(K_b\) is unique
⚠ Rao-Blackwell Theorem: If you take an estimator, and take its expectation conditioned on S (sufficient statistic), then the bias remains the same and the variance will be less or equal:
Let \(\theta^* \in \mathcal{K}_b, S\) is \(SS\), and let \(\theta_S^* = E-\theta(\theta^* | S )\).
Then \( \theta_S^* \in \mathcal{K}_b, E(\theta_S^* - \theta)^2 \leq E(\theta^* - \theta)^2, \forall \theta \)
To apply RB theorem:
1) Identify the conditional distribution of the estimator on the statistic
2) Compute the mean and variance of it
3) Should be better than the variance of the estimator!
Types of convergence
🚩 Def 5.1 Convergence almost surely
🚩 Def 5.2 Convergence in probability
🚩 Def 5.5 Convergence in distribution
🚩 Def 5.4 \(L^1\) convergence
"Mean convergence"
🚩 Def 5.3 \(L^2\) convergence
"Mean quadratic convergence"
Recall: \( \lim_{n\rightarrow\infty} x_n = x\\ \Leftrightarrow x_n \rightarrow x \\ \Leftrightarrow \forall \epsilon > 0, \exists n_0 < \infty, s.t. \forall n > n_0, |x_n - x| < \epsilon \)
However \(X(\omega)\) is a function on \(\Omega\), not a sequence!
Theorems
Pointwise convergence of the sequence of function on a set of probability 1
\( X_n(\omega) \rightarrow X(\omega), \forall \omega \in A, P(A) = 1 \)
The probability of the gap between \(X_n, X\) being greater than any \( \epsilon > 0 \) goes to 0:
\(P(|X_n - X| > \epsilon) \rightarrow 0,\forall \epsilon > 0, P(A) = 1 \)
Expectation of squared difference goes to 0
\(E(X_n - X)^2 \rightarrow 0, n \rightarrow \infty\)
Expectation of absolute difference goes to 0
\(E|X_n - X| \rightarrow 0, n \rightarrow \infty\)
The distribution function approaches the limit at all continuity points:
\( F_{X_n}(t) \rightarrow F_X(t), n\rightarrow \infty, \forall t s.t. F_X(t-) = F_X(t) \)
⚠ Theorem 5.5
\( X_n \overset{d}{\rightarrow} X \Leftrightarrow Ef(X_n) \rightarrow Ef(X) \) for all functions \(f\)
Discrete case: \(P(X_n = k) \rightarrow P(X = k)\)
Relationships
⚠ Theorem 5.9 sums of RVs converge, a.s., and in probability, and in \(L^2, L^1\)
⚠ Theorem 5.23
Continuous functions of a.s., p, d-convergent RVs converge to the function of their respective limit RVs
Limit theorems
⚠ Theorem 5.30 Weak Law of Large Numbers
The average of bernoulli trials approaches the probability as the same size increases
\( \qquad \frac{S_n}{n} \overset{p}{\rightarrow} p, n \rightarrow \infty \)
Proof: \( E(S_n/n - p)^2 = Var(S_n)/n^2 = Var(X_1)/n = pq/n \rightarrow 0 \Rightarrow S_n/n \overset{L^2}{\rightarrow} p \Rightarrow S_n/n \overset{P}{\rightarrow} p \)
⚠ Theorem 5.31 Strong Law of Large Numbers
The average of independent bernoulli trials converges almost surely to the probability:
\( \qquad \frac{S_n}{n} \overset{a.s.}{\rightarrow} p, n \rightarrow \infty \)
Actually holds for any set of identically distributed, uncorrelated RVs
Definition
🚩 Def 6.1
\( \psi(t) = E(e^{itX}) = \int e^{itx} dF(x) = \int e^{itx} f_x(x) dx = E\cos(tX) + iE \sin(tX) \)
- always exists
- always finite
- \(|\psi_X(t)| \leq 1 \)
- \(\psi_X(0) = 1 \)
Examples
\( X=c , \psi_X(t) = e^{itc} \)
\( X \sim B(p), \psi_X(t) = 1 + p(e^{it} - 1 )\)
\( X \sim U(0,1) , \psi_X(t) = \frac{e^{it} - 1}{it} \)
\( X \sim N(0,1), \psi_X(t) = e^{-t^2/2} \)
Functions of RVs
if \(Y = aX+b\), then \( \psi_Y(t) = e^{ibt} \psi_X(at) \)
\(\overline{\psi_{X}(t)} = \overline{\int e^{itx} dF_X(x)} = \int e^{i(-t)x} dF_X(x) = \psi_{X}(-t) = \psi_{-X}(t) \)
Properties of ChF's
\(X \overset{d}{=} -X \Leftrightarrow \psi_X(t) = \psi_{-X}(t) = \overline{\psi_X(t)} \)
In other words, the ChF of a symmetry RV is real valued.
Any ChF is uniformly continuous
if \(X,Y\) independent then \(\psi_{X+Y}(t) = \psi_X(t)\psi_Y(t) \)
Theorems
⚠ Theorem 6.11 if \(E|X|^k \leq \infty \),then \(\psi_X(t)\) is k-times differentiable and:
\( \qquad EX^k= (-i)^k \frac{d^k}{dt^k} \psi_X(t) \Big|_{t=0} \)
\( \Rightarrow EX = -i \psi_X'(0)\\
\Rightarrow E(X^2) = -\psi_X''(0) \)
Converse is true for even \(k\) and "almost true" for odd
⚠ Theorem 6.7 (Inversion formula) if \(\int \psi_X(t) dt < \infty \) then \(X\) has a continuous density given by:
\( \qquad \frac{1}{2\pi} \int e^{-itx} \psi_X(t) dt \)
\( X \sim Cauchy, \psi_X(t) = e^{-|t|} \)
One to one correspondence between ChFs and DFs
⚠ Theorem 6.15 Convergence in distribution \( \Leftrightarrow \) convergence of ChFs:
\(\qquad n\rightarrow \infty, X_n \overset{d}{\rightarrow} X \Leftrightarrow \psi_{X_n}(t) \rightarrow \psi_X(t) \)
⚠ Theorem 6.17 if ChFs converge and continuous at 0, then the RVs converge to something:
\(\forall t \in \mathbb{R} \psi_{X_n}(t) \rightarrow \psi(t), n \rightarrow \infty\), and \(\psi(t)\) continuous at 0, then \(X_n \overset{d}{\rightarrow} X, \psi(t) = \psi_X(t) \)
WLLN: \( \psi_{S_n/n}(t) = \psi_{S_n}(t/n) = (\psi_X(t/n))^n \)
⚠ Theorem 6.20 Central Limit Theorem (ZOMG!?)
If finite non-zero variance then:
\( \qquad Y_n := \frac{S_n - n\mu}{\sigma \sqrt{n}} = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \overset{d}{\rightarrow} Z \sim N(0,1), n \rightarrow \infty \)
⚠ Theorem 6.21 Poisson Limit Theorem
If \(X_{n,1} ... X_{n,n}\) are i.i.d \(Bernoulli(p_n)\) and \(np_n \rightarrow \lambda\), then \( S_n \rightarrow Y \sim P(\lambda)\)
Extends beyond Bernoulli / identically distributed
Random Vectors
Extends to dot product: \(\psi_X(t_1,...,t_d) = E(e^{t \cdot X}) \)
One to one correspondence between ChF and DF still holds
Linear transformations, if \(Y=XA+B\)
\(\qquad \psi_Y(s) = e^{i(s,b)} \psi_X(sA^T) \)
⚠ Multivariate CLT:
\( \qquad \sqrt{n} \Big(\frac{S_n}{n} - \mu\Big) \overset{d}{\rightarrow} Y \sim MVN(0, C_X^2) \)
Empirical distributions
🚩 Def 7.26 EDF defined as:
\( \qquad F_n^*(t) = \frac{1}{n}\sum_{i=1}^n I(X_i \leq t) \equiv \sum_{i=1}^n \frac{1}{n} I(X_{(i)} \leq t) \)
\( \qquad \equiv P_n^* := \sum_{i=1}^n \tfrac{1}{n} \epsilon_{X_j} \)
⚠ Theorem 7.27 Glivenko-Cantelli.
Let \(X_1,...,X_n \overset{i.i.d.}{\sim} F\). Then as \( n\rightarrow \infty\):
\( \qquad D_n := \sup_t |F_n^*(t) - F(t)| \overset{a.s.}{\rightarrow} 0 \)
in other words, the gap between the empirical and true distribution approaches at every point.