MAST30020
Probability for Inference

Probability & Random variables

Expectations

Convergence

Characteristic functions

Statistical applications

Expectation (General)

Conditional expectation

Basic probability

Probabilities on R

Random variables

About RVs

Random Experiment

Has mass character eg. could be repeated many times, in theory

Outcomes are uncertain (to the best of our prior knowledge)

Has some statistical regularly - the relative frequencies of outcomes stablize around some values as # independent repeations grows

Events

Events are subsets of the outcome space, which may or may not be in the \(\sigma\)-algebra.
Events must be measurable in order to calculate their probability

Indicator functions

Definition

Operations with indicator functions

\(\begin{align} A \lor B \equiv&& A \cup B \equiv && \max\{I_A, I_B\} && \approx \exists \\ A \land B \equiv&& A \cap B \equiv && I_A I_B && \approx \forall \\ \lnot A \equiv&& A^c \equiv && 1- I_A && \\ A \oplus B \equiv&& (A \cup B) \setminus (A \cap B) \equiv && |I_A - I_B| \end{align}\)

\(I_A(w) := \begin{cases} 1, & w \in A \\ 0, & w\not\in A \end{cases} \)

Complements, unions, intersections are events

Special cases

Event \(A\): \( A_1, A_2, \dots \) occurred infinitely often (i.o.)
\( \qquad \equiv A = \bigcap_{n=1}^\infty \bigcup_{k=n}^\infty A_k \)
\( \qquad \equiv A \text{ occurred iff } \forall n, \exists k\geq n, s.t. [A_k \text{ occured}]\)

Event \( A: A_1, A_2, \dots \) occurred finitely often (f.o.):
\( \qquad \equiv A = \bigcup_{n=1}^\infty \bigcap_{k=n}^\infty A_i^c \)
\( \qquad \equiv A \text{ occurred iff } \exists n, \forall k\geq n, [A_k \text{ did not occur} ]\)

Probability spaces

Probability function

Basic properties

⭐ Theorem 1.25:
\( \qquad \textbf{P} \text{ satisfies } P.3 \quad \equiv \quad A_n \uparrow A \Rightarrow P(A_n) \uparrow P(A)\quad \equiv \quad A_n \downarrow A \Rightarrow P(A_n) \downarrow P(A)\quad \equiv \quad A_n \downarrow \emptyset \Rightarrow P(A_n) \downarrow 0 \)

Disjointification:


\(\qquad A_1 \subset A_2 \subset A_2 ....\\ \qquad B_n = A_n \setminus \bigcup_{i=1}^{n-1} A_i \\ \Rightarrow\\ \qquad \bigcup_{i=1}^n A_i = \bigcup_{i=1}^n B_i \\ \qquad B_n \subset A_n \)

Monotonicity:


\( \qquad A \subset B \Rightarrow P(A) \leq P(B) \)


⭐ Theorem 1.24 Boole's Inequality:
\(\qquad \forall A_1, A_2, ... \in \mathcal{F}, P(\bigcup_{j=1}^\infty A_i) \leq \sum_{j=1}^\infty P(A_i) \)

Axioms to satisfy

🚩 P.3) Countable additivity (for disjoint \( A_1, A_2, \dots \))


\( \qquad \textbf{P} (\bigcup_{i=1}^{\infty} A_i) = \sum_{i=1}^{\infty} \textbf{P} (A_1) \)

🚩 P.1) Non-negativity:


\( \qquad \textbf{P} (A) \geq 0, A \in \mathcal{F} \)

🚩 P.2) Law of total probability:


\( \qquad \textbf{P} (\Omega) = 1 \)

Defined on measurable subsets of \( \Omega\)

Sigma-algebra

Smallest generated: \(g \subset g'\) for all others

Examples

Borel set: \(\mathcal{F} = \mathcal{B}(\mathbb{R}) = \sigma\{ (a,b] : a,b \in \mathbb{R}, a < b \} \)

Generated by event: \( \mathcal{F} = \{ \emptyset, A, A^c, \Omega \} \)

Powerset: \( \mathcal{F} = \mathscr{P}(\Omega) \)

Trivial: \(\mathcal{F} = \{ \emptyset, \Omega \} \)

Must satisfy

🚩 A.3) Closed under countable union:


\(\qquad A ,B \in \mathcal{F} \Rightarrow A \cup B \in \mathcal{F} \) etc

🚩 A.2) Closed under complementation:


\( \qquad A \in \mathcal{F} \Rightarrow A^c \in \mathcal{F} \)

🚩 A.1) Contains the outcome space itself:


\( \qquad \Omega \in \mathcal{F} \)

A set of subsets of the outcome space / a subset of the powerset of the outcome space

Outcome space

The set of all possible outcomes (may be finite, countable, uncountable). Elements of the outcome space are individual outcomes.

\( (\Omega, \mathcal{F}, \textbf{P} ) \)

\(\Omega\): outcome space
\(\mathcal{F} \subset \mathscr{P}(\Omega)\) : a \(\sigma\)-algebra of \( \Omega\)
\( \textbf{P} : \mathcal{F} \rightarrow [0,1]\): a probability function from measurable sets to [0-1]

PF.3) - finite version of P.3

⭐ Theorem 1.27, (1st) Borel-Cantelli Lemma


If \( \sum_{n=1}^\infty P(A_n) \leq \infty, \text{ then } P(A_n, i.o.) = 0 \)


Proof:
\( \qquad P(A_n, i.o.)\overset{def i.o.}{=} P(\bigcap_{n\geq 1} \bigcup_{k\geq n} A_k) \overset{1.25c}{=} \lim_{n\rightarrow \infty} P(\bigcup_{k \geq n} A_k) \overset{Boole}{ \leq} \lim_{n\rightarrow \infty} \sum_{k \geq n}P(A_k) \overset{*}{=} 0 \)


(*) since by assumption \( \sum_{i\geq 1} P(A_k) < \infty \)

Definition

Probability space is \( ( \mathbb{R}, \mathbb{B}(\mathbb{R}), P) \)

Probability function is defined in a "simpler" way - by the cdf, which defines every \( (a,b] \) interval hence defines the borel sets

Distribution function

The "distribution function" of \( P \) is an economical way to define it where \(F(t) = P(\{-\infty, t]) \), which

⭐ Theorem 1.33 Definition of DF,
For any DF of P, \(F(t)\) satisfies:

🚩 1) \(F\) is non-decreasing, so it has one sided limits:


\( \qquad F(t-) = \lim_{s \uparrow t} F(t) \\ \qquad F(t+) = \lim_{s \downarrow t} F(t) \)


Proof: monotonicity of \(P, (\infty, s] \subset (\infty, t], s < t, P( (\infty, s]) \leq P((\infty, t]), F(s) \leq F(t) \)

🚩 2) \( F \) is right continuous


\( \qquad F(t) = F(t+) \)


Proof: Right-continuity of \( P: A_n \downarrow A = (\infty, t], P(A_n) \downarrow P(A), F(t_n) \downarrow F(t) \)

3) \( \lim_{t\rightarrow -\infty} F(t) = 0, \lim_{t \rightarrow \infty} F(t) = 1 \) 🚩


Proof: \( A_n = (-\infty, t] \downarrow \emptyset, F(t) \downarrow P(\emptyset) = 0 \\ A_n = (-\infty, t] \uparrow \mathbb{R}, F(t) \uparrow P(\mathbb{R}) = 1 \)

⭐ Theorem 1.36
Any function \(F \) which satisfies 1.33 defines a unique \(\textbf{P}\) on \( \mathbb{B}(\mathbb{R}), F = F_P\)


This means we can completely define a probability on \( \mathbb{R} \) by the DF.

Types of distributions

Absolutely continuous (AC) probabilities on \( \mathbb{R} \)

Discrete probabilities on \( \mathbb{R} \)

Mixed distributions

Singular distributions

\(P(C) = 1\) for some countable set \( C \subset \mathbb{R} \)

Equivalent characterizations:


  • \(P\) is discrete
  • \(P = \sum_i p_i \epsilon_{t_i}, \sum_i p_i = 1 \)
  • \(F_P(t) = \sum_i p_i I_{t_i \leq t} \)

for some \( \{ t_i \}_{i \geq 1} \subset \mathbb{R} \)

These ones have densities / pdf's!

\( F_P(t) \) is AC iff exists \(f(t)\) s.t. \( F_P(t) = \int_{-\infty}^t f(s) ds \)

Any integral \(f(t) \geq 0 \) which integrates to 1 is a density, hence defines a distribution

Any convex combination of \(P_1, P_2\) defines a new probability function:


\( P = pP_1 + (1-p)P_2, p \in [0,1] \)

Has continuous DF, but not AC! Ie. lacks a density

⭐ Theorem 1.52 Lebesgue's decomposition
Every probability function \(P\) can be represented as the weighted combination of some discrete, AC, and singular distirbutions

Definition

This means the probability \( \textbf{P}(X \in B) \) is defined for any \( B \in \mathbb{B}(\mathbb{R}) \)

Since \( X^{-1}(B) \) preserves all set operations and disjointness, it is sufficient to show that:


\( \qquad \{X \in (-\infty, t] \} \equiv X^{-1}((-\infty, t]) \equiv \{ \omega \in \Omega : X(\omega) \in (-\infty, t] \} \in \mathcal{F} , \forall t \in \mathbb{R}\)

RVs preserve set operations 🚩 Prop. 2.2

Subsets: \( B_\alpha \subset B_\beta \Rightarrow X^{-1}(B_\alpha) \subset X^{-1}(X_\beta) \subset \Omega \)

Intersection: \( \bigcap_{\alpha\in I} X^{-1}(B_i) = X^{-1} (\bigcap_{\alpha \in I} B_i) \)

Unions: \( \bigcup_{\alpha\in I} X^{-1}(B_i) = X^{-1} (\bigcup_{\alpha \in I} B_i) \)

Disjointness: \( B_\alpha \cap B_\beta = \emptyset \Rightarrow X^{-1}(B_\alpha) \cap X^{-1}(B_\beta) = \emptyset \)

Complements: \( X^{-1}(B_\alpha^c) =[X^{-1}(B_\alpha)]^c \)

Types of RVs

Simple RVs: \(X = \sum_{i=1}^n a_i I_A\)

Random vectors:

Complex valued RVs: \( Z : \Omega \rightarrow \mathbb{C}, Z = X + iY \)

🚩 Prop 2.9, given an RV \( X \), \( \sigma(X) = \{ X^{-1}(B) : B\in \mathbb{B}(\mathbb{R}) \} \) defines a \( \sigma\)-algebra

Transformations of RVs

Functions of RVs

General fact: if \(g(x)\) is a continuous function then \(g(X)\) is a RV

Probability functions of RVs

\(P_X(B) := \textbf{P}(X \in B), B \in \mathbb{B}(\mathbb{R}) \) is called the distribution of \(X\)

This defines a probability function

So the DF of \(X\) is \( F_X(t) = P_X((-\infty, t]) = \textbf{P}(X \in (-\infty, t])) \)

"Survival tail" of \(X\) is \(S_X(t) = 1 - F_X(t)\)

Random vectors

\( X = (X_1, X_2, ... X_d) \in \mathbb{R}^d\)
Equivalently, \(X^{-1}(B) \in \mathcal{F}, B \in \mathbb{B}(\mathbb{R}^d) \)

Distributions of RVecs are:
\( \qquad F_X(t_1,...t_d) = P(X_1 \leq t_1, ... X_d \leq t_d) \)

In terms of joint density:
\( \qquad F_X(t_1, ... t_d) = \int_{-\infty}^{t_1} ... \int_{-\infty}^{t_d} f_X(t_1,...t_d) dt_d...dt_1 \)

⭐ Prop 2.28 RVecs comprised of discrete RVs are discrete

⭐ Prop 2.29 Marginal density of \(X_j\) in RVec can be found by integrating over all the other variables

Constant multipliers, linear combinations, products of RVs are RVs

⭐ Prop 2.40 if \( g \) is increasing and continuous function then:


\( g(X) \sim F_X(g^{-1}(t)) \)

⭐ Theorem 2.41 if \(X\) is RV, \(g\) is continuously differentiable on and open set, then density is:


\( g(X) \sim f_X(g^{-1}(t))|\tfrac{d}{dt}g^{-1}(t)| \)

General transformation is:


\( Y = g(X), F_Y(t) = P(Y \leq t) = P(g(X) \leq t) \) etc

⭐ Theorem 2.43 For RVecs, replace |...| with the Jacobian of the transform (determinant)

🚩 \( X(w) : \Omega \rightarrow \mathbb{R} \), s.t. the inverse image of any \( B \in \mathbb{B}(\mathbb{R}) \) must be in \(\mathcal{F}\):


\( \qquad \{X \in B \} \overset{def}{\equiv} X^{-1}(B) \equiv \{ \omega \in \Omega : X(\omega) \in B \} \in \mathcal{F}, \forall B \in \mathbb{B}(\mathbb{R}) \)


Also called "measurable" function.

Independence

Definition

RVs \( X_1,...,X_n \) are independent if the joint probabilities can always be factored into individual probabilities:


\( \qquad \forall B_1, ... B_n \in \mathbb{B}(\mathbb{R^n}), P(X_1 \in B_1, ..., X_n \in B_n) = P(X_1 \in B_1) ...P(X_n \in B_n) \)

Theorems

Continuous RVs

Discrete RVs

⭐ Theorem 3.3, RVs are independent if DFs factorize:


\( \qquad \forall t_1, ..., t_n \in \mathbb{R}^n, F(t_1, ... , t_n) = F_{X_1}(t_1) ... F_{X_n}(t_n) \)

⭐ Theorem 3.4, discrete RVs are independent if joint probability factorizes:


\( \qquad \forall t_1, ..., t_n \in \mathbb{R}^n, \textbf{P}(X_1 = t_1, ... ,X_n = t_n) = \text{P}(X_1 = t_1) ... \textbf{P}(X_n = t_n) \)

⭐ Theorem 3.5, Rvs are independent iff joint density factorizes:


\( \qquad \forall t_1, ..., t_n \in \mathbb{R}^n, f(t_1, ... , t_n) = f_{X_1}(t_1) ... f_{X_n}(t_n) \)

General

⭐ Functions of independent RVs are independent

Independence

Definition: Events \( A_1, A_2, ... \) are independent iff


\( \qquad \textbf{P}(\bigcap_i A_i) = \prod_i \textbf{P}(A_i) \)

Of events

⭐ Definition 3.19: Events \( A_1, A_2, ... \) are independent iff


\( \qquad \textbf{P}(\bigcap_i A_i) = \prod_i \textbf{P}(A_i) \)


Equivalently, if their indicator RVs are independent RVs as per definitions above. Proof:


\( \qquad \textbf{P}(\bigcap_i A_i ) = \textbf{P} (I_{A_1} = 1, I_{A_2}, = 1, ...) \overset{ind.}{=} \textbf{P}(I_{A_1} = 1)\textbf{P}(I_{A_2} = 1).... = \prod_i \textbf{P} (A_1) \)

⭐ Corollary 3.21: Events \( A_1, ... A_n \) are independent iff \( A_1^c, ... A_n^c \)

Definition

Discrete case

Definition in terms of summation: (discrete)


\( E(X) = \sum_i t_i P(X = t_i) \)

Continuous case

\( E(X) = \int x f_X(x) dx \)

Interpretation in terms of relative frequency (r.f.):


\( \qquad \overline{X_n} := \frac1n \sum_j X_j = \frac1n \sum_j \underbrace{ \sum_i t_i I(X_j = t_i) }_{=X_j} = \frac1n \sum_i t_i \underbrace{\sum_j I(X_j = t_i) }_{=n_i} = \sum_i t_i \underbrace{\frac{n_i}{n}}_{=r.f.} \approx \sum_i t_i P(X = t_i) \)

Properties of expectation

Constants: Expectations of constants are the constants themselves eg. \(X=c \Rightarrow E(X) = c \)

Linearity: \(E(aX+b) = aE(X) + b\)

Monotonicity: \( X \leq Y \Rightarrow E(X) \leq E(Y) \)

Examples

Simple RV:


\( \qquad E(X) = E(\sum_i a_i I_{A_i}) = \sum_i a_i E(I_{A_i}) = \sum_i a_i P(A_i) \)

Indicators:


\( \qquad E(I_A) = P(I_A = 0) \cdot 0 + P(I_A = 1)\cdot 1 = P(1_A=1) = P(A) \)

Interpretation in terms of indicators:


\( \qquad E(X) := E(\sum_i t_i I(X_i = t_i)) = \sum_i t_i E(I(X_i = t_i)) = \sum_i t_i P(X_i = t_i) \)

Integrability

⚠ Def 4.12 An RV is "integrable" iff \( E|X| < \infty \)
⚠ Notation: \(X\) is integrable \(\equiv X \in L^1 \)

Integrable RVs can have expectation defined as:


\( \qquad E(X) := EX^+ - EX^- \)


where \( X = X^+ - X^-\) (so \( |X| = X^+ + X^- \)). This definition still works if at most one of the terms is infinite.

Expectation over an event

Defined as \( E(X;A) = E(X I_A) \)

\(E(X;A) \leq E(X) \)

\( \sum_i E(X;A_i) = E(X) \) for some \(\Omega\) partition \(A_i \)
.... reminiscent of the LTP \( \sum_i P(A \cap B_i) = P(A) \)

⭐ Cor 4.14: if \( X \in L^1 \Rightarrow |EX| \leq E|X| \)


Proof: \( |EX| = |E(X^+ - X^-)| \overset{tri.}{\leq} |E(X^+)| + |EX^-| \overset{pos.}{=} E(X^+) + E(X^-) \overset{lin.}{=} E(X^+ + EX^-) = E|X| \)

Random Vectors & Complex numbers

Defined element-wise

Definition in terms of integrals:


\( E(X) = \int_\Omega X(\omega) P(d\omega) = \int_\Omega X(\omega) dP(\omega) = \int_\Omega X(\omega) dP \)

While Reimann integration partitions the domain, Lebesgue integration partitions the range:


\( \int g(x) dF(x) \)

Theorems

⭐ Thm 4.23 For \( X \geq 0\):
\( \qquad E(X) = \int_0^\infty (1 - F_X(x))dx \)
\( \qquad E(X) = \sum_{n\geq1} nP(X=n) = \sum_{n\geq 1}P(X \geq n) \)

Related stuff

Functions of RVs

⭐ Cor 4.30: if \( X_1, X_2 \) independent then \( \qquad E(g(X_1),h(X_2)) = E(g(X_1)) E(h(X_2)) \)

Moments

🚩 Def. k'th moment: \(E(X^k)\)

🚩 Def. k'th central moment: \( E((X - E(X))^k) \)

🚩 Def. k'th absolute moment: \( E|X - E(X)|^k \)

if \( E(|X|^p) < \infty \) then \(X \in L^p, p > 0 \)

⭐ 4.39 Jensen's inequality:
Let \( X\in L^1, g\) be convex, then, \( g(EX) \leq Eg(X) \)
(special case: \( |EX| \leq E|X| \)
If concave, then \( g(EX) \geq Eg(X)\)

⭐ Cor 4.37 Lyapunov's inequality. For \( 0 < r \leq s\):


\( \qquad (E|X|^r)^{1/r} \leq (E|X|^s)^{1/s} \)


NOTE: implies that if k'th moment is finite, all smaller ones are too!

⭐ Thm 4.40 Chebyshev's / Markov's inequality


if \(g\) non-decreasing function, then for any \(X, a \in \mathbb{R} \):


\( \qquad P(X \geq a) \leq \frac{Eg(X)}{g(a)} \)


Proof:
\( P(X \geq a) = E(I(X\geq a)), I(X\geq a) \leq X/a \\ \Rightarrow I(X \geq a) \leq g(X)/g(a) \Rightarrow E(I(X \geq a) \leq E(g(X)/g(a)) \Rightarrow P(X \geq a) \leq E(g(X)) / g(a) \)

Covariance

🚩 Def: For \( X,Y \in L^2\), \( Cov(X,Y) \ E((X-EX)(Y-EY)) = E(XY) - E(X)E(Y)\)

🚩 Def: \( Corr = \frac{Cov(X,Y)}{\sqrt{Var(X)Var(Y)}}\). NOTE: \(|Corr(X,Y)| \leq 1 \)

\(Eg(X) = \sum_i g(t_i) P(X = t_i) \)
\(Eg(X) = \int g(x) f_X(x) dx \)

⭐ Cor 4.39 Cauchy-Bunyakovsky inequality


\( \qquad E|XY| \leq \sqrt{EX^2 EY^2} \)

Notes

Correlation is a measure of linear association

Covariance matrices

Uncorrelated RVs are like orthogonal vectors; cos of angle between u,v ~~ correlation between X,Y - where 1 means zero angle and perfect correlation etc


\( |Corr(X,Y)| = 1 \Leftrightarrow P(Y = aX+b) = 1 \)

\( M= Cov_X^2 = E((X-EX)^T (X-EX)) \)
\( M_ij = Cov(X_i, X_j) \)

⚠ CovM.1 \( Cov_X^2 \) is symmetry

⚠ CovM.2 \( Cov_X^2 \) is positive/non-negative definite: \( x Cov_X^2 x^T \geq 0 \)

Multivariate normal (MVN)

Let \( X \sim N(0,1)^d \), with density \( \frac{1}{(2\pi)^{d/2}} \exp \{ -\tfrac12 xx^T \} \)

Let \( Y = \mu + XA \), then:
\( C_Y^2 = E((Y - EY)^T(Y - EY)) = E((XA)^T(XA)) = E(A^TX^T X A) = A^T I A = A^T A \)

Then \( f_Y(y) = \frac{1}{(2\pi)^{m/2} \sqrt{det(C_Y^2)}} \exp \{ -\tfrac12 (y-\mu) (C_Y^2)^{-1} (y - \mu)^T \} \)

Definition

The idea

However when we do have information about \( X \) then we can improve the best case via conditional expectation

Conditional expectation is not a numerical value; rather it is a function of the observed RVs

Conditional expectation \( E(X|A) \) minimizes the quadratic error of the conditional event:


\( g(a) := E((X-a)^2; A) = E((X-a)^2 I_A) = E(X^2 I_A) - 2aE(X I_A) + a^2 E(I_A) \)
\( \tfrac{d}{da} g(a) = -2E(X I_A) + 2a P(A) \overset{set}{=} 0 \)
\( \min_a g(a) = E(X I_A)/P(A) = E(X; A)/P(A) = E(X | A) \)

The values of \(Y\) don't matter for \(\hat{X} = E(X|Y) \); the partition generated by the values is what matters.

Properties

If \(A_i := \{Y = y_i\}\) is observed then the best guess at \(X\) is:


\( \hat{X} = E(X|Y=y_i) = E(X|A_i) = \frac{E(X; A_i)}{P(A_i)} := x_i \)

Axioms

🚩 CE.2 On the atoms of \( Y \), \(X, \hat{X} \) are indistinguishable ie. \( E(X|A) = E(E(X|Y)|A) = E(\hat{X}|A) \)

If \( \psi(x) \) is a one to one function then \(E(X|Y) = E(X|\psi(Y)) \). So the RV \(Y\) is less important than the information it contained

Conditional distributions

Conditional distribution of \( X|Y \), \( P(X \in B | Y) = P(I(X \in B) | Y) = E(I(X \in B) | Y) \)

For AC distributions:


\( f_{X|Y}(x|y) = \frac{f_{X,Y} (x,y)}{f_Y(y)}, f_Y = \int f_{X,Y}(x,y) dx\)


\(E(X|Y) = \int x f_{X|Y}(x|y) dx = g(Y)\)

🚩 CEP.1 Linearity: \( E(aX + bZ | Y) = aE(X|Y) + bE(Z|Y) \)

🚩 CEP.2 Monotonicity: \( X \leq Z \Rightarrow E(X|Y) \leq E(Z|Y) \)

🚩 CEP.3 functions of Y behave like constants when conditioning on Y:
\(Z = h(Y) \Rightarrow E(ZX|Y) = ZE(X|Y) \)

Expectation is the "best guess". Also it minimizes the mean squared error, ie. \( EX = \min_a (X - a)^2 \), in the absence of extra information about \( X\)

🚩 CE.1 The RV \( \hat{X}\) is flat on the atoms of \( Y \) ie. is a RV of \( \sigma(Y) \)
i.e. \( \hat{X} =E(X | Y_1, ... , Y_n) = h(Y_1, ..., Y_n) \) is a function of \( Y_1, ... Y_n \) and nothing else

🚩 CEP.4 Independence: if \( X,Y\) independent then \( E(X|Y) = E(X) \)

🚩 CEP.5 Double expectation law:


\( E(E(X|Y_1, Y_2) | Y_1) = E(X | Y_1) \)
\( E(E(X|Y)) = E(X) \)

Overview

In probability theory we know the probability space and make conclusions about the samples. In mathematical statistics we know the sample and draw conclusions about the underlying distribution.

In MS we have some RE \(\Omega, \mathcal{F}, \textbf{P}_\theta \) where \( \theta\) is an unknown element of \( \Theta\)

We introduce a random sample \(X = X(\omega) = \mathbb{R}^n\), so probability space is \((\mathbb{R}^n, \mathcal{B}(\mathbb{R}^n), P_\theta) \)

Key ideas

Any function of the sample \(S(X)\) is a statistic

Estimators of \(\theta\) are statistics which shoudl approach/approximate the parameter

Statistical test is a statistic with range [0,1] where 1 indicates accepting some hypothesis at some degree of certainty

Sufficiency

Statistic \(S\) is sufficient for \(\theta\) if \(P_\theta (X \in B | S) \) does not depend on \(\theta\)

if \(S\) is sufficient and \(\psi\) is one to one function then \(\psi(S)\) is sufficient

⚠ Neyman-Fisher Factorisation Theorem
"Suppose all \(P_\theta\) are AC (covers discrete case) w.r.t. some measure \( \mu \), with density \(f_\theta\) . A necessary and sufficient condition for \(S\) to be sufficient for \(\theta\) is that the density factorizes:


\(f_\theta(x) = \psi(S(X), \theta) h(x) \)

Estimators

Maximum Likelihood Estimators

⚠ Cor (of NF): the MLE is a function of a sufficient statistic:


\( \max_\theta f_\theta(X) = \max_\theta h(X)\psi(S(x),\theta) = h(X) \max_\theta \psi(S(x)) \)

⚠ Def. \(\hat{\theta}= argmax_\theta f_\theta(x) \)

Unbiased estimators \(\mathcal{K}_0\) are such that \( E(\theta_0^*) = \theta \)

Efficiency: Estimator \(\theta_0^*(X) \) is efficient in class \( K \) if \( E(\theta_0^* - \theta)^2 \leq E(\theta^* - \theta)^2 \)
Ie. minimum "variance"

Theorems

⚠ Theorem: An efficient estimator in \(K_b\) is unique

⚠ Rao-Blackwell Theorem: If you take an estimator, and take its expectation conditioned on S (sufficient statistic), then the bias remains the same and the variance will be less or equal:


Let \(\theta^* \in \mathcal{K}_b, S\) is \(SS\), and let \(\theta_S^* = E-\theta(\theta^* | S )\).


Then \( \theta_S^* \in \mathcal{K}_b, E(\theta_S^* - \theta)^2 \leq E(\theta^* - \theta)^2, \forall \theta \)

To apply RB theorem:
1) Identify the conditional distribution of the estimator on the statistic
2) Compute the mean and variance of it
3) Should be better than the variance of the estimator!

Types of convergence

🚩 Def 5.1 Convergence almost surely

🚩 Def 5.2 Convergence in probability

🚩 Def 5.5 Convergence in distribution

🚩 Def 5.4 \(L^1\) convergence
"Mean convergence"

🚩 Def 5.3 \(L^2\) convergence
"Mean quadratic convergence"

Recall: \( \lim_{n\rightarrow\infty} x_n = x\\ \Leftrightarrow x_n \rightarrow x \\ \Leftrightarrow \forall \epsilon > 0, \exists n_0 < \infty, s.t. \forall n > n_0, |x_n - x| < \epsilon \)


However \(X(\omega)\) is a function on \(\Omega\), not a sequence!

Theorems

Pointwise convergence of the sequence of function on a set of probability 1

\( X_n(\omega) \rightarrow X(\omega), \forall \omega \in A, P(A) = 1 \)

The probability of the gap between \(X_n, X\) being greater than any \( \epsilon > 0 \) goes to 0:

\(P(|X_n - X| > \epsilon) \rightarrow 0,\forall \epsilon > 0, P(A) = 1 \)

Expectation of squared difference goes to 0

\(E(X_n - X)^2 \rightarrow 0, n \rightarrow \infty\)

Expectation of absolute difference goes to 0

\(E|X_n - X| \rightarrow 0, n \rightarrow \infty\)

The distribution function approaches the limit at all continuity points:

\( F_{X_n}(t) \rightarrow F_X(t), n\rightarrow \infty, \forall t s.t. F_X(t-) = F_X(t) \)

⚠ Theorem 5.5
\( X_n \overset{d}{\rightarrow} X \Leftrightarrow Ef(X_n) \rightarrow Ef(X) \) for all functions \(f\)

Discrete case: \(P(X_n = k) \rightarrow P(X = k)\)

Relationships

image

⚠ Theorem 5.9 sums of RVs converge, a.s., and in probability, and in \(L^2, L^1\)

⚠ Theorem 5.23
Continuous functions of a.s., p, d-convergent RVs converge to the function of their respective limit RVs

Limit theorems

⚠ Theorem 5.30 Weak Law of Large Numbers
The average of bernoulli trials approaches the probability as the same size increases


\( \qquad \frac{S_n}{n} \overset{p}{\rightarrow} p, n \rightarrow \infty \)


Proof: \( E(S_n/n - p)^2 = Var(S_n)/n^2 = Var(X_1)/n = pq/n \rightarrow 0 \Rightarrow S_n/n \overset{L^2}{\rightarrow} p \Rightarrow S_n/n \overset{P}{\rightarrow} p \)

⚠ Theorem 5.31 Strong Law of Large Numbers
The average of independent bernoulli trials converges almost surely to the probability:


\( \qquad \frac{S_n}{n} \overset{a.s.}{\rightarrow} p, n \rightarrow \infty \)


Actually holds for any set of identically distributed, uncorrelated RVs

Definition

🚩 Def 6.1
\( \psi(t) = E(e^{itX}) = \int e^{itx} dF(x) = \int e^{itx} f_x(x) dx = E\cos(tX) + iE \sin(tX) \)

  • always exists
  • always finite
  • \(|\psi_X(t)| \leq 1 \)
  • \(\psi_X(0) = 1 \)

Examples

\( X=c , \psi_X(t) = e^{itc} \)

\( X \sim B(p), \psi_X(t) = 1 + p(e^{it} - 1 )\)

\( X \sim U(0,1) , \psi_X(t) = \frac{e^{it} - 1}{it} \)

\( X \sim N(0,1), \psi_X(t) = e^{-t^2/2} \)

Functions of RVs

if \(Y = aX+b\), then \( \psi_Y(t) = e^{ibt} \psi_X(at) \)

\(\overline{\psi_{X}(t)} = \overline{\int e^{itx} dF_X(x)} = \int e^{i(-t)x} dF_X(x) = \psi_{X}(-t) = \psi_{-X}(t) \)

Properties of ChF's

\(X \overset{d}{=} -X \Leftrightarrow \psi_X(t) = \psi_{-X}(t) = \overline{\psi_X(t)} \)
In other words, the ChF of a symmetry RV is real valued.

Any ChF is uniformly continuous

if \(X,Y\) independent then \(\psi_{X+Y}(t) = \psi_X(t)\psi_Y(t) \)

Theorems

⚠ Theorem 6.11 if \(E|X|^k \leq \infty \),then \(\psi_X(t)\) is k-times differentiable and:
\( \qquad EX^k= (-i)^k \frac{d^k}{dt^k} \psi_X(t) \Big|_{t=0} \)
\( \Rightarrow EX = -i \psi_X'(0)\\ \Rightarrow E(X^2) = -\psi_X''(0) \)
Converse is true for even \(k\) and "almost true" for odd

⚠ Theorem 6.7 (Inversion formula) if \(\int \psi_X(t) dt < \infty \) then \(X\) has a continuous density given by:


\( \qquad \frac{1}{2\pi} \int e^{-itx} \psi_X(t) dt \)

\( X \sim Cauchy, \psi_X(t) = e^{-|t|} \)

One to one correspondence between ChFs and DFs

⚠ Theorem 6.15 Convergence in distribution \( \Leftrightarrow \) convergence of ChFs:


\(\qquad n\rightarrow \infty, X_n \overset{d}{\rightarrow} X \Leftrightarrow \psi_{X_n}(t) \rightarrow \psi_X(t) \)

⚠ Theorem 6.17 if ChFs converge and continuous at 0, then the RVs converge to something:


\(\forall t \in \mathbb{R} \psi_{X_n}(t) \rightarrow \psi(t), n \rightarrow \infty\), and \(\psi(t)\) continuous at 0, then \(X_n \overset{d}{\rightarrow} X, \psi(t) = \psi_X(t) \)

WLLN: \( \psi_{S_n/n}(t) = \psi_{S_n}(t/n) = (\psi_X(t/n))^n \)

⚠ Theorem 6.20 Central Limit Theorem (ZOMG!?)
If finite non-zero variance then:


\( \qquad Y_n := \frac{S_n - n\mu}{\sigma \sqrt{n}} = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \overset{d}{\rightarrow} Z \sim N(0,1), n \rightarrow \infty \)

⚠ Theorem 6.21 Poisson Limit Theorem
If \(X_{n,1} ... X_{n,n}\) are i.i.d \(Bernoulli(p_n)\) and \(np_n \rightarrow \lambda\), then \( S_n \rightarrow Y \sim P(\lambda)\)


Extends beyond Bernoulli / identically distributed

Random Vectors

Extends to dot product: \(\psi_X(t_1,...,t_d) = E(e^{t \cdot X}) \)

One to one correspondence between ChF and DF still holds

Linear transformations, if \(Y=XA+B\)


\(\qquad \psi_Y(s) = e^{i(s,b)} \psi_X(sA^T) \)

⚠ Multivariate CLT:


\( \qquad \sqrt{n} \Big(\frac{S_n}{n} - \mu\Big) \overset{d}{\rightarrow} Y \sim MVN(0, C_X^2) \)

Empirical distributions

🚩 Def 7.26 EDF defined as:


\( \qquad F_n^*(t) = \frac{1}{n}\sum_{i=1}^n I(X_i \leq t) \equiv \sum_{i=1}^n \frac{1}{n} I(X_{(i)} \leq t) \)


\( \qquad \equiv P_n^* := \sum_{i=1}^n \tfrac{1}{n} \epsilon_{X_j} \)

⚠ Theorem 7.27 Glivenko-Cantelli.
Let \(X_1,...,X_n \overset{i.i.d.}{\sim} F\). Then as \( n\rightarrow \infty\):


\( \qquad D_n := \sup_t |F_n^*(t) - F(t)| \overset{a.s.}{\rightarrow} 0 \)


in other words, the gap between the empirical and true distribution approaches at every point.