Statistics

Random Variable
X:W↦R #

Discrete Random Variable

Event (E: X = a)
Each sample (\(\omega\)) is mapped to a real number (a), and we call collection of all samples that shares the same number (\(X = a\)) an 'event'

Probability of an event

Expected value

\(E(X)=\sum_{j=1}^{n}p(x_j)x_j\)

Properties

\(E(X+Y)=E(X)+E(Y)\)
\(E(aX+b)=aE(X)+b\)

Variance

\(\sigma^2=E((X-\mu)^2)\)

Standard Deviation

\(\sigma=\sqrt{(\sigma^2)}\)

Properties

\(Var(X+Y)=Var(X)+Var(Y) \Leftarrow X \perp Y\)
\(Var(aX+b)=a^2Var(X)\)
\(Var(X)=E(X^2)-E(X)^2\)

Continuous Random Variable

Probability of an event

Probability Density Function (PDF): \(f(x)\)

\(f(x)\geq 0\)

  • \(f(x)\) is not probability, so doesn't have to be \(\leq 1\)

\(\int^{\infty}_{-\infty}f(x)dx=1\)

Cumulative Distribution Function (CDF): \(F(b)\)

Calculation

properties

\(F(b)\) is non-decreasing

\(\lim_{x\to \infty}F(x)=1\)
\(\lim_{x\to -\infty}F(x)=0\)

\(P(a \leq X \leq b)=\int_a^b f(x) dx=F(b)-F(a)\)

\(F'(x)=f(x)\)
Fundamental Theorem of Calculus part.1

Common Continuous Distributions

Uniform R.V: \(X\sim U(a,b)\)

Params: a, b
Range: [a,b]
Density Fn: \(f(x)=\frac{1}{b-a}\)
CDF: \(F(x)=\frac{x - a}{b - a}\) for \(a \leq x \leq b\)

Exponential R.V: \(X\sim exp(\lambda)\)
\(\lambda\): \(\frac{events}{time}\)
probability of time taken between successive events

Params: \(\lambda\)
Range: \([0, \infty)\)
Density Fn: \(f(x)=\lambda e^{-\lambda x}\)
CDF: \(F(x)=1-e^{-\lambda x}\) for \(x \geq 0\)
-

Normal R.V: \(X\sim N(\mu, \sigma^2)\)

Params: \(\mu, \sigma\)
Range: \((-\infty, \infty)\)
Density Fn: \(f(x)=\frac{1}{\sigma \sqrt{2\pi}} e^{\frac{-(x-\mu)^2}{2\sigma^2}}\)
CDF: (graph only)
-

Expected Value

\(E(X)=\int^{b}_{a}xf(x)dx\)

\(E(Y)=E(h(X))=\int^{\infty}_{-\infty}h(x)f_X(x)dx\)

Quantile

Median

\(p^{th}\) quantile

Any \(q_p\) which \(P(X \leq q_p)=p\)
or \(F(q_p)=p\)

median is \(q_{0.5}\)

Percentile

\(p=x^{th} = q_\frac{x}{100}\)

Deciles

quartiles

\(p=q^{th} = q_\frac{q}{4}\)

\(p=d^{th} = q_\frac{d}{10}\)

Let \(\bar{X}_n=\frac{1}{n}\sum^{n}_{i=1}X_i\)
and \(S_n =\sum^{n}_{i=1}X_i\)
Where X are i.i.d random variables

The Law of Large Number

Central Limit Theorem

As n grows, \(lim_{n \to \infty}P(|\bar{X}_n-\mu | < a)=1\)
a is an arbitrarily small threshold (say 0.0001)

As n grows, the distributuion of \(\bar{X}_n\) converges to \(N(\mu, \frac{\sigma^2}{n})\)

Standardization

\(Z=\frac{X-\mu}{\sigma}\)

\(Z \sim N(0,1)\)

Histgram

Frequency Histogram

Height = # of \(x_i\)

Density Histogram

  • Area = fraction of all data points that lie in the bin
  • Sum of area must equal 1
  • Closely related to the PMF of discrete random variable

Properties

\(E(\bar{X}_n)=\mu, Var(\bar{X}_n)=\frac{\sigma^2}{n}\)
\(E(S_n)=n \mu, Var(S_n)=n \sigma^2\)

Discrete Joint

Continuous Joint

Joint PMF of \(X, Y, ...\):
p(x_1, y_1, ...), p(x_2, y_1, ...), ...

\(0 \leq p(x_i, y_j) \leq 1\)

\(\sum_{i=1}^n \sum_{j=1}^m p(x_i, y_j) = 1\)

Joint CDF: \(F(x, y)\)

Joint PDF of \(X, Y\): \(f(x, y)\)

\(P(a < X < b, c < Y < d)=\int_a^b \int_c^d f(x,y) dy dx = 1\)

Joint CDF: \(F(x, y)=P(X \leq x, Y \leq y)\)

\(f(x,y)=\frac{\partial^2 F}{\partial y \partial x}\)

Marginal Distribution

PMF

\(p_X(x_i)=\sum_j p(x_i, y_j)\)

\(p_Y(y_j)=\sum_i p(x_i, y_j)\)

PDF

\(f_X(x)=\int_c^d f(x,y) dy\)

\(f_Y(y)=\int_a^bf(x,y)dx\)

CDF
X and Y takes [a, b] X [c, d]

\(F_X(x)=F(x, d)\)

\(F_Y(y)=F(b,y)\)

Independence

PMF

PDF

CDF

\(p(x_i,y_j)=p(x_i)p(y_j)\)

\(f(x, y)=f_X(x)f_Y(y)\)

\(F(X,Y)=F(X)F(Y)\)

Discrete Variables

P of cell in table
=
P of marginal distribution of its row and column

Covariance and Correlation:
Measure of how two random variables vary together

Covariance
\((Cov(X,Y)=E((X-\mu_X)(Y-\mu_Y))\)

Properties (Bilinear Map)

\(Cov(aX+b, cY+d)=acCov(X, Y)\)

\(Cov(X_1+X_2,Y)=Cov(X_1,Y)+Cov(X_2,Y)\)

\(Cov(X,X)=Var(X)\)

\(Cov(X,Y)=E(XY)-\mu_x\mu_y\)

\(Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\)

\(Cov(X,Y)=0 \Leftarrow X \perp Y\)

Discrete

Continuous

\(\sum_i^n \sum_j^m p(x_i,y_j)(x_i-\mu_X)(y_j-\mu_Y)\)


\(\int_a^b \int_c^d (x-\mu_X)(y-\mu_Y)f(x,y) dy dx\)

Correlation \(\sigma\)
\(=Cor(X,Y)=\frac{Cov(X,Y)}{\sigma_X \sigma_Y}\)

\(-1 \leq \sigma \leq 1\)

\(+\sigma\): positive corr
\(-\sigma\): negative corr

Rules/Theorems

Conditional Probability
\(P(A|B)\) = P(A "given" B)

Independence
\(A \perp B\) iff
\(P(A|B)=P(A)\)
or \(P(A \cap B) = P(A)P(B)\)
or \(P(A \cup B) = P(A) + P(B) - P(A)P(B) \)

Baye's Theorem
\(P(B|A)=\frac{P(A|B)P(B)}{P(A)}\)

Multiplication Rule
\(P(A|B)=\frac{P(A \cap B)}{P(B)}\)
\(P(A \cap B)=P(A|B)*P(B)\)

Law of Total Probability
\(P(A)=P(A \cap B_1)+P(A \cap B_2)+P(A \cap B_3)+...\)
\(P(A)=P(A|B_1)P(B_1)+P(A|B_2)P(B_2)+P(A|B_3)P(B_3)+...\)

Axiom of Probability

'\(X = a\)' is an event where \(X(w)=a\)

\(E \subseteq W\)

Sample Space (W)
(or universal set)

\(P(E) = \sum_{w\in E}P(w)\)

\(P(c\leq X \leq d)=\int^{d}_{c}f(x)dx\)

Probability Mass Function (PMF): \(p(a)\)

Cumulative Distribution Function (CDF): \(F(a)\)

\(p(a)=P(X=a)\)
\(0 \leq p(a) \leq 1\)

\(F(a)=P(X \leq a)\)

  • sum of all p(b) which \(-\infty \leq b \leq a\)

\(F(b) = P(X \leq b) = \int^{b}_{-\infty}f(x)dx\)

  • \(f(x)\) is PDF of X

\(0 \leq F(x) \leq 1\)

R.V with functions
If \(h(x)\) is a function and X is a random variable,
then \(h(X)\) is also a random variable

Any x which \(P(X \leq x) = 0.5\)
or \(F(x) = 0.5\)

Use Table (X\Y\Z\...)

\(f(x,y) \geq 0\)

\(\sum_{x_i\leq x} \sum_{y_i\leq y} p(x_i, y_i)\)

Volume under surface

\(P(X \leq x, Y \leq y) = \int_c^y \int_a^x f(u, v) dudv\)

\((\int_a^b \int_c^d xyf(x,y) dy dx) - \mu_x \mu_y\)

\((\sum_i^n \sum_j^m p(x_i,y_j)x_i y_i) - \mu_X\mu_Y\)

Maximum Likelihood Estimation
Find the hypothesis(H) which maximizes the likelihood

Likelihood
\(P(D|H)\)

Log Likelihood
Let likelihood function in log function
\(ln(P(D|H))\)

Odd
\(O(E) = \frac{P(E)}{P(E^c)}\)

Bayes Factor
\(\frac{P(D|H)}{P(D|H^c)}\)

Posterior odd = Bayes Factor * Prior odd
\(O(H|D) = \frac{P(D|H)}{P(D|H^c)} * O(H)\)

\(BF > 1\)
Data provides evidence for the hypothesis
\(BF < 1\)
Data provides evidence against the hypothesis
\(BF = 1\)
Data provides no evidence

Types of Data

Qualitative

Quantitative

Binary
True/False

Ordinal
Day1/Day2/Day3/...
Low/Medium/High/...

Nominal
Red/Green/Blue/...

Discrete
0, 1, 2, ...

Continuous
0.01203 ~ 0.30402 ...

Trimmed Mean

  • Data must be sorted before trimming

Let \(n = 9\) and \( \alpha = 0.25\), then we need to take out \(9 * 0.25 = 2.25\) of elements from the data.


Since we have fractional part, we take ceil(2.25) and floor(2.25), which are 3 and 2, so we calculate the mean of data by taking out 3 and 2 each from the beginning and the end of the data.


After that, the means are interpolated by the factional value 0.25, in order words,


\(0.25*mean(data, trim=2) + 0.75*mean(data, trim=3)\)

\(W = \){\(w_1, w_2, ...\)}
where \(P(w)\) is probability of outcome w

\(P(E) > 0\) for \(E \subset W\)

\(P(W) = 1\)

If \(E_1, E_2, ..., E_k\) for all \(E_i \cap E_j = \emptyset \) where \(i \neq j\)
Then \(P(E_1 \cup E_2 \cup ... \cup E_k) = P(E_1) + P(E_2) + ... + P(E_k)\)

Counting

Permutation
Let n = 10, and we want to find ways to arrange them in 3, regarding order, is then


\(10 * 9 * 8\)


Generalized: \(\frac{10!}{(10-3)!} = \frac{n!}{(n-k)!}\)

Combination
Combination is the same as permutation except the order doesn't matter. So for the example of permutation above, combination would be


\(\frac{10*9*8}{3*2*1}\) because in the permuted arrangements \(10*9*8\), we want to exclude the duplicated arrangements \(3*2*1\)


Generalized: \(\frac{n!}{(n-k)!k!}\)

Mutually Exclusive
\(A \cap B = \emptyset\)
\(P(A \cup B) = P(A) + P(B)\)

Confidence/Prediction Interval of Normal Population
100(1-a)%

Decision Tree

Unknowingly distributed population

Normally distributed population
(QQ plot nearly straight line)

Population \(\sigma\) unknown

Not enough samples


* Use this all the time for STMATH 390

Population \(\sigma\) known

Large enough samples
\(n \gt 30\)


Assume normality due to CLT

Not enough samples

?

Population \(\sigma\) known


\( P(-z_{\frac{\alpha}{2}} \lt \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \lt z_{\frac{\alpha}{2}}) \approx \alpha - 1 \)

Population \(\sigma\) unknown


\( P(-z_{\frac{\alpha}{2}} \lt \frac{\bar{X} - \mu}{s / \sqrt{n}} \lt z_{\frac{\alpha}{2}}) \approx \alpha - 1 \)


CI: \(\bar{X} \pm z_\frac{a}{2} \frac{\sigma}{\sqrt{n}}\)
PI: \(\bar{X} \pm z_\frac{a}{2} \sigma \sqrt{1+\frac{1}{n}}\)


CI: \(\bar{X} \pm z_\frac{a}{2} \frac{s}{\sqrt{n}}\)
PI: \(\bar{X} \pm z_\frac{a}{2} s \sqrt{1+\frac{1}{n}}\)

Poisson R.V: \(X \sim Poisson(\alpha, t))\)
\(\alpha\): \(\frac{events}{time}\), \(\mu\): \(\alpha * t \)
probability of number of events in fixed time interval

Params: \(\alpha, t\)
Range: \([0, \infty)\)
Density Fn: \(f(x) = \frac{\mu^x e^{-\mu}}{x!}\)
CDF: \(F(x) = \)

Use t - distribution
CI: \(\bar{X} \pm t_{\frac{a}{2}, n-1} \frac{s}{\sqrt{n}}\)
PI: \(\bar{X} \pm t_{\frac{a}{2}, n-1} s \sqrt{1+\frac{1}{n}}\)

Large enough samples
\(n \gt 30\)


t - distribution with 30 degrees of freedom is almost identical to the standard normal distribution.


CI: \(\bar{X} \pm z_\frac{a}{2} \frac{s}{\sqrt{n}}\)
PI: \(\bar{X} \pm z_\frac{a}{2} s \sqrt{1+\frac{1}{n}}\)

Population Proportion


p = proportion of successes
E(X) = np
\(\sigma_x = \sqrt{np(1-p)}\)

If \(np \geq 10\) and \(n(1-p) \geq 10\), \(X\) is approximately normal

estimator: \(\hat{p} = X/n\)
\(P(-z_{\frac{\alpha}{2}} \lt \frac{\hat{p} - p}{\sqrt{np(1-p)}} \lt z_{\frac{\alpha}{2}} ) \approx \alpha - 1 \)

CI: \(\hat{p} \pm z_{\frac{\alpha}{2}} \sqrt{\frac{p(1-p)}{n}}\)

Prediction Interval
What is \(X_{n+1}\) given \(X_1, X_2, ... X_n\) (All normally distributed)?


\(\bar{X} \pm t_{\frac{a}{2}, n-1} s \sqrt{1+\frac{1}{n}}\)

Estimator \(\bar{X}\)

Prediction Error
\(\bar{X} - X_{n+1}\)

Variance
\(= Var(\bar{X} - X_{n+1})\)
\(= Var(\bar{X}) + Var(X_{n+1})\)
\(= \frac{\sigma^2}{n} + \frac{\sigma^2}{1}\)
\(= \sigma^2 (\frac{1}{n} + 1)\)

Variance/Standard Deviation
given \(X_1, X_2, ... X_n\) normally distributed
\(\frac{(n-1)S^2}{\sigma^2} = \frac{\sum (X_i - \bar{X})^2 }{\sigma^2} \sim \chi^2\) with n -1 df

Expected Value
\(= E(\bar{X} - X_{n+1})\)
\(= E(\bar{X}) - E(X_{n+1})\)
\(= \mu - \mu\)
\(=0\)

Since Xs are all normally distributed,
\(\bar{X} - X_{n+1} \sim N(0, \sigma^2 (\frac{1}{n} + 1))\)


Then,
\(P(-z_{\frac{\alpha}{2}} \lt \frac{(\bar{X} - X_{n+1}) - 0}{\sqrt{\sigma^2 (\frac{1}{n} + 1)}} \lt z_{\frac{\alpha}{2}}) = 1 - \alpha\)
Or, since we do not know the population standard deviation,
\(P(-t_{\frac{a}{2}, n-1} \lt \frac{(\bar{X} - X_{n+1}) - 0}{\sqrt{s^2 (\frac{1}{n} + 1)}} \lt t_{\frac{a}{2}, n-1}) = 1 - \alpha\)
*Use this for exam


CI: \(\bar{X} \pm z_\frac{a}{2} \frac{\sigma}{\sqrt{n}}\)
PI: \(\bar{X} \pm z_\frac{a}{2} \sigma \sqrt{1+\frac{1}{n}}\)

\(P(\chi^2_{1-\frac{\alpha}{2}, n-1} < \frac{(n-1)S^2}{\sigma^2} < \chi^2_{\frac{\alpha}{2}, n-1}) = 1 - \alpha\)
*Note that \(\chi^2\) isn't symmetric

\(P(\frac{(n-1)S^2}{\chi^2_{\frac{\alpha}{2}, n-1}} < \sigma^2 < \frac{(n-1)S^2}{\chi^2_{1-\frac{\alpha}{2}, n-1}}) = 1 - \alpha\)
*Note that critical values are swapped

Common Discrete Distributions

Bernoulli R.V: \(X \sim Bernoulli(p)\)

Binomial R.V: \(X \sim Binomial(n, p)\)

Param: p
Range:
Mass fn: \(P(X=0) = 1 - p\), \(P(X=1) = p\)
CDF: \(F(a) = 0 \) if \( x \gt 0\), \(1 - p\) if \(x \geq 0 \geq 1\), \(1\) if \(x \gt 1\)

Geometric R.V: \(X \sim Geometric(p)\)

Negative Binomial R.V: \(X \sim NBinomial(r, p)\)

Pareto Charts

  • Categories are ordered
  • Descending order: Largest number of frequency to the left

Chain Rule
\(P(A_n, A_{n-1},..., A_1) = P(A_n | A_{n-1}, ..., A_1)P(A_{n-1}, ..., A_1)\)
\(P(A_n, A_{n-1},..., A_1) = P(A_n | A_{n-1}, ..., A_1)P(A_{n-1} | A_{n-2}, ..., A_1)P(A_{n-2}, ..., A_1)\)
...

Multiplication Rule (three variables)
\(P(A|B,C)\)
\(= \frac{P(A,B,C)}{P(B,C)} = \frac{\frac{P(A,B,C)}{P(C)}}{\frac{P(B,C)}{P(C)}}\)
\(= \frac{P(A,B|C)}{P(B|C)}\)

Conditional Independence
A and B are conditionally independent if
\(P(A,B|C) = P(A|C)P(B|C)\) where \(P(C) \gt 0\)


It also follows
\(P(A|B,C) = \frac{P(A,B|C)}{P(B|C)} = \frac{P(A|C)P(B|C)}{P(B|C)} = P(A|C)\)

Param: n, p
Range:
Mass fn: \(P(X=n) = {n \choose k}(1-p)^{n-k}(p)^k\)
CDF:

\(0.8 \leq r \leq 1\): Strong correlation
\(0.5 \lt r \lt 0.8\): Moderate correlation
\(0 \leq r \leq 0.5\): Weak correlation