Statistical Inference

Definitions

Statistic - No describing characteristic of sample

Parameter - No describing characteristic of population

Sampling distribution - distribution of all possible values
taken by statistics in all possible samples of size n.

Population distribution - distribution of all observations in
the population

Central Limit Theorem - When n is large enough (or sample normal), the sampling distribution of the averages will be approximately normal with mean μ and standard deviation sd/(under root n)

What if we can only gather one sample - how much can we trust its mean and sd

We use confidence intervals

X normal ( mean, sd/(under root n) )

95% confidence in 2 x sd/(under root n)

As in the true mean is within this range

Confidence interval: A level C confidence intervals has a
probability C of containing the true value of the population mean. Can be 0 < C < 100

mean +- z sd/(under root n)

If sample

less than 25

more than 25

Hypothesis: A theory about the characteristics of a
variable in a population.

Test of Significance

uploaded image

Used to check if a hypothesis proposal is true

Can be one sided or 2

A “Z-statistic” is the strength of the evidence agains Ho. Being farther away from 0 means more evidence against Ho

The P-value gives the “probability” that Null Hypothesis
(Ho) is correct - higher - more like Ho

Use z table

Must choose a significance %age b4 test

Central Limit Theorem for Conf Intervals

Z table for hypothesis testing

Population distribution Normal - assume sample big

Not normal

Use t distribution with n-1 degrees of freedom (to estimate population distribution using sample distribution

Standard Deviations

SD Sample (1)

SD population (2)

Confidence Interval
uploaded image

Hypothesis test (3)
uploaded image

Then find on t table - will get a range
X 2 is 2 sided test

If P less than sig lvl, we reject H not.

Comparing 2 means HT (5) or CI (4), taking dof of smaller sample -1

Can also estimate the probability that one mean is
greater than another mean by x units

We've done continuous, but what about categorical variables

2 Categories

Multiple Categories

Bionomial

Population proportion: The parameter of interest (p)

Sample proportion: the statistic used to estimate population
proportion (p^ )

The sampling distribution of a
sample proportion, p, is approximately Normal
when sample size is large

mean = p

SD uploaded image

If we take millions of samples - np will be successes

Large if both successes > 10, Failures > 10

uploaded image

For hypothesis test

large if (equal to or greater) n(1-po)>10, npo>10

uploaded image

Take -ve of z to get P

Only used for continuous variables

Comparing two large proportions

(p1 est - p2 est) +-
uploaded image
(6)

Test of Significance

If successes and and failures > 5, large sample
formula (7)

If sample small - some corrections you can make (e.g. approximating to the hypergeometric distribution).

Two way tables used

Joint distribution: shows percentage of each cell relative
to the total observations

Conditional distribution: shows the percentage of each
cell relative to the column or row

Expected Count = r x c / n

Chi Sq statistic (8)

Dof = (r-1)(c-1)

No need to multiply by 2

Tells us if 2 categorical variables are associated

Larger X2 -> smaller the p-value —> association is more likely