Statistical Inference
Definitions
Statistic - No describing characteristic of sample
Parameter - No describing characteristic of population
Sampling distribution - distribution of all possible values
taken by statistics in all possible samples of size n.
Population distribution - distribution of all observations in
the population
Central Limit Theorem - When n is large enough (or sample normal), the sampling distribution of the averages will be approximately normal with mean μ and standard deviation sd/(under root n)
What if we can only gather one sample - how much can we trust its mean and sd
We use confidence intervals
X normal ( mean, sd/(under root n) )
95% confidence in 2 x sd/(under root n)
As in the true mean is within this range
Confidence interval: A level C confidence intervals has a
probability C of containing the true value of the population mean. Can be 0 < C < 100
mean +- z sd/(under root n)
If sample
less than 25
more than 25
Hypothesis: A theory about the characteristics of a
variable in a population.
Test of Significance
Used to check if a hypothesis proposal is true
Can be one sided or 2
A “Z-statistic” is the strength of the evidence agains Ho. Being farther away from 0 means more evidence against Ho
The P-value gives the “probability” that Null Hypothesis
(Ho) is correct - higher - more like Ho
Use z table
Must choose a significance %age b4 test
Central Limit Theorem for Conf Intervals
Z table for hypothesis testing
Population distribution Normal - assume sample big
Not normal
Use t distribution with n-1 degrees of freedom (to estimate population distribution using sample distribution
Standard Deviations
SD Sample (1)
SD population (2)
Confidence Interval
Hypothesis test (3)
Then find on t table - will get a range
X 2 is 2 sided test
If P less than sig lvl, we reject H not.
Comparing 2 means HT (5) or CI (4), taking dof of smaller sample -1
Can also estimate the probability that one mean is
greater than another mean by x units
We've done continuous, but what about categorical variables
2 Categories
Multiple Categories
Bionomial
Population proportion: The parameter of interest (p)
Sample proportion: the statistic used to estimate population
proportion (p^ )
The sampling distribution of a
sample proportion, p, is approximately Normal
when sample size is large
mean = p
SD
If we take millions of samples - np will be successes
Large if both successes > 10, Failures > 10
For hypothesis test
large if (equal to or greater) n(1-po)>10, npo>10
Take -ve of z to get P
Only used for continuous variables
Comparing two large proportions
(p1 est - p2 est) +-
(6)
Test of Significance
If successes and and failures > 5, large sample
formula (7)
If sample small - some corrections you can make (e.g. approximating to the hypergeometric distribution).
Two way tables used
Joint distribution: shows percentage of each cell relative
to the total observations
Conditional distribution: shows the percentage of each
cell relative to the column or row
Expected Count = r x c / n
Chi Sq statistic (8)
Dof = (r-1)(c-1)
No need to multiply by 2
Tells us if 2 categorical variables are associated
Larger X2 -> smaller the p-value —> association is more likely