Inference 6: Construction of a hypothesis test ( Error probabilities and…
Inference 6: Construction of a hypothesis test
Statement about probability distribution of population from which observed data is drawn
Usually in form of parametric tests
Statements about population value of parameters
2 complementary hypotheses: null and alternative
If we completely specify distribution of data under H0 and H1 = 'simple hypothesis' (e.g. H0: theta=1/2, H1: theta=2/3)
Or more complex 'composite hypotheses' (e.g. theta>2/3)
If 2 simple hypotheses for H0 and H1, no alternative but to accept null when not rejected.
Generally there is asymmetry: H0 is simple and H1 composite: We don't necessarily accept H0 when not rejected - 'absence of evidence is not evidence of absence'
Null hypothesis is not only hypothesis compatible with sample values when not rejected
Subset of sample space for which H0 will be rejected is rejection or critical region
Hypothesis test is a rule that specifies:
for which sample values H0 is rejected (i.e. H1 accepted as true)
for which sample values H0 is not rejected (i.e. H0 accepted as true)
Error probabilities and the power function
Type I error: false positive
Type II error: false negative
POWER: probability of rejecting the null hypothesis when the alternative is true
power = 1- Prob(Type II error)
i.e. 1-prob of wrongly accepting H0 when H1 is true
Prob(Type II error) = Beta, so power = 1-Beta
Statistical size of test (not sample size) is: Prob(Type I error)
Prob of wrongly rejecting H0 when H0 is true
H0: theta = 1/2
H1: theta = 2/3
Test A: Reject H0 if 5 successes, i.e. observe X=5.
Power: Prob(X=5|H1 true) = (2/3)^5=0.132
Size: Prob(X=5|H0 true) = (1/2)^5=0.031
Test B: Reject H0 if observed value of X is 3,4 or 5.
Power: Prob(X=3,4 or 5|H1 is true) = sum[(2/3)i(1/3)5-i(5Ci)~0.79
Size: Prob(X=3, 4 or 5|H0 is true)= 0.5
Comparing tests: more likely to wrongly reject H0 using test B, but also more likely to correctly reject H0 in test B
Generally: as we increase power to detect H1, also increase size
In practice, type 1 error probability is fixed, generally at alpha = 0.05 and then test chosen to make power as high as possible given fixed size.
Choice of a test statistic
Neyman-Pearson lemma tells us the best, most powerful test for a given size, alpha rejects H0 for small values of likelihood ratio (LH0/LH1)
Intuitive: Reject H0 if data much more likely under H1 than H0 (larger LH1 compared to LH0)
likelihood ratio value only depends on particular statistic: best test statistic
If likelihood ratio is small then so is log-likelihood ratio (loglikeH0-loglikeH1)
N.B. Not the same as previously mentioned likelihood ratio - this is just ratio of 2 likelihoods - not likelihood at maximum
Definition of 'small': determined by sampling distribution of test-statistic - chosen to give required value of alpha
N.B. likelihood ratio is random variable (data varies each time we sample so varying support for hypthesis each time we sample. So only interested in part that varies with each sample - constant provides no info on relative support data give to hypotheses, so ignore constant part
E.g. Normal dist. with known variance:
Let H0: µ=5 and
see e.g. 6.3.1. on p6.4 notes
We only need to see quantities that vary with the data, so ignore fixed constants(i.e. remove sigma^2 part)
large values of Σxi or any constant multiple of this will make likelihood ratio smaller.
It is convenient to use constant multiple 1/nΣxi, the sample mean.
Thus, best test rejects H0 for large values of sample mean.
2 more items...
5.Quantifying evidence against H0: the one-sided p-value
Using previous example:
H0: theta=theta0 vs H1: theta>theta0. Large values of T(x) (test statistic) reject H0, we need ot define rejection region for fixed Type I erro, alpha:
Prob(data within rejection region|H0)=alpha
If we know sampling distribution of T: we can determine the rejection region using threshold,c:
Prob(T≥c|H0)=alpha, i.e. reject H0 if T>c
Not just reject/not reject - we can use a continuous measure to quantify the evidence.
larger values of t are more extreme with reference to H0
We can use sampling data to calculate probability of observing data at least as extreme as that observed:
The smaller the p-value the more evidence provided by the data against the null hypothesis
We can use the p-value to reject or not reject H0:
p<alpha is the same as t>c
1 more item...
Usually interested in hypotheses such as:
e.g.1) H0: theta - theta0 vs H1: theta>theta0 (one sided alternative hypothesis)
Often there is a single test statistic most powerful for each theta>theta0 (i.e. uniformly most powerful)
e.g. Normal distribution
Previously, with simple hypotheses, large values of sample mean = small values of likelihood ratio, i.e. most powerful test rejects H0 for large values of sample mean, any greater than 5
Any values of sample mean>5 would have resulted in less support for H0, i.e. more support for H1.
Therefore, test that rejects H0 for large values of sample mean is uniformly most powerful test
If no uniformly most powerful test then use scientific knowledge of problem to identify particular theta>theta0 and choose test most powerful for that particular value
e.g.2. H0: theta = theta0 vs H1: theta ≠theta0 (two-sided alternative hypothesis)
No uniformly most powerful test can exist
Testing H1: μ>5 means that most powerful test rejects H0 for large values of sample mean but testing H1: μ<5 most powerful test rejects H0 for SMALL values of sample mean - i.e. most powerful test is different for each context.
Not clear that same test statistic will be most powerful for all alternative values of parameter
Quantifying evidence against H0: the two-sided p-value
fig 6.1 on p6.7 for figure of asymmetrical distribution to highlight issues
H0: theta=theta0 vs H1: theta≠theta0
Formally 2x one-sided tests:
i) H1: theta>theta0
first approach: double observed one-sided p-value
observe 'tobs'. A result as a unfavourable for H0 can occur either by:
i)T≥tobs, where Prob(T≥tobs|H0)=p~
ii)T≤tobs, where we choose t' such that Prob(T≤t'H0)=p~
Therefore, two-sided p-value is total probability of observicing a result at least as unfavourable to H0 as tobs, p=2p~. In effect we reject for large values of |T| (absolute T value)
N.B. t' not chosen to be equal distance from centre of distribution (as not the case if not symmetrical)but equal in probability terms, i.e. same size unfavourable tail as obtained with tobs.
N.B. in discrete observations: there may not be an opposite tail so defined. Therefore, observed tail is only such unfavourable region and p=p~. Also, there may be problems getting 'exact' p-values (see AT3), nominated significance level may not be possible exactly and need to choose as large as possible without exceeding nominated level.
second approach: (used in AT3), construct one-sided p-value, obtaining p~, then identify in the other tail a probability density equal to that of the observed result and add both tails together.
i.e. find t'' in the other tail such that Probfunction(T=t''|H0)=Probfunction(T=tobs|H0) and add to p~ the tail Prob(T≤t''|H0)
The two approached give similar results unless asymmetric distributions. Generally give v similar p-values but p-value from first will not be less than obtained from the second
In practice, we just report twice the one-sided p-value but it is useful to see this has a justification
Type I, type II error and statistical power revisited
Consider simple hypothesis test:
H0: difference (δ) between means of 2 groups is zero (Normal with common variance)
H1: difference is δi
Test statistic=D, difference in sample means for the 2 groups realised by d
fig 6.2 on p6.9 for expected distributions of test statistic under H0 and H1 (2 diff distributions with test statistic in same place)
Fixing Type I error at alpha, power is then affected by
i)size of diff to be detected or
ii)SE of test statistic which affects dispersion around locations (SE affected by population SD and size of sample)
When do we use one-sided p-values?
Rarely - and with care and justification
1) When we know a priori that a result in one direction must be due to chance - no possible scientific explanation for a result in that direction
If justification is incorrect then this would increase chance of Type I error (false positive)
2) when consequences of Type I error not as bad as Type II error, e.g. safety data - increasing sensitivity to detect unsafe drugs (at expense of false alarms).
Always state direction of evidence against the null- in actuality the test can only be rejected in one direction - two sided test just gives possibility of rejection in either direction
Construction of a test in a novel situation
First set up null and alternative hypotheses
define appropriate test statistic, which is function of random variable that that data will realise (using Neyman-Pearson lemma to assist)
Obtain sampling distribution for that test statistic under null hypothesis. Sometimes this is easier not using the most powerful test-statistic (i.e. going against Neyman-Pearson lemma)
Define rejection region which gives pre-specified type I error probability (e.g. alpha=0.05)
Calculate value of test statistic for observed sample of data
If observed value of test statistic is in rejection region then conclude data reject H0 in favour of H1. If not in rejection region conclude that data fails to reject H0.
Report p-value quantifying evidence against H0, assessing weight of evidence rather than making a decision to reject or not reject H0.
N.B if test cannot be defined in terms of paramete, sampling distribution may be obtained by considering probability distribution of statistic across sample space of observed results (typical in non-parametric test)