Intro To Health Statistics
L1 - Intro to Statistics
Scientific Approach
Inductive: observation -> pattern -> hypothesis -> theory
Deductive: Theory -> hypothesis -> observation -> confirmation
Research Questions:
Vital to get right
Usually informed by: Lack of existing evidence - Contradicting existing evidence - Low quality existing evidence
Informs: Research Methodologies - Hypotheses
Acronyms:
FINER: Feasible - Interesting - Novel - Ethical - Relevant
PICOT: Population - Intervention - Comparator - Outcome - Time Frame
SMART: Specific - Measurable - Achievable - Relevant - Timely
Population: Target population - Broad/Narrow - Exclusions - Access
Intervention: Treatment / Test / Procedure - Standard Care / Novel - Alternative: Exposure
Comparator: Is there one? - Ethics
Outcome: Primary - Secondary - Safety - Qualitative / Quantitative - Objective / Subjective
Time: When is the outcome mostrelevant? - Prevalence - Feasibility considerations
Examples:
“Does adding surgical resection improve 5-yearsurvival in children with neuroblastoma comparedwith chemotherapy alone?”
“Are patients on antidepressants better able toperform activities of daily living?”
Feasible: how anxious are students in public schools in hte UK on average (you cant measure the whole UK, pick a specific group in a given context you can collect a reliable average from
Interesting: are you investigating something which constibutes in a significant way to the wider bosy of work?
Novel: Do not just recycle well established topics
Ethical: is your research worth the effect is has on particpants ], always get ethical approval
Relevant: will this work be useful within a larger sociocultural context?
Specific: amke sure you're measuring one particular aspect of the topic in depth to get useful findings
Measurable: can you feasibly measure this variable in an objective way given the design and measurement tools at hand?
Achievable: Make sure you are researching something which can be altered, is observable and falsifiable
Relevant: will this work be useful in the larger context
Timely: check when you are researching a topic is appropriate, longitudinal work must be very well planned and consisent
Quantitative Variables:
Continuous:
Height (cm) / Weight (kg) - Time to healing (days) - Age (years) - Income (£) - Validated questionnaire scale (points)
Categorical:
Gender (Male, Female) - Marital Status (Married, Single, Divorced) - Education (Primary, Secondary) - Disease Severity (Mild, Moderate, Severe) - Group (Intervention, Control)
Study Designs and Hierarchies of Evidence
Evidence Based Medicine: the conscientious, explicit, andjudicious use of current best evidence in making decisions aboutthe care of individual patients
The practice of evidence-based medicine means integratingindividual clinical expertise with the best available externalclinical evidence from systematic research
The first and earliest principle of evidence-based medicineindicated that a hierarchy of evidence exists
In other words, not all evidence is the same
Pyramid Hierarchy of Evidence (from lowest to highest)
Case studies, anecdote, bench studies and personal opinion
Observational studies (cohort and case control)
Other controlled clinical trials
RCTs
Systematic reviews and meta-analyses of RCTs
Continued: RCTs are ranked above observational studies, while expert opinion and anecdotal experience are ranked at the bottom
Systematic review and meta-analysis (secondary research) is placed above RCTs
This is because systematicreviews combine data from multiple RCTs
Remember the higher up the hierarchy that a study comes, the morelikely it is that the study can minimise the impact of bias on the resultsof the study
You can think of this in a slightly different way - the higher up the top of the hierarchy the more certain we are of the results
The hierarchy of evidence is a core principle of Evidence Based Health Care because it ranks study types based on the strength and precision of their research methods
There is some criticism of HoE as a well-designed study lower down the pyramid should be ranked higher than a poorly designed RCT
It is important to design good studies
You need to understand about different types of study designs so thatyou can decide:
One of the problems is that not all research questions can be answered through a RCT either because of ethical issues or for practical issues
Which type of study design is best able to answer your question
You know the level and strength of evidence from a particular studydesign
Let’s not forget that even when evidence is available from RCTs, it might also be that evidence from other study designs is also relevant
Systematic Revies Continued:
Use explicit, systematic methods to collate and synthesise findings of studies that address a clearly formulated question
They are useful for confirming current practices, guiding decision-making and informing future research
Meta-analyses, while often part of systematic reviews, are not interchangeable with them
They use statistical analysis to combine data from the studies found in the systematic review
Studies must be homogenous (similar enough) that the data from them can be pooled together
Since SR are focused on a clearly formulated question, their conclusions only answer that question and cannot be generalised and importantly they are only as reliable as the studies they include
Clinical research can be broadly classified into experimental andobservational studies
What are the differences between observational studies andexperimental studies?
Observational studiesThis is where the exposure of a subject occurs naturally and is'observed' or recorded by the investigator
Experimental studiesIn an experimental study (intervention study) the exposure is'assigned' to the patient by the investigator
The primary study at the top of the pyramid is the RCT
RCTs are considered the ‘gold standard’ single study design
Most rigorous way of exploring cause-effect relationships○ Between the treatment (intervention) and outcome and for assessingcost effectiveness
If done properly, RCTs are powerful tools but not always possible - Ethical and practical concerns
Some study designs do not involve formal randomisation (quasi-experimental)
Often used to inform public health practice and where randomisationis difficult to achieve
However, they cannot rule out the possibility that the association wascaused by a third factor linked to both the intervention and outcome
L2 - Questionnaire Design
Bias
Avoid: Leading - Ambiguous - In a language which respondents will notunderstand
Leading:
Sometimes questions may lead by implying that an answer is foolish.– Do you have an unreasonable fear of heights?
A leading question might start with a piece of apparently factual information:– Most people think that medical statisticians are grossly underpaid. Do you agree?
A group of women who had just had a cervical smear were asked:– Do you understand the importance of having a smear test?– Unsurprisingly, 118/120 respondents said “yes”
Ambiguous:
Hedges (1978) reports several examples of theeffects of varying the wording of questions.
He asked two groups of people one of the following: – Do you feel you take enough care of your health, or not? – Do you feel you take enough care of your health, or do you think you could take more care of your health?
82% “took enough care” in response the firstquestion and 68% for the second.
Another set of questions were asked: – Do you think a person of your age can do anything to prevent ill-healthin the future or not? – Do you think a person of your age can do anything to prevent ill-healthin the future, or is it largely a matter of chance?
Recorded by age (overall percentage decreased with increased age), but second questions were, on average, always answered significantly lower
The second question is ambiguous, as it is quite possible to think that health is largely a matter of chance but that there is still something one can do about it.– Only if it is totally a matter of chance is there nothing one can do
Types of Ambiguity
overlapping categories:
The following comes from a questionnaire abouthealth checks in general practice:– When was your check-up? (Tick one answer only): Less than one month ago, 1 to 6 months ago, 6 to 12 months ago
Respondents who had a check-up 6 months agowould find it difficult to answer this question.
Another type of ambiguity occurs when we want to ask about numbers presented as ranges.
multiple questions in one
Sometimes people try to ask lots of things in onequestion - They confuse two (or more) questions in one.
For example, would you prefer your smear to betaken by: – A female doctor – A male doctor – A nurse– I don’t mind
The preference for a female and the preferencefor a doctor are mixed together.
Perspective:
Sometimes the respondents may interpret the question in adifferent way depending on who answers the question.
For example, when parents and school children were askedthe same question:
– Do you (does your child) usually cough first thing in themorning? - Schoolchildren 3.7%; Parents 2.4%
– Do you (does your child) usually cough at other times in the dayor night? - Schoolchildren 24.8%; Parents 4.5%
The symptoms all showed relationships to the child’ssmoking and other potentially causal variables, and also toone another.
Understanding:
For example, a sample of secondary school childrenwere asked two questions in the same questionnaireand whether they agreed with the statement:
– Smoking causes lung cancer 85%
– Smoking is not harmful 41%
The negative statement smoking is not harmful mayhave confused the children or they may not see canceras harmful
A second sample of children were asked:– Smoking causes lung cancer 90%– Smoking is bad for your health 91%
A third sample of schoolchildren were asked: – What is meant by the term ‘lung cancer’?
• Understand 13%
Do not know / don’t understand 32%
They all knew that lung cancer was caused bysmoking however.
Timeframe
Sometimes we want to ask people how oftenthey do things. For example, within a hospital a questionnaireasked:– How often have you visited the audio-visualservice?
Frequently - Often - Rarely - Never
If you were completing this question, thenwould you think frequently was more or lessthan often?
Self-administered Questionnaires
Self administered questionnaires can be usedeither through the post or for individuals whocome to the place of research (e.g. clinic visit).
Suitable when the purpose of the study is fairlystraightforward and can be easily explained.
Advantages– Cheap– Private– Can be anonymous
Conditional questions of the form:– If ‘yes’ go to question 7, if no then go to question23
Should be avoided if at all possible– They make the questionnaire difficult forrespondents to complete
Self-administered questions should be avoided if:
There is a large amount of information to gather – The study is difficult to explain – There is likely to be a problem of literacy among therespondents
It is good to explore these things in a pilot study
Face-to-face Interviews
L5 - Crosstabulations (Chi-squared) and associated tests
Chi-squared Tests
Definition: Chi-squared tests observe the distribution of a categorical variable in a sample compared to a categorical variable found in another sample
Research Example: Levett et al. (2016)
Design
176 women with low-risk pregnancies
• Attending 2 public hospital-based antenatal clinics in Sydney, Australia
• Intervention: 2-day antenatal education programme plus standard care
• Control: standard care alone
• Primary outcome: epidural use
Objective: • In this study they wanted to find out if the antenatal education programme reduced epidural use compared to the standard care group
21/88 (23.9%) women in the intervention group received an epidural
• 57/83 (68.7%) women in the control group received an epidural
• So, what should we do?
Cross tabulation of two variables
Output: Also called a contingency table or cross classification
• Called 5 by 2 table or 52 table; In general, r c table
Chi-squared test of association
• We find for each cell the frequency which we would expect if
the null hypothesis were true
• Null hypothesis: no association between the two variables
• Alternative hypothesis: an association of some type
• We use the row and column totals to do that
• Proportion who are premature = 99/1443 = 6.86%
• Proportion of full-term deliveries = 1344/1443 = 93.14%
• Proportions should be the same for each category of housing if null hypothesis is true, i.e., no association between variables
Expected Frequencies
We can calculate expected frequencies based on proportions
• Out of 899 owner occupiers, expect 899 x 99/1443 = 61.7 to be
premature deliveries if the null hypothesis were true
Hypothesis testing
In general, the expected frequency if null hypothesis is true =
row total × column total / grand total
Comparing observed and expected frequencies
If null hypothesis true, and samples are large enough, this is an observation from a chi-squared distribution, often written χ2
• For a contingency table, the degrees of freedom are given by:
(number of rows – 1) × (number of columns – 1)
p value = significance of association
Chi-squared testing for Association explained
The chi-squared statistic is not an index of the strength of the association
• If we double the frequencies, this will double chi-squared, but the strength of the association is unchanged
• The test statistic follows chi-squared distribution provided the expected values are large enough
• It is a large sample test so the smaller the expected values become, the more dubious will be the test
The conventional criterion: the chi-squared test is valid if
• at least 80% of the expected frequencies exceed 5
• AND all the expected frequencies exceed 1
Principles of significance tests
• Set up the null hypothesis and its alternative
• Do a test/check any assumptions of the test
• Find the value of the test statistic
• Refer the test statistic to a known distribution which it would follow if the null hypothesis were true
• Find the probability of a value of the test statistic arising which is as or more extreme than that observed, if the null hypothesis were true
• Conclude that the data are consistent or inconsistent with the null hypothesis
Assumption Tests
8 expected frequencies ≥ 5 => 8/10 = 80% and all above 1 - Valid assumptions for Chi-squared test
Post-hoc Tests (if assumptions are not met)
• The conventional criterion: the chi-squared test is valid if
• at least 80% of the expected frequencies exceed 5
• AND all the expected frequencies exceed 1
• We could combine or delete rows and columns to give bigger expected values.
• Obviously, this cannot be done for 2 x 2 tables.
• If the table does not meet the criterion even after reduction, then we can use a different test.
Fisher's Exact Test
Calculate the probability of every possible table with the given row and column totals.
• We then sum the probabilities for all tables as or less probable than the observed.
• Only used to be used for small samples in 2 by 2 tables, because of computing problems but can be used with any sample size.
• When the table has more than two rows or columns or large frequencies, this can be a very large number of tables.
Chi-squared for linear associations
• We would have got the same value of chi-squared whatever the order of the rows
• The test ignores the natural ordering
• We can look for a trend from one end of the table to the other
• SPSS does the Mantel-Haenszel linear-by-linear association chi-squared test, whether you want it or not
We assign numerical values to categories.
E.g.: Considerable improvement =1,
Moderate or slight improvement =2,
No material change = 3,
Moderate or slight deterioration = 4,
Considerable deterioration =5,
Death =6.
• We then say, given these numerical scales, is there a relationship
X2 (MH) = (n-1) x r2 (r = correlation coefficient)
Statistic follows Χ2 distribution with 1 degree of freedom
Observations of association
• Should be valid even when the standard chi-squared test is not, provided we have at least 30 observations
• Can be significant even when the standard chi-squared test is not
• It gives a more powerful test against a more restricted null hypothesis
L3 - Frequency Distributions and Summary Statistics
Types of Data
Categorical: N fall into different categories
Nominal: categories are not odered (physical features, conditions etc)
Ordinal: Categories have a natural order (e.g. stages of cancer)
Continuous: numerical, counts or scales with units
Integer Values: whole numbers (e.g. number of children)
Values take any numbers within a range: (e.g. body weight)
Variables: all data (qualitative or quantitative) can be ordered into variables.
qualitites or characteristics which can be attributed to a given sample or condition and are the primary "currency" of research
Statistic: any number collected from data alone
Categorical Variables:
Frequency: Count of individuals having a particular quality
Relative frequency: Proportion of individuals having a
particular quality (e.g. student: 10/390 = 0.026 or 2.6%)
Frequency Distribution: Set of frequencies of all the possible
categories of a variable (Illustrated in bar charts)
Categorical Valirables - Ordered:
Cumulative frequency: Number of individuals with values
less than or equal to a category
Relative cumulative frequency: Proportion of individuals
with values less than or equal to a category
Continuous Variables:
Frequency Distribution: Number of times each possible
value occurs.
Truly Continuous Data:
As most of the values occur only once, counting the number of occurrences does not help
• Instead: divide the scale into class intervals, e.g. from 3.0 to 3.5, from 3.5 to 4.0 etc.
• Class intervals must not overlap, convention to put the higher boundary point of an interval into the next one, i.e. boundary value of 3.5 is part of the 3.5 – 4.0 interval, not 3.0 – 3.5
Frequency distributions of continuous quantitative variables are most commonly depicted by a histogram
• Intervals are on a horizontal axis with vertical bars representing the count (or proportion) of individuals for that interval
• No spaces between bars unless there are no observations for an interval
Summary Statistics:
Averages
Mean
• Mean = sum of observations / number of observations
• Can be affected by individual outliers
The Median
The central value of the distribution, where half the data are below this value and half the data above it
• Middle observation for odd samples and average of two central observations for even samples
• Less affected by outliers than the mean
Mode
The most common observation
• Quick and easy to compute
• Unaffected by extreme scores
• Distributions can have more than one mode (e.g. bimodal)
Interpreting Histograms
For symmetric data, Mean = Median = Mode (symmetrical histogram)
Central tendency for positively skewed data: mode > median > mean (histogram peak left)
Central tendency for negatively skewed data: mode > median > mean (histogram peak right)
Variance and Standard Deviation
Variance: Average of squared differences from the mean
Standard Deviation: Average difference from the mean (square root of the variance)
Arithmetic average
Percentiles and Quartiles
The pth percentile is the value of the observation such that p% are less than or equal to it, different ways to calculate this
Quartiles divide the data into four equal parts (25% each)
Expressed on a histogram - percentiles shown on box plot
box max/min shows Q3 and Q1 respectfully
Outer lines show max and minimum values
in-box line shows median
Outliers are data points that are more than 1.5 times (extreme: more than 3 times) the height of the box (IQR) away from the edges of the box (Q1 or Q3)
L4 - Significance Testing and Correlation
Principles of significance testing
• Set up the null hypothesis and its alternative
• Do a test/check any assumptions of the test
• Find the value of the test statistic
• Refer the test statistic to a known distribution which it would follow if the null hypothesis were true
• Find the probability of a value of the test statistic arising which is as or more extreme than that observed, if the null hypothesis were true
• Conclude that the data are consistent or inconsistent with the null hypothesis
Statistical Significance
• If the data are not consistent with the null hypothesis, difference is statistically significant
• If the data are consistent with the null hypothesis, difference is not statistically significant
• We can think of the significance test probability as an index of the strength of evidence against the null hypothesis
• The probability of such an extreme value of the test statistic occurring if the null hypothesis were true is called the P value
• It is not the probability that the null hypothesis is true
• The null hypothesis is either true or it is not; it is not random and has no probability
Cont.
• Suppose we take a probability of 0.01 or less as constituting reasonable evidence against the null hypothesis
• If the null hypothesis is true, we shall make a wrong decision one in a hundred times
• The conventional approach is to say that differences are significant if the probability is less than 0.05
• If we decide that the difference is significant, the probability is sometimes referred to as the significance level
Significance - real vs important
• If a difference is statistically significant, then it might be real, but not necessarily important
• For example, we may look at the effect of a drug, given for some other purpose, on blood pressure and find it significantly raises blood pressure by an average of 1 mm Hg
• A rise in blood pressure of 1 mm Hg is not clinically significant, so, although it may be there, it does not matter
• Conversely, if a difference is not statistically significant, it could still be real
• We may simply have too small a sample to show that a difference exists
• Furthermore, the difference may still be important
• ‘Not significant’ does not imply that there is no effect
• It means that we have failed to demonstrate the existence of one
Correlation Coefficient
Correlation coefficients are used to measure the strength of relationships or associations between two continuous variables
• Scatter diagrams are often used to graphically explore this association
• For each subject we plot one variable against the other variable
• Each point in a scatter diagram representing one subject
Expressed in a Scatter Diagram
Correlation enables us to measure the closeness to a linear relationship
The correlation coefficient is based on the products of differences
from the mean of the two variables
• For each observation we subtract the mean for that variable
• Multiply the deviations from the mean for the two variables for a subject together
• Add them
• Sometimes this is called the sum of products about the mean
To see how correlation works, we can draw two lines on the scatter diagram:
• A horizontal line through the mean strength and A vertical line through the mean height
• Products in top right and bottom left quadrants Mean positive
Products in top left and bottom right quadrants negative
If the sum of products is positive, then correlation positive
• Sum of products negative (i.e. most of points in the –ve quadrants)
Statistical Tests for correlation Coefficients
Correlation coefficient, r
• Min value: -1 and max value: +1
• Also known as:
• Pearson’s correlation coefficient
• Product moment correlation coefficient
Calculated by dividing sum of products by square root of sum of squares
• Square root of sum of squares:
• For each observation we subtract the mean for that variable
• Square each difference
• Add them up
• Multiply the two summed values together
• Take the square root of this value
Perfect correlation, r=+1 (+ve) or r=-1 (-ve) - Points on scatter graph will be in a striaght line
No linear relationship, r=0 - scatter graph will have no straight shape
It is possible for r to be equal to 0 when there is a relationship which is not linear - Line will be curved (bell curve, -ve or +ve)
Tests of Assumptions
Not meeting assumptions for a given significance test means the p value is very unreliable and will require alternative test methods
Normality - histogram output
Linear - scatter graph output
If Assumptions are not met
Spearman's Ro Test
• We have highlighted that one of the assumptions for the Pearson’s correlation coefficient is Normality of one of the variables
• If this assumption is not met, then we can use an alternative correlation coefficient
• First, we rank the observations then calculate the product moment correlation of the ranks
• Rather than the observations themselves
• The resulting statistic has a distribution which does not depend on the distribution of the original variables
• Usually denoted as ρ or rs
The ranks for the two variables are found and then apply the formula for the product moment correlation to these ranks:
L6 - Comparing Means (t-tests)
Comparing two means of a continous variable
Published Research - Costa et al. (2012)
Design
66 patients received total hip arthroplasty
60 patients received resurfacing arthroplasty•
Primary endpoint: Hip functioning at 12 months after surgery(Harris Hip Score)
Results
Total hip arthroplasty
Before surgery: Mean=50.1 (SD=13.5)
Resurfacing arthroplasty
After 12 months: Mean=82.3 (SD=21.4)
Before surgery: Mean=48.6 (SD=14.2)
After 12 months: Mean=88.4 (SD=15.8)
Conducting Stats Tests - Behind the Scenes
Quantify the observed relationship between two variables intoa test statistic (observed value) from a known statisticaldistribution
Calculate the probability (p-value) of the observed value to bea plausible observation from the sampling distribution that thetest statistic would form if H0 were true
Declare the relationship statistically significant if the observedvalue is greater than the ‘critical value’ of that distribution atwhich the p-value becomes less than 0.05 (i.e. the evidenceagainst H0 becomes stronger)
Analysis Write Up
Give descriptive summary, including thedirection and magnitude of any observedrelationship between variables
Quote relevant test output and comment onstatistical significance
Answer the research question!
Comment on test assumptions of the testthat was performed
Application of This Paper
Mean Harris Hip Score at 12 months
Total hip arthroplasty: 82.3
Resurfacing arthroplasty: 88.4
Could be illustrated by adjacent box plots
Hypotheses about hip functioning in the population
H0: Hip functioning is the same for the two treatments
HA: Hip functioning is different between the two treatments
Group Difference
Observed Mean Difference = 6.04 score points
H0: Mean Difference = 0
How can we know the likelihood (p-value) of 6.04 if H0 is true?
Standard Deviation for comparing means
The sampling distribution for mean group differencesfollows the z - distribution
Normal distribution with mean of 0 and SD of 1
1z = 1 Standard Deviation (of a single sample distribution)
1z = 1 Standard Error (of the distribution of many samples)
Standard Error for Comparing Means
The standard error of the mean (SEM or SE) is an estimator of thepopulation Standard Deviation
Measure of uncertainty around the mean
Affected by the size and variability of the sample (s.e. = SD / sqrt(n) fora single group estimate)
Importance
Area under the curve calculations for sampling population
Confidence Intervals
Using Z-scores to assess group difference
Sampling distribution of mean group differences if thedifference was truly zero (H0) - Scores shown on bell-curve line graph (x-axis centre = 0)
0 +/- z-score = the value requires for a p value of .05 - determins statistical significant difference between two mean values
Transformations to hip score - z-score x standard error
Observed score is within plausible range of H0 - If observed difference z-score does not fall within significance paremeters (2.5% at low and high extremes) then the difference is not significant
For Smaller Sample Sizes
For smaller sample sizes (n < 100 per group), the Standard Normal z-distribution does not hold
Instead, (Student’s) t-distribution is used for statistical testing
Shape defined by sample size (expressed as degrees of freedom df, calculated as total sample size minus number of groups)
The t-Distribution
Greater spread of distribution towards the tails compared with z-distribution
Therefore larger ‘critical’ cut-off values for different proportions of the population
Difference is greatest for small degrees of freedom
Using the t-distribution to assess group differences
For 118 degrees of freedom (63 + 57 – 2), critical t = 1.98 for 5% significance level
Confidence Intervals (CIs)
A confidence interval for a statistic (e.g. mean) is a range of values that is likely to contain the value of that statistic for the population
Calculated as the mean ± critical statistic (e.g. critical t) * standard error of the mean
Standard: 95% Confidence Interval
Confidence limits are values that state the boundaries of the confidence interval
If the interval excludes zero (i.e. upper and lower limit are both negative or upper and lower limit are both positive), the mean difference can be assumedto be statistically significant
Application
Can be applied to group differences or group means
Often graphically displayed as ‘whiskers’ (NOT a box plot)
Overlapping confidence intervals suggest non-significant test results
Reporting t-Tests
“Functional scores improved in both treatment groups from baseline to 12months after surgery, with a small benefit for resurfacing surgery (meanscore of 88.4 versus 82.3)
There was no evidence for statistically significant differences in the Harriship score between treatment groups at 12 months
(mean difference = 6.04 (95% CI -0.51 to 12.58) score points in favour ofresurfacing arthroplasty, t(118)=1.82, p=0.070).
The true group difference is likely to lie between 0.51 points in favour oftotal arthroplasty and 12.58 points in favour of resurfacing arthroplasty.
In conclusion, the difference in functioning between the two surgery typeswas not statistically significant.”
t-Test Assumptions
Observations are independent (check study design)
Observations come from Normal distributions (check plots)
Distributions in each group have equal variance (check group statisticsand Levene’s Test for equality of variance)
Reporting Assumptions
Independence: “As data come from a survey of different individuals, wecan assume that observations are independent.”
Normality: “Based on a histogram of the variable / comparison of mean andmedian, the variable is approximately normal / positively skewed /negatively skewed.”
Equality of Variances: “Based on the similar / different standarddeviations of the variable in the two groups and a non-significant /significant Levene’s test, the variances can / can’t be assumed to be equal
Independence Not Met:
Scenarios
Measuring a variable before and after an intervention or eventor time period
Matched pairs (e.g. parent-child, or individuals matched on avariable)
Implications
Observations are correlated, i.e. no longer independent
No between-subject differences within each comparison,therefore smaller standard error of group differences
t-distribution still applies, but degrees of freedom for only onegroup• Larger observed t-values (because smaller standard error)
Same t-distribution comparison as assumptions met method
Assumptions of Paired t-Tests
Pairs of observations are independent - Check study design, we know there were 128 independentparticipants
Paired differences are normally distributed - Calculate difference between paired observations and plot forexample on a histogram
L7 - ANOVA Tests: Comparing multiple means of a continuous variable
Difference From t Tests: Using multiple t-tests to observe the association between each variable it would be
inefficient
cannot answer research Q of whether the variable as a whole is associated with one continuous variable
does not take multipe testing into account
Thus, conduct a One-way Analysis of Variance (ANOVA)
The ANOVA
Partitions the total variation into
Variability within groups - Deviations between each observation and group (treatment) mean
Variability between groups - Deviations between each group (treatment) mean and the overall mean
Sum of squares (SS)
Sum of squared deviations
Total Variation (SST) = SSB (Between SS) + SSW (Within SS)
Observed Test Statistic
The ratio between SSB and SSW is a representation of the size of any treatment effect
large ratio -> suggests group differences
small ratio / ratio close to 1 -> group means are the same as total mean, i.e. no difference
Interpretation of relative size of this ratio depends on number ofparticipants and number of groups (represented by degrees offreedom)
df(between groups) = Number of groups – 1
df(within groups) = Total sample size – number of groups
Statistic of interest
Statistic = ( SSB / dfbetween ) / ( SSW / dfwithin )
Follows the F distribution for df between and df within
The F-Distribution: Output of df test
Family of distributions shown on line chart - the shape depends on two parameters
DF1 and DF2 set parameters
the critical F value on the chart shows where the ANOVA output value where the p-value is.05
Which Groups Differ?
Post-hoc test that controls (‘corrects’) the significance level for multiple testing
There are different follow-up tests and approaches
One of them: ‘Bonferroni’ (divides the 5% significance level by the number ofcomparisons)
Post-hoc output is under "multiple tests" table
ANOVA Assumptions
Independence of observations - check study design
Normality - Appropriate plots for normal data, comparison of mean and median
Equal variances across groups - Rule of thumb: The largest group SD should not be more than double thelowest group SD (i.e. ratio < 2)
Reporting ANOVAs: “Oxford Shoulder Scores (OSS) after three months appeared to be better (higher) inthe ESP and MUA treatment arms (average of 32.7 and 31.7 score pointsrespectively), compared with the ACR treatment (27.4 score points). An Analysis of Variance (ANOVA) showed that at least one of the treatments wasstatistically significantly different from the others (F(2,444)=10.1, p<0.001). A post-hoc test with Bonferroni correction revealed that ESP treatment was onaverage 5.3 OSS points better than ACR (95% CI 1.9 to 8.6, p<0.001), and MUAtreatment was on average 4.3 OSS points better than ACR (95% CI 1.5 to 7.0,p<0.001), both comparisons were statistically significant. There was no significant difference between average ESP and MUA scores (95%CI -2.3 to 4.4, p=1.00).The analysis assumptions of independence of observations and equality ofvariance were met (the largest ratio of standard deviations was 1.07). OSS scoreswere somewhat negatively skewed, however the assumption of normality is stillassumed to hold In conclusion, after three months, ESP and MUA appeared to be superior treatmentsby approximately 4 to 5 OSS points compared with ACR.”
L8 - Managing Variables
Exploring Categorical Variables:
Example: Are different ethnicities more or less likely to have a food allergy?
Note the sample size for each variable observed
Check large differences in frequencies between categories
Output: cluter bar chart
Consequences of analysing raw data:
Assumptions are likely to be violated:
Expected frequencies for uncommon categories will be very low
In this example: 40% of expected cell counts < 5 and minimumexpected cell count = 0.14
Tests may not be meaningful:
There is not enough information in our example to draw anyinference about mixed ethnicities, but the statistical test will tryand incorporate this
This may distort any real relationships in the remaining data
Solutions: (always note any tests or alterations performed to alter raw data)
Drop/remove categories (with very little sample size)
Example: this would meet the Chi Squared assumption
Treating categories as a continuous variable
Sometimes it may be desirable to treat data collected in categories asa continuous variable
Conditions:
The categories MUST follow an order (e.g. lowest to highest)
Categories must be approximately equally spaced (conceptually)
There should be a minimum of 4 or more categories
Possible Reasons
Easier analysis / interpretation
Compensate for categories with few participants
If data are collected as Likert scales, this is usually done with theintention to analyse data continuously (but you do not have to)
You should always decide your intended analysis in advance (usuallybased on your research question), not based on which analysis showsa significant result
Managing Continuous variables
Converting a continuous variable into categories
Generally, we would avoid turning continuous data into categories, aswe almost always lose information
However, this is useful if
There are known categories of interest (e.g. BMI groups, cut-offs forclinical diagnosis)
The data are severely skewed
Always justify your choice of categories, this should be independent ofthe statistical significance of any tests
If continous variables are skewed
Many tests are reasonably forgiving when it comes to non-normalcontinuous data (e.g. t-test / ANOVA)
Sometimes, there are alternative ‘non-parametric’ tests that do no relyon normality (e.g. Spearman’s correlation)
We could abandon the continuous nature of a variable entirely andgroup it into categories (e.g. old vs young)
Or, we could transform a variable!
This does not mean we manipulate the data to achieve a desired testresult, but rather we create a version of a variable that will allow for avalid test to be carried out
Interpretation not always straightforward however!
Managing Complex Variables
Combining multiple variables into a new one
Sometimes we want to derive a new variable of interest from multipleexisting variables
E.g. for calculating known metrics, such as standard height, weight or BMI
What variable(s) could you derive from these questions rather thananalysing each activity separately?
Qualitative Data:
You need to seperate qualitative categories into quantitative variables (e.g. exercise (yes/no) or cooking (yes/no)
Be aware of bias within quantified qualitative variables