1. RELIABILITY
CREATE IT!
ASSESS IT! (Generalizability Theory)
A) Classical Test Theory (CTT)
Obtained total Score =
True Score (consistency factors) +
Measurement Error (ME / inconsistency factors)
ME-Sources
Unsystematic:
Item selection / Test administration / Test scoring (subjectivity)
Systematic: Consistently wrong (validity problem)
Assumptions
Unsystematic MEs = random / Mean Me = 0 / True scores and unsystematic MEs = uncorrelated / True score = constant / Standard distribution of MEsame for each participant / MEs not correlated with MEs from other tests / Minimal or absent systematic MEs / similar abilities
Difficulty level of item = proportion of examinees who pass it
❌ Sample dependent / MEs are identical / Better for scale range than certain ability level / Hard to generalize over many groups / Only one reliability and items-set per trait range, does not adapt
B) Item-response Theory (IRT)
Item-response Functions (IRF) / Item Characteristic Curve (ICC) =
relation between amount of latent trait of participant & probability that she will give designated response to test item that measure the trait construct
✅ Easier / Less Assumptions / good for ranking + smaller groups / Longer tests are more reliable
Assumptions
Each Item has own IRF/ICC! (Each trait level has own ME)
Difficulty level =
how much of trait needed to answer item correctly on specific trait level θ
X-axis: Trait level (theta)
Y-axis: Probability of correct answer
Item discrimination paramenter =
how well item differentiates of specific trait level (steeper slope)
True score ( θ ) = higher θ , higher trait
Often on computer: If you pass easy items, model will present you harder ones and vice versa
Undimensionality / Local independence / Timeless / Discrimination parameters same for all items (in Rasch Model)
✅ Scores more realiable for average ability people / shorter tests can be more realiable (good match between θ and difficulty) / adaptive / If IRF is known, you can estimate θ / Item-parameter invariance (IRF do not depend on particular population characteristic or ability - test scores meaningful in a relative sense)
❌ complex software / extremes in ICC not really matter / less reliable for people with high or low ability
Test Information Functions =
statistical information corresponding to each score (e.g. A is useful item to test people with low θ) / Most info in middle of ability (where θ = difficulty b)
Scale Information Functions = add info from diff. scale levels together (shows reliability of test on diff θ -levels)
Models
1-Parameter (Rasch) = probability of respondent with θ correctly responding to item with difficulty b
2-Parameter = adds item discrimination index
3-Parameter = adds guessing
...as temporal consistency
...as internal consistency
Test-Retest =
Administer test 2x to same people, compute correlation r
❌ ME-variance-source: changes over time (carryover effect, practice effect) / overestiamtion of realiability possible / systemtic errors have no infuence on score variability / motivation fluctuations
✅ realiable if: high r between two tests (can predict 2nd score from1st score) / measuring constant variables (abilities /personlity)
Alternate Forms = 2 forms of same test (difficulty & content) to same people, compute correlation r
Immediate
Delayed
❌ ME-variance-source: same as immediate + changes over times
Split-Half = Compute correlation r of two halves from one test
Odd-even System (only if items odered acc. to difficulty level)
✅ reliable if: 2 halves show strong correlation r divided by 2 (two administrations should also show high r) / or Spearman-Brown-Formula (only if two halves have same SD) / takes increasing difficulty-level into account
❌ ME-variance-source: internal consistency / item sampling / nature of split
Coefficient Alpha (Cronbach's Alpha) = Mean of all possible split-half coefficients, corrected by Spearman-Brown-Formula (otherwise: underestimation of r)
Index of interrelatedness of individual items, internal consistency, low uniqueness (NOT unidimesionality = you can measure 2 diff. factors but still have high alpha )
Assumptions: unidimensionality / factorial purity / for continous items (e.g. scale)
Kurd-Richardson Formula 20 = similar to Cronbach's Alpha, but for dichotomous items
⚠ large scales have high alpha (many items) / decision-dependent adequate realiability (context) / not a valiidity-measure (only says how correlated halves are) / only useful for factorial simple tests / lower bound of test-reliability (only equals realibility if tau-equivalent) / says nothing about error-s
❌ ME-variance-source: item sampling / test heterogenity (e.g. IQ)
✅ good estimate of split-half reliability / less susceptible to randomness / not dependent on split / more general than KR20
Alpha-"Precision" = SE of inter-itemcorrelation (large SE: symptom of multidimensionality, not proof) / also decreases if more items in test
❗ E.g. Alpha=0.8, Mean inter-itemcorr.=0,57 (3-item scale) OR 0.28 (10-item scale)
❌ Need to know proportion of people who answered correctly, but many items do not have correct or wrong options
Types
Standardized Item Alpha (average inter-itemcorrelation, stepped up with Spearman-Brown-F.) - bigger than Cronbach's Alpha
✅ if item standard scores summed to form scale-scores / ❌ if you calculate with simple raw scores & does not take diff. SD into account
Flanagan-&-Rulon-Formula = takes diff. SD into account (unlike Spearman)
Interrater (Kappa) =
How much 2+ rater agree on one item, correlate answers
⚠ supplement
❌ ME-variance-source: scorer diff. (subjectivity)
Kappa:
<0.6 (bad) / 0.6-0.8 (sufficient) / >0.8 (good)
INTERPRETE IT!
Reliability r(XX) >.09 = very accurate for individual diff. (but 0.7 has to be proven usuful too)
GENERAL TERMS
Context-dependent
E.g. Criterion-referenced Tests = perfomance assessment as mastery (pass/fail) / low variability among people / traditional reliability measures inappropriate / must be perfectly reliable (classification error)
Standard ME (SEM): larger SEM implies greater tyical ME and lower realiability / helps estimating reliability-range within which a score will lay
Standard Error of Difference: compares 2 abilities in 1 person (e.g. non-verbal and verbal test-parts) / diff >2xSD = sign.
❗ Special Circumstances
Measuring unstable characteristics (e.g. emotional reactivity)
Speed Tests = reflects speed of performance (Split-half would be spuriously high / rather use Test-Retest or 2 seperately timed split-halves)
Power Tests = Assess trait/ability (enough time)
Restriction of Range: range of values that has been shortened (e.g. due to split) - correlation is lower then when considering whole sample (e.g. 2x intelleigence test to college student)
Unidimensionality = measurement of one fanctor /psychological dimension/trait/construct/attribute/skill/ability
Homogeneity = internal consistency as necessary (but not sufficient) condition for this
⚠ Test can have only a moderate homogenous items and still measure one underlying construct
Reliability Coefficient r(XX) = ratio of true score variance to total score variance (true score + ME varience) / variance-proportion in obtained total score that is due to veriability in true score (what you intend to measure)
✅ r(XX) close to 1 (good, variance due to ME close to 0) vs. ❌ r(XX) close to 0
Correlation Coefficient r = degree of linear relationship between 2 sets of scores obtained from same person / expresses consistency
Can be reliability coefficient in some contexts (measuring consistency in test)
r=1 (perfect positive , e.g. 1 person on two different test-occasions) / r=-1 (perfect negative) / r=0,8 (strong, e.g. height and weight) / r=0 (no corr., e.g. weight and reaction time)
Variance = true score variance + ME variance (variability in large sample group)
Domain sampling model = limited number item-number represents larger concept
Many tests in same sample => normal distribution and estimate of true score
CTT-norms (⚠ less accurate as continuous norming-approach, which uses info from all available groups to construct norms)
Difference Score = Subtract "before-score" from "after-score" to check for change
❌ underestimation of reliability (substraction of true score from ME, thus ME is larger because both scores have ME variance)
❌ ME-variance-source: item sampling / expensive / motivation fluctuations / when tests not parallel = underestimation of realiability