1. RELIABILITY

CREATE IT!

ASSESS IT! (Generalizability Theory)

A) Classical Test Theory (CTT)

Obtained total Score =
True Score (consistency factors) +
Measurement Error (ME / inconsistency factors)

ME-Sources

Unsystematic:
Item selection / Test administration / Test scoring (subjectivity)

Systematic: Consistently wrong (validity problem)

Assumptions

Unsystematic MEs = random / Mean Me = 0 / True scores and unsystematic MEs = uncorrelated / True score = constant / Standard distribution of MEsame for each participant / MEs not correlated with MEs from other tests / Minimal or absent systematic MEs / similar abilities

Difficulty level of item = proportion of examinees who pass it

❌ Sample dependent / MEs are identical / Better for scale range than certain ability level / Hard to generalize over many groups / Only one reliability and items-set per trait range, does not adapt

B) Item-response Theory (IRT)

Item-response Functions (IRF) / Item Characteristic Curve (ICC) =
relation between amount of latent trait of participant & probability that she will give designated response to test item that measure the trait construct

✅ Easier / Less Assumptions / good for ranking + smaller groups / Longer tests are more reliable

Assumptions

Each Item has own IRF/ICC! (Each trait level has own ME)

Difficulty level =
how much of trait needed to answer item correctly on specific trait level θ

X-axis: Trait level (theta)
Y-axis: Probability of correct answer

Item discrimination paramenter =
how well item differentiates of specific trait level (steeper slope)

True score ( θ ) = higher θ , higher trait

Often on computer: If you pass easy items, model will present you harder ones and vice versa

Undimensionality / Local independence / Timeless / Discrimination parameters same for all items (in Rasch Model)

✅ Scores more realiable for average ability people / shorter tests can be more realiable (good match between θ and difficulty) / adaptive / If IRF is known, you can estimate θ / Item-parameter invariance (IRF do not depend on particular population characteristic or ability - test scores meaningful in a relative sense)

❌ complex software / extremes in ICC not really matter / less reliable for people with high or low ability

Test Information Functions =
statistical information corresponding to each score (e.g. A is useful item to test people with low θ) / Most info in middle of ability (where θ = difficulty b)

Scale Information Functions = add info from diff. scale levels together (shows reliability of test on diff θ -levels)

Models

1-Parameter (Rasch) = probability of respondent with θ correctly responding to item with difficulty b

2-Parameter = adds item discrimination index

3-Parameter = adds guessing

...as temporal consistency

...as internal consistency

Test-Retest =
Administer test 2x to same people, compute correlation r

❌ ME-variance-source: changes over time (carryover effect, practice effect) / overestiamtion of realiability possible / systemtic errors have no infuence on score variability / motivation fluctuations

✅ realiable if: high r between two tests (can predict 2nd score from1st score) / measuring constant variables (abilities /personlity)

Alternate Forms = 2 forms of same test (difficulty & content) to same people, compute correlation r

Immediate

Delayed

❌ ME-variance-source: same as immediate + changes over times

Split-Half = Compute correlation r of two halves from one test

Odd-even System (only if items odered acc. to difficulty level)

✅ reliable if: 2 halves show strong correlation r divided by 2 (two administrations should also show high r) / or Spearman-Brown-Formula (only if two halves have same SD) / takes increasing difficulty-level into account

❌ ME-variance-source: internal consistency / item sampling / nature of split

Coefficient Alpha (Cronbach's Alpha) = Mean of all possible split-half coefficients, corrected by Spearman-Brown-Formula (otherwise: underestimation of r)

Index of interrelatedness of individual items, internal consistency, low uniqueness (NOT unidimesionality = you can measure 2 diff. factors but still have high alpha )

Assumptions: unidimensionality / factorial purity / for continous items (e.g. scale)

Kurd-Richardson Formula 20 = similar to Cronbach's Alpha, but for dichotomous items

⚠ large scales have high alpha (many items) / decision-dependent adequate realiability (context) / not a valiidity-measure (only says how correlated halves are) / only useful for factorial simple tests / lower bound of test-reliability (only equals realibility if tau-equivalent) / says nothing about error-s

❌ ME-variance-source: item sampling / test heterogenity (e.g. IQ)

✅ good estimate of split-half reliability / less susceptible to randomness / not dependent on split / more general than KR20

Alpha-"Precision" = SE of inter-itemcorrelation (large SE: symptom of multidimensionality, not proof) / also decreases if more items in test

❗ E.g. Alpha=0.8, Mean inter-itemcorr.=0,57 (3-item scale) OR 0.28 (10-item scale)

❌ Need to know proportion of people who answered correctly, but many items do not have correct or wrong options

Types

Standardized Item Alpha (average inter-itemcorrelation, stepped up with Spearman-Brown-F.) - bigger than Cronbach's Alpha

✅ if item standard scores summed to form scale-scores / ❌ if you calculate with simple raw scores & does not take diff. SD into account

Flanagan-&-Rulon-Formula = takes diff. SD into account (unlike Spearman)

Interrater (Kappa) =
How much 2+ rater agree on one item, correlate answers

⚠ supplement

❌ ME-variance-source: scorer diff. (subjectivity)

Kappa:

<0.6 (bad) / 0.6-0.8 (sufficient) / >0.8 (good)

INTERPRETE IT!

Reliability r(XX) >.09 = very accurate for individual diff. (but 0.7 has to be proven usuful too)

GENERAL TERMS

Context-dependent

E.g. Criterion-referenced Tests = perfomance assessment as mastery (pass/fail) / low variability among people / traditional reliability measures inappropriate / must be perfectly reliable (classification error)

Standard ME (SEM): larger SEM implies greater tyical ME and lower realiability / helps estimating reliability-range within which a score will lay

Standard Error of Difference: compares 2 abilities in 1 person (e.g. non-verbal and verbal test-parts) / diff >2xSD = sign.

❗ Special Circumstances

Measuring unstable characteristics (e.g. emotional reactivity)

Speed Tests = reflects speed of performance (Split-half would be spuriously high / rather use Test-Retest or 2 seperately timed split-halves)

Power Tests = Assess trait/ability (enough time)

Restriction of Range: range of values that has been shortened (e.g. due to split) - correlation is lower then when considering whole sample (e.g. 2x intelleigence test to college student)

Unidimensionality = measurement of one fanctor /psychological dimension/trait/construct/attribute/skill/ability

Homogeneity = internal consistency as necessary (but not sufficient) condition for this

⚠ Test can have only a moderate homogenous items and still measure one underlying construct

Reliability Coefficient r(XX) = ratio of true score variance to total score variance (true score + ME varience) / variance-proportion in obtained total score that is due to veriability in true score (what you intend to measure)

✅ r(XX) close to 1 (good, variance due to ME close to 0) vs. ❌ r(XX) close to 0

Correlation Coefficient r = degree of linear relationship between 2 sets of scores obtained from same person / expresses consistency

Can be reliability coefficient in some contexts (measuring consistency in test)

r=1 (perfect positive , e.g. 1 person on two different test-occasions) / r=-1 (perfect negative) / r=0,8 (strong, e.g. height and weight) / r=0 (no corr., e.g. weight and reaction time)

Variance = true score variance + ME variance (variability in large sample group)

Domain sampling model = limited number item-number represents larger concept

Many tests in same sample => normal distribution and estimate of true score

CTT-norms (⚠ less accurate as continuous norming-approach, which uses info from all available groups to construct norms)

Difference Score = Subtract "before-score" from "after-score" to check for change

❌ underestimation of reliability (substraction of true score from ME, thus ME is larger because both scores have ME variance)

❌ ME-variance-source: item sampling / expensive / motivation fluctuations / when tests not parallel = underestimation of realiability