Reliability
What is Reliability?
Reliability refers to the consistency and stability of a test score. If you step on a scale and it says 70kg, then step off and on again and it says 85kg, the scale is unreliable. It might be a valid concept (weight), but the tool is broken.
In IQ testing, reliability is crucial. If you take an IQ test on Monday and score 130, and then take it again on Friday and score 100, the test is useless.
Methods of Measuring Reliability
Psychometricians use statistical methods to ensure tests are stable:
- Test-Retest Reliability: The gold standard. A group takes the test, waits a few weeks, and takes it again. The correlation between the two scores should be extremely high (above 0.90 for good tests).
- Internal Consistency (Split-Half): If you split the test in half (e.g., odd questions vs. even questions), your score on both halves should be roughly the same. This proves the test is measuring a single, consistent trait.
- Inter-Rater Reliability: If two different psychologists score your test, do they get the same result? For multiple-choice tests, this is easy (100%). For tests involving verbal definitions (“Define ‘Integrity’”), subjective scoring can lower reliability.
Sources of Error
Even the best tests (like the WAIS-IV) aren’t perfect. Reliability can be affected by “Measurement Error”:
- Internal State: Sleep deprivation, anxiety, or illness can temporarily lower a score.
- Environment: A noisy room or distracting proctor.
- Guessing: Multiple-choice formats introduce a small element of luck.
The Standard Error of Measurement (SEM)
Because no test is 100% reliable, psychologists use a Confidence Interval.
- Instead of saying “Your IQ is 120,” a report might say “We are 95% confident your IQ falls between 115 and 125.”
- This range accounts for the slight unreliability inherent in any human measurement.
Reliability vs. Validity
It is possible for a test to be reliable but not valid.
- Example: A broken clock is perfectly reliable (it shows the exact same time twice a day), but it is not valid for telling time.
Reliability Coefficients: Reading the Numbers
Reliability is expressed as a correlation coefficient ranging from 0.0 (completely random) to 1.0 (perfectly consistent). Understanding these numbers helps you judge whether a test is worth trusting:
- 0.90 and above: Excellent reliability. This is the standard for high-stakes clinical and legal decisions. The WAIS-IV achieves composite reliabilities of 0.97–0.98, making it one of the most statistically dependable instruments in psychology.
- 0.80–0.89: Good reliability. Acceptable for most educational and research purposes, but introduces a meaningful margin of error for individual decisions.
- 0.70–0.79: Adequate for group-level research. Too imprecise for important individual decisions (college admissions, clinical diagnosis).
- Below 0.70: Poor reliability. The test introduces more error than insight. Many freely available “online IQ tests” fall in this range or lower.
Why IQ Scores Can Fluctuate — and How Much
A common experience is taking an IQ test and then retaking it years later with a different score. Understanding reliability explains why this happens and what it means.
For the WAIS-IV, the Standard Error of Measurement (SEM) for the Full Scale IQ is approximately 2.16 points. This means:
- If your true IQ is 120, there is a 68% chance that any single test administration will produce a score between 117.8 and 122.2.
- There is a 95% chance the score will fall between 115.8 and 124.2.
This is a small margin, reflecting the WAIS-IV’s excellent reliability. By contrast, many internet IQ tests have SEMs of 10–15 points or more, meaning a “score” of 130 might reflect anything from 115 to 145.
The practical implication: no single test score should ever be treated as an exact, permanent label. It is an estimate within a range. Psychologists are trained to interpret scores within their confidence intervals rather than as precise measurements.
The “Practice Effect” Problem with Test-Retest
When measuring test-retest reliability, researchers must account for the practice effect — the tendency for scores to rise simply because the test-taker has been exposed to the same material before. This is why reliability studies use a delay of at least two to four weeks between administrations, sometimes longer.
The practice effect also explains why you shouldn’t retake an IQ test soon after your first attempt expecting to get a “real” score. The improvement will partly reflect genuine measurement — but largely reflects familiarity with the test format. For this reason, many clinical guidelines specify minimum intervals (6–12 months for children) before re-administration.
Cronbach’s Alpha: The Internal Consistency Standard
The most widely used measure of internal consistency is Cronbach’s Alpha (α), which measures how well all the items in a test scale “hang together” — i.e., how consistently they measure the same underlying trait.
- Alpha ranges from 0 to 1.
- For intelligence subtests, alphas of 0.85–0.95 are typical in well-constructed tests.
- Low alpha on a subtest (below 0.75) suggests that the items are measuring different things — a warning sign that the subtest score is unreliable.
On the WAIS-IV, the Verbal Comprehension Index achieves an alpha of approximately 0.96, and the Full Scale IQ composite reaches 0.98 — among the highest internal consistency values of any widely used psychological instrument.
Conclusion: The Foundation of Trustworthy Testing
Reliability is not a glamorous concept, but it is the bedrock on which all meaningful psychological measurement rests. Without it, a test cannot tell you anything useful about a person — regardless of how impressive the theory behind it may be. Before trusting any IQ score, the first question to ask is not “What does it mean?” but “How reliably was it measured?”