Measuring test measurement error: A general approach

Donald Boyd; Hamilton Lankford; Susanna Loeb; James Wyckoff

Author/s:

Donald Boyd

,

Hamilton Lankford

,

Susanna Loeb

,

James Wyckoff

Year of Publication:

2013

Publication:

Journal of Educational and Behavioral Statistics

Volume/Issue:

38(6)

Pages:

629-663

Test-based accountability as well as value-added assessments and much experimental and quasi-experimental research in education rely on achievement tests to measure student skills and knowledge. Yet we know little regarding fundamental properties of these tests, an important example being the extent of test measurement error and its implications for educational policy and practice. While test vendors provide estimates of split-test reliability, these measures do not account for potentially important day-to-day differences in student performance. In this paper, we demonstrate a credible, low-cost approach for estimating the overall extent of measurement error that can be applied when students take three or more tests in the subject of interest (e.g., state assessments in consecutive grades). Our method generalizes the test-retest framework by allowing for i) growth or decay in knowledge and skills between tests, ii) tests being neither parallel nor vertically scaled, and iii) the degree of measurement error varying across tests. The approach maintains relatively unrestrictive, testable assumptions regarding the structure of student achievement growth. Estimation only requires descriptive statistics (e.g., test-score correlations). With student-level data, the extent and pattern of measurement error heteroskedasticity also can be estimated. In turn, one can compute Bayesian posterior-means of achievement and achievement gains given observed scores – estimators having statistical properties superior to those for the observed score (score-gain). We employ math and ELA test-score data from New York City to demonstrate these methods and estimate the overall extent of test measurement error is at least twice as large as that reported by the test vendor.