PISA statistical methods - more detailed comments
In the Radio 4 documentary PISA - Global Education Tables Tested, broadcast on November 25th, a comment is made that the statistical issues are a bit complex to go into. Here is a brief summary of my personal concerns: to get an idea of the feelings about PISA statistical methods, see for example an article in the Times Educational Supplement, and the response by OECD.
The PISA methodology is complex and rather opaque, in spite of the substantial amount of material published in the technical reports. Briefly:
- Individual students only answer a minority of questions.
- Multiple ‘plausible values’ are then generated for all students assuming a particular statistical model, essentially estimating what might have happened if the student had answered all the questions.
- These ‘plausible values’ are then treated as if they are the results of complete surveys, and form the basis of national scores (and their uncertainties) and hence rankings in league tables.
- But the statistical model used to generate the ‘plausible scores’ is demonstrably inadequate – it does not fit the observed data.
- This means the variability in the plausible scores is underestimated, which in turn means the uncertainty in the national scores is underestimated, and hence the rankings are even less reliable than claimed.
Here's a little more detail on these steps.
1. Individual students only answer a minority of questions.
Svend Kreiner has calculated that in 2006, about half did not answer any reading questions at all, while "another 40 per cent of participating students were tested on just 14 of the 28 reading questions used in the assessment. So only approximately 10 per cent of the students who took part in Pisa were tested on all 28 reading questions."
2. Multiple ‘plausible values’ are then generated for all students assuming a particular statistical model
A simple Rasch model (PISA Technical Report , Chapter 9) is assumed, and five values for each student are generated at random from the 'posterior' distribution given the information available on that student. So for the half of students in 2006 who did not answer any reading questions, five 'plausible' reading scores are generated on the basis of their responses on other subjects.
3. These ‘plausible values’ are then treated as if they are the results of surveys with complete data on all students
The Technical Report is not clear about how the final country scores are derived, but the Data Analysis manual makes clear that these are based on the five plausible values generated for each student: they then use standard methods to inflate the sampling error to allow for the use of 'imputed' data.
“Secondly, PISA uses imputation methods, denoted plausible values, for reporting student performance. From a theoretical point of view, any analysis that involves student performance estimates should be analysed five times and results should be aggregated to obtain: (i) the final estimate; and (ii) the imputation error that will be combined with the sampling error in order to reflect the test unreliability on the standard error.
All results published in the OECD initial and thematic reports have been computed accordingly to these methodologies, which means that the reporting of a country mean estimate and its respective standard error requires the computation of 405 means as described in detail in the next sections.”
There does seem to be some confusion in the PISA team about this - in my interview with Andreas Schleicher, I explicitly asked whether the country scores were based on the 'plausible values', and he appeared to deny that this was the case.
4. The statistical model used to generate the ‘plausible scores’ is demonstrably inadequate.
Analysis using imputed ('plausible') data is not inherently unsound, provided (as PISA do) the extra sampling error is taken into account. But the vital issue is that the adjustment for imputation is only valid if the model used to generate the plausible values can be considered 'true', in the sense that the generated values are reasonably 'plausible' assessments of what that student would have scored had they answered the questions.
A simple Rasch model is assumed by PISA, in which questions are assumed to have a common level of difficulty across all countries - questions with clear differences are weeded out as “dodgy”. But in a paper in Psychometrika, Kreiner has shown the existence of substantial Differential Item Functioning” (DIF) - i.e. questions have different difficulty in different countries, and concludes that the “The evidence against the Rasch model is overwhelming.”
The existence of DIF is acknowledged by Adams (who heads the OECD analysis team), who says “The sample sizes in PISA are such that the fit of any scaling model, particularly a simple model like the Rasch model, will be rejected. PISA has taken the view that it is unreasonable to adopt a slavish devotion to tests of statistical significance concerning fit to a scaling model.”. Kreiner disagrees, and argues that the effects are both statistically significant and practically important.
5. This means the variability in the plausible scores is underestimated
The crucial issue, in my view, is that since these 'plausible values' are generated from an over-simplified model, they will not represent plausible values as if the student really had answered all the questions. Kreiner says “The effect of using plausible values generated by a flawed model is unknown”.
[The next para was in the original blog, but I have revised my opinion since - see note below] I would be more confident than this, and would expect that the 'plausible values' will be ‘under-dispersed’, ie not show a reasonable variability. Hence the uncertainty about all the derived statistics, such as mean country scores, will be under-estimated, although the extent of this under-estimation is unknown. It is notable that PISA acknowledge the uncertainty about their rankings (although this is not very prominent in their main communications), but the extra variability due to the use of potentally-inappropriate plausible values will inevitably mean that the rankings would be even less reliable than claimed. That is the reason for my scepticism about PISA's detailed rankings.
Note added 30th November:
I acknowledge that plausible values derived from an incorrect model should, if analysed assuming that model, lead to exactly the same conclusions than if they had not been generated in the first place (and, say, a standard maximum likelihood analysis carried out). Which could make one ask - why generate plausible values in the first place? But in this case it is convenient for PISA to have ‘complete response’ data to apply their complex survey weighting schemes for their final analyses.
But this is the issue: it is unclear what effect generating a substantial amount of imputed data from a simplistic model will have, when those imputed data are then fed through additional analyses. So after more reflection I am not so confident that the PISA methods lead to an under-estimate of the uncertainty associated with the country scores: instead I agree with Svend Kreiner’s view that it is not possible to predict the effect of basing subsequent detailed analysis on plausible values from a flawed model.