PISA statistical methods - more detailed comments

In the Radio 4 documentary PISA - Global Education Tables Tested, broadcast on November 25th, a comment is made that the statistical issues are a bit complex to go into. Here is a brief summary of my personal concerns: to get an idea of the feelings about PISA statistical methods, see for example an article in the Times Educational Supplement, and the response by OECD.

The PISA methodology is complex and rather opaque, in spite of the substantial amount of material published in the technical reports. Briefly:

  1. Individual students only answer a minority of questions.
  2. Multiple ‘plausible values’ are then generated for all students assuming a particular statistical model, essentially estimating what might have happened if the student had answered all the questions.
  3. These ‘plausible values’ are then treated as if they are the results of complete surveys, and form the basis of national scores (and their uncertainties) and hence rankings in league tables.
  4. But the statistical model used to generate the ‘plausible scores’ is demonstrably inadequate – it does not fit the observed data.
  5. This means the variability in the plausible scores is underestimated, which in turn means the uncertainty in the national scores is underestimated, and hence the rankings are even less reliable than claimed.

Here's a little more detail on these steps.

1. Individual students only answer a minority of questions.

Svend Kreiner has calculated that in 2006, about half did not answer any reading questions at all, while "another 40 per cent of participating students were tested on just 14 of the 28 reading questions used in the assessment. So only approximately 10 per cent of the students who took part in Pisa were tested on all 28 reading questions."

2. Multiple ‘plausible values’ are then generated for all students assuming a particular statistical model

A simple Rasch model (PISA Technical Report , Chapter 9) is assumed, and five values for each student are generated at random from the 'posterior' distribution given the information available on that student. So for the half of students in 2006 who did not answer any reading questions, five 'plausible' reading scores are generated on the basis of their responses on other subjects.

3. These ‘plausible values’ are then treated as if they are the results of surveys with complete data on all students

The Technical Report is not clear about how the final country scores are derived, but the Data Analysis manual makes clear that these are based on the five plausible values generated for each student: they then use standard methods to inflate the sampling error to allow for the use of 'imputed' data.

“Secondly, PISA uses imputation methods, denoted plausible values, for reporting student performance. From a theoretical point of view, any analysis that involves student performance estimates should be analysed five times and results should be aggregated to obtain: (i) the final estimate; and (ii) the imputation error that will be combined with the sampling error in order to reflect the test unreliability on the standard error.

All results published in the OECD initial and thematic reports have been computed accordingly to these methodologies, which means that the reporting of a country mean estimate and its respective standard error requires the computation of 405 means as described in detail in the next sections.”

There does seem to be some confusion in the PISA team about this - in my interview with Andreas Schleicher, I explicitly asked whether the country scores were based on the 'plausible values', and he appeared to deny that this was the case.

4. The statistical model used to generate the ‘plausible scores’ is demonstrably inadequate.

Analysis using imputed ('plausible') data is not inherently unsound, provided (as PISA do) the extra sampling error is taken into account. But the vital issue is that the adjustment for imputation is only valid if the model used to generate the plausible values can be considered 'true', in the sense that the generated values are reasonably 'plausible' assessments of what that student would have scored had they answered the questions.

A simple Rasch model is assumed by PISA, in which questions are assumed to have a common level of difficulty across all countries - questions with clear differences are weeded out as “dodgy”. But in a paper in Psychometrika, Kreiner has shown the existence of substantial Differential Item Functioning” (DIF) - i.e. questions have different difficulty in different countries, and concludes that the “The evidence against the Rasch model is overwhelming.”

The existence of DIF is acknowledged by Adams (who heads the OECD analysis team), who says “The sample sizes in PISA are such that the fit of any scaling model, particularly a simple model like the Rasch model, will be rejected. PISA has taken the view that it is unreasonable to adopt a slavish devotion to tests of statistical significance concerning fit to a scaling model.”. Kreiner disagrees, and argues that the effects are both statistically significant and practically important.

5. This means the variability in the plausible scores is underestimated

The crucial issue, in my view, is that since these 'plausible values' are generated from an over-simplified model, they will not represent plausible values as if the student really had answered all the questions. Kreiner says “The effect of using plausible values generated by a flawed model is unknown”.

[The next para was in the original blog, but I have revised my opinion since - see note below] I would be more confident than this, and would expect that the 'plausible values' will be ‘under-dispersed’, ie not show a reasonable variability. Hence the uncertainty about all the derived statistics, such as mean country scores, will be under-estimated, although the extent of this under-estimation is unknown. It is notable that PISA acknowledge the uncertainty about their rankings (although this is not very prominent in their main communications), but the extra variability due to the use of potentally-inappropriate plausible values will inevitably mean that the rankings would be even less reliable than claimed. That is the reason for my scepticism about PISA's detailed rankings.

Note added 30th November:

I acknowledge that plausible values derived from an incorrect model should, if analysed assuming that model, lead to exactly the same conclusions than if they had not been generated in the first place (and, say, a standard maximum likelihood analysis carried out). Which could make one ask - why generate plausible values in the first place? But in this case it is convenient for PISA to have ‘complete response’ data to apply their complex survey weighting schemes for their final analyses.

But this is the issue: it is unclear what effect generating a substantial amount of imputed data from a simplistic model will have, when those imputed data are then fed through additional analyses. So after more reflection I am not so confident that the PISA methods lead to an under-estimate of the uncertainty associated with the country scores: instead I agree with Svend Kreiner’s view that it is not possible to predict the effect of basing subsequent detailed analysis on plausible values from a flawed model.

Comments

Drhmorrison's picture

Why Michael Gove should follow India’s lead and detach himself from PISA Just ahead of the publication of PISA league table on 3rd December 2013 India has withdrawn from the list of countries which will feature in the tables. The Education Secretary, Michael Gove, on the other hand, seems determined to stick with PISA despite recent concerns - published in the Times Educational Supplement in July of this year - about the global league table. Mr Gove’s Department reiterated its support for PISA in a recently-aired Radio 4 programme entitled “PISA – Global Education Tables Tested.” That programme illustrated the dangers inherent in critiquing PISA in exclusively statistical terms. Statistical modellers have made life too easy for PISA because they simply accept the PISA interpretation of the construct “ability.” It is only when the focus moves to measurement that the profound difficulties inherent in PISA come to the fore with greatest clarity. Niels Bohr is ranked with Newton and Einstein as one of greatest physicists of all time. The father of atomic physics taught that “unambiguous communication” is the hallmark of measurement in quantum physics. Importantly, Bohr traced measurement in quantum mechanics and measurement in psychology to a common source, which he referred to as “subject/object holism.” The physicist cannot have direct experience of the atom, just as the teacher cannot have direct experience of the child’s mind. The microworld manifests itself in the measuring instruments of the physicist just as mind is expressed in the child’s responses to test items. Both the physicist and the psychologist are forced to describe what is beyond direct experience using the language of everyday experience. Bohr demonstrated that measurement in quantum physics and in psychology share a common inescapable constraint, namely, one cannot communicate unambiguously about measurement in either realm without factoring in the measuring instrument. In Heisenberg’s words: “what we observe is not nature in itself but nature exposed to our form of questioning.” The lesson we learn from Bohr is that in all psychological measurement, the entity measured cannot be divorced from the measuring instrument. When this central tenet of measurement (in quantum physics or in psychology/education) is broken, nonsense always ensues. The so-called Rasch model, which produces the PISA ranks, offends against this central measurement principle and therefore the ranks it generates are meaningless. According to Bohr, the entity measured and the measuring instrument cannot be meaningfully separated. According to PISA, they are entirely independent. Who are we to believe, Niels Bohr or Andreas Schleicher? The following simple illustration will help make Bohr’s point. Suppose Einstein and a 16 year-old pupil both produce a perfect score on a GCSE mathematics paper. Surely to claim that the pupil has the same mathematical ability as Einstein is to communicate ambiguously? However, unambiguous communication can be restored if we simply take account of the measuring instrument and say, “Einstein and the pupil have the same mathematical ability relative to this particular GCSE paper.” Mathematical ability, indeed any ability, is not an intrinsic property of the individual; rather, it’s a joint property of the individual and the measuring instrument. In short, ability isn’t a property of the person being measured; it’s a property of the interaction of the person with the measuring instrument. One is concerned with the between rather than the within. It’s hard to imagine a more stark contrast between Bohr’s teachings and the PISA approach to measurement. Critiques of PISA by statistical modellers, however, have missed this profound conceptual error entirely. My bookshelves are groaning with books concerned with the wide-ranging debates around the notion of intelligence. All of these debates dissolve away when one eschews the twin notions that intelligence is either a property of the person or is an ensemble property, for the simple definition that intelligence is a property of the interaction between person and intelligence test. To say “John has an IQ of 104” is to communicate ambiguously. An ocean of ink has been spilt because intelligence researchers have missed the simple truth that intelligence is not something we have. In closing, it is only when the PISA critique shifts from statistical modelling to measurement, the profound nature of PISA’s error becomes clear. PISA produces nonsense because it misconstrues entirely the nature of ability. I trust this essay will be a comfort to those who had the courage to remove India from PISA, and hope it will prompt a similar decision from Michael Gove.