Data overload?

As of the 23rd May 2022 this website is archived and will receive no further updates.

understandinguncertainty.org was produced by the Winton programme for the public understanding of risk based in the Statistical Laboratory in the University of Cambridge. The aim was to help improve the way that uncertainty and risk are discussed in society, and show how probability and statistics can be both useful and entertaining.

Many of the animations were produced using Flash and will no longer work.

30 minutesThe government announced last week that it would be greatly expanding the amount of data which it shares with the rest of us. Its white paper spells out the detailed principles of the new approach, and there is much in it to commend. It addresses many of the hideous features of government data at the moment, such as departments' habit of publishing in proprietary formats (usually Excel); the fact that data cannot necessarily be re-used without obtaining explicit permission; and the lack of coherence between different datasets on essentially the same topic.

Proposed changes include the adoption of a five star scheme for the usefulness of data presentation, open-access provisions, and agreed inter-departmental standards. The thousands of new files will be published on a central website, data.gov.uk. If fully implemented, the potential power of this newly unleashed mass of information could be vast, with benefits including both economic (£33 billion a year, according to the report) and democratic gains.

A little knowledge...

With great power, however, comes great responsibility, and there are many dangers. The specific example cited by most newspaper articles is to release detailed information on the success of individual GPs at treating diseases such as cancer. This ties in with the philosophy of patient choice, which has been taken up by this government and the last.

The trouble is that there are a lot of GPs, and many of them won't have all that many cancer patients. Let's suppose that 80% of patients with a particular disease have a good outcome, such as survival after 5 years. Even if all the GPs are equally good at treating their patients, and the patients they see are essentially the same, some will be unlucky and less than 80% of their patients will survive; in some cases a lot less. In reality GPs in different areas will have sets of patients which vary wildly in their demographic make-up, so the diversity will be even greater than could be explained by a simple binomial distribution.

Our brains are designed to find patterns even if none exist, so such detailed data are ripe for over-interpretation. We've seen this before with bowel cancer rates broken down by local authority; some areas seemed superficially to have very low or high rates of bowel cancer, but after accounting for sampling variation, the difference turned out not to be significant in many cases. The consequences of failing to understand uncertainty might be unnecessary worries about the competency of a local doctor or hospital, the 'discovery' of non-existent cancer or crime hotspots, or possibly much worse.

A statistical call to arms

I see this as a challenge for the statistically literate; we have to ensure that naive and misleading interpretations of data are not allowed to predominate. Blogs like this one are part of the solution, but may only preach to the choir. Those politicians and journalists who produce attention grabbing soundbites and headlines at the expense of statistical sense must be corrected, whether by a polite letter or a public dressing down from the statistics watchdog. We can all bring the most widely propagated myths to the attention of anyone we happen to be discussing politics with.

The government itself could help: why not produce a funnel plot to accompany the GP data, thus preempting some of the most ridiculous statistical fallacies? A brief analysis of how the survival rates vary regionally and by the age-profile of the population would also make a simple addition. Already the ONS is careful to point out which data should not be compared over time because of changing definitions.

Finally we can perform some rigorous analyses ourselves. The publication of all this data is an invitation for statisticians to get involved in communicating its content to the public. We are in the highly privileged position of having the training to extract the useful information it contains, discard the noise, and (just as importantly) explain how we do it. If we fail to rise to this challenge, others will move to fill the vacuum.

The author: Robin Evans is a Postdoctoral Research Fellow at the Statistical Laboratory, University of Cambridge. You can read his blog at itsastatlife.blogspot.com, and he tweets as @ItsAStatLife