Data: can we cope?

Are we all drowning in a deluge of data? Are our data tools and systems managing to keep up with all the numbers we're collecting all the time? A series of articles in the journal Science doesn't give an entirely positive view, at least in terms of what's going on in the scientific research community. But what does that have to do with uncertainty?

In the 11 February 2011 issue of the (extremely prestigious) international journal Science journal, published by the American Association for the Advancement of Science, there's a series of articles on data. More specifically, they are about the challenges posed to science by the enormous quantities of data being collected, and the new opportunities that arise when scientists disseminate, share and collaborate on the data.

Science has decided to allow free online access to most of these articles, provided you are willing to register, which isn't too tedious a process.

The general tone of many of them is rather gloomy. Yes, almost unimaginable amounts of scientific data are being collected, but data are being lost, and opportunities are being wasted because sharing of data doesn't happen enough. Science surveyed their peer reviewers (the scientists who make recommendations on whether articles by other scientists should be published). They got about 1700 responses, and found, for instance, that over 80% of respondents did not have enough funding for data curation (that is, effective storing and caring for the data they have collected). Almost a quarter (23%) said they did not have the necessary expertise to analyse the data in the way they wanted, even after taking into account the possibility of getting the expertise through collaborations outside their own lab or research group.

But what does all this have to do with understanding uncertainty? Well, if you were to go by what it says in the article collection, you might think the answer was "not a lot". Rather little is said about uncertainty, or risk, or chance. I should make it clear here that I haven't read all the articles yet - they do provide a small data deluge, or at least quite a big puddle, on their own. But I did search them (online) for mentions of words like uncertainty, chance and risk, and found rather few. (If you find I've missed some important mentions, please put up a comment to say so!)

Maybe this lack reflects the view that, if you have enough data, there's so little uncertainty left that you needn't worry about it. I hope not, because in general that's not true. In simple situations, if you have a big enough sample of data, there may indeed be little uncertainty left. But many of the Science articles deal with very complicated situations, where the data are highly structured and correlated. In such cases, even with huge amounts of data there can be considerable uncertainty about what's going on.

More importantly, in many of the scientific areas discussed, the main aim is to make predictions about the future. Even with huge amounts of data about the present and the past, predicting the future almost inevitably involves uncertainty. Just think about weather forecasts. Meteorologists collect huge numbers of observations, all over the world, all the time. They process them on immensely powerful computers. Yet everyone knows there is considerable uncertainty in the predictions they make. Tomorrow's forecast isn't always entirely accurate. This isn't because the meteorologists are incompetent or because they are doing the wrong calculations; it's just the way things are. The atmosphere is very complex indeed.

Let's just hope that some of the other scientific areas, covered in the Science articles, haven't forgotten about uncertainty.

Free tags: 
Levels: