Reproducibility in Climate Science

The idea of ‘reproducibility’ is fundamental to scientific culture. Scientists don’t merely develop theories, construct models, form hypotheses, perform experiments, collect data, and use it to test their theories. They describe their theories, models, hypotheses, experiments, and data, in published papers, so that others can criticise their work and improve upon it. Colleagues constantly attempt to out-do each other: to improve a theory, a model, or an experiment. This competitive collaboration drives science forwards, and depends on scientists fully publishing their work, in enough detail that others can not simply understand it but can reproduce it.

So the progress of science depends absolutely on reproducibility. In some sense, if your work is not reproducible, then it is not science, or at least falls short of a scientific ideal. If a method is not documented, any other scientist attempting to reproduce or improve upon the work may not be able to do so. They may well get different results, for mysterious reasons – error in the original work, error in the reproduction, or simply that they are measuring different things due to a lack of documentation. Instead of science providing a clear “signpost to the truth”, it will spin like a broken weathervane, effort will be wasted, and no progress will be made.

Publication is seen as the key metric of science, but in fact publication is a means to an end. Science depends on reproducibility, and publication enables that. But a publication is inadequate for this purpose if it does not describe the study at a reproducible level of detail.

A paper stating “we mixed some salt solution with some of that stuff in the green bottle” is not science, not because it is informal and in the active voice but because it is not reproducible. What salt solution? What stuff? Mixed how? To be a science paper, it should say something like, “A volume of 50 ml 0.35 M NaCl, 0.35 M NaClO4 was titrated with 50 ml of 0.35 M NaCl, 0.1167 M Na2SO4“. The difference between the two is that the latter is reproducible: it has the details to allow another scientist to reasonably attempt a reproduction. Maybe it doesn’t have enough detail – maybe the outcome depended on unstated factors such as temperature, pressure, or the phase of the moon – but it is a fair attempt: it gives the details which the authors believe to be pertinent.

Of course the scientific world has long understood this, and my “green bottle” caricature would not have passed muster in any scientific journal in the last hundred years. Since formal peer review became the norm in science publication, in the latter half of the 20th century, reproducibility has been a key aspect of review.

However this reproducibility criterion has not been applied with consistency or rigour to the data processing or computation in science. In the last fifty years, science has become increasingly computational: collected data may be processed quite extensively in order to extract information from it. This data processing is as vital to science as the data gathering or experimentation itself. Not just high-profile science (for example, the LHC, the hunt for exo-planets, the human genome project, or climate modelling), but almost all current science would be completely impossible without computation. Every figure, every table, and every result depends on the details of that computation.

And yet scientific publication has not fully caught up with this computational revolution in science: these key aspects of method are not generally published. Most published science does not include, or link to, a computational description which is sufficiently detailed to reproduce the results. Papers often only devote a few words to describing processing techniques – these descriptions are usually incomplete, and sometimes incorrect.

For instance, a caption for a time-series chart which says “Shaded envelopes are 1σ variance about the mean” may not meet this standard of reproducibility. How was that variance computed? Is it the variance of many samples from that particular time, or of a section of the time series? If the latter, is auto-correlation accounted for? Or is it based on a data model of the instrument or data collection system, or on a theoretical model of the system under study, or some combination of these?

That example is taken from a recent and very interesting letter in Nature, Sexton, P. F. et al. Eocene global warming events driven by ventilation of oceanic dissolved organic carbon. Nature 471, 349-352 (2011) doi:10.1038/nature09826 [paywall]. I use it as a representative illustration simply because it is at hand on my desk as I write. I don’t mean to single out that paper, which seems to shed some very interesting light on the role of deep-ocean carbon reservoirs in global climate changes during the Eocene. The lack of more detailed information about data processing is entirely normal.

This may seem like nit-picking, but problems in computation reproducibility are affecting an increasing range of sciences. Nature recently published an editorial identifying this problem in genomics. A 2009 paper on microarray studies stated that the findings of 10 out of 18 experiments could not be reproduced. Those findings are quite likely to be valid, but without the computational details, nobody can tell. And if the findings aren’t reproducible, are they science?

It is in this context that we published a detailed description of how we produced a figure for the April issue of Nature Climate Change. This was a somewhat laborious process, partly because we are still developing our own code to draw figures as SVG, but we regard it as necessary to back up the full reproducibility of our results. People are researching and developing systems to automate and ease this process, both in individual fields and across science. See, for instance, this AAAS session, and especially the Donoho/Gavish presentation.

The journal Biostatistics has an unusual policy, described on its Information for Authors page, which is leading the way in this area:

Our reproducible research policy is for papers in the journal to be kite-marked D if the data on which they are based are freely available, C if the authors’ code is freely available, and R if both data and code are available, and our Associate Editor for Reproducibility is able to use these to reproduce the results in the paper. Data and code are published electronically on the journal’s website as Supplementary Materials.

We applaud that policy, and look forward to the appointment of Reproducibility Editors throughout scientific publishing.

This entry was posted in News. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *