The Climate Code Foundation is a non-profit organisation to promote the public understanding of climate science.

ccc-gistemp summer project update

This guest post is written by Filipe Fernandes, who worked all summer on our ccc-gistemp project, thanks to the excellent Google Summer of Code. The summer is now past, although Filipe continues to work on our code. This is his third post, here are the first and second.

We finally arrived at the final stage of the Google Summer of Code program. I’m happy I could be part of such an interesting and exciting project. Most of all I’m glad I ended up working with the Climate Code Foundation (CCF). I won’t read nor write code the same way after being mentored by CCF’s David Jones.

In fact, both David Jones and Nick Barnes were a fantastic duo to work with. They live in a “hybrid world” of computer and climate scientists. That unique experience makes them the perfect mentors for people like me, who had little training in software. Developing the project under the CCF supervision was a great learning opportunity for me.

My project was focused on the ccc-gistemp (the CCF implementation of NASA GISTEMP). The project changed a little bit during the GSoC, but its general ideas survived:

  • Make ccc-gistemp more user-friendly;
  • Improve ccc-gistemp running time using NumPy arrays;
  • Transform ccc-gistemp into an accessible piece of software for end users.

Since the midterm I have implemented a few improvements towards those goals:

  • Comma Separated Value alternative for ccc-gistemp outputs. Now the GUI can open the results directly in a spreadsheet program like excel.
  • GUI support for a rudimentary “project management system.” It means that the user can to make multiple runs with different options and compare them later.
  • A SUSE studio appliance with pypy+ccc-gistemp+data+GUI interface that can be executed as a virtual machine, live CD or Amazon EC2 instance. [1]
  • A NumPy alternative for step 3.

The last one turned out to be more challenging than expected, and it is still a work in progress that I wish to continue pursuing after GSoC.

There are also several things left to be done:

  • More (elaborated) graphics and plotting output for the GUI;
  • A project management via an ini- like file;
  • Full NumPy support (from steps 1-5);

I would like to thank my mentor David Jones for all the wisdom he shared with me during the GSoC. I also would like to extend my thanks to all CCF/CCP mentors (Nick, Julien, Kevin and Jason) who promptly helped all the students. Finally, I would like to thank my colleagues Daniel and Hannah for their valuable opinions and feedback on my work. I’m going to miss our Monday meetings and Friday code reviews.

[1] http://susegallery.com/a/YfJVDT/ccc-gistemp?#appliance-downloads
[2] http://pypi.python.org/pypi/ccc-gistemp/

Posted in News | Tagged , , | Leave a comment

Homogenization project progress

This guest post is written by Daniel Rothenberg, one of our Google Summer of Code students, who is working on a library for climate record homogenization. His previous post introduced his project.

At the halfway point in my Google Summer of Code project, I am happy to report that a great deal of progress has been made. A few weeks ago, I set out to re-write the Pairwise Homogenization Algorithm (PHA) [1] used by the United States Historical Climatology Network (USHCN) [2]. While there is a published version of this algorithm available online [3], the code is written in Fortran and complicated to read, understand, and use. My project aimed to de-obfuscate this code, and use it to build a library of similar codes that people could use in the future to explore surface temperature homogenizations and reconstructions.

The first task in this project was to port the PHA from its current form (in Fortran) to something more accessible and maintainable. Thus, I’ve spent the majority of my time slogging through complex, dense Fortran subroutines, working out the array-traversing logic that govern the mechanics of the algorithm. While there is a published, high-level description of this algorithm and its logic [1], nothing quite compares to seeing how the code is actually implemented. By far, the most difficult obstacle in this project so far has been translating existing code—like the semi-hierarchical splitting algorithm used to identify possible undocumented changepoints—into a Pythonic, easy-to-understand form.

So far, I’ve had a lot of success overcoming this sort of obstacle, but it’s never an easy feat. The original code, by Claude Williams and Matt Menne, has useful comments and documentation, but is written in an older style of Fortran (Fortran 77), and uses many conventions which are avoided in modern programming. A good example is copious use of goto statements. In a nutshell, these tell a program to jump over a block of code—even out of other control structures like for-loops. While they’re useful for some tasks, they result in the creation of what’s often derided as “spaghetti code”— code which goes every which way but loose, and is hard to understand.

There are good ways to untangle spaghetti code, though. For instance, I’ve been developing my code on a test set of data which I repeatedly run through the Fortran routines. By manipulating where the original Fortran code ends, I can liberally sprinkle debugging information like print statements throughout the code, and re-compile/run it in seconds when needed. This lets me track how variables like for-loop index counters change over time, and allow me to investigate exactly where looping code breaks and to where it jumps.

Then, once I understand how it works, I can translate it into Python—but only with a few clever tricks! You see, Python doesn’t have a goto statement. However, it does include ways to break prematurely out of loops—the continue statement, which skips to the next iteration of a loop, and the break statement, which immediately exits the looping scenario. These are useful, but have caveats; for instance, the break statement only breaks out of one loop, so if you’re looping over two indices, you can’t exit out of the “master” loop structure. Other nifty Python tools let you overcome this, though. For instance, you can use zip to “zip up” two lists of values into a single one, which often lets you condense some complex nested for-loops from Fortran into simpler and easier to understand loops in Python.

Other obstacles have sometimes involved the uncovering of bugs in the original PHA code. To date, I’ve found three significant bugs in the code which could potentially change some of the detection of changepoints in the algorithm. Two of these relate to a form of linear regression called the Kendall-Theil robust line fit. In this method, you form pairwise estimates of slopes from all the values you have in your data, and estimate the linear regression using the median slope you find. One bug I found involved the two-phase regression form of this code (used if you hypothesize that the slope on one half of a segment of data is different from the slope on the other half) used in the changepoint detection tests with the Bayesian Information Criterion. A second bug was the inadvertent overwriting of a variable describing median values found. I have reported these bugs back to the authors.

These bugs bring me full-circle back to the original intent of this project, which is to de-obfuscate code and make it as easy-to-understand and transparent as possible. The only way to have caught these bugs is to have dug through the code and kept detailed accounts of what values various loops take on during their execution. These sorts of runtime errors in your code can be very hard to catch—especially in complex codes which are hard to understand. There are a few tried-and-true methods to alleviate them, however. First, you can use large sets of simple tests cases to catch the various corner cases and bugs that can creep into your codes. I use this method to validate my auxiliary methods, such as correlation computations. However, sometimes it’s just not tenable to generate test cases—the Kendall-Theill code is complicated enough that while you could theoretically work out simple cases by hand, you really need some sort of numerical code to perform the method.

It is in cases like this where it is so important to practice good software development and engineering principles, like organized code, strong documentation, and iterative development. The truth is that a great deal of numerical codes used in scientific programming are dense and complicated; reducing the complexity of code and writing in as clear and transparent a manner as possible helps both the end user of the code and the writer himself ensure that it is valid and does what it is supposed to do.

With that said, although I’ve accomplished much in my project so far [4], there is still a great deal to do. Expect a second post on this soon, which will also recap a recent trip I made to the National Climatic Data Center.

1) ftp://ftp.ncdc.noaa.gov/pub/data/ushcn/v2/monthly/menne-williams2009.pdf
2) http://www.ncdc.noaa.gov/oa/climate/research/ushcn/#phas
3) http://www.ncdc.noaa.gov/oa/climate/research/ushcn/#homogeneity
4) http://code.google.com/p/ccf-homogenization/

Posted in News | Tagged , , | Leave a comment

Making ccc-gistemp more user-friendly

This guest post is written by Filipe Fernandes, one of our Google Summer of Code students, who is working on our ccc-gistemp project. His previous post introduced his project.

Hello, my name is Filipe Fernandes and I’m a Google Summer of Code (GSoC) student for the Climate Code Foundation (CCF).

I’ve worked mostly on packaging and cross-distribution of ccc-gistemp.

The current ccc-gistemp code is a program for people with at least intermediate computer skills. It must be run from the a command line terminal and it is difficult to make multiple runs and comparisons.

We want to change that, making ccc-gistemp available to a broader audience. The progress I have made towards that goal is:

  • Added a Command Line Interface (CLI) that unify all calls to run/vischeck;
  • Package ccc-gistemp via a standard Python setup.py;
  • Registered the code at PyPI;
  • Implemented py2exe (Windows) and py2app (Mac) for a frozen version of the CLI;
  • Started a Graphical User Interface (GUI).

The PyPI package has 63 downloads so far (as of 2011-07-10), which is quite impressive, since there was no advertisement. A Linux package was also added to the Open Build Service (OBS), but the number of downloads is not available.

Via the OBS one can create live CDs with the code or virtual machines that run on Virtual Box or Amazon EC2 making the code even more accessible.

I decided to tackle the GUI early in the project schedule due to its importance and higher difficulty (tackle largest risk first). I never used wxPython before, but I’m glad with the results so far.

The GUI is still under development, but the current version already runs ccc-gistemp similar to the CLI. We are working in ways to visualize the results and compare different runs.

For the second half of the GSoC period I’ll be working with the GUI and implementing an alternative core to ccc-gistemp using NumPy.

My original proposal has changed a little bit, I’m favoring the GUI instead of the NumPy implementation. I believe that the foundation of a good user interface is crucial to achieve the foundation goals.

Posted in News | Tagged , , | Leave a comment

First code for Common Climate Project

This guest post is written by Hannah Aizenman, one of our Google Summer of Code students, who is working on a web-based visualisation tool for reconstructions of late Holocene temperatures, for the Common Climate Project (CCP).

Since my last blog post my project has progressed enough that the code can actually be used to make graphs of some sort, mostly plots of GISTEMP anomalies. The code is separated into two distinct parts: code that handles the data and code that handles the graphs.

The data part is pretty straightforward; give the code the path to the data and it’ll try to unpack the data and pull all the metadata out of the file. This is achieved as follows:

file_path = '../data/fields/gistemp_sat_anom_2.5deg.nc'
data_obj = unpack.CCPDataFromNetCDF(file_path, field = 'field')

Then the data can be pulled out of the file using:

im = data_obj.get_all_data()

Note: there is also a get_data() function for doing online calculations over multiple files.

Once the data is pulled out of the file, it’s up to the user to run it through whatever algorithm he or she chooses to and spit out data to be graphed. For this post, I just took the mean of all the observations:

missing = data_obj.missing_value
mask = [im == missing]
im_masked = np.ma.masked_array(im, mask)
masked = im_masked.std(0)

Next it’s time to set up the graph, starting off with setting the attributes and creating the object:

graph_attrs = dict( projection = 'moll',
                          title = 'gistemp_sat_anom_2.5deg',
                          cmap = 'gist_heat_r',
                          xlabel = 'longitude',
                          ylabel = 'latitude',
                          cblabel = 'std dev of temp anomalies')
graph_obj = spatial.SpatialGraph(data_obj, **graph_attrs)

And to finish off and create the image:

graph_obj.ccpfig(masked, 'gistemp_demo')

This code is also available for download at: https://code.google.com/p/ccp-viz-toolkit/source/browse/scripts/demo_gis.py

Besides generating figure, this code also lays out the rough structure of the project at the moment. There are three main, and somewhat independent, parts to the library: handling the data, doing some number crunching, and making pretty pictures. The plan for the web interface, which is my current focus, is to glue this library to some javascript or similar, so that anybody can go on to the commonclimate website, pick their data, throw in some attributes, and generate their own figures.

Posted in News | Tagged , , | Leave a comment

Science as a Public Enterprise

The Royal Society is conducting a policy study entitled ‘Science as a Public Enterprise’, focused on public engagement with science. This goes far beyond the traditional notions of ‘engagement’, in which the high priesthood of science may offer occasional public lectures and open days, write pop-science books, or contribute to TV documentaries. There is a growing realisation across science that modern communication media allow much more direct involvement: the public can see, grasp, and take part in scientific research to a much greater extent than has ever been possible before. There is also a sense that there are practical arguments for increased transparency – that it would benefit scientists as well as the public – as well as a moral case (the public purse funds most research, and the public are often profoundly affected even by private science – for instance medical science, or models of oil dispersal in deep-water blowouts). The Climate Code Foundation, of course, welcomes this study, which relates directly to our goal of improving the public understanding of climate science.

The study group is led by Geoffrey Boulton, an eminent geologist. As part of the study, there was a Town Hall Meeting on Wednesday (2011-06-08), looking specifically at ‘Open Science’, which David Jones and I (Nick Barnes) attended. It was divided into two panel sessions, “Why should science be open?” and “How should science be open?” The meeting was addressed by Paul Nurse, president of the Royal Society, by Mark Walport, director of the Wellcome Trust, and by Philip Campbell, editor-in-chief of Nature. Many more of the great and good of UK science were in attendance, either on the panels or contributing from the floor. The discussion was interesting, and for the most part was both constructive and well-informed.

Mark Walport described the case for open science as “obvious and powerful”, and summarised arguments for and against. He dismissed many of the arguments against as weak and insubstantial, but identified the following as stronger:

  • there are no incentives for greater openness;
  • the global equity question: is free access necessarily fair access?
  • (especially in medical science)what about the confidentiality of the subjects?
  • what about privately-funded science, or science with national-security ramifications?
  • competitiveness: won’t groups or countries practising open science be disadvantaged?

He emphasized that even these last two arguments can’t stand in the way of an urgent and necessary change: negative results of medical trials must be published.

Geoffrey Boulton contrasted his first ever science publication, which had six data points, with a more recent paper of his which has six billion. Many modern papers cannot include all their data, and act instead almost as an advertisement for the dataset, where the real science value lies. I would argue that the same metaphor applies to papers on computational science: the paper cannot include a precise description of the computational methods, and should act as a pointer to the underlying code.

Stephen Emmott, head of computational science at Microsoft Research, said that we need a revolutionary change to maintain reproducibility and falsifiability in a world of model-based science. He emphasized the importance of open code: much research cannot be reproduced without the code. He referred to a genomics study (possibly the same ones described in this Nature editorial) in which the findings of most studies could not be reproduced due to a lack of openness.

Geoffrey Boulton rounded off the first session by encouraging us to ask “Is it worth the candle?” to open science, suggesting that the answer is decidedly yes, and pointing out that we will probably have to do it anyway.

The “How?” session was introduced by Philip Campbell, who emphasized three key questions:

  • Credit: How can the systems of acknowledgement, reward, professional advancement, and institutional assessment in science be evolved to properly recognise contributions other than the traditional peer-reviewed paper? Creating and curating datasets, writing and maintaining code, promoting public engagement, all must be recognised and rewarded.
  • Cost: Creating and especially curating datasets is expensive, especially in fields such as particle physics and metagenomics where data volumes are enormous. Who is going to pay? Funding agencies need to step up for this. Opening, curating, and maintaining software resources also costs money (although much less) and funding agencies have failed to provide for it.
  • Community: Each scientific community must decide on the appropriate level of openness. For example, data embargo times might vary from field to field according to the personal and institutional investment made in obtaining data. In many fields, openness is increasing. In genomics, researchers who wanted data embargoes have been persuaded to accept credit instead: open science wins citations.

Timo Hannay, of Digital Science (a division of Macmillan publishing) is working to provide better software tools to working scientists. He pointed out that almost all scientists have better software tools for managing their music collection or family snapshots than they do for managing their data and other digital resources.

From the floor, Peter Murray-Rust expressed the view that some groups can have valuable vested interests in the status quo, and be opposed to openness regardless of the interests of society or the views of scientists. Sometimes gradual “evolution” is possible, but sometimes a “fracture” is necessary.

The last comment I recorded was from Cameron Neylon, a biophysicist and open research expert who sits on our advisory committee (as does Peter Murray-Rust). He said that funding bodies should demand progress, but can’t move out in front of their scientific communities. So communities have to believe in the provision of research outputs as adding value. However, institutions and agencies “should never spend money restricting access” to scientific data or information.

In the coffee break after the meeting, I met Philip Campbell, who invited me to attend a meeting to discuss journal software publication policies. I very much look forward to that. Geoffrey Boulton encouraged us to make a submission to the study group, which we will certainly do. I also spoke briefly to Nick von Behr of the Royal Society, and to Timo Hannay, and hope to be able to meet each of them again in future.

One last point raised, although I can’t recall who said it: access to science ought not to be limited according to perceived interest. Almost any scientific topic is of interest to some proportion of the public, and modern technology – in particular the web – allows those specific people to directly engage in the science, without the wasted effort and limits that traditional ‘broadcasting’ media would impose.

This has a direct bearing on citizen science – another important aspect of ‘Science as a Public Enterprise’, not really touched on by this meeting. There are dozens of amazing citizen science projects, covering astrophysics, climate prediction, malaria control, historical climatology, among many other topics. Some simply allow the public to donate spare computational power of their own machines. In others, participants contribute their own intelligence (for instance, to discriminate between different galaxy types, or to read and transcribe old hand-written ship’s logs). In either case, a large amount of excellent science is being done with the help and participation of the public, which would not be possible in any other way.

Overall, a constructive and interesting meeting. I look forward to future activities of the study, and to seeing its conclusions. It is easy to be impatient at the pace of change in large organisations or communities, but this change, however much delayed, is definitely coming.

More information about the Royal Society study here.

Posted in News | Leave a comment

Welcome Daniel Rothenberg

This guest post is written by Daniel Rothenberg, one of our Google Summer of Code students.

As a meteorologist and student of climate science, I’m fascinated by the atmosphere and the weather it produces. Throughout my studies, I’ve learned to use equations and simplified models to explain how the atmosphere works and to predict its behavior. However, a key to understanding the atmosphere’s future lies in understanding its past; by studying past weather and climate, we can begin to learn how it has changed over time and might continue to change in the future.

Thankfully, such studies are made easier thanks to recorded observations about the weather and atmosphere, which extend back a century and cover much of the globe. These vast historical records provide a wealth of insight into how weather affects people and places, and also offer clues about the extent of natural climate variability. They also afford us a clear view on how weather and climate has changed over the industrial era.

Unfortunately, many weather observations recorded over the years are riddled with data quality issues. As time progresses, observation sites and observers come and go, and there are significant changes in methodology (migrating from analog to digital thermometers, for instance, or the time of day the observation is taken). Often, equipment can malfunction and drift from its initial calibrations.

These known biases complicate the process of reconstructing past weather, even for variables as innocuous as daily temperatures. To better understand observed climate variability, we try to identify and remove discontinuities and anomalies in the temperature record due to some of these biases. In fact, we automate this process with the help of “homogenization” algorithms, which compare temperature records against each other to identify potential data quality errors – even when we have no prior knowledge or record of them.

My project this summer is to translate several of these algorithms from the dense, difficult-to-understand FORTRAN codes used at the National Climactic Data Center into something simpler and easier to understand, and into something that can easily be used by people interested in studying the observational temperature record in more detail. By doing this, I hope to further the Climate Code Foundation’s goal of improving climate science software practices and opening up scientific software for public consumption and improvement. Ultimately, I hope this project will contribute positively to the public’s knowledge of and reception to the science behind climate change, as well as encourage my fellow scientists to adopt similar goals for their own codes and software.

Posted in News | Tagged , , | 1 Comment

Welcome Hannah Aizenman

This guest post is written by Hannah Aizenman, one of our Google Summer of Code students.

If you’ve been interested in climate for a while (or longer than 15 minutes) you’ve probably heard of the hockey-stick, or the East Anglia email controversy, or seen yet another rant on climate change loaded with references to half-understandable topics discussed in papers you’d need a degree in geophysics or earth science to hope to understand, and a big old graph that seems to be nothing more than lines or a map and somewhat circle-like shapes. Even though climate data currently seems somewhat inscrutable, many people in the climate community want the data, and more importantly the science, to be understandable; therefore they are starting to allow anyone with an internet connection to interact with and visualize their data and tools, because one of the best ways to learn about something is to play with it.

My project for GSOC is to open up the process of creating visualizations to anyone who happens to stumble onto the Common Climate Project (CCP) web page, so that he or she can learn how choices in parameters and restrictions on the data, not magical manipulation and fabrication, yield graphs similar to the ones in all these papers.

The plan is to create a tool for generating graphs using data sets created and studied by members of the CCP, to host it on the CCP website, and to open-source it so that any lab can reuse it to make their own data easily explorable. The tool’s backend will be a wrapper for the brilliant matplotlib library, and it’ll also be open-sourced so that anybody can pick it up to make climate data graphs in python without having to learn the intricacies of the matplotlib library. It’ll also hopefully make a good demo for how to use matplotlib to make climate data graphs, so anybody with interest and a smidgen of programming skills can push the graphs further and maybe even contribute back more functions for doing visualizations.

Posted in News | Tagged , , | Leave a comment

Welcome Filipe Fernandes

This guest post is written by Filipe Fernandes, one of our Google Summer of Code students.

My name is Filipe Fernandes. I am a PhD candidate at the University of Massachusetts Dartmouth School for Marine Science and Technology, or SMAST. SMAST is part of a system-wide graduate school that combines the marine science resources, faculty, and courses of all the campuses of the University of Massachusetts.

I’m very excited to take part in the GSoC with the CCF, an opportunity which joins three of my passions: coding, science, and education. The coding part involves the interesting challenge of adapting the current ccc-gistemp code to make use of NumPy optimizations to attain a faster code. The scientific part is the opportunity to educate myself in climate sciences and better understand such a controversial topic. Ultimately, the education part will be to transform the ccc-gistemp code into an “App”-like software that anyone can run on any platform. The challenge here will be to create a solid piece of software on the topic in ordinary (non-scientific) language.

In short, my GSoC project goal is to make the ccc-gistemp more user-friendly via a Graphical User Interface, an automatic installer and multi-platform releases (windows/mac/Linux), and faster runs.

My mentor is David Jones, a member of the CCF. I’m looking forward to learning from his experience and hopefully deliver a more friendly code to all users.

Posted in News | Tagged , , | Leave a comment

Google Summer of Code projects

Today Google announced the final selection of successful proposals for Google Summer of Code. The Climate Code Foundation is very grateful to Google for sponsoring three of our projects:

  • Hannah Aizenman, who will start development on the new ‘Open Climate Project’ work, mentored by Jason Smerdon at Columbia.
  • Filipe Fernandes, who will work on ccc-gistemp packaging, NumPy integration, and visualisation, mentored by David Jones.
  • Daniel Rothenberg, who will develop a Python library of homogenization algorithms, mentored by Nick Barnes.

We’d like to congratulate these students. We very much look forward to working with them. We will invite each of them to write posts for this blog, describing their projects and their contributions to the goals of the Foundation.

We’d also like to thank all the other students who submitted proposals or otherwise expressed interest in working with us. The competition was fierce, and the final selection was not easy. We hope they will stay in touch with the Foundation, and apply again next year. We want to act as a clearing-house or brokerage for devising, funding, and initiating more open-source software projects to advance the public understanding of climate science.

Posted in News | Tagged , | Leave a comment

Reproducibility in Climate Science

The idea of ‘reproducibility’ is fundamental to scientific culture. Scientists don’t merely develop theories, construct models, form hypotheses, perform experiments, collect data, and use it to test their theories. They describe their theories, models, hypotheses, experiments, and data, in published papers, so that others can criticise their work and improve upon it. Colleagues constantly attempt to out-do each other: to improve a theory, a model, or an experiment. This competitive collaboration drives science forwards, and depends on scientists fully publishing their work, in enough detail that others can not simply understand it but can reproduce it.

So the progress of science depends absolutely on reproducibility. In some sense, if your work is not reproducible, then it is not science, or at least falls short of a scientific ideal. If a method is not documented, any other scientist attempting to reproduce or improve upon the work may not be able to do so. They may well get different results, for mysterious reasons – error in the original work, error in the reproduction, or simply that they are measuring different things due to a lack of documentation. Instead of science providing a clear “signpost to the truth”, it will spin like a broken weathervane, effort will be wasted, and no progress will be made.

Publication is seen as the key metric of science, but in fact publication is a means to an end. Science depends on reproducibility, and publication enables that. But a publication is inadequate for this purpose if it does not describe the study at a reproducible level of detail.

A paper stating “we mixed some salt solution with some of that stuff in the green bottle” is not science, not because it is informal and in the active voice but because it is not reproducible. What salt solution? What stuff? Mixed how? To be a science paper, it should say something like, “A volume of 50 ml 0.35 M NaCl, 0.35 M NaClO4 was titrated with 50 ml of 0.35 M NaCl, 0.1167 M Na2SO4“. The difference between the two is that the latter is reproducible: it has the details to allow another scientist to reasonably attempt a reproduction. Maybe it doesn’t have enough detail – maybe the outcome depended on unstated factors such as temperature, pressure, or the phase of the moon – but it is a fair attempt: it gives the details which the authors believe to be pertinent.

Of course the scientific world has long understood this, and my “green bottle” caricature would not have passed muster in any scientific journal in the last hundred years. Since formal peer review became the norm in science publication, in the latter half of the 20th century, reproducibility has been a key aspect of review.

However this reproducibility criterion has not been applied with consistency or rigour to the data processing or computation in science. In the last fifty years, science has become increasingly computational: collected data may be processed quite extensively in order to extract information from it. This data processing is as vital to science as the data gathering or experimentation itself. Not just high-profile science (for example, the LHC, the hunt for exo-planets, the human genome project, or climate modelling), but almost all current science would be completely impossible without computation. Every figure, every table, and every result depends on the details of that computation.

And yet scientific publication has not fully caught up with this computational revolution in science: these key aspects of method are not generally published. Most published science does not include, or link to, a computational description which is sufficiently detailed to reproduce the results. Papers often only devote a few words to describing processing techniques – these descriptions are usually incomplete, and sometimes incorrect.

For instance, a caption for a time-series chart which says “Shaded envelopes are 1σ variance about the mean” may not meet this standard of reproducibility. How was that variance computed? Is it the variance of many samples from that particular time, or of a section of the time series? If the latter, is auto-correlation accounted for? Or is it based on a data model of the instrument or data collection system, or on a theoretical model of the system under study, or some combination of these?

That example is taken from a recent and very interesting letter in Nature, Sexton, P. F. et al. Eocene global warming events driven by ventilation of oceanic dissolved organic carbon. Nature 471, 349-352 (2011) doi:10.1038/nature09826 [paywall]. I use it as a representative illustration simply because it is at hand on my desk as I write. I don’t mean to single out that paper, which seems to shed some very interesting light on the role of deep-ocean carbon reservoirs in global climate changes during the Eocene. The lack of more detailed information about data processing is entirely normal.

This may seem like nit-picking, but problems in computation reproducibility are affecting an increasing range of sciences. Nature recently published an editorial identifying this problem in genomics. A 2009 paper on microarray studies stated that the findings of 10 out of 18 experiments could not be reproduced. Those findings are quite likely to be valid, but without the computational details, nobody can tell. And if the findings aren’t reproducible, are they science?

It is in this context that we published a detailed description of how we produced a figure for the April issue of Nature Climate Change. This was a somewhat laborious process, partly because we are still developing our own code to draw figures as SVG, but we regard it as necessary to back up the full reproducibility of our results. People are researching and developing systems to automate and ease this process, both in individual fields and across science. See, for instance, this AAAS session, and especially the Donoho/Gavish presentation.

The journal Biostatistics has an unusual policy, described on its Information for Authors page, which is leading the way in this area:

Our reproducible research policy is for papers in the journal to be kite-marked D if the data on which they are based are freely available, C if the authors’ code is freely available, and R if both data and code are available, and our Associate Editor for Reproducibility is able to use these to reproduce the results in the paper. Data and code are published electronically on the journal’s website as Supplementary Materials.

We applaud that policy, and look forward to the appointment of Reproducibility Editors throughout scientific publishing.

Posted in News | Leave a comment