Scientists often ask why they should publish their data-processing codes. “After all,” they say, “the paper describes the algorithm.”
There are many good reasons to publish science source code, including:
- The paper usually does not include a full description of the algorithm: its parameters, a full list of the processing steps, the details of individual steps, and so on. Even if the description is complete, it may not be accurate.
- The code may not perform the intended algorithm. In other words, it may contain bugs. Almost all software does, after all. Distributing your code allows these bugs to be discovered and fixed.
- By making your code available, you allow other scientists to see, criticise, and learn from it, just as you do by describing the rest of your method. Publication, in the broadest sense, is key to scientific progress.
- Quite separately: if you release your code under an open-source license, you allow other scientists to reuse and adapt it for their own work.
The rest of this post uses a small example of climate science code released in 2010 to illustrate these points.
Following the unauthorized release of emails from the Climatic Research Unit (CRU) at the University of East Anglia in November 2009, doubt was cast on the behaviour of scientists at the Unit. In March 2010 the university asked Sir Muir Russell to head up an independent review, examining the honesty, rigour, and openness of CRU scientists. The Independent Climate Code Email Review reported in July 2010, finding that their rigour and honesty as scientists were not in doubt.
Part of the work of the CRU is CRUTEM3, a dataset of global historical terrestrial temperature anomalies. This is produced by an algorithm similar to our ccc-gistemp code. CRUTEM3 was the subject of much of the criticism around the emails, and so “the review conducted a trial analysis to demonstrate what an independent researcher is able to do, using publicly available land station temperature information, should they wish to replicate the CRUTEM analysis.”
Based on this trial, the review concluded that:
- The steps needed to create a global temperature series from the data are straightforward to implement.
- The required computer code is straightforward and easily written by a competent researcher.
- The shape of the temperature trends obtained in all cases is very similar: in other words following the same process with the same data obtained from different sources generates very similar results.
These findings are similar in tone to parts of the evidence which we submitted to the review, and we absolutely agree with them.
The review also found “The Review believes that all data, metadata and codes necessary to allow independent replication of results should be provident concurrent with peer-reviewed publication.” In line with this recommendation, the review’s own code was published in October 2010, and can be downloaded from the review’s website.
The ICCER trial code
The code is a few hundred lines of C++. It is easy to use. It’s pretty straight-forward, not especially clear or obscure.
Several interested parties have downloaded this code, examined it, and run it. The well-known blogger John Graham-Cumming reported his findings in October.
The review report includes lengthy descriptions of the trial analysis: Appendix 7 describing the analysis covers nine pages, including 200 words to describe the algorithm. This is considerably more than most science papers devote to algorithmic description, especially for such a simple piece of code, for which one might expect one or two sentences. Here is the description from the report, numbered for easy reference:
Code was written in C++ to:
- 1. Parse the source files and assemble the information into computer memory.
- 2. Pass over each station and calculate the monthly “normals” defined as the average over a set period. In this case we required good data from at least 10 years in the period 1955-1995. This is larger than was used by CRUTEM in order to obtain more valid normals without having to resort to external records.
- 3. In the case of GHCN, choose the duplicate with the most valid monthly normals.
- 4. Calculate the anomaly for each monthly measurement for each station. The anomaly is defined as the monthly temperature minus the normal for that month.
- 5. Either choose to use all stations, or only those matched to the CRUTEM3 station list published in 2009.
- 6. Pass over the data set and assemble average annual anomalies in each cell of a 5×5 degree grid of latitude and longitude.
- 7. Calculate the overall annual average anomaly by averaging over grid cells.
- 8. Do this for each of GHCN (unadjusted), GHCN (adjusted) and NCAR, and plot the annual average anomaly as a function of time.
- 9. Separately read the published CRUTEM3 product and form the same average over grid cells and plot this on the same figure for comparison.
When John Graham-Cumming reviewed the code, he noted that point 2 of this description is inaccurate. In fact the code uses the date range 1961-1990 to calculate a monthly normal, and requires 16 values (Graham-Cumming states “fifteen”), not 10. This is a minor example of a comon problem: the description in the paper does not completely describe the algorithm.
Furthermore, none of the released code corresponds to points 5, 8, and 9 of the description. The code as it stands can only read GHCN adjusted data, and there is no use of a CRUTEM3 station list or the CRUTEM3 product. The charts in the published report show that some such code must have been written, and the released code is designed to allow it, but does not actually do it.
This is a good example of a broader problem, well known and long-solved in the software industry: when creating a final work-product (whether it’s a software product or something else such as a report, paper, or dataset), it is vital to use a known and reproducible software configuration. Otherwise bugs in the work-product can be impossible to track down and fix. This is the problem which the entire field of Software Configuration Management exists to solve. Over the last forty years the whole software industry has (sometimes reluctantly) developed and adopted many mature and sophisticated tools, skills, processes and procedures to address this. The Climate Code Foundation will help scientists to develop these skills.
When the Foundation’s David Jones briefly reviewed the code in January 2011, he observed another problem: when calculating the global mean anomaly in point 7, the code does not area-weight the gridded anomaly values when calculating their average. This means that contributions from the polar areas (where grid cells have a small area) are too heavily weighted in the global result, and equatorial grid cells are comparatively neglected. This defect is interesting, in that the report does not say how the average is calculated. Plainly area-weighting is better, and a reader might assume that it is done, but it isn’t described. The paper description is simply too vague.
One final point about this code: it performs a further step which isn’t in the above description, computing a “smoothed” global annual anomaly – a simple moving average over several consecutive years. The charts in the report show this smoothed result, described as “a simple 5-year smooth”. On inspecting the code I realised that in fact the smoothing done is a 4 year smooth (the mean of values from years y-2 to y+1, inclusive). It is clear that this is a programming mistake: the programmer intended it to perform a 5-year smooth.
This report included a good algorithmic description, and has been accompanied by source code. We greatly welcome both of these departures from the norm, as setting a good example and following the report’s own recommendation. These facts also allow us to illustrate particular reasons why code release is important, and why science software skills should be improved.
The four separate bugs – in the description, in the code, in the configuration, and in the expectation of the reader – are, in this case, trivial and unimportant – they do not affect the broad results of the report in any way. However, each is characteristic of problems with science software which can be more serious, and which are impossible to discover unless code is released.
- The different base period and criteria illustrates that algorithmic parameters may be omitted or given incorrectly, which is important as they can make crucial differences to results.
- The missing code illustrates that science software is usually not developed using any configuration management tools or techniques, which are vital because without it results may be impossible for either readers or the authors to reproduce;
- The non-area-weighted average illustrates that readers may often assume obvious algorithmic refinements which are not in fact in use, and that discovering this fact requires the code to be released;
- The incorrect smoothing parameter illustrates that all software has bugs, some of which are difficult or impossible to discover without a close review of the source code. This code was doubtless read by several people prior to release by the ICCER, and has certainly been read by several commentators since, but this bug was not found until today. Reading through your own code is a notoriously poor way to find bugs. Showing it to your colleagues is also insufficient. If you release it, so more people read it, it is more likely the bugs are to be found and fixed.
By releasing the code, and opening it to general review and criticism, the report has allowed us to illustrate and discuss general problems in science software development, and to work towards resolving them. In this way, science software, and science in general, benefits from publication.