The Climate Code Foundation is a non-profit organisation to promote the public understanding of climate science.

Welcome Daniel Rothenberg

This guest post is written by Daniel Rothenberg, one of our Google Summer of Code students.

As a meteorologist and student of climate science, I’m fascinated by the atmosphere and the weather it produces. Throughout my studies, I’ve learned to use equations and simplified models to explain how the atmosphere works and to predict its behavior. However, a key to understanding the atmosphere’s future lies in understanding its past; by studying past weather and climate, we can begin to learn how it has changed over time and might continue to change in the future.

Thankfully, such studies are made easier thanks to recorded observations about the weather and atmosphere, which extend back a century and cover much of the globe. These vast historical records provide a wealth of insight into how weather affects people and places, and also offer clues about the extent of natural climate variability. They also afford us a clear view on how weather and climate has changed over the industrial era.

Unfortunately, many weather observations recorded over the years are riddled with data quality issues. As time progresses, observation sites and observers come and go, and there are significant changes in methodology (migrating from analog to digital thermometers, for instance, or the time of day the observation is taken). Often, equipment can malfunction and drift from its initial calibrations.

These known biases complicate the process of reconstructing past weather, even for variables as innocuous as daily temperatures. To better understand observed climate variability, we try to identify and remove discontinuities and anomalies in the temperature record due to some of these biases. In fact, we automate this process with the help of “homogenization” algorithms, which compare temperature records against each other to identify potential data quality errors – even when we have no prior knowledge or record of them.

My project this summer is to translate several of these algorithms from the dense, difficult-to-understand FORTRAN codes used at the National Climactic Data Center into something simpler and easier to understand, and into something that can easily be used by people interested in studying the observational temperature record in more detail. By doing this, I hope to further the Climate Code Foundation’s goal of improving climate science software practices and opening up scientific software for public consumption and improvement. Ultimately, I hope this project will contribute positively to the public’s knowledge of and reception to the science behind climate change, as well as encourage my fellow scientists to adopt similar goals for their own codes and software.

Posted in News | Tagged , | Leave a comment

Welcome Hannah Aizenman

This guest post is written by Hannah Aizenman, one of our Google Summer of Code students.

If you’ve been interested in climate for a while (or longer than 15 minutes) you’ve probably heard of the hockey-stick, or the East Anglia email controversy, or seen yet another rant on climate change loaded with references to half-understandable topics discussed in papers you’d need a degree in geophysics or earth science to hope to understand, and a big old graph that seems to be nothing more than lines or a map and somewhat circle-like shapes. Even though climate data currently seems somewhat inscrutable, many people in the climate community want the data, and more importantly the science, to be understandable; therefore they are starting to allow anyone with an internet connection to interact with and visualize their data and tools, because one of the best ways to learn about something is to play with it.

My project for GSOC is to open up the process of creating visualizations to anyone who happens to stumble onto the Common Climate Project (CCP) web page, so that he or she can learn how choices in parameters and restrictions on the data, not magical manipulation and fabrication, yield graphs similar to the ones in all these papers.

The plan is to create a tool for generating graphs using data sets created and studied by members of the CCP, to host it on the CCP website, and to open-source it so that any lab can reuse it to make their own data easily explorable. The tool’s backend will be a wrapper for the brilliant matplotlib library, and it’ll also be open-sourced so that anybody can pick it up to make climate data graphs in python without having to learn the intricacies of the matplotlib library. It’ll also hopefully make a good demo for how to use matplotlib to make climate data graphs, so anybody with interest and a smidgen of programming skills can push the graphs further and maybe even contribute back more functions for doing visualizations.

Posted in News | Tagged , | Leave a comment

Welcome Filipe Fernandes

This guest post is written by Filipe Fernandes, one of our Google Summer of Code students.

My name is Filipe Fernandes. I am a PhD candidate at the University of Massachusetts Dartmouth School for Marine Science and Technology, or SMAST. SMAST is part of a system-wide graduate school that combines the marine science resources, faculty, and courses of all the campuses of the University of Massachusetts.

I’m very excited to take part in the GSoC with the CCF, an opportunity which joins three of my passions: coding, science, and education. The coding part involves the interesting challenge of adapting the current ccc-gistemp code to make use of NumPy optimizations to attain a faster code. The scientific part is the opportunity to educate myself in climate sciences and better understand such a controversial topic. Ultimately, the education part will be to transform the ccc-gistemp code into an “App”-like software that anyone can run on any platform. The challenge here will be to create a solid piece of software on the topic in ordinary (non-scientific) language.

In short, my GSoC project goal is to make the ccc-gistemp more user-friendly via a Graphical User Interface, an automatic installer and multi-platform releases (windows/mac/Linux), and faster runs.

My mentor is David Jones, a member of the CCF. I’m looking forward to learning from his experience and hopefully deliver a more friendly code to all users.

Posted in News | Tagged , | Leave a comment

Google Summer of Code projects

Today Google announced the final selection of successful proposals for Google Summer of Code. The Climate Code Foundation is very grateful to Google for sponsoring three of our projects:

  • Hannah Aizenman, who will start development on the new ‘Open Climate Project’ work, mentored by Jason Smerdon at Columbia.
  • Filipe Fernandes, who will work on ccc-gistemp packaging, NumPy integration, and visualisation, mentored by David Jones.
  • Daniel Rothenberg, who will develop a Python library of homogenization algorithms, mentored by Nick Barnes.

We’d like to congratulate these students. We very much look forward to working with them. We will invite each of them to write posts for this blog, describing their projects and their contributions to the goals of the Foundation.

We’d also like to thank all the other students who submitted proposals or otherwise expressed interest in working with us. The competition was fierce, and the final selection was not easy. We hope they will stay in touch with the Foundation, and apply again next year. We want to act as a clearing-house or brokerage for devising, funding, and initiating more open-source software projects to advance the public understanding of climate science.

Posted in News | Tagged | Leave a comment

Reproducibility in Climate Science

The idea of ‘reproducibility’ is fundamental to scientific culture. Scientists don’t merely develop theories, construct models, form hypotheses, perform experiments, collect data, and use it to test their theories. They describe their theories, models, hypotheses, experiments, and data, in published papers, so that others can criticise their work and improve upon it. Colleagues constantly attempt to out-do each other: to improve a theory, a model, or an experiment. This competitive collaboration drives science forwards, and depends on scientists fully publishing their work, in enough detail that others can not simply understand it but can reproduce it.

So the progress of science depends absolutely on reproducibility. In some sense, if your work is not reproducible, then it is not science, or at least falls short of a scientific ideal. If a method is not documented, any other scientist attempting to reproduce or improve upon the work may not be able to do so. They may well get different results, for mysterious reasons – error in the original work, error in the reproduction, or simply that they are measuring different things due to a lack of documentation. Instead of science providing a clear “signpost to the truth”, it will spin like a broken weathervane, effort will be wasted, and no progress will be made.

Publication is seen as the key metric of science, but in fact publication is a means to an end. Science depends on reproducibility, and publication enables that. But a publication is inadequate for this purpose if it does not describe the study at a reproducible level of detail.

A paper stating “we mixed some salt solution with some of that stuff in the green bottle” is not science, not because it is informal and in the active voice but because it is not reproducible. What salt solution? What stuff? Mixed how? To be a science paper, it should say something like, “A volume of 50 ml 0.35 M NaCl, 0.35 M NaClO4 was titrated with 50 ml of 0.35 M NaCl, 0.1167 M Na2SO4“. The difference between the two is that the latter is reproducible: it has the details to allow another scientist to reasonably attempt a reproduction. Maybe it doesn’t have enough detail – maybe the outcome depended on unstated factors such as temperature, pressure, or the phase of the moon – but it is a fair attempt: it gives the details which the authors believe to be pertinent.

Of course the scientific world has long understood this, and my “green bottle” caricature would not have passed muster in any scientific journal in the last hundred years. Since formal peer review became the norm in science publication, in the latter half of the 20th century, reproducibility has been a key aspect of review.

However this reproducibility criterion has not been applied with consistency or rigour to the data processing or computation in science. In the last fifty years, science has become increasingly computational: collected data may be processed quite extensively in order to extract information from it. This data processing is as vital to science as the data gathering or experimentation itself. Not just high-profile science (for example, the LHC, the hunt for exo-planets, the human genome project, or climate modelling), but almost all current science would be completely impossible without computation. Every figure, every table, and every result depends on the details of that computation.

And yet scientific publication has not fully caught up with this computational revolution in science: these key aspects of method are not generally published. Most published science does not include, or link to, a computational description which is sufficiently detailed to reproduce the results. Papers often only devote a few words to describing processing techniques – these descriptions are usually incomplete, and sometimes incorrect.

For instance, a caption for a time-series chart which says “Shaded envelopes are 1σ variance about the mean” may not meet this standard of reproducibility. How was that variance computed? Is it the variance of many samples from that particular time, or of a section of the time series? If the latter, is auto-correlation accounted for? Or is it based on a data model of the instrument or data collection system, or on a theoretical model of the system under study, or some combination of these?

That example is taken from a recent and very interesting letter in Nature, Sexton, P. F. et al. Eocene global warming events driven by ventilation of oceanic dissolved organic carbon. Nature 471, 349-352 (2011) doi:10.1038/nature09826 [paywall]. I use it as a representative illustration simply because it is at hand on my desk as I write. I don’t mean to single out that paper, which seems to shed some very interesting light on the role of deep-ocean carbon reservoirs in global climate changes during the Eocene. The lack of more detailed information about data processing is entirely normal.

This may seem like nit-picking, but problems in computation reproducibility are affecting an increasing range of sciences. Nature recently published an editorial identifying this problem in genomics. A 2009 paper on microarray studies stated that the findings of 10 out of 18 experiments could not be reproduced. Those findings are quite likely to be valid, but without the computational details, nobody can tell. And if the findings aren’t reproducible, are they science?

It is in this context that we published a detailed description of how we produced a figure for the April issue of Nature Climate Change. This was a somewhat laborious process, partly because we are still developing our own code to draw figures as SVG, but we regard it as necessary to back up the full reproducibility of our results. People are researching and developing systems to automate and ease this process, both in individual fields and across science. See, for instance, this AAAS session, and especially the Donoho/Gavish presentation.

The journal Biostatistics has an unusual policy, described on its Information for Authors page, which is leading the way in this area:

Our reproducible research policy is for papers in the journal to be kite-marked D if the data on which they are based are freely available, C if the authors’ code is freely available, and R if both data and code are available, and our Associate Editor for Reproducibility is able to use these to reproduce the results in the paper. Data and code are published electronically on the journal’s website as Supplementary Materials.

We applaud that policy, and look forward to the appointment of Reproducibility Editors throughout scientific publishing.

Posted in News | Leave a comment

Accepted for Google Summer of Code

We are proud to announce that we have been accepted as a mentoring organization for Google Summer of Code. Since we announced our application, we have been contacted by more than a dozen students from all over the world, keen to work with us on software to improve public understanding of climate science.

If you’re a student, this is an opportunity to help the Foundation, write useful code, learn a lot, network with climate scientists, and earn USD 5000 from Google into the bargain. Take a look at our ideas page, read all about the Summer of Code, look at the “application template” on our GSoC page, and contact us. If we like your ideas then we will help you turn them into a detailed project proposal, for consideration by Google.

If you already contacted us, thank you for your patience. We will be contacting you, setting up a mailing list, and so on, next week.

Posted in News | Tagged | Leave a comment

Why publish code: a case study

Scientists often ask why they should publish their data-processing codes. “After all,” they say, “the paper describes the algorithm.”

There are many good reasons to publish science source code, including:

  • The paper usually does not include a full description of the algorithm: its parameters, a full list of the processing steps, the details of individual steps, and so on. Even if the description is complete, it may not be accurate.
  • The code may not perform the intended algorithm. In other words, it may contain bugs. Almost all software does, after all. Distributing your code allows these bugs to be discovered and fixed.
  • By making your code available, you allow other scientists to see, criticise, and learn from it, just as you do by describing the rest of your method. Publication, in the broadest sense, is key to scientific progress.
  • Quite separately: if you release your code under an open-source license, you allow other scientists to reuse and adapt it for their own work.

The rest of this post uses a small example of climate science code released in 2010 to illustrate these points.
Continue reading

Posted in News | 5 Comments

Nature Figure

The April issue of Nature Climate Change features a figure produced by Climate Code Foundation, illustrating this article. This page is the “supplementary information” regarding that figure.

NASA GISTEMP (blue), Clear Climate Code ccc-gistemp (pink,
offset -0.2°C for clarity). Difference ccc-gistemp - GISTEMP (green,
right-hand scale, note x20 scale).

Updated to state these errata: in the print edition of Nature Climate Change, the ’0.0′ label on the y-axis has been misprinted as ’0.6′, and the caption credits ‘Climate Change Foundation’, not ‘Climate Code Foundation’.

The figure compares the global temperature anomaly analysis using our ccc-gistemp software against the analysis published by NASA using their GISTEMP software. To produce it you need to have run ccc-gistemp (to get a valid result directory), and you need to have Inkscape on your PATH. Then go python tool/multi.py nature201002. This will create nature.pdf.

As the figure caption says, the ccc-gistemp result in the figure was produced using “software revision 700“. The version of tool/multi.py (and related files) used to make nature.pdf was revision 727. Both computation and visualisation were run with Python 2.7 (although the Python version should not affect these results).

The input data for the figure is in this zip archive held at our source code repository. These are just copies of publicly available datasets (GHCN, USHCN, and so on), but because the available copies change (typically every month), we need to keep a copy if we’re going to reproduce the figure exactly.

The numbers for the GISTEMP curve come from the NASA GISTEMP website. The published datafiles change every month, but previous versions are not made publicly available. In order to exactly reproduce the Nature figure, we have to archive a copy of the file we used: again at the googlecode repository.

Why are we publishing this blog post? Across science, reproducibility of computer results is seen as increasingly important. We believe it is vital to public understanding of climate science in particular. Every number and every figure in every paper is the result of processing data with a computer program; releasing the programs allows interested readers to better understand the processing, and also to check the programs for errors. It should be possible for an interested reader to reproduce the figures exactly by re-running the programs.

[Updated to add link to Nature Climate Change article]

Posted in News | Leave a comment

Google Summer of Code

Google have announced their Summer of Code, and we intend to be a mentoring organisation. If you’re a student, this is an opportunity to work on our open source code and earn a bit of money doing so (Google give a stipend of USD 5000 qualifying students, and an honorarium of USD 500 to the mentoring organisation).

We have an ideas page, most of which revolves around our ccc-gistemp project. Ideas range from improving ccc-gistemp in various ways, through novel reconstructions, to clear implementations of other climate codes. If you have ideas of your own, we’d like to hear about those too.

If you are interested in participating as a student, then please get in touch.

We have not been a Summer of Code mentor before, but we bring many years (decades even!) of experience to the table: experience in computer science, software engineering, project management, and so on. We hope to help students make a success of their projects!

Posted in News | Tagged | 3 Comments

Some code published

One of our goals is to see more scientific code published. Nature kindly gave us space to voice this opinion earlier in the year. In the world of software tools (our home planet if you like) we have seen huge strides forward because people published the source code to their software. It’s where the Open Source movement began. We believe that science will similarly be improved by having more scientists publish more of their code.

By publish we don’t necessarily mean polished, documented, formatted, and printed in a glossy peer-reviewed journal. We just mean made available. Whatever you wrote. Just stick it on the web somewhere. A zipfile is fine.

What should this look like? The upcoming issue of Annals of Applied Statistics provides a fine example. McShane and Wyner have published an article in this journal, and their are various discussions of the article in the same issue. One of which is Davis & Liu, and their code is published as supplementary material. It’s a great example of what we mean when we say you should publish your code. The code (it’s a few dozen lines of R) is clearly more or less as Davis & Liu wrote it. It has a comment at the top telling you how to download the inputs and run it, that looks like it might have been added in haste later.

As code goes it’s not great code: it’s poorly documented, and full of magic numbers. But it doesn’t have to be great code. It does the job; no-one is going to be building nuclear power stations or recommending the purchase of a cancer-busting drug using this code. The important thing is that it’s the code used to produce the figures in the paper, and it’s published.

(Davis & Liu are not the only ones to make their code available, there is plenty more in the supplementary materials: McShane and Wyner make available their R code. The rest use Matlab: Smerdon’s; Tingley’s; Holmström’s; Kaplan’s)

Stein’s editorial in the same issue of Annals of Applied Statistics is well worth reading, and he has useful things to say on peer-review, data, statistical testing, uncertainty, and the relationship between code and reproducibility. He notes that (emphasis mine) “There is a movement in various disciplines to make all numerical results reported on in published papers reproducible by providing all of the data and code used to generate the results“, and goes on to say that this reproducibility “should be a requirement for research that has potentially important public policy implications whenever permissible”.

Naturally we agree.

Posted in News | 16 Comments