2011 Student: Daniel Rothenberg

Daniel Rothenberg was a student mentored by the Foundation in the Google Summer of Code 2011. This was his project proposal. There are also blog posts reporting his project progress.

The homogenization algorithms used to correct for discontinuities and errors in collections of empirical temperature records are a “black box” to the general public. This project will translate the algorithm used in the modern United States Historical Climatology Network and two legacy ones into clearly-coded, easy-to-understand Python to boost both public understanding of climate science, and to set an example for future scientists wishing to explore the use of Python in their work.

Proposal

Synthesizing the large body of surface temperature observations made at various times and locations across the world over the past century into a single, continuous, global temperature record is complex task fraught with difficulty. One such difficulty arises from the fact that many observation records are littered with discontinuities; these discontinuities can be attributed to a change in observation methodology (time of observation, measurement equipment), a change in where the observation is taken (the introduction of an air conditioning unit or a paved road, or a total move of the observation site), or numerical drift due to poor sensor maintenace (measurement drift over time). Over the past few decades, an assortment of numerical algorithms have been developed to detect and “homogenize” these discontinuities [1,2,3]; that is, they use metadata packaged with temperature records and/or data from nearby, correlated observation sites to identify and correct for discontinuities in the observation records.

These sorts of homogenization algorithms form an integral part of modern surface temperature reconstructions. Menne and Williams 2009 ([3], hereafter MW2009), for instance, is used in the latest version of the United States Historical Climatology Network (USHCN) [4]. In addition to the mathematical algorithm being published [3], the source code behind an implementation of MW2009 is available for free from the National Climatic Data Center (NCDC) [5]. While making this code available for the general public is much aligned with the spirit of transparent, reproducible science, the actual product available on the web is not suitable for public consumption; it is fickle to run, coded in a complicated fashion using a not-widely-known (outside of, perhaps, scientific circles) programming language – Fortran 77 – and specifically rendered for a single task within the USHCN product; applying this implementation of the MW2009 algorithm to a small sample of data for testing purposes is exceedingly difficult because the code expects USHCN data and isn’t flexible.

This presents a major problem. Deserved or not, there is much public scrutiny of the quality of the USHCN data network [6,7]. This scrutiny extends beyond the quality of the stations and to the actual process by which temperature products are made, and has even led to the creation of third-party projects to either translate existing products into clearer, easier-to-understand and use codes [8], or to create new temperature records altogether in a transparent, open manner [9]. Ultimately, much of this scrutiny rests on the fact that the methods and techniques used to create temperature products are obfuscated in pay-walled journal articles or in Fortran sources stashed on seemingly random NCDC FTP servers.

Re-coding key temperature analysis products and algorithms in a clear, easy-to-understand and reproducible format might be an effective way to dispell some of the mythology behind said products. To the lay public, the process by which volunteer temperature observations are converted into a national or global temperature record is a mysterious black box, and as such it is hard to inspire much faith in the public with respect to the records’ veracity. Clearly laying out the algorithms in such a manner that an end-user could download code and run it on a test dataset of their own construction would help inspire trust in these products.

However, there are loftier reasons to under-go such a re-coding. The atmospheric sciences as a field are somewhat backwards in that they are very slow to adopt new and emerging technologies. We still write most of our hardcore numerical code in Fortran 77, and we suffer in our data analysis and general scientific programming because many atmospheric scientists lack formal backgrounds in computer science or software engineering. Recently, there has been an effort to advocate for the adoption of modern programming tools and techniques in the atmospheric sciences, resulting in a short couse on Python being offered at the 2011 American Meteorological Society’s Annual Meeting [10]. From personal experience interacting with climate scientists through internships and graduate school visits, there is much interest in exploring the possibility of using new technologies for building better/faster models and other numerical tools, but what is lacking is a clear leader to step forward and guide the proliferation of such technologies.

Broadly adopting new tools such as Python within the atmospheric science community would revolutionize the way we conduct our business. Languages such as Python are mostly standardized (assuming users are running the same version and have the same third-party libraries installed), and by writing code in it – be it simple shell scripts for processing data, visualization and analysis scripts, or full-blown model code – would help boost reproducibility. Furthermore, moving to adopt a new language as a standard scientific tool affords the community the opportunity to re-evaluate guidelines on what constitutes proper coding style, proper documentation, and other issues. In addition, it would forge the relationships we will need to bring atmospheric science into the 21st century. It is not uncommon for computer scientists to work with atmosphere experts to construct new models, but the scientists generally outnumber the programmers by a great deal. Adopting newer and better technologies might be a persausive first step in attracting more programming talent to aid in the development of next-generation models and tools.

With these goals in mind, this project proposes to accomplish three major tasks:

  1. Re-code the MW2009 algorithm in clear, well-documented, easy-to-understand Python. This part of the project would focus entirely the numerical aspects of the algorithm, divorced from the temperature data that would inevitably be fed into it; ultimately, the numerical code should be able to be dropped into an existing temperature product to replace that product’s default homogenization routine.
  2. Additionally, choose several (2 or 3) algorithms from [1] or [2] to also code in a similar manner. This collection of algorithms could then be packaged together as a library of code which could serve several purpose: it could be used as sample code for atmospheric scientists wishing to learn more about how to adopt Python as their tool of choice for numerical work, thus broadly disseminating strong software engineering principles and proper coding practices; it could be used as part of further work by others to build a completely open-source and transparent temperature record product; also, it could be used to supplement the existing literature and code samples from the NCDC pertaining to these particular algorithms, exposing this sort of work to the public in an easier-to-understand fashion.
  3. Time permitting, the code could be used in conjunction with a dataset from either the USHCN or another major temperature product to create a new, simple temperature product. This would lay the foundation for further work in creating the aforementioned open-source, transparent temperature product.

These goals are not without risk. First, it is possible that Python is too slow for the rot-gut numerical computations used in these algorithms – particularly when they are scaled up to thousands of stations of data. In this case, it might be possible to wrap the most numerically-intense portions of the code in Fortran sources which are accessed via Python, or (preferably) to write NumPy-accelerated versions of the numerically-intense methods. Second – but most importantly – it will likely be difficult at first to translate the MW2009 algorithm into pure Python, even with the sample Fortran routines. This issue would likely be addressed by directly contacting Menne and Williams, as will be discussed later in this proposal.

Additionally, connecting these algorithms to existing temperature products will likely be difficult, and tackling this as the last part of the project is risky because there may not be much time left at the end of the project if soft-deadlines within the timeframe of the project are not meant promptly. Even without successfully completing this particular goal, however, the library of clearly-coded algorithms would still be a valuable to addition to the climate science world.

A great deal of support will be pursued to aid in this project. It should be explicitly noted that since I strongly identify myself as advocate for the adoption better coding practices tools in the atmospheric sciences, I am not averse to strong networking and building a corps of mentors for this project in addition to the Climate Code Foundation. For instance, pursuing the aid of Matthew Menne and Claude Williams [3] would likely prove valuable when re-writing the MW2009 algorithm, and Art Degeatano [1] would likely be able to advise on what additional algorithms would be suitable for inclusion in this library (in full disclosure, Dr. Degeatano was a professor of mine at Cornell). I plan on working on this project while spending the summer in San Francisco, which makes a partnership or liasion with the team behind the Berkeley Earth Surface Temperature project [9] a feasible possibility. These sorts of partnerships would enable me to achieve a higher understanding of the core numerical algortihms behind the product, and would serve as valuable contacts should I explore the scientific consequences of deploying these algorithms (an algorithm inter-comparison at towards the end of this project is a possible replacement project for third project goal I outlined should it prove too difficult).

By the end of this project, an open-source, clearly coded library of homogenizaiton algorithms including MW2009 should be available for general use. The library will be specifically designed so that an end user could either a) use a routine to replace or supplement the algorihtm in an existing temperature product, or b) feed raw temperature data with minimal pre-processing (possibly just formatting into a large Python list or some other type of data structure) into any algorithm in the library to obtain a homogenized dataset. In addition, I would like to wrap-up this project in such a fashion that its results could be presented at either the Fall meeting of the American Geophysical Union or the 2012 American Meteorological Society Annual Meeting.

Timeline

Prior to May 23:

  • Contact Menne, Williams, Degaetano, and other scientists working with these algorithms and pursue support either in the form of additional code samples, sample datasets, or a committment to help answer questions related to the algorithms themselves.

May 23 – June 1:

  • I will be in the process of finishing prior research at Cornell as well as coordinating logistics for moving to San Francisco for the summer as well as for my subsequent move to Cambridge, MA at the end of August.

June 1 – June 15:

  • Interregnum in Louisville, KY, while my move to San Francisco is finalized (please see “MISCELLANY” section for details on why I am making this move). With help from mentors, decide on library package structure, and finalize choice of additional algorithms to code besides MW2009. If possible, begin translating the initial steps of the MW2009 algorithm into Python.

June 15 – July 6:

  • Continue re-writing MW2009 code. Throughout this process, the algorithm should regularly be tested on sample and test datasets. The algorithm should be finished by the end of this time-period as a major goal for the mid-term evaluation. By July 6th, the working branch of the source code for the homogenization library should be working in such a way that an end-user could download a copy and immediately run the algorithm on their own data or deploy it into another temperature product.

July 6 – August 1:

  • Implement chosen additional algorithms into the library. If necessary, replace a previously-chosen algorith with another if it proves too difficult to write in Python – that is, it is too complex and runs too slowly.

August 1 – August 15:

  • Attempt to implement the algorithm into an existing temperature product. If this proves too difficult, reproduce the intercomparison from Degaetano 2006 [1] and expand it to include MW2009.

About Me

To cut down on the length of this document, please refer to my CV and the “Addtional Info” document posted to my personal webpage.

Bibliography

[1] DeGaetano, A. T., 2006: Attributes of several methods for detecting discontinuities in mean temperature series. J. Climate, 19, 838?853 (http://journals.ametsoc.org/doi/abs/10.1175/JCLI3662.1)

[2] Reeves, J., J. Chen, X. L. Wang, R. Lund, and Q. Lu, 2007: A review and comparison of changepoint detection techniques for climate data. J. Appl. Meteor. Climatol., 46, 900?915. (http://journals.ametsoc.org/doi/abs/10.1175/JAM2493.1)

[3] Menne, Matthew J., Claude N. Williams, 2009: Homogenization of Temperature Series via Pairwise Comparisons. J. Climate, 22, 1700?1717. (http://journals.ametsoc.org/doi/abs/10.1175/2008JCLI2263.1)

[4] (http://cdiac.ornl.gov/epubs/ndp/ushcn/ushcn.html)

[5] (ftp://ftp.ncdc.noaa.gov/pub/data/ushcn/v2/monthly/software/)

[6] (http://surfacestations.org/)

[7] (http://wattsupwiththat.files.wordpress.com/2009/05/surfacestationsreport_spring09.pdf)

[8] (http://clearclimatecode.org/gistemp/)

[9] (http://berkeleyearth.org/)

[10] (http://pyaos.johnny-lin.com/)