Homogenization project progress

This guest post is written by Daniel Rothenberg, one of our Google Summer of Code students, who is working on a library for climate record homogenization. His previous post introduced his project.

At the halfway point in my Google Summer of Code project, I am happy to report that a great deal of progress has been made. A few weeks ago, I set out to re-write the Pairwise Homogenization Algorithm (PHA) [1] used by the United States Historical Climatology Network (USHCN) [2]. While there is a published version of this algorithm available online [3], the code is written in Fortran and complicated to read, understand, and use. My project aimed to de-obfuscate this code, and use it to build a library of similar codes that people could use in the future to explore surface temperature homogenizations and reconstructions.

The first task in this project was to port the PHA from its current form (in Fortran) to something more accessible and maintainable. Thus, I’ve spent the majority of my time slogging through complex, dense Fortran subroutines, working out the array-traversing logic that govern the mechanics of the algorithm. While there is a published, high-level description of this algorithm and its logic [1], nothing quite compares to seeing how the code is actually implemented. By far, the most difficult obstacle in this project so far has been translating existing code—like the semi-hierarchical splitting algorithm used to identify possible undocumented changepoints—into a Pythonic, easy-to-understand form.

So far, I’ve had a lot of success overcoming this sort of obstacle, but it’s never an easy feat. The original code, by Claude Williams and Matt Menne, has useful comments and documentation, but is written in an older style of Fortran (Fortran 77), and uses many conventions which are avoided in modern programming. A good example is copious use of goto statements. In a nutshell, these tell a program to jump over a block of code—even out of other control structures like for-loops. While they’re useful for some tasks, they result in the creation of what’s often derided as “spaghetti code”— code which goes every which way but loose, and is hard to understand.

There are good ways to untangle spaghetti code, though. For instance, I’ve been developing my code on a test set of data which I repeatedly run through the Fortran routines. By manipulating where the original Fortran code ends, I can liberally sprinkle debugging information like print statements throughout the code, and re-compile/run it in seconds when needed. This lets me track how variables like for-loop index counters change over time, and allow me to investigate exactly where looping code breaks and to where it jumps.

Then, once I understand how it works, I can translate it into Python—but only with a few clever tricks! You see, Python doesn’t have a goto statement. However, it does include ways to break prematurely out of loops—the continue statement, which skips to the next iteration of a loop, and the break statement, which immediately exits the looping scenario. These are useful, but have caveats; for instance, the break statement only breaks out of one loop, so if you’re looping over two indices, you can’t exit out of the “master” loop structure. Other nifty Python tools let you overcome this, though. For instance, you can use zip to “zip up” two lists of values into a single one, which often lets you condense some complex nested for-loops from Fortran into simpler and easier to understand loops in Python.

Other obstacles have sometimes involved the uncovering of bugs in the original PHA code. To date, I’ve found three significant bugs in the code which could potentially change some of the detection of changepoints in the algorithm. Two of these relate to a form of linear regression called the Kendall-Theil robust line fit. In this method, you form pairwise estimates of slopes from all the values you have in your data, and estimate the linear regression using the median slope you find. One bug I found involved the two-phase regression form of this code (used if you hypothesize that the slope on one half of a segment of data is different from the slope on the other half) used in the changepoint detection tests with the Bayesian Information Criterion. A second bug was the inadvertent overwriting of a variable describing median values found. I have reported these bugs back to the authors.

These bugs bring me full-circle back to the original intent of this project, which is to de-obfuscate code and make it as easy-to-understand and transparent as possible. The only way to have caught these bugs is to have dug through the code and kept detailed accounts of what values various loops take on during their execution. These sorts of runtime errors in your code can be very hard to catch—especially in complex codes which are hard to understand. There are a few tried-and-true methods to alleviate them, however. First, you can use large sets of simple tests cases to catch the various corner cases and bugs that can creep into your codes. I use this method to validate my auxiliary methods, such as correlation computations. However, sometimes it’s just not tenable to generate test cases—the Kendall-Theill code is complicated enough that while you could theoretically work out simple cases by hand, you really need some sort of numerical code to perform the method.

It is in cases like this where it is so important to practice good software development and engineering principles, like organized code, strong documentation, and iterative development. The truth is that a great deal of numerical codes used in scientific programming are dense and complicated; reducing the complexity of code and writing in as clear and transparent a manner as possible helps both the end user of the code and the writer himself ensure that it is valid and does what it is supposed to do.

With that said, although I’ve accomplished much in my project so far [4], there is still a great deal to do. Expect a second post on this soon, which will also recap a recent trip I made to the National Climatic Data Center.

1) ftp://ftp.ncdc.noaa.gov/pub/data/ushcn/v2/monthly/menne-williams2009.pdf
2) http://www.ncdc.noaa.gov/oa/climate/research/ushcn/#phas
3) http://www.ncdc.noaa.gov/oa/climate/research/ushcn/#homogeneity
4) http://code.google.com/p/ccf-homogenization/

This entry was posted in News and tagged , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *