As you may recall, I spent the past summer working on behalf of the Climate Code Foundation to port and revise the Pairwise Homogenization software utilized by the National Climatic Data Center to produce the US Historical Climate Network dataset. Since my last update in the middle of July, I successfully worked through my first pass at the remaining sections of the algorithm, and have arrived at a major milestone – a Python program which can arbitrarily look at networks in the USHCN raw data, and homogenize them based on pairwise comparisons.
Figure 1 illustrates the homogenization results for two stations which were passed into the algorithm with a random selection of 50 other stations from across the USHCN. This test illustrates that the new code does some things very well, but still has some work to be done. For starters, when investigating the diagnostic output log from running the code on this test case, it is clear that the code nearly exactly reproduces its Fortran parent’s results up through the final “CONFIRMFILT” stage of analysis. At this stage, the code attempts to condense a large number of suspected breakpoints into a best-fit over the data. There are still some discrepancies between my code and the original, which tends to suppress the final number of detected changepoints. A perfect example of this is in the NEW ULM plot in Figure 1; the Python code misses the first detected changepoint around the year 2000, while it sucessfully finds others that the Fortan code spots. By contrast, the Python code sometimes fails to remove extra changepoints – particularly around swaths of ‘deleted’ data (data which cannot be analyzed in this algorithm, usually because there aren’t enough paired neighbors to provide supporting information to them); this is illustrated well by the COLFAX plot in Figure 1, around 1910.
Although this is the major glitch in the code at this point, there are some other issues which need to be ironed out. First, there are some numerical issues associated with calculating the standardized adjustments to apply at each changepoint. From my experience with other parts of the code, this is likely a sign error in the statistical test which calculates the final adjustment at each changepoint, so it should be simple to find and fix in the future. Second, the algorithm needs to be adjusted to accept external sources of documented changepoints – this will greatly improve its ability to find the “best” changepoints in the cloud of suspect ones it finds through the first half of the algorithm. Finally, I am still working on re-engineering the code in its existing form to work more in the fashion of an API so that it can be more easily used on various datasets in the future.
This project wouldn’t have been possible without the support and adivce of David Jones and Nick Barnes of the Climate Code Foundation, as well as with the help and advice of Claude Williams and Matt Menne at the National Climatic Data Center. Hannah and Filipe – my GSoC compatriots – also provided great feedback and help throughout our code reviews and meetings. I’d like to thank them for all their time and effort over the summer!
Finally, I’m excited to continue working on this project – especially over the next few months and leading up the 2012 Annual Meeting of the American Meteorological Society, where I will hopefully presenting a talk entitled “Lessons From Deploying the USHCN Pairwise Homogenization Algorithm in Python” as part of the 2nd Symposium on Advanced in Modeling and Analysis Using Python. There is much work to continue with in the future:
- Refinement of the Python homogenization code, including addressing known bugs in the CONFIRMFILT process.
- Further collaboration with Menne/Williams to improve the code and thoroughly see how it differs from the Fortran homogenization code.
- A possible project with David Jones, looking at applying this algorithm to data from the Canadian climate record.
If you’re interested in helping continue this project, please contact me or the Climate Code Foundation – we’d love to have you on board!