2011 Student: Hannah Aizenman

Hannah Aizenman was a student mentored by the Foundation in the Google Summer of Code 2011. This was her project proposal.  There are also blog posts reporting her project progress.

Climate change has become synonymous with global warming, which itself is a much studied phenomena that has had lots of controversy surrounding it in large part due to the lack of understanding of how global temperature is studied. This project attempts to increase the public’s knowledge of how global temperature records are constructed and analyzed by providing them access to information about the records as well as the raw data and tools to visualize the data in a variety of ways. This initial proposal will focus on working with CCSM Pseudoproxy Experiments (http://www.ldeo.columbia.edu/~jsmerdon/2011_grl_supplement.html) because they are used in a wide varity of climate applications and are a prime example of the type of data used to build climate reconstructions, which are often the source of data for graphs on global warming trends. The hope is that by making it easy to explore the data and build their own graphs, the public will gain a greater understanding of the work scientists do to understand climate.

My first deliverable is a wrapper for matplotlib(and maybe pyNGL) that generates climate visualizations by taking care of the boilerplate code–including setting up the various axes, ticks, legends, plots , colormaps, colorbars, legends, labels, maps and other pieces of the figure–so that the user does not have to understand the matplotlib object model and API in order to generate understandable figures.

This wrapper will consist of generic/abstract visualization functions that can be used across datasets and either a python dictionary, xml document, or configuration file containing parameters for the various datasets the user intends to visualize. The raw data will then be passed off to a function for extracting data from whichever file type the user sent in; initially the project will support .dat (using numpy) and likely NetCDF files (using scipy.io.netcdf or pyNIO) or files obtained through openDAP (using pydap), and these extraction functions will be written using a plug and play style so that support for more formats can be added as needed. I plan to parse the parameters into a “data object” to be passed around to the various visualization functions because I’ve found that an object model works well for handling lots of parameters in my current project, which focuses on analyzing NCEP/NCAR Reanalysis data (http://csdirs.ccny.cuny.edu/csdirs/projects/non-linear).

I want to keep the dataset specific parameters separate from the visualization code to make it easy to add on new datasets, allow the user to easily throw in their own customizations, and to make the code reusable for similar projects. For the same reason, I’m sticking to a very modular design of mostly one function per task to allow for easy reuse of the pieces of code and building of new visualization functions as needed. I also plan to make the code return the matplotlib object so that the user can take full advantage of the library to further customize the figure.

I will likely end up writing a command line interface to these functions for rapid testing purposes. I think it will take about a month to pull together all the code for the various visualizations, more specifically a few hours for really simple figures to a couple of days for really complex (3D, rotating, highly interactive) ones. As I don’t yet know what needs to be visualized, I can’t yet give a more concrete time frame, but I’m gonna assume this part of the project will take up June.

My second deliverable is a GUI that sits on top of the wrapper and which the user can use to generate visualizations. The GUI will have two main menus, simple and advanced. The simple menu will let the user choose a dataset and some attributes (like location or time) to generate figures, very much like the one on the GISStemp website (http://data.giss.nasa.gov/gistemp/maps/) and on the NCEP site(http://www.esrl.noaa.gov/psd/data/correlation/). The advanced menu will let users customize the visualization attributes (things like the labels and tick marks and increments). If there is time, the plan is to then throw this GUI on the web using a lightweight python server like bottle( http://bottlepy.org/docs/dev/). I haven’t chosen a library for the GUI yet, though I was thinking of either using pyQT because of its portability or a web framework that handles html forms (like pylons or pyjamas). I think the GUI will take about a month, including two weeks to just learn whichever library/framework I will end up using, then a few days on each menu (including testing). Throwing it on the web reliably will probably end up taking a week or two (accounting for lots of things going wrong). I’m aiming for this phase of the project to take up July and expecting the have the talk to web part to leak into August (unless it’s integrated from the start, which it very well may be).

The documentation for all the deliverables will be in reStructuredText (a text markup syntax commonly used in python documentation) so that the documentation can be generated using sphinx (http://sphinx.pocoo.org/). I plan to use sphinx because that’s what my lab uses and because it seems to be pretty standard in the python community.

I think that my project will promote public understanding of climate science because a picture speaks a thousand words, so good visualizations always help just about anyone get a handle on what the paper or scientist is talking about. I think the GUI will probably have an even stronger impact because it gives anyone with an interest a chance to see how figures are generated and how they change due to the parameters thrown at the algorithm/visualization. I also think letting anyone create a figure is a good way of getting across the message that published figures aren’t photoshopped or otherwise manipulated, which goes a long way toward strengthening the credibility of any climate organization.

Also, since figures are often taken straight from a paper or website without context, the secondary goal of this project is to simplify the generation of figures that tell a good part of the story in their labels, legends, ticks, and other properties, unlike the figure given as a bad example on the GSOC2011 climate code mailing list (http://www.pyngl.ucar.edu/Examples/Images/ngl02p.2.png). This figure is so free of context to be just about useless because neither the title nor other labels tell much about what type of data this is, what it being plotted, where this data occurs, or when. The poor use of hatching and small labels in lieu of a proper colorbar further hinder the understandability of the figure by not providing enough differentiation between the levels (and it is unclear if the levels are labeled), as does the use of latitudes and longitudes instead of overlaying a map. This project instead hopes to produce figures that inform about regional temperature variation in a clear way, for example http://story645.imgur.com/0LI8X#5FG1q.

This project will also facilitate data exploration and exchange through its support of openDAP (data access protocol) and NetCDF (Network Common Data Form) format. These tools further the goals of the Common Climate Project by facilitating easy access to information about the data, thereby helping to make the data extraction and processing steps easier and more transparent. NetCDF and OpenDap do an excellent job of encapsulating metadata, thereby packaging the raw data and information about the data(including where it came from, when it was recorded, what it might be missing, how to unpack it, who gathered the data, etc.) into one easy file so that a user does not need to start looking at dozens of documents to try to get a handle on the data, thereby in effect letting the data tell its story. By putting everything into one file, the goal is to allow the user to easily start playing with the data by having it all there. This easy open access to data is central to the OCP goal of making climate data public friendly, and therefore support for these formats is crucial to any analysis or visualization toolkit built for the OCP.

The main risk I can think of is web integration failing spectacularly, but that’s why I’m thinking of using a web-framework from the start and keeping it in mind for every step of the GUI and even when writing the visualization code (as some matplotlib backends play well with the web and others don’t).

A secondary risk is that my time may be somewhat limited in June. I hope to mitigate these constraints by reusing some existing code, and I think the vacation should not be much of an issue because I will have my laptop, much of the code does not require an internet connection, and I usually have an internet connection when I travel anyway.