Apologies for the enormous delay in writing this post. There is a massive backlog, and I am intending to write one post every day until it’s cleared.
On 2011-08-26 I attended the First International Workshop on Climate Informatics, in New York City. It was a very interesting event which gave me some new insights into computer use in climate science. This blog post is my trip report.
The workshop was run by Gavin Schmidt and Claire Monteleoni, both of Columbia University. Gavin is a climate scientist, very well known for his modelling work and as founder of and active poster on the RealClimate science blog. Claire is a computer scientist and informatician, whose research focus is on machine learning specifically as it can be applied to climate science. They very kindly invited me along to give an outsider’s perspective on the new field of Climate Informatics.
Informatics is the name now given to the field which was called “machine learning” when I was a computer scientist, back when the world was young. It incorporates both guided and unguided discovery of structure in numeric and relational datasets, from basic statistical techniques such as least-squares regression and interpolation, to more advanced techniques such as cluster analysis, to data mining. There’s Big Money in informatics: finding correlations and patterns in large data volumes is fundamental to several whole industries these days, including advertising (think of PageRank, or store loyalty cards, or automated recommendation systems: “you might also like this”).
Climate science certainly has large datasets. Historical datasets of temperature, precipitation, pressure, and other climatic variables may contain many millions of observations. The CMIP3 model intercomparison dataset (used in IPCC AR4) is 36 terabytes. The NCEP/NCAR reanalysis dataset is over 250 terabytes. When the CMIP5 dataset is finished in 2013, it will include around 2 petabytes. So how can informatics help climate scientists to make sense of all this data? Can it uncover previously unknown connections or correlations? (answer: yes! how many do you want?) How can we identify which of these correlations are meaningful? (answer: with difficulty, and only by applying understanding). Can it help us to forecast future climate? (answer: maybe).
This one day event was held in the rather fabulous offices of the New York Academy of Sciences, on the 40th floor of 7 World Trade Center, a moderately-tall building overlooking the Ground Zero monument and already dwarfed by the half-built Freedom Tower. The NYAS is obviously a very formal institution – uncommonly at workshops, everyone in the printed program was given a title, including a nonexistent doctorate for me.
As the day passed, the public alerts regarding Hurricane Irene became more urgent and more frequent – compulsory evacuation notices were issued for some parts of Manhattan – and by the final panel session a number of attendees had left (which seemed to include most of the hurricane specialists). I didn’t mind overly: having flown in the night before I was very jetlagged and I fear my appearance on that last panel wasn’t exactly sparkling. I recall spouting some incomplete thoughts on the seminal influence of numerical weather prediction and climate science on the development of successive generations of supercomputers, entirely missing the point of a question about the importance of supercomputers in climate informatics.
Earlier in the day we had several interesting presentations, including one which I’m going to describe in some detail because it raised a couple of important points about the structure of collaborative science. This was a talk by David Musicant of Carleton College, a computer scientist who has worked with atmospheric scientists who are attempting to analyze and categorize environmental aerosols: tiny liquid and solid particles suspended in the air. He described a custom-built instrument – an ATOMFS (aerosol time-of-flight mass spectrometer) – which essentially extracts tiny particles from the air and zaps them with a laser. The laser vapourises the particle and ionises the vapour, and then the time-of-flight mass spec does its work and produces a lot of numbers describing the charge/mass ratios of the resulting ions (which may be elemental, such as Na+ or O–, or compound, such as C16H10+). The instrument can analyse hundreds of particles each minute, and produce about 30,000 numbers for each particle.
One aim of this analysis is to identify the size, source, type, and properties of each particle, and thus to characterize the mixture of aerosols in the air. How, exactly, is one to do this, given mass spectrometry data? Well, that’s where the environmental scientists turned to the computer scientists for assistance, and that, from my point of view, is where the story gets interesting.
The most obvious way in which software can help with this problem is cluster analysis. You have thousands of mass spectra: each one is a point in a 30,000-dimensional space. By applying well-known statistical techniques, clusters of such points can be identified: each cluster representing a number of particles with similar composition. Soot from diesel engines might form one cluster, pollen might be another, sulphate aerosols a third, and so on. Cluster analysis works without preconceptions about the number, size, or location of clusters, so can not only determine the relative contribution of different aerosol sources, but can also find unexpected and even unknown components of the aerosol mix.
This is where Musicant’s experience holds lessons which apply not only to other informatics projects but across all fields of science, whenever computational techniques are applied. He described the following problem:
- A scientist needs some software to be written, so consults a software expert: often a computer scientist, and a joint research project is formed.
- Often the software does not need to solve a new problem in computer science. Rather, it is the application of well-understood techniques – such as cluster analysis – developed many years ago.
- Applying well-understood techniques is not going to advance the career of the computer scientist. It’s not going to be particularly challenging or interesting from a CS point-of view. It’s not going to result in publications in the CS literature. It might not even get funded.
- So the computer scientist must devise a way of viewing the research as a novel class of problem, whether or not that is the most productive way of addressing the scientific question at hand.
This is not meant, in any way, as a criticism of David Musicant’s fascinating work. He has indeed succeeded in finding new ways to look at problems such as cluster analysis, motivated by aspects of the aerosol study, and I recommend anyone interested in aerosols to look into his EDAM project. But it highlights an important gap in the funding and career structures of computational science. Very often, a research project needs some software to apply a well-known computational technique, which is outside the knowledge or experience of the researchers. As computation becomes more sophisticated, it is unreasonable to expect researchers in every field of science to be experts on the methods involved. But at the same time, it is unreasonable to expect researchers into methods to devote their time to rehearsing well-trodden ground. And it won’t always be the case that the ground is so well-trodden that it has become a neatly-packaged library for R, or Matlab, or NumPy. What is the solution? I suspect that funding bodies and institutions need to start seeing some software development as a facilities or even utilities problem, like doing the departmental payroll or keeping the lights on. Yes, it’s highly skilled and sophisticated work, but so is running the university’s networking infrastructure, and that’s (mostly) not done by researchers. You have a firehose of data from your experiment, and think you might need some cluster analysis? Dial 9724 on your office phone, and talk to your institution’s science code monkeys.
The other particular insight I gained at this workshop was in a breakout session discussion led by Gavin Schmidt. We were talking about comparing climate models. In hindcasts and reanalysis modes, or when comparing older forecasts with more recent data, different models have different strengths and weaknesses: one model might be especially good at matching observations of tropical precipitation, but less good when looking at overturning ocean circulations. Another might be great at the ocean circulation, but completely miss extreme drought events in Australia. A third might predict the droughts but get mid-latitude temperature oscillations wrong. (I’ve just made those examples up, but you get the idea). These various strengths and weaknesses are routinely identified and studied by modelling experts (indeed, it’s one of the purposes of the CMIPs). Now, the sixty-four trillion dollar question is: what is the climate going to do in the future? Should we use our acquired expertise in comparing models and identifying relative strengths to say something along the lines of:
- “Well, we know that model A didn’t spot two out of five droughts in Central Asia in the twentieth century, but model B did, so going forward we should trust B more than A in that area”; or
- “Hmm, model C is the closest we’ve got to fitting the arctic sea-ice decline since 1980; it’s only overestimating by 50%, so let’s apply that 50% discount to model C’s forecast”; or
- “The hindcast skill of this inter-model mean is terrible, but if we miss out models P, Q, and R, then it’s much better. Let’s drop those models for our 21st-century forecast.”
Unfortunately, informatics theory tends to answer this question with a big fat “No”. The problem is this: each climate model explores a non-linear manifold in a very high-dimensional space (to the extent to which the model is correct, this approximates the manifold occupied by the actual climate). The model comparison ideas I sketch above are basically comparing different points in the sub-manifold representing past conditions, and making linear or quasi-linear estimates at intervening points. Informatics is really good at this sort of interpolation. However, what we really want to do is to extrapolate into a new domain, and the non-linearity of the manifold hurts us there. A model which has been really good in the past domain might be terrible in the future, and vice versa. In fact, it might be exactly the same aspect of a model preventing it from resolving some past phenomenon and allowing it to correctly forecast a future one.
I hope that explanation is clear. It was a very eye-opening discussion for several people there.
Finally, I should mention that the workshop finally gave me the opportunity to meet two of our GSoC 2011 students: Filipe Fernandes and Hannah Aizenman.