The Climate Code Foundation is a non-profit organisation to promote the public understanding of climate science.

June, July, August

Last month I added code to vischeck.py so that it could more easily display the seasonal averages computed by ccc-gistemp and GISTEMP. We can display temperature change for a particular season.

It is customary for meteorologists to define seasons in 4 blocks of 3 months. June, July, and August form the Southern Hemisphere cool season and the Northern Hemisphere warm season.

Here’s the latest comparison with GISTEMP for the June, July, August season:

tool/vischeck.py --download > jja.png --legend ccc-gistemp\|GISS --extract JJA result/mixedGLB.Ts.ERSST.GHCN.CL.PA.txt http://data.giss.nasa.gov/gistemp/tabledata_v3/GLB.Ts+dSST.txt

June, July, August

For 2014 both ccc-gistemp and GISTEMP report an anomaly of 0.62 °C for the June, July, August season (using a base period of 1951 to 1980). And normally we’re an exact match, occasionally the rounding shifts one way or the other and we can be 0.01 °C out. For 2013 we’re a whopping 0.03 °C different. Does anyone want to investigate?

Posted in ccc-gistemp | Leave a comment

ISTI has more data

When we’re using the ISTI dataset with ccc-gistemp, what advantage does it give us? The northern hemisphere is already well sampled, so it doesn’t give us much there. Does it do any better in the southern hemisphere?

This is a plot after figure 2 of Hansen and Lebedeff 1987. It shows, for each of the 80 boxes used in the analysis, the earliest year that has data:

isti

A word of caution for those comparing this to the actual figure 2 of Hansen and Lebedeff 1987. Their figure shows the date “when continuous coverage began” for each box, which I take to mean the date when continuous reporting began for any station within the box. The plot I give above is of data used in an analysis, and a box will include data from stations outside of the box (as per the 1200 km rule in Hansen and Lebedeff 1987); it is also why my boxes are clamped at 1880.

The top figure in each box is the earliest year of data for the ISTI MADQC dataset; the bottom figure is the earliest year of data for the GHCN-M QCU dataset (years in brackets mean that the given year is the earliest year of continuous reporting, but there are earlier fragments). The little figure in the bottom right corner of each box is the box number, using the same convention as figure 2 of Hansen and Lebedeff 1987. A box is blue when ISTI has earlier data (hence, more), and is pink when GHCN-M has earlier data.

Where ISTI wins the most is box 65, where ISTI has extended the data period back from 1950 to 1887 and a little bit before (almost the full period for the analysis). There are only a handful of stations contributing to this box, so it’s just about sensible to plot them all on one plot. All the stations start reporting around 1950, except for one:

box65

That station is MP00006199, Plaisance (now renamed Sir Seewoosagur Ramgoolam International Airport; the renaming of airports is a history-in-miniature of colonialisation: airports are built by the colonial powers, then renamed as the newly independent ex-colonies stamp their mark on them).

The equivalent plot for the ccc-gistemp analysis done with the GHCN-M QCU dataset has all of the stations (including 12961990000 Plaisance) starting in 1950 or later:

box65ghcn

Meteorological stations don’t just spring up out of nowhere and start reporting all in the same year of 1951. The fact that many stations have records starting in 1951 is an artefact of the collection process: There were deliberate attempts to recover and digitise existing records from 1951 to 1980 so that they could be used for normals (Peterson and Vose, 1997). This is one of the key benefits of the ISTI dataset. By bringing together data from diverse sources, we find and can make use of longer records.

So it seems likely that some of these other stations contributing to box 65 could have more data coaxed out of them. For this box the station of most interest would be WMO station 61996 (listed as 15761996000 Ile Nouvelle-Amsterdam in GHCN-M v3, and FS000061996 Martin-de-Viviès in ISTI Stage 3) because this station is actually within the bounds of the box, whereas the others are not. Sadly, this is a remote island that didn’t have any settlement at all until 1949, so we’re not going to suddenly find significantly more data for this station.

Box 58 is a case where ISTI has more data, but it’s not in a period that connects to the data starting in 1935. Plotting the contributing stations we see that the period of continuous reporting hasn’t actually changed:

box58

What’s changed is that we have two extra data fragments for a single station: FP000091938, Tahiti. Given the huge gaps between reporting periods, for an analysis like ccc-gistemp we would be better off just discarding those data. I’m sharpening my scalpels.

Perhaps it would be worth doing some data archaeology for Tahiti. Can it really be the case that reliable temperatures were only reported from 1935 onwards? The international airport opened in 1960, so the period of reporting isn’t directly related to the airport opening. Perhaps there’s a stack of yellowing paper forms in wherever the French keep their archives.

Box 40 is the only box in the northern hemisphere to have its period of data extended by ISTI. The, by now traditional, plot of contributing stations shows that this is due to 3 stations (I’ve truncated the plot at 1945 to avoid showing a large number of stations that began in 1950/1951):

box40

The contributions of BPXLT466819, Tulagi, and KRXLT605164, Ocean Island, are most welcome. The record for the 3rd station, NRXLT092567, Nauru, is somewhat obscured, but when I plot that alone, we see that it’s just the sort of record that I’ve come to complain about:

nauru

A series of unrelated periods joined into a single record for no particularly good reason. The ISTI dataset is certainly good for studies of imhomogeneity because it seems to create lots of inhomogeneities to study. I think if we’re going to continue to use ISTI for ccc-gistemp, I’ll have to implement something like Rohde’s scalpel.

Posted in News | Leave a comment

What good is MADQC?

I previously blogged about running ccc-gistemp with the ISTI Stage 3 dataset, and I slipped into that blog post the fact that I had to QC (Quality Control) the data. The purpose of QC is to eliminate data that is obviously entirely spurious and not representative of the climate of the station in question.

Let’s look at station record CI000085469, Isla de Pascua (aka Rapa Nui, aka Easter Island), in the ISTI Stage 3 dataset:

IslaDePascua

Did this island in the middle of the pacific really experience sub-zero monthly means prior to 1942? Or is it simply that somewhere along the arduous journey from paper records to our digital data, some bogus data (from some other station?) crept in. It’s QCs job to eliminate data like that.

As mentioned in the earlier article, to do the QC I made a really simple QC routine which I called MADQC.

The pink in the plot above is MADQC working; it shows the data that MADQC eliminates (the ISTI record is plotted in pink, then on top I plot in blue the data after MADQC has been applied; thus the pink shows where MADQC removed data).

In a similar vein, no one really believes that station FG000081405 (listed as Rochambeau, but now known as Félix Éboué airport), experienced a monthly average well above 100°C for a month in 1903:

Rochambeau

This is transcription error or something similar, and is eliminated by MADQC.

The next case is a little different. Station WAXLT556695, Walfisch Bay:

WalfischBay

The record between 1941 and 1951 is obviously not consistent with the other two fragments, but does at least look like real temperature data. Except for being about 40 degrees warmer than it should be for this location. A little inspiration suggests that this period has been recorded in degrees Fahrenheit (further supported by the fact that the annual variation is higher in the period between 1941 and 1951). This is unfortunate, but it is potentially correctable. MADQC doesn’t care about correcting it and just eliminates that period entirely.

This station, WAXLT556695, also illustrates a different problem with these records (also present in GHCN-M in various forms): no one really believes that the fragment of data around 1890 is in any way connected to the fragment from the 1960’s and 1970’s, separated, as they are, by many decades of no data. ccc-gistemp doesn’t make any attempt to correct for large gaps, and in some sense would prefer to digest such data as two or more separate records (since it can make use of records as long as they overlap with stations with 1200km; they don’t have to have the common reference period that CAM requires). This is the inspiration behind Rohde’s scalpel (Rohde et al 2013). I haven’t investigated, but should the number of large gaps in the ISTI dataset prove problematic, I may have to get my own scalpels out.

The final station I show is a case where MADQC doesn’t help. Station AYM00089034:

BelgranoII

This is listed in ISTI as Belgrano II, but the record is clearly a composite of records from nearby stations, only one of which is Belgrano II. There is an obvious inhomogeneity in both mean temperature and range. The merging algorithm that ISTI uses is entirely automatic, so it is inevitable that some mistakes are made. In terms of how ccc-gistemp could correct for this, maybe we need to consider merge history (it is currently ignored), use one of the alternative merge results, or use some more sophisticated form of homogenisation.

I didn’t select these stations at random, nor did I search through all the stations to find particular illustrative examples. They were all found because when I did my first analysis using the ISTI dataset there were significant discrepencies that were visible at the hemispheric level. Hunting down the discrepancies led me to these stations, and then led me to realise that I need to QC the data first. The usual data source for ccc-gistemp is GHCN-M and SCAR READER, both of which have at least been QC’d at source (and in addition to which ccc-gistemp normally uses the same stop list as NASA GISTEMP), so previous to using ISTI data, our own QC routine wasn’t necessary.

Don’t analyse your data without QC. If you’d like to try your own analysis, give MADQC a try.

Posted in News | 3 Comments

ccc-gistemp and ISTI

This post is also published at the International Surface Temperatures Initiative blog.

ccc-gistemp is Climate Code Foundation‘s rewrite of the NASA GISS Surface Temperature Analysis GISTEMP. It produces exactly the same result, but is written in clear Python.

I’ve recently modified ccc-gistemp so that it can use the dataset recently released by the International Surface Temperature Initiative. Normally ccc-gistemp uses GHCN-M, but the ISTI dataset is much larger. Since ISTI publish the Stage 3 dataset in the same format as GHCN-M v3 the required changes were relatively minor, and Climate Code Foundation appreciates the fact that ISTI is published in several formats, including GHCN-M v3.

The ISTI dataset is not quality controlled, so, after re-reading section 3.3 of Lawrimore et al 2011, I implemented an extremely simple quality control scheme, MADQC. In MADQC a data value is rejected if its distance from the median (for its station’s named month) exceeds 5 times the median absolute deviation (MAD, hence MADQC); any series with fewer than 20 values (for each named month) is rejected.

So far I’ve found MADQC to be reasonable at rejecting the grossest non climatic errors.

Let’s compare the ccc-gistemp analysis using the ISTI Stage 3 dataset versus using the GHCN-M QCU dataset. The analysis for each hemisphere:

ccc-gistemp analysis

ccc-gistemp analysis, southern hemisphere

ccc-gistemp analysis, northern hemisphere

ccc-gistemp analysis, northern hemisphere

For both hemispheres the agreement is generally good and certainly within the published error bounds.

Zooming in on the recent period:

ccc-gistemp analysis, recent period, southern hemisphere

ccc-gistemp analysis, recent period, southern hemisphere

ccc-gistemp analysis, recent period, northern hemisphere

ccc-gistemp analysis, recent period, northern hemisphere

Now we can see the agreement in the northern hemisphere is excellent. In the southern hemisphere agreement is very good. The trend is slightly higher for the ISTI dataset.

The additional data that ISTI has gathered is most welcome, and this analysis shows that the warming trend in both hemispheres was not due to choosing a particular set of stations for GHCN-M. The much more comprehensive station network of ISTI shows the same trends.

Posted in News | Leave a comment

ZONTEM is simpler, clearly

How simple can a temperature analysis be?

The original inspiration for Climate Code Foundation was ccc-gistemp. Our original pro-bono rewrite of NASA GISTEMP. Software that shows global historical temperature change.

We wanted the average person on the Clapham omnibus to be able to use it, inspect it, reason about it, and change it. ccc-gistemp successfully reproduces the NASA GISTEMP analysis (to within a few millikelvin for any particular monthly value) in a few thousand lines of Python code.

A few thousand lines is still a lot of code. There are still a few corners of ccc-gistemp that I haven’t fully looked into. Can we make something simpler and smaller that does more or less the same job? Obviously we can’t still expect to use exactly the same algorithm as NASA GISTEMP, and nor would we want to, because the exact details of which arctic locations use SST anomalies and which use LSAT anomalies are just not very important (for estimating global temperature change). It can be distracting to get bogged down in detail.

ZONTEM attempts to discard all constraints and make an analysis that is as simple as possible. The input data is monthly temperature records from GHCN-M. The Earth is divided into 20 latitudinal zones. The input records are distributed into the zones (by choosing the zone according to the latitude of the station). The records are then combined in two steps: first combining all the stations in a zone into a zonal record; then combining all zonal records into a single global record. The global record is converted into monthly anomalies, and then averaged into yearly anomalies. The zones are chosen to be equal area.

This is simpler in so many ways: only one source of input; no ad hoc fixing or rejection of records; no correction for UHI or other inhomogeneity; no use of Sea Surface Temperatures; only a single global result (no gridded output, no separate hemispherical series).

The result is about 600 lines of Python. This is split into 3 roughly equally sized pieces: ghcn.py, series.py, zontem.py. ghcn.py understands the GHCN-M v3 file format for data and metadata (this is vital, but scientifically uninteresting); series.py is borrowed from the ccc-gistemp project and consists of the detailed routines to combine monthly records and to convert to anomalies; zontem.py is the main driver algorithm, it allocates stations to zones, and picks a particular order to combine station records.

A good chunk of zontem.py is concerned with finding files and parsing command line arguments. The actual interesting bit, the core of the ZONTEM algorithm, is expressed as a very short Python function:

def zontem(input, n_zones):
    zones = split(input, n_zones)
    zonal_average = map(combine_stations, zones)
    global_average = combine_stations(zonal_average)
    global_annual_average = annual_anomaly(global_average.series)
    zonal_annual_average = [
      annual_anomaly(zonal.series) for zonal in zonal_average]
    return global_annual_average, zonal_annual_average

This is a useful 7 line summary of the algorithm even though it glosses over some essential details (how are stations split into zones? How are station records combined together?). The details are of course found in the remaining source code.

I like to think of ZONTEM as a napkin explanation of how a simple way to estimate global historical temperature change works. I can imagine describing the whole thing over a couple of pints in the pub. ZONTEM has probably been simplified to the point where it is no longer useful to Science. But that does not mean it is not useful for Science Communication. In just the same way we might use a simplified sketch of a cell or an atom to explain how a real cell or atom works, ZONTEM is a sketch of a program to explain how a real (“better”) analysis works.

If ZONTEM seems simplistic because it doesn’t increase coverage by using ocean data, well, that’s because GISTEMP and HadCRUT4 do that. If it seems simplistic because it doesn’t produce gridded maps at monthly resolution, well, that’s because Berkeley Earth (and the others) do that. Every way in which ZONTEM has been made simpler is probably a way in which NASA GISTEMP, or a similar analysis that already exists, gets a more accurate result.

zontemglobe

Posted in News | Leave a comment

Catching up

I’ve been spring cleaning.

For too long I have neglected ccc-gistemp (our clear rewrite of GISTEMP). For a while now it has not been possible to run it. The problems were mostly due to finding the right Sea Surface Temperature (SST) file. The old file was called SBBX.HadR2 (a combination of Hadley ISST and Reynold’s Optimal Interpolation v2). GISS withdrew this file in favour of SBBX.ERSST which is Smith et al’s 2008 Extended Reconstruction

In the final stages (Step 5) of ccc-gistemp, SSTs from the ocean file are combined with temperature anomalies from land-based meteorological stations to produce zonal means that are then averaged into hemispherical and global means. The choice of which dataset to use for SSTs is not completely straightforward, there are different groups with different ways to assimilate all the available observations. Hansen et al’s 2010 paper “Global Surface Temperature Change” does a good job of comparing some of the available options.

So now we’ve caught up with GISS and can do once again do an analysis of combined Land and Ocean Temperatures:

Global Land-Ocean Temperature Index

I’ve also moved the code to github, which is a much nicer place and you should move there too.

While spring cleaning I noticed that ccc-gistemp had accumulated a few tools and bits of code that had less to do with ccc-gistemp and more to do with Global Historical Climatology Network. I’ve moved those into their own ghcntool repository.

Posted in News | Tagged , , | Leave a comment

River level data must be open

My home is flooded, for the second time in a month. The mighty Thames is reclaiming its flood-plains, and making humans – especially the UK government’s Environment Agency – look puny and irrelevant. As I wade to and fro, putting sandbags around the doors, carrying valuables upstairs, and adding bricks to the stacks that prop up the heirloom piano, I occasionally check the river level data at the Agency website, and try to estimate how high the water will rise, and when.

There are thousands of river monitoring stations across the UK, recording water levels every few minutes. The Agency publishes the resulting data on its website, in pages like this. For each station it shows a graph of the level over the last 24 hours (actually, the 24 hours up to the last reported data: my local station stopped reporting three days ago, presumably overwhelmed by the water), and has some running text giving the current level in metres above a local datum. There’s a small amount of station metadata, and that’s all. No older data, and no tabular data. I can’t:

  • See the levels over the course of a previous flood;
  • Measure how quickly the river typically rises, or how long it typically takes to go down;
  • Compare today’s flood to that four weeks ago (or those in 2011 or 2003);
  • Easily navigate to the data for neighbouring stations up and down river;
  • Get a chart showing the river level, or river level anomalies, along the length of the Thames;
  • Get a chart comparing that longitudinal view of the flood with the situation at any previous time;
  • Make a maps mash-up showing river level anomalies across the Thames catchment;
  • Make a personalised chart by adding my own observations, or critical values (“electrics cut out”, “front garden floods”, “water comes into house”, …).
  • Make a crowd-sourced flooding community site combining river level data, maps, pictures, observations, and advice (“sandbags are now available at the village hall”)
  • Make a mash-up combining river level data with precipitation records;
  • Make a flood forecasting tool by combining historical river level, ground-water, and precipitation records with precipitation forecasts.

Most of these things (not the last!) would be a small matter of programming, if the data were available. The Thames Valley is teeming with programmers who would be interested in bashing together a quick web app; or taking part in a larger open-source project to deliver more detailed, more accessible, and more useful flood data. But if we want to do any of those things, we have to pay a license fee to access the data, and the license would apparently then require us to get pre-approval from the Environment Agency before releasing any “product”. All this for data which is gathered, curated, and managed by a part of the UK government, nominally for the benefit of all.

Admittedly I couldn’t do any of those things this week anyway – too many boxes to carry, too much furniture to prop up. But surely this is a prime example of the need for open data.

Posted in News | 2 Comments

Ten reasons you must publish your code

Last week I gave a short talk to the SoundSoftware Workshop 2013. SoundSoftware is a group of researchers in the field of music and acoustics, based at Queen Mary University in London, who promote the use of sustainable and reusable software and data. My talk was entitled “Ten reasons you must publish your code”, and SoundSoftware have now published the video. It was intended to stimulate debate, and the time was limited, so it’s long on polemic and short on evidence, although there is plenty of evidence out there to support almost everything I said. The short list of reasons is as follows:

  1. Review: to improve your chances of passing review, publish your code;
  2. Reproducibility: if you want others to be able to reproduce your results, publish your code;
  3. Citations: if you want to boost your citation counts (or altmetrics), publish your code;
  4. Collaboration: to find new collaborators and teams, to cross-fertilise with new areas, publish your code;
  5. Skills: to boost your software skills, publish your code;
  6. Career: to improve your resume and your job prospects, publish your code;
  7. Reputation: to avoid getting egg on your face, or worse, publish your code;
  8. Policies: to get a job, to publish in a particular journal, to secure funding, publish your code;
  9. Preparation: to prepare for the great future of web science, publish your code;
  10. Science! To do science rather than alchemy, publish your code.
Posted in News | Leave a comment

Chasing the Dream

Interspersed with my Picking Winners series, here’s another post about the vision thing, a broad outline of how we might get there from here.

A month ago I sketched out my vision of the future of science. Of course, I didn’t just dream that up by myself: large parts of it are stolen shamelessly from much smarter people like


FX: screeching brakes

Hold it right there! You’ve never read “Reinventing Discovery”?! OK, go and buy it, right now, read it, and come back. I’ll wait.

Back? Let’s go on…


  • Michael Nielsen …;
  • Greg Wilson, whose Software Carpentry work is finally breaking through, bringing fundamental software development skills to the next generation of scientists;
  • Peter Murray-Rust, who has been banging the drum for open access and for content-mining for many years: it’s good to see this finally getting regulatory traction;
  • Victoria Stodden, who is putting her ground-breaking research on replication of computational science into action in the cloud at RunMyCode;
  • Fernando Pérez, whose compelling IPython system is transforming the way people think about interactive scientific computation;
  • and many, many others.

Although everyone in open science is coming at the subject from their own direction, with their own focus and interests, our ideas are definitely cohering into some clear ideas about the future of science software:

  • Code written for published research will be:
    • open-source;
    • easy to download;
    • easy to fork;
    • and easy to track.
  • It will be:
    • hosted and curated in the cloud;
    • at any of several independent hosting sites;
    • with version-tracking and defect-tracking web services;
    • and automatically-generated permanent version IDs for citation purposes;
    • all accessible via open web APIs.
  • It will have:
    • automatic authorship tracking;
    • linking many different web-based services;
    • producing online profiles and reputations;
    • which will naturally be used in scientific career records.
  • It will often depend on:
    • several third-party components;
    • automatically giving credit to their authors;
    • with clear records of dependency configuration;
    • and thus remain reproducible even when those components change.
  • The code will be written:
    • in many different programming languages;
    • on remote machines or via browser-based editors;
    • usually by scientists who are trained in software development.
  • It will be runnable:
    • by any interested person;
    • on servers in the cloud;
    • through an easy-to-use web interface;
    • on the original or modified data sets;
    • (possibly on payment of reasonable virtualization fees).
  • The resulting data – files, charts, or other results – will be open:
    • freely available;
    • for any use;
    • and for redistribution;
    • subject at most to attribution and/or share-alike requirements.
  • The data will also be:
    • retained permanently;
    • with a citable identifier;
    • and recorded provenance: versions of source code and data;
    • and automatically recorded credit to the person who ran the code.
  • Scientific publications will be:
    • often written alongside the code which supports them;
    • often generated online, fully automatically, from the inter-woven source and text;
    • usually open;
    • subject to public discussion in open web forums;
    • widely indexed, inter-linked, and woven into an open web of scientific knowledge.

I could go on (I haven’t even raised the subject of crowd-sourced data), but that’s the core of the vision. The question then is: how do we get there from here? As discussed in the earlier post, many components already exist (although some are still in their infancy):

  • GitHub is the world’s current favourite online code hosting service, with version control, defect tracking, and related services;
  • iPython Notebook is a great system for interactive web-based coding, including visualisation and text markup;
  • RunMyCode is a landmark site for cloud-based curation and running of research computation;
  • Zotero is a web-based literature search, research, and bibliographic tool;
  • FigShare is a web hosting service for datasets, figures, and all kinds of research outputs;
  • StarCluster is a fine system for managing computation on Amazon’s EC2 cloud service.
  • Software Carpentry is a training organization to provide basic software development skills to researchers.

It’s worth noting that some server-side parts of this ecosystem (GitHub, FigShare, RunMyCode?, EC2) are not open source, which adds some risk to the vision. A truly open science future won’t depend on closed-source systems: institutions and organisations will be able to deploy their own servers, services and data will be robust against corporate failure or change. Happily these proprietary components (with the possible exception of RunMyCode) all have open APIs, which allows for the possibility of future open implementations.

It’s also worth noting in passing that several of these components (iPython Notebook, RunMyCode, Zotero, and Software Carpentry) are partly funded by the Alfred P. Sloan Foundation, a shadowy puppet-master a non-profit philanthropic institution which obviously shares this open science software vision. Several also have links with, or funding from, the Mozilla Foundation, which now has a grant from the Sloan Foundation to create a “Webmaking Science Lab”, apparently intending to drive this vision forward. [Full disclosure: I have applied for the post of lab director].

So, a lot of the vision already exists, in disconnected parts. The work remaining to be done includes:

  • Integration. RunMyCode with an iPython Notebook interface. EC2 configuration using a lickable web UI, generating StarCluster scripts. iPython Notebook servers versioning code at GitHub (or SourceForge, or GoogleCode). A GitHub button to generate a DOI. Automatic posting of result datasets to FigShare. And so on.
  • Inter-operation. Simple libraries for popular languages to access the APIs of all these services. An example is rfigshare: FigShare for R. but where’s (say) PyFigShare? Or FortranGitHub?
  • Independence. Web services shouldn’t care what programming languages and systems are used by researchers. They should make it easy to do popular things, but possible to do almost anything. They should be able to manage and interact with any virtual machine or cluster that a researcher can configure.
  • Identity. How does a researcher log into a service? Single sign-on using Persona? How does she claim credit for a piece of work? How does she combine different sources of credit? Can we do something with Open Badges?
  • Import (well, it begins with ‘I'; really I mean Configuration Management). Researchers need to be able to pull in specific versioned platforms or code modules from GitHub or elsewhere, and have those same versions work in perpetuity. This is a tangled web (a form of DLL hell) so we also need a system for recommending and managing known-working combinations. Like EPD or ActivePerl, but fully open.

This list is Incomplete. We also need to build networks and groups of related projects. A lot of the required development will take place anyway, and coordinating different open-source projects, like managing individual developers, can resemble cat-herding. But there is a role for joint development, for workshops, summits, and informal gatherings, for code sprints, hackjams, and summer projects. Even if we are not marching in step, we should all be heading in the same general direction, and avoid duplicated effort. We should build interest and enthusiasm in the open-source community, and work together with major cross-project organisations such as the Google Open Source Programs Office, O’Reilly Media, and (again) the Mozilla Foundation.

Finally we also need buy-in from researchers and institutions. Planck’s second law has it that science progresses one funeral at a time. The experience of Software Carpentry suggests that it can take years for even basic ideas such as training to gain traction. So as the infrastructure takes shape, we should also be talking to institutions, journals, societies, and conferences, to publicise it and encourage its take-up. This vision thing isn’t top-down: it’s driven by scientists’ own perceptions of their needs, what we think will make us more productive, increase the recognition of our work, and improve science. If we’re right, then more scientists will join us. If we’re wrong, we don’t deserve to succeed.

Posted in News | Tagged , , , , , , , , , , , , | 5 Comments

Picking Winners: Rule 3 – Find a Strong Community

This is the third of a series of posts intended to help scientists take part in the open science revolution by pointing them towards effective tools, approaches, and technologies.

The previous advice is to use free software, and to take the advocacy with a pinch of salt.

3. Find a Strong Community

New projects are often very exciting, and might show amazing promise. And using a brand-new free software project is far less dangerous than taking up a beta version of some proprietary tool. A proprietary tool might never make it to a formal release, and if the vendor decides not to follow through on early development then it might even become impossible to legally use it. That can’t happen to free software: the license guarantees that you will always be able to use, adapt, copy, and distribute it. However, you usually want more than that.

Free software is not exempt from a general rule of thumb in the industry: there is a high “infant mortality rate”, and a majority of projects never make it to “version 2.0″. A shiny new tool is likely to become dull and unloved, and may never get the vital upgrade to a new language version, or the module for compatibility with a new protocol, or the latest operating system. It is paradoxical but true: the useful life expectancy of a software product increases as the product matures.

Growing into maturity, a piece of software acquires a community of developers and users, with mailing lists, IRC channels, blogs, twitter feeds, and meetings – everything from informal online gatherings through pub meets up to international conferences. This community forms a complex ecosystem, and it’s this network of people, systems, companies, and technologies that gives longevity to the software:

  • A diverse base of developers makes a project robust in event of loss of a key person, and helps to guard against unwise development choices.
  • A critical mass of users ensures a rich flow of new requirements, serves as the source of new developers, and may often provide some financial and material support for development (from server hardware to conference sponsorship).
  • An active community builds interfaces, plugins, and modules to connect a good project to a wide diversity of other systems, and these connections in turn allow the user base to expand.

This rich context is the sign of a mature software system, and until it has formed – which may take years from a tool’s first public announcement or release – you are taking a risk in using a project. Sooner or later you will want a new feature, or a bug fix, or compatibility with a new operating system or interface, or simply guidance through some obscure corner of the code, and at those times you will be glad of the community of fellow users. Without it, you will have to find your own way. Choosing free software ensures you are always free to do that: you can write a serial driver controlling your new lab instrument, or a visualisation tool to generate PDF charts directly from your analysis results. But if you are one of hundreds or thousands using the same tool, the chances are that someone will have done that work already. Or at least that you can find fellow-travellers to develop the code with you.

Free software communities are often very strong, and may survive for several decades. But they can atrophy and die. Potentially fatal problems include:

  • Stalled development: if the stream of new releases dries up, users will start to drift away to competing tools which provide the new functionality they need. This sort of project stall can be due to various causes: key people moving on, loss of corporate involvement, developmental over-reach or a defect mire, or even office politics and personality clashes.
  • Incompatible releases: If new releases break existing applications, the user base will become fragmented (as each user continues to use whichever old release works for them). Potential new users see this fragmentation and are either confused or simply put off.
  • Unwise development direction: a core development team focused on their own ideas, to the exclusion of the community’s requirements, will gradually lose users to other projects. This can result from core developers using the development of the system itself for research or experimental purposes. For instance, the developers of a programming language implementation might well be interested in a cool idea about programming languages (e.g. a new concurrency model, or a new syntactic possibility) and use their language to explore that direction. But the users mostly don’t want a Cool New Language; they want the Same Old Language, but with new libraries, a faster compiler, better debugger, etc.

So these are all things to consider when checking for a healthy community: are there frequent new releases of software? Are they backward-compatible, and is that a priority of the developers? Is the process for steering development flexible and responsive to community input?

Open-source software communities are great for productive conversations with colleagues and fellow researchers. Because they bring together users with a shared interest in the software, but who might work in different fields, you may well find grounds, not just for collective software development, but for research directions, and even collaborations. So once you have found a strong and healthy community, be sure to get involved, engage in discussions, and go to meetings. It is often highly rewarding.

Posted in News | Tagged | 1 Comment