The Climate Code Foundation is a non-profit organisation to promote the public understanding of climate science.

ZONTEM is simpler, clearly

How simple can a temperature analysis be?

The original inspiration for Climate Code Foundation was ccc-gistemp. Our original pro-bono rewrite of NASA GISTEMP. Software that shows global historical temperature change.

We wanted the average person on the Clapham omnibus to be able to use it, inspect it, reason about it, and change it. ccc-gistemp successfully reproduces the NASA GISTEMP analysis (to within a few millikelvin for any particular monthly value) in a few thousand lines of Python code.

A few thousand lines is still a lot of code. There are still a few corners of ccc-gistemp that I haven’t fully looked into. Can we make something simpler and smaller that does more or less the same job? Obviously we can’t still expect to use exactly the same algorithm as NASA GISTEMP, and nor would we want to, because the exact details of which arctic locations use SST anomalies and which use LSAT anomalies are just not very important (for estimating global temperature change). It can be distracting to get bogged down in detail.

ZONTEM attempts to discard all constraints and make an analysis that is as simple as possible. The input data is monthly temperature records from GHCN-M. The Earth is divided into 20 latitudinal zones. The input records are distributed into the zones (by choosing the zone according to the latitude of the station). The records are then combined in two steps: first combining all the stations in a zone into a zonal record; then combining all zonal records into a single global record. The global record is converted into monthly anomalies, and then averaged into yearly anomalies. The zones are chosen to be equal area.

This is simpler in so many ways: only one source of input; no ad hoc fixing or rejection of records; no correction for UHI or other inhomogeneity; no use of Sea Surface Temperatures; only a single global result (no gridded output, no separate hemispherical series).

The result is about 600 lines of Python. This is split into 3 roughly equally sized pieces:,, understands the GHCN-M v3 file format for data and metadata (this is vital, but scientifically uninteresting); is borrowed from the ccc-gistemp project and consists of the detailed routines to combine monthly records and to convert to anomalies; is the main driver algorithm, it allocates stations to zones, and picks a particular order to combine station records.

A good chunk of is concerned with finding files and parsing command line arguments. The actual interesting bit, the core of the ZONTEM algorithm, is expressed as a very short Python function:

def zontem(input, n_zones):
    zones = split(input, n_zones)
    zonal_average = map(combine_stations, zones)
    global_average = combine_stations(zonal_average)
    global_annual_average = annual_anomaly(global_average.series)
    zonal_annual_average = [
      annual_anomaly(zonal.series) for zonal in zonal_average]
    return global_annual_average, zonal_annual_average

This is a useful 7 line summary of the algorithm even though it glosses over some essential details (how are stations split into zones? How are station records combined together?). The details are of course found in the remaining source code.

I like to think of ZONTEM as a napkin explanation of how a simple way to estimate global historical temperature change works. I can imagine describing the whole thing over a couple of pints in the pub. ZONTEM has probably been simplified to the point where it is no longer useful to Science. But that does not mean it is not useful for Science Communication. In just the same way we might use a simplified sketch of a cell or an atom to explain how a real cell or atom works, ZONTEM is a sketch of a program to explain how a real (“better”) analysis works.

If ZONTEM seems simplistic because it doesn’t increase coverage by using ocean data, well, that’s because GISTEMP and HadCRUT4 do that. If it seems simplistic because it doesn’t produce gridded maps at monthly resolution, well, that’s because Berkeley Earth (and the others) do that. Every way in which ZONTEM has been made simpler is probably a way in which NASA GISTEMP, or a similar analysis that already exists, gets a more accurate result.


Posted in News | Leave a comment

Catching up

I’ve been spring cleaning.

For too long I have neglected ccc-gistemp (our clear rewrite of GISTEMP). For a while now it has not been possible to run it. The problems were mostly due to finding the right Sea Surface Temperature (SST) file. The old file was called SBBX.HadR2 (a combination of Hadley ISST and Reynold’s Optimal Interpolation v2). GISS withdrew this file in favour of SBBX.ERSST which is Smith et al’s 2008 Extended Reconstruction

In the final stages (Step 5) of ccc-gistemp, SSTs from the ocean file are combined with temperature anomalies from land-based meteorological stations to produce zonal means that are then averaged into hemispherical and global means. The choice of which dataset to use for SSTs is not completely straightforward, there are different groups with different ways to assimilate all the available observations. Hansen et al’s 2010 paper “Global Surface Temperature Change” does a good job of comparing some of the available options.

So now we’ve caught up with GISS and can do once again do an analysis of combined Land and Ocean Temperatures:

Global Land-Ocean Temperature Index

I’ve also moved the code to github, which is a much nicer place and you should move there too.

While spring cleaning I noticed that ccc-gistemp had accumulated a few tools and bits of code that had less to do with ccc-gistemp and more to do with Global Historical Climatology Network. I’ve moved those into their own ghcntool repository.

Posted in News | Tagged , , | Leave a comment

River level data must be open

My home is flooded, for the second time in a month. The mighty Thames is reclaiming its flood-plains, and making humans – especially the UK government’s Environment Agency – look puny and irrelevant. As I wade to and fro, putting sandbags around the doors, carrying valuables upstairs, and adding bricks to the stacks that prop up the heirloom piano, I occasionally check the river level data at the Agency website, and try to estimate how high the water will rise, and when.

There are thousands of river monitoring stations across the UK, recording water levels every few minutes. The Agency publishes the resulting data on its website, in pages like this. For each station it shows a graph of the level over the last 24 hours (actually, the 24 hours up to the last reported data: my local station stopped reporting three days ago, presumably overwhelmed by the water), and has some running text giving the current level in metres above a local datum. There’s a small amount of station metadata, and that’s all. No older data, and no tabular data. I can’t:

  • See the levels over the course of a previous flood;
  • Measure how quickly the river typically rises, or how long it typically takes to go down;
  • Compare today’s flood to that four weeks ago (or those in 2011 or 2003);
  • Easily navigate to the data for neighbouring stations up and down river;
  • Get a chart showing the river level, or river level anomalies, along the length of the Thames;
  • Get a chart comparing that longitudinal view of the flood with the situation at any previous time;
  • Make a maps mash-up showing river level anomalies across the Thames catchment;
  • Make a personalised chart by adding my own observations, or critical values (“electrics cut out”, “front garden floods”, “water comes into house”, …).
  • Make a crowd-sourced flooding community site combining river level data, maps, pictures, observations, and advice (“sandbags are now available at the village hall”)
  • Make a mash-up combining river level data with precipitation records;
  • Make a flood forecasting tool by combining historical river level, ground-water, and precipitation records with precipitation forecasts.

Most of these things (not the last!) would be a small matter of programming, if the data were available. The Thames Valley is teeming with programmers who would be interested in bashing together a quick web app; or taking part in a larger open-source project to deliver more detailed, more accessible, and more useful flood data. But if we want to do any of those things, we have to pay a license fee to access the data, and the license would apparently then require us to get pre-approval from the Environment Agency before releasing any “product”. All this for data which is gathered, curated, and managed by a part of the UK government, nominally for the benefit of all.

Admittedly I couldn’t do any of those things this week anyway – too many boxes to carry, too much furniture to prop up. But surely this is a prime example of the need for open data.

Posted in News | 2 Comments

Ten reasons you must publish your code

Last week I gave a short talk to the SoundSoftware Workshop 2013. SoundSoftware is a group of researchers in the field of music and acoustics, based at Queen Mary University in London, who promote the use of sustainable and reusable software and data. My talk was entitled “Ten reasons you must publish your code”, and SoundSoftware have now published the video. It was intended to stimulate debate, and the time was limited, so it’s long on polemic and short on evidence, although there is plenty of evidence out there to support almost everything I said. The short list of reasons is as follows:

  1. Review: to improve your chances of passing review, publish your code;
  2. Reproducibility: if you want others to be able to reproduce your results, publish your code;
  3. Citations: if you want to boost your citation counts (or altmetrics), publish your code;
  4. Collaboration: to find new collaborators and teams, to cross-fertilise with new areas, publish your code;
  5. Skills: to boost your software skills, publish your code;
  6. Career: to improve your resume and your job prospects, publish your code;
  7. Reputation: to avoid getting egg on your face, or worse, publish your code;
  8. Policies: to get a job, to publish in a particular journal, to secure funding, publish your code;
  9. Preparation: to prepare for the great future of web science, publish your code;
  10. Science! To do science rather than alchemy, publish your code.
Posted in News | Leave a comment

Chasing the Dream

Interspersed with my Picking Winners series, here’s another post about the vision thing, a broad outline of how we might get there from here.

A month ago I sketched out my vision of the future of science. Of course, I didn’t just dream that up by myself: large parts of it are stolen shamelessly from much smarter people like

FX: screeching brakes

Hold it right there! You’ve never read “Reinventing Discovery”?! OK, go and buy it, right now, read it, and come back. I’ll wait.

Back? Let’s go on…

  • Michael Nielsen …;
  • Greg Wilson, whose Software Carpentry work is finally breaking through, bringing fundamental software development skills to the next generation of scientists;
  • Peter Murray-Rust, who has been banging the drum for open access and for content-mining for many years: it’s good to see this finally getting regulatory traction;
  • Victoria Stodden, who is putting her ground-breaking research on replication of computational science into action in the cloud at RunMyCode;
  • Fernando Pérez, whose compelling IPython system is transforming the way people think about interactive scientific computation;
  • and many, many others.

Although everyone in open science is coming at the subject from their own direction, with their own focus and interests, our ideas are definitely cohering into some clear ideas about the future of science software:

  • Code written for published research will be:
    • open-source;
    • easy to download;
    • easy to fork;
    • and easy to track.
  • It will be:
    • hosted and curated in the cloud;
    • at any of several independent hosting sites;
    • with version-tracking and defect-tracking web services;
    • and automatically-generated permanent version IDs for citation purposes;
    • all accessible via open web APIs.
  • It will have:
    • automatic authorship tracking;
    • linking many different web-based services;
    • producing online profiles and reputations;
    • which will naturally be used in scientific career records.
  • It will often depend on:
    • several third-party components;
    • automatically giving credit to their authors;
    • with clear records of dependency configuration;
    • and thus remain reproducible even when those components change.
  • The code will be written:
    • in many different programming languages;
    • on remote machines or via browser-based editors;
    • usually by scientists who are trained in software development.
  • It will be runnable:
    • by any interested person;
    • on servers in the cloud;
    • through an easy-to-use web interface;
    • on the original or modified data sets;
    • (possibly on payment of reasonable virtualization fees).
  • The resulting data – files, charts, or other results – will be open:
    • freely available;
    • for any use;
    • and for redistribution;
    • subject at most to attribution and/or share-alike requirements.
  • The data will also be:
    • retained permanently;
    • with a citable identifier;
    • and recorded provenance: versions of source code and data;
    • and automatically recorded credit to the person who ran the code.
  • Scientific publications will be:
    • often written alongside the code which supports them;
    • often generated online, fully automatically, from the inter-woven source and text;
    • usually open;
    • subject to public discussion in open web forums;
    • widely indexed, inter-linked, and woven into an open web of scientific knowledge.

I could go on (I haven’t even raised the subject of crowd-sourced data), but that’s the core of the vision. The question then is: how do we get there from here? As discussed in the earlier post, many components already exist (although some are still in their infancy):

  • GitHub is the world’s current favourite online code hosting service, with version control, defect tracking, and related services;
  • iPython Notebook is a great system for interactive web-based coding, including visualisation and text markup;
  • RunMyCode is a landmark site for cloud-based curation and running of research computation;
  • Zotero is a web-based literature search, research, and bibliographic tool;
  • FigShare is a web hosting service for datasets, figures, and all kinds of research outputs;
  • StarCluster is a fine system for managing computation on Amazon’s EC2 cloud service.
  • Software Carpentry is a training organization to provide basic software development skills to researchers.

It’s worth noting that some server-side parts of this ecosystem (GitHub, FigShare, RunMyCode?, EC2) are not open source, which adds some risk to the vision. A truly open science future won’t depend on closed-source systems: institutions and organisations will be able to deploy their own servers, services and data will be robust against corporate failure or change. Happily these proprietary components (with the possible exception of RunMyCode) all have open APIs, which allows for the possibility of future open implementations.

It’s also worth noting in passing that several of these components (iPython Notebook, RunMyCode, Zotero, and Software Carpentry) are partly funded by the Alfred P. Sloan Foundation, a shadowy puppet-master a non-profit philanthropic institution which obviously shares this open science software vision. Several also have links with, or funding from, the Mozilla Foundation, which now has a grant from the Sloan Foundation to create a “Webmaking Science Lab”, apparently intending to drive this vision forward. [Full disclosure: I have applied for the post of lab director].

So, a lot of the vision already exists, in disconnected parts. The work remaining to be done includes:

  • Integration. RunMyCode with an iPython Notebook interface. EC2 configuration using a lickable web UI, generating StarCluster scripts. iPython Notebook servers versioning code at GitHub (or SourceForge, or GoogleCode). A GitHub button to generate a DOI. Automatic posting of result datasets to FigShare. And so on.
  • Inter-operation. Simple libraries for popular languages to access the APIs of all these services. An example is rfigshare: FigShare for R. but where’s (say) PyFigShare? Or FortranGitHub?
  • Independence. Web services shouldn’t care what programming languages and systems are used by researchers. They should make it easy to do popular things, but possible to do almost anything. They should be able to manage and interact with any virtual machine or cluster that a researcher can configure.
  • Identity. How does a researcher log into a service? Single sign-on using Persona? How does she claim credit for a piece of work? How does she combine different sources of credit? Can we do something with Open Badges?
  • Import (well, it begins with ‘I’; really I mean Configuration Management). Researchers need to be able to pull in specific versioned platforms or code modules from GitHub or elsewhere, and have those same versions work in perpetuity. This is a tangled web (a form of DLL hell) so we also need a system for recommending and managing known-working combinations. Like EPD or ActivePerl, but fully open.

This list is Incomplete. We also need to build networks and groups of related projects. A lot of the required development will take place anyway, and coordinating different open-source projects, like managing individual developers, can resemble cat-herding. But there is a role for joint development, for workshops, summits, and informal gatherings, for code sprints, hackjams, and summer projects. Even if we are not marching in step, we should all be heading in the same general direction, and avoid duplicated effort. We should build interest and enthusiasm in the open-source community, and work together with major cross-project organisations such as the Google Open Source Programs Office, O’Reilly Media, and (again) the Mozilla Foundation.

Finally we also need buy-in from researchers and institutions. Planck’s second law has it that science progresses one funeral at a time. The experience of Software Carpentry suggests that it can take years for even basic ideas such as training to gain traction. So as the infrastructure takes shape, we should also be talking to institutions, journals, societies, and conferences, to publicise it and encourage its take-up. This vision thing isn’t top-down: it’s driven by scientists’ own perceptions of their needs, what we think will make us more productive, increase the recognition of our work, and improve science. If we’re right, then more scientists will join us. If we’re wrong, we don’t deserve to succeed.

Posted in News | Tagged , , , , , , , , , , , , | 5 Comments

Picking Winners: Rule 3 – Find a Strong Community

This is the third of a series of posts intended to help scientists take part in the open science revolution by pointing them towards effective tools, approaches, and technologies.

The previous advice is to use free software, and to take the advocacy with a pinch of salt.

3. Find a Strong Community

New projects are often very exciting, and might show amazing promise. And using a brand-new free software project is far less dangerous than taking up a beta version of some proprietary tool. A proprietary tool might never make it to a formal release, and if the vendor decides not to follow through on early development then it might even become impossible to legally use it. That can’t happen to free software: the license guarantees that you will always be able to use, adapt, copy, and distribute it. However, you usually want more than that.

Free software is not exempt from a general rule of thumb in the industry: there is a high “infant mortality rate”, and a majority of projects never make it to “version 2.0″. A shiny new tool is likely to become dull and unloved, and may never get the vital upgrade to a new language version, or the module for compatibility with a new protocol, or the latest operating system. It is paradoxical but true: the useful life expectancy of a software product increases as the product matures.

Growing into maturity, a piece of software acquires a community of developers and users, with mailing lists, IRC channels, blogs, twitter feeds, and meetings – everything from informal online gatherings through pub meets up to international conferences. This community forms a complex ecosystem, and it’s this network of people, systems, companies, and technologies that gives longevity to the software:

  • A diverse base of developers makes a project robust in event of loss of a key person, and helps to guard against unwise development choices.
  • A critical mass of users ensures a rich flow of new requirements, serves as the source of new developers, and may often provide some financial and material support for development (from server hardware to conference sponsorship).
  • An active community builds interfaces, plugins, and modules to connect a good project to a wide diversity of other systems, and these connections in turn allow the user base to expand.

This rich context is the sign of a mature software system, and until it has formed – which may take years from a tool’s first public announcement or release – you are taking a risk in using a project. Sooner or later you will want a new feature, or a bug fix, or compatibility with a new operating system or interface, or simply guidance through some obscure corner of the code, and at those times you will be glad of the community of fellow users. Without it, you will have to find your own way. Choosing free software ensures you are always free to do that: you can write a serial driver controlling your new lab instrument, or a visualisation tool to generate PDF charts directly from your analysis results. But if you are one of hundreds or thousands using the same tool, the chances are that someone will have done that work already. Or at least that you can find fellow-travellers to develop the code with you.

Free software communities are often very strong, and may survive for several decades. But they can atrophy and die. Potentially fatal problems include:

  • Stalled development: if the stream of new releases dries up, users will start to drift away to competing tools which provide the new functionality they need. This sort of project stall can be due to various causes: key people moving on, loss of corporate involvement, developmental over-reach or a defect mire, or even office politics and personality clashes.
  • Incompatible releases: If new releases break existing applications, the user base will become fragmented (as each user continues to use whichever old release works for them). Potential new users see this fragmentation and are either confused or simply put off.
  • Unwise development direction: a core development team focused on their own ideas, to the exclusion of the community’s requirements, will gradually lose users to other projects. This can result from core developers using the development of the system itself for research or experimental purposes. For instance, the developers of a programming language implementation might well be interested in a cool idea about programming languages (e.g. a new concurrency model, or a new syntactic possibility) and use their language to explore that direction. But the users mostly don’t want a Cool New Language; they want the Same Old Language, but with new libraries, a faster compiler, better debugger, etc.

So these are all things to consider when checking for a healthy community: are there frequent new releases of software? Are they backward-compatible, and is that a priority of the developers? Is the process for steering development flexible and responsive to community input?

Open-source software communities are great for productive conversations with colleagues and fellow researchers. Because they bring together users with a shared interest in the software, but who might work in different fields, you may well find grounds, not just for collective software development, but for research directions, and even collaborations. So once you have found a strong and healthy community, be sure to get involved, engage in discussions, and go to meetings. It is often highly rewarding.

Posted in News | Tagged | 1 Comment

Software Carpentry Boot Camp, Edinburgh

In a break from Nick’s recent Picking Winners series, an interlude from David about Software Carpentry—definitely a winner.

I went wearing my Climate Code Foundation badge to Edinburgh to help out at the Software Carpentry Boot Camp hosted by EPCC and PRACE and led by the Software Sustainability Institute’s Mike Jackson and the Space Telescope Science Institute’s Azalee Bostroem.

This post is full of nitty gritty details. I’ll follow up with a higher level view and some distilled feedback.

Being a helper is fun. And also quite hard work; probably not quite as hard as actually giving the course though. The job is to lurk at the edges and then swoop in when someone is in danger of falling behind, and fix their problem regardless of what OS, environment, or editor they may be using. Keeps you on your toes. Co-helper Aleksandra Pawlik’s top tips blog post is spot on.

Here follows a rambling list of observations and thoughts.

If you say boot camp starts at 0930, people turn up between 0830 (me) and 0931 and you actually get down to business at about 0940. Recommendation: tell people to turn up at 0900 for an 0930 start. Entice with biscuits and/or promises of one-on-one installation and editor instruction.

Echoing many other comments from other bootcamps, without a doubt the thing that required most help was software installation. Windows and Mac OS X were significantly worse than Linux here (Cent OS making a special appearance just to make sure that Linux wasn’t entirely plain sailing). Next, eduroam; then probably whitespace (in both shell and Python).

curl is annoying (but it’s used because it’s installed by default on OS X). When trying to download a file from bitbucket, instead of giving some sort of error for a mistyped URL, curl would download the HTML of the error page. I would now recommend that people install wget.

Expecting people to accurately type in a 60 character URL just does not work. Perhaps a shortener would help. And more bundling so we have fewer things to download.

Installing software is still painful. Even when everyone is told that they must have installed software before attending and when I provided a script to test what was installed. To be fair to attendees, they had all acted in good faith and made honest efforts. We computer types are the ones to blame. The “highlight” of that for me had to be setting the system time to 2012-01-01 so that we could install Xcode on a Mac. Things like this are always easier to solve in person, sat next to someone (it’s a mystery to me why they are, but they are). We had tried to fix this on the mailing list a couple of days before the boot camp, with no success, but sat next to the computer in question it was done with a google search in a few minutes.

Up-arrow and tab completion are really useful. And it is really hard to tell when an instructor is using them, because text just magically appears on the command line. I don’t know how hard this would be, but it might be a good idea to try saying out loud every time you pressed up-arrow or Tab. Also, Control-A.

Quitting things. It is surprisingly easy to get stuck in something and not know how to escape. In shell you can accidentally start a command that takes stdin (such as “cat”). You can accidentally open an editor (such as “hg commit”) and not know how to quit; there seems to be an equal chance of it being vi or nano (what no ed?). You can be reading a paged file (“man grep” or “git diff”) and not know how to go back.

The thing about quitting is that there are lots of ways of doing it, and it depends on what you’ve done, which you may not know. Ctrl-C is a good choice, but it doesn’t work from the Python prompt, or a pager (it does something else useful in the case of Python, but it doesn’t quit). Ctrl-D is useful, but doesn’t work in an editor (and you may accidentally quit out of the bash shell that you’re in). At least nano documents that exit is ^X. Which is useful if: a) you read it; and, b) you know that ^X means Ctrl-X. vi does not reveal how to quit itself.

It’s probably useful to have a little “how to quit stuff” card. At least learners can try the different things and see if one works (I had to rescue people out nano, cat, man, awk, vi, and a shell string). Someone I spoke to had developed the coping strategy of closing the entire Terminal window if they got stuck in vi (good for their inventive skills, but a bit horrific).

On a slightly related note, to a first approximation it’s basically impossible for people to tell the difference between the Python, ipython, or shell prompts. Consequently they would type shell command into python, or vice versa, and be mystified as to why it wasn’t working. Probably useful to emphasise the prompt, and the difference between the different ones.

People seemed happy using nano even if they hadn’t used it before (though it is worth noting yet again that people still can’t quit nano, even thought it says how to at the bottom of the screen).

Was really hard to see the difference between purple and black (on screen).

Instructors should make their font really big. I can’t over-emphasise this. About 20 lines for the entire screen seems about right.

German keyboards have Steuerung key (for Ctrl), and they might not have a backquote key. Should probably teach $(stuff) instead of `stuff`.

(when programming in Python) left to their own devices, many people will call a variable that holds a list list, and one that holds a tuple tuple, and one that holds a dict dict. This is very natural, but a bad idea in Python. But it lets you do it anyway. (this was mentioned in the materials, but by then, many people had done it already; fortunately bad things did not happen (because we didn’t teach people that the list function can be used to make lists, and the
dict function can be used to make dicts)). Also, functions to sum lists will inevitably be called sum.

Although matplotlib was advertised as optional, in practice, as soon as it was being demonstrated, everyone wanted to use it, so they began to install it. With varying degrees of success. And of course the whole business of people not being able to follow along if they were installing something. Recommendation: do not have optional things.

Posted in News | Tagged | 3 Comments

A Vision of Web Science

In a break from my Picking Winners series, here’s a long-brewing post about the future of web science.

The culture and practice of science is undergoing a revolution, driven by technological change. While most scientists are excited by the shifts and the opportunities they present, some are uneasy about the pace of change, and unclear about the destination: where is science going, and how will it help their own research. In this post, I will lay out my vision of twenty-first century science: the shape of future scientific practice, and in particular the future of scientific computation and data-processing.

First, a few motivational remarks:

  • These changes are good for scientists. They are eliminating many of the irritating chores that consume too much of a working scientist’s time. They are shortening paths of communication—with other scientists, with institutions, with funding bodies, with the public—and providing new opportunities for gathering data, for developing hypotheses and models, for building on and integrating with other work, for demonstrating an individual’s contributions. Science is a tough profession: with demanding qualifications, poor job security, long hours of challenging intellectual work, and often relatively low pay. Anything we can do to improve the life of scientists can only be good for the profession.
  • These changes are good for science. They accelerate the speed and reliability of the core project of science: discovering facts and constructing true knowledge about the physical world. Information technology—primarily the internet but also many other aspects of cheaper and more reliable communication, processing, and storage—is allowing us to develop understanding, and correct errors, much faster, more cheaply, and more reliably than before. It allows us to collaborate, and to compete, more effectively and fairly than ever. As new systems emerge, both collaboration and competition will improve, and science will benefit.
  • These changes are good for humanity. Science has been an overwhelming force for good for the last three centuries, and that will continue. To tackle the challenges facing the world in the twenty-first century—such as global warming, water management, power production, energy efficiency, resource depletion, population growth, and demographic shifts—we will need new technologies, and creating those technologies will demand new and better science. The open science revolution will provide it.

That summarizes my own motivation for involvement in the open science revolution. I am not a scientist myself—I’m a software consultant—but I see this as the most positive contribution I can make to the common human endeavour.

Now, the vision. Nothing I describe here is technically very challenging: most of the pieces already exist, and “all that remains” is to combine them into an integrated whole. There’s an old software development joke that any project is divided into the first 90% and the remaining 90%, but the systems I describe below are surely no more that five years away. I’ll be pessimistic and stretch this out by a few years. So this vision concerns a birdsong research project in the year 2020, run by a young scientist named Binita*.

The Vision: Binita’s Birdsong

Binita has an interest in the shifting geographic and seasonal patterns of birdsong, which she believes may be connected to climate change. She has some old data from a nineteenth-century global network of ornithologists, who exchanged and collected form letters recording bird sightings: that data was digitized in 2018 by a crowd-sourcing project, similar to Old Weather+, called “Funny Old Bird”*. The Funny Old Bird data is on the Victorian Historian* data hub, and like most science data is entirely open. Binita needs twenty-first century data for comparison.

She gathers this data from citizen scientists, who have downloaded an app to their phone or tablet. These citizens are located all over the world, and all they have to do to take part in the project is to leave the app running on their phone. In fact, they may be recruited automatically because they are already running a citizen science meta-app, such as BirdBrains*, or NaturalSounds*—built in 2019 based on BOINC+.

The app listens for birdsong—naturally, most of the people who run it are wildlife enthusiasts, and spend a lot of time out-of-doors. When the audio pattern recognition in the app detects birdsong, it records it, tags it with GPS data, and prepares it for upload to a cloud server. Some of the citizen scientists are really keen, and set up multiple listening stations in the woods and fields near their home. A listening station is just an ancient cellphone—maybe an iPhone 5—so is ultra-cheap. A number of schools create mesh networks from old phones and laptops, monitoring whole areas near the school. Other data is obtained from ornithological societies, and from naturalists using the audio recording chips built into the GPS markers they use when “ringing” birds. There’s a rich variety of data sources for Binita to incorporate. A citizen scientist can tag a recording with his or her belief about the bird species featured, or with an image or video of the bird or birds in question.

When each recording arrives in the cloud, at Climate Data Hub*, it is logged against the citizen scientist (who can sign in to the data site and listen to recordings). This provenance information flows through the entire analysis, and adds to each citizen’s project score, which is also influenced by the accuracy of their species guesses. Citizens who record the same species of bird (or the same individual) are automatically connected via the project website. The dataset is automatically versioned, so that every update creates a new version and every analysis takes place on an instantaneous and reproducible snapshot, and has a record of the contributors which can be used in a citation. This functionality is all built out of toolkits originally created for the Galaxy Zoo+ or another Zooniverse+ project, and the data citability is descended from DataCite+.

Binita writes code to analyse her birdsong data, on ClimateCodeHub*. This is a notebook hub site, one of many on the web, and recognisable as a descendant of GitHub+, with heritage dating back to sites developed in the 1990s such as SourceForge+. When she logs into the site from her tablet, she can create a new project or copy an existing one (either her own or from another researcher). In a project, she can specify which languages she is coding in, and the URL of the source datasets (at Victorian Historian and Climate Data Hub). She can keep her code closed (for example, during early development, or until a publication embargo is cleared), or make it open right away. On open projects anyone can contribute; for closed projects she can invite specific other researchers. Contributors can view the code and offer comments, defect reports, or code contributions which she can use or ignore, and she can give various permissions and controls to other people too. In this case, it turns out that an enthusiastic bird-watcher in Venezuela is also a professional statistician, and one in New Zealand is an audio-processing expert, and they join Binita in working on the code.

The hub site provides a version control system derived from systems such as Git+ or Subversion+, so contributors can download code to their own machines, or they can code directly on the notebook hub through their browsers. The versioning system tracks everyone’s contributions, and links to them from each person’s home page on the hub site, so every participant can easily claim appropriate credit.

Whether Binita and her colleagues are coding in their browsers, or on their own machines, they are using CodeNotebook*, a descendant of iPython Notebook+, which allows them to interactively develop “notebook” pages, incorporating code, charts, text, equations, audio, and video. They can work on separate computations, or connect together to the same notebook server (on the notebook hub) and code interactively together. Since 2012, people have been using this technology to write websites, academic papers, and even textbooks. The CodeNotebook system is not language-specific: Binita can use her favourite language (which happens to be some species of Python+) and adds interfaces to some third-party Fortran+ libraries to do heavy numeric lifting, and a C+++ subsystem for audio processing.

Several times during the research project, starting with those Fortran and C++ modules, Binita realises that she can save effort by re-using other pieces of code. These might be her own modules from previous projects, or sections of other researchers’ projects. These modules are separate projects on that hub, or maybe some other one. She adds these versioned dependencies to the project configuration, and the code is automatically integrated by the version control system, and added to the citations section of her notebook pages. The other researchers whose code she is re-using are notified, and some get in touch to see whether they can contribute (and maybe get additional credit as contributors).

Binita uses a GitLit* plugin to find and track related research reading and to build the project bibliography—with citations of code, data, papers, and notebooks—which is all kept updated automatically. This is related to Mendeley+ and Zotero+, combined with content-mining software originally called AMI2+.

Binita can run her notebook directly on her own machine, or on the notebook hub servers in the cloud (she’s probably going to do the latter, because her datasets are fairly large and the various hub providers work together to make this efficient). This was pioneered by RunMyCode+ in 2011. Each time Binita runs her notebook, it can fetch snapshots from the dataset clouds, complete with versioning and citation metadata. The notebook is automatically annotated with all this configuration and version information, together with the versions of the operating system and citation links to all the systems, libraries and third-party code they are using. When she is content with the results, she can click a “Pre-print” button, and the resulting notebook is automatically given a Digital Object Identifier (DOI) and made available online. This instant citability was an amazing thing when FigShare+ first did it, back in 2011 when Binita was in middle school. The hub has a post-publication peer review system, so Binita’s colleagues and competitors around the world can see this pre-print, comment on it, and score it. It is automatically added to her profile page on the hub, where her resume system will pick it up, and funding bodies will find it there when they come to assess her research excellence productivity metric (or whatever it is called, in 2020).

Binita, however, wants a research paper in a journal, the old-fashioned sort, even printed on paper—something she can show her grandfather, who first took her bird-watching. So she clicks the “Submit” button, and a PDF document is automatically generated from the notebook, formatted according to the house style of her chosen journal, and sent to the journal’s editors. Open-source journal software based on Open Journal Systems+ manage the peer-review process, and after some minor revisions, it’s done.

When Binita revises her notebook, either during peer-review or otherwise, fresh DOIs are minted and annotations are automatically added to old versions, so other researchers can always find the most recent public version. Researchers who are “watching” a notebook are notified automatically. As Binita works, she herself sometimes receives an automated notification that code or data she is using has been updated in this way; she can view the reason for the update and choose whether to switch to the new version or continue to use the old version.

Any reader of Binita’s work can run her code themselves, either on the notebook hub or on their own machine, simply by clicking on buttons in the notebook. If it’s very computationally expensive, they might have to make a micro-payment, as they might be used to doing on Amazon’s EC2+. The notebook hub will automatically offer a virtual machine image, and a management script for a system such as StarCluster+. Alternatively, the notebook’s automated description—with exact version information, and links to installers, for all the components—allows interested readers to put together a running system for themselves.

All the code for running a notebook hub is open-source, of course, so anyone can host one and in fact an ecosystem has developed. Rich institutions and departments host their own, others are run by publishers, funding bodies, professional societies, non-profit organisations, and commercial companies who fund them by advertising or rental. A Semantic Web protocol has arisen for ensuring that Binita’s logins on all of these various hub systems are connected, creating a single unified researcher identity, and bibliography, that moves with her through her career.

Last, but far from least: where did Binita and all her peers learn how to drive all this marvellous technology? Well, back around 2010 science institutions finally started to understand the huge importance of software skills in twenty-first century science. After years of effort, Software Carpentry+ was finally recognised as indispensable training for the next generation of scientists. By the time Binita clicks “submit” on her project, every university around the world has compulsory courses in basic software development skills, for every science undergraduate. The better high schools are running their own notebook hubs, and smart middle schoolers are creating notebooks about snail trails and raindrop patterns. Science is better, faster, more open, and more fun, than ever before.

* These names are invented for the purpose of this description; any resemblance to the name of any project, product, or person, living or dead, is accidental and not intended to imply endorsement or criticism, so please don’t sue me.

+ These, on the other hand, are real projects that exist right now. The future is here.

Posted in News | Tagged , , , , , , , , , , , , , , , , , , , | 2 Comments

Picking Winners: Rule 2 – Don’t Believe the Hype

This is the second of a series of posts intended to help scientists take part in the open science revolution by pointing them towards effective tools, approaches, and technologies.

Having established in the previous post the importance of using open-source software for open science, I continue to the second in my set of five simple rules for scientists. This is one which most of us know and use instinctively in our daily lives, but somehow when it comes to computers and software tools we can lose sight of this basic idea:

2. Don’t Believe the Hype

Many open-source projects have active communities of users and developers. These people are using and developing the tool because it solves their problems. Unfortunately, this success may blind some of them to difficulties which you may face when you use the tool in your research. They may be so enthused about their project that they become evangelical about its use: it becomes the solution to everyone’s problems. In their minds, they have the World’s Greatest Hammer, so your project looks like a nail.

When you are first researching a tool for potential use, reading documentation and opening conversations with existing users, you will come across some of these project evangelists, and be exposed to some of their hype. They may well be very experienced in their own arena, but are likely to have little or no experience in your research domain, or the class of problem you are trying to solve. Particularly with younger project communities, whose tools have not been applied to a wide variety of problems, and which have grown out of a narrow problem domain, the ignorance can be breathtaking, and the hype can be extreme.

Of course, hype for proprietary software is often far more widespread and more extreme than in the free software world. Some commercial marketing departments may have no compunction about deliberately distorting the truth, or outright lying, to secure a sale, and (due to common stringent licensing conditions which can prohibit users from telling war stories), they often have a great deal of control over the “messaging” about their products. But since we already disposed of proprietary software with rule 1, it’s important that we should be honest with ourselves about some problems in open source communities.

Here’s a list of particularly common subjects of hyperbolic rhetoric. For each subject, I’ve suggested some “antidotes” to the Kool-Aid: questions you should ask yourself or the community, to help you judge the true suitability of the tool for your use.

  • Productivity: those jobs which used to take months, or which were especially difficult to get right, are now automatically handled in the background.
    • Are those jobs you are familiar with? Are they likely to feature in your project? Yes, it’s cool that the tool makes it really easy to drive a robotic instrument through the USB port. But is that something you need to do on your project?
    • Do you already have a good way of carrying out that job? Is it wise to discard that expertise, or would it be better to continue with your existing technique?
  • Performance: “Faster than FORTRAN” is a common cry. This is quite a hard target for algorithms which are very numeric-intensive, and which have been optimized for a particular platform by an experienced developer. It’s hard because compiling FORTRAN for high-performance computation is a problem which has been attacked for six decades by many of the finest minds in computer science, and parts of FORTRAN have been refined to allow the expression of algorithms in a performance-sensitive way (and computers have evolved alongside, to enable FORTRAN programs to run faster). Competing with FORTRAN is not such a big deal in general-purpose programming—algorithms without data parallelism—or for more modern languages in which parallelism, dependency, and aliasing can be fully expressed, and which have mature compilers developed with performance in mind.
    • “faster than X” is meaningless. “Faster than X, on this specific computer, solving this problem, with this code and this data” is meaningful, and should be assessed by comparing that problem with your own.
    • How computationally intensive is your problem? Will it take hours of computer time? Of super-computer time? How much of your time is it worth to save this amount of computer time?
    • How much of your computation is going to be performance-critical? It’s often only a very small numeric core, which—once located—can be farmed out to a separate small program written in FORTRAN or some other special-sauce language.
  • Installation: you will be told that installation “is a snap”. This is very common, because long-term users of any software have usually not had to install it from scratch for a long time. They may never have installed it at all (it may have come “out-of-the-box”), and if they did then they may well have forgotten any difficulty they had. Furthermore, many users have only ever installed it on a single “platform”: their own computer, running some particular version of that operating system with particular versions of any required packages. Software installation is often tricky, with complex dependencies and many compatibility headaches.
    • Read through the installation instructions for your operating system. Do they identify the dependencies, including particular versions? Do they have trouble-shooting advice?
    • Are there existing developers and users across a wide diversity of platforms? Is it easy to find users with your particular mix of operating system and other dependencies?
    • Are there active discussion forums, or a wiki, or a knowledge base? Browse them: do newcomers get real assistance, or a brush-off?
  • Compatibility: you will certainly be assured that the system has interfaces to XML, and JSON, and SOAP/RPC, and NetCDF, and SVG, and SQLite, and on and on. But if that functionality is critical to your project, research it further. The interface may well have been developed to support one or two previous projects, and only provide whatever functionality they required. Or it may be antique, not up-to-date with recent changes.
    • Look through the interface documentation. Is it complete? Is it up-to-date?
    • Ask questions specific to your use. Instead of asking “does it work with Matlab files?”, ask “can I read MAT-File Level 5 format files, with sparse arrays and byte-swapping?”
    • Find other users with similar compatibility requirements and ask to see their code.
  • Your problem doesn’t matter: because the tool is so awesome that this other technique is better. The language doesn’t need namespaces—you can always use a prefix. You don’t need a working C interface—you can always reimplement that library in this other way. You don’t really need PDF charts: SVG is so much more 21st century.
    • You are kidding, right?
    • (yes, it’s true about SVG, but many publishers and other tools still want PDF).
  • Future development: It’s not faster than FORTRAN yet, but it will be by Christmas because I just read this cool paper about how to do it. Version 3 is going to have support for PDF defrobnification. The iPhone app will be out Real Soon Now.
    • As with all software, you should assume that any feature not already functional in a released version is very unlikely to be working by the suggested date, and may well never exist.
    • So: keep one eye on project announcements, and bear the tool in mind for future use, but if your project needs a feature today, then either choose a tool which has that feature now, or plan to develop it yourself.

Hype is at its worst at particular phases in a project’s life-cycle: at the beginning (when the future is bright, and everything is possible) and near the end (when the project is dying, and this is obvious to all but the True Faithful, who are sustained by unshakeable belief). Those are times to avoid a project anyway, and will be addressed in future posts in this series.

Posted in News | Tagged | Leave a comment

Picking Winners: Rule 1 – Is it Really Free?

This is the first of a series of posts intended to help scientists take part in the open science revolution by pointing them towards effective tools, approaches, and technologies.

When I make presentations to groups of scientists about the Climate Code Foundation and the Science Code Manifesto, I am often asked to recommend particular systems, or to advise on the advantages and disadvantages of some particular combination of tools. Many scientists are keen to adopt free software systems, open-source tools, and modern development methods, to become part of the growing open science revolution, and are looking for guidance.

I’m happy to discuss such questions individually, but there are far too many possible systems, configurations, and sets of requirements for anyone to give a single comprehensive answer. One tool might be excellent for most uses but fall short in some particular aspect which is critical for your project. Another might be generally weak but have a tremendously strong feature which addresses your needs. Over the next few days I will be giving some specific answers to common questions—identifying the strengths and weaknesses of my own favourite systems—but first I will try to help any scientist to pick winners for themselves, by giving some useful rules of thumb.

So I have formulated five general rules for making good choices in the open-source development world, and this is the first in a series of posts laying out those rules.

1. Is it really free? Is it really open?

The first question to ask about any particular tool is: what restrictions are there on using, modifying, or sharing it? Can my audience use it too?

The open science revolution is built on free software, and I strongly encourage all researchers to use free software whenever possible. However, the term “free software” has a very specific technical meaning, which is often misunderstood by those outside the software world. It does not mean “zero-cost software”, it means “software that does not restrict its users from studying, modifying, and sharing it”.

Because of this confusion, I usually favour the term “open-source software”, which isn’t so easily misunderstood and also has two other advantages. First, it has the ring of jargon, so listeners immediately grasp that it has a specific technical meaning. Secondly, it chimes with other pillars of twenty-first century science: “open data” and “open access”, with which researchers are often already familiar. Open-ness—making one’s results available to others to be criticised, praised, and built upon—is a core scientific value, and this term emphasizes that the same principle is at work for software.

The formal definitions of the two terms overlap almost totally—certainly all the software I recommend is both free and open-source—and many people use them synonymously: I mostly use “free software” for audiences of software professionals, and “open-source software” for others. It’s very regrettable that some people, including software community leaders, see conflict between the two. In my experience most people actually writing free software aren’t very interested in the distinction.

In any case, the important thing to remember is that software which is zero-cost but not truly free is a very risky choice which will not provide many of the great advantages of open-source software. Scientists and researchers working in large institutions in rich countries often have a very wide choice of zero-cost software tools:

  • Some will be available “for free” for some limited trial period. For example, many new computers come pre-installed with suites of software which become unusable after 30 days, unless a fee is paid. The same is often true of software provided at a departmental or group level, and of many web services (for example, search tools for research literature). Remember, the first hit of any drug is always free.
  • Some will have been bought for the institution, the department, or the team. A license may need renewing—often annually—at a price to be determined by the vendor. Next year, or next week, the license may come up for renewal and be cancelled due to budget pressure.
  • Some will not have been bought, but will have been copied illegally, and any licensing mechanism will have been subverted. Such software may be suddenly disabled (and the people responsible punished).
  • Some may have been bought by a single researcher or small group, and made available to a few people. The fact that it’s on your computer doesn’t guarantee that you have the right to use it.
  • Finally, some will be truly free, open-source, software.

Always, ask yourself this: can my audience use this tool? Open science is all about sharing your research, including your code, with others, and if they can’t use the tool then your code is much less useful to them—they won’t read it and they won’t improve it—and to you: they won’t cite it.

Your audience might be your research colleagues, or those elsewhere in your department, or at other institutions around the world (including those with less generous budgets), or independent researchers, or the public. One key audience member, often the most important, is your own future self: you can use the tool right now, but your institution might stop licensing it, or you might move to work somewhere without a license, or the software vendor might go broke (so that suddenly nobody has a license). Do you want to be able to use your own research in five years’ time?

Open science uses open-source software. Make sure you do too.

Posted in News | Tagged | Leave a comment