The Climate Code Foundation is a non-profit organisation to promote the public understanding of climate science.

Chasing the Dream

Interspersed with my Picking Winners series, here’s another post about the vision thing, a broad outline of how we might get there from here.

A month ago I sketched out my vision of the future of science. Of course, I didn’t just dream that up by myself: large parts of it are stolen shamelessly from much smarter people like

FX: screeching brakes

Hold it right there! You’ve never read “Reinventing Discovery”?! OK, go and buy it, right now, read it, and come back. I’ll wait.

Back? Let’s go on…

  • Michael Nielsen …;
  • Greg Wilson, whose Software Carpentry work is finally breaking through, bringing fundamental software development skills to the next generation of scientists;
  • Peter Murray-Rust, who has been banging the drum for open access and for content-mining for many years: it’s good to see this finally getting regulatory traction;
  • Victoria Stodden, who is putting her ground-breaking research on replication of computational science into action in the cloud at RunMyCode;
  • Fernando Pérez, whose compelling IPython system is transforming the way people think about interactive scientific computation;
  • and many, many others.

Although everyone in open science is coming at the subject from their own direction, with their own focus and interests, our ideas are definitely cohering into some clear ideas about the future of science software:

  • Code written for published research will be:
    • open-source;
    • easy to download;
    • easy to fork;
    • and easy to track.
  • It will be:
    • hosted and curated in the cloud;
    • at any of several independent hosting sites;
    • with version-tracking and defect-tracking web services;
    • and automatically-generated permanent version IDs for citation purposes;
    • all accessible via open web APIs.
  • It will have:
    • automatic authorship tracking;
    • linking many different web-based services;
    • producing online profiles and reputations;
    • which will naturally be used in scientific career records.
  • It will often depend on:
    • several third-party components;
    • automatically giving credit to their authors;
    • with clear records of dependency configuration;
    • and thus remain reproducible even when those components change.
  • The code will be written:
    • in many different programming languages;
    • on remote machines or via browser-based editors;
    • usually by scientists who are trained in software development.
  • It will be runnable:
    • by any interested person;
    • on servers in the cloud;
    • through an easy-to-use web interface;
    • on the original or modified data sets;
    • (possibly on payment of reasonable virtualization fees).
  • The resulting data – files, charts, or other results – will be open:
    • freely available;
    • for any use;
    • and for redistribution;
    • subject at most to attribution and/or share-alike requirements.
  • The data will also be:
    • retained permanently;
    • with a citable identifier;
    • and recorded provenance: versions of source code and data;
    • and automatically recorded credit to the person who ran the code.
  • Scientific publications will be:
    • often written alongside the code which supports them;
    • often generated online, fully automatically, from the inter-woven source and text;
    • usually open;
    • subject to public discussion in open web forums;
    • widely indexed, inter-linked, and woven into an open web of scientific knowledge.

I could go on (I haven’t even raised the subject of crowd-sourced data), but that’s the core of the vision. The question then is: how do we get there from here? As discussed in the earlier post, many components already exist (although some are still in their infancy):

  • GitHub is the world’s current favourite online code hosting service, with version control, defect tracking, and related services;
  • iPython Notebook is a great system for interactive web-based coding, including visualisation and text markup;
  • RunMyCode is a landmark site for cloud-based curation and running of research computation;
  • Zotero is a web-based literature search, research, and bibliographic tool;
  • FigShare is a web hosting service for datasets, figures, and all kinds of research outputs;
  • StarCluster is a fine system for managing computation on Amazon’s EC2 cloud service.
  • Software Carpentry is a training organization to provide basic software development skills to researchers.

It’s worth noting that some server-side parts of this ecosystem (GitHub, FigShare, RunMyCode?, EC2) are not open source, which adds some risk to the vision. A truly open science future won’t depend on closed-source systems: institutions and organisations will be able to deploy their own servers, services and data will be robust against corporate failure or change. Happily these proprietary components (with the possible exception of RunMyCode) all have open APIs, which allows for the possibility of future open implementations.

It’s also worth noting in passing that several of these components (iPython Notebook, RunMyCode, Zotero, and Software Carpentry) are partly funded by the Alfred P. Sloan Foundation, a shadowy puppet-master a non-profit philanthropic institution which obviously shares this open science software vision. Several also have links with, or funding from, the Mozilla Foundation, which now has a grant from the Sloan Foundation to create a “Webmaking Science Lab”, apparently intending to drive this vision forward. [Full disclosure: I have applied for the post of lab director].

So, a lot of the vision already exists, in disconnected parts. The work remaining to be done includes:

  • Integration. RunMyCode with an iPython Notebook interface. EC2 configuration using a lickable web UI, generating StarCluster scripts. iPython Notebook servers versioning code at GitHub (or SourceForge, or GoogleCode). A GitHub button to generate a DOI. Automatic posting of result datasets to FigShare. And so on.
  • Inter-operation. Simple libraries for popular languages to access the APIs of all these services. An example is rfigshare: FigShare for R. but where’s (say) PyFigShare? Or FortranGitHub?
  • Independence. Web services shouldn’t care what programming languages and systems are used by researchers. They should make it easy to do popular things, but possible to do almost anything. They should be able to manage and interact with any virtual machine or cluster that a researcher can configure.
  • Identity. How does a researcher log into a service? Single sign-on using Persona? How does she claim credit for a piece of work? How does she combine different sources of credit? Can we do something with Open Badges?
  • Import (well, it begins with ‘I’; really I mean Configuration Management). Researchers need to be able to pull in specific versioned platforms or code modules from GitHub or elsewhere, and have those same versions work in perpetuity. This is a tangled web (a form of DLL hell) so we also need a system for recommending and managing known-working combinations. Like EPD or ActivePerl, but fully open.

This list is Incomplete. We also need to build networks and groups of related projects. A lot of the required development will take place anyway, and coordinating different open-source projects, like managing individual developers, can resemble cat-herding. But there is a role for joint development, for workshops, summits, and informal gatherings, for code sprints, hackjams, and summer projects. Even if we are not marching in step, we should all be heading in the same general direction, and avoid duplicated effort. We should build interest and enthusiasm in the open-source community, and work together with major cross-project organisations such as the Google Open Source Programs Office, O’Reilly Media, and (again) the Mozilla Foundation.

Finally we also need buy-in from researchers and institutions. Planck’s second law has it that science progresses one funeral at a time. The experience of Software Carpentry suggests that it can take years for even basic ideas such as training to gain traction. So as the infrastructure takes shape, we should also be talking to institutions, journals, societies, and conferences, to publicise it and encourage its take-up. This vision thing isn’t top-down: it’s driven by scientists’ own perceptions of their needs, what we think will make us more productive, increase the recognition of our work, and improve science. If we’re right, then more scientists will join us. If we’re wrong, we don’t deserve to succeed.

Posted in News | Tagged , , , , , , , , , , , , | 5 Comments

Picking Winners: Rule 3 – Find a Strong Community

This is the third of a series of posts intended to help scientists take part in the open science revolution by pointing them towards effective tools, approaches, and technologies.

The previous advice is to use free software, and to take the advocacy with a pinch of salt.

3. Find a Strong Community

New projects are often very exciting, and might show amazing promise. And using a brand-new free software project is far less dangerous than taking up a beta version of some proprietary tool. A proprietary tool might never make it to a formal release, and if the vendor decides not to follow through on early development then it might even become impossible to legally use it. That can’t happen to free software: the license guarantees that you will always be able to use, adapt, copy, and distribute it. However, you usually want more than that.

Free software is not exempt from a general rule of thumb in the industry: there is a high “infant mortality rate”, and a majority of projects never make it to “version 2.0”. A shiny new tool is likely to become dull and unloved, and may never get the vital upgrade to a new language version, or the module for compatibility with a new protocol, or the latest operating system. It is paradoxical but true: the useful life expectancy of a software product increases as the product matures.

Growing into maturity, a piece of software acquires a community of developers and users, with mailing lists, IRC channels, blogs, twitter feeds, and meetings – everything from informal online gatherings through pub meets up to international conferences. This community forms a complex ecosystem, and it’s this network of people, systems, companies, and technologies that gives longevity to the software:

  • A diverse base of developers makes a project robust in event of loss of a key person, and helps to guard against unwise development choices.
  • A critical mass of users ensures a rich flow of new requirements, serves as the source of new developers, and may often provide some financial and material support for development (from server hardware to conference sponsorship).
  • An active community builds interfaces, plugins, and modules to connect a good project to a wide diversity of other systems, and these connections in turn allow the user base to expand.

This rich context is the sign of a mature software system, and until it has formed – which may take years from a tool’s first public announcement or release – you are taking a risk in using a project. Sooner or later you will want a new feature, or a bug fix, or compatibility with a new operating system or interface, or simply guidance through some obscure corner of the code, and at those times you will be glad of the community of fellow users. Without it, you will have to find your own way. Choosing free software ensures you are always free to do that: you can write a serial driver controlling your new lab instrument, or a visualisation tool to generate PDF charts directly from your analysis results. But if you are one of hundreds or thousands using the same tool, the chances are that someone will have done that work already. Or at least that you can find fellow-travellers to develop the code with you.

Free software communities are often very strong, and may survive for several decades. But they can atrophy and die. Potentially fatal problems include:

  • Stalled development: if the stream of new releases dries up, users will start to drift away to competing tools which provide the new functionality they need. This sort of project stall can be due to various causes: key people moving on, loss of corporate involvement, developmental over-reach or a defect mire, or even office politics and personality clashes.
  • Incompatible releases: If new releases break existing applications, the user base will become fragmented (as each user continues to use whichever old release works for them). Potential new users see this fragmentation and are either confused or simply put off.
  • Unwise development direction: a core development team focused on their own ideas, to the exclusion of the community’s requirements, will gradually lose users to other projects. This can result from core developers using the development of the system itself for research or experimental purposes. For instance, the developers of a programming language implementation might well be interested in a cool idea about programming languages (e.g. a new concurrency model, or a new syntactic possibility) and use their language to explore that direction. But the users mostly don’t want a Cool New Language; they want the Same Old Language, but with new libraries, a faster compiler, better debugger, etc.

So these are all things to consider when checking for a healthy community: are there frequent new releases of software? Are they backward-compatible, and is that a priority of the developers? Is the process for steering development flexible and responsive to community input?

Open-source software communities are great for productive conversations with colleagues and fellow researchers. Because they bring together users with a shared interest in the software, but who might work in different fields, you may well find grounds, not just for collective software development, but for research directions, and even collaborations. So once you have found a strong and healthy community, be sure to get involved, engage in discussions, and go to meetings. It is often highly rewarding.

Posted in News | Tagged | 1 Comment

Software Carpentry Boot Camp, Edinburgh

In a break from Nick’s recent Picking Winners series, an interlude from David about Software Carpentry—definitely a winner.

I went wearing my Climate Code Foundation badge to Edinburgh to help out at the Software Carpentry Boot Camp hosted by EPCC and PRACE and led by the Software Sustainability Institute’s Mike Jackson and the Space Telescope Science Institute’s Azalee Bostroem.

This post is full of nitty gritty details. I’ll follow up with a higher level view and some distilled feedback.

Being a helper is fun. And also quite hard work; probably not quite as hard as actually giving the course though. The job is to lurk at the edges and then swoop in when someone is in danger of falling behind, and fix their problem regardless of what OS, environment, or editor they may be using. Keeps you on your toes. Co-helper Aleksandra Pawlik’s top tips blog post is spot on.

Here follows a rambling list of observations and thoughts.

If you say boot camp starts at 0930, people turn up between 0830 (me) and 0931 and you actually get down to business at about 0940. Recommendation: tell people to turn up at 0900 for an 0930 start. Entice with biscuits and/or promises of one-on-one installation and editor instruction.

Echoing many other comments from other bootcamps, without a doubt the thing that required most help was software installation. Windows and Mac OS X were significantly worse than Linux here (Cent OS making a special appearance just to make sure that Linux wasn’t entirely plain sailing). Next, eduroam; then probably whitespace (in both shell and Python).

curl is annoying (but it’s used because it’s installed by default on OS X). When trying to download a file from bitbucket, instead of giving some sort of error for a mistyped URL, curl would download the HTML of the error page. I would now recommend that people install wget.

Expecting people to accurately type in a 60 character URL just does not work. Perhaps a shortener would help. And more bundling so we have fewer things to download.

Installing software is still painful. Even when everyone is told that they must have installed software before attending and when I provided a script to test what was installed. To be fair to attendees, they had all acted in good faith and made honest efforts. We computer types are the ones to blame. The “highlight” of that for me had to be setting the system time to 2012-01-01 so that we could install Xcode on a Mac. Things like this are always easier to solve in person, sat next to someone (it’s a mystery to me why they are, but they are). We had tried to fix this on the mailing list a couple of days before the boot camp, with no success, but sat next to the computer in question it was done with a google search in a few minutes.

Up-arrow and tab completion are really useful. And it is really hard to tell when an instructor is using them, because text just magically appears on the command line. I don’t know how hard this would be, but it might be a good idea to try saying out loud every time you pressed up-arrow or Tab. Also, Control-A.

Quitting things. It is surprisingly easy to get stuck in something and not know how to escape. In shell you can accidentally start a command that takes stdin (such as “cat”). You can accidentally open an editor (such as “hg commit”) and not know how to quit; there seems to be an equal chance of it being vi or nano (what no ed?). You can be reading a paged file (“man grep” or “git diff”) and not know how to go back.

The thing about quitting is that there are lots of ways of doing it, and it depends on what you’ve done, which you may not know. Ctrl-C is a good choice, but it doesn’t work from the Python prompt, or a pager (it does something else useful in the case of Python, but it doesn’t quit). Ctrl-D is useful, but doesn’t work in an editor (and you may accidentally quit out of the bash shell that you’re in). At least nano documents that exit is ^X. Which is useful if: a) you read it; and, b) you know that ^X means Ctrl-X. vi does not reveal how to quit itself.

It’s probably useful to have a little “how to quit stuff” card. At least learners can try the different things and see if one works (I had to rescue people out nano, cat, man, awk, vi, and a shell string). Someone I spoke to had developed the coping strategy of closing the entire Terminal window if they got stuck in vi (good for their inventive skills, but a bit horrific).

On a slightly related note, to a first approximation it’s basically impossible for people to tell the difference between the Python, ipython, or shell prompts. Consequently they would type shell command into python, or vice versa, and be mystified as to why it wasn’t working. Probably useful to emphasise the prompt, and the difference between the different ones.

People seemed happy using nano even if they hadn’t used it before (though it is worth noting yet again that people still can’t quit nano, even thought it says how to at the bottom of the screen).

Was really hard to see the difference between purple and black (on screen).

Instructors should make their font really big. I can’t over-emphasise this. About 20 lines for the entire screen seems about right.

German keyboards have Steuerung key (for Ctrl), and they might not have a backquote key. Should probably teach $(stuff) instead of `stuff`.

(when programming in Python) left to their own devices, many people will call a variable that holds a list list, and one that holds a tuple tuple, and one that holds a dict dict. This is very natural, but a bad idea in Python. But it lets you do it anyway. (this was mentioned in the materials, but by then, many people had done it already; fortunately bad things did not happen (because we didn’t teach people that the list function can be used to make lists, and the
dict function can be used to make dicts)). Also, functions to sum lists will inevitably be called sum.

Although matplotlib was advertised as optional, in practice, as soon as it was being demonstrated, everyone wanted to use it, so they began to install it. With varying degrees of success. And of course the whole business of people not being able to follow along if they were installing something. Recommendation: do not have optional things.

Posted in News | Tagged | 3 Comments

A Vision of Web Science

In a break from my Picking Winners series, here’s a long-brewing post about the future of web science.

The culture and practice of science is undergoing a revolution, driven by technological change. While most scientists are excited by the shifts and the opportunities they present, some are uneasy about the pace of change, and unclear about the destination: where is science going, and how will it help their own research. In this post, I will lay out my vision of twenty-first century science: the shape of future scientific practice, and in particular the future of scientific computation and data-processing.

First, a few motivational remarks:

  • These changes are good for scientists. They are eliminating many of the irritating chores that consume too much of a working scientist’s time. They are shortening paths of communication—with other scientists, with institutions, with funding bodies, with the public—and providing new opportunities for gathering data, for developing hypotheses and models, for building on and integrating with other work, for demonstrating an individual’s contributions. Science is a tough profession: with demanding qualifications, poor job security, long hours of challenging intellectual work, and often relatively low pay. Anything we can do to improve the life of scientists can only be good for the profession.
  • These changes are good for science. They accelerate the speed and reliability of the core project of science: discovering facts and constructing true knowledge about the physical world. Information technology—primarily the internet but also many other aspects of cheaper and more reliable communication, processing, and storage—is allowing us to develop understanding, and correct errors, much faster, more cheaply, and more reliably than before. It allows us to collaborate, and to compete, more effectively and fairly than ever. As new systems emerge, both collaboration and competition will improve, and science will benefit.
  • These changes are good for humanity. Science has been an overwhelming force for good for the last three centuries, and that will continue. To tackle the challenges facing the world in the twenty-first century—such as global warming, water management, power production, energy efficiency, resource depletion, population growth, and demographic shifts—we will need new technologies, and creating those technologies will demand new and better science. The open science revolution will provide it.

That summarizes my own motivation for involvement in the open science revolution. I am not a scientist myself—I’m a software consultant—but I see this as the most positive contribution I can make to the common human endeavour.

Now, the vision. Nothing I describe here is technically very challenging: most of the pieces already exist, and “all that remains” is to combine them into an integrated whole. There’s an old software development joke that any project is divided into the first 90% and the remaining 90%, but the systems I describe below are surely no more that five years away. I’ll be pessimistic and stretch this out by a few years. So this vision concerns a birdsong research project in the year 2020, run by a young scientist named Binita*.

The Vision: Binita’s Birdsong

Binita has an interest in the shifting geographic and seasonal patterns of birdsong, which she believes may be connected to climate change. She has some old data from a nineteenth-century global network of ornithologists, who exchanged and collected form letters recording bird sightings: that data was digitized in 2018 by a crowd-sourcing project, similar to Old Weather+, called “Funny Old Bird”*. The Funny Old Bird data is on the Victorian Historian* data hub, and like most science data is entirely open. Binita needs twenty-first century data for comparison.

She gathers this data from citizen scientists, who have downloaded an app to their phone or tablet. These citizens are located all over the world, and all they have to do to take part in the project is to leave the app running on their phone. In fact, they may be recruited automatically because they are already running a citizen science meta-app, such as BirdBrains*, or NaturalSounds*—built in 2019 based on BOINC+.

The app listens for birdsong—naturally, most of the people who run it are wildlife enthusiasts, and spend a lot of time out-of-doors. When the audio pattern recognition in the app detects birdsong, it records it, tags it with GPS data, and prepares it for upload to a cloud server. Some of the citizen scientists are really keen, and set up multiple listening stations in the woods and fields near their home. A listening station is just an ancient cellphone—maybe an iPhone 5—so is ultra-cheap. A number of schools create mesh networks from old phones and laptops, monitoring whole areas near the school. Other data is obtained from ornithological societies, and from naturalists using the audio recording chips built into the GPS markers they use when “ringing” birds. There’s a rich variety of data sources for Binita to incorporate. A citizen scientist can tag a recording with his or her belief about the bird species featured, or with an image or video of the bird or birds in question.

When each recording arrives in the cloud, at Climate Data Hub*, it is logged against the citizen scientist (who can sign in to the data site and listen to recordings). This provenance information flows through the entire analysis, and adds to each citizen’s project score, which is also influenced by the accuracy of their species guesses. Citizens who record the same species of bird (or the same individual) are automatically connected via the project website. The dataset is automatically versioned, so that every update creates a new version and every analysis takes place on an instantaneous and reproducible snapshot, and has a record of the contributors which can be used in a citation. This functionality is all built out of toolkits originally created for the Galaxy Zoo+ or another Zooniverse+ project, and the data citability is descended from DataCite+.

Binita writes code to analyse her birdsong data, on ClimateCodeHub*. This is a notebook hub site, one of many on the web, and recognisable as a descendant of GitHub+, with heritage dating back to sites developed in the 1990s such as SourceForge+. When she logs into the site from her tablet, she can create a new project or copy an existing one (either her own or from another researcher). In a project, she can specify which languages she is coding in, and the URL of the source datasets (at Victorian Historian and Climate Data Hub). She can keep her code closed (for example, during early development, or until a publication embargo is cleared), or make it open right away. On open projects anyone can contribute; for closed projects she can invite specific other researchers. Contributors can view the code and offer comments, defect reports, or code contributions which she can use or ignore, and she can give various permissions and controls to other people too. In this case, it turns out that an enthusiastic bird-watcher in Venezuela is also a professional statistician, and one in New Zealand is an audio-processing expert, and they join Binita in working on the code.

The hub site provides a version control system derived from systems such as Git+ or Subversion+, so contributors can download code to their own machines, or they can code directly on the notebook hub through their browsers. The versioning system tracks everyone’s contributions, and links to them from each person’s home page on the hub site, so every participant can easily claim appropriate credit.

Whether Binita and her colleagues are coding in their browsers, or on their own machines, they are using CodeNotebook*, a descendant of iPython Notebook+, which allows them to interactively develop “notebook” pages, incorporating code, charts, text, equations, audio, and video. They can work on separate computations, or connect together to the same notebook server (on the notebook hub) and code interactively together. Since 2012, people have been using this technology to write websites, academic papers, and even textbooks. The CodeNotebook system is not language-specific: Binita can use her favourite language (which happens to be some species of Python+) and adds interfaces to some third-party Fortran+ libraries to do heavy numeric lifting, and a C+++ subsystem for audio processing.

Several times during the research project, starting with those Fortran and C++ modules, Binita realises that she can save effort by re-using other pieces of code. These might be her own modules from previous projects, or sections of other researchers’ projects. These modules are separate projects on that hub, or maybe some other one. She adds these versioned dependencies to the project configuration, and the code is automatically integrated by the version control system, and added to the citations section of her notebook pages. The other researchers whose code she is re-using are notified, and some get in touch to see whether they can contribute (and maybe get additional credit as contributors).

Binita uses a GitLit* plugin to find and track related research reading and to build the project bibliography—with citations of code, data, papers, and notebooks—which is all kept updated automatically. This is related to Mendeley+ and Zotero+, combined with content-mining software originally called AMI2+.

Binita can run her notebook directly on her own machine, or on the notebook hub servers in the cloud (she’s probably going to do the latter, because her datasets are fairly large and the various hub providers work together to make this efficient). This was pioneered by RunMyCode+ in 2011. Each time Binita runs her notebook, it can fetch snapshots from the dataset clouds, complete with versioning and citation metadata. The notebook is automatically annotated with all this configuration and version information, together with the versions of the operating system and citation links to all the systems, libraries and third-party code they are using. When she is content with the results, she can click a “Pre-print” button, and the resulting notebook is automatically given a Digital Object Identifier (DOI) and made available online. This instant citability was an amazing thing when FigShare+ first did it, back in 2011 when Binita was in middle school. The hub has a post-publication peer review system, so Binita’s colleagues and competitors around the world can see this pre-print, comment on it, and score it. It is automatically added to her profile page on the hub, where her resume system will pick it up, and funding bodies will find it there when they come to assess her research excellence productivity metric (or whatever it is called, in 2020).

Binita, however, wants a research paper in a journal, the old-fashioned sort, even printed on paper—something she can show her grandfather, who first took her bird-watching. So she clicks the “Submit” button, and a PDF document is automatically generated from the notebook, formatted according to the house style of her chosen journal, and sent to the journal’s editors. Open-source journal software based on Open Journal Systems+ manage the peer-review process, and after some minor revisions, it’s done.

When Binita revises her notebook, either during peer-review or otherwise, fresh DOIs are minted and annotations are automatically added to old versions, so other researchers can always find the most recent public version. Researchers who are “watching” a notebook are notified automatically. As Binita works, she herself sometimes receives an automated notification that code or data she is using has been updated in this way; she can view the reason for the update and choose whether to switch to the new version or continue to use the old version.

Any reader of Binita’s work can run her code themselves, either on the notebook hub or on their own machine, simply by clicking on buttons in the notebook. If it’s very computationally expensive, they might have to make a micro-payment, as they might be used to doing on Amazon’s EC2+. The notebook hub will automatically offer a virtual machine image, and a management script for a system such as StarCluster+. Alternatively, the notebook’s automated description—with exact version information, and links to installers, for all the components—allows interested readers to put together a running system for themselves.

All the code for running a notebook hub is open-source, of course, so anyone can host one and in fact an ecosystem has developed. Rich institutions and departments host their own, others are run by publishers, funding bodies, professional societies, non-profit organisations, and commercial companies who fund them by advertising or rental. A Semantic Web protocol has arisen for ensuring that Binita’s logins on all of these various hub systems are connected, creating a single unified researcher identity, and bibliography, that moves with her through her career.

Last, but far from least: where did Binita and all her peers learn how to drive all this marvellous technology? Well, back around 2010 science institutions finally started to understand the huge importance of software skills in twenty-first century science. After years of effort, Software Carpentry+ was finally recognised as indispensable training for the next generation of scientists. By the time Binita clicks “submit” on her project, every university around the world has compulsory courses in basic software development skills, for every science undergraduate. The better high schools are running their own notebook hubs, and smart middle schoolers are creating notebooks about snail trails and raindrop patterns. Science is better, faster, more open, and more fun, than ever before.

* These names are invented for the purpose of this description; any resemblance to the name of any project, product, or person, living or dead, is accidental and not intended to imply endorsement or criticism, so please don’t sue me.

+ These, on the other hand, are real projects that exist right now. The future is here.

Posted in News | Tagged , , , , , , , , , , , , , , , , , , , | 2 Comments

Picking Winners: Rule 2 – Don’t Believe the Hype

This is the second of a series of posts intended to help scientists take part in the open science revolution by pointing them towards effective tools, approaches, and technologies.

Having established in the previous post the importance of using open-source software for open science, I continue to the second in my set of five simple rules for scientists. This is one which most of us know and use instinctively in our daily lives, but somehow when it comes to computers and software tools we can lose sight of this basic idea:

2. Don’t Believe the Hype

Many open-source projects have active communities of users and developers. These people are using and developing the tool because it solves their problems. Unfortunately, this success may blind some of them to difficulties which you may face when you use the tool in your research. They may be so enthused about their project that they become evangelical about its use: it becomes the solution to everyone’s problems. In their minds, they have the World’s Greatest Hammer, so your project looks like a nail.

When you are first researching a tool for potential use, reading documentation and opening conversations with existing users, you will come across some of these project evangelists, and be exposed to some of their hype. They may well be very experienced in their own arena, but are likely to have little or no experience in your research domain, or the class of problem you are trying to solve. Particularly with younger project communities, whose tools have not been applied to a wide variety of problems, and which have grown out of a narrow problem domain, the ignorance can be breathtaking, and the hype can be extreme.

Of course, hype for proprietary software is often far more widespread and more extreme than in the free software world. Some commercial marketing departments may have no compunction about deliberately distorting the truth, or outright lying, to secure a sale, and (due to common stringent licensing conditions which can prohibit users from telling war stories), they often have a great deal of control over the “messaging” about their products. But since we already disposed of proprietary software with rule 1, it’s important that we should be honest with ourselves about some problems in open source communities.

Here’s a list of particularly common subjects of hyperbolic rhetoric. For each subject, I’ve suggested some “antidotes” to the Kool-Aid: questions you should ask yourself or the community, to help you judge the true suitability of the tool for your use.

  • Productivity: those jobs which used to take months, or which were especially difficult to get right, are now automatically handled in the background.
    • Are those jobs you are familiar with? Are they likely to feature in your project? Yes, it’s cool that the tool makes it really easy to drive a robotic instrument through the USB port. But is that something you need to do on your project?
    • Do you already have a good way of carrying out that job? Is it wise to discard that expertise, or would it be better to continue with your existing technique?
  • Performance: “Faster than FORTRAN” is a common cry. This is quite a hard target for algorithms which are very numeric-intensive, and which have been optimized for a particular platform by an experienced developer. It’s hard because compiling FORTRAN for high-performance computation is a problem which has been attacked for six decades by many of the finest minds in computer science, and parts of FORTRAN have been refined to allow the expression of algorithms in a performance-sensitive way (and computers have evolved alongside, to enable FORTRAN programs to run faster). Competing with FORTRAN is not such a big deal in general-purpose programming—algorithms without data parallelism—or for more modern languages in which parallelism, dependency, and aliasing can be fully expressed, and which have mature compilers developed with performance in mind.
    • “faster than X” is meaningless. “Faster than X, on this specific computer, solving this problem, with this code and this data” is meaningful, and should be assessed by comparing that problem with your own.
    • How computationally intensive is your problem? Will it take hours of computer time? Of super-computer time? How much of your time is it worth to save this amount of computer time?
    • How much of your computation is going to be performance-critical? It’s often only a very small numeric core, which—once located—can be farmed out to a separate small program written in FORTRAN or some other special-sauce language.
  • Installation: you will be told that installation “is a snap”. This is very common, because long-term users of any software have usually not had to install it from scratch for a long time. They may never have installed it at all (it may have come “out-of-the-box”), and if they did then they may well have forgotten any difficulty they had. Furthermore, many users have only ever installed it on a single “platform”: their own computer, running some particular version of that operating system with particular versions of any required packages. Software installation is often tricky, with complex dependencies and many compatibility headaches.
    • Read through the installation instructions for your operating system. Do they identify the dependencies, including particular versions? Do they have trouble-shooting advice?
    • Are there existing developers and users across a wide diversity of platforms? Is it easy to find users with your particular mix of operating system and other dependencies?
    • Are there active discussion forums, or a wiki, or a knowledge base? Browse them: do newcomers get real assistance, or a brush-off?
  • Compatibility: you will certainly be assured that the system has interfaces to XML, and JSON, and SOAP/RPC, and NetCDF, and SVG, and SQLite, and on and on. But if that functionality is critical to your project, research it further. The interface may well have been developed to support one or two previous projects, and only provide whatever functionality they required. Or it may be antique, not up-to-date with recent changes.
    • Look through the interface documentation. Is it complete? Is it up-to-date?
    • Ask questions specific to your use. Instead of asking “does it work with Matlab files?”, ask “can I read MAT-File Level 5 format files, with sparse arrays and byte-swapping?”
    • Find other users with similar compatibility requirements and ask to see their code.
  • Your problem doesn’t matter: because the tool is so awesome that this other technique is better. The language doesn’t need namespaces—you can always use a prefix. You don’t need a working C interface—you can always reimplement that library in this other way. You don’t really need PDF charts: SVG is so much more 21st century.
    • You are kidding, right?
    • (yes, it’s true about SVG, but many publishers and other tools still want PDF).
  • Future development: It’s not faster than FORTRAN yet, but it will be by Christmas because I just read this cool paper about how to do it. Version 3 is going to have support for PDF defrobnification. The iPhone app will be out Real Soon Now.
    • As with all software, you should assume that any feature not already functional in a released version is very unlikely to be working by the suggested date, and may well never exist.
    • So: keep one eye on project announcements, and bear the tool in mind for future use, but if your project needs a feature today, then either choose a tool which has that feature now, or plan to develop it yourself.

Hype is at its worst at particular phases in a project’s life-cycle: at the beginning (when the future is bright, and everything is possible) and near the end (when the project is dying, and this is obvious to all but the True Faithful, who are sustained by unshakeable belief). Those are times to avoid a project anyway, and will be addressed in future posts in this series.

Posted in News | Tagged | Leave a comment

Picking Winners: Rule 1 – Is it Really Free?

This is the first of a series of posts intended to help scientists take part in the open science revolution by pointing them towards effective tools, approaches, and technologies.

When I make presentations to groups of scientists about the Climate Code Foundation and the Science Code Manifesto, I am often asked to recommend particular systems, or to advise on the advantages and disadvantages of some particular combination of tools. Many scientists are keen to adopt free software systems, open-source tools, and modern development methods, to become part of the growing open science revolution, and are looking for guidance.

I’m happy to discuss such questions individually, but there are far too many possible systems, configurations, and sets of requirements for anyone to give a single comprehensive answer. One tool might be excellent for most uses but fall short in some particular aspect which is critical for your project. Another might be generally weak but have a tremendously strong feature which addresses your needs. Over the next few days I will be giving some specific answers to common questions—identifying the strengths and weaknesses of my own favourite systems—but first I will try to help any scientist to pick winners for themselves, by giving some useful rules of thumb.

So I have formulated five general rules for making good choices in the open-source development world, and this is the first in a series of posts laying out those rules.

1. Is it really free? Is it really open?

The first question to ask about any particular tool is: what restrictions are there on using, modifying, or sharing it? Can my audience use it too?

The open science revolution is built on free software, and I strongly encourage all researchers to use free software whenever possible. However, the term “free software” has a very specific technical meaning, which is often misunderstood by those outside the software world. It does not mean “zero-cost software”, it means “software that does not restrict its users from studying, modifying, and sharing it”.

Because of this confusion, I usually favour the term “open-source software”, which isn’t so easily misunderstood and also has two other advantages. First, it has the ring of jargon, so listeners immediately grasp that it has a specific technical meaning. Secondly, it chimes with other pillars of twenty-first century science: “open data” and “open access”, with which researchers are often already familiar. Open-ness—making one’s results available to others to be criticised, praised, and built upon—is a core scientific value, and this term emphasizes that the same principle is at work for software.

The formal definitions of the two terms overlap almost totally—certainly all the software I recommend is both free and open-source—and many people use them synonymously: I mostly use “free software” for audiences of software professionals, and “open-source software” for others. It’s very regrettable that some people, including software community leaders, see conflict between the two. In my experience most people actually writing free software aren’t very interested in the distinction.

In any case, the important thing to remember is that software which is zero-cost but not truly free is a very risky choice which will not provide many of the great advantages of open-source software. Scientists and researchers working in large institutions in rich countries often have a very wide choice of zero-cost software tools:

  • Some will be available “for free” for some limited trial period. For example, many new computers come pre-installed with suites of software which become unusable after 30 days, unless a fee is paid. The same is often true of software provided at a departmental or group level, and of many web services (for example, search tools for research literature). Remember, the first hit of any drug is always free.
  • Some will have been bought for the institution, the department, or the team. A license may need renewing—often annually—at a price to be determined by the vendor. Next year, or next week, the license may come up for renewal and be cancelled due to budget pressure.
  • Some will not have been bought, but will have been copied illegally, and any licensing mechanism will have been subverted. Such software may be suddenly disabled (and the people responsible punished).
  • Some may have been bought by a single researcher or small group, and made available to a few people. The fact that it’s on your computer doesn’t guarantee that you have the right to use it.
  • Finally, some will be truly free, open-source, software.

Always, ask yourself this: can my audience use this tool? Open science is all about sharing your research, including your code, with others, and if they can’t use the tool then your code is much less useful to them—they won’t read it and they won’t improve it—and to you: they won’t cite it.

Your audience might be your research colleagues, or those elsewhere in your department, or at other institutions around the world (including those with less generous budgets), or independent researchers, or the public. One key audience member, often the most important, is your own future self: you can use the tool right now, but your institution might stop licensing it, or you might move to work somewhere without a license, or the software vendor might go broke (so that suddenly nobody has a license). Do you want to be able to use your own research in five years’ time?

Open science uses open-source software. Make sure you do too.

Posted in News | Tagged | Leave a comment


This guest post is written by Jeremy Wang, who worked all summer on a web visualisation system for GISTEMP results, thanks to the excellent Google Summer of Code. This is his second post, here is the first.

It is the end of the summer and I am wrapping up my Google Summer of Code project. My experience with the Google Summer of Code, the Climate Code Foundation, and my mentors here (Nick Barnes, Nick Levine, and David Jones) has been very rewarding. My final project is largely what I proposed at the beginning of the summer. CCF MapView is a map-based visualization of the GISTEMP climate analysis via ccc-gistemp. The tools shows a Google Maps-like interface over which various climate data is overlaid. The map supports zooming and panning. Data sets including topography, night light radiance, cities, and weather stations which can be toggled on and off. The gridded temperature values computed by ccc-gistemp can be overlaid on the map by selecting a source – ocean, land, or mixed. The displayed year can also be changed using arrows or the slider.

Weather stations contributing to the GISTEMP analysis are displayed as square dots with orange indicating stations in urban locations, yellow indicating suburban, and green indicating rural. When the mouse hovers over a station, the station name is shown. Upon clicking a station, detailed information about the station pops up, including geographic information about the station location and two types of chart showing the historical temperature record. Charts show the final temperature difference from the baseline period (1950-1980) along with adjustments made based on partial/missing data and urbanization effect. The basic unit in the GISTEMP analysis is a grid of 8000 sub-boxes designed to contain equal area across the globe (although they appear uneven on a map). Each of these grid cells is shown a different color indicating the temperature delta for the selected year. Red indicates warmer than baseline and blue indicates cooler. When the mouse hovers over a cell, the coordinates for that cell are displayed. Upon clicking a cell, detailed information about the cell pops up, including the list and contribution weights of weather stations contributing to that sub-box average. Also shown is a chart of the temperature record (again, relative to the baseline) for the chosen region.

I hope for this tool to help better explain and illustrate the GISTEMP climate analysis procedure and results, especially for non-scientists and those interested in climate change. The climate data is displayed on top of a modular map framework so that it should be relatively easy to extend the same type of visualization to other climate data sets or other types of GIS data. The source can be downloaded at

Posted in News | Leave a comment

Open BEST Project status

This guest post is written by György Kovács, who worked all summer on a reimplementation of some climate science software using only free software tools, thanks to the excellent Google Summer of Code. This is his second post, here is the first.

In the Google Summer of Code 2012 program my task was the reimplementation of the Berkeley Earth Surface Temperature (BEST) Matlab software in C. In the beginning of the work I was very optimistic, but now I see that I have highly underestimated the amount of work required to complete the entire project.

The BEST software is a professional Matlab code, that is, all the fine features and structures of Matlab are routinely utilized in the code, from the dynamic extension of structures with fields to the use of cell arrays. The representation of these things is definitely not that mechanical job I have expected. Another drawback is the lack of Matlab to C compiler in the form we have expected it when scheduled the work. In older versions of Matlab there were opportunities to compile a Matlab code to a C source, but in the current releases Matlab creates only a header and an encrypted library, which definitely not fits the goals of Climate Code Foundation.

Anyway, after two months of coding I managed to implement the main run path in C. The BEST software has plenty of parameters and three predefined parameter sets. The ‘quick’ parameter set is the one which takes the control over one iteration of the kriging process and generates simple but demonstrative results. This run path is the backbone of any parameterizations of the software, so the most important part of the development is completed and working.

Although a large amount of code was written, the development is far not ready. Several features need to be added to get more accurate results. The main lesson I have drawn is that the C language is not the easiest way to reimplement the interpreted Matlab code to compiler based imperative languages. Although C is simple and extremely fast, it’s a hard task to represent the data structures and operations that are routinely used in Matlab. Either a thorough refactoring is required before coding in C or some higher level programming tools (classes, overloading, templates) should be used to code the handy but complex features of Matlab. Perhaps a reimplementation of the reimplemenetation could make the code simpler and easier to use by the community.

Special thanks to Nick Barnes and Nick Levine for the great mentoring. GSoC is great, and I really hope I have created something valuable for the Climate Code Foundation, science and mankind in this summer. The code is available at github, while some sample charts can be seen in my blog posts.

Posted in News | 2 Comments

Welcome György Kovács

This guest post is written by György Kovács, one of our Google Summer of Code students for 2012.

I have graduated in Computer Science and Applied Mathematics at the University of Debrecen, Debrecen, Hungary. Now, I am a BSc. student in Physics and PhD student in Computer Science in the Department of Computer Graphics and Image Processing of the same university. My main research fields are statistical medical image analysis and pattern recognition. Besides, I have definite interest for interdisciplinary topics related to computer science, math or physics.

Nowadays, climate change is one of the most intensively researched topics of science, several research groups are analyzing and trying to understand the phenomenon with the aid of computers, well-developed mathematical models and processing plenty of data. One aim of the Climate Code Foundation is to make the already developed and published approaches and algorithms available to the open source community and my contribution to these efforts is going to be the reimplementation of the Berkeley Earth Surface Temperature (BEST) software, using open source tools.

Due to the excellent job of researchers of the BEST project, a larger temperature dataset has been collected than ever before, and new homogenization and averaging models have been developed. The software was written in MatLab using cluster-based parallelization to process the data of more than 40,000 weather stations. The scientific papers, software and data are freely available on the website of the project.

My aim is to create a clear reimplementation of the BEST-software in C. First, a serial version, providing the same functionality as the BEST-software, is going to be built. Then, a parallelization framework will be developed using OpenMPI to provide faster processing. Hopefully, by the end of summer, one of the state-of-the-art climate analysis software becomes available as a C package, that is beneficial for both the open source and open science communities. During the summer, my progress can be followed in my project blog.

Posted in News | Leave a comment

Welcome Jeremy Wang

This guest post is written by Jeremy Wang, one of our Google Summer of Code students for 2012.

I am a PhD student in the Computer Science Department at the University of North Carolina at Chapel Hill. My primary research focus is in bioinformatics and computational genetics. I have a particular passion for web-based development, supporting interaction with and visualization of large data sets. I have put this to use in my bioinformatics work, developing web-based tools for analysis and visualization of genomic data.

To make climate science more accessible to those who are not scientists or programmers, we need some tools which are very easy to get and very easy to use. Web-based technologies are ideal for exactly
this type of application. I aim to create a platform which exposes some of the basics of climate research in a way that is accessible to non-scientists – an online map.

Most internet users have used Google Maps or something equivalent. I propose to use a similar map-based interface to show, instead of roads and directions, climate information, superimposed over the world.
Spatial phenomenon and relationships among climate data can be well demonstrated on a map. Temporal changes can also be illustrated through animations or an select-able snapshots. By using an interface
most people already understand, we can lower the barrier for getting people interested in and looking at climate data.

The major goal of the project is to create a very intuitive user interface to illustrate global climate in the modern era. I plan to create a navigable map wherein users can pan in two dimensions and zoom in, visualizing climate at global and local scale. Climate data will be derived from the ccc-gistemp implementation of the NASA GISTEMP climate analysis, including global gridded temperature measurements, factors affecting normalization, and individual weather station measurements. I hope for this platform to be flexible enough to allow visualization of arbitrary geographic information such that it can be easily extended and remain useful for a variety of analyses in the future.

Follow my progress on the project blog.

Posted in News | Tagged , , , , | Leave a comment