Interspersed with my Picking Winners series, here’s another post about the vision thing, a broad outline of how we might get there from here.
A month ago I sketched out my vision of the future of science. Of course, I didn’t just dream that up by myself: large parts of it are stolen shamelessly from much smarter people like
- Michael Nielsen (whose amazing book, Reinventing Discovery suddenly crystallised my thinking in 2011);
FX: screeching brakes
Hold it right there! You’ve never read “Reinventing Discovery”?! OK, go and buy it, right now, read it, and come back. I’ll wait.
Back? Let’s go on…
- Michael Nielsen …;
- Greg Wilson, whose Software Carpentry work is finally breaking through, bringing fundamental software development skills to the next generation of scientists;
- Peter Murray-Rust, who has been banging the drum for open access and for content-mining for many years: it’s good to see this finally getting regulatory traction;
- Victoria Stodden, who is putting her ground-breaking research on replication of computational science into action in the cloud at RunMyCode;
- Fernando Pérez, whose compelling IPython system is transforming the way people think about interactive scientific computation;
- and many, many others.
Although everyone in open science is coming at the subject from their own direction, with their own focus and interests, our ideas are definitely cohering into some clear ideas about the future of science software:
- Code written for published research will be:
- open-source;
- easy to download;
- easy to fork;
- and easy to track.
- It will be:
- hosted and curated in the cloud;
- at any of several independent hosting sites;
- with version-tracking and defect-tracking web services;
- and automatically-generated permanent version IDs for citation purposes;
- all accessible via open web APIs.
- It will have:
- automatic authorship tracking;
- linking many different web-based services;
- producing online profiles and reputations;
- which will naturally be used in scientific career records.
- It will often depend on:
- several third-party components;
- automatically giving credit to their authors;
- with clear records of dependency configuration;
- and thus remain reproducible even when those components change.
- The code will be written:
- in many different programming languages;
- on remote machines or via browser-based editors;
- usually by scientists who are trained in software development.
- It will be runnable:
- by any interested person;
- on servers in the cloud;
- through an easy-to-use web interface;
- on the original or modified data sets;
- (possibly on payment of reasonable virtualization fees).
- The resulting data – files, charts, or other results – will be open:
- freely available;
- for any use;
- and for redistribution;
- subject at most to attribution and/or share-alike requirements.
- The data will also be:
- retained permanently;
- with a citable identifier;
- and recorded provenance: versions of source code and data;
- and automatically recorded credit to the person who ran the code.
- Scientific publications will be:
- often written alongside the code which supports them;
- often generated online, fully automatically, from the inter-woven source and text;
- usually open;
- subject to public discussion in open web forums;
- widely indexed, inter-linked, and woven into an open web of scientific knowledge.
I could go on (I haven’t even raised the subject of crowd-sourced data), but that’s the core of the vision. The question then is: how do we get there from here? As discussed in the earlier post, many components already exist (although some are still in their infancy):
- GitHub is the world’s current favourite online code hosting service, with version control, defect tracking, and related services;
- iPython Notebook is a great system for interactive web-based coding, including visualisation and text markup;
- RunMyCode is a landmark site for cloud-based curation and running of research computation;
- Zotero is a web-based literature search, research, and bibliographic tool;
- FigShare is a web hosting service for datasets, figures, and all kinds of research outputs;
- StarCluster is a fine system for managing computation on Amazon’s EC2 cloud service.
- Software Carpentry is a training organization to provide basic software development skills to researchers.
It’s worth noting that some server-side parts of this ecosystem (GitHub, FigShare, RunMyCode?, EC2) are not open source, which adds some risk to the vision. A truly open science future won’t depend on closed-source systems: institutions and organisations will be able to deploy their own servers, services and data will be robust against corporate failure or change. Happily these proprietary components (with the possible exception of RunMyCode) all have open APIs, which allows for the possibility of future open implementations.
It’s also worth noting in passing that several of these components (iPython Notebook, RunMyCode, Zotero, and Software Carpentry) are partly funded by the Alfred P. Sloan Foundation, a shadowy puppet-master a non-profit philanthropic institution which obviously shares this open science software vision. Several also have links with, or funding from, the Mozilla Foundation, which now has a grant from the Sloan Foundation to create a “Webmaking Science Lab”, apparently intending to drive this vision forward. [Full disclosure: I have applied for the post of lab director].
So, a lot of the vision already exists, in disconnected parts. The work remaining to be done includes:
- Integration. RunMyCode with an iPython Notebook interface. EC2 configuration using a lickable web UI, generating StarCluster scripts. iPython Notebook servers versioning code at GitHub (or SourceForge, or GoogleCode). A GitHub button to generate a DOI. Automatic posting of result datasets to FigShare. And so on.
- Inter-operation. Simple libraries for popular languages to access the APIs of all these services. An example is rfigshare: FigShare for R. but where’s (say) PyFigShare? Or FortranGitHub?
- Independence. Web services shouldn’t care what programming languages and systems are used by researchers. They should make it easy to do popular things, but possible to do almost anything. They should be able to manage and interact with any virtual machine or cluster that a researcher can configure.
- Identity. How does a researcher log into a service? Single sign-on using Persona? How does she claim credit for a piece of work? How does she combine different sources of credit? Can we do something with Open Badges?
- Import (well, it begins with ‘I’; really I mean Configuration Management). Researchers need to be able to pull in specific versioned platforms or code modules from GitHub or elsewhere, and have those same versions work in perpetuity. This is a tangled web (a form of DLL hell) so we also need a system for recommending and managing known-working combinations. Like EPD or ActivePerl, but fully open.
This list is Incomplete. We also need to build networks and groups of related projects. A lot of the required development will take place anyway, and coordinating different open-source projects, like managing individual developers, can resemble cat-herding. But there is a role for joint development, for workshops, summits, and informal gatherings, for code sprints, hackjams, and summer projects. Even if we are not marching in step, we should all be heading in the same general direction, and avoid duplicated effort. We should build interest and enthusiasm in the open-source community, and work together with major cross-project organisations such as the Google Open Source Programs Office, O’Reilly Media, and (again) the Mozilla Foundation.
Finally we also need buy-in from researchers and institutions. Planck’s second law has it that science progresses one funeral at a time. The experience of Software Carpentry suggests that it can take years for even basic ideas such as training to gain traction. So as the infrastructure takes shape, we should also be talking to institutions, journals, societies, and conferences, to publicise it and encourage its take-up. This vision thing isn’t top-down: it’s driven by scientists’ own perceptions of their needs, what we think will make us more productive, increase the recognition of our work, and improve science. If we’re right, then more scientists will join us. If we’re wrong, we don’t deserve to succeed.

This guest post is written by 
This guest post is written by 