Chasing the Dream

Interspersed with my Picking Winners series, here’s another post about the vision thing, a broad outline of how we might get there from here.

A month ago I sketched out my vision of the future of science. Of course, I didn’t just dream that up by myself: large parts of it are stolen shamelessly from much smarter people like

FX: screeching brakes

Hold it right there! You’ve never read “Reinventing Discovery”?! OK, go and buy it, right now, read it, and come back. I’ll wait.

Back? Let’s go on…

  • Michael Nielsen …;
  • Greg Wilson, whose Software Carpentry work is finally breaking through, bringing fundamental software development skills to the next generation of scientists;
  • Peter Murray-Rust, who has been banging the drum for open access and for content-mining for many years: it’s good to see this finally getting regulatory traction;
  • Victoria Stodden, who is putting her ground-breaking research on replication of computational science into action in the cloud at RunMyCode;
  • Fernando Pérez, whose compelling IPython system is transforming the way people think about interactive scientific computation;
  • and many, many others.

Although everyone in open science is coming at the subject from their own direction, with their own focus and interests, our ideas are definitely cohering into some clear ideas about the future of science software:

  • Code written for published research will be:
    • open-source;
    • easy to download;
    • easy to fork;
    • and easy to track.
  • It will be:
    • hosted and curated in the cloud;
    • at any of several independent hosting sites;
    • with version-tracking and defect-tracking web services;
    • and automatically-generated permanent version IDs for citation purposes;
    • all accessible via open web APIs.
  • It will have:
    • automatic authorship tracking;
    • linking many different web-based services;
    • producing online profiles and reputations;
    • which will naturally be used in scientific career records.
  • It will often depend on:
    • several third-party components;
    • automatically giving credit to their authors;
    • with clear records of dependency configuration;
    • and thus remain reproducible even when those components change.
  • The code will be written:
    • in many different programming languages;
    • on remote machines or via browser-based editors;
    • usually by scientists who are trained in software development.
  • It will be runnable:
    • by any interested person;
    • on servers in the cloud;
    • through an easy-to-use web interface;
    • on the original or modified data sets;
    • (possibly on payment of reasonable virtualization fees).
  • The resulting data – files, charts, or other results – will be open:
    • freely available;
    • for any use;
    • and for redistribution;
    • subject at most to attribution and/or share-alike requirements.
  • The data will also be:
    • retained permanently;
    • with a citable identifier;
    • and recorded provenance: versions of source code and data;
    • and automatically recorded credit to the person who ran the code.
  • Scientific publications will be:
    • often written alongside the code which supports them;
    • often generated online, fully automatically, from the inter-woven source and text;
    • usually open;
    • subject to public discussion in open web forums;
    • widely indexed, inter-linked, and woven into an open web of scientific knowledge.

I could go on (I haven’t even raised the subject of crowd-sourced data), but that’s the core of the vision. The question then is: how do we get there from here? As discussed in the earlier post, many components already exist (although some are still in their infancy):

  • GitHub is the world’s current favourite online code hosting service, with version control, defect tracking, and related services;
  • iPython Notebook is a great system for interactive web-based coding, including visualisation and text markup;
  • RunMyCode is a landmark site for cloud-based curation and running of research computation;
  • Zotero is a web-based literature search, research, and bibliographic tool;
  • FigShare is a web hosting service for datasets, figures, and all kinds of research outputs;
  • StarCluster is a fine system for managing computation on Amazon’s EC2 cloud service.
  • Software Carpentry is a training organization to provide basic software development skills to researchers.

It’s worth noting that some server-side parts of this ecosystem (GitHub, FigShare, RunMyCode?, EC2) are not open source, which adds some risk to the vision. A truly open science future won’t depend on closed-source systems: institutions and organisations will be able to deploy their own servers, services and data will be robust against corporate failure or change. Happily these proprietary components (with the possible exception of RunMyCode) all have open APIs, which allows for the possibility of future open implementations.

It’s also worth noting in passing that several of these components (iPython Notebook, RunMyCode, Zotero, and Software Carpentry) are partly funded by the Alfred P. Sloan Foundation, a shadowy puppet-master a non-profit philanthropic institution which obviously shares this open science software vision. Several also have links with, or funding from, the Mozilla Foundation, which now has a grant from the Sloan Foundation to create a “Webmaking Science Lab”, apparently intending to drive this vision forward. [Full disclosure: I have applied for the post of lab director].

So, a lot of the vision already exists, in disconnected parts. The work remaining to be done includes:

  • Integration. RunMyCode with an iPython Notebook interface. EC2 configuration using a lickable web UI, generating StarCluster scripts. iPython Notebook servers versioning code at GitHub (or SourceForge, or GoogleCode). A GitHub button to generate a DOI. Automatic posting of result datasets to FigShare. And so on.
  • Inter-operation. Simple libraries for popular languages to access the APIs of all these services. An example is rfigshare: FigShare for R. but where’s (say) PyFigShare? Or FortranGitHub?
  • Independence. Web services shouldn’t care what programming languages and systems are used by researchers. They should make it easy to do popular things, but possible to do almost anything. They should be able to manage and interact with any virtual machine or cluster that a researcher can configure.
  • Identity. How does a researcher log into a service? Single sign-on using Persona? How does she claim credit for a piece of work? How does she combine different sources of credit? Can we do something with Open Badges?
  • Import (well, it begins with ‘I’; really I mean Configuration Management). Researchers need to be able to pull in specific versioned platforms or code modules from GitHub or elsewhere, and have those same versions work in perpetuity. This is a tangled web (a form of DLL hell) so we also need a system for recommending and managing known-working combinations. Like EPD or ActivePerl, but fully open.

This list is Incomplete. We also need to build networks and groups of related projects. A lot of the required development will take place anyway, and coordinating different open-source projects, like managing individual developers, can resemble cat-herding. But there is a role for joint development, for workshops, summits, and informal gatherings, for code sprints, hackjams, and summer projects. Even if we are not marching in step, we should all be heading in the same general direction, and avoid duplicated effort. We should build interest and enthusiasm in the open-source community, and work together with major cross-project organisations such as the Google Open Source Programs Office, O’Reilly Media, and (again) the Mozilla Foundation.

Finally we also need buy-in from researchers and institutions. Planck’s second law has it that science progresses one funeral at a time. The experience of Software Carpentry suggests that it can take years for even basic ideas such as training to gain traction. So as the infrastructure takes shape, we should also be talking to institutions, journals, societies, and conferences, to publicise it and encourage its take-up. This vision thing isn’t top-down: it’s driven by scientists’ own perceptions of their needs, what we think will make us more productive, increase the recognition of our work, and improve science. If we’re right, then more scientists will join us. If we’re wrong, we don’t deserve to succeed.

This entry was posted in News and tagged , , , , , , , , , , , , . Bookmark the permalink.

5 Responses to Chasing the Dream

  1. Nick Barnes says:

    Among the many existing systems I didn’t mention in the main post: ImpactStory, Heather Piwowar’s open-source web tool for researchers to build pages showing all the diverse impacts their research has by combining a wide range of alt-metrics. Once again, it’s funded by the Sloan Foundation. And it’s all open-source, yay!

  2. Titus Brown says:

    Having watched Software Carpentry “take hold”, I personally think a lot of it has less to do with the lead time and more to do with the scientific “readiness”. I hesitate to call Greg a visionary just on 1st principles, but he saw the real problems of scientific computing (efficiency, correctness) long before most — yet even with articulate and constant discussion of the issues over a decade, it required a critical mass of *other* people recognizing the problems for Software Carpentry to take off.

  3. Nick Barnes says:

    Titus: right. I have a whole ‘nother post brewing, for the Picking Winners series, about SwC and how/why it took so long to get going. But this open science / open source revolution is gaining pace all the time, and it’s definitely becoming easier to gain traction.
    I was saying in a conversation just now that young scientists, and some of the elite in science institutions, are ready to embrace change, but that in between there is a mass who are so busy on the proposal/paper treadmill that they can’t afford to step back and consider whether there’s a better way. The key (I think) is to change the treadmill: change the rules for grants, for publication, for assessment, so that better ways of doing things are rewarded. That’s why the NSF policy change (and previous changes at RCUK, Wellcome, NIH) is such good news. Once scientists can gain professional credit for open software contributions, they will make the shift.

  4. Nick Barnes says:

    (Re “the elite in science institutions”: my experience is that senior people at places like the Royal Society, or Wellcome, can be quite open to radical change).

  5. That is a really nice vision, but I see a big drawback: It requires scientists to use a common toolbox.

    We do not have that. We have fortran code from 20 years back lying around, which takes already a few days to just get running on a cluster which has all the needed dependencies – not to speak of a bare cluster (like the one I encountered when I started with my project). We have programs which need 1 week of schooling to just learn to understand the config files – and how to set up all the required files. We have systems which require 1 TiB of data just to run – which needs days to download. And programs which require the full scipy stack including Basemap.

    And different compilers, libraries, versiontracking systems, …

    And we have our own development tools, optimized over the years for our work so we cannot just go and use the one unified interface. We have custom scripts, custom information storage systems. And often about 5 other people interested in exactly the stuff we do.

    So any web-system for all scientists must be adaptable for all workflows. And that is really hard.

    Some of my plotting routines actually take a day just to visualize all the data I might want to investigate from a project run. And the results require Gigabytes of diskspace.

    Still we can do better than now.

    We can use Emacs org-mode to include snippets of code we use directly in the papers we write (that’s what I do).

    We can share more of our code. We can get rid of the “state your intent”-barriers and just give people code. As long as the institution allows it (*shudder*).

    And I think the best place to start that would not be a big web based system. The best place would be a small text file: How_to_get_the_dependencies.txt

    Or simply: Readme.txt

    And maybe a Makefile.

    We don’t need to be able to run the research with a single click. Often it takes months only to read the required theory to understand what we do. But we need to have clear information how we can get something running in a day. Plus download times.

    And we need clear ways to reproduce published results. Ideally in a tarball with a Makefile. Any vcs can produce a tarball. And the published results are not intended for changing, so there is no need for complex versioning. But the tarball should contain a file which states which revision was used to create the tarball.

    So, the simplest specification of a really reproducible publication format would be:

    – Readme.txt: Read this to find instructions for getting dependencies – the data and tools needed to create the publication from the shipped files.
    – Makefile: Once you have all dependencies, this allows you to simply call `make` to recreate the results.
    – version.txt: This notes the revision of the project used to generate the publication – if no versiontracking system was used, it simply gives the date.
    – … any files used for creating the publication which are not named in the Readme as dependencies.

Leave a Reply

Your email address will not be published. Required fields are marked *