In a break from my Picking Winners series, here’s a long-brewing post about the future of web science.
The culture and practice of science is undergoing a revolution, driven by technological change. While most scientists are excited by the shifts and the opportunities they present, some are uneasy about the pace of change, and unclear about the destination: where is science going, and how will it help their own research. In this post, I will lay out my vision of twenty-first century science: the shape of future scientific practice, and in particular the future of scientific computation and data-processing.
First, a few motivational remarks:
- These changes are good for scientists. They are eliminating many of the irritating chores that consume too much of a working scientist’s time. They are shortening paths of communication—with other scientists, with institutions, with funding bodies, with the public—and providing new opportunities for gathering data, for developing hypotheses and models, for building on and integrating with other work, for demonstrating an individual’s contributions. Science is a tough profession: with demanding qualifications, poor job security, long hours of challenging intellectual work, and often relatively low pay. Anything we can do to improve the life of scientists can only be good for the profession.
- These changes are good for science. They accelerate the speed and reliability of the core project of science: discovering facts and constructing true knowledge about the physical world. Information technology—primarily the internet but also many other aspects of cheaper and more reliable communication, processing, and storage—is allowing us to develop understanding, and correct errors, much faster, more cheaply, and more reliably than before. It allows us to collaborate, and to compete, more effectively and fairly than ever. As new systems emerge, both collaboration and competition will improve, and science will benefit.
- These changes are good for humanity. Science has been an overwhelming force for good for the last three centuries, and that will continue. To tackle the challenges facing the world in the twenty-first century—such as global warming, water management, power production, energy efficiency, resource depletion, population growth, and demographic shifts—we will need new technologies, and creating those technologies will demand new and better science. The open science revolution will provide it.
That summarizes my own motivation for involvement in the open science revolution. I am not a scientist myself—I’m a software consultant—but I see this as the most positive contribution I can make to the common human endeavour.
Now, the vision. Nothing I describe here is technically very challenging: most of the pieces already exist, and “all that remains” is to combine them into an integrated whole. There’s an old software development joke that any project is divided into the first 90% and the remaining 90%, but the systems I describe below are surely no more that five years away. I’ll be pessimistic and stretch this out by a few years. So this vision concerns a birdsong research project in the year 2020, run by a young scientist named Binita*.
The Vision: Binita’s Birdsong
Binita has an interest in the shifting geographic and seasonal patterns of birdsong, which she believes may be connected to climate change. She has some old data from a nineteenth-century global network of ornithologists, who exchanged and collected form letters recording bird sightings: that data was digitized in 2018 by a crowd-sourcing project, similar to Old Weather+, called “Funny Old Bird”*. The Funny Old Bird data is on the Victorian Historian* data hub, and like most science data is entirely open. Binita needs twenty-first century data for comparison.
She gathers this data from citizen scientists, who have downloaded an app to their phone or tablet. These citizens are located all over the world, and all they have to do to take part in the project is to leave the app running on their phone. In fact, they may be recruited automatically because they are already running a citizen science meta-app, such as BirdBrains*, or NaturalSounds*—built in 2019 based on BOINC+.
The app listens for birdsong—naturally, most of the people who run it are wildlife enthusiasts, and spend a lot of time out-of-doors. When the audio pattern recognition in the app detects birdsong, it records it, tags it with GPS data, and prepares it for upload to a cloud server. Some of the citizen scientists are really keen, and set up multiple listening stations in the woods and fields near their home. A listening station is just an ancient cellphone—maybe an iPhone 5—so is ultra-cheap. A number of schools create mesh networks from old phones and laptops, monitoring whole areas near the school. Other data is obtained from ornithological societies, and from naturalists using the audio recording chips built into the GPS markers they use when “ringing” birds. There’s a rich variety of data sources for Binita to incorporate. A citizen scientist can tag a recording with his or her belief about the bird species featured, or with an image or video of the bird or birds in question.
When each recording arrives in the cloud, at Climate Data Hub*, it is logged against the citizen scientist (who can sign in to the data site and listen to recordings). This provenance information flows through the entire analysis, and adds to each citizen’s project score, which is also influenced by the accuracy of their species guesses. Citizens who record the same species of bird (or the same individual) are automatically connected via the project website. The dataset is automatically versioned, so that every update creates a new version and every analysis takes place on an instantaneous and reproducible snapshot, and has a record of the contributors which can be used in a citation. This functionality is all built out of toolkits originally created for the Galaxy Zoo+ or another Zooniverse+ project, and the data citability is descended from DataCite+.
Binita writes code to analyse her birdsong data, on ClimateCodeHub*. This is a notebook hub site, one of many on the web, and recognisable as a descendant of GitHub+, with heritage dating back to sites developed in the 1990s such as SourceForge+. When she logs into the site from her tablet, she can create a new project or copy an existing one (either her own or from another researcher). In a project, she can specify which languages she is coding in, and the URL of the source datasets (at Victorian Historian and Climate Data Hub). She can keep her code closed (for example, during early development, or until a publication embargo is cleared), or make it open right away. On open projects anyone can contribute; for closed projects she can invite specific other researchers. Contributors can view the code and offer comments, defect reports, or code contributions which she can use or ignore, and she can give various permissions and controls to other people too. In this case, it turns out that an enthusiastic bird-watcher in Venezuela is also a professional statistician, and one in New Zealand is an audio-processing expert, and they join Binita in working on the code.
The hub site provides a version control system derived from systems such as Git+ or Subversion+, so contributors can download code to their own machines, or they can code directly on the notebook hub through their browsers. The versioning system tracks everyone’s contributions, and links to them from each person’s home page on the hub site, so every participant can easily claim appropriate credit.
Whether Binita and her colleagues are coding in their browsers, or on their own machines, they are using CodeNotebook*, a descendant of iPython Notebook+, which allows them to interactively develop “notebook” pages, incorporating code, charts, text, equations, audio, and video. They can work on separate computations, or connect together to the same notebook server (on the notebook hub) and code interactively together. Since 2012, people have been using this technology to write websites, academic papers, and even textbooks. The CodeNotebook system is not language-specific: Binita can use her favourite language (which happens to be some species of Python+) and adds interfaces to some third-party Fortran+ libraries to do heavy numeric lifting, and a C+++ subsystem for audio processing.
Several times during the research project, starting with those Fortran and C++ modules, Binita realises that she can save effort by re-using other pieces of code. These might be her own modules from previous projects, or sections of other researchers’ projects. These modules are separate projects on that hub, or maybe some other one. She adds these versioned dependencies to the project configuration, and the code is automatically integrated by the version control system, and added to the citations section of her notebook pages. The other researchers whose code she is re-using are notified, and some get in touch to see whether they can contribute (and maybe get additional credit as contributors).
Binita uses a GitLit* plugin to find and track related research reading and to build the project bibliography—with citations of code, data, papers, and notebooks—which is all kept updated automatically. This is related to Mendeley+ and Zotero+, combined with content-mining software originally called AMI2+.
Binita can run her notebook directly on her own machine, or on the notebook hub servers in the cloud (she’s probably going to do the latter, because her datasets are fairly large and the various hub providers work together to make this efficient). This was pioneered by RunMyCode+ in 2011. Each time Binita runs her notebook, it can fetch snapshots from the dataset clouds, complete with versioning and citation metadata. The notebook is automatically annotated with all this configuration and version information, together with the versions of the operating system and citation links to all the systems, libraries and third-party code they are using. When she is content with the results, she can click a “Pre-print” button, and the resulting notebook is automatically given a Digital Object Identifier (DOI) and made available online. This instant citability was an amazing thing when FigShare+ first did it, back in 2011 when Binita was in middle school. The hub has a post-publication peer review system, so Binita’s colleagues and competitors around the world can see this pre-print, comment on it, and score it. It is automatically added to her profile page on the hub, where her resume system will pick it up, and funding bodies will find it there when they come to assess her research excellence productivity metric (or whatever it is called, in 2020).
Binita, however, wants a research paper in a journal, the old-fashioned sort, even printed on paper—something she can show her grandfather, who first took her bird-watching. So she clicks the “Submit” button, and a PDF document is automatically generated from the notebook, formatted according to the house style of her chosen journal, and sent to the journal’s editors. Open-source journal software based on Open Journal Systems+ manage the peer-review process, and after some minor revisions, it’s done.
When Binita revises her notebook, either during peer-review or otherwise, fresh DOIs are minted and annotations are automatically added to old versions, so other researchers can always find the most recent public version. Researchers who are “watching” a notebook are notified automatically. As Binita works, she herself sometimes receives an automated notification that code or data she is using has been updated in this way; she can view the reason for the update and choose whether to switch to the new version or continue to use the old version.
Any reader of Binita’s work can run her code themselves, either on the notebook hub or on their own machine, simply by clicking on buttons in the notebook. If it’s very computationally expensive, they might have to make a micro-payment, as they might be used to doing on Amazon’s EC2+. The notebook hub will automatically offer a virtual machine image, and a management script for a system such as StarCluster+. Alternatively, the notebook’s automated description—with exact version information, and links to installers, for all the components—allows interested readers to put together a running system for themselves.
All the code for running a notebook hub is open-source, of course, so anyone can host one and in fact an ecosystem has developed. Rich institutions and departments host their own, others are run by publishers, funding bodies, professional societies, non-profit organisations, and commercial companies who fund them by advertising or rental. A Semantic Web protocol has arisen for ensuring that Binita’s logins on all of these various hub systems are connected, creating a single unified researcher identity, and bibliography, that moves with her through her career.
Last, but far from least: where did Binita and all her peers learn how to drive all this marvellous technology? Well, back around 2010 science institutions finally started to understand the huge importance of software skills in twenty-first century science. After years of effort, Software Carpentry+ was finally recognised as indispensable training for the next generation of scientists. By the time Binita clicks “submit” on her project, every university around the world has compulsory courses in basic software development skills, for every science undergraduate. The better high schools are running their own notebook hubs, and smart middle schoolers are creating notebooks about snail trails and raindrop patterns. Science is better, faster, more open, and more fun, than ever before.
* These names are invented for the purpose of this description; any resemblance to the name of any project, product, or person, living or dead, is accidental and not intended to imply endorsement or criticism, so please don’t sue me.
+ These, on the other hand, are real projects that exist right now. The future is here.