This is the second of a series of posts intended to help scientists take part in the open science revolution by pointing them towards effective tools, approaches, and technologies.
Having established in the previous post the importance of using open-source software for open science, I continue to the second in my set of five simple rules for scientists. This is one which most of us know and use instinctively in our daily lives, but somehow when it comes to computers and software tools we can lose sight of this basic idea:
2. Don’t Believe the Hype
Many open-source projects have active communities of users and developers. These people are using and developing the tool because it solves their problems. Unfortunately, this success may blind some of them to difficulties which you may face when you use the tool in your research. They may be so enthused about their project that they become evangelical about its use: it becomes the solution to everyone’s problems. In their minds, they have the World’s Greatest Hammer, so your project looks like a nail.
When you are first researching a tool for potential use, reading documentation and opening conversations with existing users, you will come across some of these project evangelists, and be exposed to some of their hype. They may well be very experienced in their own arena, but are likely to have little or no experience in your research domain, or the class of problem you are trying to solve. Particularly with younger project communities, whose tools have not been applied to a wide variety of problems, and which have grown out of a narrow problem domain, the ignorance can be breathtaking, and the hype can be extreme.
Of course, hype for proprietary software is often far more widespread and more extreme than in the free software world. Some commercial marketing departments may have no compunction about deliberately distorting the truth, or outright lying, to secure a sale, and (due to common stringent licensing conditions which can prohibit users from telling war stories), they often have a great deal of control over the “messaging” about their products. But since we already disposed of proprietary software with rule 1, it’s important that we should be honest with ourselves about some problems in open source communities.
Here’s a list of particularly common subjects of hyperbolic rhetoric. For each subject, I’ve suggested some “antidotes” to the Kool-Aid: questions you should ask yourself or the community, to help you judge the true suitability of the tool for your use.
- Productivity: those jobs which used to take months, or which were especially difficult to get right, are now automatically handled in the background.
- Are those jobs you are familiar with? Are they likely to feature in your project? Yes, it’s cool that the tool makes it really easy to drive a robotic instrument through the USB port. But is that something you need to do on your project?
- Do you already have a good way of carrying out that job? Is it wise to discard that expertise, or would it be better to continue with your existing technique?
- Performance: “Faster than FORTRAN” is a common cry. This is quite a hard target for algorithms which are very numeric-intensive, and which have been optimized for a particular platform by an experienced developer. It’s hard because compiling FORTRAN for high-performance computation is a problem which has been attacked for six decades by many of the finest minds in computer science, and parts of FORTRAN have been refined to allow the expression of algorithms in a performance-sensitive way (and computers have evolved alongside, to enable FORTRAN programs to run faster). Competing with FORTRAN is not such a big deal in general-purpose programming—algorithms without data parallelism—or for more modern languages in which parallelism, dependency, and aliasing can be fully expressed, and which have mature compilers developed with performance in mind.
- “faster than X” is meaningless. “Faster than X, on this specific computer, solving this problem, with this code and this data” is meaningful, and should be assessed by comparing that problem with your own.
- How computationally intensive is your problem? Will it take hours of computer time? Of super-computer time? How much of your time is it worth to save this amount of computer time?
- How much of your computation is going to be performance-critical? It’s often only a very small numeric core, which—once located—can be farmed out to a separate small program written in FORTRAN or some other special-sauce language.
- Installation: you will be told that installation “is a snap”. This is very common, because long-term users of any software have usually not had to install it from scratch for a long time. They may never have installed it at all (it may have come “out-of-the-box”), and if they did then they may well have forgotten any difficulty they had. Furthermore, many users have only ever installed it on a single “platform”: their own computer, running some particular version of that operating system with particular versions of any required packages. Software installation is often tricky, with complex dependencies and many compatibility headaches.
- Read through the installation instructions for your operating system. Do they identify the dependencies, including particular versions? Do they have trouble-shooting advice?
- Are there existing developers and users across a wide diversity of platforms? Is it easy to find users with your particular mix of operating system and other dependencies?
- Are there active discussion forums, or a wiki, or a knowledge base? Browse them: do newcomers get real assistance, or a brush-off?
- Compatibility: you will certainly be assured that the system has interfaces to XML, and JSON, and SOAP/RPC, and NetCDF, and SVG, and SQLite, and on and on. But if that functionality is critical to your project, research it further. The interface may well have been developed to support one or two previous projects, and only provide whatever functionality they required. Or it may be antique, not up-to-date with recent changes.
- Look through the interface documentation. Is it complete? Is it up-to-date?
- Ask questions specific to your use. Instead of asking “does it work with Matlab files?”, ask “can I read MAT-File Level 5 format files, with sparse arrays and byte-swapping?”
- Find other users with similar compatibility requirements and ask to see their code.
- Your problem doesn’t matter: because the tool is so awesome that this other technique is better. The language doesn’t need namespaces—you can always use a prefix. You don’t need a working C interface—you can always reimplement that library in this other way. You don’t really need PDF charts: SVG is so much more 21st century.
- You are kidding, right?
- (yes, it’s true about SVG, but many publishers and other tools still want PDF).
- Future development: It’s not faster than FORTRAN yet, but it will be by Christmas because I just read this cool paper about how to do it. Version 3 is going to have support for PDF defrobnification. The iPhone app will be out Real Soon Now.
- As with all software, you should assume that any feature not already functional in a released version is very unlikely to be working by the suggested date, and may well never exist.
- So: keep one eye on project announcements, and bear the tool in mind for future use, but if your project needs a feature today, then either choose a tool which has that feature now, or plan to develop it yourself.
Hype is at its worst at particular phases in a project’s life-cycle: at the beginning (when the future is bright, and everything is possible) and near the end (when the project is dying, and this is obvious to all but the True Faithful, who are sustained by unshakeable belief). Those are times to avoid a project anyway, and will be addressed in future posts in this series.