Some code published

One of our goals is to see more scientific code published. Nature kindly gave us space to voice this opinion earlier in the year. In the world of software tools (our home planet if you like) we have seen huge strides forward because people published the source code to their software. It’s where the Open Source movement began. We believe that science will similarly be improved by having more scientists publish more of their code.

By publish we don’t necessarily mean polished, documented, formatted, and printed in a glossy peer-reviewed journal. We just mean made available. Whatever you wrote. Just stick it on the web somewhere. A zipfile is fine.

What should this look like? The upcoming issue of Annals of Applied Statistics provides a fine example. McShane and Wyner have published an article in this journal, and their are various discussions of the article in the same issue. One of which is Davis & Liu, and their code is published as supplementary material. It’s a great example of what we mean when we say you should publish your code. The code (it’s a few dozen lines of R) is clearly more or less as Davis & Liu wrote it. It has a comment at the top telling you how to download the inputs and run it, that looks like it might have been added in haste later.

As code goes it’s not great code: it’s poorly documented, and full of magic numbers. But it doesn’t have to be great code. It does the job; no-one is going to be building nuclear power stations or recommending the purchase of a cancer-busting drug using this code. The important thing is that it’s the code used to produce the figures in the paper, and it’s published.

(Davis & Liu are not the only ones to make their code available, there is plenty more in the supplementary materials: McShane and Wyner make available their R code. The rest use Matlab: Smerdon’s; Tingley’s; Holmström’s; Kaplan’s)

Stein’s editorial in the same issue of Annals of Applied Statistics is well worth reading, and he has useful things to say on peer-review, data, statistical testing, uncertainty, and the relationship between code and reproducibility. He notes that (emphasis mine) “There is a movement in various disciplines to make all numerical results reported on in published papers reproducible by providing all of the data and code used to generate the results“, and goes on to say that this reproducibility “should be a requirement for research that has potentially important public policy implications whenever permissible”.

Naturally we agree.

This entry was posted in News. Bookmark the permalink.

16 Responses to Some code published

  1. Ken Mankoff says:

    I agree with publishing the code used for analysis and figures in a publication. I am OK publishing code even if it is in script form, poorly documented for 3rd parties, in proprietary expensive languages, etc., as described above. Even this is better than nothing.

    Hoping to do better than this low bar, I have been exploring Python, both from personal interest and due to comments made by this foundation. I’ve used it a bit for small OS system scripts and utilities. I would like to use it to do scientific research. It seems like the Sage project, built on Python, and with the ability to “publish” notebooks of code, comment, and graphics online, is an excellent way to achieve these goals.

    However, I am increasingly frustrated in the ability of Python to produce graphics as easily as the more expensive languages. Todays frustration: If there is an easy way to produce a plot with 3 y-axes in python, I would be happy to know of it. I don’t know MATLAB well at all, but it took one search and 5 minutes to get this file (http://www.mathworks.com/matlabcentral/fileexchange/1017-plotyyy) to produce publication-quality 3-y-axis plots.

    My Python knowledge is very small, but comparable to my MATLAB skills. I’ve spent most of a day trying to replicate this in Python. I have found snippets, but not drop-in modules. I have managed to customize the snippets and produce my own module, but there appears to be a bug with scaling the axes, and as far as I can tell it isn’t in my code but in matplotlib set_ylim.

    I hope we can agree that, as a scientist and not a programmer, it would be better to do science and not release code (worst case, possibly soon-no-longer-allowed by the journals and funding agencies), or do science and release poor code in expensive languages (better), or do science and release elegant code in expensive languages (better still), than to not do science and spend time writing graphic or stat libraries for Python, hoping one day to achieve the best solution of writing elegant code in a popular, clear, free, easy-to-use language.

    • drj says:

      @Ken: I’m in broad agreement, and I sympathise with your frustration at Python’s graphing tools (and, no, I don’t know of a solution to your 3 y-axis problem, but I don’t do much graphing in Python).

      I think it’s pretty clear that we’d like code to be available in whatever expensive proprietary language it was written in. That’s better than nothing. Do I want grad students to be writing graphics libraries in Python? Probably not. But do I want public research funding to go towards paying for expensive Matlab licenses? Yes, if that produces the best science. Is it possible that this funding could be redirected towards creating Open Source alternatives? It’s possible. Some of the arguments here are analogous to the Open Access journal movement: it’s the same bucket of money, it’s a matter of where you choose to spend it.

      In the long term, I think that the science will be better if it’s done using Open Source tools (as well as itself being Open Source).

      • Ken Mankoff says:

        @drj I think the funding allocation logic is reasonable. For the price of one or a few institutions MATLAB licenses, an undergrad CS team funded to do part-time or summer work could make good improvements to a syntax-compatible graphics package. Perhaps the CCF or CCCF could look into the Google Summer of Code project or some similar path toward sponsoring such a team…

        • drj says:

          We, Climate Code Foundation, are intending to “do” (sponsor, propose, whatever it is that we do) a Google Summer of Code project or two. Your suggestion would make a good one.

          • Ken Mankoff says:

            FYI, the CS dept. at CU Boulder (and many other CS depts around the world) have senior or capstone projects where companies submit projects and a team of students work on those projects for the year.

            If someone has management time, this is another way to get a project moving forward.

    • John Keyes says:

      I don’t have a solution to the graphing problem in Python, but I don’t believe it’s necessary to have the graphing implemented in Python. The outputs can be transformed into a format that can be imported by other tools. Unlike the algorithms, graphing isn’t core, it’s only another output format.

      • Nick Barnes says:

        John, take a look at the example code linked in the article. It’s all geared towards producing plots. How to reproduce that in Python? I might ask Greg Wilson for his views.

        • John Keyes says:

          I’m curious what the reason is for reproducing it in Python? R is GPL so why not use it as is?

          There is a Python interface to R called RPy. I have no experience with it, but it looks like a good place to start.

          Also, it may be worthwhile checking out this pretty comprehensive list of plotting libraries/packages for Python.

          • Ken Mankoff says:

            That list is nice but also part of the problem… Learning one of those packages is a significant time investment (and as stated above, even the most mature isn’t mature enough to compete). Evaluating all of them is a waste of time.

            As for R, I’ve considered it. I’ve also considered a dozen other free (beer,speech) languages and graphics packages. But there is an exponential decay in the user base, and anything after IDL/MATLAB/Python is of little use if my colleagues and random readers are not fluent. And more importantly, for me they all fall short on functionality.

            Python has the user base, elegance, etc. If it can get the functionality it would be wonderful.

          • John Keyes says:

            With regards to plotting in Python, it might be useful to document which features are required for scientific plotting.

            This document could then be passed on to the communities that are actively developing their plotting libraries and they can comment on what is currently supported, and what will be supported in the future and when.

  2. Stefan says:

    The link to the supplement material does not work (correct one is http://lib.stat.cmu.edu/aoas/398C/supplementC.txt) and I do not find ‘plenty more’ code in the linked folder; only one zip file is available. Maybe you wanted to link this page: http://www.imstat.org/aoas/supplements/future_issues.html ?

    • drj says:

      They updated their links between when I hit publish and now. What can I say? Thanks for the note, I’ll fix the links in the article. There used to be a handy directory with all the supplementary materials for the “Climate Change Discussion”. Looks like all the links will break when the issue is published anyway (and no longer the “next issue”).

    • drj says:

      @Stefan: I’ve updated the links. Thanks again.

  3. drj says:

    Just to illustrate how science is improved by making the code available, Smerdon makes use of McShane and Wyner’s code in his discussion. Part of Smerdon’s code is a modified version of McShane and Wyner’s. Clearly the scientific discussion was made easier by having the code available.

    (and if it had been stored in a source code control system, I would be able to see what changes Smerdon made much more easily (though in this case they are not large and they’re commented))

    • John Keyes says:

      Although source code control can be used in a very simple manner, it is a hurdle to contributing. Even use of a distributed versioned file system (like Dropbox) would be of use here.

  4. Pingback: Tweets that mention Some code published | Climate Code Foundation -- Topsy.com

Leave a Reply

Your email address will not be published. Required fields are marked *