swcarpentry/good-enough-practices-in-scientific-computing

Please leave feedback here

gvwilson opened this issue · 6 comments

We welcome feedback on this draft - please either:

  1. Add a comment to this issue.
  2. File a separate issue in this repository.
  3. Submit a pull request proposing a change to good-enough-practices-for-scientific-computing.tex (by preference) or index.md.
  4. Email the authors.

I do not have a simple solution for these:

  • I understand why you feel the need for two separate sets of recommendations
    for manuscript-writing tools,
    but it would be nice if you could alter Box 1 somehow
    so that points 6a and 6c do not (appear to) directly contradict each other.
  • Recommendations for both a CITATION file (3d) and a PUBLICATIONS file (6b)
    are somewhat redundant.
  • You recommend plain text files at several place in the manuscript.
    You might add a sentence about the stability of plain-text file formats,
    which is at least as important as their vcs-trackability/diffability.
  • What is the scope of the 'tools needed to compile manuscripts' you refer to in 6d?
    An example might help.
  • Why do you dismiss emailing documents to yourself early on in the manuscript?
    That is a great way to back up static time-stamped versions
    that can be organized together with related correspondence.
  • I do not like recommendation 5g ('Copy the entire project').
    It has an intuitive appeal,
    but I think in practice it creates a need to manually check every single file
    if one thinks there is a problem.
    Files can sometimes be edited by accident after the full copy is made.
    I wrote a manuscript this way with some people, and it was terribly frustrating.
    Disrupts the automatic version histories created by Dropbox/Google docs/etc.,
    which are great because they require no thought or self-discipline at all.

WRT citations, you might want to have a look at (and possibly add a reference to) the article Software Citation Principles, although it has not been peer reviewed yet.

Thanks @jshoyer. On your points:

  • Done by adding a big OR into the boxes to make it clear that you choose one of these or the other. While doing this, I also actually removed the former points 6b and 6d. I felt that they were too detailed for the user base of this manuscript, and also somewhat unclearly written.
  • CITATION is how to cite this project, PUBLICATIONS would have been like a bibliography (I believe), but I removed the latter recommendation.
  • I honestly didn't see a great place to make this point cleanly, so haven't added anything.
  • This was one of the things that I thought was unclear and unnecessarily complicated in 6d, and I removed it
  • I think the goal is to keep everything in one place (a project directory) rather than having important information spread out between an active project directory and another system, like email. It's not that it's impossible to manage versions in email, just that it's not very scalable.
  • We had quite a lot of discussion about this - the downside to Dropbox is that the automated backups provide no information on what changed between snapshots. I think that's pretty critical - in a manuscript maybe not such a big deal, because you can visually see what's changed between versions in a single file, but in a larger project with many files and mostly code, it would be a nightmare to trace a regression without some information on what files have changed. The changelog provides that ability for manual versioning. There's definitely the risk of mistakes with any manual system, which is why it's really recommended as a stopgap, and a stepping stone, towards version control software.

To comment by @aurelg, thanks for the link, very interesting paper. I'm not sure it's quite the same thing as the CITATION file suggestion here, which is about giving text to others who want to cite your software, rather than the broader issues surrounding software citation in general. But if I read this wrong, definitely feel free to come back with text that might work it in better.

Under the Project Organization section, the authors discuss what goes in the src directory, and recommend both files that perform the core analysis of the research, as well as controller or driver scripts. An example of the latter that I use are Makefiles, and my understanding of these (from tutorials and looking at what others are doing) is that they are often included in the projects root directory. What is the justification for putting them in the src directory as opposed to the project's root directory?

@ttimbers, I could see it either way. William Noble (paper we reference) puts his makefile in src, I do the same in my tutorial. For me, the arguement is that we are allowing for a more heterogeneous project (including manuscript files, for example) than a "traditional" software project. Thus to me it makes more sense to keep all the scripts together. Would a makefile in the root directory be expected to compile the manuscript, for example? Or to build the docs? That's my thinking anyway.

Edit: Noble actually has several driver scripts in his tutorial, the makefile for the paper is in doc, the makefile for the code is in src, and his runall scripts are actually in results. None are in root, though, for the same reason as above - they're tagged to a particular type of output/part of the project.

Sounds like a reasonable explanation/justification for putting in src. I guess I really see it as not black/white, but a grey area of how complex the project is.