swcarpentry/good-enough-practices-in-scientific-computing

git-lfs for data

Closed this issue · 6 comments

Recording/moving a twitter conversation here w/o 140 character limit.

From Bjørn Fjukstad (github @fjukstad, twitter @fjukstad) via Twitter:

great read, but why not git-lfs for version control of datasets?

I replied with this link: GitHub’s Large File Storage is no panacea for Open Source — quite the opposite.

Bjørn pointed out that article is about storing large files on GitHub.com specifically. It's not necessarily a reason to abandon git-lfs. Recommended this list of alternative implementations and http://www.pachyderm.io.

My 2 cents: if it's not something all of us authors are using routinely, then it's not right for this particular paper. We even dropped many things we do (version control! tests!), in order to focus on people just entering the on-ramp.

But I like to capture these discussions for ... edification, future articles, whatever. I agree that change tracking for data, at both the pro and amateur levels, is not at all sorted out. Thanks @fjukstad.

Weighing in with minimal context but I agree completely. It holds other people (usually novices) to an unrealistic standard that we ourselves don't practice. It's good to separate
here are things we do
here are things we would like to someday do if they work and people are willing to use them

Here is the 140 character-limited starting point: https://twitter.com/lexnederbragt/timelines/771781985798946816

As @fjukstad points out on twitter, this could be added to "what we left out".

Hey!

First off: Great read, lot's of good points to take home!

I agree that version control of datasets (especially intermediate data) isn't really a mainstream thing yet, but I believe that it's a step towards reproducible research. E.g. if you're analyzing RNA seq data through a pipeline with multiple stages. Keeping the intermediate data (and results) under version control would simplify the process of making your results reproducible even if you update a tool in the middle of the pipeline. With these new tools such as git-lfs and pachyderm I think that it's something for readers to be aware of!

I have RNA seq data in Git and on GitHub. Not the raw data, of course, but once I had it in a differential analysis pipeline. Yes it's awkward. But the datasets get progressively smaller as you move them through the analysis, so luckily the bits that are changing the most are also the smallest.

I have not looked at pachyderm too closely, but listened the podcast about it and it has a lot of support for data engineering workflows. Anyway for machine learning workflow DVC looks more applicable than git-lfs or pachyderm:

https://github.com/iterative/dvc

Disclaimer: I have a bit of bias, because DVC is written in Python and by Russian-speaking developers.