Comparison with DVC
erdnaavlis opened this issue · 2 comments
Hello!
First of all thank you for your contribution to the community! I’ve just found out about this and it seems to be a nice project that is growing!
You are probably familiar with dvc (https://github.com/iterative/dvc).
I’ve been investigating it in order to include it in my ML pipeline. Can you explain briefly how/if Lazydata differs from dvc? And any advantages and disadvantages? I understand that there may be some functionalities that maybe are not yet implemented purely due to time constraints or similar. I’m more interested in knowing if there are any differences in terms of paradigm.
Ps- if you have a different channel for these kind of questions please let me know.
Thank you very much!
Hi Andre!
Sure, so at the moment the main differences are in a couple of design decisions:
- In DVC there is an extra
.dvc
file for every data file you have. Inlazydata
all file metadata is stored in a single filelazydata.yml
. - DVC uses the same basic paradigm as
git-lfs
- data files are tightly coupled to the repository, and after doinggit pull
you would normally dodvc pull
that will pull all the data files. Inlazydata
, you normally download files "lazily", i.e. callingtrack()
on a file that is tracked but missing from the local copy will download it. The idea is to enable you to simply track all of your data files without worrying about someone else downloading it, because they'll only ever download it if they need it. - The main interface of
lazydata
is programatic (ie from within Python) and I'll continue developing it in that direction. The main interface of dvc and git-lfs is command-line.
Of course, DVC and git-lfs have more features at the moment, but I expect the differences in these design decisions will remain the same. Hope this answers the question!
Thank you very much for your reply @rstojnic. It makes sense, and your point 2., is very appealing.
I’ll make sure to give Lazydata a try soon :)