iterative/dvc

import: flag / parameter to skip the computation of the checksums

Honzys opened this issue · 1 comments

Hello,

First of all, thank you very much for the nice tooling you provide to everyone and thus making the data life cycle easier!

I'd like to ask you, if there's a way to skip computation of checksums of imported files?

Imagine having a TBs of data saved on the dvc remote. Imagine having isolated environments for ML trainings where only directory with dvc cache is shared (eg. docker containers with mounted volumes, or kubernetes pods, etc.) to prevent the downloading the data from remote, but rather use it from the cache.

Our use-case is to train ML models with the data I've mentioned. If I briefly describe our cycle, it would look like this:

  1. Initialize the isolated evinronment & start the actual training script.
  2. Import the data (after the first import the data are not being copied from remote but only reflinked/linked/copied from the shared cache directory, after this issue get fixed - #10255).
  3. Run the training & save the output models.
  4. Destroy the environment.

The current issue is in the step 2, where the checksums of the data (even if they're not being imported from remote), are being re-computed every time.

Have you been considering some parameter / flag that would disable the checksum computation if we can trust the data source located in the shared cache directory?
I can see that the computation is just to make sure the data didn't get corrupted on the way, but with TBs of data the import data step can take a lot of time (because all the data needs to be loaded to RAM and hashed :/)

Thank you very much in advance for your answers or any insights regarding this issue!