Storage constraints for Something-v2 for inference

Question

Storage constraints for Something-v2 for inference

Closed this issue a year ago · 3 comments

Hey @siddk,
Thanks for open sourcing the framework!
I had a question about the data loading: I wanted to evaluate/infer the pre-trained models on a small subset of the Sth-Sth v2 data, and have <80-100 GB of storage space for it.
The Readme file in the pretrain folder says the data extraction might need 100s of GBs and that the streaming_dataset might be a solution, could you elaborate? Or am I interpreting it wrong since the dataset website says the data might be 56 GB after extraction, so maybe the >100GB storage is needed only if we want to process it a certain way for Voltron pretraining?
Alternatively do you know if I might be able to reduce storage needs by extracting the dataset at a lower fps (I assume that should be fine since the Voltron models encode single images?) or only preprocessing a certain subset of the videos?

PS: also minor correction to the Readme: the command to untar should be
cat 20bn-something-something-v2-?? | tar -xvzf -
instead of
cat 20bn-something-something-?? | tar -xvzf -

Answer 1 · 2023-09-15T17:26:15.000Z

Hey @gunshi - thanks for using the framework!

So the >100 GB storage is a conservative estimate assuming you're dumping versions of the dataset for all baselines as well (e.g., the index files for R3M training). If you're working with just the 1-2 frame Voltron models, should be around 90GB. Totally feel free to dump frames at a lower FPS if you want to further reduce the footprint.

I hope this is helpful! Thanks for catching the error in the README as well, I'll go ahead and fix that!

Answer 2 · 2023-09-19T13:39:41.000Z

thank you for clarifying!
I wasn't sure if I've missed this, but would it be possible to see the config settings used to pre-train on the dataset for the Voltron results as a starting point? As far as I can tell it seems like currently I would call preprocess.py and supply the "missing" args (marked as MISSING in the code) on the command line but I was wondering if there's config files that already specify the recommended values to use to replicate the setup in the paper.
Thanks!

Answer 3 · 2023-09-19T13:52:21.000Z

You actually shouldn't need to override anything! The "default config" for preprocess.py is set here: https://github.com/siddk/voltron-robotics/blob/main/examples/pretrain/preprocess.py#L28.

This links to the full dataset config here: https://github.com/siddk/voltron-robotics/blob/main/voltron/conf/datasets.py#L61.

As long as the fields path and artifact_path are set properly, the script should run and exactly replicate the paper's preprocessing flow.

Mainly, you just need to make sure path points to wherever you've downloaded the Sth-Sth dataset following these instructions: https://github.com/siddk/voltron-robotics/tree/main/examples/pretrain#obtaining-the-raw-dataset).

Let me know if you have any trouble!