siddk/voltron-robotics

Storage constraints for Something-v2 for inference

Closed this issue · 3 comments

gunshi commented

Hey @siddk,
Thanks for open sourcing the framework!
I had a question about the data loading: I wanted to evaluate/infer the pre-trained models on a small subset of the Sth-Sth v2 data, and have <80-100 GB of storage space for it.
The Readme file in the pretrain folder says the data extraction might need 100s of GBs and that the streaming_dataset might be a solution, could you elaborate? Or am I interpreting it wrong since the dataset website says the data might be 56 GB after extraction, so maybe the >100GB storage is needed only if we want to process it a certain way for Voltron pretraining?
Alternatively do you know if I might be able to reduce storage needs by extracting the dataset at a lower fps (I assume that should be fine since the Voltron models encode single images?) or only preprocessing a certain subset of the videos?

PS: also minor correction to the Readme: the command to untar should be
cat 20bn-something-something-v2-?? | tar -xvzf -
instead of
cat 20bn-something-something-?? | tar -xvzf -

siddk commented

Hey @gunshi - thanks for using the framework!

So the >100 GB storage is a conservative estimate assuming you're dumping versions of the dataset for all baselines as well (e.g., the index files for R3M training). If you're working with just the 1-2 frame Voltron models, should be around 90GB. Totally feel free to dump frames at a lower FPS if you want to further reduce the footprint.

I hope this is helpful! Thanks for catching the error in the README as well, I'll go ahead and fix that!

gunshi commented

thank you for clarifying!
I wasn't sure if I've missed this, but would it be possible to see the config settings used to pre-train on the dataset for the Voltron results as a starting point? As far as I can tell it seems like currently I would call preprocess.py and supply the "missing" args (marked as MISSING in the code) on the command line but I was wondering if there's config files that already specify the recommended values to use to replicate the setup in the paper.
Thanks!

siddk commented

You actually shouldn't need to override anything! The "default config" for preprocess.py is set here: https://github.com/siddk/voltron-robotics/blob/main/examples/pretrain/preprocess.py#L28.

This links to the full dataset config here: https://github.com/siddk/voltron-robotics/blob/main/voltron/conf/datasets.py#L61.

As long as the fields path and artifact_path are set properly, the script should run and exactly replicate the paper's preprocessing flow.

Mainly, you just need to make sure path points to wherever you've downloaded the Sth-Sth dataset following these instructions: https://github.com/siddk/voltron-robotics/tree/main/examples/pretrain#obtaining-the-raw-dataset).

Let me know if you have any trouble!