About create_dataset.py

Question

About create_dataset.py

jeje910 opened this issue 3 years ago · 6 comments

I had same issue with ElegantLin and also tried to use create_dataset.py to make dataset for my own.

When I run
python -m modeling.datasets.create_dataset
with args.visual_checkpoint=$TEACH_LOGS/pretrained/fasterrcnn_model.pth
args.data_input=games
args.task_type=game
args.data_output=tatc_dataset
args.vocab_path=None

It creates tatc_dataset folder with 15TB which is not make sense.

I also tried to make small subset with the
args.vocab_path=$My_own_path/vocab/human.vocab
but it shows the error as below

Thanks!

Answer 1 · 2022-04-28T16:06:06.000Z

Hi,
Please let me know your lmdb version, python version, and the OS you are using. We tested the code with lmdb=1.3.0, python3.7, and Ubuntu 20.04.3.

It seems that lmdb has issues on MacOS and Win . The map_size field mentioned here is supposed to be the maximum allowed memory for the lmdb file, however on MacOS/Win it allocates that much disk space right away.

You can try reducing 700 here to some smaller value. For me, the disk usage (each for commander and driver features) was:

train: 22GB
valid_seen: 2.8GB
valid_unseen: 8.7GB

(Small data creation is not supported yet.)

Answer 2 · 2022-04-28T17:01:11.000Z

Thank you for your helpful reply!

I'm now testing it in my local with lmdb=1.3.0, python3.8 and Ubuntu 20.04 3 LTS.
I already tried to change the line you mentioned but it also seems data is not properly generated as the error below.
(I tried to run this code with the map_size as default, or the size similar to 130 GB..)

Could I get other tar.gz file which is already generated if possible?
Thanks!

Answer 3 · 2022-04-28T17:44:37.000Z

Is that the complete error log?

I am trying to figure out a way to share the preprocessed features, but due to sparse file format of mbd files, it's not trivial. Running create_dataset at your end will only take a few hours.

Answer 4 · 2022-04-28T18:25:32.000Z

No, it wasn't complete error log. I'll share some logs that might be helpful to find the answer. (when run create_dataset.py)
It seems wried for me because it takes only about 1 minute to run all of the code. I'm now double checking the paths of global variables. (TEACH_DATA_DIR etc)

Answer 5 · 2022-05-14T00:22:25.000Z

Hi, did you make any progress on this? Unfortunately, I have never seen this error on my end. Are all these logs from the same run? Maybe you can try downloading and unzipping the data again, seems like it's failing towards the last few files.

Yes, the code first processes all game json files, which is fast. Then it processes all the images per episode in the meta_data which takes a few hours.

Answer 6 · 2022-05-17T11:19:14.000Z

It works well in my server after I rewrite all of my environment! (But still not work on my local but I don't know why..)
But it works on my server so I think it won't be matter!! Thank you for your awesome help!