About create_dataset.py
jeje910 opened this issue · 6 comments
I had same issue with ElegantLin and also tried to use create_dataset.py to make dataset for my own.
When I run
python -m modeling.datasets.create_dataset
with args.visual_checkpoint=$TEACH_LOGS/pretrained/fasterrcnn_model.pth
args.data_input=games
args.task_type=game
args.data_output=tatc_dataset
args.vocab_path=None
It creates tatc_dataset folder with 15TB which is not make sense.
I also tried to make small subset with the
args.vocab_path=$My_own_path/vocab/human.vocab
but it shows the error as below
Thanks!
Hi,
Please let me know your lmdb version, python version, and the OS you are using. We tested the code with lmdb=1.3.0, python3.7, and Ubuntu 20.04.3.
It seems that lmdb has issues on MacOS and Win . The map_size field mentioned here is supposed to be the maximum allowed memory for the lmdb file, however on MacOS/Win it allocates that much disk space right away.
You can try reducing 700
here to some smaller value. For me, the disk usage (each for commander and driver features) was:
- train: 22GB
- valid_seen: 2.8GB
- valid_unseen: 8.7GB
(Small data creation is not supported yet.)
Thank you for your helpful reply!
I'm now testing it in my local with lmdb=1.3.0, python3.8 and Ubuntu 20.04 3 LTS.
I already tried to change the line you mentioned but it also seems data is not properly generated as the error below.
(I tried to run this code with the map_size as default, or the size similar to 130 GB..)
Could I get other tar.gz file which is already generated if possible?
Thanks!
Is that the complete error log?
I am trying to figure out a way to share the preprocessed features, but due to sparse file format of mbd files, it's not trivial. Running create_dataset
at your end will only take a few hours.
Hi, did you make any progress on this? Unfortunately, I have never seen this error on my end. Are all these logs from the same run? Maybe you can try downloading and unzipping the data again, seems like it's failing towards the last few files.
Yes, the code first processes all game json files, which is fast. Then it processes all the images per episode in the meta_data
which takes a few hours.
It works well in my server after I rewrite all of my environment! (But still not work on my local but I don't know why..)
But it works on my server so I think it won't be matter!! Thank you for your awesome help!