megvii-model/HINet

Image Deblur - Custom dataset Error

kimtaehyeong opened this issue · 6 comments

Thanks for writing a good paper.
I have data[input,target] that I have.
Now

./datasets/
./datasets/GoPro/
./datasets/GoPro/train/
./datasets/GoPro/train/input/
./datasets/GoPro/train/target/
./datasets/GoPro/test/
./datasets/GoPro/test/input/
./datasets/GoPro/test/target/

Create a folder in the same way as

python scripts/data_preparation/gopro.py
Through preprocessing, blur_crops, blur_crops.imdb / sharp_crops, sharp_crops.imdb datasets were created.

Finally

python -m torch.distributed.launch --nproc_per_node=8 --master_port=4321 basicsr/train.py -opt options/train/GoPro/HINet.yml --launcher pytorch

I tried to learn through the above command.
But I got the following error:

ValueError: Keys in lq_folder and gt_folder are different.
...
...
subprocess.CalledProcessError: Command '['/home/ubuntu/anaconda3/envs/HINet/bin/python', '-u', 'basicsr/train.py', '--local_rank=7', '-opt', 'options/train/GoPro/HINet.yml', '--launcher', 'pytorch']' returned non-zero exit status 1.

How can I fix the error?

Thanks.

Hi, kimtaehyeong,
Thanks for your attention to HINet!

The error seems like the inconsistency between the input and target data, as shown in https://github.com/megvii-model/HINet/blob/main/basicsr/data/data_util.py#L151-L153 .
Make sure the count of the original images in ./datasets/GoPro/train/input/ and ./datasets/GoPro/train/target/ are same,
in ./datasets/GoPro/test/input/ and ./datasets/GoPro/test/target/ are same.

You can check the meta_info.txt in the input and target .lmdb folder to see whether they are identical.

There are our meta_info.txt for the training data and testing data, hope they could help you:

meta_info.txt for the cropped training data:
https://drive.google.com/file/d/1G2lI_QX9iSKDQ7Ub2-frDxUe9IJaZczA/view?usp=sharing

meta_info.txt for the testing data:
https://drive.google.com/file/d/1Oj1wB8dAhxy-Cawymy2uhPDeSqypW8EI/view?usp=sharing

Thanks.

Thank you so much,
With the help, learning succeeded normally.
If you have a 24g GPU that I have,
What should the ideal GPU training be?
Thank you.

Hi, kimtaehyeong,

Glad to help!
It's hard to say the optimal setting in your environment, a common practice would be

  1. make sure the gpu usage (not gpu memory) is fully exploited (near 100%) to speed up the training process.
  2. choose the batch_size, crop_size, and iterations of your model
  3. I recommend the total pixels the model "see" close to the baseline, which is 8(gpus) x 8 (batch_size) x 256 x 256 (crop_size) x 400000 (iters).
  4. Adjust a stable learning rate. You could test the model in ie. 1000, 2000 iters to see whether it is in your expectation.
  5. I recommend set the testing crop size the same as the training crop size you chose.

Thanks.

Thank you very much.
With your help, I have succeeded in my current learning, and I want to test my image.

Here is my test code:

python basicsr/demo.py -opt options/demo/demo.yml

Also, the error is:
My test environment trained with 3 gpu when training, so I modified gpu to 3 in the .yml file.

Disable distributed.
Traceback (most recent call last):
  File "basicsr/demo.py", line 46, in <module>
    main()
  File "basicsr/demo.py", line 40, in main
    model = create_model(opt)
  File "/home/ubuntu/project/HINet/basicsr/models/__init__.py", line 44, in create_model
    model = model_cls(opt)
  File "/home/ubuntu/project/HINet/basicsr/models/image_restoration_model.py", line 37, in __init__
    self.opt['path'].get('strict_load_g', True), param_key=self.opt['path'].get('param_key', 'params'))
  File "/home/ubuntu/project/HINet/basicsr/models/base_model.py", line 277, in load_network
    load_path, map_location=lambda storage, loc: storage)
  File "/home/ubuntu/anaconda3/envs/hinet/lib/python3.6/site-packages/torch/serialization.py", line 593, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/ubuntu/anaconda3/envs/hinet/lib/python3.6/site-packages/torch/serialization.py", line 780, in _legacy_load
    deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: unexpected EOF, expected 1607442 more bytes. The file might be corrupted.

How can I solve this ?

Hi, kimtaehyeong,

demo.py is designed to inference one image in one gpu, even if you train with 3 gpus.

Thanks.

Thank you very much.
Thank you it worked out well.