lightly-ai/lightly

Unusual loss trend and 'OSError: [WinError 6] The handle is invalid' issue

Nutty233 opened this issue · 5 comments

Thank you for supporting these codes for our single GPU users.

I'm trying to train VGG 16 using my own dataset (28,685 * (2242244band) satellite data) for semantic segmentation with Moco V2.

To fit the number of bands of the images, I changed the PIL image open code from ("RGB") to ("RGBA") from other '.py' files in the 'lightly' package, and set the dynamically reducing learning rate to start with 0.01 (I tried 0.06 0.03, and the loss won't converge). Then I employed the VGG16 framework and changed the parameters of 'MoCoProjectionHead' to (25088, 512, 128) (Maybe I got things wrong here) to let the model work.

But the loss will always decrease slowly and stuck at about 7.9. (as shown in the attached figure) When the lr dynamically decreases, the loss will increase, then come up with an 'OSError: [WinError 6] The handle is invalid' linked with 'queues.py' and 'connection.py' lately, and the training will end there.
moco_v2_train_loss

Does anyone have any ideas or experience on this? I would appreciate it.

Hi! Thanks for using lightly!

I never tried training a MoCo model with 4 channels but I would check the following things:

  • Make sure your image transforms are working as expected, especially image normalization
  • 25088 dimensions in the MoCoProjectionHead sounds like too much. VGG16 usually has 4096 output dimensions, the projection head should have the same size, maybe check that the pooling layer works as expected.
  • 'OSError: [WinError 6] The handle is invalid' sounds like an issue with filehandles (too many images opened at the same time). Make sure that images are correctly opened and maybe increase the number of open filehandles limit if necessary.

Hope this helps :)

Thank you for your suggestions! Really helpful.
The problem with "OSError: [WinError 6] Invalid handle" still exists, but setting num_workers = 0 seems to work. After the model is working well, I will try to adjust the code again to use multiprocessing.
I tried to make it feasible to change the input to 4096 for the moco model by changing the frame, and am now adjusting the hyperparameters to let the loss decrease.
Thanks again for giving me a solution idea!

Closing this issue for now, please reopen if you encounter more errors.

@guarin Hello, i am using the VicReg for medical image self-supervision and i noticed the same behavior in the loss function. Loss reduces from 8 to 7.5 and becomes stable. I tried different setting and the behavior was the same. Did you find the problem with the loss values? is there any suggestions about the reasons behind this?
I also noticed that the same happens with dino

Hi, the loss curves should look like this:

MoCoV2

mocov2

VicReg

vicreg

Dino

dino

The curves are from running our benchmarks here: https://github.com/lightly-ai/lightly/tree/master/benchmarks/imagenet/resnet50

Did you follow the benchmark scripts or the examples in the docs when implementing the model? The examples in the docs are a bit simplified to quicly run on a single, small GPU, whereas the benchmark scripts are full reproductions of the original code with all the training tricks.