CEA-LIST/N2D2

validation throw std::out_of_range

e-dupuis opened this issue · 2 comments

It seems that validation throw a std:out_of_range error when using the Imagenet dataset.
The weird thing is that it does not appear when using the MNIST dataset.

Tested with both ONNX and standard ini.
Tested with multiple model topology.
Tested with -learn / -test and -export mode.

I have pushed my docker image online for reproduction purpose

step to reproduce:
docker pull edupuis/n2d2
docker run --rm -it --gpus=all -v "<local_datasets>":"/local/n2d2_data/" -v $(pwd):/workspace -e /imagenet.sh edupuis/n2d2 bash
n2d2 $N2D2_MODELS/MobileNet_v1.ini -learn 10000

[...]
CUBLAS initialized on device #0
Learning #1024         0.16% at  522.77 p./s (  31366 p./min)         
Validationterminate called after throwing an instance of 'std::out_of_range'
  what():  basic_string::substr: __pos (which is 140) > this->size() (which is 0)
Aborted (core dumped)
[...]

Any idea where the error come from ? I can help debug if needed.

Hi,

I cannot reproduce the error with your docker image.
I suspect that there is an issue with your ILSVRC2012 dataset, because that's the only difference. In fact I do not understand the purpose of your imagenet.sh script, which make a symbolic link? (which is very dangerous by the way, as it also tries to delete the train folder on the host!). If your are not using the true ILSVRC2012 dataset, the problem could come from here, because the list of expected ILSVRC2012 folders is loaded first...

In any case, you should not get this kind of error, which is not very descriptive...
What would help is a debug call trace. Could you make N2D2 in debug in the docker and run the command with gdb?

Olivier

Thank you for your answer,

for information, the imagenet.sh script is intended to avoid having several copies of the imagenet dataset on my server, and I never had any issue with symbolic links, as long as they are created and used only inside the docker.
You are right rm -r is dangerous, I have fixed this by removing the -r so that it delete only symbolic link and not folder.

Thanks to your advice, I extracted the Imagenet dataset and now everything work fine, Maybe one of my JPEG file was corrupted or something like that.

I can safely close the issue.

Etienne