Issue during training a model: "OSError: Unable to create file (file signature not found)."
davidroid opened this issue · 3 comments
Hello all, I tried to run the training of an image classification model available in the stm32ai-modelzoo, but hit the following issue: "OSError: Unable to create file (file signature not found)."
-
Setup:
- OS: Microsoft Windows 10 Enterprise
- VMware on Windows, running Linux virtual machine: Ubuntu 22.04LTS
- Python virtual environment 1: Python 3.10.6
- Python virtual environment 2: Python 3.9.3
- stm32ai service online
-
Training:
- Guide: https://github.com/STMicroelectronics/stm32ai-modelzoo/blob/main/image_classification/scripts/training/README.md
- Python virtual environments (both Python 3.10.6 and Python 3.9.3) created with the following dependencies: https://github.com/STMicroelectronics/stm32ai-modelzoo/blob/main/requirements.txt
- Model: MobileNetv1, 0.25, 128x128x3
- Dataset: https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz
- Script: https://github.com/STMicroelectronics/stm32ai-modelzoo/blob/main/image_classification/scripts/training/train.py
-
Output:
- The script runs, it configure the experiment, connect to the stm32ai service online to convert and analyze the model, then hits the following error during the training of the second epoch: "OSError: Unable to create file (file signature not found)”. This happens with both the Python virtual environments Python 3.10.6 and Python 3.9.3.
-
Attachments:
Hello @davidroid , We were not able to reproduce the issue on our side, however, doing a bit of research it looks like the issue is caused because during training we are using a set of callbacks. One of these callbacks is making sure that at every epoch it saves/updates a model checkpoint with the best validation accuracies. It looks like this file is locked for some reason. I found a similar problem here along with the fix. Could you please try to export HDF5_USE_FILE_LOCKING=FALSE
run this command from your terminal and see if it fixes the problem? The details of the solution or what it will do can be found here. In the meanwhile, could you please also tell us what are the versions of your OS, WSL, Python that you are using in order to try to reproduce the issues that you have?
Let us know if the solution works, Thank you!
Hello @Shahnawax, I have run the training again after exporting the variable you suggested, but nothing has changed unfortunately, I got the same error.
I have updated the previous comment with the release of the OS, which is Microsoft Windows 10 Enterprise.
The issue is more likely caused due to some right issues of the file checkpoint. Discussed the fix to the reporter, and tested on multiple platforms to confirm that the issue does not exist.