rwightman/efficientdet-pytorch

[BUG] OSError: [Errno 38] Function not implemented

irinushirka opened this issue ยท 4 comments

Hi! I'm trying to train tf_efficientdet_d3 with your code on custom coco-like data. I'm training on Google Colab.
!python3 train.py '/content/drive/MyDrive/dataset' --model tf_efficientdet_d3 --num-classes 60 --pretrained -b 1 --save-images --log-interval 100 --epochs 10 --output '/content/drive/MyDrive/Colab Notebooks/PyTorch training/'
After the first epoch, I faced with this type of error:

Traceback (most recent call last):
  File "train.py", line 656, in <module>
    main()
  File "train.py", line 435, in main
    best_metric, best_epoch = saver.save_checkpoint(epoch=epoch, metric=eval_metrics[eval_metric])
  File "/usr/local/lib/python3.7/dist-packages/timm-0.4.5-py3.7.egg/timm/utils/checkpoint_saver.py", line 78, in save_checkpoint
    os.link(last_save_path, save_path)
OSError: [Errno 38] Function not implemented: '/content/drive/MyDrive/Colab Notebooks/PyTorch training/train/20210311-161812-tf_efficientdet_d3/last.pth.tar' -> '/content/drive/MyDrive/Colab Notebooks/PyTorch training/train/20210311-161812-tf_efficientdet_d3/checkpoint-0.pth.tar'

Something went wrong during the process of saving the checkpoint. I'll be grateful to recieve the solution of this problem or some tips that may help me to solve it. Thanks!

@irinushirka colab isn't a normal filesystem, it's a FUSE filesystem on top of cloud storage and doesn't support hardlinks which the saver relies on for robust checkpoint saving (crash recovery). I'm aware of it but don't currently have a solution.

Looking out a few weeks to a month from now I plan to support saving into google storage buckets.

Hi! Maybe not a permanent solution, but at least to get it working temporarily to understand your results, you can just change your output_dir to be
output_dir = "/content/output" and it will save in colab. Downside is you do have to manually download it before you go out of session and you lose the checkpoint forever. I did that and it worked fine for now

Hi! Maybe not a permanent solution, but at least to get it working temporarily to understand your results, you can just change your output_dir to be output_dir = "/content/output" and it will save in colab. Downside is you do have to manually download it before you go out of session and you lose the checkpoint forever. I did that and it worked fine for now

Thanks, man you are my saver of day. It works actually.

However absurd it is, we still need to manually copy the weights to gg drive now.