FfDL v0.1.1 model training error
bleachzk opened this issue · 4 comments
model trained as command:
$CLI_CMD train etc/examples/tf-model/manifest-hostmount.yml etc/examples/tf-model
hostmount learner pod error as flow:
Starting Training training-PYCOsfJmg
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/usr/lib/python2.7/zipfile.py", line 1541, in <module>
main()
File "/usr/lib/python2.7/zipfile.py", line 1512, in main
with ZipFile(args[1], 'r') as zf:
File "/usr/lib/python2.7/zipfile.py", line 756, in init
self.fp = open(file, modeDict[mode])
IOError: [Errno 2] No such file or directory: '/mnt/results/results/training-PYCOsfJmg/_submitted_code/model.zip'
Done load-model
Hi @bleachzk, sorry for the confusion. The manifest-hostmount.yml is for our CI test and it's not prepared for general users. This hostmount.yml is for the dev environment where it's using the worker node's hostpath to get all the necessary files for training which assumes you have everything in the specified hostpath directory.
If you want to do a regular training job with this example, please refer to manifest.yml or gpu-manifest.yml
Please let me know if you have further question. Thank you.
I use hostmount to test,v0.1 hostmount works,v0.1.1 it seems path config error:
IOError: [Errno 2] No such file or directory: '/mnt/results/results/training-PYCOsfJmg/_submitted_code/model.zip'
I want to know the difference between v0.1 and v0.1.1 about hostmount ?
@Tomcli
For v0.1.1, we are expecting all the model definition files are packaged in zip and storage in the path call /mnt/results/results/training-PYCOsfJmg/_submitted_code/model.zip
. Although it's done automatically in FfDL with cos_mount, but in hostmount you have to place the packaged zip file in the mounted host path (e.g. /cosdata/local-dlaas-ci-trained-results-tf-training-data/_submitted_code/model.zip
).
You can refer to this Makefile for more details on how we did it on Travis CI https://github.com/IBM/FfDL/blob/master/Makefile#L494
Please let me know if you have further question. Thank you.