Saving files
Nimi42 opened this issue · 2 comments
I tried to use the Horovod recipe. The std output works just fine but I can't
seem to save the model.
What do I have do I have to do to save the files to some output dir on the storage?
The job.json defines an output directory, but it stays empty even after a successful run.
"outputDirectories": [
{
"createNew": true,
"id": "MODEL",
"pathPrefix": "$AZ_BATCHAI_MOUNT_ROOT/external",
"pathSuffix": "Models",
"type": "custom"
}
],
I tried saving something to
'./MODEL/example.txt'
explicitly, but that also did not work.
What am I missing?
EDIT:
I checked the VM and the results are definitely there somewhere in /mnt/batch/tasks/workitems...
I think. Should I save them by hand to $AZ_BATCHAI_MOUNT_ROOT or how do I get these
files into the storage?
- Until now I thought I have to start mpirun with the number of processes and servers that I
want to use. How come the mpirun from the job.json does not define any such things?
e.g.
"commandLine": "mpirun -mca btl_tcp_if_exclude docker0,lo --allow-run-as-root --hostfile $AZ_BATCHAI_MPI_HOST_FILE python $AZ_BATCHAI_INPUT_SCRIPTS/tensorflow_mnist.py"
Hi,
The output directory setting creates an unique job output directory in your fileshare storage, and you can use it in the job via environment variable $AZ_BATCHAI_OUTPUT_ (in your case, $AZ_BATCHAI_OUTPUT_MODEL)
Your training script will be responsible to save the model file to a specified destination. In the horovod recipe, we use the official horovod sample tensorflow_mnist.py, where checkpoint is saved to:
if hvd.rank() == 0:
callbacks.append(keras.callbacks.ModelCheckpoint('./checkpoint-{epoch}.h5'))
You have to modify the script to something like:
if hvd.rank() == 0:
callbacks.append(keras.callbacks.ModelCheckpoint(os.path.join(os.environ['AZ_BATCHAI_OUTPUT_MODEL'], 'checkpoint-{epoch}.h5')))
Then you should be able to see your model output in your share.
We use MPI host file instead of specifying the number of process
--hostfile $AZ_BATCHAI_MPI_HOST_FILE
The file is auto-generated by Batch AI in the format of:
host1 #proc max_slot
host2 #proc max_slot
...
Hi,
You got answer with reference to keras recipe? I mean how can I get model file as well as tensorboard log files using keras recipe?