microsoftarchive/BatchAI

Saving files

Nimi42 opened this issue · 2 comments

I tried to use the Horovod recipe. The std output works just fine but I can't
seem to save the model.

What do I have do I have to do to save the files to some output dir on the storage?

The job.json defines an output directory, but it stays empty even after a successful run.

"outputDirectories": [
      {
        "createNew": true,
        "id": "MODEL",
        "pathPrefix": "$AZ_BATCHAI_MOUNT_ROOT/external",
        "pathSuffix": "Models",
        "type": "custom"
      }
],

I tried saving something to

'./MODEL/example.txt'

explicitly, but that also did not work.
What am I missing?

EDIT:
I checked the VM and the results are definitely there somewhere in /mnt/batch/tasks/workitems...
I think. Should I save them by hand to $AZ_BATCHAI_MOUNT_ROOT or how do I get these
files into the storage?


  1. Until now I thought I have to start mpirun with the number of processes and servers that I
    want to use. How come the mpirun from the job.json does not define any such things?

e.g.

"commandLine": "mpirun -mca btl_tcp_if_exclude docker0,lo --allow-run-as-root --hostfile $AZ_BATCHAI_MPI_HOST_FILE python $AZ_BATCHAI_INPUT_SCRIPTS/tensorflow_mnist.py"

Hi,

The output directory setting creates an unique job output directory in your fileshare storage, and you can use it in the job via environment variable $AZ_BATCHAI_OUTPUT_ (in your case, $AZ_BATCHAI_OUTPUT_MODEL)

Your training script will be responsible to save the model file to a specified destination. In the horovod recipe, we use the official horovod sample tensorflow_mnist.py, where checkpoint is saved to:

if hvd.rank() == 0:
    callbacks.append(keras.callbacks.ModelCheckpoint('./checkpoint-{epoch}.h5'))

You have to modify the script to something like:

if hvd.rank() == 0: 
    callbacks.append(keras.callbacks.ModelCheckpoint(os.path.join(os.environ['AZ_BATCHAI_OUTPUT_MODEL'], 'checkpoint-{epoch}.h5')))

Then you should be able to see your model output in your share.

We use MPI host file instead of specifying the number of process

--hostfile $AZ_BATCHAI_MPI_HOST_FILE

The file is auto-generated by Batch AI in the format of:

host1 #proc max_slot
host2 #proc max_slot
...

Hi,
You got answer with reference to keras recipe? I mean how can I get model file as well as tensorboard log files using keras recipe?