How to save non-model artifacts from a container (output_data_dir)

Question

How to save non-model artifacts from a container (output_data_dir)

diegodebrito opened this issue 4 years ago · 2 comments

What did you find confusing? Please describe.
I am using docker images with my custom training algorithm. I am interested in sending information back from the container (for example, saving coefficients from the fitted model on an S3 bucket). I am confused about the use of the folder opt/ml/output. I understand that files on the folder opt/ml/model will be saved to the output bucket, but what happens to files on opt/ml/output? Can they be saved as well?

Describe how documentation can be improved
There is very little information about the output_data_dir. This directory is defined on the environment.py module on the source code, but it seems unused on the other modules. How does sagemaker save files from the output folder to S3?
There is also very little information about the use of opt/ml/output besides the file failure.

Answer 1 · 2020-10-20T05:03:42.000Z

Hi @diegodebrito, training output files (depending on the purposes of the file), should be written into /opt/ml/output. You could write extra files to /opt/ml/model and they will be uploaded to S3 after training is complete. For more information, please refer to this documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-output.html

Answer 2 · 2021-07-27T16:14:04.000Z

I agree, there is little information about /opt/ml/output.
I found that you can use the /opt/ml/output/data directory to store your files (I personally try to store there my actual training logs) and it will be uploaded as output.tar.gz under output directory in the bucket.
But I came on another issue when tried to run multi instance training and I saw that only 1 machine is actually able to upload the stored files to the bucket.