This problem uses the ResNet-50 CNN to do image classification.
Download the data using the following command:
Please download the dataset manually following the instructions from the ImageNet website. We use non-resized Imagenet dataset, packed into MXNet recordio database. It is not resized and not normalized. No preprocessing was performed on the raw ImageNet jpegs.
For further instructions, see https://github.com/NVIDIA/DeepLearningExamples/blob/master/MxNet/Classification/RN50v1.5/README.md#prepare-dataset .
Launch configuration and system-specific hyperparameters for the NVIDIA DGX-1
single node submission are in the config_DGX1.sh
script.
Steps required to launch single node training on NVIDIA DGX-1:
docker build --pull -t mlperf-nvidia:image_classification .
DATADIR=<path/to/data/dir> LOGDIR=<path/to/output/dir> DGXSYSTEM=DGX1 ./run.sub
Launch configuration and system-specific hyperparameters for the NVIDIA DGX-2
single node submission are in the config_DGX2.sh
script.
Steps required to launch single node training on NVIDIA DGX-2:
docker build --pull -t mlperf-nvidia:image_classification .
DATADIR=<path/to/data/dir> LOGDIR=<path/to/output/dir> DGXSYSTEM=DGX2 ./run.sub
Launch configuration and system-specific hyperparameters for the NVIDIA DGX-1
multi node submission are in the config_DGX1_multi.sh
script.
Steps required to launch multi node training on NVIDIA DGX-1:
- Build the docker container and push to a docker registry
docker build --pull -t <docker/registry>/mlperf-nvidia:image_classification .
docker push <docker/registry>/mlperf-nvidia:image_classification
- Launch the training
source config_DGX1_multi.sh && CONT="<docker/registry>/mlperf-nvidia:image_classification" DATADIR=<path/to/data/dir> LOGDIR=<path/to/output/dir> DGXSYSTEM=DGX1_multi sbatch -N $DGXNNODES -t $WALLTIME --ntasks-per-node $DGXNGPU run.sub
Launch configuration and system-specific hyperparameters for the NVIDIA DGX-2
multi node submission are in the config_DGX2_multi.sh
script.
Steps required to launch multi node training on NVIDIA DGX-2:
- Build the docker container and push to a docker registry
docker build --pull -t <docker/registry>/mlperf-nvidia:image_classification .
docker push <docker/registry>/mlperf-nvidia:image_classification
- Launch the training
source config_DGX2_multi.sh && CONT="<docker/registry>/mlperf-nvidia:image_classification" DATADIR=<path/to/data/dir> LOGDIR=<path/to/output/dir> DGXSYSTEM=DGX2_multi sbatch -N $DGXNNODES -t $WALLTIME --ntasks-per-node $DGXNGPU run.sub