dl4j-dev-tools

Deeplearning4j Benchmarks

Benchmarks popular models and configurations on Deeplearning4j, and output performance and versioning statistics.

Core Benchmarks

Available benchmarks:

Performance benchmarks:
- CNN benchmarks: Link - benchmarks for a number of CNN models on random data.
- MLP/RNN Benchmarks Link - benchmarks for some simple MLP and RNN models on random data.
- BenchmarkCustom: Link - benchmarks for CNN models with custom image data
Memory benchmarks:
- CNN memory benchmarks: Link - used to measure the memory requirements of CNN inference and training.

Running Benchmarks

Multiple version of DL4J can be benchmarked in this repo using Maven profiles:

0.9.1 (profile name: v091)
1.0.0-alpha (profile name: v100alpha)
1.0.0-beta (profile name: v100beta)
1.0.0-beta3 (profile name: v100beta3)
Master/snapshots (profile name: v100snapshot)

Furthermore, multiple backends can be configured:

Native (profile name: native)
CUDA 8 (profile name: cuda8)
CUDA 9.1 (profile name: cuda91 - can only be used with 1.0.0-alpha/beta and master/snapshots)
CUDA 10.0 (profile name: cuda10 - can only be used with 1.0.0-beta3 and master/snapshots )
CUDA 8 with cuDNN (profile name: cudnn8)
CUDA 9.1 with cuDNN (profile name: cudnn91 - can only be used with 1.0.0-alpha/beta)
CUDA 10.0 with (profile name: cudnn10 - can only be used with 1.0.0-beta3 and master/snapshots)

These Maven profiles allow any supported combinations of backends and DL4J versions to be run. These are specified at build time. You must build the repository before running benchmarks.

For example, to build the benchmark repo with support for ND4J-native backend for v0.9.1, use:

mvn package -Pnative,v091 -DskipTests

Similarly, to build for v1.0.0-beta3 with CUDA 10.0 + cuDNN, use:

mvn package -Pcudnn10,v100beta3 -DskipTests

Finally, to run the benchmarks, use the following:

mvn package -Pcudnn10,v100beta3 -DskipTests
cd dl4j-core-benchmark
java -cp dl4j-core-benchmark-v100beta3_cuda10-cudnn.jar org.deeplearning4j.benchmarks.BenchmarkCnn --modelType ALEXNET --batchSize 32

*** NOTE: The JAR file name encodes which profiles (version + backend) were used when building ***

For the full list of configuration options, see the configuration section below.

*** NOTE: There is also a benchmark script to compare backends: see scripts/benchmark.sh ***

Running Benchmarks in IntelliJ

In the same was as building/running through Maven, running the benchmark repos through Intellij requires the selection of two Maven profiles (one for the backend, one for the version). Link: Setting Maven Profiles

Additionally, IntelliJ does not properly handle the version-specific code configured using the Maven build helper plugin. Consequently, you will need to exclude the irrelevant directories.

For example, when running with profile v091 you should exclude the v100alpha and v100snapshot directories. You can do this by finding the directory in the project window -> right click -> Mark Directory as -> Excluded. To switch between versions (after previously marking as excluded), switch the Maven profiles as before, then cancel the exclusion on the source directory, and mark that same directory as a sources root (both using the same right click menu).

Configuring Benchmarks

Benchmarks have a number of configuration options, with defaults for most values.

Performance benchmarks:

modelType:
- ALL
- CNN
- SIMPLECNN
- ALEXNET
- LENET
- GOOGLELENET
- VGG16
- INCEPTIONRESNETV1
- FACENETNN4
- RNN
- MLP_SMALL
- RNN_SMALL
numLabels: output size for the network
totalIterations: Number of iterations to perform
batchSize: Minibatch size (number of examples) for benchmarks
gcWindow: Garbage collection frequency/window
profile: If true, run ND4J op profiler and report results. Has considerable performance overhead, but provides a performance information on a per-operation basis
cacheMode: DL4J CacheMode to use
workspaceMode: DL4J WorkspaceMode to use
updater: Updater to use (for example, NONE, ADAM, NESTEROVS, SGD, etc)

Memory benchmarks:

modelType: As per performance benchmarks
memoryTest: Type of test to run: TRAINING or INFERENCE
numLabels: output size for the network
batchSizes: Minibatch sizes (note: multiple are possible) to benchmark. For multiple, use space separated: --batchSizes 8 16 32
gcWindow: Garbage collection frequency/window
cacheMode: DL4J CacheMode to use
workspaceMode: DL4J WorkspaceMode to use
updater: Updater to use (for example, NONE, ADAM, NESTEROVS, SGD, etc)

Top Benchmarks

The following benchmarks have been run using the SNAPSHOT version of DL4J 0.9.1. This version utilizes workspace concepts and is significantly faster for inference than 0.8.0. The number of labels used for benchmarks was 1000. Note that for full training iteration timings, the number of labels and batch size impacts updater timing. CUDA_VISIBLE_DEVICES has been set to 1.

AlexNet 16x3x224x224

The AlexNet batch 16 benchmark below was developed as a comparison to: https://github.com/jcjohnson/cnn-benchmarks. Note that the linked benchmarks do not provide values for training iterations.

DL4J summary (milliseconds):

Forward	Backward	Total	Training Iteration
2	5.01	7.01	14.33