Getting min,-nan,avg,-nan,max,-nan during training

Question

Getting min,-nan,avg,-nan,max,-nan during training

Closed this issue a year ago · 2 comments

labm1997 commented a year ago

Hi. When running piranha with:

3 parties
Command to build: make -j12 PIRANHA_FLAGS="-DFLOAT_PRECISION=64"
Model: lenet-norelu model with MNIST
Localhost with a single GPU, setting all CUDA_VISIBLE_DEVICES of localhost_runner.sh to 0.
Localhost config as follows:

{
    "_desc_num_parties": "Number of parties involved in the MPC computation. Should match compiled application protocol.",
    "num_parties": 3,

    "_desc_party_ips": "IP addresses of each party, ordered by party number",
    "party_ips": ["127.0.0.1", "127.0.0.1", "127.0.0.1"],

    "_desc_party_users": "Usernames for each party, ordered by party number. Can be used by scripts to SSH into machines if necessary to start piranha processes",
    "party_users": [],

    "_desc_run_unit_tests": "Run unit tests",
    "run_unit_tests": false,

    "_desc_unit_test_only": "Only run unit tests, do not try to run any full Piranha applications",
    "unit_test_only": false,

    "_desc_debug_print": "Show debug output",
    "debug_print": true,

    "_desc_debug_overflow": "Test for overflow and print related debug output. Breaks security by revealing intermediary values.",
    "debug_overflow": false,

    "_desc_debug_sqrt": "Test sqrt for invalid input and print related debug output. Breaks security by revealing intermediary values.",
    "debug_sqrt": false,

    "_desc_run_name": "Descriptive name for run, used to name log files with something useful",
    "run_name": "piranha_localhost",

    "_desc_network": "Path to NN architecture to use for the run",
    "network": "files/models/lenet-norelu.json",

    "_desc_custom_epochs": "Enable custom number of epochs. Otherwise set by size of learning rate schedule",
    "custom_epochs": true,

    "_desc_custom_epoch_count": "Number of epochs to train for, if custom_epochs is set",
    "custom_epoch_count": 10,

    "_desc_custom_iterations": "Enable custom number of iterations. Otherwise set by training dataset size.",
    "custom_iterations": false,

    "_desc_custom_iteration_count": "Number of iterators to train per epoch, if custom_iterations is set",
    "custom_iteration_count": 0,

    "_desc_custom_batch_size": "Enable custom batch size. Otherwise set by architecture configuration.",
    "custom_batch_size": true,

    "_desc_custom_batch_size_count": "Desired custom batch size",
    "custom_batch_size_count": 256,

    "_desc_nn_seed": "Seed for NN initialization",
    "nn_seed": 343934585,

    "_desc_preload": "Preload weights from a snapshot directory instead of training from scratch",
    "preload": false,

    "_desc_preload_path": "Directory path from which to preload network weights",
    "preload_path": "",

    "_desc_lr_schedule": "Learning rate schedule, in negative powers of 2 (e.g. 3 -> learning rate of 2^-3). Assumes that the number of LR exponents matches the desired number of training epochs",
    "lr_schedule": [3, 3, 3, 4, 4, 5, 6, 7, 8, 9],

    "_desc_test_only": "Only run NN test, skip training (useful if weights have been preloaded)",
    "test_only": false,

    "_desc_inference_only": "Only run inference (forward pass), not backward pass training",
    "inference_only": false,

    "_desc_no_test": "Do not run testing after training epochs",
    "no_test": false,

    "_desc_last_test": "Only run a test pass after the last training epoch",
    "last_test": true,

    "_desc_iteration_snapshots": "Take snapshots at each training iteration",
    "iteration_snapshots": false,

    "_desc_test_iteration_snapshots": "Take snapshots of a '1PC' test network running the same data",
    "test_iteration_snapshots": false,

    "_desc_epoch_snapshots": "Take snapshots after each training epoch",
    "epoch_snapshots": false,

    "_desc_eval_accuracy": "Evaluation: print training/test accuracy",
    "eval_accuracy": true,

    "_desc_eval_inference_stats": "Evaluation: print runtime and communication statistics for each inference forward pass",
    "eval_inference_stats": false,

    "_desc_eval_train_stats": "Evaluation: print runtime and communication statistics for each training forward-backward pass",
    "eval_train_stats": false,

    "_desc_eval_fw_peak_memory": "Evaluation: print peak memory usage during each forward pass",
    "eval_fw_peak_memory": false,

    "_desc_eval_bw_peak_memory": "Evaluation: print peak memory usage during each backward pass",
    "eval_bw_peak_memory": false,
    
    "_desc_eval_epoch_stats": "Evaluation: print cumulative runtime and communication statistics for each training epoch",
    "eval_epoch_stats": true,

    "_desc_print_activations": "Print output activations for each layer every forward pass",
    "print_activations": false,

    "_desc_print_deltas": "Print input gradient to each layer every backward pass",
    "print_deltas": false,
    
    "_desc_debug_all_forward": "Print debug information for all layer forward passes",
    "debug_all_forward": true,

    "_desc_debug_all_backward": "Print debug information for all layer backward passes",
    "debug_all_backward": true
}

I get the following output:

run unit tests? false
config network: "files/models/lenet-norelu.json"
network filename: files/models/lenet-norelu.json
----------------------------------------------
(1) CNN Layer             28 x 28 x 1
                          5 x 5         (Filter Size)
                          1 , 0         (Stride, padding)
                          256           (Batch Size)
                          24 x 24 x 20  (Output)
----------------------------------------------
(2) Maxpool Layer         24 x 24 x 20
                          2             (Pooling Size)
                          2             (Stride)
                          256           (Batch Size)
----------------------------------------------
(3) ReLU Layer            256 x 2880
----------------------------------------------
(4) CNN Layer             12 x 12 x 20
                          5 x 5         (Filter Size)
                          1 , 0         (Stride, padding)
                          256           (Batch Size)
                          8 x 8 x 50    (Output)
----------------------------------------------
(5) Maxpool Layer         8 x 8 x 50
                          2             (Pooling Size)
                          2             (Stride)
                          256           (Batch Size)
----------------------------------------------
(6) ReLU Layer            256 x 800
----------------------------------------------
(7) FC Layer              800 x 500
                          256            (Batch Size)
----------------------------------------------
(8) ReLU Layer            256 x 500
----------------------------------------------
(9) FC Layer              500 x 10
                          256            (Batch Size)
TRAINING, EPOCHS = 10 ITERATIONS = 234

 == Training (10 epochs) ==

 -- Epoch 0 (234 iterations, log_lr = 3) --
iteration,0
layer 0
cnn,fw activation,min,-nan,avg,-nan,max,-nan
layer 1
maxpool,fw activation,min,-nan,avg,-nan,max,-nan
layer 2
relu,fw activation,min,-nan,avg,-nan,max,-nan
layer 3
cnn,fw activation,min,-nan,avg,-nan,max,-nan
layer 4
maxpool,fw activation,min,-nan,avg,-nan,max,-nan
layer 5
relu,fw activation,min,-nan,avg,-nan,max,-nan
layer 6
fc,fw activation,min,-nan,avg,-nan,max,-nan
layer 7
relu,fw activation,min,-nan,avg,-nan,max,-nan
layer 8
fc,fw activation,min,-nan,avg,-nan,max,-nan
layer 8
fc,bw input delta,min,inf,avg,inf,max,inf
max bw dW value: -nan
max bw db value: -inf
layer 7
relu,bw input delta,min,-nan,avg,-nan,max,-nan
layer 6
fc,bw input delta,min,-nan,avg,-nan,max,-nan
max bw dW value: -nan
max bw db value: -nan
layer 5
relu,bw input delta,min,-nan,avg,-nan,max,-nan
layer 4
maxpool,bw input delta,min,-nan,avg,-nan,max,-nan
layer 3
cnn,bw input delta,min,-nan,avg,-nan,max,-nan
max bw dF value: -nan
layer 2
relu,bw input delta,min,-nan,avg,-nan,max,-nan
layer 1
maxpool,bw input delta,min,-nan,avg,-nan,max,-nan
layer 0
cnn,bw input delta,min,-nan,avg,-nan,max,-nan
max bw dF value: -nan
iteration,1
layer 0
cnn,fw activation,min,-nan,avg,-nan,max,-nan
...

Answer 1 · 2023-04-27T00:57:14.000Z

From a quick glance at your settings, you shouldn't be setting FLOAT_PRECISION to 64, as that is assigning 64 bits to the fractional part of each value and 0 remaining bits to the whole part. I believe that's why you're getting these errors -- could you try running with something more like -DFLOAT_PRECISION=26?

Answer 2 · 2023-04-27T10:42:28.000Z

Thank you @jlwatson it worked perfectly!