Doodleverse/segmentation_gym

Nan train loss with cat, hinge, and kld

CameronBodine opened this issue ยท 9 comments

Describe the bug
I am exploring differences in model performance with different hyper-parameter settings. I have successfully trained models with dice as the loss function. However, when attempting to train with cat, hinge, or kld, the reported loss during training is nan, despite using a range of learning rate values (1e-1 to 1e-7). See screenshot below for console output.

To Reproduce
Steps to reproduce the behavior:

  1. Install conda environment for gym (package specifics below)
  2. Set hyper-parameters in the shadowpick_0.json file:
{
    "TARGET_SIZE": [512, 512],
    "MODEL": "unet",
    "NCLASSES": 2,
    "BATCH_SIZE": 10,
    "N_DATA_BANDS": 1,
    "DO_TRAIN": true,
    "PATIENCE": 10,
    "MAX_EPOCHS": 10,
    "VALIDATION_SPLIT": 0.6,
    "FILTERS": 2,
    "KERNEL": 7,
    "STRIDE": 2,
    "LOSS": "cat",
    "DROPOUT": 0.1,
    "DROPOUT_CHANGE_PER_LAYER": 0.0,
    "DROPOUT_TYPE": "standard",
    "USE_DROPOUT_ON_UPSAMPLING": false,
    "ROOT_STRING": "shadowpick",
    "FILTER_VALUE": 3,
    "DOPLOT": false,
    "USEMASK": true,
    "RAMPUP_EPOCHS": 10,
    "SUSTAIN_EPOCHS": 0.0,
    "EXP_DECAY": 0.9,
    "START_LR": 0.1,
    "MIN_LR": 0.1,
    "MAX_LR": 0.1,
    "AUG_ROT": 0,
    "AUG_ZOOM": 0.0,
    "AUG_WIDTHSHIFT": 0.05,
    "AUG_HEIGHTSHIFT": 0.05,
    "AUG_HFLIP": true,
    "AUG_VFLIP": false,
    "AUG_LOOPS": 3,
    "AUG_COPIES": 3,
    "TESTTIMEAUG": false,
    "SET_GPU": "0",
    "DO_CRF": false,
    "SET_PCI_BUS_ID": true,
    "WRITE_MODELMETADATA": true,
    "OTSU_THRESHOLD": true
}
  1. A subset of the dataset can be downloaded here.
  2. Train model by running python train_model.py.

Expected behavior
I expected to see a value other then nan while training.

Screenshots
Console output:

Screenshot from 2023-01-27 12-42-45

Desktop (please complete the following information):

  • OS: Ubuntu 22.04
  • Conda Environment:
# packages in environment at /home/cbodine/miniconda3/envs/gym:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
absl-py                   1.4.0              pyhd8ed1ab_0    conda-forge
aiohttp                   3.8.3            py38h0a891b7_1    conda-forge
aiosignal                 1.3.1              pyhd8ed1ab_0    conda-forge
alsa-lib                  1.2.8                h166bdaf_0    conda-forge
aom                       3.5.0                h27087fc_0    conda-forge
appdirs                   1.4.4              pyh9f0ad1d_0    conda-forge
asttokens                 2.2.1              pyhd8ed1ab_0    conda-forge
astunparse                1.6.3              pyhd8ed1ab_0    conda-forge
async-timeout             4.0.2              pyhd8ed1ab_0    conda-forge
attr                      2.5.1                h166bdaf_1    conda-forge
attrs                     22.2.0             pyh71513ae_0    conda-forge
backcall                  0.2.0              pyh9f0ad1d_0    conda-forge
backports                 1.0                pyhd8ed1ab_3    conda-forge
backports.functools_lru_cache 1.6.4              pyhd8ed1ab_0    conda-forge
blinker                   1.5                pyhd8ed1ab_0    conda-forge
blosc                     1.21.3               hafa529b_0    conda-forge
brotli                    1.0.9                h166bdaf_8    conda-forge
brotli-bin                1.0.9                h166bdaf_8    conda-forge
brotlipy                  0.7.0           py38h0a891b7_1005    conda-forge
brunsli                   0.1                  h9c3ff4c_0    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.18.1               h7f98852_0    conda-forge
c-blosc2                  2.6.1                hf91038e_0    conda-forge
ca-certificates           2023.01.10           h06a4308_0  
cached-property           1.5.2                hd8ed1ab_1    conda-forge
cached_property           1.5.2              pyha770c72_1    conda-forge
cachetools                5.3.0              pyhd8ed1ab_0    conda-forge
cairo                     1.16.0            ha61ee94_1014    conda-forge
certifi                   2022.12.7        py38h06a4308_0  
cffi                      1.15.1           py38h4a40e3a_3    conda-forge
cfitsio                   4.2.0                hd9d235c_0    conda-forge
charls                    2.4.1                hcb278e6_0    conda-forge
charset-normalizer        2.1.1              pyhd8ed1ab_0    conda-forge
click                     8.1.3           unix_pyhd8ed1ab_2    conda-forge
cloudpickle               2.2.1              pyhd8ed1ab_0    conda-forge
colorama                  0.4.6              pyhd8ed1ab_0    conda-forge
contourpy                 1.0.7            py38hfbd4bf9_0    conda-forge
cryptography              39.0.0           py38h1724139_0    conda-forge
cudatoolkit               11.8.0              h37601d7_11    conda-forge
cudnn                     8.4.1.50             hed8a83a_0    conda-forge
cycler                    0.11.0             pyhd8ed1ab_0    conda-forge
cython                    0.29.33          py38h8dc9893_0    conda-forge
cytoolz                   0.12.0           py38h0a891b7_1    conda-forge
dask-core                 2023.1.0           pyhd8ed1ab_0    conda-forge
dav1d                     1.0.0                h166bdaf_1    conda-forge
dbus                      1.13.6               h5008d03_3    conda-forge
decorator                 5.1.1              pyhd8ed1ab_0    conda-forge
doodleverse-utils         0.0.18                   pypi_0    pypi
executing                 1.2.0              pyhd8ed1ab_0    conda-forge
expat                     2.5.0                h27087fc_0    conda-forge
fftw                      3.3.10          nompi_hf0379b8_106    conda-forge
flatbuffers               2.0.8                hcb278e6_1    conda-forge
font-ttf-dejavu-sans-mono 2.37                 hab24e00_0    conda-forge
font-ttf-inconsolata      3.000                h77eed37_0    conda-forge
font-ttf-source-code-pro  2.038                h77eed37_0    conda-forge
font-ttf-ubuntu           0.83                 hab24e00_0    conda-forge
fontconfig                2.14.1               hc2a2eb6_0    conda-forge
fonts-conda-ecosystem     1                             0    conda-forge
fonts-conda-forge         1                             0    conda-forge
fonttools                 4.38.0           py38h0a891b7_1    conda-forge
freetype                  2.12.1               hca18f0e_1    conda-forge
frozenlist                1.3.3            py38h0a891b7_0    conda-forge
fsspec                    2023.1.0           pyhd8ed1ab_0    conda-forge
gast                      0.4.0              pyh9f0ad1d_0    conda-forge
gettext                   0.21.1               h27087fc_0    conda-forge
giflib                    5.2.1                h36c2ea0_2    conda-forge
glib                      2.74.1               h6239696_1    conda-forge
glib-tools                2.74.1               h6239696_1    conda-forge
google-auth               2.16.0             pyh1a96a4e_1    conda-forge
google-auth-oauthlib      0.4.6              pyhd8ed1ab_0    conda-forge
google-pasta              0.2.0              pyh8c360ce_0    conda-forge
graphite2                 1.3.13            h58526e2_1001    conda-forge
grpc-cpp                  1.47.1               h05bd8bd_7    conda-forge
grpcio                    1.47.1           py38h7dc2bf5_7    conda-forge
gst-plugins-base          1.21.3               h4243ec0_1    conda-forge
gstreamer                 1.21.3               h25f0c4b_1    conda-forge
gstreamer-orc             0.4.33               h166bdaf_0    conda-forge
h5py                      3.8.0           nompi_py38hd5fa8ee_100    conda-forge
harfbuzz                  6.0.0                h8e241bc_0    conda-forge
hdf5                      1.12.2          nompi_h2386368_101    conda-forge
icu                       70.1                 h27087fc_0    conda-forge
idna                      3.4                pyhd8ed1ab_0    conda-forge
imagecodecs               2023.1.23        py38h3ca0a39_0    conda-forge
imageio                   2.25.0             pyh24c5eb1_0    conda-forge
importlib-metadata        6.0.0              pyha770c72_0    conda-forge
ipython                   8.8.0              pyh41d4057_0    conda-forge
jack                      1.9.21               h583fa2b_2    conda-forge
jedi                      0.18.2             pyhd8ed1ab_0    conda-forge
joblib                    1.2.0              pyhd8ed1ab_0    conda-forge
jpeg                      9e                   h166bdaf_2    conda-forge
jxrlib                    1.1                  h7f98852_2    conda-forge
keras                     2.10.0             pyhd8ed1ab_0    conda-forge
keras-preprocessing       1.1.2              pyhd8ed1ab_0    conda-forge
keyutils                  1.6.1                h166bdaf_0    conda-forge
kiwisolver                1.4.4            py38h43d8883_1    conda-forge
krb5                      1.20.1               hf9c8cef_0    conda-forge
lame                      3.100             h166bdaf_1003    conda-forge
lcms2                     2.14                 hfd0df8a_1    conda-forge
ld_impl_linux-64          2.39                 hcc3a1bd_1    conda-forge
lerc                      4.0.0                h27087fc_0    conda-forge
libabseil                 20220623.0      cxx17_h05df665_6    conda-forge
libaec                    1.0.6                hcb278e6_1    conda-forge
libavif                   0.11.1               h5cdd6b5_0    conda-forge
libblas                   3.9.0           16_linux64_openblas    conda-forge
libbrotlicommon           1.0.9                h166bdaf_8    conda-forge
libbrotlidec              1.0.9                h166bdaf_8    conda-forge
libbrotlienc              1.0.9                h166bdaf_8    conda-forge
libcap                    2.66                 ha37c62d_0    conda-forge
libcblas                  3.9.0           16_linux64_openblas    conda-forge
libclang                  15.0.7          default_had23c3d_0    conda-forge
libclang13                15.0.7          default_h3e3d535_0    conda-forge
libcups                   2.3.3                h36d4200_3    conda-forge
libcurl                   7.87.0               h6312ad2_0    conda-forge
libdb                     6.2.32               h9c3ff4c_0    conda-forge
libdeflate                1.17                 h0b41bf4_0    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libevent                  2.1.10               h9b69904_4    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libflac                   1.4.2                h27087fc_0    conda-forge
libgcc-ng                 12.2.0              h65d4601_19    conda-forge
libgcrypt                 1.10.1               h166bdaf_0    conda-forge
libgfortran-ng            12.2.0              h69a702a_19    conda-forge
libgfortran5              12.2.0              h337968e_19    conda-forge
libglib                   2.74.1               h606061b_1    conda-forge
libgomp                   12.2.0              h65d4601_19    conda-forge
libgpg-error              1.46                 h620e276_0    conda-forge
libiconv                  1.17                 h166bdaf_0    conda-forge
libjpeg-turbo             2.1.4                h166bdaf_0    conda-forge
liblapack                 3.9.0           16_linux64_openblas    conda-forge
libllvm15                 15.0.7               hadd5161_0    conda-forge
libnghttp2                1.51.0               hdcd2b5c_0    conda-forge
libogg                    1.3.4                h7f98852_1    conda-forge
libopenblas               0.3.21          pthreads_h78a6416_3    conda-forge
libopus                   1.3.1                h7f98852_1    conda-forge
libpng                    1.6.39               h753d276_0    conda-forge
libpq                     15.1                 h2baec63_3    conda-forge
libprotobuf               3.21.12              h3eb15da_0    conda-forge
libsndfile                1.2.0                hb75c966_0    conda-forge
libsqlite                 3.40.0               h753d276_0    conda-forge
libssh2                   1.10.0               haa6b8db_3    conda-forge
libstdcxx-ng              12.2.0              h46fd767_19    conda-forge
libsystemd0               252                  h2a991cd_0    conda-forge
libtiff                   4.5.0                h6adf6a1_2    conda-forge
libtool                   2.4.7                h27087fc_0    conda-forge
libudev1                  252                  h166bdaf_0    conda-forge
libuuid                   2.32.1            h7f98852_1000    conda-forge
libvorbis                 1.3.7                h9c3ff4c_0    conda-forge
libwebp-base              1.2.4                h166bdaf_0    conda-forge
libxcb                    1.13              h7f98852_1004    conda-forge
libxkbcommon              1.0.3                he3ba5ed_0    conda-forge
libxml2                   2.10.3               h7463322_0    conda-forge
libzlib                   1.2.13               h166bdaf_4    conda-forge
libzopfli                 1.0.3                h9c3ff4c_0    conda-forge
locket                    1.0.0              pyhd8ed1ab_0    conda-forge
lz4-c                     1.9.3                h9c3ff4c_1    conda-forge
markdown                  3.4.1              pyhd8ed1ab_0    conda-forge
markupsafe                2.1.2            py38h1de0b5d_0    conda-forge
matplotlib                3.6.3            py38h578d9bd_0    conda-forge
matplotlib-base           3.6.3            py38hd6c3c57_0    conda-forge
matplotlib-inline         0.1.6              pyhd8ed1ab_0    conda-forge
mpg123                    1.31.2               hcb278e6_0    conda-forge
multidict                 6.0.4            py38h1de0b5d_0    conda-forge
munkres                   1.1.4              pyh9f0ad1d_0    conda-forge
mysql-common              8.0.32               h14678bc_0    conda-forge
mysql-libs                8.0.32               h54cf53e_0    conda-forge
natsort                   8.2.0              pyhd8ed1ab_0    conda-forge
nccl                      2.14.3.1             h0800d71_0    conda-forge
ncurses                   6.3                  h27087fc_1    conda-forge
networkx                  3.0                pyhd8ed1ab_0    conda-forge
nspr                      4.35                 h27087fc_0    conda-forge
nss                       3.82                 he02c5a1_0    conda-forge
numpy                     1.23.0           py38h3a7f9d9_0    conda-forge
oauthlib                  3.2.2              pyhd8ed1ab_0    conda-forge
openjpeg                  2.5.0                hfec8fc6_2    conda-forge
openssl                   1.1.1s               h7f8727e_0  
opt_einsum                3.3.0              pyhd8ed1ab_1    conda-forge
packaging                 23.0               pyhd8ed1ab_0    conda-forge
pandas                    1.5.3            py38hdc8b05c_0    conda-forge
parso                     0.8.3              pyhd8ed1ab_0    conda-forge
partd                     1.3.0              pyhd8ed1ab_0    conda-forge
pcre2                     10.40                hc3806b6_0    conda-forge
pexpect                   4.8.0              pyh1a96a4e_2    conda-forge
pickleshare               0.7.5                   py_1003    conda-forge
pillow                    9.4.0            py38hb32c036_0    conda-forge
pip                       22.3.1             pyhd8ed1ab_0    conda-forge
pixman                    0.40.0               h36c2ea0_0    conda-forge
plotly                    5.13.0             pyhd8ed1ab_0    conda-forge
ply                       3.11                       py_1    conda-forge
pooch                     1.6.0              pyhd8ed1ab_0    conda-forge
prompt-toolkit            3.0.36             pyha770c72_0    conda-forge
protobuf                  4.21.12          py38h8dc9893_0    conda-forge
pthread-stubs             0.4               h36c2ea0_1001    conda-forge
ptyprocess                0.7.0              pyhd3deb0d_0    conda-forge
pulseaudio                16.1                 h4ab2085_1    conda-forge
pure_eval                 0.2.2              pyhd8ed1ab_0    conda-forge
pyasn1                    0.4.8                      py_0    conda-forge
pyasn1-modules            0.2.7                      py_0    conda-forge
pycparser                 2.21               pyhd8ed1ab_0    conda-forge
pydensecrf                1.0rc3           py38h8f669ce_4    conda-forge
pygments                  2.14.0             pyhd8ed1ab_0    conda-forge
pyjwt                     2.6.0              pyhd8ed1ab_0    conda-forge
pyopenssl                 23.0.0             pyhd8ed1ab_0    conda-forge
pyparsing                 3.0.9              pyhd8ed1ab_0    conda-forge
pyqt                      5.15.7           py38h7492b6b_2    conda-forge
pyqt5-sip                 12.11.0          py38hfa26641_2    conda-forge
pysocks                   1.7.1              pyha2e5f31_6    conda-forge
python                    3.8.16               h7a1cb2a_2  
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python-flatbuffers        23.1.21            pyhd8ed1ab_0    conda-forge
python_abi                3.8                      2_cp38    conda-forge
pytz                      2022.7.1           pyhd8ed1ab_0    conda-forge
pyu2f                     0.1.5              pyhd8ed1ab_0    conda-forge
pywavelets                1.4.1            py38h7e4f40d_0    conda-forge
pyyaml                    6.0              py38h0a891b7_5    conda-forge
qt-main                   5.15.6               h18908ee_6    conda-forge
re2                       2022.06.01           h27087fc_1    conda-forge
readline                  8.2                  h5eee18b_0  
requests                  2.28.2             pyhd8ed1ab_0    conda-forge
requests-oauthlib         1.3.1              pyhd8ed1ab_0    conda-forge
rsa                       4.9                pyhd8ed1ab_0    conda-forge
scikit-image              0.19.3           py38h8f669ce_2    conda-forge
scipy                     1.10.0           py38h10c12cc_0    conda-forge
setuptools                66.1.1             pyhd8ed1ab_0    conda-forge
sip                       6.7.5            py38hfa26641_0    conda-forge
six                       1.16.0             pyh6c4a22f_0    conda-forge
snappy                    1.1.9                hbd366e4_2    conda-forge
sqlite                    3.40.1               h5082296_0  
stack_data                0.6.2              pyhd8ed1ab_0    conda-forge
tenacity                  8.1.0              pyhd8ed1ab_0    conda-forge
tensorboard               2.10.1             pyhd8ed1ab_0    conda-forge
tensorboard-data-server   0.6.1            py38h2b5fc30_4    conda-forge
tensorboard-plugin-wit    1.8.1              pyhd8ed1ab_0    conda-forge
tensorflow                2.10.0          cuda112py38hded6998_0    conda-forge
tensorflow-base           2.10.0          cuda112py38h6b2b66c_0    conda-forge
tensorflow-estimator      2.10.0          cuda112py38hf5dcc89_0    conda-forge
tensorflow-gpu            2.10.0          cuda112py38h0bbbad9_0    conda-forge
termcolor                 2.2.0              pyhd8ed1ab_0    conda-forge
tifffile                  2023.1.23.1        pyhd8ed1ab_0    conda-forge
tk                        8.6.12               h27826a3_0    conda-forge
toml                      0.10.2             pyhd8ed1ab_0    conda-forge
toolz                     0.12.0             pyhd8ed1ab_0    conda-forge
tornado                   6.2              py38h0a891b7_1    conda-forge
tqdm                      4.64.1             pyhd8ed1ab_0    conda-forge
traitlets                 5.8.1              pyhd8ed1ab_0    conda-forge
typing-extensions         4.4.0                hd8ed1ab_0    conda-forge
typing_extensions         4.4.0              pyha770c72_0    conda-forge
unicodedata2              15.0.0           py38h0a891b7_0    conda-forge
urllib3                   1.26.14            pyhd8ed1ab_0    conda-forge
versioneer                0.28                     pypi_0    pypi
wcwidth                   0.2.6              pyhd8ed1ab_0    conda-forge
werkzeug                  2.2.2              pyhd8ed1ab_0    conda-forge
wheel                     0.38.4             pyhd8ed1ab_0    conda-forge
wrapt                     1.14.1           py38h0a891b7_1    conda-forge
xcb-util                  0.4.0                h516909a_0    conda-forge
xcb-util-image            0.4.0                h166bdaf_0    conda-forge
xcb-util-keysyms          0.4.0                h516909a_0    conda-forge
xcb-util-renderutil       0.3.9                h166bdaf_0    conda-forge
xcb-util-wm               0.4.1                h516909a_0    conda-forge
xorg-kbproto              1.0.7             h7f98852_1002    conda-forge
xorg-libice               1.0.10               h7f98852_0    conda-forge
xorg-libsm                1.2.3             hd9c2040_1000    conda-forge
xorg-libx11               1.7.2                h7f98852_0    conda-forge
xorg-libxau               1.0.9                h7f98852_0    conda-forge
xorg-libxdmcp             1.1.3                h7f98852_0    conda-forge
xorg-libxext              1.3.4                h7f98852_1    conda-forge
xorg-libxrender           0.9.10            h7f98852_1003    conda-forge
xorg-renderproto          0.11.1            h7f98852_1002    conda-forge
xorg-xextproto            7.3.0             h7f98852_1002    conda-forge
xorg-xproto               7.0.31            h7f98852_1007    conda-forge
xz                        5.2.10               h5eee18b_1  
yaml                      0.2.5                h7f98852_2    conda-forge
yarl                      1.8.2            py38h0a891b7_0    conda-forge
zfp                       1.0.0                h27087fc_3    conda-forge
zipp                      3.11.0             pyhd8ed1ab_0    conda-forge
zlib                      1.2.13               h166bdaf_4    conda-forge
zlib-ng                   2.0.6                h166bdaf_0    conda-forge
zstd                      1.5.2                h3eb15da_6    conda-forge

Additional context
As I mentioned, I was able to train multiple models using dice with the following hyper-parameters with the same dataset linked above.

{
    "TARGET_SIZE": [512, 512],
    "MODEL": "unet",
    "NCLASSES": 2,
    "BATCH_SIZE": 10,
    "N_DATA_BANDS": 1,
    "DO_TRAIN": true,
    "PATIENCE": 10,
    "MAX_EPOCHS": 10,
    "VALIDATION_SPLIT": 0.6,
    "FILTERS": 2,
    "KERNEL": 7,
    "STRIDE": 2,
    "LOSS": "dice",
    "DROPOUT": 0.1,
    "DROPOUT_CHANGE_PER_LAYER": 0.0,
    "DROPOUT_TYPE": "standard",
    "USE_DROPOUT_ON_UPSAMPLING": false,
    "ROOT_STRING": "shadowpick",
    "FILTER_VALUE": 3,
    "DOPLOT": false,
    "USEMASK": true,
    "RAMPUP_EPOCHS": 10,
    "SUSTAIN_EPOCHS": 0.0,
    "EXP_DECAY": 0.9,
    "START_LR": 1e-07,
    "MIN_LR": 1e-07,
    "MAX_LR": 0.0001,
    "AUG_ROT": 0,
    "AUG_ZOOM": 0.0,
    "AUG_WIDTHSHIFT": 0.05,
    "AUG_HEIGHTSHIFT": 0.05,
    "AUG_HFLIP": true,
    "AUG_VFLIP": false,
    "AUG_LOOPS": 3,
    "AUG_COPIES": 3,
    "TESTTIMEAUG": false,
    "SET_GPU": "0",
    "DO_CRF": false,
    "SET_PCI_BUS_ID": true,
    "WRITE_MODELMETADATA": true,
    "OTSU_THRESHOLD": true
}

I also tried other versions of Tensorflow-gpu (2.4, 2.6, 2.7, 2.8) with kld, but loss was reported as ing.

Hi @CameronBodine ,

My hunch is this is mixed precision (which can cause underflow/overflow and therefore nan or inf loss). Can you try to train a model but with these lines on train_model.py commented out:

from tensorflow.keras import mixed_precision
try:
mixed_precision.set_global_policy('mixed_float16')
except:
mixed_precision.experimental.set_policy('mixed_float16')

Right on the money @ebgoldstein! Running now with cat loss. Let me know if I can report back any info, or try out anything else.

Good call @ebgoldstein and thanks @CameronBodine for reporting

It would also be useful if you can confirm if you can train using 'kld' and/or 'hinge' without mixed precision, thanks

great news @CameronBodine ..

I have run into this scenario several times.. and have always been able to train with any loss by falling back to full precision..

for now I am going to close this issue. but please reopen if there are any other problems..

@dbuscombe-usgs - feel free to reopen this.. i just saw your comment above...

It would also be useful if you can confirm if you can train using 'kld' and/or 'hinge' without mixed precision, thanks

I can confirm that both kld and hinge loss is reported after disabling mixed precision.

I'm adding more info related to using mixed precision, FYI. Not sure if it's helpful, but figured I would document it.

If I don't comment out the lines @ebgoldstein referenced above, I get the following error using LOSS='dice':

$ python train_model.py 
/mnt/md0/SynologyDrive/Modeling/99_ForTesting/Test_ExecScript/datasets
/mnt/md0/SynologyDrive/Modeling/99_ForTesting/Test_ExecScript/config/Test_ExecScript.json
Using GPU
Using single GPU device
2023-02-13 12:46:12.951058: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Version:  2.11.0
Eager mode:  True
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Making new directory for example model outputs: /mnt/md0/SynologyDrive/Modeling/99_ForTesting/Test_ExecScript/modelOut
MODE "all": using all augmented and non-augmented files
2023-02-13 12:46:15.089657: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-13 12:46:15.815354: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14606 MB memory:  -> device: 0, name: Quadro RTX 5000, pci bus id: 0000:65:00.0, compute capability: 7.5
3
1
.....................................
Creating and compiling model ...
INITIAL_EPOCH not specified in the config file. Setting to default of 0 ...
.....................................
Training model ...

Epoch 1: LearningRateScheduler setting learning rate to 1e-07.
Epoch 1/5
2023-02-13 12:46:28.331262: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:428] Loaded cuDNN version 8401
2023-02-13 12:46:29.121728: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-02-13 12:46:52.331351: I tensorflow/compiler/xla/service/service.cc:173] XLA service 0x7f6d74003af0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-02-13 12:46:52.331451: I tensorflow/compiler/xla/service/service.cc:181]   StreamExecutor device (0): Quadro RTX 5000, Compute Capability 7.5
2023-02-13 12:46:52.345416: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-02-13 12:46:52.564638: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-02-13 12:46:52.656992: I tensorflow/compiler/jit/xla_compilation_cache.cc:477] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
3/3 [==============================] - 43s 2s/step - loss: 0.8905 - mean_iou: 0.0391 - dice_coef: 0.1095 - val_loss: 0.8784 - val_mean_iou: 0.0356 - val_dice_coef: 0.1216 - lr: 1.0000e-07

Epoch 2: LearningRateScheduler setting learning rate to 1.0090000000000002e-05.
Epoch 2/5
3/3 [==============================] - 3s 1s/step - loss: 0.8870 - mean_iou: 0.0424 - dice_coef: 0.1130 - val_loss: 0.8772 - val_mean_iou: 0.0329 - val_dice_coef: 0.1228 - lr: 1.0090e-05

Epoch 3: LearningRateScheduler setting learning rate to 2.008e-05.
Epoch 3/5
3/3 [==============================] - 3s 1s/step - loss: 0.8706 - mean_iou: 0.0560 - dice_coef: 0.1294 - val_loss: 0.8745 - val_mean_iou: 0.0332 - val_dice_coef: 0.1255 - lr: 2.0080e-05

Epoch 4: LearningRateScheduler setting learning rate to 3.0070000000000002e-05.
Epoch 4/5
3/3 [==============================] - 3s 1s/step - loss: 0.8517 - mean_iou: 0.0740 - dice_coef: 0.1483 - val_loss: 0.8705 - val_mean_iou: 0.0387 - val_dice_coef: 0.1295 - lr: 3.0070e-05

Epoch 5: LearningRateScheduler setting learning rate to 4.0060000000000006e-05.
Epoch 5/5
3/3 [==============================] - 3s 1s/step - loss: 0.8346 - mean_iou: 0.1016 - dice_coef: 0.1654 - val_loss: 0.8659 - val_mean_iou: 0.0577 - val_dice_coef: 0.1341 - lr: 4.0060e-05
Traceback (most recent call last):
  File "train_model.py", line 920, in <module>
    model.save(weights.replace('.h5','_fullmodel.h5'))
  File "/home/cbodine/miniconda3/envs/gym/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/cbodine/miniconda3/envs/gym/lib/python3.8/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 775, in variables
    return self._variables
AttributeError: 'LossScaleOptimizerV3' object has no attribute '_variables'

Thanks @CameronBodine

We should modify the code so unless Dice is the loss, mixed precision is disabled with a warning

On 'nan' losses with Dice, switching mixed precision off is the quick/easy way to get finite losses. However, I still have good luck with modifying the LR scheduler. So far, I've managed to get most models to converge doing this, but it is obviously a much more time-consuming process, involving trial and error