Nan train loss with cat, hinge, and kld
CameronBodine opened this issue ยท 9 comments
Describe the bug
I am exploring differences in model performance with different hyper-parameter settings. I have successfully trained models with dice
as the loss function. However, when attempting to train with cat
, hinge
, or kld
, the reported loss during training is nan
, despite using a range of learning rate values (1e-1 to 1e-7). See screenshot below for console output.
To Reproduce
Steps to reproduce the behavior:
- Install conda environment for gym (package specifics below)
- Set hyper-parameters in the
shadowpick_0.json
file:
{
"TARGET_SIZE": [512, 512],
"MODEL": "unet",
"NCLASSES": 2,
"BATCH_SIZE": 10,
"N_DATA_BANDS": 1,
"DO_TRAIN": true,
"PATIENCE": 10,
"MAX_EPOCHS": 10,
"VALIDATION_SPLIT": 0.6,
"FILTERS": 2,
"KERNEL": 7,
"STRIDE": 2,
"LOSS": "cat",
"DROPOUT": 0.1,
"DROPOUT_CHANGE_PER_LAYER": 0.0,
"DROPOUT_TYPE": "standard",
"USE_DROPOUT_ON_UPSAMPLING": false,
"ROOT_STRING": "shadowpick",
"FILTER_VALUE": 3,
"DOPLOT": false,
"USEMASK": true,
"RAMPUP_EPOCHS": 10,
"SUSTAIN_EPOCHS": 0.0,
"EXP_DECAY": 0.9,
"START_LR": 0.1,
"MIN_LR": 0.1,
"MAX_LR": 0.1,
"AUG_ROT": 0,
"AUG_ZOOM": 0.0,
"AUG_WIDTHSHIFT": 0.05,
"AUG_HEIGHTSHIFT": 0.05,
"AUG_HFLIP": true,
"AUG_VFLIP": false,
"AUG_LOOPS": 3,
"AUG_COPIES": 3,
"TESTTIMEAUG": false,
"SET_GPU": "0",
"DO_CRF": false,
"SET_PCI_BUS_ID": true,
"WRITE_MODELMETADATA": true,
"OTSU_THRESHOLD": true
}
- A subset of the dataset can be downloaded here.
- Train model by running
python train_model.py
.
Expected behavior
I expected to see a value other then nan
while training.
Screenshots
Console output:
Desktop (please complete the following information):
- OS: Ubuntu 22.04
- Conda Environment:
# packages in environment at /home/cbodine/miniconda3/envs/gym:
#
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
absl-py 1.4.0 pyhd8ed1ab_0 conda-forge
aiohttp 3.8.3 py38h0a891b7_1 conda-forge
aiosignal 1.3.1 pyhd8ed1ab_0 conda-forge
alsa-lib 1.2.8 h166bdaf_0 conda-forge
aom 3.5.0 h27087fc_0 conda-forge
appdirs 1.4.4 pyh9f0ad1d_0 conda-forge
asttokens 2.2.1 pyhd8ed1ab_0 conda-forge
astunparse 1.6.3 pyhd8ed1ab_0 conda-forge
async-timeout 4.0.2 pyhd8ed1ab_0 conda-forge
attr 2.5.1 h166bdaf_1 conda-forge
attrs 22.2.0 pyh71513ae_0 conda-forge
backcall 0.2.0 pyh9f0ad1d_0 conda-forge
backports 1.0 pyhd8ed1ab_3 conda-forge
backports.functools_lru_cache 1.6.4 pyhd8ed1ab_0 conda-forge
blinker 1.5 pyhd8ed1ab_0 conda-forge
blosc 1.21.3 hafa529b_0 conda-forge
brotli 1.0.9 h166bdaf_8 conda-forge
brotli-bin 1.0.9 h166bdaf_8 conda-forge
brotlipy 0.7.0 py38h0a891b7_1005 conda-forge
brunsli 0.1 h9c3ff4c_0 conda-forge
bzip2 1.0.8 h7f98852_4 conda-forge
c-ares 1.18.1 h7f98852_0 conda-forge
c-blosc2 2.6.1 hf91038e_0 conda-forge
ca-certificates 2023.01.10 h06a4308_0
cached-property 1.5.2 hd8ed1ab_1 conda-forge
cached_property 1.5.2 pyha770c72_1 conda-forge
cachetools 5.3.0 pyhd8ed1ab_0 conda-forge
cairo 1.16.0 ha61ee94_1014 conda-forge
certifi 2022.12.7 py38h06a4308_0
cffi 1.15.1 py38h4a40e3a_3 conda-forge
cfitsio 4.2.0 hd9d235c_0 conda-forge
charls 2.4.1 hcb278e6_0 conda-forge
charset-normalizer 2.1.1 pyhd8ed1ab_0 conda-forge
click 8.1.3 unix_pyhd8ed1ab_2 conda-forge
cloudpickle 2.2.1 pyhd8ed1ab_0 conda-forge
colorama 0.4.6 pyhd8ed1ab_0 conda-forge
contourpy 1.0.7 py38hfbd4bf9_0 conda-forge
cryptography 39.0.0 py38h1724139_0 conda-forge
cudatoolkit 11.8.0 h37601d7_11 conda-forge
cudnn 8.4.1.50 hed8a83a_0 conda-forge
cycler 0.11.0 pyhd8ed1ab_0 conda-forge
cython 0.29.33 py38h8dc9893_0 conda-forge
cytoolz 0.12.0 py38h0a891b7_1 conda-forge
dask-core 2023.1.0 pyhd8ed1ab_0 conda-forge
dav1d 1.0.0 h166bdaf_1 conda-forge
dbus 1.13.6 h5008d03_3 conda-forge
decorator 5.1.1 pyhd8ed1ab_0 conda-forge
doodleverse-utils 0.0.18 pypi_0 pypi
executing 1.2.0 pyhd8ed1ab_0 conda-forge
expat 2.5.0 h27087fc_0 conda-forge
fftw 3.3.10 nompi_hf0379b8_106 conda-forge
flatbuffers 2.0.8 hcb278e6_1 conda-forge
font-ttf-dejavu-sans-mono 2.37 hab24e00_0 conda-forge
font-ttf-inconsolata 3.000 h77eed37_0 conda-forge
font-ttf-source-code-pro 2.038 h77eed37_0 conda-forge
font-ttf-ubuntu 0.83 hab24e00_0 conda-forge
fontconfig 2.14.1 hc2a2eb6_0 conda-forge
fonts-conda-ecosystem 1 0 conda-forge
fonts-conda-forge 1 0 conda-forge
fonttools 4.38.0 py38h0a891b7_1 conda-forge
freetype 2.12.1 hca18f0e_1 conda-forge
frozenlist 1.3.3 py38h0a891b7_0 conda-forge
fsspec 2023.1.0 pyhd8ed1ab_0 conda-forge
gast 0.4.0 pyh9f0ad1d_0 conda-forge
gettext 0.21.1 h27087fc_0 conda-forge
giflib 5.2.1 h36c2ea0_2 conda-forge
glib 2.74.1 h6239696_1 conda-forge
glib-tools 2.74.1 h6239696_1 conda-forge
google-auth 2.16.0 pyh1a96a4e_1 conda-forge
google-auth-oauthlib 0.4.6 pyhd8ed1ab_0 conda-forge
google-pasta 0.2.0 pyh8c360ce_0 conda-forge
graphite2 1.3.13 h58526e2_1001 conda-forge
grpc-cpp 1.47.1 h05bd8bd_7 conda-forge
grpcio 1.47.1 py38h7dc2bf5_7 conda-forge
gst-plugins-base 1.21.3 h4243ec0_1 conda-forge
gstreamer 1.21.3 h25f0c4b_1 conda-forge
gstreamer-orc 0.4.33 h166bdaf_0 conda-forge
h5py 3.8.0 nompi_py38hd5fa8ee_100 conda-forge
harfbuzz 6.0.0 h8e241bc_0 conda-forge
hdf5 1.12.2 nompi_h2386368_101 conda-forge
icu 70.1 h27087fc_0 conda-forge
idna 3.4 pyhd8ed1ab_0 conda-forge
imagecodecs 2023.1.23 py38h3ca0a39_0 conda-forge
imageio 2.25.0 pyh24c5eb1_0 conda-forge
importlib-metadata 6.0.0 pyha770c72_0 conda-forge
ipython 8.8.0 pyh41d4057_0 conda-forge
jack 1.9.21 h583fa2b_2 conda-forge
jedi 0.18.2 pyhd8ed1ab_0 conda-forge
joblib 1.2.0 pyhd8ed1ab_0 conda-forge
jpeg 9e h166bdaf_2 conda-forge
jxrlib 1.1 h7f98852_2 conda-forge
keras 2.10.0 pyhd8ed1ab_0 conda-forge
keras-preprocessing 1.1.2 pyhd8ed1ab_0 conda-forge
keyutils 1.6.1 h166bdaf_0 conda-forge
kiwisolver 1.4.4 py38h43d8883_1 conda-forge
krb5 1.20.1 hf9c8cef_0 conda-forge
lame 3.100 h166bdaf_1003 conda-forge
lcms2 2.14 hfd0df8a_1 conda-forge
ld_impl_linux-64 2.39 hcc3a1bd_1 conda-forge
lerc 4.0.0 h27087fc_0 conda-forge
libabseil 20220623.0 cxx17_h05df665_6 conda-forge
libaec 1.0.6 hcb278e6_1 conda-forge
libavif 0.11.1 h5cdd6b5_0 conda-forge
libblas 3.9.0 16_linux64_openblas conda-forge
libbrotlicommon 1.0.9 h166bdaf_8 conda-forge
libbrotlidec 1.0.9 h166bdaf_8 conda-forge
libbrotlienc 1.0.9 h166bdaf_8 conda-forge
libcap 2.66 ha37c62d_0 conda-forge
libcblas 3.9.0 16_linux64_openblas conda-forge
libclang 15.0.7 default_had23c3d_0 conda-forge
libclang13 15.0.7 default_h3e3d535_0 conda-forge
libcups 2.3.3 h36d4200_3 conda-forge
libcurl 7.87.0 h6312ad2_0 conda-forge
libdb 6.2.32 h9c3ff4c_0 conda-forge
libdeflate 1.17 h0b41bf4_0 conda-forge
libedit 3.1.20191231 he28a2e2_2 conda-forge
libev 4.33 h516909a_1 conda-forge
libevent 2.1.10 h9b69904_4 conda-forge
libffi 3.4.2 h7f98852_5 conda-forge
libflac 1.4.2 h27087fc_0 conda-forge
libgcc-ng 12.2.0 h65d4601_19 conda-forge
libgcrypt 1.10.1 h166bdaf_0 conda-forge
libgfortran-ng 12.2.0 h69a702a_19 conda-forge
libgfortran5 12.2.0 h337968e_19 conda-forge
libglib 2.74.1 h606061b_1 conda-forge
libgomp 12.2.0 h65d4601_19 conda-forge
libgpg-error 1.46 h620e276_0 conda-forge
libiconv 1.17 h166bdaf_0 conda-forge
libjpeg-turbo 2.1.4 h166bdaf_0 conda-forge
liblapack 3.9.0 16_linux64_openblas conda-forge
libllvm15 15.0.7 hadd5161_0 conda-forge
libnghttp2 1.51.0 hdcd2b5c_0 conda-forge
libogg 1.3.4 h7f98852_1 conda-forge
libopenblas 0.3.21 pthreads_h78a6416_3 conda-forge
libopus 1.3.1 h7f98852_1 conda-forge
libpng 1.6.39 h753d276_0 conda-forge
libpq 15.1 h2baec63_3 conda-forge
libprotobuf 3.21.12 h3eb15da_0 conda-forge
libsndfile 1.2.0 hb75c966_0 conda-forge
libsqlite 3.40.0 h753d276_0 conda-forge
libssh2 1.10.0 haa6b8db_3 conda-forge
libstdcxx-ng 12.2.0 h46fd767_19 conda-forge
libsystemd0 252 h2a991cd_0 conda-forge
libtiff 4.5.0 h6adf6a1_2 conda-forge
libtool 2.4.7 h27087fc_0 conda-forge
libudev1 252 h166bdaf_0 conda-forge
libuuid 2.32.1 h7f98852_1000 conda-forge
libvorbis 1.3.7 h9c3ff4c_0 conda-forge
libwebp-base 1.2.4 h166bdaf_0 conda-forge
libxcb 1.13 h7f98852_1004 conda-forge
libxkbcommon 1.0.3 he3ba5ed_0 conda-forge
libxml2 2.10.3 h7463322_0 conda-forge
libzlib 1.2.13 h166bdaf_4 conda-forge
libzopfli 1.0.3 h9c3ff4c_0 conda-forge
locket 1.0.0 pyhd8ed1ab_0 conda-forge
lz4-c 1.9.3 h9c3ff4c_1 conda-forge
markdown 3.4.1 pyhd8ed1ab_0 conda-forge
markupsafe 2.1.2 py38h1de0b5d_0 conda-forge
matplotlib 3.6.3 py38h578d9bd_0 conda-forge
matplotlib-base 3.6.3 py38hd6c3c57_0 conda-forge
matplotlib-inline 0.1.6 pyhd8ed1ab_0 conda-forge
mpg123 1.31.2 hcb278e6_0 conda-forge
multidict 6.0.4 py38h1de0b5d_0 conda-forge
munkres 1.1.4 pyh9f0ad1d_0 conda-forge
mysql-common 8.0.32 h14678bc_0 conda-forge
mysql-libs 8.0.32 h54cf53e_0 conda-forge
natsort 8.2.0 pyhd8ed1ab_0 conda-forge
nccl 2.14.3.1 h0800d71_0 conda-forge
ncurses 6.3 h27087fc_1 conda-forge
networkx 3.0 pyhd8ed1ab_0 conda-forge
nspr 4.35 h27087fc_0 conda-forge
nss 3.82 he02c5a1_0 conda-forge
numpy 1.23.0 py38h3a7f9d9_0 conda-forge
oauthlib 3.2.2 pyhd8ed1ab_0 conda-forge
openjpeg 2.5.0 hfec8fc6_2 conda-forge
openssl 1.1.1s h7f8727e_0
opt_einsum 3.3.0 pyhd8ed1ab_1 conda-forge
packaging 23.0 pyhd8ed1ab_0 conda-forge
pandas 1.5.3 py38hdc8b05c_0 conda-forge
parso 0.8.3 pyhd8ed1ab_0 conda-forge
partd 1.3.0 pyhd8ed1ab_0 conda-forge
pcre2 10.40 hc3806b6_0 conda-forge
pexpect 4.8.0 pyh1a96a4e_2 conda-forge
pickleshare 0.7.5 py_1003 conda-forge
pillow 9.4.0 py38hb32c036_0 conda-forge
pip 22.3.1 pyhd8ed1ab_0 conda-forge
pixman 0.40.0 h36c2ea0_0 conda-forge
plotly 5.13.0 pyhd8ed1ab_0 conda-forge
ply 3.11 py_1 conda-forge
pooch 1.6.0 pyhd8ed1ab_0 conda-forge
prompt-toolkit 3.0.36 pyha770c72_0 conda-forge
protobuf 4.21.12 py38h8dc9893_0 conda-forge
pthread-stubs 0.4 h36c2ea0_1001 conda-forge
ptyprocess 0.7.0 pyhd3deb0d_0 conda-forge
pulseaudio 16.1 h4ab2085_1 conda-forge
pure_eval 0.2.2 pyhd8ed1ab_0 conda-forge
pyasn1 0.4.8 py_0 conda-forge
pyasn1-modules 0.2.7 py_0 conda-forge
pycparser 2.21 pyhd8ed1ab_0 conda-forge
pydensecrf 1.0rc3 py38h8f669ce_4 conda-forge
pygments 2.14.0 pyhd8ed1ab_0 conda-forge
pyjwt 2.6.0 pyhd8ed1ab_0 conda-forge
pyopenssl 23.0.0 pyhd8ed1ab_0 conda-forge
pyparsing 3.0.9 pyhd8ed1ab_0 conda-forge
pyqt 5.15.7 py38h7492b6b_2 conda-forge
pyqt5-sip 12.11.0 py38hfa26641_2 conda-forge
pysocks 1.7.1 pyha2e5f31_6 conda-forge
python 3.8.16 h7a1cb2a_2
python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge
python-flatbuffers 23.1.21 pyhd8ed1ab_0 conda-forge
python_abi 3.8 2_cp38 conda-forge
pytz 2022.7.1 pyhd8ed1ab_0 conda-forge
pyu2f 0.1.5 pyhd8ed1ab_0 conda-forge
pywavelets 1.4.1 py38h7e4f40d_0 conda-forge
pyyaml 6.0 py38h0a891b7_5 conda-forge
qt-main 5.15.6 h18908ee_6 conda-forge
re2 2022.06.01 h27087fc_1 conda-forge
readline 8.2 h5eee18b_0
requests 2.28.2 pyhd8ed1ab_0 conda-forge
requests-oauthlib 1.3.1 pyhd8ed1ab_0 conda-forge
rsa 4.9 pyhd8ed1ab_0 conda-forge
scikit-image 0.19.3 py38h8f669ce_2 conda-forge
scipy 1.10.0 py38h10c12cc_0 conda-forge
setuptools 66.1.1 pyhd8ed1ab_0 conda-forge
sip 6.7.5 py38hfa26641_0 conda-forge
six 1.16.0 pyh6c4a22f_0 conda-forge
snappy 1.1.9 hbd366e4_2 conda-forge
sqlite 3.40.1 h5082296_0
stack_data 0.6.2 pyhd8ed1ab_0 conda-forge
tenacity 8.1.0 pyhd8ed1ab_0 conda-forge
tensorboard 2.10.1 pyhd8ed1ab_0 conda-forge
tensorboard-data-server 0.6.1 py38h2b5fc30_4 conda-forge
tensorboard-plugin-wit 1.8.1 pyhd8ed1ab_0 conda-forge
tensorflow 2.10.0 cuda112py38hded6998_0 conda-forge
tensorflow-base 2.10.0 cuda112py38h6b2b66c_0 conda-forge
tensorflow-estimator 2.10.0 cuda112py38hf5dcc89_0 conda-forge
tensorflow-gpu 2.10.0 cuda112py38h0bbbad9_0 conda-forge
termcolor 2.2.0 pyhd8ed1ab_0 conda-forge
tifffile 2023.1.23.1 pyhd8ed1ab_0 conda-forge
tk 8.6.12 h27826a3_0 conda-forge
toml 0.10.2 pyhd8ed1ab_0 conda-forge
toolz 0.12.0 pyhd8ed1ab_0 conda-forge
tornado 6.2 py38h0a891b7_1 conda-forge
tqdm 4.64.1 pyhd8ed1ab_0 conda-forge
traitlets 5.8.1 pyhd8ed1ab_0 conda-forge
typing-extensions 4.4.0 hd8ed1ab_0 conda-forge
typing_extensions 4.4.0 pyha770c72_0 conda-forge
unicodedata2 15.0.0 py38h0a891b7_0 conda-forge
urllib3 1.26.14 pyhd8ed1ab_0 conda-forge
versioneer 0.28 pypi_0 pypi
wcwidth 0.2.6 pyhd8ed1ab_0 conda-forge
werkzeug 2.2.2 pyhd8ed1ab_0 conda-forge
wheel 0.38.4 pyhd8ed1ab_0 conda-forge
wrapt 1.14.1 py38h0a891b7_1 conda-forge
xcb-util 0.4.0 h516909a_0 conda-forge
xcb-util-image 0.4.0 h166bdaf_0 conda-forge
xcb-util-keysyms 0.4.0 h516909a_0 conda-forge
xcb-util-renderutil 0.3.9 h166bdaf_0 conda-forge
xcb-util-wm 0.4.1 h516909a_0 conda-forge
xorg-kbproto 1.0.7 h7f98852_1002 conda-forge
xorg-libice 1.0.10 h7f98852_0 conda-forge
xorg-libsm 1.2.3 hd9c2040_1000 conda-forge
xorg-libx11 1.7.2 h7f98852_0 conda-forge
xorg-libxau 1.0.9 h7f98852_0 conda-forge
xorg-libxdmcp 1.1.3 h7f98852_0 conda-forge
xorg-libxext 1.3.4 h7f98852_1 conda-forge
xorg-libxrender 0.9.10 h7f98852_1003 conda-forge
xorg-renderproto 0.11.1 h7f98852_1002 conda-forge
xorg-xextproto 7.3.0 h7f98852_1002 conda-forge
xorg-xproto 7.0.31 h7f98852_1007 conda-forge
xz 5.2.10 h5eee18b_1
yaml 0.2.5 h7f98852_2 conda-forge
yarl 1.8.2 py38h0a891b7_0 conda-forge
zfp 1.0.0 h27087fc_3 conda-forge
zipp 3.11.0 pyhd8ed1ab_0 conda-forge
zlib 1.2.13 h166bdaf_4 conda-forge
zlib-ng 2.0.6 h166bdaf_0 conda-forge
zstd 1.5.2 h3eb15da_6 conda-forge
Additional context
As I mentioned, I was able to train multiple models using dice
with the following hyper-parameters with the same dataset linked above.
{
"TARGET_SIZE": [512, 512],
"MODEL": "unet",
"NCLASSES": 2,
"BATCH_SIZE": 10,
"N_DATA_BANDS": 1,
"DO_TRAIN": true,
"PATIENCE": 10,
"MAX_EPOCHS": 10,
"VALIDATION_SPLIT": 0.6,
"FILTERS": 2,
"KERNEL": 7,
"STRIDE": 2,
"LOSS": "dice",
"DROPOUT": 0.1,
"DROPOUT_CHANGE_PER_LAYER": 0.0,
"DROPOUT_TYPE": "standard",
"USE_DROPOUT_ON_UPSAMPLING": false,
"ROOT_STRING": "shadowpick",
"FILTER_VALUE": 3,
"DOPLOT": false,
"USEMASK": true,
"RAMPUP_EPOCHS": 10,
"SUSTAIN_EPOCHS": 0.0,
"EXP_DECAY": 0.9,
"START_LR": 1e-07,
"MIN_LR": 1e-07,
"MAX_LR": 0.0001,
"AUG_ROT": 0,
"AUG_ZOOM": 0.0,
"AUG_WIDTHSHIFT": 0.05,
"AUG_HEIGHTSHIFT": 0.05,
"AUG_HFLIP": true,
"AUG_VFLIP": false,
"AUG_LOOPS": 3,
"AUG_COPIES": 3,
"TESTTIMEAUG": false,
"SET_GPU": "0",
"DO_CRF": false,
"SET_PCI_BUS_ID": true,
"WRITE_MODELMETADATA": true,
"OTSU_THRESHOLD": true
}
I also tried other versions of Tensorflow-gpu (2.4, 2.6, 2.7, 2.8) with kld
, but loss was reported as ing
.
Hi @CameronBodine ,
My hunch is this is mixed precision (which can cause underflow/overflow and therefore nan or inf loss). Can you try to train a model but with these lines on train_model.py
commented out:
segmentation_gym/train_model.py
Lines 128 to 132 in 809466a
Right on the money @ebgoldstein! Running now with cat
loss. Let me know if I can report back any info, or try out anything else.
Good call @ebgoldstein and thanks @CameronBodine for reporting
It would also be useful if you can confirm if you can train using 'kld' and/or 'hinge' without mixed precision, thanks
great news @CameronBodine ..
I have run into this scenario several times.. and have always been able to train with any loss by falling back to full precision..
for now I am going to close this issue. but please reopen if there are any other problems..
@dbuscombe-usgs - feel free to reopen this.. i just saw your comment above...
It would also be useful if you can confirm if you can train using 'kld' and/or 'hinge' without mixed precision, thanks
I can confirm that both kld
and hinge
loss is reported after disabling mixed precision.
I'm adding more info related to using mixed precision, FYI. Not sure if it's helpful, but figured I would document it.
If I don't comment out the lines @ebgoldstein referenced above, I get the following error using LOSS='dice'
:
$ python train_model.py
/mnt/md0/SynologyDrive/Modeling/99_ForTesting/Test_ExecScript/datasets
/mnt/md0/SynologyDrive/Modeling/99_ForTesting/Test_ExecScript/config/Test_ExecScript.json
Using GPU
Using single GPU device
2023-02-13 12:46:12.951058: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Version: 2.11.0
Eager mode: True
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Making new directory for example model outputs: /mnt/md0/SynologyDrive/Modeling/99_ForTesting/Test_ExecScript/modelOut
MODE "all": using all augmented and non-augmented files
2023-02-13 12:46:15.089657: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-13 12:46:15.815354: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14606 MB memory: -> device: 0, name: Quadro RTX 5000, pci bus id: 0000:65:00.0, compute capability: 7.5
3
1
.....................................
Creating and compiling model ...
INITIAL_EPOCH not specified in the config file. Setting to default of 0 ...
.....................................
Training model ...
Epoch 1: LearningRateScheduler setting learning rate to 1e-07.
Epoch 1/5
2023-02-13 12:46:28.331262: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:428] Loaded cuDNN version 8401
2023-02-13 12:46:29.121728: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-02-13 12:46:52.331351: I tensorflow/compiler/xla/service/service.cc:173] XLA service 0x7f6d74003af0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-02-13 12:46:52.331451: I tensorflow/compiler/xla/service/service.cc:181] StreamExecutor device (0): Quadro RTX 5000, Compute Capability 7.5
2023-02-13 12:46:52.345416: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-02-13 12:46:52.564638: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-02-13 12:46:52.656992: I tensorflow/compiler/jit/xla_compilation_cache.cc:477] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
3/3 [==============================] - 43s 2s/step - loss: 0.8905 - mean_iou: 0.0391 - dice_coef: 0.1095 - val_loss: 0.8784 - val_mean_iou: 0.0356 - val_dice_coef: 0.1216 - lr: 1.0000e-07
Epoch 2: LearningRateScheduler setting learning rate to 1.0090000000000002e-05.
Epoch 2/5
3/3 [==============================] - 3s 1s/step - loss: 0.8870 - mean_iou: 0.0424 - dice_coef: 0.1130 - val_loss: 0.8772 - val_mean_iou: 0.0329 - val_dice_coef: 0.1228 - lr: 1.0090e-05
Epoch 3: LearningRateScheduler setting learning rate to 2.008e-05.
Epoch 3/5
3/3 [==============================] - 3s 1s/step - loss: 0.8706 - mean_iou: 0.0560 - dice_coef: 0.1294 - val_loss: 0.8745 - val_mean_iou: 0.0332 - val_dice_coef: 0.1255 - lr: 2.0080e-05
Epoch 4: LearningRateScheduler setting learning rate to 3.0070000000000002e-05.
Epoch 4/5
3/3 [==============================] - 3s 1s/step - loss: 0.8517 - mean_iou: 0.0740 - dice_coef: 0.1483 - val_loss: 0.8705 - val_mean_iou: 0.0387 - val_dice_coef: 0.1295 - lr: 3.0070e-05
Epoch 5: LearningRateScheduler setting learning rate to 4.0060000000000006e-05.
Epoch 5/5
3/3 [==============================] - 3s 1s/step - loss: 0.8346 - mean_iou: 0.1016 - dice_coef: 0.1654 - val_loss: 0.8659 - val_mean_iou: 0.0577 - val_dice_coef: 0.1341 - lr: 4.0060e-05
Traceback (most recent call last):
File "train_model.py", line 920, in <module>
model.save(weights.replace('.h5','_fullmodel.h5'))
File "/home/cbodine/miniconda3/envs/gym/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/cbodine/miniconda3/envs/gym/lib/python3.8/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 775, in variables
return self._variables
AttributeError: 'LossScaleOptimizerV3' object has no attribute '_variables'
Thanks @CameronBodine
We should modify the code so unless Dice is the loss, mixed precision is disabled with a warning
On 'nan' losses with Dice, switching mixed precision off is the quick/easy way to get finite losses. However, I still have good luck with modifying the LR scheduler. So far, I've managed to get most models to converge doing this, but it is obviously a much more time-consuming process, involving trial and error