Doodleverse/segmentation_gym

Restore model weights with missing/mystery config file

dbuscombe-usgs opened this issue · 11 comments

Describe the bug

A user, Stephen Bosse, has a trained model weights file but no longer has the original config file used to create it. Model cardinality is unknown. This is difficult to reverse engineer because there are many viable combinations of TARGET_SIZE, BATCH_SIZE, NCLASSES, KERNEL_SIZE and STRIDE that combine to dictate model architecture

To Reproduce
See attached files and below code snippet from seg_images_in_folder

Original implementation

TARGET_SIZE= [768,1024]
NCLASSES=1
KERNEL=7
STRIDE=2
BATCH_SIZE=6
FILTERS=6
N_DATA_BANDS= 3

model =  custom_resunet((TARGET_SIZE[0], TARGET_SIZE[1], N_DATA_BANDS),
                FILTERS,
                nclasses=[NCLASSES+1 if NCLASSES==1 else NCLASSES][0],
                kernel_size=(KERNEL,KERNEL),
                strides=STRIDE,
                dropout=DROPOUT,#0.1,
                dropout_change_per_layer=DROPOUT_CHANGE_PER_LAYER,#0.0,
                dropout_type=DROPOUT_TYPE,#"standard",
                use_dropout_on_upsampling=USE_DROPOUT_ON_UPSAMPLING,#False,
                )

model.load_weights(weights)

results in

ValueError: Cannot assign value to variable ' conv2d_304/kernel:0': Shape mismatch.The variable shape (2, 2, 3, 6), and the assigned value shape (6, 3, 7, 7) are incompatible.

other trials

NCLASSES=2 results in ValueError: Cannot assign value to variable ' conv2d_334/kernel:0': Shape mismatch.The variable shape (2, 2, 3, 6), and the assigned value shape (6, 3, 7, 7) are incompatible.

TARGET_SIZE= [1024,768] NCLASSES=1 results in ValueError: Cannot assign value to variable ' conv2d_364/kernel:0': Shape mismatch.The variable shape (2, 2, 3, 6), and the assigned value shape (6, 3, 7, 7) are incompatible.

TARGET_SIZE= [1024,768]
NCLASSES=1
N_DATA_BANDS= 1

results in ValueError: Cannot assign value to variable ' conv2d_394/kernel:0': Shape mismatch.The variable shape (2, 2, 1, 6), and the assigned value shape (6, 3, 7, 7) are incompatible.

Seems like a very convoluted task to discover the config parameters this way .... anyone have any good ideas? (except, retrain the model and this time keep the config file!)

Further, with the original config

TARGET_SIZE= [768,1024]
NCLASSES=1
KERNEL=7
STRIDE=2
FILTERS=6
N_DATA_BANDS= 3

BATCH_SIZE=1 results in ValueError: Cannot assign value to variable ' conv2d_424/kernel:0': Shape mismatch.The variable shape (2, 2, 3, 6), and the assigned value shape (6, 3, 7, 7) are incompatible.

BATCH_SIZE=2 results in ValueError: Cannot assign value to variable ' conv2d_454/kernel:0': Shape mismatch.The variable shape (2, 2, 3, 6), and the assigned value shape (6, 3, 7, 7) are incompatible.

BATCH_SIZE=3 results in ValueError: Cannot assign value to variable ' conv2d_484/kernel:0': Shape mismatch.The variable shape (2, 2, 3, 6), and the assigned value shape (6, 3, 7, 7) are incompatible.

... the specific conv layer changes, but the error is always the same

same for trials of KERNEL (5, 7, 9, 11)

i think its the same Conv2d layer is only changing because the of the TF session is open so all new conv2d layers get sequentially higher number names... I don;t think its an actual different Conv layer (since the desired shape of the layer seems to remain the same).

There are so many configs that I unfortunately can't think of a way to do this quickly, so i think its more up to the operator to just keep good notes/coding practice with the config file.. (hence why i tagged this with 'won't fix')

Thanks, that makes sense about incrementing conv2d layers.... and yes, after a little more exploring, I'm inclined to agree this seems to be an impossible task without more info

Just a thought: What if the model and config file were zipped into a single npz file? Then you would always have the model and associated config file. Not sure how big of a code change that would require. And this obviously won't help Stephen's current issue.

Hi all,

Below are the original config params used to create the weights file for the water masking model. This was done on one of the first iterations of Seg Zoo. My hope was to implement the weights file (and associated config) to run on new imagery (of the same stretch of coast) using the latest version of Seg Zoo, but it would not run because the new iteration calls for params that were not included in that original config file. Such as MODELS, FILTERS, STRIDE, etc. It's sounding like this may not be possible and that I should run a new model?
"TARGET_SIZE": [768,1024],
"KERNEL_SIZE": 7,
"NCLASSES": 1,
"BATCH_SIZE": 6,
"N_DATA_BANDS": 3,
"DO_CRF_REFINE": false,
"DO_TRAIN": true,
"USE_LOCATION": false,
"PATIENCE": 25,
"IMS_PER_SHARD": 50,
"MAX_EPOCHS": 200,
"VALIDATION_SPLIT": 0.75,
"RAMPUP_EPOCHS": 20,
"SUSTAIN_EPOCHS": 0.0,
"EXP_DECAY": 0.9,
"START_LR": 1e-7,
"MIN_LR": 1e-7,
"MAX_LR": 1e-5,
"MEDIAN_FILTER_VALUE": 3,
"DOPLOT": true,
"ROOT_STRING": "watermask-nadir-data",
"USEMASK": false,
"AUG_ROT": 0,
"AUG_ZOOM": 0,
"AUG_WIDTHSHIFT": 0.05,
"AUG_HEIGHTSHIFT": 0.05,
"AUG_HFLIP": true,
"AUG_VFLIP": true,
"AUG_LOOPS": 3,
"AUG_COPIES": 2,
"DO_AUG": true

looking at the config, and the trials from @dbuscombe-usgs above, i think Stride = 1 is going to help.. (I think some early versions of the code may have defaulted to stride ==1)..

Regardless, @sbosse12 - I do think its worth just retraining a model.. making sure you have the most recent version of Gym - it has changed a bunch, with many improvements.

Thanks Evan, I'm thinking you may be right.

Thanks, all. Stephen hopefully a fresh model will resolve this. You're a victim of being an early adopter of gym! Much has changed and you may now notice, for example, that you may be able to use a larger batch size than before.

Also @CameronBodine I'll think about zipping up the configuration and weights files together. One issue is that the configuration file used for training should be the same used for inference, and human readable.... but perhaps there is a way to link the 2 files together

I agree that keeping it human readable is critical.. I think this just needs to be made clear in the wiki that the config file is very important, and should not be lost.

i added this via https://github.com/Doodleverse/segmentation_gym/wiki/4_Creation-of-%60config%60-files/8d1b163b9df98f7e9872e246cf5c64cbde4442e9

closing this for now..