INVALID_ARGUMENT: Cannot batch tensors with different shapes in component 0
mvilar22 opened this issue · 5 comments
When training the model I run into the following error:
File "train_model_offset.py", line 841, in <module>
history = model.fit(train_ds, steps_per_epoch=steps_per_epoch, epochs=MAX_EPOCHS,
File "/home/mvilar/miniconda3/envs/gym/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/mvilar/miniconda3/envs/gym/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:
6 root error(s) found.
(0) INVALID_ARGUMENT: Cannot batch tensors with different shapes in component 0. First element had shape [512,512,3] and element 3 had shape [512,512,1].
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNext]]
[[replica_4/strided_slice/_1]]
[[div_no_nan_2/_547]]
(1) INVALID_ARGUMENT: Cannot batch tensors with different shapes in component 0. First element had shape [512,512,3] and element 3 had shape [512,512,1].
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNext]]
[[replica_4/strided_slice/_1]]
(2) INVALID_ARGUMENT: Cannot batch tensors with different shapes in component 0. First element had shape [512,512,3] and element 3 had shape [512,512,1].
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNext]]
[[div_no_nan_1/ReadVariableOp_2/_462]]
[[ArithmeticOptimizer/AddOpsRewrite_AddN_11/_604]]
[[div_no_nan_2/_543]]
(3) INVALID_ARGUMENT: Cannot batch tensors with different shapes in component 0. First element had shape [512,512,3] and element 3 had shape [512,512,1].
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNext]]
[[div_no_nan_1/ReadVariableOp_2/_462]]
[[ArithmeticOptimizer/AddOpsRewrite_AddN_11/_604]]
(4) INVALID_ARGUMENT: Cannot batch tensors with different shapes in component 0. First element had shape [512,512,3] and element 3 had shape [512,512,1].
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNext]]
[[div_no_nan_1/ReadVariableOp_2/_462]]
(5) INVALID_ARGUMENT: Cannot batch tensors with different shapes in component 0. First element had shape [512,512,3] and element 3 had shape [512,512,1].
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNext]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_109274]
I am using a modified version of make_dataset.py
and train_model.py
that changes the color_offset so it doesn't "paint" the pixels of the images that belong to class 0 with any of the colors of the color map. This was done to address a significant overhead issue that was causing the training process to be excessively time-consuming.
make_dataset.py
run just fine but I ran into this issue when I train. As i understand it, one of the images have a different shape that expected. However, I have iterated over the npz files, checking the shape of the arrays in said npz and there are no arrays with sahpe (3,3,1)
. I checked it with this code:
import numpy as np
import glob
# Directory where the npz files are located
npz_directory = '/path/to/npz_directory/*.npz'
# Get the list of npz files in the directory
npz_files = glob.glob(npz_directory)
print(len(npz_files))
# Iterate over the npz files
for npz_file in npz_files:
with np.load(npz_file) as npz_data:
# Get the names of the arrays stored in the npz file
array_names = npz_data.files
# Iterate over the array names
for array_name in array_names:
array = npz_data[array_name]
# Check if the array has shape (3, 3, 1)
if array.shape == (3, 3, 1):
print(f"The file {npz_file} contains an array with shape (3, 3, 1): {array_name}")
I used this script with both the train_npzs and val_npzs folders path, and nothing but the number of files was printed.
I have checked and all the images provided to make_dataset.py
are RGB jpg files and all the labels are grayscale jpg files, named exactly the same as the images but with a _label
before the .jpg
This is mi config file:
"TARGET_SIZE": [512,512],
"MODEL": "resunet",
"NCLASSES": 25,
"KERNEL":3,
"STRIDE":2,
"BATCH_SIZE": 5,
"FILTERS":16,
"N_DATA_BANDS": 3,
"DROPOUT":0.25,
"DROPOUT_CHANGE_PER_LAYER":0.0,
"DROPOUT_TYPE":"standard",
"USE_DROPOUT_ON_UPSAMPLING":false,
"DO_TRAIN": true,
"LOSS":"cat",
"PATIENCE": 10,
"MAX_EPOCHS": 30,
"VALIDATION_SPLIT": 0.25,
"RAMPUP_EPOCHS": 20,
"SUSTAIN_EPOCHS": 0.0,
"EXP_DECAY": 0.9,
"START_LR": 1e-7,
"MIN_LR": 1e-7,
"MAX_LR": 1e-4,
"FILTER_VALUE": 0,
"DOPLOT": true,
"ROOT_STRING": "mydata",
"USEMASK": false,
"AUG_ROT": 0,
"AUG_ZOOM": 0.05,
"AUG_WIDTHSHIFT": 0.05,
"AUG_HEIGHTSHIFT": 0.05,
"AUG_HFLIP": true,
"AUG_VFLIP": true,
"AUG_LOOPS": 10,
"AUG_COPIES": 3,
"SET_GPU": "0,1,2,3,4",
"WRITE_MODELMETADATA": false,
"DO_CRF": false,
"LOSS_WEIGHTS": false,
"MODE": "all",
"SET_PCI_BUS_ID": true,
"TESTTIMEAUG": true,
"WRITE_MODELMETADATA": true,
"OTSU_THRESHOLD": true
}
Browsing the issues in the repo, some people seem to have had this issue when using images that only have 1 class, however, that should not be what's happening here, since all images should have 2 classes, class 0 or "don't paint class" and another class..
Here is a screenshot of the files in train_images, train_labels:
Note that this files are png, because they have been created by make_dataset.py
Here are some of the npzs, I don't know if any of these are the ones causing the issue, but just in case it's helpfull
sample.zip
And I think that's all, thanks in advance for your help!
Hi @mvilar22, I think you just need to change "N_DATA_BANDS" value in your config file to be 1 instead of 3. Then try training again.
Nope I misunderstood. Please disregard.
Hi @mvilar22 -
Since you are using a modified version of make_dataset and train_model then it will not be possible for me to debug this issue. However I can share some general thoughts on what I think is going on:
- I have also run into issues with array sizes in the npz (Usually it is a greyscale image that got mixed in with color images).. I have not run your code to check for array shape, but I have a few points here i want to mention from looking at it. It seems like your tensor shape issue is that most image elements images have shape of
(512,512,3)
but you have one with a shape of(512,512,1)
mixed in.. As a result, In your code you need to be looking for arrays of size(512,512,1)
, and *not(3,3,1)
as you have listed in your code. My recollection is thatarr_0
is the image andarr_1
is the label, so make sure to restrict your code toarr_0
if you are looking for offending images... Here is my version of that array checking code.. (note that it will delete offending npz's, so don;t run it unless you are ready to do that)
https://github.com/ebgoldstein/WaterCam/blob/main/array_shape_test.py
-
I strongly recommend you look at the 'remap classes` config, which allows you to use an unmodified version of make_datasets and train model and adjust the number of classes within the Gym code, and not rely on modifying make_dataset and train_model. See: https://github.com/Doodleverse/segmentation_gym/wiki/04_Creation-of-%60config%60-files
-
How many classes does your problem have? From your explanation, there are 2 classes (0 and 1), but from your config file, it looks like there are 25 classes (
"NCLASSES": 25,
). I imagine you might run into problems because of this, but maybe you have modified the code to deal with this mismatch, or i am not understanding something correctly.
Hi @mvilar22 , I echo everything @ebgoldstein said.
We're not going to help troubleshoot an alternative workflow -
- we have enough on our hands at the moment just maintaining the current workflow, which uses labels that can be remapped, using the config file, or the utility that @ebgoldstein mention. I'd prefer you helped us maintain one codebase that works for all - yes, this is more difficult for you in the shorter term, but its better for everyone in the longer term, and this is a collaborative codebase after all, not a custom solution.
- labels starting at zero is necessary for binary problems, and I've never encountered a problem with starting from 0 - it is python expects, in general. If zeros are problematic, modifying the loss function to ignore them would be a simpler change, no?
- keras reports there is one class, because there is only one class - the folder called 'images'. This is not a bug or error, it is simply a hangover from how keras has historically read folders of images for classification problems, i.e. one folder per class. we're using this simply to pipe to
tf.data
, firstly to make non-augmented files, then to pipe to augmentation procedures. It is very tricky to modify the augmentation pipeline
And 25 classes -- you are entering new territory and should pay attention to class imbalance. The most I have successfully used is NCLASSES=12, so as @ebgoldstein asked, did you modify the code at all to deal with a larger number of classes? (if so, how?) I'd be nervous about using cat
loss on a large number of classes, but maybe its fine - out of curiosity, did you try Dice first?
Finally, I noticed you are using 4 GPUs (?) - you must have a large dataset - are you sure you dont have a single bad image? Are all of your npz files roughly the same size in bytes?
I looked at your sample images and note they are atypical of a standard image segmentation problem - this looks like an image classification problem to me - each one of your subjects occupy most of the scene - your labels are bounding boxes. Are you sure you need image segmentation for this task? Rather than whole image classification?
I use the following code
from numpy.lib.npyio import load
from glob import glob
import matplotlib
import plotly.express as px
import matplotlib.pyplot as plt
import numpy as np
files = glob("/home/marda/Downloads/sample/*.npz")
class_label_colormap = px.colors.qualitative.Light24
NUM_LABEL_CLASSES=25
cmap = matplotlib.colors.ListedColormap(class_label_colormap[:NUM_LABEL_CLASSES+1])
for counter,anno_file in enumerate(files):
data = dict()
with load(anno_file, allow_pickle=True) as dat:
#create a dictionary of variables
#automatically converted the keys in the npz file, dat to keys in the dictionary, data, then assigns the arrays to data
for k in dat.keys():
data[k] = dat[k]
del dat
plt.imshow(data['arr_0'])
plt.imshow(np.argmax(data['arr_1'],-1), alpha=0.5, vmin=0, vmax=25, cmap=cmap)
plt.axis('off')
plt.savefig(f"{counter}.png",dpi=200,bbox_inches='tight')
plt.close()
Hi, thanks for the insights!
I will try the suggestions and tweak with it to see if I can get it to work, just to see if it's possible. You are right this is an object detection more than a segmentation one, but since I was already using this workflow and it's fairly accessible I thought why not. Clearly it's not the best solution for this specific problem.
Thanks again for the help and for the fantastic tool you have here!