INVALID_ARGUMENT: Cannot batch tensors with different shapes in component 0

Question

INVALID_ARGUMENT: Cannot batch tensors with different shapes in component 0

mvilar22 opened this issue a year ago · 5 comments

When training the model I run into the following error:

File "train_model_offset.py", line 841, in <module>
    history = model.fit(train_ds, steps_per_epoch=steps_per_epoch, epochs=MAX_EPOCHS,
  File "/home/mvilar/miniconda3/envs/gym/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/mvilar/miniconda3/envs/gym/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:


6 root error(s) found.
  (0) INVALID_ARGUMENT:  Cannot batch tensors with different shapes in component 0. First element had shape [512,512,3] and element 3 had shape [512,512,1].
	 [[{{node MultiDeviceIteratorGetNextFromShard}}]]
	 [[RemoteCall]]
	 [[IteratorGetNext]]
	 [[replica_4/strided_slice/_1]]
	 [[div_no_nan_2/_547]]
  (1) INVALID_ARGUMENT:  Cannot batch tensors with different shapes in component 0. First element had shape [512,512,3] and element 3 had shape [512,512,1].
	 [[{{node MultiDeviceIteratorGetNextFromShard}}]]
	 [[RemoteCall]]
	 [[IteratorGetNext]]
	 [[replica_4/strided_slice/_1]]
  (2) INVALID_ARGUMENT:  Cannot batch tensors with different shapes in component 0. First element had shape [512,512,3] and element 3 had shape [512,512,1].
	 [[{{node MultiDeviceIteratorGetNextFromShard}}]]
	 [[RemoteCall]]
	 [[IteratorGetNext]]
	 [[div_no_nan_1/ReadVariableOp_2/_462]]
	 [[ArithmeticOptimizer/AddOpsRewrite_AddN_11/_604]]
	 [[div_no_nan_2/_543]]
  (3) INVALID_ARGUMENT:  Cannot batch tensors with different shapes in component 0. First element had shape [512,512,3] and element 3 had shape [512,512,1].
	 [[{{node MultiDeviceIteratorGetNextFromShard}}]]
	 [[RemoteCall]]
	 [[IteratorGetNext]]
	 [[div_no_nan_1/ReadVariableOp_2/_462]]
	 [[ArithmeticOptimizer/AddOpsRewrite_AddN_11/_604]]
  (4) INVALID_ARGUMENT:  Cannot batch tensors with different shapes in component 0. First element had shape [512,512,3] and element 3 had shape [512,512,1].
	 [[{{node MultiDeviceIteratorGetNextFromShard}}]]
	 [[RemoteCall]]
	 [[IteratorGetNext]]
	 [[div_no_nan_1/ReadVariableOp_2/_462]]
  (5) INVALID_ARGUMENT:  Cannot batch tensors with different shapes in component 0. First element had shape [512,512,3] and element 3 had shape [512,512,1].
	 [[{{node MultiDeviceIteratorGetNextFromShard}}]]
	 [[RemoteCall]]
	 [[IteratorGetNext]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_109274]

I am using a modified version of make_dataset.py and train_model.py that changes the color_offset so it doesn't "paint" the pixels of the images that belong to class 0 with any of the colors of the color map. This was done to address a significant overhead issue that was causing the training process to be excessively time-consuming.

make_dataset.py run just fine but I ran into this issue when I train. As i understand it, one of the images have a different shape that expected. However, I have iterated over the npz files, checking the shape of the arrays in said npz and there are no arrays with sahpe (3,3,1). I checked it with this code:

import numpy as np
import glob

# Directory where the npz files are located
npz_directory = '/path/to/npz_directory/*.npz'

# Get the list of npz files in the directory
npz_files = glob.glob(npz_directory)
print(len(npz_files))
# Iterate over the npz files
for npz_file in npz_files:
    with np.load(npz_file) as npz_data:
        # Get the names of the arrays stored in the npz file
        array_names = npz_data.files
        
        # Iterate over the array names
        for array_name in array_names:
            array = npz_data[array_name]
            
            # Check if the array has shape (3, 3, 1)
            if array.shape == (3, 3, 1):
                print(f"The file {npz_file} contains an array with shape (3, 3, 1): {array_name}")

I used this script with both the train_npzs and val_npzs folders path, and nothing but the number of files was printed.

I have checked and all the images provided to make_dataset.py are RGB jpg files and all the labels are grayscale jpg files, named exactly the same as the images but with a _label before the .jpg

This is mi config file:

"TARGET_SIZE": [512,512],
 "MODEL": "resunet",
 "NCLASSES": 25,
 "KERNEL":3,
 "STRIDE":2,
 "BATCH_SIZE": 5,
 "FILTERS":16,
 "N_DATA_BANDS": 3,
 "DROPOUT":0.25,
 "DROPOUT_CHANGE_PER_LAYER":0.0,
 "DROPOUT_TYPE":"standard",
 "USE_DROPOUT_ON_UPSAMPLING":false,
 "DO_TRAIN": true,
 "LOSS":"cat",
 "PATIENCE": 10,
 "MAX_EPOCHS": 30,
 "VALIDATION_SPLIT": 0.25,
 "RAMPUP_EPOCHS": 20,
 "SUSTAIN_EPOCHS": 0.0,
 "EXP_DECAY": 0.9,
 "START_LR":  1e-7,
 "MIN_LR": 1e-7,
 "MAX_LR": 1e-4,
 "FILTER_VALUE": 0,
 "DOPLOT": true,
 "ROOT_STRING": "mydata",
 "USEMASK": false,
 "AUG_ROT": 0,
 "AUG_ZOOM": 0.05,
 "AUG_WIDTHSHIFT": 0.05,
 "AUG_HEIGHTSHIFT": 0.05,
 "AUG_HFLIP": true,
 "AUG_VFLIP": true,
 "AUG_LOOPS": 10,
 "AUG_COPIES": 3,
 "SET_GPU": "0,1,2,3,4",
 "WRITE_MODELMETADATA": false,
 "DO_CRF": false,
 "LOSS_WEIGHTS": false,
 "MODE": "all",
 "SET_PCI_BUS_ID": true,
 "TESTTIMEAUG": true,
 "WRITE_MODELMETADATA": true,
 "OTSU_THRESHOLD": true      
}

Browsing the issues in the repo, some people seem to have had this issue when using images that only have 1 class, however, that should not be what's happening here, since all images should have 2 classes, class 0 or "don't paint class" and another class..

Here is a screenshot of the files in train_images, train_labels:

Note that this files are png, because they have been created by make_dataset.py

Here are some of the npzs, I don't know if any of these are the ones causing the issue, but just in case it's helpfull
sample.zip

And I think that's all, thanks in advance for your help!

Answer 1 · 2023-06-16T12:17:45.000Z

Hi @mvilar22, I think you just need to change "N_DATA_BANDS" value in your config file to be 1 instead of 3. Then try training again.

Answer 2 · 2023-06-16T12:20:02.000Z

Nope I misunderstood. Please disregard.

Answer 3 · 2023-06-16T14:39:00.000Z

Hi @mvilar22 -

Since you are using a modified version of make_dataset and train_model then it will not be possible for me to debug this issue. However I can share some general thoughts on what I think is going on:

I have also run into issues with array sizes in the npz (Usually it is a greyscale image that got mixed in with color images).. I have not run your code to check for array shape, but I have a few points here i want to mention from looking at it. It seems like your tensor shape issue is that most image elements images have shape of (512,512,3) but you have one with a shape of (512,512,1) mixed in.. As a result, In your code you need to be looking for arrays of size (512,512,1), and *not (3,3,1) as you have listed in your code. My recollection is that arr_0 is the image and arr_1 is the label, so make sure to restrict your code to arr_0 if you are looking for offending images... Here is my version of that array checking code.. (note that it will delete offending npz's, so don;t run it unless you are ready to do that)

https://github.com/ebgoldstein/WaterCam/blob/main/array_shape_test.py

I strongly recommend you look at the 'remap classes` config, which allows you to use an unmodified version of make_datasets and train model and adjust the number of classes within the Gym code, and not rely on modifying make_dataset and train_model. See: https://github.com/Doodleverse/segmentation_gym/wiki/04_Creation-of-%60config%60-files
How many classes does your problem have? From your explanation, there are 2 classes (0 and 1), but from your config file, it looks like there are 25 classes ("NCLASSES": 25,). I imagine you might run into problems because of this, but maybe you have modified the code to deal with this mismatch, or i am not understanding something correctly.

Answer 4 · 2023-06-16T18:03:07.000Z

Hi @mvilar22 , I echo everything @ebgoldstein said.

We're not going to help troubleshoot an alternative workflow -

we have enough on our hands at the moment just maintaining the current workflow, which uses labels that can be remapped, using the config file, or the utility that @ebgoldstein mention. I'd prefer you helped us maintain one codebase that works for all - yes, this is more difficult for you in the shorter term, but its better for everyone in the longer term, and this is a collaborative codebase after all, not a custom solution.
labels starting at zero is necessary for binary problems, and I've never encountered a problem with starting from 0 - it is python expects, in general. If zeros are problematic, modifying the loss function to ignore them would be a simpler change, no?
keras reports there is one class, because there is only one class - the folder called 'images'. This is not a bug or error, it is simply a hangover from how keras has historically read folders of images for classification problems, i.e. one folder per class. we're using this simply to pipe to tf.data, firstly to make non-augmented files, then to pipe to augmentation procedures. It is very tricky to modify the augmentation pipeline

And 25 classes -- you are entering new territory and should pay attention to class imbalance. The most I have successfully used is NCLASSES=12, so as @ebgoldstein asked, did you modify the code at all to deal with a larger number of classes? (if so, how?) I'd be nervous about using cat loss on a large number of classes, but maybe its fine - out of curiosity, did you try Dice first?

Finally, I noticed you are using 4 GPUs (?) - you must have a large dataset - are you sure you dont have a single bad image? Are all of your npz files roughly the same size in bytes?

I looked at your sample images and note they are atypical of a standard image segmentation problem - this looks like an image classification problem to me - each one of your subjects occupy most of the scene - your labels are bounding boxes. Are you sure you need image segmentation for this task? Rather than whole image classification?

I use the following code

from numpy.lib.npyio import load
from glob import glob
import matplotlib
import plotly.express as px
import matplotlib.pyplot as plt
import numpy as np 

files = glob("/home/marda/Downloads/sample/*.npz")

class_label_colormap = px.colors.qualitative.Light24
NUM_LABEL_CLASSES=25

cmap = matplotlib.colors.ListedColormap(class_label_colormap[:NUM_LABEL_CLASSES+1])

for counter,anno_file in enumerate(files):
    data = dict()
    with load(anno_file, allow_pickle=True) as dat:
        #create a dictionary of variables
        #automatically converted the keys in the npz file, dat to keys in the dictionary, data, then assigns the arrays to data
        for k in dat.keys():
            data[k] = dat[k]
        del dat

        plt.imshow(data['arr_0'])
        plt.imshow(np.argmax(data['arr_1'],-1), alpha=0.5, vmin=0, vmax=25, cmap=cmap)
        plt.axis('off')
        plt.savefig(f"{counter}.png",dpi=200,bbox_inches='tight')
        plt.close()

Answer 5 · 2023-06-19T07:18:39.000Z

Hi, thanks for the insights!

I will try the suggestions and tweak with it to see if I can get it to work, just to see if it's possible. You are right this is an object detection more than a segmentation one, but since I was already using this workflow and it's fairly accessible I thought why not. Clearly it's not the best solution for this specific problem.

Thanks again for the help and for the fantastic tool you have here!