
How to get the summary if the model output consists of int and str?

simon5u opened this issue · 1 comments

Describe the bug
torchinfo.py", line 448, in traverse_input_data
result = aggregate(
TypeError: unsupported operand type(s) for +: 'int' and 'str'

It seems like the torchinfo.py cannot mix the different model outputs.

    elif isinstance(data, Iterable) and not isinstance(data, str):
        aggregate = aggregate_fn(data)
        result = aggregate(
            [traverse_input_data(d, action_fn, aggregate_fn) for d in data]

To Reproduce
Steps to reproduce the behavior:

  1. Install the lavis model from https://github.com/salesforce/LAVIS
    salesforce-lavis 1.0.0
    transformers 4.25.0
  2. Run the following code to get the summary:-
import torch
from PIL import Image

# load sample image
raw_image = Image.open("docs/_static/merlion.png").convert("RGB")

import torch
from lavis.models import load_model_and_preprocess
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# loads BLIP caption base model, with finetuned checkpoints on MSCOCO captioning dataset.
# this also loads the associated image processors
model, vis_processors, _ = load_model_and_preprocess(name="blip_caption", model_type="base_coco", is_eval=True, device=device)

# preprocess the image
# vis_processors stores image transforms for "train" and "eval" (validation / testing / inference)
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)

# generate caption
output = model.generate({"image": image})
# ['a large fountain spewing water into the air']

from torchinfo import summary
text_input = ["a large statue of a person spraying water from a fountain"]
samples = {"image": image, "text_input": text_input}
summary(model, input_data=[{"image": image, "text_input": text_input}])

Expected behavior
To produce the model summary

I'm having the same issue, is there any update on this?