How to get the summary if the model output consists of int and str?

Question

How to get the summary if the model output consists of int and str?

simon5u opened this issue a year ago · 1 comments

Describe the bug
torchinfo.py", line 448, in traverse_input_data
result = aggregate(
TypeError: unsupported operand type(s) for +: 'int' and 'str'

It seems like the torchinfo.py cannot mix the different model outputs.

    elif isinstance(data, Iterable) and not isinstance(data, str):
        aggregate = aggregate_fn(data)
        result = aggregate(
            [traverse_input_data(d, action_fn, aggregate_fn) for d in data]
        )

To Reproduce
Steps to reproduce the behavior:

Install the lavis model from https://github.com/salesforce/LAVIS
salesforce-lavis 1.0.0
transformers 4.25.0
Run the following code to get the summary:-

import torch
from PIL import Image

# load sample image
raw_image = Image.open("docs/_static/merlion.png").convert("RGB")

import torch
from lavis.models import load_model_and_preprocess
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# loads BLIP caption base model, with finetuned checkpoints on MSCOCO captioning dataset.
# this also loads the associated image processors
model, vis_processors, _ = load_model_and_preprocess(name="blip_caption", model_type="base_coco", is_eval=True, device=device)

# preprocess the image
# vis_processors stores image transforms for "train" and "eval" (validation / testing / inference)
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)

# generate caption
output = model.generate({"image": image})
# ['a large fountain spewing water into the air']

from torchinfo import summary
text_input = ["a large statue of a person spraying water from a fountain"]
samples = {"image": image, "text_input": text_input}
summary(model, input_data=[{"image": image, "text_input": text_input}])

Expected behavior
To produce the model summary

Answer 1 · 2023-12-27T09:21:39.000Z

I'm having the same issue, is there any update on this?