[FEATURE] Chaining pooled output to classifier

Question

[FEATURE] Chaining pooled output to classifier

ZeyuSun opened this issue 7 months ago · 2 comments

Motivation

Chaining unpooled output to classifier has been implemented and can be done as follows:

model = timm.create_model('vit_medium_patch16_reg1_gap_256', pretrained=True)
output = model.forward_features(torch.randn(2,3,256,256))
classified = model.forward_head(output)

Compared the convolutional layers outputs ("pre-classifier" features), the penultimate linear layer outputs (pre-logits features) are equally useful in many tasks.
For example, we may want a vector embedding of an image to compute the intrinsic distance between two images.

Current solutions are inefficient

The two current solutions create a new network or changes the network in-place:

Create with no classifier:

m = timm.create_model('resnet50', pretrained=True, num_classes=0)

Remove it later:

m = timm.create_model('ese_vovnet19b_dw', pretrained=True)
m.reset_classifier(0)

If I need to collect the logits and pre-logits features by iterating through the batches, I need to define two networks that only differs in the last layer.
This is not optimal because all the boiler plates have to be replicated and it may cause out-of-memory for large networks.

Potential solutions

One potential solution to get pre-logits features is using model.forward_head with pre_logits = True. This works for most networks, but some networks do not accept the pre_logits argument:

repghost.py:304:    def forward_head(self, x):
ghostnet.py:291:    def forward_head(self, x):
inception_v3.py:374:    def forward_head(self, x):
tiny_vit.py:548:    def forward_head(self, x):
nasnet.py:556:    def forward_head(self, x):

A more general alternative solution is to set up a common interface to cut the network in halves and chain them.
This requires to pass where we cut the network to forward_features and forward_head.
However, this generality may not be necessary because the convolution layers and the pre-logit layers are arguably the two most important intermediate features.

Answer 1 · 2024-05-26T16:21:44.000Z

@ZeyuSun that's what pre_logits is there for, it is supposed to be implemented across all models, it can be added to ones that were missed... I feel trying to go beyond that without breaking compatibility would be an exercise in overcomplication (for the functionality provided).

It should be noted that pooling can be quite intertwined in the 'head', some models have an extra conv/linear or norm between pooling and the final classifier, that's why the logical break was btw 'features' and 'head' where head includes the pooling. Something I may add is a transformers style Dict/Dataclass that can collect the separate items, but adding this without breaking compat will likey complicated typing and break torchscript for all models, I'm hoping to do it once torchscript is finally deprecated...

as is, this is the closest and only repeating pooling and possibly one conv or norm in some cases

unpooled_features = model.forward_features(x)
pooled_features = model.forward_head(unpooled_features, pre_logits=True)
classified = model.forward_head(unpooled_features)

in cases where the head is simple pool + classifier,

unpooled_features = model.forward_features(x)
pooled_features = model.forward_head(unpooled_features, pre_logits=True)
classified = model.get_classifier()(unpooled_features)

I've also thought about pre_logits='both' to return both last hidden features + classifier output, but the issue with changing the output type signature, it's likely to cause big headaches with torchscript

Answer 2 · 2024-05-27T22:35:35.000Z

I didn't check all those 5 networks whose forward_head doesn't accept pre_logits, but I did see the "pre-logits" are unflattened conv-layer outputs for some networks. I guess it is not apparent for those networks what the vector representation of images should be. If that's the case, then I agree that we shouldn't abstract out a "pre-logits vector representation" and should stick with using pre_logits=True.