About object detection training
louis624 opened this issue · 4 comments
Dear authors
Thank you for the great paper and its model architecture.
I have some questions related to the object detections in your paper.
In Section 4.2 (Object Detection), it is written as the following:
We validate PiT through object detection on COCO dataset [24] in Deformable-DETR [44]. ... Since the original image resolution is too large for transformer-based backbones, we halve the image resolution for training and test of all backbones.
So, my questions are
- Is PiT for object detection trained with a fixed size of 667 by 400 (half of 1333 and 800)? If so, were the images zero padded in case where the resized images were smaller than the size (667 by 400)?
- For object detection, it is clear that the input data size are different than the data for image classification. Then, does patch size of PiT changes? or the number of patches changes?
- If number of patches for detection were kept the same as for the image classification, than does the patch embedding (conv_embedding) has larger kernel sizes?
Thank you in advance.
Hi @louis624
Thank you for your interest in our paper.
Here are my answers.
1. Is PiT for object detection trained with a fixed size of 667 by 400 (half of 1333 and 800)? If so, were the images zero padded in case where the resized images were smaller than the size (667 by 400)?
I'm sorry for make confusion. I will explain our detection setting in detail.
We changed these lines from the official Deformable-DETR code.
https://github.com/fundamentalvision/Deformable-DETR/blob/11169a60c33333af00a4849f1808023eba96a931/datasets/coco.py#L132-L152
Original
scales = [480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800]
if image_set == 'train':
return T.Compose([
T.RandomHorizontalFlip(),
T.RandomSelect(
T.RandomResize(scales, max_size=1333),
T.Compose([
T.RandomResize([400, 500, 600]),
T.RandomSizeCrop(384, 600),
T.RandomResize(scales, max_size=1333),
])
),
normalize,
])
if image_set == 'val':
return T.Compose([
T.RandomResize([800], max_size=1333),
normalize,
])
Ours
scales = [400 - i * 16 for i in range(11)]
if image_set == 'train':
return T.Compose([
T.RandomHorizontalFlip(),
T.RandomSelect(
T.RandomResize(scales, max_size=666),
T.Compose([
T.RandomResize([200, 250, 300]),
T.RandomSizeCrop(192, 300),
T.RandomResize(scales, max_size=666),
])
),
normalize,
])
if image_set == 'val':
return T.Compose([
T.RandomResize([400], max_size=666),
normalize,
])
So, it is not the fixed size setting and we didn't use extra code for zero padding.
2. For object detection, it is clear that the input data size are different than the data for image classification. Then, does patch size of PiT changes? or the number of patches changes?
3. If number of patches for detection were kept the same as for the image classification, than does the patch embedding (conv_embedding) has larger kernel sizes?
When the input size changes, PiT uses a different number of patches.
We didn't change the kernel size of patch_embedding
for object detection.
I think PiT code used for Deformable-DETR can be a clear answer to this question.
We use features
instead of cls_token
in image classification.
We also interpolate pos_embed
when the network processes a different input size.
But, we didn't change the kernel_size
of patch_embedding
class PoolingTransformer(nn.Module):
def __init__(self, image_size, patch_size, stride,
num_classes, base_dims, depth, heads, mlp_ratio, in_chans=3,
attn_drop_rate=.0, drop_rate=.0, drop_path_rate=.0,
replace_stride_with_dilation=None):
super(PoolingTransformer, self).__init__()
total_block = sum(depth)
padding = 0
block_idx = 0
if replace_stride_with_dilation is None:
replace_stride_with_dilation = [False, False]
self.dilation = 1
width = math.floor(
(image_size + 2 * padding - patch_size) / stride + 1)
self.base_dims = base_dims
self.heads = heads
self.num_classes = num_classes
self.patch_size = patch_size
self.pos_embed = nn.Parameter(
torch.randn(1, base_dims[0] * heads[0], width, width),
requires_grad=True)
self.patch_embed = conv_embedding(in_chans, base_dims[0] * heads[0],
patch_size, stride, padding)
self.cls_token = nn.Parameter(
torch.randn(1, 1, base_dims[0] * heads[0]),
requires_grad=True)
self.pos_drop = nn.Dropout(p=drop_rate)
self.transformers = nn.ModuleList([])
self.pools = nn.ModuleList([])
for stage in range(len(depth)):
drop_path_prob = [drop_path_rate * i / total_block
for i in
range(block_idx, block_idx + depth[stage])]
block_idx += depth[stage]
self.transformers.append(
Transformer(base_dims[stage], depth[stage], heads[stage],
mlp_ratio,
drop_rate, attn_drop_rate, drop_path_prob)
)
if stage < len(heads) - 1:
stride = 2
if replace_stride_with_dilation[stage]:
self.dilation *= stride
stride = 1
self.pools.append(
conv_head_pooling(base_dims[stage] * heads[stage],
base_dims[stage + 1] * heads[stage + 1],
stride=stride,
dilation=self.dilation)
)
self.norm = nn.LayerNorm(base_dims[-1] * heads[-1], eps=1e-6)
# Classifier head
self.head = nn.Linear(base_dims[-1] * heads[-1],
num_classes) if num_classes > 0 else nn.Identity()
trunc_normal_(self.pos_embed, std=.02)
trunc_normal_(self.cls_token, std=.02)
self.apply(self._init_weights)
def _init_weights(self, m):
if isinstance(m, nn.LayerNorm):
nn.init.constant_(m.bias, 0)
nn.init.constant_(m.weight, 1.0)
@torch.jit.ignore
def no_weight_decay(self):
return {'pos_embed', 'cls_token'}
def get_classifier(self):
return self.head
def reset_classifier(self, num_classes, global_pool=''):
self.num_classes = num_classes
self.head = nn.Linear(self.embed_dim,
num_classes) if num_classes > 0 else nn.Identity()
def no_grad_head(self):
self.head.weight.requires_grad_(False)
self.head.bias.requires_grad_(False)
self.norm.weight.requires_grad_(False)
self.norm.bias.requires_grad_(False)
def change_resolution(self, h, w):
self.pos_embed = nn.Parameter(
F.interpolate(self.pos_embed.data, (h, w), mode='bicubic'),
requires_grad=True
)
def forward_features(self, x):
x = self.patch_embed(x)
if x.shape[2:4] == self.pos_embed.shape[2:4]:
pos_embed = self.pos_embed
else:
pos_embed = F.interpolate(self.pos_embed, x.shape[2:4],
mode='bicubic')
x = self.pos_drop(x + pos_embed)
cls_tokens = self.cls_token.expand(x.shape[0], -1, -1)
features = []
for stage in range(len(self.pools)):
x, cls_tokens = self.transformers[stage](x, cls_tokens)
features.append(x)
x, cls_tokens = self.pools[stage](x, cls_tokens)
x, cls_tokens = self.transformers[-1](x, cls_tokens)
features.append(x)
return features, cls_tokens
def forward(self, x):
features, cls_tokens = self.forward_features(x)
return features
I hope my answers solve your questions about our detection setting.
Please let me know if you have any further questions.
Best
Thank you for the detailed explanation about my questions!!!
Just one more question about the architecture.
In the architecture that you have shared, there is dilation argument for conv_head_pooling which does not exist conv_head_poling class.
self.pools.append(
conv_head_pooling(base_dims[stage] * heads[stage],
base_dims[stage + 1] * heads[stage + 1], stride=stride, dilation=self.dilation)
)
In this case, since self.dilation is just 1, which is the default value of torch.nn.Conv2d, can I just ignore the dilation?
Thank you!
Yes. you can ignore the dilation option.
Because Deformable-DETR supports the dilation option for backbone network, I implemented it for PiT.
But, I didn't use it for experiments. So, you can simply ignore it.
Great! Thank you for the detailed explanations!!