cheerss/CrossFormer

Some question about your paper and code

Huzhen757 opened this issue · 6 comments

Hi,I'm very interested in your work about Multi-scale Attention in Transformer. but I have some questions about your work:

  1. In Appendix 2. DPB, Why do i and j parameters range from 0 to 2G-1 instead of 0 to G-1?Besides,the inputs of DPB
    module is (1-G+i, 1-G+j), What is the reason for this setting? Why not just use i and j as inputs?

  2. When I debug your code , I add a parameters due to I have only one 3090 with 24G memory, like this:

parser = argparse.ArgumentParser('CrossFormer training and evaluation script', add_help=False)
parser.add_argument('--cfg', type=str, required=True, metavar="FILE",
default='/configs/small_patch4_group7_224.yaml', help='path to config file')
parser.add_argument(
"--opts",
help="Modify config options by adding 'KEY VALUE' pairs. ",
default=None,
nargs='+'
)
# easy config modification
parser.add_argument('--batch-size', type=int, default=32, help="batch size for single GPU")
parser.add_argument('--data-set', type=str, default='flower', help='dataset to use')
parser.add_argument('--data-path', type=str, help='path to dataset', default='/media/data2/huzhen/flower_data')
parser.add_argument('--zip', action='store_true', help='use zipped dataset instead of folder dataset')
parser.add_argument('--cache-mode', type=str, default='part', choices=['no', 'full', 'part'],
help='no: no cache, '
'full: cache all data, '
'part: sharding the dataset into nonoverlapping pieces and only cache one piece')
parser.add_argument('--resume', help='resume from checkpoint', default='')
parser.add_argument('--accumulation-steps', type=int, help="gradient accumulation steps")
parser.add_argument('--use-checkpoint', action='store_true',
help="whether to use gradient checkpointing to save memory")
parser.add_argument('--amp-opt-level', type=str, default='native', choices=['native', 'O0', 'O1', 'O2'],
help='mixed precision opt level, if O0, no amp is used')
parser.add_argument('--output', default='./Flower_weights', type=str, metavar='PATH',
help='root of output folder, the full path is /<model_name>/ (default: output)')
parser.add_argument('--tag', help='tag of experiment')
parser.add_argument('--eval', action='store_true', help='Perform evaluation only')
parser.add_argument('--throughput', action='store_true', help='Test throughput only')
parser.add_argument('--num_workers', type=int, default=8, help="")
parser.add_argument('--mlp_ratio', type=int, default=4, help="")
parser.add_argument('--warmup_epochs', type=int, default=20, help="#epoches for warm up")
parser.add_argument("--local_rank", type=int, required=True, default=0, help='local rank for DistributedDataParallel')
parser.add_argument('--device', default='cuda:2',
help='device to use for training / testing')

args, unparsed = parser.parse_known_args()

but its report an error: 发生异常: SystemExit 2
The above is my parameter setting. Is there a problem?
I sincerely hope I can receive for your help!

  1. DPB receives relative position as input. Take the case in 1-dimension as an example: if the sequence length is G, the relative position of any two tokens ranges from [1-G, G-1], whose length is 2G-1. The case is similar for two-dimensional case. Absolute position bias uses i and j as input directly, while we use relative position bias. You may wish to read this paper (Self-attention with relative position representations) for their difference.

  2. To run the model with 1 GPU, you could do this without changing any code:

CUDA_VISIBLE_DEVICES=0 python -u -m torch.distributed.launch --nproc_per_node 1 main.py --cfg configs/tiny_patch4_group7_224.yaml \
--batch-size 128 --data-path path_to_imagenet --output ./output

OK, thanks for your carefully reply。
The second point you mentioned is only about how to use a single GPU to train at the terminal. But I want to ask how to use a single GPU to debug the code, because I want to know more details about the forward process of the model。

I set the cfg file address of the corresponding model in the ‘--cfg’ parameter in the main script, and the ‘--local_ rank’ parameter is set to 0, as well as a new parameter device is added to specify the GPU model used。

I have solved this question, but when I debug the code,I find in the CrossFromerBlock。Even if it is distinguished whether SDA or LDA is used in each bottleneck, it does not seem to be reflected in the code, like this:
` G = self.group_size
if self.lsda_flag == 0: # 0 for SDA [bs, 112, 112, 96] -> [bs,16,16,7,7,96]
x = x.reshape(B, H // G, G, W // G, G, C).permute(0, 1, 3, 2, 4, 5)
else: # 1 for LDA [bs, 112, 112, 96] -> [bs,16,16,7,7,96]
x = x.reshape(B, G, H // G, G, W // G, C).permute(0, 2, 4, 1, 3, 5)
x = x.reshape(B * H * W // G2, G2, C) # [bs1616, 49, 96]

    # multi-head self-attention
    x = self.attn(x, mask=self.attn_mask)  # [num_groups*bs, 49, embed_dim]`

in my opinion,if SDA, The design of the code is consistent with that mentioned in the paper, and the feature map is divided into groups*groups(7x7),its have 'input size / 7' group. So feature‘s shape :[bs, 112, 112, 96] -> [bs,16,16,7,7,96] is fully comply with the SDA mentioned in the paper. But In LDA , I use CrossFormer-small model,so the LDA I is equal [8,4,2,1] 。
However, the same reshape operation as SDA is adopted,this is very puzzling to me。 As mentioned in the paper, it should
reshape : x = x.reshape(B, I, H // I, I, W // I, C).permute(0, 2, 4, 1, 3, 5).
Can you tell me why this is done? Thanks !

Sorry for late response.

We have reconsidered your question and reassured that the code is consistent with the paper.

When image size is 224x224 and I equals to [8, 4, 2, 1], G is exactly equal to 7 for all layers (Both for SDA and LDA) because I x G = S, where S represents the width/length of feature maps. Thus, in our code for classification, the argument G is always 7 for this function.

For SDA, the reshape operation is:

x = x.reshape(B, H//G, G, W//G, G, C).permute(0, 1, 3, 2, 4, 5) # after which, x is of shape [B, H//7, W//7, 7, 7, C]

So, for LDA, the case should be exactly the opposite:

I = H// G # or W//G
x = x.reshape(B, G, I, G, I, C).permute(0, 2, 4, 1, 3, 5) # after which, x is of shape [B, I, I, 7, 7, C]

Since detection and segmentation use variable images, the implementation may be more clear. You may wish to refer to here.

Finally, if the reshape operation is done as you said, the shape of x after permute will become [B, 7, 7, I, I, C], which is not consistent with our paper. It is worth noting that in LDA, embeddings that are far from each other belong to the same group, and there are always 7x7 embeddings in each group if image size is 224x224

Since the issue has kept inactive for a long time, I'll close it.