This is the official PyTorch implementation of RepLKNet, from the following CVPR-2022 paper:
Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs.
The paper is now released on arXiv: https://arxiv.org/abs/2203.06717.
Update: all the pretrained models, ImageNet-1K models, and Cityscapes/ADE20K/COCO models have been released.
Update: released a script to visualize the ERF. To get the ERF of your own model, you only need to add a few lines of code!
If you find the paper or this repository helpful, please consider citing
@article{replknet,
title={Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs},
author={Ding, Xiaohan and Zhang, Xiangyu and Zhou, Yizhuang and Han, Jungong and Ding, Guiguang and Sun, Jian},
journal={arXiv preprint arXiv:2203.06717},
year={2022}
}
framework | link |
---|---|
MegEngine (official) | https://github.com/megvii-research/RepLKNet |
PyTorch (official) | https://github.com/DingXiaoH/RepLKNet-pytorch |
Tensorflow | https://github.com/shkarupa-alex/tfreplknet |
... |
More re-implementations are welcomed.
We have released an example for PyTorch. Please check setup.py
and depthwise_conv2d_implicit_gemm.py
(a replacement of torch.nn.Conv2d) in https://github.com/MegEngine/cutlass/tree/master/examples/19_large_depthwise_conv2d_torch_extension.
- Clone
cutlass
(https://github.com/MegEngine/cutlass), enter the directory. cd examples/19_large_depthwise_conv2d_torch_extension
./setup.py install --user
. If you get errors, check yourCUDA_HOME
.- A quick check:
python depthwise_conv2d_implicit_gemm.py
- Add
WHERE_YOU_CLONED_CUTLASS/examples/19_large_depthwise_conv2d_torch_extension
into yourPYTHONPATH
so that you canfrom depthwise_conv2d_implicit_gemm import DepthWiseConv2dImplicitGEMM
anywhere. Then you may useDepthWiseConv2dImplicitGEMM
as a replacement ofnn.Conv2d
. export LARGE_KERNEL_CONV_IMPL=WHERE_YOU_CLONED_CUTLASS/examples/19_large_depthwise_conv2d_torch_extension
so that RepLKNet will use the efficient implementation. Or you may simply modify the related code (get_conv2d
) inreplknet.py
.
Our implementation mentioned in the paper has been integrated into MegEngine. The engine will automatically use it. If you would like to use it in other frameworks like Tensorflow, you may need to compile our released cuda sources (the *.cu
files in the above example should work with other frameworks) and use some tools to load them, just like cutlass
and torch.utils.cpp_extension
in the PyTorch example. Would be appreciated if you could share with us your experience.
You may refer to the MegEngine source code: https://github.com/MegEngine/MegEngine/tree/8a2e92bd6c5ac02807b27d174dce090ee391000b/dnn/src/cuda/conv_bias/chanwise. .
Pull requests (e.g., better or other implementations or implementations on other frameworks) are welcomed.
- Model code
- PyTorch pretrained models
- PyTorch large-kernel conv impl
- PyTorch training code
- PyTorch downstream models
- PyTorch downstream code
- A script to visualize the ERF
- How to obtain the shape bias
name | resolution | ImageNet-1K acc | #params | FLOPs | ImageNet-1K pretrained model |
---|---|---|---|---|---|
RepLKNet-31B | 224x224 | 83.5 | 79M | 15.3G | Google Drive, Baidu |
RepLKNet-31B | 384x384 | 84.8 | 79M | 45.1G | Google Drive, Baidu |
name | resolution | ImageNet-1K acc | #params | FLOPs | 22K pretrained model | 1K finetuned model |
---|---|---|---|---|---|---|
RepLKNet-31B | 224x224 | 85.2 | 79M | 15.3G | Google Drive, Baidu | Google Drive, Baidu |
RepLKNet-31B | 384x384 | 86.0 | 79M | 45.1G | - | Google Drive, Baidu |
RepLKNet-31L | 384x384 | 86.6 | 172M | 96.0G | Google Drive, Baidu | Google Drive, Baidu |
(uploading)
name | resolution | ImageNet-1K acc | #params | FLOPs | MegData-73M pretrained model | 1K finetuned model |
---|---|---|---|---|---|---|
RepLKNet-XL | 320x320 | 87.8 | 335M | 128.7G | Google Drive, Baidu | Google Drive, Baidu |
For RepLKNet-31B/L with 224x224 or 384x384, we use the "IMAGENET_DEFAULT_MEAN/STD" for preprocessing (see here). For examples,
python -m torch.distributed.launch --nproc_per_node=8 main.py --model RepLKNet-31B --batch_size 32 --eval True --resume RepLKNet-31B_ImageNet-1K_224.pth --input_size 224
or
python -m torch.distributed.launch --nproc_per_node=8 main.py --model RepLKNet-31L --batch_size 32 --eval True --resume RepLKNet-31L_ImageNet-22K-to-1K_384.pth --input_size 384
For RepLKNet-XL, please note that we used mean=[0.5,0.5,0.5] and std=[0.5,0.5,0.5] for preprocessing on MegData73M dataset as well as finetuning on ImageNet-1K. This mean/std setting is also referred to as "IMAGENET_INCEPTION_MEAN/STD" in timm, see here. Add --imagenet_default_mean_and_std false
to use this mean/std setting (see here). As noted in the paper, we did not use small kernels for re-parameterization.
python -m torch.distributed.launch --nproc_per_node=8 main.py --model RepLKNet-XL --batch_size 32 --eval true --resume RepLKNet-XL_MegData73M_ImageNet1K.pth --imagenet_default_mean_and_std false --input_size 320
To verify the equivalency of Structural Re-parameterization (i.e., the outputs before and after structural_reparam
), add --with_small_kernel_merged true
.
You may use multi-node training on a SLURM cluster with submitit. Please install:
pip install submitit
If you have limited GPU memory (e.g., 2080Ti), use --use_checkpoint True
to save GPU memory.
Single machine:
python -m torch.distributed.launch --nproc_per_node=8 main.py --model RepLKNet-31B --drop_path 0.5 --batch_size 64 --lr 4e-3 --update_freq 4 --model_ema true --model_ema_eval true --data_path /path/to/imagenet-1k --warmup_epochs 10 --epochs 300 --use_checkpoint True --output_dir your_training_dir
Four machines:
python run_with_submitit.py --nodes 4 --ngpus 8 --model RepLKNet-31B --drop_path 0.5 --batch_size 64 --lr 4e-3 --update_freq 4 --model_ema true --model_ema_eval true --data_path /path/to/imagenet-1k --warmup_epochs 10 --epochs 300 --use_checkpoint True --job_dir your_training_dir
Single machine:
(coming soon in two days)
We use MMSegmentation framework. Just clone MMSegmentation, and
- Put
segmentation/replknet.py
intommsegmentation/mmseg/models/backbones/
. The only difference betweensegmentation/replknet.py
andreplknet.py
is the@BACKBONES.register_module
. - Add RepLKNet into
mmsegmentation/mmseg/models/backbones/__init__.py
. That is
...
from .replknet import RepLKNet
__all__ = ['ResNet', ..., 'RepLKNet']
- Put
segmentation/configs/*.py
intommsegmentation/configs/replknet/
. - Download and use our weights. For example, to evaluate a model:
python3 -m torch.distributed.launch --nproc_per_node=8 --master_port=-29500 tools/test.py configs/replknet/RepLKNet-31B_1Kpretrain_upernet_80k_cityscapes_769.py RepLKNet-31B_ImageNet-1K_UperNet_Cityscapes.pth --launcher pytorch --eval mIoU
Single-scale (ss) and multi-scale (ms) mIoU tested with UperNet (FLOPs is computed with 2048×512 for the ImageNet-1K pretrained models and 2560×640 for the 22K and MegData73M pretrained models, following Swin):
backbone | pretraining | dataset | train schedule | mIoU (ss) | mIoU (ms) | #params | FLOPs | download |
---|---|---|---|---|---|---|---|---|
RepLKNet-31B | ImageNet-1K | Cityscapes | 80k | 83.1 | 83.5 | 110M | 2315G | Google Drive, Baidu |
RepLKNet-31B | ImageNet-1K | ADE20K | 160k | 49.9 | 50.6 | 112M | 1170G | Google Drive, Baidu |
RepLKNet-31B | ImageNet-22K | ADE20K | 160k | 51.5 | 52.3 | 112M | 1829G | Google Drive, Baidu |
RepLKNet-31L | ImageNet-22K | ADE20K | 160k | 52.4 | 52.7 | 207M | 2404G | Google Drive, Baidu |
RepLKNet-XL | MegData73M | ADE20K | 160k | 55.2 | 56.0 | 374M | 3431G | Google Drive, Baidu |
We use MMDetection framework. Just clone MMDetection, and
- Put
segmentation/replknet.py
intommdetection/mmdet/models/backbones/
. The only difference betweensegmentation/replknet.py
andreplknet.py
is the@BACKBONES.register_module
. - Add RepLKNet into
mmdetection/mmdet/models/backbones/__init__.py
. That is
...
from .replknet import RepLKNet
__all__ = ['ResNet', ..., 'RepLKNet']
- Put
detection/configs/*.py
intommdetection/configs/replknet/
. - Download and use our weights. For example, to evaluate a model:
python -m torch.distributed.launch --nproc_per_node=8 tools/test.py configs/replknet/RepLKNet-31B_22Kpretrain_cascade_mask_rcnn_3x_coco.py RepLKNet-31B_ImageNet-22K_CascMaskRCNN_COCO.pth --eval bbox --launcher pytorch
backbone | pretraining | method | train schedule | AP_box | AP_mask | #params | FLOPs | download |
---|---|---|---|---|---|---|---|---|
RepLKNet-31B | ImageNet-1K | FCOS | 2x | 47.0 | - | 87M | 437G | Google Drive, Baidu |
RepLKNet-31B | ImageNet-1K | Cascade Mask RCNN | 3x | 52.2 | 45.2 | 137M | 965G | Google Drive, Baidu |
RepLKNet-31B | ImageNet-22K | Cascade Mask RCNN | 3x | 53.0 | 46.0 | 137M | 965G | Google Drive, Baidu |
RepLKNet-31L | ImageNet-22K | Cascade Mask RCNN | 3x | 53.9 | 46.5 | 229M | 1321G | Google Drive, Baidu |
RepLKNet-XL | MegData73M | Cascade Mask RCNN | 3x | 55.5 | 48.0 | 392M | 1958G | Google Drive, Baidu |
- The mean/std values on MegData73M are different from ImageNet. So we used ``mean=[0.5,0.5,0.5], std=[0.5,0.5,0.5]```` for pretraining on MegData73M and finetuning on ImageNet-1K. Accordingly, we should let
img_norm_cfg = dict(mean=[127.5, 127.5, 127.5], std=[127.5, 127.5, 127.5], to_rgb=True)
in MMSegmentation and MMDetection. Please check here and here. For other models, we use the default ImageNet mean/std. - For RepLKNet-XL on ADE20K and COCO, we batch-normalize the intermediate feature maps before feeding them into the heads. Just use
RepLKNet(..., norm_intermediate_features=True)
. For other models, there is no need to do so. - For RepLKNet-31B/L on Cityscapes and ADE20K, we used 4 or 8 2080Ti nodes each with 8 GPUs, the batch size per GPU was smaller than the default (the default is 4 per GPU, see here), but the global batch size was larger. Accordingly, we reduced the number of iterations to ensure the same total training examples. Please check the comments in the config files. If you wish to train with our config files, please set the batch size and number of iterations according to your own situation.
- Lowering the learning rate for lower-level layers may improve the performance when finetuning on ImageNet-1K or downstream tasks, just like ConvNeXt and BeiT. I don't know if the improvements will be significant. You may follow the implementation in ConvNeXt and BeiT. If you need a showcase, please raise an issue.
- Tips on the drop_path_rate: bigger model, higher drop_path; bigger pretraining data, lower drop_path.
The released PyTorch training script is based on the code of ConvNeXt, which was built using the timm library, DeiT and BEiT repositories.
This project is released under the MIT license. Please see the LICENSE file for more information.