RikoLi/PCL-CLIP

无监督实现有?

Opened this issue · 9 comments

请问可以提供无监督的实现细节以及log.txt吗?我自己修改的无监督代码,不能达到论文的指标,只有80%多

你好!
无监督部分的代码还没有整理好,但可以参考如下工作的无监督实现,与我们的实现pipeline基本一致:

  1. TMGF:https://github.com/RikoLi/WACV23-workshop-TMGF/tree/main
  2. ClusterContrast:https://github.com/alibaba/cluster-contrast-reid

另外,无监督方法下性能受聚类相关参数、contrastive temperature影响较明显,可参考上述工作中的设定。
祝工作顺利!

你好! 无监督部分的代码还没有整理好,但可以参考如下工作的无监督实现,与我们的实现pipeline基本一致:

  1. TMGF:https://github.com/RikoLi/WACV23-workshop-TMGF/tree/main
  2. ClusterContrast:https://github.com/alibaba/cluster-contrast-reid

另外,无监督方法下性能受聚类相关参数、contrastive temperature影响较明显,可参考上述工作中的设定。 祝工作顺利!

好的,非常感谢,我调一调参数试一试

Hello @RikoLi, great work! In addition to replacing the backbone of the O2CAP with PCL-CLIP, do I need to adjust any parameters? Should I load any pretrained weights?

By only changing the backbone, I have encountered difficulty in making the model converge on the MSMT17 dataset.

The following is the code of the backbone that I used. Here I assume that cfg.MODEL.SIE_CAMERA and cfg.MODEL.SIE_VIEW are both false as default.

import torch
import torch.nn as nn
from timm.models.layers import trunc_normal_


def weights_init_kaiming(m):
    classname = m.__class__.__name__
    if classname.find("Linear") != -1:
        nn.init.kaiming_normal_(m.weight, a=0, mode="fan_out")
        nn.init.constant_(m.bias, 0.0)

    elif classname.find("Conv") != -1:
        nn.init.kaiming_normal_(m.weight, a=0, mode="fan_in")
        if m.bias is not None:
            nn.init.constant_(m.bias, 0.0)
    elif classname.find("BatchNorm") != -1:
        if m.affine:
            nn.init.constant_(m.weight, 1.0)
            nn.init.constant_(m.bias, 0.0)


def weights_init_classifier(m):
    classname = m.__class__.__name__
    if classname.find("Linear") != -1:
        nn.init.normal_(m.weight, std=0.001)
        if m.bias:
            nn.init.constant_(m.bias, 0.0)


import clip.clip as clip


def load_clip_to_cpu(backbone_name, h_resolution, w_resolution, vision_stride_size):
    url = clip._MODELS[backbone_name]
    model_path = clip._download(url)

    try:
        # loading JIT archive
        model = torch.jit.load(model_path, map_location="cpu").eval()
        state_dict = None

    except RuntimeError:
        state_dict = torch.load(model_path, map_location="cpu")

    model = clip.build_model(
        state_dict or model.state_dict(), h_resolution, w_resolution, vision_stride_size
    )

    return model


class ClipPCLReID(nn.Module):
    def __init__(self):
        super(ClipPCLReID, self).__init__()
        self.model_name = "ViT-B-16"
        self.STRIDE_SIZE = [16, 16]
        self.SIZE_TRAIN = [256, 128]
        self.in_planes = 768
        self.in_planes_proj = 512
        self.num_features = self.in_planes_proj

        self.bottleneck = nn.BatchNorm1d(self.in_planes)
        self.bottleneck.bias.requires_grad_(False)
        self.bottleneck.apply(weights_init_kaiming)

        self.bottleneck_proj = nn.BatchNorm1d(self.in_planes_proj)
        self.bottleneck_proj.bias.requires_grad_(False)
        self.bottleneck_proj.apply(weights_init_kaiming)

        self.h_resolution = int((self.SIZE_TRAIN[0] - 16) // self.STRIDE_SIZE[0] + 1)
        self.w_resolution = int((self.SIZE_TRAIN[1] - 16) // self.STRIDE_SIZE[1] + 1)
        self.vision_stride_size = self.STRIDE_SIZE[0]
        clip_model = load_clip_to_cpu(
            self.model_name,
            self.h_resolution,
            self.w_resolution,
            self.vision_stride_size,
        )
        clip_model.to("cuda")

        self.image_encoder = clip_model.visual

        # Trick: freeze patch projection for improved stability
        # https://arxiv.org/pdf/2104.02057.pdf
        for _, v in self.image_encoder.conv1.named_parameters():
            v.requires_grad_(False)
        print(
            "Freeze patch projection layer with shape {}".format(
                self.image_encoder.conv1.weight.shape
            )
        )

    def forward(self, x):

        cv_embed = None
        (
            _,
            image_features,
            image_features_proj,
        ) = self.image_encoder(x, cv_embed)
        img_feature = image_features[:, 0]
        img_feature_proj = image_features_proj[:, 0]

        feat = self.bottleneck(img_feature)
        feat_proj = self.bottleneck_proj(img_feature_proj)

        out_feat = torch.cat([feat, feat_proj], dim=1)
        return out_feat

    def load_param(self, trained_path):
        param_dict = torch.load(trained_path)
        for i in param_dict:
            if not self.training and "classifier" in i:
                continue  # ignore classifier weights in evaluation
            self.state_dict()[i.replace("module.", "")].copy_(param_dict[i])
        print("Loading pretrained model from {}".format(trained_path))

    def load_param_finetune(self, model_path):
        param_dict = torch.load(model_path)
        for i in param_dict:
            self.state_dict()[i].copy_(param_dict[i])
        print("Loading pretrained model for finetuning from {}".format(model_path))


# def make_model(cfg, num_classes, camera_num, view_num):
#     model = TransReID(num_classes, camera_num, view_num, cfg)
#     return model


def clip_base(**kwargs):
    model = ClipPCLReID()
    return model

Hello @RikoLi, great work! In addition to replacing the backbone of the O2CAP with PCL-CLIP, do I need to adjust any parameters? Should I load any pretrained weights?

By only changing the backbone, I have encountered difficulty in making the model converge on the MSMT17 dataset.

The following is the code of the backbone that I used. Here I assume that cfg.MODEL.SIE_CAMERA and cfg.MODEL.SIE_VIEW are both false as default.

import torch
import torch.nn as nn
from timm.models.layers import trunc_normal_


def weights_init_kaiming(m):
    classname = m.__class__.__name__
    if classname.find("Linear") != -1:
        nn.init.kaiming_normal_(m.weight, a=0, mode="fan_out")
        nn.init.constant_(m.bias, 0.0)

    elif classname.find("Conv") != -1:
        nn.init.kaiming_normal_(m.weight, a=0, mode="fan_in")
        if m.bias is not None:
            nn.init.constant_(m.bias, 0.0)
    elif classname.find("BatchNorm") != -1:
        if m.affine:
            nn.init.constant_(m.weight, 1.0)
            nn.init.constant_(m.bias, 0.0)


def weights_init_classifier(m):
    classname = m.__class__.__name__
    if classname.find("Linear") != -1:
        nn.init.normal_(m.weight, std=0.001)
        if m.bias:
            nn.init.constant_(m.bias, 0.0)


import clip.clip as clip


def load_clip_to_cpu(backbone_name, h_resolution, w_resolution, vision_stride_size):
    url = clip._MODELS[backbone_name]
    model_path = clip._download(url)

    try:
        # loading JIT archive
        model = torch.jit.load(model_path, map_location="cpu").eval()
        state_dict = None

    except RuntimeError:
        state_dict = torch.load(model_path, map_location="cpu")

    model = clip.build_model(
        state_dict or model.state_dict(), h_resolution, w_resolution, vision_stride_size
    )

    return model


class ClipPCLReID(nn.Module):
    def __init__(self):
        super(ClipPCLReID, self).__init__()
        self.model_name = "ViT-B-16"
        self.STRIDE_SIZE = [16, 16]
        self.SIZE_TRAIN = [256, 128]
        self.in_planes = 768
        self.in_planes_proj = 512
        self.num_features = self.in_planes_proj

        self.bottleneck = nn.BatchNorm1d(self.in_planes)
        self.bottleneck.bias.requires_grad_(False)
        self.bottleneck.apply(weights_init_kaiming)

        self.bottleneck_proj = nn.BatchNorm1d(self.in_planes_proj)
        self.bottleneck_proj.bias.requires_grad_(False)
        self.bottleneck_proj.apply(weights_init_kaiming)

        self.h_resolution = int((self.SIZE_TRAIN[0] - 16) // self.STRIDE_SIZE[0] + 1)
        self.w_resolution = int((self.SIZE_TRAIN[1] - 16) // self.STRIDE_SIZE[1] + 1)
        self.vision_stride_size = self.STRIDE_SIZE[0]
        clip_model = load_clip_to_cpu(
            self.model_name,
            self.h_resolution,
            self.w_resolution,
            self.vision_stride_size,
        )
        clip_model.to("cuda")

        self.image_encoder = clip_model.visual

        # Trick: freeze patch projection for improved stability
        # https://arxiv.org/pdf/2104.02057.pdf
        for _, v in self.image_encoder.conv1.named_parameters():
            v.requires_grad_(False)
        print(
            "Freeze patch projection layer with shape {}".format(
                self.image_encoder.conv1.weight.shape
            )
        )

    def forward(self, x):

        cv_embed = None
        (
            _,
            image_features,
            image_features_proj,
        ) = self.image_encoder(x, cv_embed)
        img_feature = image_features[:, 0]
        img_feature_proj = image_features_proj[:, 0]

        feat = self.bottleneck(img_feature)
        feat_proj = self.bottleneck_proj(img_feature_proj)

        out_feat = torch.cat([feat, feat_proj], dim=1)
        return out_feat

    def load_param(self, trained_path):
        param_dict = torch.load(trained_path)
        for i in param_dict:
            if not self.training and "classifier" in i:
                continue  # ignore classifier weights in evaluation
            self.state_dict()[i.replace("module.", "")].copy_(param_dict[i])
        print("Loading pretrained model from {}".format(trained_path))

    def load_param_finetune(self, model_path):
        param_dict = torch.load(model_path)
        for i in param_dict:
            self.state_dict()[i].copy_(param_dict[i])
        print("Loading pretrained model for finetuning from {}".format(model_path))


# def make_model(cfg, num_classes, camera_num, view_num):
#     model = TransReID(num_classes, camera_num, view_num, cfg)
#     return model


def clip_base(**kwargs):
    model = ClipPCLReID()
    return model

Hello, I also encountered the same problem about the MSMT dataset. Have you found the problem now?

你好! 无监督部分的代码还没有整理好,但可以参考如下工作的无监督实现,与我们的实现pipeline基本一致:

  1. TMGF:https://github.com/RikoLi/WACV23-workshop-TMGF/tree/main
  2. ClusterContrast:https://github.com/alibaba/cluster-contrast-reid

另外,无监督方法下性能受聚类相关参数、contrastive temperature影响较明显,可参考上述工作中的设定。 祝工作顺利!

@RikoLi 你好,我的无监督修改,对于market数据集可以正常工作,但是msmt数据集我还是没办法运行得很好,mAP一般收敛在20多就结束了,希望能得到您的帮助

For MSMT17 unsupervised training, you can try following configs:

  1. Use k-reciprocal jaccard distance based DBSCAN clustering (see O2CAP implementations) with eps=0.7, k1=30, k2=6. Note: clustering parameters are important to performance.
  2. Freeze patch projection layer in training.

对于MSMT17无监督训练,您可以尝试以下配置:

  1. 使用基于 k 倒数杰卡德距离的 DBSCAN 聚类(请参阅 O2CAP 实现),eps=0.7、k1=30、k2=6。注意:聚类参数对性能很重要。
  2. 在训练中冻结补丁投影层。
  3. 主干预训练权重(来自 TransReID-SSL):https://drive.google.com/file/d/18FL9JaJNlo15-UksalcJRXX-0dgo4Mz4/view ?usp=sharing
    但是这个预训练主干不是clip吗?怎么用transreid-ssl的呢?
    @RikoLi

对于MSMT17无监督训练,您可以尝试以下配置:

  1. 使用基于 k 倒数杰卡德距离的 DBSCAN 聚类(请参阅 O2CAP 实现),eps=0.7、k1=30、k2=6。注意:聚类参数对性能很重要。
  2. 在训练中冻结补丁投影层。
  3. 主干预训练权重(来自 TransReID-SSL):https://drive.google.com/file/d/18FL9JaJNlo15-UksalcJRXX-0dgo4Mz4/view ?usp=sharing
    但是这个预训练主干不是clip吗?怎么用transreid-ssl的呢?
    @RikoLi

抱歉,和另一个开源代码回复混了😂,感谢提醒,我会将它改正!
这里权重使用的就是CLIP image encoder的预训练权重。
#1 中所述,以上两点1和2对于最终性能比较重要,对于无监督,聚类参数尤其重要。

After the following changes, I managed to reproduce MSMT17 result on O2CAP.

  1. change the optimizer to SGD
  2. normalize the output
       out_feat = torch.nn.functional.normalize(out_feat)