/MuSc

This is an official PyTorch implementation for "MuSc : Zero-Shot Industrial Anomaly Classification and Segmentation with Mutual Scoring of the Unlabeled Images" (MuSc ICLR2024).

Primary LanguagePythonMIT LicenseMIT

✨MuSc (ICLR 2024)✨

This is an official PyTorch implementation for "MuSc : Zero-Shot Industrial Anomaly Classification and Segmentation with Mutual Scoring of the Unlabeled Images" (MuSc)

Authors: Xurui Li1* | Ziming Huang1* | Feng Xue3 | Yu Zhou1,2

Institutions: 1Huazhong University of Science and Technology | 2Wuhan JingCe Electronic Group Co.,LTD | 3University of Trento

📖 Chinese README

🙈TODO list:

  • ⬜️ Using some strategies to reduce the inference time per image from 955.3ms to 249.8ms.
  • ⬜️ Compatibility with more industrial datasets.
  • ⬜️ Compatibility with more visual backbones, e.g. Vision Mamba.

📣Updates:

04/11/2024

  1. The comparisons with the zero/few-shot methods in CVPR 2024 have been added to Compare with SOTA k-shot Methods.
  2. Fixed some bugs in models/backbone/_backbones.py.

03/22/2024

  1. The supported codes for BTAD dataset are provided.
  2. Some codes are modified to support larger batch_size.
  3. Some codes are optimized to obtain faster speeds.
  4. Results of different backbones in MVTec AD, VisA and BTAD datasets are provided.
  5. The detailed results of different datasets are provided.
  6. The inference time of different backbones is provided.
  7. The comparisons with SOTA zero/few-shot methods are provided. This table will be updated continuously.
  8. We summarize the frequently asked questions from users when using MuSc, and give the answers.
  9. We add README in Chinese.

02/01/2024

Initial commits:

  1. The complete code of our method MuSc in paper is released.
  2. This code is compatible with image encoder (ViT) of CLIP and ViT pre-trained with DINO/DINO_v2.

🎖️Compare with SOTA k-shot methods [Go to Catalogue]

We will continuously update the following table to compare our MuSc with the newest zero-shot and few-shot methods. "-" indicates that the authors did not measure this metric in their paper.

MVTec AD

Classification Segmentation
Methods Venue Setting AUROC-cls F1-max-cls AP-cls AUROC-segm F1-max-segm AP-segm PRO-segm
MuSc(ours) ICLR 2024 0-shot 97.8 97.5 99.1 97.3 62.6 62.7 93.8
RegAD ECCV 2022 4-shot 89.1 92.4 94.9 96.2 51.7 48.3 88.0
GraphCore ICLR 2023 4-shot 92.9 - - 97.4 - - -
WinCLIP CVPR 2023 0-shot 91.8 92.9 96.5 85.1 31.7 - 64.6
WinCLIP CVPR 2023 4-shot 95.2 94.7 97.3 96.2 51.7 - 88.0
APRIL-GAN CVPR Workshop 2023 0-shot 86.1 90.4 93.5 87.6 43.3 40.8 44.0
APRIL-GAN CVPR Workshop 2023 4-shot 92.8 92.8 96.3 95.9 56.9 54.5 91.8
FastRecon ICCV 2023 4-shot 94.2 - - 97.0 - - -
ACR NeurIPS 2023 0-shot 85.8 91.3 92.9 92.5 44.2 38.9 72.7
RegAD+Adversarial Loss BMVC 2023 8-shot 91.9 - - 96.9 - - -
PACKD BMVC 2023 8-shot 95.3 - - 97.3 - - -
PromptAD WACV 2024 0-shot 90.8 - - 92.1 36.2 - 72.8
AnomalyCLIP ICLR 2024 0-shot 91.5 - 96.2 91.1 - - 81.4
InCTRL CVPR 2024 8-shot 95.3 - - - - - -
MVFA-AD CVPR 2024 4-shot 96.2 - - 96.3 - - -
PromptAD CVPR 2024 4-shot 96.6 - - 96.5 - - -

VisA

Classification Segmentation
Methods Venue Setting AUROC-cls F1-max-cls AP-cls AUROC-segm F1-max-segm AP-segm PRO-segm
MuSc(ours) ICLR 2024 0-shot 92.8 89.5 93.5 98.8 48.8 45.1 92.7
WinCLIP CVPR 2023 0-shot 78.1 79.0 81.2 79.6 14.8 - 56.8
WinCLIP CVPR 2023 4-shot 87.3 84.2 88.8 97.2 47.0 - 87.6
APRIL-GAN CVPR Workshop 2023 0-shot 78.0 78.7 81.4 94.2 32.3 25.7 86.8
APRIL-GAN CVPR Workshop 2023 4-shot 92.6 88.4 94.5 96.2 40.0 32.2 90.2
PACKD BMVC 2023 8-shot 87.5 - - 97.9 - - -
AnomalyCLIP ICLR 2024 0-shot 82.1 - 85.4 95.5 - - 87.0
InCTRL CVPR 2024 8-shot 88.7 - - - - - -
PromptAD CVPR 2024 4-shot 89.1 - - 97.4 - - -

📖Catalogue

👇Abstract: [Back to Catalogue]

This paper studies zero-shot anomaly classification (AC) and segmentation (AS) in industrial vision. We reveal that the abundant normal and abnormal cues implicit in unlabeled test images can be exploited for anomaly determination, which is ignored by prior methods. Our key observation is that for the industrial product images, the normal image patches could find a relatively large number of similar patches in other unlabeled images, while the abnormal ones only have a few similar patches.

We leverage such a discriminative characteristic to design a novel zero-shot AC/AS method by Mutual Scoring (MuSc) of the unlabeled images, which does not need any training or prompts. Specifically, we perform Local Neighborhood Aggregation with Multiple Degrees (LNAMD) to obtain the patch features that are capable of representing anomalies in varying sizes. Then we propose the Mutual Scoring Mechanism (MSM) to leverage the unlabeled test images to assign the anomaly score to each other. Furthermore, we present an optimization approach named Re-scoring with Constrained Image-level Neighborhood (RsCIN) for image-level anomaly classification to suppress the false positives caused by noises in normal images.

The superior performance on the challenging MVTec AD and VisA datasets demonstrates the effectiveness of our approach. Compared with the state-of-the-art zero-shot approaches, MuSc achieves a $\textbf{21.1}$% PRO absolute gain (from 72.7% to 93.8%) on MVTec AD, a $\textbf{19.4}$% pixel-AP gain and a $\textbf{14.7}$% pixel-AUROC gain on VisA. In addition, our zero-shot approach outperforms most of the few-shot approaches and is comparable to some one-class methods.

pipline

😊Compare with other 0-shot methods

Compare_0

😊Compare with other 4-shot methods

Compare_4

Environment:

  • Python 3.8
  • CUDA 11.7
  • PyTorch 2.0.1

Clone the repository locally:

git clone https://github.com/xrli-U/MuSc.git

Create virtual environment:

conda create --name musc python=3.8
conda activate musc

Install the required packages:

pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
pip install -r requirements.txt

👇Datasets Download: [Back to Catalogue]

Put all the datasets in ./data folder.

data
|---mvtec_anomaly_detection
|-----|-- bottle
|-----|-----|----- ground_truth
|-----|-----|----- test
|-----|-----|----- train
|-----|-- cable
|-----|--- ...
data
|----visa
|-----|-- split_csv
|-----|-----|--- 1cls.csv
|-----|-----|--- ...
|-----|-- candle
|-----|-----|--- Data
|-----|-----|-----|----- Images
|-----|-----|-----|--------|------ Anomaly 
|-----|-----|-----|--------|------ Normal 
|-----|-----|-----|----- Masks
|-----|-----|-----|--------|------ Anomaly 
|-----|-----|--- image_anno.csv
|-----|-- capsules
|-----|--- ...

VisA dataset need to be preprocessed to separate the train set from the test set.

python ./datasets/visa_preprocess.py
data
|---btad
|-----|--- 01
|-----|-----|----- ground_truth
|-----|-----|----- test
|-----|-----|----- train
|-----|--- 02
|-----|--- ...

💎Run MuSc: [Back to Catalogue]

We provide two ways to run our code.

python

python examples/musc_main.py

Follow the configuration in ./configs/musc.yaml.

script

sh scripts/musc.sh

The configuration in the script musc.sh takes precedence.

The key arguments of the script are as follows:

  • --device: GPU_id.
  • --data_path: The directory of datasets.
  • --dataset_name: Dataset name.
  • --class_name: Category to be tested. If the parameter is set to ALL, all the categories are tested.
  • --backbone_name: Feature exractor name. Our code is compatible with CLIP, DINO and DINO_v2. For more details, see configs/musc.yaml.
  • --pretrained: Pretrained CLIP model. openai, laion400m_e31, and laion400m_e32 are optional.
  • --feature_layers: The layers for extracting features in backbone(ViT).
  • --img_resize: The size of the image inputted into the model.
  • --divide_num: The number of subsets the whole test set is divided into.
  • --r_list: The aggregation degrees of our LNAMD module.
  • --output_dir: The directory that saves the anomaly prediction maps and metrics. This directory will be automatically created.
  • --vis: Whether to save the anomaly prediction maps.
  • --vis_type: Choose between single_norm and whole_norm. This means whether to normalize a single anomaly map or all of them together when visualizing.
  • --save_excel: Whether to save anomaly classification and segmentation results (metrics).

💎Classification optimization (RsCIN): [Back to Catalogue]

We provide additional code in ./models/RsCIN_features folder to optimize the classification results of other methods using our RsCIN module. We use ViT-large-14-336 of CLIP to extract the image features of the MVTec AD and VisA datasets and store them in mvtec_ad_cls.dat and visa_cls.dat respectively. We show how to use them in ./models/RsCIN_features/RsCIN.py.

Example

Before using our RsCIN module, move RsCIN.py, mvtec_ad_cls.dat and visa_cls.dat to your project directory.

import numpy as np
from RsCIN import Mobile_RsCIN

classification_results = np.random.rand(83) # the classification results of your method.
dataset_name = 'mvtec_ad' # dataset name
class_name = 'bottle' # category name in the above dataset
optimized_classification_results = Mobile_RsCIN(classification_results, dataset_name=dataset_name, class_name=class_name)

The optimized_classification_results are the anomaly classification scores optimized by our RsCIN module.

Apply to the custom dataset

You can extract the image features of each image in the custom dataset, and store them in the variable cls_tokens. The multiple window sizes in the Multi-window Mask Operation can be adjusted by the value of k_list.

import numpy as np
from RsCIN import Mobile_RsCIN

classification_results = np.random.rand(83) # the classification results of your method.
cls_tokens = np.random.rand(83, 768)  # shape[N, C] the image features, N is the number of images
k_list = [2, 3] # the multiple window sizes in the Multi-window Mask Operation
optimized_classification_results = Mobile_RsCIN(classification_results, k_list=k_list, cls_tokens=cls_tokens)

🎖️Results of different datasets: [Back to Catalogue]

All the results are implemented by the default settings in our paper.

MVTec AD

Classification Segmentation
Category AUROC-cls F1-max-cls AP-cls AUROC-segm F1-max-segm AP-segm PRO-segm
bottle 99.92 99.21 99.98 98.48 79.17 83.04 96.10
cable 98.99 97.30 99.42 95.76 60.97 57.70 89.62
capsule 96.45 94.88 99.30 98.96 49.80 48.45 95.49
carpet 99.88 99.44 99.96 99.45 73.33 76.05 97.58
grid 98.66 96.49 99.54 98.16 43.94 38.24 93.92
hazelnut 99.61 98.55 99.79 99.38 73.41 73.28 92.24
leather 100.0 100.0 100.0 99.72 62.84 64.47 98.74
metal_nut 96.92 97.38 99.25 86.12 46.22 47.54 89.34
pill 96.24 95.89 99.31 97.47 65.54 67.25 98.01
screw 82.17 88.89 90.88 98.77 41.87 36.12 94.40
tile 100.0 100.0 100.0 97.90 74.71 78.90 94.64
toothbrush 100.0 100.0 100.0 99.53 70.19 67.79 95.48
transistor 99.42 95.00 99.19 91.38 59.24 58.40 77.21
wood 98.51 98.33 99.52 97.24 68.64 74.75 94.50
zipper 99.84 99.17 99.96 98.40 62.48 61.89 94.46
mean 97.77 97.37 99.07 97.11 62.16 62.26 93.45

VisA

Classification Segmentation
Category AUROC-cls F1-max-cls AP-cls AUROC-segm F1-max-segm AP-segm PRO-segm
candle 96.55 91.26 96.45 99.36 39.56 28.36 97.62
capsules 88.62 86.43 93.77 98.71 50.85 43.90 88.20
cashew 98.54 95.57 99.30 99.33 74.88 77.63 94.30
chewinggum 98.42 96.45 99.30 99.54 61.33 61.21 88.39
fryum 98.64 97.44 99.43 99.43 58.13 50.43 94.38
macaroni1 89.33 82.76 88.64 99.51 21.90 15.25 96.37
macaroni2 68.03 69.96 67.37 97.14 11.06 3.91 88.84
pcb1 89.28 84.36 89.89 99.50 80.49 88.36 92.76
pcb2 93.20 88.66 94.46 97.39 34.38 21.86 86.06
pcb3 93.52 86.92 93.48 98.05 40.23 41.03 92.32
pcb4 98.43 92.89 98.47 98.70 46.38 44.72 92.66
pipe_fryum 98.34 96.04 99.16 99.40 67.56 67.90 97.32
mean 92.57 89.06 93.31 98.71 48.90 45.38 92.43

BTAD

Classification Segmentation
Category AUROC-cls F1-max-cls AP-cls AUROC-segm F1-max-segm AP-segm PRO-segm
01 98.74 97.96 99.53 97.49 59.73 58.76 85.05
02 90.23 95.38 98.41 95.36 58.20 55.16 68.64
03 99.52 88.37 95.62 99.20 55.64 57.53 96.62
mean 96.16 93.90 97.85 97.35 57.86 57.15 83.43

🎖️Results of different backbones: [Back to Catalogue]

The default backbone (feature extractor) in our paper is ViT-large-14-336 of CLIP. We also provide the supported codes for other image encoder of CLIP, DINO and DINO_v2. For more details, see configs/musc.yaml.

MVTec AD

Classification Segmentation
Backbones Pre-training image size AUROC-cls F1-max-cls AP-cls AUROC-segm F1-max-segm AP-segm PRO-segm
ViT-B-32 CLIP 256 87.99 92.31 94.38 93.08 42.06 37.21 72.62
ViT-B-32 CLIP 512 89.91 92.72 95.12 95.73 53.32 52.33 83.72
ViT-B-16 CLIP 256 92.78 93.98 96.59 96.21 52.48 50.23 87.00
ViT-B-16 CLIP 512 94.20 95.20 97.34 97.09 61.24 61.45 91.67
ViT-B-16-plus-240 CLIP 240 94.77 95.43 97.60 96.26 52.23 50.27 87.70
ViT-B-16-plus-240 CLIP 512 95.69 96.50 98.11 97.28 60.71 61.29 92.14
ViT-L-14 CLIP 336 96.06 96.65 98.25 97.24 59.41 58.10 91.69
ViT-L-14 CLIP 518 95.94 96.32 98.30 97.42 63.06 63.67 92.92
ViT-L-14-336 CLIP 336 96.40 96.44 98.30 97.03 57.51 55.44 92.18
ViT-L-14-336 CLIP 518 97.77 97.37 99.07 97.11 62.16 62.26 93.45
dino_vitbase16 DINO 256 89.39 93.77 95.37 95.83 54.02 52.84 84.24
dino_vitbase16 DINO 512 94.11 96.13 97.26 97.78 62.07 63.20 92.49
dinov2_vitb14 DINO_v2 336 95.67 96.80 97.95 97.74 60.23 59.45 93.84
dinov2_vitb14 DINO_v2 518 96.31 96.87 98.32 98.07 64.65 65.31 95.59
dinov2_vitl14 DINO_v2 336 96.84 97.45 98.68 98.17 61.77 61.21 94.62
dinov2_vitl14 DINO_v2 518 97.08 97.13 98.82 98.34 66.15 67.39 96.16

VisA

Classification Segmentation
Backbones Pre-training image size AUROC-cls F1-max-cls AP-cls AUROC-segm F1-max-segm AP-segm PRO-segm
ViT-B-32 CLIP 256 72.95 76.90 77.68 89.30 25.93 20.68 50.95
ViT-B-32 CLIP 512 77.82 80.20 81.01 96.06 34.72 30.20 73.08
ViT-B-16 CLIP 256 81.44 80.86 83.84 95.97 36.72 31.81 73.48
ViT-B-16 CLIP 512 86.48 84.12 88.05 97.98 42.21 37.29 85.10
ViT-B-16-plus-240 CLIP 240 82.62 81.61 85.05 96.11 37.84 33.43 72.37
ViT-B-16-plus-240 CLIP 512 86.72 84.22 89.41 97.95 43.27 37.68 83.52
ViT-L-14 CLIP 336 88.38 85.23 89.77 98.32 44.67 40.42 87.80
ViT-L-14 CLIP 518 90.86 87.75 91.66 98.45 45.74 42.09 89.93
ViT-L-14-336 CLIP 336 88.61 85.31 90.00 98.53 45.10 40.92 89.35
ViT-L-14-336 CLIP 518 92.57 89.06 93.31 98.71 48.90 45.38 92.43
dino_vitbase16 DINO 256 78.21 80.12 81.11 95.74 36.81 32.84 70.21
dino_vitbase16 DINO 512 84.11 83.52 85.91 97.74 42.86 38.27 83.00
dinov2_vitb14 DINO_v2 336 87.65 86.24 88.51 97.80 41.68 37.06 85.01
dinov2_vitb14 DINO_v2 518 90.25 87.48 90.86 98.66 45.56 41.23 91.80
dinov2_vitl14 DINO_v2 336 90.18 88.47 90.56 98.38 43.84 38.74 88.38
dinov2_vitl14 DINO_v2 518 91.73 89.20 92.27 98.78 47.12 42.79 92.40

BTAD

Classification Segmentation
Backbones Pre-training image size AUROC-cls F1-max-cls AP-cls AUROC-segm F1-max-segm AP-segm PRO-segm
ViT-B-32 CLIP 256 92.19 95.55 98.47 96.74 43.98 35.70 68.56
ViT-B-32 CLIP 512 93.31 94.61 98.40 97.41 52.94 48.80 69.59
ViT-B-16 CLIP 256 92.44 91.00 97.31 97.45 55.27 52.19 72.68
ViT-B-16 CLIP 512 94.11 92.99 97.98 97.91 59.18 59.05 77.86
ViT-B-16-plus-240 CLIP 240 92.86 93.99 97.96 97.68 54.81 51.33 73.47
ViT-B-16-plus-240 CLIP 512 94.13 93.84 98.34 98.14 58.66 57.53 77.23
ViT-L-14 CLIP 336 92.74 93.21 97.71 97.84 56.60 55.94 77.01
ViT-L-14 CLIP 518 94.82 95.29 98.58 97.77 55.55 55.46 80.62
ViT-L-14-336 CLIP 336 95.11 94.48 98.53 97.42 56.75 55.23 79.63
ViT-L-14-336 CLIP 518 96.16 93.90 97.85 97.35 57.86 57.15 83.43
dino_vitbase16 DINO 256 93.63 95.66 98.66 97.55 52.16 49.25 72.86
dino_vitbase16 DINO 512 92.38 92.66 97.81 97.44 53.32 53.02 74.91
dinov2_vitb14 DINO_v2 336 93.60 91.65 97.19 98.08 63.28 65.32 74.35
dinov2_vitb14 DINO_v2 518 94.99 95.11 98.55 98.30 65.75 68.89 80.41
dinov2_vitl14 DINO_v2 336 94.15 92.64 97.61 98.19 63.86 66.03 76.33
dinov2_vitl14 DINO_v2 518 95.62 95.40 98.76 98.40 65.88 69.90 82.47

⌛Inference Time: [Back to Catalogue]

We show the inference time per image in the table below when using different backbones and image sizes. The default setting for number of images in mutual scoring module is 200, and GPU is NVIDIA RTX 3090.

Backbones Pre-training image size times(ms/image)
ViT-B-32 CLIP 256 48.33
ViT-B-32 CLIP 512 95.74
ViT-B-16 CLIP 256 86.68
ViT-B-16 CLIP 512 450.5
ViT-B-16-plus-240 CLIP 240 85.25
ViT-B-16-plus-240 CLIP 512 506.4
ViT-L-14 CLIP 336 266.0
ViT-L-14 CLIP 518 933.3
ViT-L-14-336 CLIP 336 270.2
ViT-L-14-336 CLIP 518 955.3
dino_vitbase16 DINO 256 85.97
dino_vitbase16 DINO 512 458.5
dinov2_vitb14 DINO_v2 336 209.1
dinov2_vitb14 DINO_v2 518 755.0
dinov2_vitl14 DINO_v2 336 281.4
dinov2_vitl14 DINO_v2 518 1015.1

🙋🙋‍♂️Frequently Asked Questions: [Back to Catalogue]

Q: Why do large areas of high anomaly scores appear on normal images in the visualization?

A: In the visualization, in order to highlight abnormal areas, we adopt a single anomaly map normalization by default. Even if the overall response of the single map is low, a large number of highlighted areas will appear after normalization. Normalization of all the anomaly maps together can be achieved by adding the vis_type parameter to the shell script and setting it as whole_norm, or by modifying the testing->vis_type parameter in the ./configs/musc.yaml.

Q: How to set the appropriate input image resolution ?

A: The image resolution img_resize input into the backbone is generally set to a multiple of the patch size of ViT. The commonly used values are 224, 240, 256, 336, 512 and 518. In the previous section (jump), we show the two input image resolutions commonly used by different feature extractors for reference. The image resolution can be changed by modifying the 'img_resize' parameter in the shell script, or by modifying the datasets->img_resize parameter in the ./configs/musc.yaml configuration file.

@inproceedings{Li2024MuSc,
  title={MuSc: Zero-Shot Industrial Anomaly Classification and Segmentation with Mutual Scoring of the Unlabeled Images},
  author={Li, Xurui and Huang, Ziming and Xue, Feng and Zhou, Yu},
  booktitle={International Conference on Learning Representations},
  year={2024}
}

Our repo is built on PatchCore and APRIL-GAN, thanks their clear and elegant code !

MuSc is released under the MIT Licence, and is fully open for academic research and also allow free commercial usage. To apply for a commercial license, please contact yuzhou@hust.edu.cn.