/Generalized-Long-Tailed-Benchmarks.pytorch

[ECCV 2022] A generalized long-tailed challenge that incorporates both the conventional class-wise imbalance and the overlooked attribute-wise imbalance within each class. The proposed IFL together with other baselines are also included.

Primary LanguageJupyter NotebookOtherNOASSERTION

Generalized Long-tailed Classification (GLT) Benchmarks

[ECCV 2022] This project introduces a new long-tailed challenge that incorporates both the conventional class-wise imbalance and the overlooked attribute-wise imbalance within each class. The proposed IFL together with other baselines are also included. This project is the official implementation of the ECCV 2022 paper Invariant Feature Learning for Generalized Long-Tailed Classification.

If my open source projects have inspired you, giving me some sponsorship will be a great help to my subsequent open source work. Support my subsequent open source work❤️🙏

If you find our paper or this project helps your research, please kindly consider citing our paper in your publications.

@inproceedings{tang2022invariant,
  title={Invariant Feature Learning for Generalized Long-Tailed Classification},
  author={Tang, Kaihua and Tao, Mingyuan and Qi, Jiaxin and Liu, Zhenguang and Zhang, Hanwang},
  booktitle= {ECCV},
  year={2022}
}

[Project Page] [5min Slides] [30min Slides]

Contents

  1. Background
  2. Problem Formulation
  3. Install the Requirements
  4. Prepare GLT Datasets
  5. Evaluation Protocols and Metrics
  6. Invariant Feature Learning
  7. Conduct Training
  8. Conduct Testing
  9. Add Custom Models
  10. Observations

Background

Existing long-tailed classification methods only focus on tackling the class-wise imbalance (head classes have more samples than tail classes), but overlook the attribute-wise imbalance (the intra-class distribution is also long-tailed due to the varying attributes). If we look at samples inside each class in Figure 1, their attributes may also exhibit long-tailed distributions, e.g., there are more sitting dogs than swimming dogs, there are more brown dogs than green dogs. Therefore, simply considering the class distribution won't explain all the phenomena caused by imbalanced data. To be specific, 1) why the performance within each class is also long-tailed? 2) why images are usually mis-classified as classes with similar attributes? The attribute bias is thus incorporated into the proposed generalized long-tailed classification to answer the above questions.

However, most of the conventional long-tailed classification benchmarks, e.g., ImageNet-LT, Long-Tailed CIFAR-10/-100, or iNaturalist, are only capable of evaluating the class bias, underestimating the role of the attribute bias in the long tail challenge. To better evaluate the robustness of models in terms of both inter-class imbalance (class-level) and intra-class imbalance (attribute-level) at the same time, this project will introduce two GLT benchmarks and three evaluation protocols.

The Generalized Long Tail.

Figure 1. The real-world long-tailed distribution is both class-wise and attribute-wise imbalanced.

Problem Formulation

Prevalent LT methods formulate the classification model as $p(Y|X)$, predicting the label $Y$ from the input image $X$, which can be further decomposed into $p(Y|X) \propto p(X|Y)\cdot p(Y)$. This formulation identifies the cause of class-wise bias as $p(Y)$, so it can be elegantly solved by Logit Adjustment. However, such a formulation is based on a strong assumption that the distribution of $p(X|Y)$ won’t change in different domains, i.e, $p_{train}(X|Y) = p_{test}(X|Y)$ which cannot be guaranteed in real-world applications. Therefore, we reduces the previous assumption $p_{train}(X|Y) = p_{test}(X|Y)$ to a more realistic one, \ie, only a subset of features $z_c$ are invariant cross domains $p_{train}(z_c|Y)=p_{test}(z_c|Y)$, and this new assumption has to be correct, otherwise, the robust classification model cannot exist in the first place.

Specifically, we consider that each $X$ is generated by a set of underlying $(z_c, z_a)$, where the class-specific components $z_c$ are the invariant factors and the attribute-related variables $z_a$ have domain-specific distribution. Therefore, we can follow the Bayes theorem to convert the classification model $p(Y|X) = p(Y|z_c, z_a)$ into the following formula:

The formulation of GLT problem.

where class-specific components $z_c$ only depend on $Y$ and is invariant cross domains; descriptive attributes $z_a$ that vary across instances may depend on both $Y$ and $z_c$. We generally consider $p(z_c, z_a) = p(z_a|z_c)\cdot p(z_c)$ WITHOUT introducing any independence assumption. Note that we also DO NOT impose the disentanglement assumption that a perfect feature vector $\mathbf{z}=[z_c;z_a]$ with separated $z_c$ and $z_a$ can be obtained, as the disentanglement is a challenging task on its own. Otherwise, we only need to conduct a simple feature selection to obtain the ideal classification model. This new formulation unites the class-wise bias with the attribute-wise biases (e.g., sub-population shift, domain shift, etc.) in the general classification tasks.

Install the Requirement

  • Pytorch >= 1.6.0 (CUDA 10.2)
  • torchvision >= 0.7.0
###################################
###  Step by Step Installation   ##
###################################

# 1. create and activate conda environment
conda create -n glt_benchmark pip python=3.6
conda activate glt_benchmark

# 2. install pytorch and torchvision
conda install pytorch torchvision cudatoolkit=10.2 -c pytorch

# 3. install other packages
pip install sklearn joblib randaugment pyyaml==5.4
conda install matplotlib

# 4. download this project
git clone https://github.com/KaihuaTang/Generalized-Long-Tailed-Benchmarks.pytorch.git

Prepare GLT Datasets

We propose two datasets for the Generalized Long-Tailed (GLT) classification tasks: ImageNet-GLT and MSCOCO-GLT.

  • For ImageNet-GLT (link), like most of the other datasets, we don't have attribute annotations, so we use feature clusters within each class to represent K ''pretext attributes''. In other words, each cluster represents a meta attribute layout for this class.
  • For MSCOCO-GLT (link), we directly adopt attribute annotations from MSCOCO-Attribute to construct our dataset.

Please follow the above links to prepare the datasets.

Evaluation Protocols and Metrics

To systematically evaluate the robustness of models against class-wise imbalance, attribute-wise imbalance, and their joint effect, we introduce three protocols along with the above two benchmark datasets.

Class-wise Long Tail (CLT) Protocol

Same as the conventional long-tailed classification, we first adopt a class-wise and attribute-wise LT training set, called Train-GLT, which can be easily sampled from ImageNet and MSCOCO-Attribute using a class-wise LT distribution. We don’t need to intentionally ensure the attribute-wise imbalance as it’s ubiquitous and inevitable in any real-world dataset. The corresponding Test-CBL, which is i.i.d. sampled within each class, is a class-wise balanced and attribute-wise long-tailed testing set. (Train-GLT, Test-CBL) with the same attribute distributions and different class distributions can thus evaluate the robustness against the class-wise long tail.

Attribute-wise Long Tail (ALT) Protocol

The training set Train-CBL of this protocol has the same number of images for each class and keeps the original long-tailed attribute distribution by i.i.d. sampling images within each class, so its bias only comes from the attribute. Meanwhile, Test-GBL, as the most important evaluation environment for GLT task, has to balance both class and attribute distributions. Test-GBL for ImageNet-GLT samples equal number of images from each “pretext attribute” and each class. Test-GBL for MSCOCO-GLT is a little bit tricky, because each object has multiple attributes, making strictly balancing the attribute distribution prohibitive. Hence, we select a fixed size of subset within each class that has the minimized standard deviation of attributes as the Test-GBL. As long as Test-GBL is relatively more balanced in attributes than Train-CBL, it can serve as a valid testing set for ALT protocol. In summary, (Train-CBL, Test-GBL) have the same class distributions and different attribute distributions.

Generalized Long Tail (GLT) Protocol

This protocol combines (Train-GLT, Test-GBL) from the above, so both class and attribute distributions are changed from training to testing. As the generalized evaluation protocol for the long-tailed challenge, an algorithm can only obtain satisfactory results when both class bias and attribute bias are well addressed in the final model.

Evaluation Metrics

The top-1 accuracy is commonly adopted as the only metric in the conventional LT studies, yet, it cannot reveal the limitation of precision-accuracy trade-off. Therefore, in GLT classification, we report both Accuracy (#CorrectPredictions / #AllSamples), which is equal to Top-1 Recall in the class-wise balanced test sets, and Precision (1 / #Class * SUM over class (#CorrectPredictions / #SamplesPredictedAsThisClass)), to better evaluate the effectiveness of algorithms.

Notice

To reproduce the reported experimental results, you need to remove momentum in SGD optimizer. When the deadline is approaching, I suddenly found that I forgot to add momentum into my SGD optimizer. Therefore, I have to just accept the setting of 0 momentum, but since all the methods are replemented under the same optimizer, our conclusions and analyses still hold. For the followers, you can decide whether to add momentum at link or not.

Invariant Feature Learning

To tackle the proposed GLT challenge, we introduce an Invariant Feature Learning method to deal with the attribute-wise bias at the feature level. It can be incorporated into the previous LT algorithms to achieve the GLT robustness. To better understand our algorithm, please see the framework and the pseudo code of our algorithm (Link)

Conduct Training

Train Baseline Models

Run the following command to train a baseline model on Train-GLT of MSCOCO-GLT:

CUDA_VISIBLE_DEVICES=0,1 python main.py --cfg config/COCO_LT.yaml --output_dir checkpoints/YOUR_PATH --require_eval --train_type baseline --phase train

Run the following command to train a baseline model on Train-CBL of MSCOCO-GLT:

CUDA_VISIBLE_DEVICES=0,1 python main.py --cfg config/COCO_BL.yaml --output_dir checkpoints/YOUR_PATH --require_eval --train_type baseline --phase train

Run the following command to train a baseline model on Train-GLT of ImageNet-GLT:

CUDA_VISIBLE_DEVICES=0,1 python main.py --cfg config/ImageNet_LT.yaml --output_dir checkpoints/YOUR_PATH --require_eval --train_type baseline --phase train

Run the following command to train a baseline model on Train-CBL of ImageNet-GLT:

CUDA_VISIBLE_DEVICES=0,1 python main.py --cfg config/ImageNet_BL.yaml --output_dir checkpoints/YOUR_PATH --require_eval --train_type baseline --phase train

Train Other Models

You can easily switch pre-defined algorithms by change the value of --train_type. Details of our methods and re-implemented algorithms are under config/algorithms_config.yaml This project currently support following methods:

  1. --train_type baseline (Cross-Entropy Baseline Model)
  2. --train_type mixup (Cross-Entropy model with Mixup Augmentation)
  3. --train_type TDE (TDE model)
  4. --train_type BBN (BBN model)
  5. --train_type LA (Logit Adjustment method)
  6. --train_type LDAM (LDAM model)
  7. --train_type RIDE (RIDE model)
  8. --train_type TADE (TADE model)
  9. --train_type stage1 (The first stage feature learning for Decoupling)
  10. --train_type crt_stage2 (The second stage classifier fine-tuning using CRT classifier from Decoupling)
  11. --train_type lws_stage2 (The second stage classifier fine-tuning using LWS classifier from Decoupling)
  12. --train_type ride_stage2 (Trying to decouple stage 2 class-balanced classifier for RIDE)
  13. --train_type Focal (Focol loss)
  14. --train_type FocalLA (Combine Focol loss with Logit Adjustment)
  15. --train_type LFF (Learning from Failure model)
  16. --train_type LFFLA (Combine Learning from Failure with logit Adjustment)
  17. --train_type center_dual (The proposed IFL algorithm that extends the center loss to its Invariant Risk Minimization (IRM) version with two environments)
  18. --train_type center_ride/center_dual_mixup/center_tade etc. (The variations of IFL that combines with other methods, e.g., "center_ride" combines IFL with RIDE model)

Conduct Testing

Test on Train-GLT will automatically evaluate both CLT Protocl (Test-CBL) and GLT Protocol (Test-GBL), so you can run the following command to evaluate your model:

CUDA_VISIBLE_DEVICES=0,1 python main.py --cfg config/ImageNet_LT.yaml  --output_dir checkpoints/YOUR_PATH --require_eval --train_type baseline --phase test --load_dir checkpoints/YOUR_PATH/YOUR_CHECKPOINT.pth

Add Custom Models

To add a custom model, you 1) first need to design a ''train_XXX.py'' file templated by ''train_baseline.py'', an additional ''test_XX.py'' may also be required, if your custom algorithm contains some special post-processing. 2) After that, you need to add the config of your algorithm into ''config/algorithms_config.yaml''. 3) Finally, to match the config with the train/test frameworks, you need to link them in the ''utils/train_loader.py'' and ''utils/test_loader.py''.

Observations

  1. Attribute-wise imbalance is fundamentally different from the class-wise imbalance for two reasons: 1) it's statistical distribution is forbidden during training as exhaustively annotating attributes is prohibitive, 2) multiple attributes tend to co-occur with each other in one object, making instance-wise re-weighting/re-sampling less effective, as every time we sample an object with a rare attribute, it simultaneously sample the co-occurred frequent attributes as well. (To be specific, in MSCOCO-GLT, even if we directly use the explicit attribute annotations, we are still unable to strictly balance the attribute distribution (we can only minimized its STD)).

  2. Previous LT methods usually exhibit precision-recall trade-off between head and tail classes, therefore, we report both Accuracy, which is equal to Top-1 Recall in the class-wise balanced test sets, and Precision, to better evaluate the effectiveness of algorithms. We empirically found that the recent trend of improving both head and tail categories, though lack a formal definition in their approaches, are essentially trying to solve the GLT challenge. Benefit from the feature learning, these ensemble learning and data augmentation approaches can also serve as good baselines for the proposed GLT as well.