/Improving-Mean-Absolute-Error-against-CCE

Mean Absolute Error Does Not Treat Examples Equally and Gradient Magnitude’s Variance Matters

Primary LanguageShellMIT LicenseMIT

IMAE for Noise-Robust Learning: Mean Absolute Error Does Not Treat Examples Equally and Gradient Magnitude’s Variance Matters

👍 Glad to know that our recent papers have inspired an ICML 2020 paper: Normalized Loss Functions for Deep Learning with Noisy Labels

  • Open discussion on deep robustness, please!
  • This paper worked on improving loss functions. Insteadm, in our work, we go deeper and propose a much more flexible framework to design the derivative straightforwardly without deriving it from a loss function, is not it very cool?
  • Besides, we propose to interpret the derivative magnitude of one data point as its weight. Rethinking of deep learning optimisation is probably needed here!

👍 Selected work partially impacted by our work

Citation

Please kindly cite us if you find our work useful.

@article{wang2019imae,
  title={{IMAE} for Noise-Robust Learning: Mean Absolute Error Does Not Treat Examples Equally and Gradient Magnitude’s Variance Matters},
  author={Wang, Xinshao and Hua, Yang and Kodirov, Elyor and Robertson, Neil M},
  journal={arXiv preprint arXiv:1903.12141},
  year={2019}
}

Open Reviews and Discussion

Since this paper is released, for your better reference, the ICCV-19 reviews results are released following the practice of OpenReview

A Open Question on whether clean or noisy validation set for ML/DL researchers caring about label noise

  • Reviewer#3's opinion in final justification: `The validation sets are required to be clean, which greatly decrease the contribution. Many existing methods employ noisy validation set to choose hyper-parameters, e.g., when the risk is consistent. As minimizing risks on the noisy validation set is asymptotically equal to minimizing risk on the clean data.'

  • My opinion discussed with my collaborators: Following the ML literature, a validation set should be clean as we should not expect a ML model to predict noisy data well. In other words, we cannot evaluate/decide a model’s performance on noisy validation/test data. Our goal is to avoid learning faults from noisy data and generalise better during inference.

Positive comments we collected

  • The proposed modification IMAE is quite simple and should be considerably more efficient than other methods that deal with label noise.
  • The theoretical analysis of CCE and MAE is thorough and provides an explanation of the tendency of CCE to overfit to incorrect labels and the underfitting of MAE to the correct labels.
  • The experiments show a significant improvement over CCE in the case of noisy labels which validates the approach. I also appreciate the experiment on MARS with realistic label noise. I appreciate the comparison on Clothing-1M provided by the authors. The results there suggest that under realistic label noise the method actually works well when compared to SotA methods.

Introduction

Research questions:

  • Why does MAE work much worse than CCE although it is noise-robust?
  • How to improve MAE against CCE to embrace noise-robustness and high generalisation performance?

Our work is a further study of robust losses following MAE [1] and GCE [2]. They proved MAE is more robust than CCE when noise exists. However, MAE’s underfitting phenomenon is not exposed and studied in the literature. We analysed it thoroughly and proposed a simple solution to embrace both high fitting ability (accurate) and test stability (robust).

IMAE is suitable for cases where inputs and labels may be unmatched.

Training DNNs requires rethinking data fitting and generalisation. Our main contribution is simple analysis and solution from the viewpoint of gradient magnitude with respect to logits.

Takeaways

  • By ‘CCE is noise-sensitive and overfits training data’, we mean CCE owns high data fitting accuracy but its final test accuracy drops a lot versus its best test accuracy.
  • By ‘MAE is robust’, we mean MAE’s final test accuracy drops only a bit versus its best test one.
  • By ‘MAE underfits training data’, we mean its training and best test accuracies are low.

Please see our empirical evidences in the paper.

MAE’s fitting ability is much worse than CCE. In other words, CCE overfits to incorrect labels while MAE underfits to correct labels.

  • The robustness/sensitive to noise is from the angle of test accuracy stability/trend, i.e., CCE’s final test accuracy drops a lot versus its best one while MAE’s final one is almost the same as its best one;
  • The claim ‘MAE works worse than CCE’ is from the aspect of best test accuracy since we generally apply early stopping to help CCE.

Results

Label noise is one of the most explicit cases where some observations and their labels are not matched in the training data. In this case, it is quite crucial to make your models learn meaningful patterns instead of errors.

Synthetic noise

Real-world unknown noise

Classification on Clothing 1M [a] is here

Hyper-paramter Analysis

Discussion

1. The idea of this paper is quite close to "training deep neural-networks using a noise adaptation layer"? They both intend to change the weight of each sample before sending to softmax, definitely they do in different ways. It decreases the novelty of this paper?

Their critical differences are: 1) Noise Adaption explicitly estimates latent true labels by an additional softmax layer while our IMAE reweights examples based on their input-to-label relevance scores; 2) IMAE reweights samples after softmax, i.e., scaling their gradients as shown in Eq. (22) in our paper.

2. Why uniform noise (symmetric/class-independent noise )?

We choose uniform noise because it is more challenging than asymmetric (class-dependent) noise which was verified in [d] Vahdat et al. Toward robustness against label noise in training deep discriminative neural networks. In NeurIPS, 2017.

3. Why is the performance still okay when noise rate is 80%?

By adding uniform noise, even up to 80%, the correct portion is still the majority, since the 80% are relocated to other 9 classes evenly.

Being natural and intuitive, the majority voting decides the meaningful data patterns to learn. We believe that if the noise accounts the majority, DNNs is hard to learn meaningful patterns. Therefore, the majority voting is our reasonable assumption.

4. The study from the gradient perspective is not new, e.g., Truncated Cauchy Non-Negative Matrix Factorization, ang GCE [2].

Yes, we agree the perspective itself is not new. However, we find how we analyse fundamentally and go to the simple solution via the gradient viewpoint is novel.

Truncated Cauchy Non-Negative Matrix Factorization (TPAMI-2017) and GCE [2] truncate large errors to filter out extreme outliers. Instead, our IMAE adjusts weighting variance without dropping any samples.

5. The robustness is not specific for label noise. I think the method works well for general noise, e.g., outliers.

Yes, that is a great point. Our IMAE is suitable for all cases where inputs and their labels are not semantically matched, which may come from noisy data or labels. Since we only evaluated on label noise, we did not exaggerate its efficacy.

We will test more cases in the future.

6. Is the validation data clean or not? If clean, this would greatly reduce the contribution of the paper.

Following the ML literature, a validation set should be clean as we should not expect a ML model to predict noisy data well. In other words, we cannot evaluate a model’s performance on noisy validation/test data. Our goal is to avoid learning faults from noisy data and generalise better during inference.

7. More experiments with comparison to prior work and more evaluation on real-world datasets with unknown noise?

Our focus is to analyse why CCE overfits while MAE underfits as presented in ablation studies in Table 2. Under unknown real-world noise in Table 3, we only compared with GCE [2] as it is the most related and demonstrated to be the state-of-the-art.

Classification on Clothing 1M [a] is here

References

[1] A. Ghosh, H. Kumar, and P. Sastry. Robust loss functions under label noise for deep neural networks. In AAAI, 2017.

[2] Z. Zhang and M. R. Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. In NeurIPS 2018.

[3] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.

[4] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian. Mars: A video benchmark for large-scale person re-identification. In ECCV, 2016.

[a] Xiao et al. Learning From Massive Noisy Labeled Data for Image Classification. In CVPR, 2015.

[b] Patrini et al. Making deep neural networks robust to label noise: A loss correction approach. In CVPR, 2017.

[c] Goldberger et al. Training deep neural-networks using a noise adaptation layer. In ICLR, 2017.

[d] Vahdat et al. Toward robustness against label noise in training deep discriminative neural networks. In NeurIPS, 2017.

[e] Tanaka et al. Joint optimization framework for learning with noisy labels. In CVPR, 2018.

[f] Han et al. Masking: A new perspective of noisy supervision. In NeurIPS, 2018.

[g] Jenni et al. Deep bilevel learning. In ECCV, 2018.