/GLUE-X

We leverage 14 datasets as OOD test data and conduct evaluations on 8 NLU tasks over 21 popularly used models. Our findings confirm that the OOD accuracy in NLP tasks needs to be paid more attention to since the significant performance decay compared to ID accuracy has been found in all settings.

Primary LanguagePython

GLUE-X

We collect 14 publicly available datasets as OOD test data and conduct evaluations on 8 classic NLP tasks over popularly used models. Our findings confirm that the OOD accuracy in NLP tasks needs to be paid more attention to since the significant performance decay compared to ID accuracy has been found in all settings.

Fine-tune your language model

Please checkout these examples from Hugging Face Transformer, to fine-tune your custom models.

Out-of-Domain Tests (OOD)

The data for all OOD tests can be found here.

Main Contributer

Shuibai Zhang (Code work and Experiments Implementation); Linyi Yang (Guidance and Experiments Design); Wei Zhou (Website Implementation)

Citation

If you find this work is helpful for your research, please consider to cite the paper as follows.

@article{yang2022glue,
  title={GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-distribution Generalization Perspective},
  author={Yang, Linyi and Zhang, Shuibai and Qin, Libo and Li, Yafu and Wang, Yidong and Liu, Hanmeng and Wang, Jindong and Xie, Xing and Zhang, Yue},
  journal={arXiv preprint arXiv:2211.08073},
  year={2022}
}