/transformer_adapter_bias_evaluation

Natural Language Processing HPI SS 2021

Primary LanguageJupyter Notebook

Evaluation of Adapters with StereoSet

Bias in AI has become an increasingly important topic. The recent paper On the Dangers of Stochastic ParrotsCan Language Models Be Too Big? by Bender et al. [Bender2021] addressed this issue and provided a comprehensive analysis of the risks posed by large language models (LM). I summarize the key findings in this section. Firstly, researchers need to collect great amounts of data from the Internet in order to feed data-hungry language models. This yields the risk of sampling abusive language in the training data and teaching the model abusive and discriminating vocabulary from the very beginning. As models pick up the bias embedded in the training data, training models with biased data results in models that contain stereotypical associations regarding gender, race, ethnicity, and disability status [basta-etal-2019-evaluating]. When these models are deployed, the bias is reinforced in the application: on the one hand, a vicious circle starts in the post-deployment phase. The text generated by LMs will be included in new training data for LMs. On the other hand, people will start to disseminate text generated by LMs. This not only increases the quantity of abusive language, it also contributes to poisoning the social climate as people are either introduced to prejudices or feel reinforced in their already existing stereotypes. For groups which are discriminated against, bias in LM can become a serious problem. This does not only include individual psychological harm; the problem has broader societal implications as the reinforcement of sexist, racist, and other prejudices supports harmful ideologies that may lead to violence in the worst case.

In recognition of these problems, my project’s goal is to benchmark various NLP tasks to verify whether the models produce discriminatory results. However, building this approach from scratch would not only require access to data and models, it would also require a domain-specific and intersectional analysis of which kind of inputs would be well-suited to test a model for fairness (not to mention that there is no standardized, easily quantifiable definition of fairness). Consequently, developing my own benchmark data set would probably go beyond the scope of this lecture. Therefore, my idea was to to benchmark multiple adapters from AdapterHub on a publicly accessible benchmark data set called StereoSet.