iSEA: An Interactive Pipeline of Semantic Error Analysis for NLP models

This is the official code repository for iSEA: An Interactive Pipeline for Semantic Error Analysis of NLP Models, by Jun Yuan, Jesse Vig, Nazneen Rajani.

Repository Overview

This repository contains the following two parts:

pre-process/: This folder contains the code of pre-processing the text documents. We use the pre-trained DistilBERT as an example to demonstrate how we process the data in several Jupyter Notebook files. These notebooks include code for the following content:
- preprocessing of documents (tokenization, lemmatization, document embedding, etc.);
- model performance;
- high-level feature generation;
- rule generation;
- instance-level model explanation (SHAP values).
ui/: This folder contains code and processed data of running the front-end.

System Architecture

We first pre-compute all the necessary information such as model output, analysis information, and error rules in the server. We then present this information in the user interface. Based on the user input, the server calculates subpopulation-level information (errors, document statistics, aggregated SHAP values, etc.) and returns this information back to the UI.

Data & Model

In the paper, we present two use cases with the following data and models:

For MultiNLI dataset, we first train a DistilBERT model based on the government genre. We then analyze the model performance on the travel genre. The checkpoint can be found here.
For the sentiment analysis task on Twitter dataset, we analyze the errors from the open-sourced twitter-roberta-base-sentiment model on test data via our pipeline.

To apply iSEA to your own data/model, please follow the instructions in the pre-process/ folder for data preprocessing and the instructions in the ui/.

Citation

When referencing this repository, please cite this paper:

@misc{yuan22isea,
      title={iSEA: An Interactive Pipeline for Semantic Error Analysis of NLP Models}, 
      author={Yuan, Jun and Vig, Jesse and Rajani, Nazneen},
      year={2022},
      eprint={2203.04408},
      archivePrefix={arXiv},
      primaryClass={cs.HC},
      url={https://arxiv.org/abs/2203.04408}
}

License

This repository is released under the BSD-3 License.

salesforce/iSEA