/word_embedding_interpretability

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Interpretability of Word Embeddings

Provides a tool to generate interpretable word vectors, from existing embedding spaces.

Related Work

Senel et. al., Semantic Structure and Interpretability of Word Embeddings.

Table of Contents

Requirements

Python 3.8+

Every dependency can be found in the requirements.txt file.
pip install -r requirements.txt

If you wish to use the sparse method from preprocessing you need to install SPAMS separately.

Usage

The code base is separated into 3 modules.

  • Preprocessing (optional, you can do your preprocessing steps separately)
    • train - Path to the embedding space (for training)
    • path - Path to a project folder (later you have to set the same folder for the calculations)
    • configuration - Path to a configuration JSON file (e.g. this)
      • It builds up like this:{ "<priority>": { "name": "<name>", "params": { ... } }, ...
      • priority modifies the execution order
      • method can be:
        • With same parameters as in numpy: std, norm, center
        • whiten
          • parameters - method: 'zca', 'pca', 'cholesky', 'zca_cor', 'pca_cor'
        • sparse (only on systems where SPAMS is supported)
    • jobs - Deprecated
  • Calculation of Distance Matrix
    • train - Path to the embedding space (for training)
    • no_transform - Flag to do not apply preprocessing step if a config exists in the project folder
    • train_labels - Path to file containing labels (for SemCor *.data.xml file)
    • label_processor - Method to load labels into memory (Right now semcor-lexname only)
    • path - Path to project (a place to save files, or the same as provided during preprocessing)
    • distance - Distance to apply ('bhattacharyya', 'hellinger', 'bhattacharyya_normal', 'hellinger_normal', 'bhattacharyya_exponential', 'hellinger_exponential')
    • kde_kernel - Kernel to use if bhattacharyya or hellinger was provided as distance. ('gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine')
    • kde_bandwidth - Bandwidth to use for kernel density estimation
    • jobs - Number of processes to use during distance calculation (it is always min(provided, number_of_physical_cores))
  • Evaluation
    • test - Path to the embedding space (for evaluation)
    • no_transform - Flag to do not apply preprocessing step if a config exists in the project folder
    • test_labels - Path to file containing labels (for SemCor *.data.xml file)
    • label_processor - Method to load labels into memory (Right now semcor-lexname only)
    • path - Path to project (a place to save files, or the same as provided during preprocessing)
    • save - To save the interpretable space
    • label_frequency - Applying label frequency based weighting on the output embedding.
    • evaluation_method - How to measure interpretability (argmax only)
    • devset_name - Name of the devset (good if you wish to use one for parameter selection)

Reproducing the results from the papers

TSD

The paper which was submitted to the 23rd International Conference on Text, Speech and Dialogue conference is available here

GloVe can be downloaded from here and the SemCat dataset is available here.

MSZNY2021 (Conference on Hungarian Computational Linguistics)

Link to the paper.

To reproduce the results from the MSZNY paper, download the following embeddings:

We generated the sparse embedding spaces with the following script. Parameters can be found in mszny_sparse.sh.

ACL-IJCNLP 2021 Student Workshop

Paper is available here

We included the configuration file for the preprocessing step. We generated the sparse embeddings separately with the following script.

Citation

If you are using the code or relying on the paper please cite the following paper(s):

For contextual embeddings:

@inproceedings{ficsor-berend-2021-changing,
    title = "Changing the Basis of Contextual Representations with Explicit Semantics",
    author = "Ficsor, Tam{\'a}s  and
      Berend, G{\'a}bor",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-srw.25",
    doi = "10.18653/v1/2021.acl-srw.25",
    pages = "235--247",
    abstract = "The application of transformer-based contextual representations has became a de facto solution for solving complex NLP tasks. Despite their successes, such representations are arguably opaque as their latent dimensions are not directly interpretable. To alleviate this limitation of contextual representations, we devise such an algorithm where the output representation expresses human-interpretable information of each dimension. We achieve this by constructing a transformation matrix based on the semantic content of the embedding space and predefined semantic categories using Hellinger distance. We evaluate our inferred representations on supersense prediction task. Our experiments reveal that the interpretable nature of transformed contextual representations makes it possible to accurately predict the supersense category of a word by simply looking for its transformed coordinate with the largest coefficient. We quantify the effects of our proposed transformation when applied over traditional dense contextual embeddings. We additionally investigate and report consistent improvements for the integration of sparse contextual word representations into our proposed algorithm.",
}

For static embeddings

@InProceedings{10.1007/978-3-030-58323-1_21,
  author="Ficsor, Tam{\'a}s
  and Berend, G{\'a}bor",
  editor="Sojka, Petr
  and Kope{\v{c}}ek, Ivan
  and Pala, Karel
  and Hor{\'a}k, Ale{\v{s}}",
  title="Interpreting Word Embeddings Using a Distribution Agnostic Approach Employing Hellinger Distance",
  booktitle="Text, Speech, and Dialogue",
  year="2020",
  publisher="Springer International Publishing",
  address="Cham",
  pages="197--205",
  abstract="Word embeddings can encode semantic and syntactic features and have achieved many recent successes in solving NLP tasks. Despite their successes, it is not trivial to directly extract lexical information out of them. In this paper, we propose a transformation of the embedding space to a more interpretable one using the Hellinger distance. We additionally suggest a distribution-agnostic approach using Kernel Density Estimation. A method is introduced to measure the interpretability of the word embeddings. Our results suggest that Hellinger based calculation gives a  1.35{\%} improvement on average over the Bhattacharyya distance in terms of interpretability and adapts better to unknown words.",
  isbn="978-3-030-58323-1"
}