Provides a tool to generate interpretable word vectors, from existing embedding spaces.
Senel et. al., Semantic Structure and Interpretability of Word Embeddings.
Python 3.8+
Every dependency can be found in the requirements.txt file.
pip install -r requirements.txt
If you wish to use the sparse method from preprocessing you need to install SPAMS separately.
The code base is separated into 3 modules.
- Preprocessing (optional, you can do your preprocessing steps separately)
- train - Path to the embedding space (for training)
- path - Path to a project folder (later you have to set the same folder for the calculations)
- configuration - Path to a configuration JSON file (e.g. this)
- It builds up like this:
{ "<priority>": { "name": "<name>", "params": { ... } }, ...
- priority modifies the execution order
- method can be:
- With same parameters as in numpy: std, norm, center
- whiten
- parameters - method: 'zca', 'pca', 'cholesky', 'zca_cor', 'pca_cor'
- sparse (only on systems where SPAMS is supported)
- parameters: spams.trainDL and spams.lasso
- It builds up like this:
- jobs - Deprecated
- Calculation of Distance Matrix
- train - Path to the embedding space (for training)
- no_transform - Flag to do not apply preprocessing step if a config exists in the project folder
- train_labels - Path to file containing labels (for SemCor
*.data.xml
file) - label_processor - Method to load labels into memory (Right now
semcor-lexname
only) - path - Path to project (a place to save files, or the same as provided during preprocessing)
- distance - Distance to apply (
'bhattacharyya', 'hellinger', 'bhattacharyya_normal', 'hellinger_normal', 'bhattacharyya_exponential', 'hellinger_exponential'
) - kde_kernel - Kernel to use if
bhattacharyya
orhellinger
was provided as distance. ('gaussian', 'tophat', 'epanechnikov', 'exponential', 'linear', 'cosine'
) - kde_bandwidth - Bandwidth to use for kernel density estimation
- jobs - Number of processes to use during distance calculation (it is always
min(provided, number_of_physical_cores)
)
- Evaluation
- test - Path to the embedding space (for evaluation)
- no_transform - Flag to do not apply preprocessing step if a config exists in the project folder
- test_labels - Path to file containing labels (for SemCor
*.data.xml
file) - label_processor - Method to load labels into memory (Right now
semcor-lexname
only) - path - Path to project (a place to save files, or the same as provided during preprocessing)
- save - To save the interpretable space
- label_frequency - Applying label frequency based weighting on the output embedding.
- evaluation_method - How to measure interpretability (
argmax
only) - devset_name - Name of the devset (good if you wish to use one for parameter selection)
The paper which was submitted to the 23rd International Conference on Text, Speech and Dialogue conference is available here
GloVe can be downloaded from here and the SemCat dataset is available here.
Link to the paper.
To reproduce the results from the MSZNY paper, download the following embeddings:
We generated the sparse embedding spaces with the following script. Parameters can be found in mszny_sparse.sh.
Paper is available here
We included the configuration file for the preprocessing step. We generated the sparse embeddings separately with the following script.
If you are using the code or relying on the paper please cite the following paper(s):
For contextual embeddings:
@inproceedings{ficsor-berend-2021-changing,
title = "Changing the Basis of Contextual Representations with Explicit Semantics",
author = "Ficsor, Tam{\'a}s and
Berend, G{\'a}bor",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-srw.25",
doi = "10.18653/v1/2021.acl-srw.25",
pages = "235--247",
abstract = "The application of transformer-based contextual representations has became a de facto solution for solving complex NLP tasks. Despite their successes, such representations are arguably opaque as their latent dimensions are not directly interpretable. To alleviate this limitation of contextual representations, we devise such an algorithm where the output representation expresses human-interpretable information of each dimension. We achieve this by constructing a transformation matrix based on the semantic content of the embedding space and predefined semantic categories using Hellinger distance. We evaluate our inferred representations on supersense prediction task. Our experiments reveal that the interpretable nature of transformed contextual representations makes it possible to accurately predict the supersense category of a word by simply looking for its transformed coordinate with the largest coefficient. We quantify the effects of our proposed transformation when applied over traditional dense contextual embeddings. We additionally investigate and report consistent improvements for the integration of sparse contextual word representations into our proposed algorithm.",
}
For static embeddings
@InProceedings{10.1007/978-3-030-58323-1_21,
author="Ficsor, Tam{\'a}s
and Berend, G{\'a}bor",
editor="Sojka, Petr
and Kope{\v{c}}ek, Ivan
and Pala, Karel
and Hor{\'a}k, Ale{\v{s}}",
title="Interpreting Word Embeddings Using a Distribution Agnostic Approach Employing Hellinger Distance",
booktitle="Text, Speech, and Dialogue",
year="2020",
publisher="Springer International Publishing",
address="Cham",
pages="197--205",
abstract="Word embeddings can encode semantic and syntactic features and have achieved many recent successes in solving NLP tasks. Despite their successes, it is not trivial to directly extract lexical information out of them. In this paper, we propose a transformation of the embedding space to a more interpretable one using the Hellinger distance. We additionally suggest a distribution-agnostic approach using Kernel Density Estimation. A method is introduced to measure the interpretability of the word embeddings. Our results suggest that Hellinger based calculation gives a 1.35{\%} improvement on average over the Bhattacharyya distance in terms of interpretability and adapts better to unknown words.",
isbn="978-3-030-58323-1"
}