An implementation of the Densifier introduced by Rothe et al. 2016 which aims at grouping words based on any given separating signals such as sentiment, concreteness, or frequency, as long as embeddings encode them.
The training objective is to group words in an ultradense space, e.g. dim=1, according to provided separating signals. Setting the ultradense space with dim==1 yields lexicons based on embeddings.
These codes are optimized such that they only work when dim==1. However, it should be straightforward enough to modify them to output ultradense spaces with dim>1, which can be subsequently feed to NNs.
These codes are written in NumPy. For implementations using autograd frameworks, one can refer to here (for TensorFlow users) and here (for Keras users). Note, running Densifier on GPU may not be ideal -- there are some overheads moving around from tensor to ndarrays (also GPU <-> CPU) for doing the expensive SVD, see this thread.
- Python 2.7
- NumPy 1.14.3
- SciPy 1.1.0
All codes are written in Python 2.7, yet should be compatible with Python 3.
python Densifier.py
--LR learning rate
--alpha hyperparameter balancing two sub-objectives
--EPC epochs
--OUT_DIM ultradense dimension size
--BATCH_SIZE batch size
--EMB_SPACE input embedding space
--SAVE_EVERY save every N steps
--SAVE_TO output trained transformation matrix
Using the same Twitter embedding space in Rothe et al. 2016, these codes perform roughly the same to the TensorFlow implementation -- 0.48 v.s. 0.47 of kendall's tau on the SemEval2015 10B sentiment analysis task, which are unfortunately both lower than 0.65 reported by the original author 0.63 which are close to the reported value 0.65. Note, need to carry out evaluations on the joint vocabulary.
To replicate results in other languages and domains, the original test sets are required but I did not find them.
- Efficiency: Rothe el al. 2016 reported that all experiments were finished in 5 mins. Unfortunately I am not able to achieve this speed as it takes me ~0.2s to compute an SVD of a 400 x 400 dense matrix using NumPy. Autograd frameworks can take more time.
-
Orthogonal Constraint: Buechel et al. 2018 reports that enforcing the orthogonal constraint introduces no difference on performance. Similar observations occur in this implementation. Releasing the orthogonal constraint means we now can touch the norm of
$Q$ so probably we need to regularize the$l2$ norm of$PQ$ such that the loss does not go to$-\infty$ . Again, no improvements are observed in this case either. The orthorgonal constraint regularizes optimization steps go along the surface of the cube, but probably is not significantly helpful when the evaluation metric is ranking based.
Papers referred in this implementation:
@InProceedings{N16-1091,
author = "Rothe, Sascha
and Ebert, Sebastian
and Sch{\"u}tze, Hinrich",
title = "Ultradense Word Embeddings by Orthogonal Transformation",
booktitle = "Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies ",
year = "2016",
publisher = "Association for Computational Linguistics",
pages = "767--777",
location = "San Diego, California",
doi = "10.18653/v1/N16-1091",
url = "http://aclweb.org/anthology/N16-1091"
}
@InProceedings{N18-1173,
author = "Buechel, Sven
and Hahn, Udo",
title = "Word Emotion Induction for Multiple Languages as a Deep Multi-Task Learning Problem ",
booktitle = "Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) ",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "1907--1918",
location = "New Orleans, Louisiana",
doi = "10.18653/v1/N18-1173",
url = "http://aclweb.org/anthology/N18-1173"
}