This repository contains code and notes for my Prototype Fund project. It was mainly done between 01.03.2019 and 01.09.2019. The topic: Explaining machine learning and natural language processing at the example of news comments, and visualize language change.
The work is devided into serveral sub projects:
- Website for explaining ML and NLP, as well as investigating language change in online comments: kommentare.vis.one, code
- Backend to serve local views on word embeddings (used for kommentare.vis.one): ptf-kommentare-backend
- Python package to construct (stable) word embeddings for small data: hyperhyper
- Python package to clean text: clean-text
- Python package for common text preprocessing for German: german
- Python package to lemmatize German text: german-lemmatizer
- Benchmark for SVD implementations: sparse-svd-benchmark
Here is a short guide on how to create your own videos. An example video here.
- Divide your data in time slices & create a word embedding for each slice
- Save the embedding in
KeyedVectors
format of gensim (using hyperhyper to create stable word embeddings is advised) - Install ffmpeg
pip install git+https://github.com/jfilter/adjustText && pip install gensim scikit-learn matplotlib colormath
- Adopt the code in this notebook (so you also need to have either Jupyter Lab or Jupyter Notebook installed.)
Right now, it's not that easy to create those videos. However, it's doable and I'm willing to help you. The 'important' part of the code is commented thoroughly. Please contact me for assistance.
Two papers for a more scientific background:
- Hamilton et al.: Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change
- Hellrich et al.: The Influence of Down-Sampling Strategieson SVD Word Embedding Stability
Some more papers here.
This work was funded by the German Federal Ministry of Education and Research.