Introduction to cross-lingual word-embeddings at Wikimania 2019
Word-embeddings allows machines to measure the semantic distance between a pair of words or sentences. This is done by converting each string (words or sentences) in vectors, allowing to perform mathematical operations with those strings. For example, it is possible to measure the distance between //cat// and //dog//, that might be smaller (so, they are both animals) than the distance between //cat// and //car//.
Recently, researchers have been working in make those embeddings cross-lingual, allowing to measure the distance between strings in different languages. Therefore, translations such as //cat// [en] and //gato// [es], should very similar (ideally identical) in the vector space.
In the research team we have been using those cross-lingual embeddings to create section alignments across different projects, or to align template parameters.
The session will be organized as follows:
First Part: Understanding and playing with cross-lingual word-embeddings
- What is a word-embedding
- How to use FastText in Python.
- How to align models in different languages.
**Second Part: Use cases on section alignment and recommendation **
- How to query the section alignment API.
- How to query the section recommendation API.
If you are just interested in using the APIs you are welcome to come just to the second part of session.
Materials and recommendations:
If you want to do hands-on work, and try your own alignments you will need to install some packages and download some data in advance:
- You will need a machine with at least 16GB of RAM.
- Install Python 3 .
- Install FastText for Python
- Download the models (bin+vec) in English and [Spanish](https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.es.zip(. You could also download any pair of languages contained in this list: ["es", "en", "fr", "ar", "ru", "uk", "pt", "vi", "zh", "ru", "he", "it", "ta", "id", "fa", "ca"]
- Clone this repository.
If you want to know more about word-embeddings alignments check this repository.