Mapping semantic spaces for translation

Requirements

You'll need the scikit-learn package to run the code in this repo. If you need to install it on your system, do:

sudo pip install sklearn

Pen and paper exercise

In this tutorial, you will map from a small English semantic space to a Catalan semantic space. You may need a Catalan dictionary for the following exercises. Here's a good one: https://www.diccionaris.cat/.

The following are two toy semantic spaces, one for English, one for Catalan. Rows represent vectors, columns represent dimensions.

	nature	argue
horse	0.5	0.0
dog	0.7	0.0
house	0.2	0.1
parliament	0.0	0.7
politics	0.1	0.9
right	0.1	0.6
wrong	0.1	0.7

	lluitar	arbre
cavall	0.1	0.3
gos	0.1	0.2
casa	0.0	0.2
parlament	0.5	0.0
política	0.6	0.0
correcte	0.4	0.0
equivocat	0.5	0.0

Now, you get a new vector in English, say:

	nature	argue
green	0.6	0.1

Which of those two Catalan words do you think is the translation of green according to your semantic spaces? Why? NB: you don't have to know any calculus to solve this by hand.

	lluitar	arbre
verd	0.1	0.5
vermell	0.2	0.2

Inspecting the spaces in 2D

You have 2D 'pictures' of the English and Catalan spaces in your directory (english_space.png and catalan_space.png), generated by computing PCA on the full spaces and retaining the first two components of the data. Have a brief look at the spaces and consider how words have clustered in the space.

(Remember that the 'clustering' here may not be particularly good, as you are looking at 300 or 400 dimensions flattened in 2D. You may be missing a lot of perspective!)

Preliminaries to running the code

Familiarise yourself with the content of the data/ directory. The pairs.txt file contains gold standard translations from English to Catalan. english.subset.dm and catalan.subset.dm are subsets of an English and a Catalan semantic space corresponding to the words occurring in pairs.txt.

There are 166 pairs and we will be splitting the data into 120 pairs for training and 46 for testing. You can look at the test pairs by doing

tail -46 data/pairs.txt

Just looking at the data and the associated visualisation (before running anything), can you tell where the model might do well and where it might fail?

Running the regression code

Running the code will split the data into training and test set, train a PLSR model on the training data and evaluate it on the test set. The command to run the code is in the following format:

python3 -W ignore plsr_regression.py --ncomps=<n> --nns=<n> [-v | --verbose]

where --ncomps is the number of principal components to use in your analysis, and --nns is the number of nearest neighbours to compute your precision at k. For instance, you might try:

python3 -W ignore plsr_regression.py --ncomps=10 --nns=5

If you use the verbose option (-v or --verbose at the end of your command), the output first gives the predictions for each pair. For instance, it could be:

bird ocell ['arbre', 'peix', 'ocell', 'gos', 'animal'] 1

Here, bird should have been translated with ocell. The 5 nearest neighbours of the predicted vector are arbre, peix, ocell, gos and animal, meaning the gold translation can be found in those close neighbours.

The system then gives the precision @ k, where k is the number of nearest neighbours considered for evaluation.

Run your system for different values of --ncomps and --nns. Write your results in a table. Check that the results are what you expected.

Now, looking at the verbose output, what can you say about the system's errors? Do they confirm the hypotheses you made when looking at the semantic spaces?

Write up your experiments

Write a little report of what you've done (again, this is just practice for the exam!) Your report should contain the following sections:

Description of the task and data.
Your hypothesis: here, it could simply be how you expected the number of principal components to affect results.
The experiments you ran: which range of hyperparameters did you try (ncomps and nns)?
Results: insert the little table showing your results and discuss it with respect to your hypothesis.

For those wanting to do a tiny bit of coding...

Automatize your hyperparameter search. Write a loop in plsr_regression.py that automatically tries different values of ncomps without you having to manually feed the hyperparameter to the system.

Open-ended project

There is a small Italian semantic space in the data/ folder, with 1000 frequent Italian words. You can try and build the regression for a new train/test set for Italian! You will need to manually select the words you want in your training /test sets, and extract their English equivalents from en.subset.dm.

ml-for-nlp/word-translation