
A Pytorch-based model for lexical normalisation, as detailed in the paper Word-level Lexical Normalisation using Context-Dependent Embeddings.

Setting up

Place your data under data/datasets/<dataset_name>, e.g. data/datasets/us_accidents_self, using the names train.txt and test.txt. The format for the training and test datasets must be a list of (word, correct_form), one per line, e.g:

word <SELF>
eror error

Whenever a word does not require normalisation, the second column should be <SELF>. If it does require normalisation, the second column should be the correct form of that word. Each document is separated by an additional newline character.

A sample train and test set has been provided (under data/datasets/us_accidents_self. The full version of the provided sample can be found at

Modifying the config file

The config file,, can be modified according to your desired parameters. Notable options are:

CF_DATASET = "US Acc"           # The name of the dataset.
CF_PRETRAINED = False           # Whether to use pretrained word embeddings, which must be saved under `data/<dataset_embeddings>`.
CF_EMBEDDING_MODEL = "Uniform"  # The embedding model to use. We found "Uniform" (i.e. randomly generated embeddings following a uniform distribution) generally works best, as detailed in the paper.

At this stage modifying the CF_DATASET requires adjusting the dictionaries defined later in the code, so the easiest way to run the code is to simply replace the datasets under data/datasets/us_accidents_self with your own datasets and not modify the CF_DATASET (i.e. keep it as "US Acc").

Hyperparameters for the neural network are listed in the __init__ function. The default hyperparameters were found to be the best performing hyperparameters for the US Accidents dataset.

Running the code

First, run the script to build the input data for the neural model:

$ python

Then, run the training script:

$ python

The script will evaluate the model's performance on the test set every epoch. The predictions at each epoch will be saved under models/<model folder>, where <model folder> is an automatically generated name based on the parameters specified in