Pytorch-lexnorm
A Pytorch-based model for lexical normalisation, as detailed in the paper Word-level Lexical Normalisation using Context-Dependent Embeddings.
Setting up
Place your data under data/datasets/<dataset_name>
, e.g. data/datasets/us_accidents_self
, using the names train.txt
and test.txt
. The format for the training and test datasets must be a list of (word, correct_form), one per line, e.g:
word <SELF>
eror error
Whenever a word does not require normalisation, the second column should be <SELF>
. If it does require normalisation, the second column should be the correct form of that word. Each document is separated by an additional newline character.
A sample train and test set has been provided (under data/datasets/us_accidents_self
. The full version of the provided sample can be found at https://github.com/Michael-Stewart-Webdev/us-accidents-dataset.
Modifying the config file
The config file, config.py
, can be modified according to your desired parameters. Notable options are:
CF_DATASET = "US Acc" # The name of the dataset.
CF_PRETRAINED = False # Whether to use pretrained word embeddings, which must be saved under `data/<dataset_embeddings>`.
CF_EMBEDDING_MODEL = "Uniform" # The embedding model to use. We found "Uniform" (i.e. randomly generated embeddings following a uniform distribution) generally works best, as detailed in the paper.
At this stage modifying the CF_DATASET
requires adjusting the dictionaries defined later in the config.py
code, so the easiest way to run the code is to simply replace the datasets under data/datasets/us_accidents_self
with your own datasets and not modify the CF_DATASET
(i.e. keep it as "US Acc").
Hyperparameters for the neural network are listed in the __init__
function. The default hyperparameters were found to be the best performing hyperparameters for the US Accidents dataset.
Running the code
First, run the build_data.py
script to build the input data for the neural model:
$ python build_data.py
Then, run the training script:
$ python train.py
The script will evaluate the model's performance on the test set every epoch. The predictions at each epoch will be saved under models/<model folder>
, where <model folder>
is an automatically generated name based on the parameters specified in config.py
.