This repository contains source code for the paper Neural String Edit Distance.
Classification using neural edit distance: train_transliteration_classification.py
Baseline using Transformers: train_baseline_cognates_classification.py
Sequence generation using neural string edit distance: train_transliteration_generation.py
Baseline using sequence-to-sequence (RNN and Transformer): train_transliteration_s2s.py
The paper evalutes the method on cognate detection, Arabic-to-English transliteration and grapheme-to-phoneme conversion. The can be downloaded and preprocessed for the experiments using the following scripts:
-
Cognate detection:
download_cognates.sh
-
Transliteration:
download_transliteration_data.sh
-
Grapheme to phoneme:
download_cmu_dict_data.sh
Interpretability of the models is evaluated by measuring how well symbol
alignment is preserved. The ground-truth data are prepared using a SOTA
statistical aligner. The ground-truth data for alignment can be prepared using
the script alignment/prepare_alignment.sh
that builds the alignemnt tools.
For extracting the aglignment from the neural string edit distance models, use
the run_viterbi.py
script with option --output-format=alignment
.