[🔥 Best model]
[📀 Models]
[🤗 Demo]
Note: The documentation for this project is currently being written. I am working hard to make this project easily hackable so people can add new heuristics and train more models.
This repository contains a collection of neural spell correctors for the Central Kurdish language.These models have been trained on an extensive corpus of synthetically generated data. They are able to correct a wide range of spelling errors, including typos and grammatical errors.
Using various heuristics, we generate a rich dataset by mapping sequences containing misspellings to the correct sequence. We do this by randomly inserting valid characters, deleting characters or patterns, substituting characters with random ones or their keyboard neighbors, swapping two adjacent characters, shuffling sentences, and replacing specific predefined patterns with targeted alternatives.
The error injection framework in prepare_data
offers a method to inject errors according to a distortion ratio. I conducted the following experiments to determine the optimal ratio that allows the model to achieve the lowest Word Error Rate (WER) and Character Error Rate (CER) on the synthetic test set.
Model Name | Dataset Distortion | CER | WER |
---|---|---|---|
bart-base | 5% | 5.39% | 34.73% |
bart-base | 10% | 2.15% | 11.19% |
bart-base | Mixed (5% + 10%) | 1.54% | 8.31% |
bart-base | 15% | 2.17% | 12.3% |
The benchmark for this project is exclusively designed for single-word spelling corrections. The script create_asosoft_benchmark.py
processes each word from the Amani dataset by searching for sentences with the correct spelling, checking if the sentence has not been included in train.csv
and replaces it with the provided misspelling. This is hacky way to get a gold-standard benchmark. The current best-performing model achieves the following results:
Metric | Value |
---|---|
CER | 9.6545 |
WER | 21.7558 |
Bleu | 68.1724 |
The final generated dataset is also concatenated with the training dataset from Script Normalization for Unvonventional Writing project. Therefore, the model not only correct spelling but also normalize unconventional writings. "Unconventional Writing" means using the writing system of one language to write in another language.
They also employ a similiar approach to generate their data. But it's not wise to evaluate your model on the synthetic test set since the model can memorize the underlying patterns from the training set. Hence they provide a gold-standard benchmark for Central Kurdish and they use Bleu
& chrF
to measure the performance of their model.
Model | Bleu | chrF |
---|---|---|
Script Normalization | 12.7 | 69.6 |
Bart-kurd-spell-base | 13.8 | 73.9 |
Keep in mind of both these models have seen the same data for script normalization but our model is performing slighly better due to the additional data for spell correction.
Since the problem is framed as mapping a sequence containing misspellings to a correct sequence, we can train different econder-decoder models such as T5.
- Run
train_tokenizer.py
to build tokenizer for your chosen model with--tokenizer_name
argument. - Create
data.txt
and put it indata
dir. Checkinspect_data.ipynb
. - Check the arguments of
pepare_data/process_data.py
and run it to gettrain.csv
andtest.csv
- Change the arguments in
train.sh
if your want to train a different model other than Bart. In case you want to train T5, you need to add--source_prefix "correct: "
. - Evaluate the model on both
data/asosoft_benchmark.csv
anddata/Sorani-Arabic.csv
usingeval.sh
Different heuristics could be added to the pipeline, for example, replacing ر at the start of every word with ڕ or replacing ك with ک. These aforementioned examples occur quite often in Central Kurdish texts online. But both of these problems can be solved using rule-based instead of being learned from the data. It is more practical to address such problems using rule-based solutions such as KLPT
.
But in case you can think of more heuristics, they can be easily added to the pipeline in the get_text_distorter
function.
PRs with additional models, evaluation, or data generation heuristics are welcome! 👍