Homework assignment for the lecture about Grazie in HSE. The task is to create simple GEC system with porting it to JS or JVM.
For each word, GEC extracts possible candidates with estimated probabilities. To prepare candidates, the following approach is used:
- Retrieve possible candidates by HunSpell
- For each pair
<word; candidate>
calculate features:- Normalized Damerau-Levenshtein distance.
- Normalized Jaro-Winkler distance.
- Normalized length of the longest common subsequences.
- Train classifier to predict the probability of correct fix by these features.
If l
is edit distance between s1
and s2
,
then normalized distance is (1 - l) / max(|s1|, |s2|)
.
For LCS, normalized distance is LCS / max(|s1|, |s2|)
.
Dataset for training model is taken from norvig. For each correct spell:
- positive example retrieved from dataset
- negative example retrieved from hunspell's suggestion
Currently, two models are supported: Logistic Regression and Random Forest Classifier. To choose model use special config class.
To validate model aspell dataset is used.
Metrics:
Accuracy@1
: part of examples where the correct candidate has maximal probability.Accuracy@5
: part of examples where the correct candidate has maximal probability.
This section describes how to use this GEC system. To install python dependencies (needed for model training):
pip install -r requirements.txt
To download necessary data use download_data.sh
script.
To train model and save artifacts in ONNX format,
use train
script:
PYTHONPATH="." python src/main/python/train.py \
--train $TRAIN_DATA_PATH \
--ckpt $CKPT_OUTPUT_FOLDER \
--test $OPTIONAL_TEST_DATA_PATH
This will train logistic regression on the given train data, test it on test data, and save logistic regression in onnx format.
To validate GEC, use validate
script:
PYTHONPATH="." python src/main/python/validate.py \
--model $PATH_TO_ONNX_MODEL \
--test $PATH_TO_TEST_DATA
Accuracy@1 | Accuracy@5 | |
---|---|---|
Logistic Regression | 49.18 | 73.49 |
Random Forest Classifier | 48.99 | 73.49 |
Checkpoints
folder contains weights for these model.
SpellCheckerKt
contains GEC transfer from Python to Kotlin.
It contains only functionality for inference, and therefore, model should be trained before using it.
AppKt
shows an example of usage this system.
As python validation, it is used to check model accuracy on a test data.
To run example use Gradle:
gradle run --args="$PATH_TO_TEST_DATA $PATH_TO_ONNX_MODEL"
Accuracy@1 | Accuracy@5 | |
---|---|---|
Logistic Regression | 48.99 | 73.67 |
Random Forest Classifier | 46.62 | 73.86 |
There is a little bit differents from Python validation. Perhaps, it is due to small difference in floating-point arithmetic (both kotlin and python use native C hunspell implementation and distance algorithms are deterministic).
- Basically, spellchecker depends only on dictionaries for HunSpell and train dataset. There are already a bunch of such dictionaries for many languages. Therefore, adding new language requires only collecting data with known misspells for this language.
- Current solution belong to Mixed class of GEC systems. Therefore, future work may include adding new features, based on rules or existing text algorithms, along with ranking model improvements.