The project aims at implementing a set of algorithms for weighting transducers.
- apertium-weighting-tools
lt-weight
lttoolbox
hfst
Most of the weighting scripts
python3
Word2vec scripts
tqdm
(A python package for showing the progress)gensim
(A python package for training word2vec models)
pip install tqdm gensim
Evaluation scripts
tabulate
(A python package for generating evaluation results in Markdown table format)
pip install tabulate
Add weights to a compiled dictionary using a set of weightlists.
- The weightlists are written as Xerox regexp which is more powerful than using weighted string pairs as it permits matching all analyses of certain prefixes, suffixes. BUT, take care when using complex regexp since composing a transducer with a complex regexp transducer is time and memory consuming (TODO: WHAT IS THE EXACT COMPLEXITY?)
- The script can make use of a sequence of a weightlists and it considers each weightlist to be a fallback one for paths that weren't weighted by the preceding weightlist.
- It's also advised to use a default weightlist - in the form
[?*]::LARGE_WEIGHT
- as the last weightlist such that paths that weren't part of any of the weightlists aren't dropped from the final composed transducer and are given large default weight. - Note: The script makes use of composition of weighted FSTs using
hfst-compose
and assumes that the weights belong to a tropical semi-ring. So, if the input fst had a path mappingcat
todog
with a weight of1
and the weightlist was in form[d o g]::2
then the weighted fst will mapcat
todog
with a weight of3 (1+2)
TODO: SO?? ADVANTAGE IN ANY WAY?
TODO: PRINT USAGE
For the ambiguous English word saw
, apertium-eng
already gives it
four distinct morphological analyses:
^saw/saw<n><sg>/saw<vblex><inf>/saw<vblex><pres>/saw<vblex><imp>/see<vblex><past>$
Let's assume we want to priortize the analyses as follows:
^saw/see<vblex><past>$
as the verb see is somehow common in language.^saw/saw<n><sg>$
as it's both a popular movie and a useful tool.^saw/saw<vblex><imp>$
as imperative verbs seem to occur a lot in the linguist's opinion.- Any other possible should be given high default weight.
$ cat saw.att
0 1 s s
1 2 a a
2 3 w w
3 9 ε <n>
9 8 ε <sg>
3 6 ε <vblex>
6 8 ε <inf>
6 8 ε <pres>
6 8 ε <imp>
1 4 a e
4 5 w e
5 7 ε <vblex>
7 8 ε <past>
8
$ lt-comp lr saw.att saw.bin
main@standard 10 13
$ echo '[s e e %<vblex%> %<past%>]::1' > wl1
$ echo '[s a w %<n%> %<sg%>]::2' > wl2
$ echo '[?* %<vblex%> %<imp%>]::3' > wl3
$ echo '[?*]::4' > wl4
$ ./lt-weight saw.bin weighted_saw.bin wl1 wl2 wl3 wl4
Reading from /tmp/tmp.mOItp9wzUO/transducer.hfst and
/tmp/tmp.mOItp9wzUO/weighted-regexp.hfst, writing to
/tmp/tmp.mOItp9wzUO/weighted-transducer.hfst
Composing text(/tmp/tmp.mOItp9wzUO/transducer.att) and xre(?)...
Reading from /tmp/tmp.mOItp9wzUO/transducer.hfst and
/tmp/tmp.mOItp9wzUO/weighted-regexp.hfst, writing to
/tmp/tmp.mOItp9wzUO/weighted-transducer.hfst
Composing text(/tmp/tmp.mOItp9wzUO/transducer.att) and xre(?)...
Reading from /tmp/tmp.mOItp9wzUO/transducer.hfst and
/tmp/tmp.mOItp9wzUO/weighted-regexp.hfst, writing to
/tmp/tmp.mOItp9wzUO/weighted-transducer.hfst
Composing text(/tmp/tmp.mOItp9wzUO/transducer.att) and xre(?)...
Reading from /tmp/tmp.mOItp9wzUO/transducer.hfst and
/tmp/tmp.mOItp9wzUO/weighted-regexp.hfst, writing to
/tmp/tmp.mOItp9wzUO/weighted-transducer.hfst
Composing text(/tmp/tmp.mOItp9wzUO/transducer.att) and xre(?)...
main@standard 10 13
$ echo 'saw' | lt-proc weighted_saw.bin -W
^saw/see<vblex><past><W:1.000000>/saw<n><sg><W:2.000000>/saw<vblex><imp><W:3.000000>/saw<vblex><pres><W:4.000000>/saw<vblex><inf><W:4.000000>$
Note:
- wl1 and wl2 can be merged into a single weightlist since they aren't actually acting as fallback for each other.
- On the other hand, if wl3 and wl4 were merged into a single
weightlist then the analysis
saw<vblex><imp>
would actually be added twice to the weighted transducer with two different weights 3 and 4!
Generate a weightlist given an annotated corpus.
The annotated corpus is in form ^surface_form/analyzed_form$
.
The script will estimate the weight for the analyzed form by
calculating the probability of the analysis in the corpus.
P(analysis) = Count(analysis) / size of corpus
This model acts as a benchmark model for the other unsupervised
techniques.
To account for the OOV analyses, laplace smoothing is used such that:
- the weight for analyses that aren't part of the corpus is
1 / (size of corpus + number of unique analysis in corpus + 1)
- the weight for an analysis that is part of the corpus is
(1 + Count(analysis)) / (size of corpus + number of unique analysis in corpus + 1)
A tweak to the weightlist generation (--tag_weightlist
) was also
added to give priority to analyses with tags that are common in the
corpus.
i.e: If the noun <n>
tag is highly probable in the corpus then we
might want to weight OOV analyses with a tag with a lower weight
(higher probablility).
However, using the tag weightlist won't make the weights
probabilistic.
TODO: PRINT USAGE
TODO
- The
--tag_weightlist
seems to be slow since it envolves the composition of the original fst with a regexp-fst in the form [?*].
- Make use of advanced techniques other than the unigram counts (example: n-grams).
To generate a tagged corpus similar to that used in the
annotated-corpus-to-weightlist
method, the script will:
- analyze a raw corpus using a compiled unweighted dictionary.
- make use of a constraint grammar to discard immpossible analyses or select certain ones.
- write all the possible analyses to a file using the following
format for each line
^surface_form/analyzed_form$
. - use the
annotated-corpus-to-weightlist
to estimate a weightlist.
TODO: PRINT USAGE
TODO
- Investigate the effect of using a constraint grammar and perform a thorough error analysis
Generate a simple weightlist with the same weight for all analyses. This model acts as the baseline for all the other techniques.
TODO: PRINT USAGE
TODO
Use hfst-reweight
to directly give a weight of one to all the edges
of the finite state transducer.
TODO: PRINT USAGE
TODO
- Compare the results of using
equal-weightlist
andanalysis-length-reweight
Weighting epsilon input might be the reason for the drift in the results of both methods.
Generate a weightlist for words based on a word2vec CBOW (continuous bag of words) model.
- First, use the whole raw corpus to train a word2vec model.
- Then, use a sliding window and for each center word, predict the most probable words given the current context.
- Finally, If the center word isn't ambiguous, just increment the count of its analysis. Else, For each ambiguous analysis of the center word, Count the number of similar words that aren't ambiguous AND have tags matching the tag of the center word's analysis.
TODO: PRINT USAGE
TODO
- Training a word2vec model using a raw corpus is both time and memory intense (that's why the script currently avoids loading the whole data into memory at the same time)
- Refactor the word2vec training function
To evaluate and compare the performance of the weighting methods, cross validation is used.
A script to divide a tagged corpus into n folds (where n is a parameter)
TODO: PRINT USAGE
Each weighting method has a script for training n models and report the metrics for each one in a tabular form
TODO: PRINT USAGE
REPO="../../apertium-kaz"
BIN="${REPO}/kaz.automorf.bin"
TAGGED_CORPUS="${REPO}/corpus/kaz.tagged"
UNTAGGED_CORPUS="${REPO}/kaz-clean"
CONSTRAINT_GRAMMAR="${REPO}/kaz.rlx.bin"
BASE_DIR=$(mktemp -d)
FOLDS_DIR="folds"
CLEANED_CORPUS="${BASE_DIR}/kaz.cleaned"
apertium-cleanstream -n < "$TAGGED_CORPUS" > "$CLEANED_CORPUS"
python eval/corpus_split.py -o "$FOLDS_DIR" "$CLEANED_CORPUS"
python eval/unigram_fit.py -i "$FOLDS_DIR" -b "$BIN" -o temp_uni_bin
python eval/equalweight_fit.py -i "$FOLDS_DIR" -b "$BIN" -o temp_eq_bin
python eval/constraintgrammar_fit.py -i "$FOLDS_DIR" -cg "$CONSTRAINT_GRAMMAR" -corpus "$UNTAGGED_CORPUS" -b "$BIN" -o temp_cg_bin
python eval/analysis_length_fit.py -i "$FOLDS_DIR" -b "$BIN" -o temp_ana_bin
python eval/w2v_fit.py -i "$FOLDS_DIR" -b "$BIN" -o temp_w2v_bin -corpus "$UNTAGGED_CORPUS"
echo 'Uni'
python eval/metrics_report.py -i "$FOLDS_DIR" -b temp_uni_bin
echo 'Eq'
python eval/metrics_report.py -i "$FOLDS_DIR" -b temp_eq_bin
echo 'Cg'
python eval/metrics_report.py -i "$FOLDS_DIR" -b temp_cg_bin
echo 'Length'
python eval/metrics_report.py -i "$FOLDS_DIR" -b temp_ana_bin
echo 'W2V'
python eval/metrics_report.py -i "$FOLDS_DIR" -b temp_w2v_bin
TODO: Add the results on apertium repos
After a series of shot-gun debugging, I found this file
(ftp://ftp.cis.upenn.edu/pub/cis639/public_html/docs/xfst.html)
and it was a great help in understanding how to use XEROX regexp them
-especially the Commands <-> RE operators
section.