In training data for NMT (neural machine translation) systems it is of benefit to have a large and varried corpus. Unfortunately this is not often the case. This submodule implements a pipeline for tagging, filtering, matching, substituing/correcting and evaluating translation of named entities in a parallel English to Icelandic corpus.
pip install git+https://github.com/mideind/MT-NE-Pipeline
For Icelandic NER the included IceBERT-NER model is used. For english we use Flair.
The following command accepts as input a txt file which has a sentence (or multiple) per line and writes out the NEs in the output file. The command will preserve empty lines.
Guðrún fór í heimsókn til Einars Jónssonar.
Anna fékk gjöf frá Alexei, Pétri og Páli.
Núna með Tómar Línur, takk Joe!
Ha?
Running the NER on the example file and writing output to example.is.ner
(use -
to specify stdout).
mt ner tests/data/example.is example.is.ner --lang is
cat example.is.ner
# Produces
Person:0:6 Person:26:42
Person:0:4 Person:19:25 Person:27:32 Person:36:40
Organization:9:20 Person:27:30
The NEs are written to example.is.ner
in which each line corresponds to a line in the input. The NEs are formatted as label:start_char_idx:end_char_idx
, i.e. the span of the label. The spans are separated with space. The BIO-markers have been joined together to create the span.
For more options (GPU/batch_size) call mt ner --help
Similarly, for English:
Guðrún visited Einars Jónssonar.
Anna got a gift from Pétri, Páli and Alexei.
Now with Empty Lines, thanks Joe!
Huh?
mt ner tests/data/example.en example.en.ner --lang en
cat example.en.ner
# Produces
PER:0:6 PER:15:31
PER:0:4 PER:21:26 PER:28:32 PER:37:43
MISC:9:20 PER:29:32
Notice that the taggers do not produce the same tag sets.
To be able to filter and/or align NE markers we need to unify the tag sets.
mt normalize example.en.ner example.en.ner-norm
mt normalize example.is.ner example.is.ner-norm
We can also embed the NEs directly into the sentences:
mt embed tests/data/example.is example.is.ner -
mt embed tests/data/example.en example.en.ner -
Filtering is based on parallel data and works as follows
- Lines with no NEs are removed
- NE tag sets are normalized (like above)
- NE tags which are not Organization, Location or Person are filtered out
- Lines with unequal number of tags in each group are filtered out
- Then all the remaining lines are shuffled.
mt filter-text-by-ner tests/data/example.is tests/data/example.en example.is.ner example.en.ner example.is.filtered example.en.filtered example.is.ner.filtered example.en.ner.filtered
# Check the results
cat example.??*.filtered
Guðrún visited Einars Jónssonar.
Anna got a gift from Pétri, Páli and Alexei.
P:0:6 P:15:31
P:0:4 P:21:26 P:28:32 P:37:43
Guðrún fór í heimsókn til Einars Jónssonar.
Anna fékk gjöf frá Alexei, Pétri og Páli.
P:0:6 P:26:42
P:0:4 P:19:25 P:27:32 P:36:40
Only two lines remain.
In our example, the English sentences is considered to be incorrect translations of the Icelandic sentences. They are incorrect because the names are not in nominative case. We will now correct this.
mt correct example.is.filtered example.en.filtered example.is.ner.filtered example.en.ner.filtered example.is.corrected --to_nominative_case
# example.is.corrected
Guðrún visited Einar Jónsson.
Anna got a gift from Alexei, Pétur and Páll.
Additionally, a dictionary can be provided to make manually corrections. The dictionary is used as a first correction resort.
mt correct example.is.filtered example.en.filtered example.is.ner.filtered example.en.ner.filtered example.is.corrected --to_nominative_case --corrections_tsv tests/data/corrections.tsv
# example.is.corrected
Guðrún visited Einar Jónsson.
Anna got a gift from Alexei Sergov, Pétur and Páll.
To evaluate an MT system w.r.t. BLEU run:
# This should give a perfect score.
lang=en
ref=testdata/example.ner-ext.$lang
sys=testdata/example.$lang
python mt_eval.py --ref $ref --ref-contains-entities --sys $sys --tgt_lang $lang
This will do the following:
- Read the NER markers from the REF.
- Report the BLEU score on the cleaned REF and SYS (as is).
- Run a NER on the SYS.
- Report on NER alignment:
- Alignment count: How many NEs we were able to match between REF and SYS.
- Alignment coverage: The fraction of NEs which we were able to able to align, from 0-1, 1 is best. If REF and SYS do not contain equal counts of NEs, we use the smaller count.
- Average alignment distance: The average distance in the alignment, from 0-1, 0 is best.
- Accuracy: The fraction exact matches in the alignment (string comparison).
- Run the report on each distinct tag found both REF and SYS.
This evaluation can be run with any combination of --ref/sys-contains-entities.
(This can be skipped) The next step aligns the two tagged files, and optionally prints some statistics. This step is run automatically by the filtering but can be ran on its own.
python aligner.py --is_ent testdata/is.ner --en_ent testdata/en.ner --output testdata/alignment.tsv
The columns are ner_tagger_1, source_1, ner_tagger_2, source_2, match_code, max_distance (1-JarWink), alignment spans
is hf 1 0.06999999999999995 0:1:Person-5:6:PER 5:7:Person-0:2:PER
is hf 1 0.12 0:1:Person-0:1:PER 4:5:Person-9:10:PER 6:7:Person-5:6:PER 8:9:Person-7:8:PER
This step parses the named files, aligns entities and pos tags them.
python postagger.py --is_ent testdata/is.ner --en_ent testdata/en.ner --output testdata/en_is.pos.tsv
The resulting file contains tags indicating which entity ID and part of speech (POS) a given name has in the Icelandic side.
<e:0:nkee-s:>Einar Jónsson</e0> was visited by <e:1:nven-s:>Guðrún</e1> . <e:1:nven-s:>Guðrún</e1> fór í heimsókn til <e:0:nkee-s:>Einars Jónssonar</e0> .
<e:0:nven-s:>Anna</e0> got a gift from <e:1:nkeþ-s:>Pétur</e1> , <e:2:nkeþ-s:>Páll</e2> and <e:3:nkeþ-s:>Alexei</e3> . <e:0:nven-s:>Anna</e0> fékk gjöf frá <e:3:nkeþ-s:>Alexei</e3> , <e:1:nkeþ-s:>Pétri</e1> og <e:2:nkeþ-s:>Páli</e2> .
Finally, given a list of tab separated genders (kk and kvk) and sufficient names such as
kk Þröstur Helagson
kk Jón Jónsson
kk Bubbi Morthens
kk Ingvar Gunnarsson
kvk Sigga
kvk Sigríður Einarsdóttir
we can then generate a synthetic parallel corpus with randomly inserted names (full names and first names) using
python patcher.py --input testdata/en_is.pos.tsv --output testdata/en_is.synth.tsv --names testdata/names.txt
which outputs
Jón was visited by Sigríður . Sigríður fór í heimsókn til Jóns .
Sigga got a gift from Bubbi Morthens , Ingvar and Jón . Sigga fékk gjöf frá Jóni , Bubba Morthens og Ingvari .