Efficient Low-Memory Aligner
This is a word alignment tool based on efmaral, with the following main differences:
- More compact data structures are used, so memory requirements are much lower (by orders of magnitude).
- The estimation of alignment variable marginals is done one sentence at a time, which also saves a lot of memory at no detectable cost in accuracy.
- New: User-specified Dirichlet priors, which can be generated by the
makepriors.py
script to allow models to be saved. See below under Input data format and Generating priors.
Technical details relevant to both efmaral
and eflomal
can be found in
the following article:
To install the complete Python package, run:
python -m pip install .
If you want to compile and install only the C binary, run:
make -C src
sudo make -C src install
Change the INSTALLDIR
parameter in the install step if you want to install somewhere
other than the default /usr/local/bin
(e.g. make -C src -e INSTALLDIR=~/bin install
).
There are three main ways of using eflomal
:
- Directly call the
eflomal
binary. Note that this requires some preprocessing. - Use the eflomal-align
command-line interface, which is partly compatible with that of
efmaral
. Runeflomal-align --help
for instructions. - Use the Cython module to call the
eflomal
binary, this takes care of the preprocessing and file conversions necessary. See the docstrings in eflomal.pyx for documentation.
In addition, there are convenience scripts for aligning and symmetrizing (with
the atools
program from fast_align
) as well as evaluating with data from
the WPT shared task datasets. These work the same way as in efmaral
,
please see its
README for
details.
When used with the -s
and -t
options for separate source/target files, the
eflomal-align
interface expects one sentence per line with space-separated
tokens, similar to most word alignment software.
The -i
option assumes a fast_text
style joint source/target file of the
format
source sentence ||| target sentence
another source sentence ||| another target sentence
...
The --priors
option expects a file generated by eflomal-makepriors
(see below).
This file contains user-specified lexical, HMM and/or fertility distribution
priors. Since the algorithm is asymmetric, HMM and fertility priors can be
stored for both the forward and reverse directions. eflomal-makepriors
handles
this automatically, see examples below.
Note that the default value of the Dirichlet priors (defined in eflomal.c
as
LEX_ALPHA
, JUMP_ALPHA
and FERT_ALPHA
) will be added to whatever is
specified in the priors file. This means that integer counts for whatever word
forms you have data on are fine in the priors file.
It s possible to use the special <NULL>
token in the priors file, in case
you want to encourage certain word forms to remain unaligned.
Currently the eflomal-makepriors
script does not generate these, and this feature
has not been tested yet.
If you have a large file that you want to use as "training data", named en-sv
,
and a small file that you later want to align quickly, en-sv.small
, start by
aligning the large file as usual, specifying where to write the reverse and forward
alignment output files:
eflomal-align -i en-sv --model 3 -f en-sv.fwd -r en-sv.rev
The above command will give you two intermediate files en-sv.fwd
and en-sv.rev
.
Now you can use these to generate priors based on the large aligned file. The
priors will be stored in en-sv.priors
:
eflomal-makepriors -i en-sv -f en-sv.fwd -r en-sv.rev --priors en-sv.priors
Alternatively, you can symmetrize en-sv.fwd
and en-sv.rev
into en-sv.sym
and pass the same file to both -f
and -r
:
atools -c grow-diag-final-and -i en-sv.fwd -j en-sv.rev >en-sv.sym
eflomal-makepriors -i en-sv -f en-sv.sym -r en-sv.sym --priors en-sv.priors
Now, if you have another file to align, en-sv.small
, simply use e.g.:
eflomal-align -i en-sv.small --priors en-sv.priors --model 3 \
-f en-sv.small.fwd -r en-sv.small.rev
This will be much faster than merging en-sv
and en-sv.small
and aligning
them jointly, while nearly as accurate (assuming en-sv.small
is much smaller
than en-sv
).
The alignment output contains the same number of lines as the input files, where each line contains pairs of indexes. For instance, if the source input contains the following:
a black cat
and the target input is the following:
kuro neko
the correct output would be:
1-0 2-1
That is, 1-0
indicates token 1 of the source (black) is aligned to token 0
of the target (kuro), and 2-1
that token 2 of the source (cat) is aligned to
token 1 of the target (neko). NULL
alignments are not present in the output.
Note that the forward and reverse alignments both use source-target order, so
the output can be fed directly to atools
(see scripts/align_symmetrize.sh
for an example).
In case you made a mistake with the direction, you can fix it afterwards with
scripts/reverse_moses.py
.
The Python package provides an interface for aligning and estimating priors. Here is a simple example using the files in testdata:
import eflomal
aligner = eflomal.Aligner()
with open('test1.sv', 'r', encoding='utf-8') as src_data, \
open('test1.en', 'r', encoding='utf-8') as trg_data, \
open('test1.priors', 'r', encoding='utf-8') as priors_data:
# Align with priors
aligner.align(
src_data, trg_data,
links_filename_fwd='sv-en.fwd', links_filename_rev='sv-en.rev',
priors_input=priors_data)
with open('test1.sv', 'r', encoding='utf-8') as src_data, \
open('test1.en', 'r', encoding='utf-8') as trg_data, \
open('sv-en.fwd', 'r', encoding='utf-8') as fwd_links, \
open('sv-en.rev', 'r', encoding='utf-8') as rev_links, \
open('sv-en.priors', 'w', encoding='utf-8') as priors_f:
# Estimate priors
priors_tuple = eflomal.calculate_priors(
src_data, trg_data, fwd_links, rev_links)
# Write priors to file
eflomal.write_priors(priors_f, *priors_tuple)
Note that the output files for Aligner.align()
are given as paths,
not file objects, as they are written directly by the eflomal
binary.
This is a comparison between eflomal, efmaral and fast_align.
The difference between efmaral and eflomal is in part due to different default parameters, in particular the number of iterations and the number of independent samplers.
Note that all timing figures below include alignments in both directions (run in parallel) and symmetrization.
Languages | Sentences | AER | CPU time (s) | Real time (s) |
---|---|---|---|---|
English-French | 1,130,551 | 0.081 | 1,232 | 337 |
English-Inkutitut | 340,601 | 0.203 | 161 | 44 |
Romanian-English | 48,681 | 0.298 | 159 | 33 |
English-Hindi | 3,530 | 0.467 | 31 | 6 |
Languages | Sentences | AER | CPU time (s) | Real time (s) |
---|---|---|---|---|
English-Swedish | 1,862,426 | 0.133 | 1,719 | 620 |
English-French | 1,130,551 | 0.085 | 763 | 279 |
English-Inkutitut | 340,601 | 0.235 | 122 | 46 |
Romanian-English | 48,681 | 0.287 | 161 | 46 |
English-Hindi | 3,530 | 0.483 | 98 | 10 |
Languages | Sentences | AER | CPU time (s) | Real time (s) |
---|---|---|---|---|
English-Swedish | 1,862,426 | 0.205 | 11,090 | 672 |
English-French | 1,130,551 | 0.153 | 3,840 | 241 |
English-Inuktitut | 340,601 | 0.287 | 477 | 47 |
Romanian-English | 48,681 | 0.325 | 208 | 17 |
English-Hindi | 3,530 | 0.672 | 24 | 2 |