Scripts to preprocess training and test data for alignment experiments and to run and evaluate FastAlign and Mgiza.
- Install all necessary dependencies
- Export install locations for dependencies:
export {MOSES_DIR,FASTALIGN_DIR,MGIZA_DIR}=/foo/bar
- Make sure you set a reasonable default locale, e.g.:
export LC_ALL=en_US.UTF-8
- Create folder for your test data:
mkdir -p test
- Download Test Data for German-English and move it into the folder
test
- Run preprocessing:
./preprocess/run.sh
- Run Fastalign:
./scripts/run_fast_align.sh
- Run Giza:
./scripts/run_giza.sh
(This might take multiple days)
All results are in percent in the format: AlignmentErrorRate (Precision/Recall)
Method |
DeEn |
EnDe |
Grow-Diag |
Grow-Diag-Final |
FastAlign |
28.4% (71.3%/71.8%) |
32.0% (69.7%/66.4%) |
27.0% (84.6%/64.1%) |
27.7% (80.7%/65.5%) |
Mgiza |
21.0% (86.2%/72.8%) |
23.1% (86.6%/69.0%) |
21.4% (94.3%/67.2%) |
20.6% (91.3%/70.2%) |
Method |
RoEn |
EnRo |
Grow-Diag |
Grow-Diag-Final |
FastAlign |
33.8% (71.8%/61.3%) |
35.5% (70.6%/59.4%) |
32.1% (85.1%/56.5%) |
32.2% (81.4%/58.1%) |
Mgiza |
28.7% (82.7%/62.6%) |
32.2% (79.5%/59.1%) |
27.9% (94.0%/58.5%) |
26.4% (90.9%/61.8%) |
Method |
EnFr |
FrEn |
Grow-Diag |
Grow-Diag-Final |
FastAlign |
16.4% (80.0%/90.1%) |
15.9% (81.3%/88.7%) |
10.5% (90.8%/87.8%) |
12.1% (87.7%/88.3%) |
Mgiza |
8.0% (91.4%/92.9%) |
9.8% (91.6%/88.3%) |
5.9% (97.5%/89.7%) |
6.2% (95.5%/91.6%) |
- Does not work on MacOs
- Tokenization of the Canadian Hansards seems to be off when accents are present in the English text:
Ms. H é l è ne Alarie
, Mr. Andr é Harvey :
, Mr. R é al M é nard