Repository for my Master thesis on The effects of word segmentation quality on word alignments. The thesis PDF can be found here. This repository handles the following functions:
- Datasets: English-German, English-Romanian, English-Hindi. To any other datasets, add a folder with the names of the language pairs in
data/input
and under it the txt files with the following format: 'eng_with_X.txt', for X number of sentences and for both languages, and the gold standard. See examples indata/input
- Alignment models: Fastalign, Eflomal
- Sampling methods: Dropout
- Tokenization: space mode, no space mode
These parameters and others can be set in settings.py
.
Fastalign installation
sudo apt-get install libgoogle-perftools-dev libsparsehash-dev
cd /path/to/project
mkdir tools
cd tools
git clone https://github.com/clab/fast_align.git
cd fast_align
mkdir build
cd build
sudo apt install cmake
cmake ..
make
Eflomal installation
cd /path/to/project/tools
git clone https://github.com/robertostling/eflomal.git
cd eflomal
make
sudo make install
python3 setup.py install
Install dependencies
pip -r install requirements.txt
Modify settings.py
for your desired parameters. To run all pipeline:
./run.sh
If you get an error like /bin/bash^M: bad interpreter: No such file
, run this:
sed -i -e 's/\r$//' run.sh # https://stackoverflow.com/questions/14219092/bash-script-and-bin-bashm-bad-interpreter-no-such-file-or-directory
./run.sh
.
├── data
│ ├── input
│ │ ├── eng-deu
│ │ │ ├── eng_with_10k.txt # input txt file with 10k english sentences
│ │ │ ├── deu_with_10k.txt
│ │ │ ├── eng_deu.gold # gold standard alignments
│ │ │ ├── eng.model # merge list for english, space mode
│ │ │ ├── deu.model
│ │ │ ├── eng_ns.model # merge list for english, no space mode
│ │ │ └── deu_ns.model
│ │ ├── eng-ron
│ │ └── eng-hin
│ ├── normal_bpe
│ │ ├── segmentations # files obtained by applying BPE to corpus
│ │ │ └── *.bpe
│ │ ├── fastalign # files obtained from fastalign
│ │ │ └── *.wgdfa
│ │ └── eflomal # files obtained from eflomal
│ │ └── .wgdfa
│ └── dropout_bpe
│ ├── segmentations
│ │ └── *.bpe
│ ├── fastalign
│ │ └── *.wgdfa
│ └── eflomal
├── doc # LaTeX files for the writing of the thesis
│ ├── figures
│ ├── sections
│ └── *.tex files
├── reports
│ ├── scores_normal_bpe # scores for BPE
│ │ └── *.csv, *.png
│ └── scores_dropout_bpe # scores for BPE dropout space/no space, and depending on dropout rate
│ ├── space
│ │ ├── 0.1
│ │ └── 0.2
│ | └── *.csv, *.png
│ └── no space
│ └── 0.1
│ └── *.csv, *.png
├── src # python files
│ ├── learn_bpe.py
│ ├── apply_bpe.py
│ ├── extract_alignments.py
│ └── calc_align_score.py
├── tools # fastalign, eflomal installation directories
│ ├── fastalign
│ └── eflomal
├── .gitignore
├── README.md
├── requirements.txt
└── settings.py