thesis

Repository for my Master thesis on The effects of word segmentation quality on word alignments. The thesis PDF can be found here. This repository handles the following functions:

Datasets: English-German, English-Romanian, English-Hindi. To any other datasets, add a folder with the names of the language pairs in data/input and under it the txt files with the following format: 'eng_with_X.txt', for X number of sentences and for both languages, and the gold standard. See examples in data/input
Alignment models: Fastalign, Eflomal
Sampling methods: Dropout
Tokenization: space mode, no space mode

These parameters and others can be set in settings.py.

Installation and run

Fastalign installation

sudo apt-get install libgoogle-perftools-dev libsparsehash-dev
cd /path/to/project
mkdir tools
cd tools
git clone https://github.com/clab/fast_align.git
cd fast_align
mkdir build
cd build
sudo apt install cmake
cmake ..
make

Eflomal installation

cd /path/to/project/tools
git clone https://github.com/robertostling/eflomal.git
cd eflomal
make
sudo make install
python3 setup.py install

Install dependencies

pip -r install requirements.txt

Modify settings.py for your desired parameters. To run all pipeline:

./run.sh

If you get an error like /bin/bash^M: bad interpreter: No such file, run this:

sed -i -e 's/\r$//' run.sh # https://stackoverflow.com/questions/14219092/bash-script-and-bin-bashm-bad-interpreter-no-such-file-or-directory
./run.sh

Project structure

.
├── data
│   ├── input
│   │   ├── eng-deu
│   │   │   ├── eng_with_10k.txt   # input txt file with 10k english sentences
│   │   │   ├── deu_with_10k.txt
│   │   │   ├── eng_deu.gold       # gold standard alignments
│   │   │   ├── eng.model          # merge list for english, space mode
│   │   │   ├── deu.model
│   │   │   ├── eng_ns.model       # merge list for english, no space mode
│   │   │   └── deu_ns.model
│   │   ├── eng-ron
│   │   └── eng-hin
│   ├── normal_bpe
│   │   ├── segmentations      # files obtained by applying BPE to corpus
│   │   │   └── *.bpe
│   │   ├── fastalign          # files obtained from fastalign 
│   │   │   └── *.wgdfa
│   │   └── eflomal            # files obtained from eflomal 
│   │       └── .wgdfa
│   └── dropout_bpe
│       ├── segmentations
│       │   └── *.bpe
│       ├── fastalign
│       │   └── *.wgdfa
│       └── eflomal
├── doc                        # LaTeX files for the writing of the thesis
│   ├── figures
│   ├── sections
│   └── *.tex files
├── reports
│   ├── scores_normal_bpe      # scores for BPE
│   │   └── *.csv, *.png
│   └── scores_dropout_bpe     # scores for BPE dropout space/no space, and depending on dropout rate
│       ├── space
│       │   ├── 0.1
│       │   └── 0.2
│       |       └── *.csv, *.png
│       └── no space
│           └── 0.1
│               └── *.csv, *.png
├── src                        # python files
│   ├── learn_bpe.py
│   ├── apply_bpe.py
│   ├── extract_alignments.py
│   └── calc_align_score.py
├── tools                        # fastalign, eflomal installation directories
│   ├── fastalign
│   └── eflomal
├── .gitignore
├── README.md
├── requirements.txt
└── settings.py

anebz/thesis

thesis

Installation and run

Project structure