STRAND Aligner

An implementation of the HTML alignment algorithm of STRAND:

P. Resnik and N. A Smith. The web as a parallel corpus. Computational Linguistics, 29(3):349-380, 2003.

This also contains an implementation of the Gale and Church sentence alignment algorithm:

W. A. Gale and K. W. Church. A program for aligning sentences in bilingual corpora. Computational Linguistics, 19:75-102, March 1993.

Installation prerequisites

NOTE: You might need sudo permissions for some of the steps below.

pip install numpy
pip install cython
pip install beautifulsoup4
sudo apt-get install libxml2-dev libxslt-dev python-dev
pip install lxml
apt-get install python-scipy
pip install nltk
open python shell
import nltk
nltk.download()
type in 'd' for download, and enter 'punkt'
checkout maxent from https://github.com/lzhang10/maxent.git
cd into maxent folder
follow the installation instructions here https://github.com/lzhang10/maxent/blob/master/INSTALL. This will install c++ maxent lib.
cd into python dir within maxent, and follow installation instructions in README file.

cd into dir called 'data' under STRANDAligner. There is a folder named 'articles' that contains 360 parallel pages named 1.{en,kz}, etc.
sudo python ../strand/run_strand_batch -i articles
when the above commands completes the following will be created in 'data' directory:
- chunks_output directory - contains files 1.chunks, 2.chunks, etc. Each file contains lengths of aligned chunks from parallel texts.
- chunks_tagged_output directory - contains files 1.tagged_chunks, etc. Each file contains aligned chunks together with tags.
- df_percentage file - contains data needed for learning algorithm used in our paper.
- strand_output directory - contains files 1.{en, kz}, etc. with chunks.