- Empirically verified Zipf’s law using the following freely available corpora: King James Bible, The Jungle Book and SETIMES Turkish-Bulgarian parallel newspaper text.
- Reimplementation the “Dissociated Press” system that generates random text from an n-gram model over a corpus.
- Implementation of a bigram part-of-speech (POS) tagger based on Viterbi algorithm and Hidden Markov Models from scratch.
- Implementation of the Cocke-Kasami-Younger (CKY) algorithm for bottom-up CFG parsing, and apply it to the word and the parsing problem of English.
- Implementation of the IBM Model 1 word aligner for statistical machine translation between 100.000 English-French sentence pairs. Additionally, compared results with a simple baseline and fast_align implementation.