Word Segmentation and Artificial Languages Makefile use this to run an experiment by default, the Unigram grammar is used on the first random hawaiian corpus to run on a different version (there are currently 4 sets, there is nothing special about them, they were just generated by running Robert's scripts 4 times), use make SET=0[1234] to run on a different language (right now, there only is Berber in addition) make LANGUAGE=berber to keep the model as simple as possible, the base-distribution is assumed to be fixed, that is, words are generated assuming a uniform phoneme distribu- tion and a constant stopping probability of 0.5, although hyper-parameter inference is performed to get around having to set arbitrary hyper-parameters (as of now, it uses the original DP-model instead of the slightly more expres- -sive PYP model) runs/ where experiments are put there are two folders for every experiment, a Tmp-folder which contains actual data (inputs, outputs, tracefiles) for different runs of the same experiment, and an Eval-folder that only contains a "summary"-file which gives a single score experiments are referred to by the language plus the grammar name, e.g. runs/hawaiian_unigram(Tmp|Eval) to distinguish different corpus versions, each file has "s0[1234]" in its name. To illustrate, runs/berber_unigramEval/r00_Gunigram_n1000_w1_b1_g100_h0.01_R-1_s01.trscore runs/berber_unigramEval/r00_Gunigram_n1000_w1_b1_g100_h0.01_R-1_s02.trscore are the score-files for the unigram-model on two different random berber corpora data/ contains test-corpora, named as corpus_<language>_<set>.txt, e.g. corpus_hawaiian_01.txt corpus_berber_04.txt scripts/ contains scripts to generate artificial corpora / analyse their properties corpus/ scripts to generate corpora analysis/ scripts to perform posterior analysis (input / output projections) use as follows: cat runs/hawaiian_unigramTmp/r00_Gunigram_n1000_w1_b1_g100_h0.01_R-1_s02_fold0*.trsws | python scripts/analysis/posteriorOutputprojections.py runs/hawaiian_unigramTmp/AGgold_02.txt ambiguity/ scripts to calculate segmentation ambiguity dictionaries/ contains dictionaries generated by Robert's MaxEnt-grammar-generator prog_seg/ contains scripts required by the Adaptor Grammar evaluation py-cfg/ contains source code for Adaptor Grammars (use Makefile to build) only needs to be compiled once