This is the second asignment for the course Algorithms for speech and natural language processing of the 2018/2019 MVA Master class.
- Python 3
- nltk, sklear, PYEVALB packages.
pip install nltk
pip install scikit-learn
pip install PYEVALB
: - File containing training data (default='sequoia-corpus+fct.mrg_strict')
: - File containing test data (default='test_data')
: - Whether to split the data into train-eval datasets or train on the whole training set.
: - Ratio of the data to train on, used only with train_eval option (default=0.9)
: - Name of the output parse file (default='output_parse')
: - Number of closest words w.r.t formal similarity (default=2)
: - Number of closest words w.r.t embedding similarity (default=20)
: - Use Damerau-Levenstein distance when true and Levenstein distance otherwise.
: - Float in [0, 1], the interpolation parameter between bigram and unigram models (default=0.8)
To use sequoia treebank dataset to train on the first 90% and evaluate on the last 10% use the following command:
python --train_eval --train_size 0.9
bash --train_eval --train_size 0.9
With the default parameters you will get "output_parse" as the output file name.
To train on the whole sequoia treebank dataset and test on an input file with space-tokenized sentences, use:
python --data_test 'test_data'
bash --data_test 'test_data'
With the default parameters you will get "output_parse" as the output file name. If you don't specify --data_test
it will by default test on 'test_data' file.
You can find the parse result of the last 10% of sequoia treebansk dataset in the file 'evaluation_data.parser_output'