Text tokenization and annotation scripts for Ruscorpora
tokenizer.py: annotates text with sentence and word tags. Uses punctuation and <p>, <tr>, <td>, <th>, <table>, <body> tags as sentence delimiters.
morpho_tagger.py: adds grammatical analysis <ana> tags to the words. Will split compound words into extra <w>-parts according to the lemmer output.
annotate_texts.py: the complete two-stage tagging. Used with regular morpho_tagger options.
Compiled lemmer binding 'liblemmer_python_binding.so' is needed by morpho_tagger.py and is not in the repository for now.