ruscorpora_tagging

Text tokenization and annotation scripts for Ruscorpora

tokenizer.py: annotates text with sentence and word tags. Uses punctuation and <p>, <tr>, <td>, <th>, <table>, <body> tags as sentence delimiters.

morpho_tagger.py: adds grammatical analysis <ana> tags to the words. Will split compound words into extra <w>-parts according to the lemmer output.

annotate_texts.py: the complete two-stage tagging. Used with regular morpho_tagger options.

Compiled lemmer binding 'liblemmer_python_binding.so' is needed by morpho_tagger.py and is not in the repository for now.

alzobnin/ruscorpora_tagging