unigrams pt-br

Unigram generated from 16 files provided by NILC - Núcleo Interinstitucional de Linguística Computacional.

These files are composed by +681.639.644 tokens:

The files are available at:

The reason to create this file is to provide unigrams to be used in the word segmentation algorithm:

The the script used to create this file is npl_word_segment.py and group_files.py.

subtosilencio/unigrams_pt-br