/unigrams_pt-br

Word segmentation to create unigrams in Portuguese (pt-br)

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

unigrams pt-br

Unigram generated from 16 files provided by NILC - Núcleo Interinstitucional de Linguística Computacional.

These files are composed by +681.639.644 tokens:

  • Wikpedia (pt-br) - 2016
  • Google News
  • SubIMDB-PT
  • G1
  • PNL-Br
  • Literancy works of public domain
  • Lacio-Web
  • Portuguese e-books
  • Mundo Estranho
  • CHC
  • Fapesp
  • Textbooks
  • Folhinha
  • NILC subcorpus
  • Para seu filho ler
  • SARESP

The files are available at:

http://nilc.icmc.usp.br/nilc/index.php/repositorio-de-word-embeddings-do-nilc

The reason to create this file is to provide unigrams to be used in the word segmentation algorithm:

https://github.com/grantjenks/python-wordsegment

The the script used to create this file is npl_word_segment.py and group_files.py.