proycon/colibri-core

Provide vocabulary file

naiaden opened this issue · 1 comments

I would like to have a feature which allows me to limit the classes to a certain vocabulary. If you want to reproduce experiments by others, often you are given a vocabulary as well. Right now there is not a trivial way to limit the words to a certain vocabulary, without sacrificing efficiency in the encoding.

What I want is to give a vocabulary as parameter, and that the class file is limited to the words found in the vocabulary. The other words are mapped to OOV.

  • implement top-x classes as well, pruning tail of class encoding
  • ensure pattern model training properly ignores patterns with OOV