koaning/whatlies

Add Tokenizers

koaning opened this issue · 0 comments

The goal here is to allow users to experiment with tokenizers from the tokenizers library and train their own subword embeddings. The main point here is that it can be hard to train your own subword embeddings without it. BPEmb is cool because it is pretrained, but the library left out a method to train your own.

It also allows us to perhaps allow for some experimental tokenizers like phonetic ones and/or libraries like pyphen.