facebookresearch/LASER

Port LASER-2,3 text preprocessing into Python

avidale opened this issue · 1 comments

Currently, text preprocessing is performed via third-party command line tools (Moses and SentencePiece), which makes their use less convenient, especially when processing one sentence at a time.

We will need to switch to their Python implementations (i.e. sacremoses and Python interface for SentencePiece) and wrap them into an interface like Tokenizer in the transformers package responsible for all the text preprocessing.

Some of the Moses scripts may be available in the Stopes repository, and some of them might be needed to re-implemented from scratch.

Done in #249