Helsinki-NLP/OpusFilter

SentenceEmbeddingFilter chunksize clashes with general chunksize

miau1 opened this issue · 1 comments

miau1 commented

The general chunksize in common options is 100000 by default. The SentenceEmbeddingFilter chunksize is 200 by default.

When using the score function, only 200 sentence pairs are processed per 100000 sentence pairs in the data if the default chunksizes are used. For example, if you are scoring data with less than 100k sentence pairs, the resulting score file will have only 200 scores. If the data has more than 100k pairs but less than 200k, the result will be have 400 scores, and so on.

When using the filter function, filtering seems to always hang after processing 200 sentence pairs.

SentenceEmbeddingFilter's score method was broken, and now fixed in the fix-sentence-emb-chunking branch. However, I couldn't replicate the problem in filter. Can you create a minimal example for it?