SentenceEmbeddingFilter chunksize clashes with general chunksize
miau1 opened this issue · 1 comments
The general chunksize
in common
options is 100000 by default. The SentenceEmbeddingFilter chunksize
is 200 by default.
When using the score
function, only 200 sentence pairs are processed per 100000 sentence pairs in the data if the default chunksizes are used. For example, if you are scoring data with less than 100k sentence pairs, the resulting score file will have only 200 scores. If the data has more than 100k pairs but less than 200k, the result will be have 400 scores, and so on.
When using the filter
function, filtering seems to always hang after processing 200 sentence pairs.
SentenceEmbeddingFilter
's score
method was broken, and now fixed in the fix-sentence-emb-chunking
branch. However, I couldn't replicate the problem in filter
. Can you create a minimal example for it?