Specify different "unit" types in filters.
BrightXiaoHan opened this issue · 2 comments
BrightXiaoHan commented
I want to filter parallel corpus for "English-Chinese", but in "LengthFilter", "LengthRatioFilter", I can only specify one unit type.
Is it possible to config like this
min_length: [20, 10]
max_length: [100, 200]
unit: [char, word]
svirpioj commented
It's a good point that some of your languages in the parallel data might work better with character counts, while the others with word counts. I had not thought about it before.
I needed to think a bit how to do this nicely without breaking backwards compatibility (i.e. allowing also non-list input), but now it's there in the develop
branch (#40).
BrightXiaoHan commented
Thanks