Helsinki-NLP/OpusFilter

Specify different "unit" types in filters.

BrightXiaoHan opened this issue · 2 comments

I want to filter parallel corpus for "English-Chinese", but in "LengthFilter", "LengthRatioFilter", I can only specify one unit type.

Is it possible to config like this

min_length: [20, 10]
max_length: [100, 200]
unit: [char, word]

It's a good point that some of your languages in the parallel data might work better with character counts, while the others with word counts. I had not thought about it before.

I needed to think a bit how to do this nicely without breaking backwards compatibility (i.e. allowing also non-list input), but now it's there in the develop branch (#40).

Thanks