bitextor/bifixer

Long sentences are not being removed apparently

Closed this issue · 2 comments

Hi!

Either monofixer or bifixer should remove long sentences when the number of words is greater than 5000:

if not args.ignore_long and (len(sentence) > 5000):

if not args.ignore_long and (len(source_sentence) > 5000 or len(target_sentence) > 5000):

The problem is that, apparently, it seems that it is not working:

pip3 install bifixer==0.8.3
# monofixer

python -c "print('asd'); print(' '.join(['a']*6000)); print('asd')" \
  | monofixer --scol 1 --ignore_duplicates  -q - - es \
  | wc -w
# 6002

python -c "print('asd'); print(' '.join(['a']*6000)); print('asd')" \
  | monofixer --scol 1 --ignore_duplicates --ignore_long -q - - es \
  | wc -w
# 6002
# bifixer

python -c "print('asd\tasd'); print('asd\t' + ' '.join(['a']*6000)); print('asd\tasd')" \
  | bifixer --scol 1 --tcol 2 --ignore_duplicates  -q - - en es \
  | wc -w
# 6005

python -c "print('asd\tasd'); print('asd\t' + ' '.join(['a']*6000)); print('asd\tasd')" \
  | bifixer --scol 1 --tcol 2 --ignore_long --ignore_duplicates  -q - - en es \
  | wc -w
# 6005

Am I doing something wrong?

Thank you!

Long sentences are not being removed, they are just ignored (not processed, but outputted).

It's not correct at the documentation, I'm fixing it.

Oh! Ok, thank you!