
Output columns are not the same that the input (even wrong scores?)

Closed this issue · 6 comments


I usually apply Bicleaner and Bicleaner AI with --score_only and then apply paste. Then, I realized that this extra step is not needed, since if I remove --score_only, the output should be the same but with the extra column of the score. I tried to apply it, but then I realized that I was not getting the same result, and this applies to both Bicleaner and Bicleaner AI.

What I was doing (method 1):

zcat file.gz \
  | cache -k 3,4 bicleaner-ai-classify --scol src_text --tcol trg_text --score_only --header  -q - - \
    /home/cgarcia/tmp/bitextor-tests-min-100-again2/data/bicleaner-ai-models/en-fr/ \
  | paste <(zcat file.gz) - \
  | pigz -c > bicleaner1.gz

What I tried and should be equivalent (method 2):

zcat file.gz \
  | cache -k 3,4 bicleaner-ai-classify --scol src_text --tcol trg_text --header  -q - - \
    /home/cgarcia/tmp/bitextor-tests-min-100-again2/data/bicleaner-ai-models/en-fr/ \
  | pigz -c > bicleaner2.gz

Some tests I made that reflects that something is wrong:

# The content is not the same, but it should be
zcat bicleaner1.gz | md5sum
# b592abb53b6eb827a962448175bf3183
zcat bicleaner2.gz | md5sum
# fc8574a953060015085d37ce8d3694fe

zcat file.gz | wc -l
# 20423
zcat bicleaner1.gz | wc -l
# 20423
zcat bicleaner2.gz | wc -l
# 20423

# The content is not the same, but is similar:
zcat bicleaner1.gz | head | md5sum
# 81eec3f263d37b86b216fd74b6a6d9c6
zcat bicleaner2.gz | head | md5sum
# 81eec3f263d37b86b216fd74b6a6d9c6

# Check if the output from Bicleaner, the first lines, are the same that the input (yes, they are):
zcat bicleaner1.gz | head | cut -f1,2,3,4,5,6,7 | md5sum
# 14d20a90e2413a42ebfca3fb5f2cec0e
zcat bicleaner2.gz | head | cut -f1,2,3,4,5,6,7 | md5sum
# 14d20a90e2413a42ebfca3fb5f2cec0e
zcat file.gz | head | md5sum
# 14d20a90e2413a42ebfca3fb5f2cec0e

# Check if the input columns are the same (method 1 will be ok since we used paste)
diff <(zcat file.gz) <(zcat bicleaner1.gz | cut -f1,2,3,4,5,6,7) | wc -l
# 0 -> OK and we expected this result
diff <(zcat file.gz) <(zcat bicleaner2.gz | cut -f1,2,3,4,5,6,7) | wc -l
# 34686 -> Almost ALL the content is wrong, since the max value we might get is ~20423 * 2

# Are the scores the same?
diff <(zcat bicleaner1.gz | cut -f8) <(zcat bicleaner2.gz | cut -f8) | wc -l
# 0 -> yes, the calculated scores are the same, but the sentences used for calculate the scores are the ones in bicleaner1.gz or bicleaner2.gz? If the used sentences are the ones which are printed from bicleaner, so bicleaner2.gz, then the scores are wrong according to the input data...

bicleaner1.gz and bicleaner2.gz does not contain the same content, but are similar, and the first (i.e. zcat bicleaner1.gz | head) lines are:

src_url trg_url src_text        trg_text        bleualign_score src_deferred_hash       trg_deferred_hash       bicleaner_ai_score    Sign Up to Volunteer - Greenpeace Canada Skip to Navigation  Inscrivez-vous pour recevoir nos opportunités de bénévolat - Greenpeace Canada  0.258199        32f07ec71aaaf85d+597168137c5d5274       3e87dfbe387a9294     0.005    Menu Close Menu Selected: Greenpeace Canada Change Country   Menu Fermer le menu Sélectionné•s: Greenpeace Canada Changer de pays    0.299956        1834e69b2464bb67+a85013050d06f02c       1834e69b2464bb67+94b4a053aab89548    1.000    International | English      International | English 1.000000        8c6fa63ce1c6fcea        8c6fa63ce1c6fcea        0    Africa | English     Africa | English        1.000000        1c38daf7d1165577        1c38daf7d1165577        0    Africa | Français    Africa | Français       1.000000        1cbc20304b275a25        1cbc20304b275a25        0    Arabic | العربية     Arabic | العربية        1.000000        483ee5a8c342794b        483ee5a8c342794b        0    Argentina | Español  Argentina | Español     1.000000        6f2cef7310e9086c        6f2cef7310e9086c        0    Australia | English  Australia | English     1.000000        d475692cddef69ba        d475692cddef69ba        0    Austria | Deutsch    Austria | Deutsch       1.000000        7fe80ae35534ab7c        7fe80ae35534ab7c        0

I found a specific example of wrong data that might work for a little sample:

zcat file.gz | fgrep 'Toronto Mobilization-Hub Launch Party!!' | wc -l
# 5
zcat bicleaner2.gz | fgrep 'Toronto Mobilization-Hub Launch Party!!' | wc -l
# 5

diff <(zcat file.gz | fgrep 'Toronto Mobilization-Hub Launch Party!!') <(zcat bicleaner2.gz | cut -f1,2,3,4,5,6,7 | fgrep 'Toronto Mobilization-Hub Launch Party!!') | wc -l
# 6
# < Toronto Mobilization-Hub Launch Party!!      Toronto Mobilization-Hub Launch Party!! 1.000000        822e9d43295af2f 822e9d43295af2f
# <      Toronto Mobilization-Hub Launch Party!! Toronto Mobilization-Hub Launch Party!! 1.000000        822e9d43295af2f 822e9d43295af2f
# ---
# >   Toronto Mobilization-Hub Launch Party!! Toronto Mobilization-Hub Launch Party!!      1.000000        822e9d43295af2f 822e9d43295af2f
# >   Toronto Mobilization-Hub Launch Party!! Toronto Mobilization-Hub Launch Party!!      1.000000        822e9d43295af2f 822e9d43295af2f

This leads me to wonder... are the scores being calculated with the original content piped to Bicleaner/Bicleaner AI or with the wrong sentences I noticed that both Bicleaner's are showing when --score_only is not set? If it's for sure that the scores are being calculated according to the input content, the method 1 is a workaround for this bug, but if not, I guess that neither method 1 or 2 would be valid.

I'm not sure why this bug happens, but the implications might be that the score is wrong or, at least, that the output columns provided from the input are not the same.

I tested this bug on:

Models I used:

Thank you!


I've been taking a very naive look into this, and I found this issue with urls:

  • File:
zcat file.gz  | grep "7c83b028cf21f355"	Formation Greenspeaker	Formation Greenspeaker	1.000000	7c83b028cf21f355	7c83b028cf21f355	Formation Greenspeaker	Formation Greenspeaker	1.000000	7c83b028cf21f355	7c83b028cf21f355	Formation Greenspeaker	Formation Greenspeaker	1.000000	7c83b028cf21f355	7c83b028cf21f355
  • Bicleaner1:
zcat bicleaner1.gz  | grep "7c83b028cf21f355"	Formation Greenspeaker	Formation Greenspeaker	1.000000	7c83b028cf21f355	7c83b028cf21f355	0	Formation Greenspeaker	Formation Greenspeaker	1.000000	7c83b028cf21f355	7c83b028cf21f355	0	Formation Greenspeaker	Formation Greenspeaker	1.000000	7c83b028cf21f355	7c83b028cf21f355	0
  • Bicleaner2:
zcat bicleaner2.gz  | grep "7c83b028cf21f355"	Formation Greenspeaker	Formation Greenspeaker	1.000000	7c83b028cf21f355	7c83b028cf21f355	0	Formation Greenspeaker	Formation Greenspeaker	1.000000	7c83b028cf21f355	7c83b028cf21f355	0	Formation Greenspeaker	Formation Greenspeaker	1.000000	7c83b028cf21f355	7c83b028cf21f355	0

Since your diff takes all the columns, including urls, into account, this issue with urls is giving impression that there are more different lines than it actually are. I tried diff <(zcat file.gz | cut -f3,4) <(zcat bicleaner2.gz | cut -f3,4) | wc -l and I get 0 differences.

So I think we just need to fix whatever is Bicleaner doing when writing urls to output.

@cgr71ii Could you try running it without cache, just plain bicleaner?

Hi! Thanks for the help!

I've tried without cache as you pointed out and it seems that is the reason of the problem... I'd never have guessed it... It's true that I should've tried to remove all external components to Bicleaner :/

And I'm observing that the problem seems to be related to -k flag from cache, not cache itself:

zcat file.gz | md5sum
# 199661296e84d638d5adcb17a27994a8
zcat file.gz | cache cat | md5sum
# 199661296e84d638d5adcb17a27994a8
zcat file.gz | cut -f1,2,3,4,5,6,7 | md5sum
# 199661296e84d638d5adcb17a27994a8
zcat file.gz | cache cut -f1,2,3,4,5,6,7 | md5sum
# 199661296e84d638d5adcb17a27994a8

zcat file.gz | cache -k 3,4 cat | md5sum
# 8dfdb3f73082abfe119049beb03e5aa6
zcat bicleaner2.gz | cut -f1,2,3,4,5,6,7 | md5sum # All fields but bicleaner score
# 8dfdb3f73082abfe119049beb03e5aa6

It seems you're totally right pointing out cache. Do you think it could be a regression from kpu/preprocess#35? I'm going to install the version of cache previous to the issue and I'll come back here to tell if it worked.

So, in the end it seems it is not a problem related to Bicleaner. Anyway, I think it'd be nice if the issue is kept open until the problem is totally solved, even if it seems that the main cause is cache (if it is ok, of course).

Sure! I'll keep this open. Let me know what you find :)

I think I got it, and I think it is not a bug, but something I assumed and is wrong... I assumed that if a pair of sentences are found by cache, all the fields should be the same, but it's not true (it's almost true: all the fields depends on the pair of sentences but the URLs...). E.g.:

long_url1	long_url2	Formation Greenspeaker	Formation Greenspeaker	1.000000	7c83b028cf21f355	7c83b028cf21f355
long_url3	long_url4	Formation Greenspeaker	Formation Greenspeaker	1.000000	7c83b028cf21f355	7c83b028cf21f355
long_url5	long_url6	Formation Greenspeaker	Formation Greenspeaker	1.000000	7c83b028cf21f355	7c83b028cf21f355

# after `cache -k3,4`:

long_url1	long_url2	Formation Greenspeaker	Formation Greenspeaker	1.000000	7c83b028cf21f355	7c83b028cf21f355
long_url1	long_url2	Formation Greenspeaker	Formation Greenspeaker	1.000000	7c83b028cf21f355	7c83b028cf21f355
long_url1	long_url2	Formation Greenspeaker	Formation Greenspeaker	1.000000	7c83b028cf21f355	7c83b028cf21f355

In the previous example can be observed that the URLs are the only fields that are independent of the sentences (remember cache -k 3,4, where 3 and 4 are the missing piece I didn't notice). So, I tried:

diff \
    <(zcat bicleaner1.gz | awk -F$'\t' '{print $3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8}') \
    <(zcat bicleaner2.gz | awk -F$'\t' '{print $3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8}') \
  | wc -l
# 0 YEP!!!

So, like in your example running zcat bicleaner2.gz | grep "7c83b028cf21f355", there only the URLs are the fields that are wrong. I assumed all the fields were dependent from the ones I specified (i.e. -k 3,4), but it is not true, so cache can't be applied directly or should be applied using paste as in the method 1.

Sorry for the long reply and wasting your time :(

Closing issue.

Thank you for the support!

No problem, @cgr71ii ! Thanks for the investigations 🔬