Output columns are not the same that the input (even wrong scores?)
Closed this issue · 6 comments
Hi!
I usually apply Bicleaner and Bicleaner AI with --score_only
and then apply paste
. Then, I realized that this extra step is not needed, since if I remove --score_only
, the output should be the same but with the extra column of the score. I tried to apply it, but then I realized that I was not getting the same result, and this applies to both Bicleaner and Bicleaner AI.
What I was doing (method 1):
zcat file.gz \
| cache -k 3,4 bicleaner-ai-classify --scol src_text --tcol trg_text --score_only --header -q - - \
/home/cgarcia/tmp/bitextor-tests-min-100-again2/data/bicleaner-ai-models/en-fr/ \
| paste <(zcat file.gz) - \
| pigz -c > bicleaner1.gz
What I tried and should be equivalent (method 2):
zcat file.gz \
| cache -k 3,4 bicleaner-ai-classify --scol src_text --tcol trg_text --header -q - - \
/home/cgarcia/tmp/bitextor-tests-min-100-again2/data/bicleaner-ai-models/en-fr/ \
| pigz -c > bicleaner2.gz
Some tests I made that reflects that something is wrong:
# The content is not the same, but it should be
zcat bicleaner1.gz | md5sum
# b592abb53b6eb827a962448175bf3183
zcat bicleaner2.gz | md5sum
# fc8574a953060015085d37ce8d3694fe
zcat file.gz | wc -l
# 20423
zcat bicleaner1.gz | wc -l
# 20423
zcat bicleaner2.gz | wc -l
# 20423
# The content is not the same, but is similar:
zcat bicleaner1.gz | head | md5sum
# 81eec3f263d37b86b216fd74b6a6d9c6
zcat bicleaner2.gz | head | md5sum
# 81eec3f263d37b86b216fd74b6a6d9c6
# Check if the output from Bicleaner, the first lines, are the same that the input (yes, they are):
zcat bicleaner1.gz | head | cut -f1,2,3,4,5,6,7 | md5sum
# 14d20a90e2413a42ebfca3fb5f2cec0e
zcat bicleaner2.gz | head | cut -f1,2,3,4,5,6,7 | md5sum
# 14d20a90e2413a42ebfca3fb5f2cec0e
zcat file.gz | head | md5sum
# 14d20a90e2413a42ebfca3fb5f2cec0e
# Check if the input columns are the same (method 1 will be ok since we used paste)
diff <(zcat file.gz) <(zcat bicleaner1.gz | cut -f1,2,3,4,5,6,7) | wc -l
# 0 -> OK and we expected this result
diff <(zcat file.gz) <(zcat bicleaner2.gz | cut -f1,2,3,4,5,6,7) | wc -l
# 34686 -> Almost ALL the content is wrong, since the max value we might get is ~20423 * 2
# Are the scores the same?
diff <(zcat bicleaner1.gz | cut -f8) <(zcat bicleaner2.gz | cut -f8) | wc -l
# 0 -> yes, the calculated scores are the same, but the sentences used for calculate the scores are the ones in bicleaner1.gz or bicleaner2.gz? If the used sentences are the ones which are printed from bicleaner, so bicleaner2.gz, then the scores are wrong according to the input data...
bicleaner1.gz
and bicleaner2.gz
does not contain the same content, but are similar, and the first (i.e. zcat bicleaner1.gz | head
) lines are:
src_url trg_url src_text trg_text bleualign_score src_deferred_hash trg_deferred_hash bicleaner_ai_score
https://www.greenpeace.org/canada/en/act/volunteer/sign-up-to-volunteer/ https://www.greenpeace.org/canada/fr/agir/impliquez-vous/inscrivez-vous-pour-recevoir-nos-opportunites-de-benevolat/ Sign Up to Volunteer - Greenpeace Canada Skip to Navigation Inscrivez-vous pour recevoir nos opportunités de bénévolat - Greenpeace Canada 0.258199 32f07ec71aaaf85d+597168137c5d5274 3e87dfbe387a9294 0.005
https://www.greenpeace.org/canada/en/act/volunteer/sign-up-to-volunteer/ https://www.greenpeace.org/canada/fr/agir/impliquez-vous/inscrivez-vous-pour-recevoir-nos-opportunites-de-benevolat/ Menu Close Menu Selected: Greenpeace Canada Change Country Menu Fermer le menu Sélectionné•s: Greenpeace Canada Changer de pays 0.299956 1834e69b2464bb67+a85013050d06f02c 1834e69b2464bb67+94b4a053aab89548 1.000
https://www.greenpeace.org/canada/en/act/volunteer/sign-up-to-volunteer/ https://www.greenpeace.org/canada/fr/agir/impliquez-vous/inscrivez-vous-pour-recevoir-nos-opportunites-de-benevolat/ International | English International | English 1.000000 8c6fa63ce1c6fcea 8c6fa63ce1c6fcea 0
https://www.greenpeace.org/canada/en/act/volunteer/sign-up-to-volunteer/ https://www.greenpeace.org/canada/fr/agir/impliquez-vous/inscrivez-vous-pour-recevoir-nos-opportunites-de-benevolat/ Africa | English Africa | English 1.000000 1c38daf7d1165577 1c38daf7d1165577 0
https://www.greenpeace.org/canada/en/act/volunteer/sign-up-to-volunteer/ https://www.greenpeace.org/canada/fr/agir/impliquez-vous/inscrivez-vous-pour-recevoir-nos-opportunites-de-benevolat/ Africa | Français Africa | Français 1.000000 1cbc20304b275a25 1cbc20304b275a25 0
https://www.greenpeace.org/canada/en/act/volunteer/sign-up-to-volunteer/ https://www.greenpeace.org/canada/fr/agir/impliquez-vous/inscrivez-vous-pour-recevoir-nos-opportunites-de-benevolat/ Arabic | العربية Arabic | العربية 1.000000 483ee5a8c342794b 483ee5a8c342794b 0
https://www.greenpeace.org/canada/en/act/volunteer/sign-up-to-volunteer/ https://www.greenpeace.org/canada/fr/agir/impliquez-vous/inscrivez-vous-pour-recevoir-nos-opportunites-de-benevolat/ Argentina | Español Argentina | Español 1.000000 6f2cef7310e9086c 6f2cef7310e9086c 0
https://www.greenpeace.org/canada/en/act/volunteer/sign-up-to-volunteer/ https://www.greenpeace.org/canada/fr/agir/impliquez-vous/inscrivez-vous-pour-recevoir-nos-opportunites-de-benevolat/ Australia | English Australia | English 1.000000 d475692cddef69ba d475692cddef69ba 0
https://www.greenpeace.org/canada/en/act/volunteer/sign-up-to-volunteer/ https://www.greenpeace.org/canada/fr/agir/impliquez-vous/inscrivez-vous-pour-recevoir-nos-opportunites-de-benevolat/ Austria | Deutsch Austria | Deutsch 1.000000 7fe80ae35534ab7c 7fe80ae35534ab7c 0
I found a specific example of wrong data that might work for a little sample:
zcat file.gz | fgrep 'Toronto Mobilization-Hub Launch Party!!' | wc -l
# 5
zcat bicleaner2.gz | fgrep 'Toronto Mobilization-Hub Launch Party!!' | wc -l
# 5
diff <(zcat file.gz | fgrep 'Toronto Mobilization-Hub Launch Party!!') <(zcat bicleaner2.gz | cut -f1,2,3,4,5,6,7 | fgrep 'Toronto Mobilization-Hub Launch Party!!') | wc -l
# 6
# < https://greenwire.greenpeace.org/canada/fr/search/event/%20/participants%20DESC/list/100 https://greenwire.greenpeace.org/canada/fr/search/event/%20/participants%20DESC/list/50 Toronto Mobilization-Hub Launch Party!! Toronto Mobilization-Hub Launch Party!! 1.000000 822e9d43295af2f 822e9d43295af2f
# < https://greenwire.greenpeace.org/canada/en/search/event/%20/search_api_relevance%20DESC/list/20?page=7 https://greenwire.greenpeace.org/canada/fr/search/event/%20/field_event_date_value%20ASC/list/20?page=7 Toronto Mobilization-Hub Launch Party!! Toronto Mobilization-Hub Launch Party!! 1.000000 822e9d43295af2f 822e9d43295af2f
# ---
# > https://greenwire.greenpeace.org/canada/en/events/toronto-mobilization-hub-launch-party https://greenwire.greenpeace.org/canada/fr/node/88498 Toronto Mobilization-Hub Launch Party!! Toronto Mobilization-Hub Launch Party!! 1.000000 822e9d43295af2f 822e9d43295af2f
# > https://greenwire.greenpeace.org/canada/en/events/toronto-mobilization-hub-launch-party https://greenwire.greenpeace.org/canada/fr/node/88498 Toronto Mobilization-Hub Launch Party!! Toronto Mobilization-Hub Launch Party!! 1.000000 822e9d43295af2f 822e9d43295af2f
This leads me to wonder... are the scores being calculated with the original content piped to Bicleaner/Bicleaner AI or with the wrong sentences I noticed that both Bicleaner's are showing when --score_only
is not set? If it's for sure that the scores are being calculated according to the input content, the method 1 is a workaround for this bug, but if not, I guess that neither method 1 or 2 would be valid.
I'm not sure why this bug happens, but the implications might be that the score is wrong or, at least, that the output columns provided from the input are not the same.
I tested this bug on:
- https://github.com/bitextor/bicleaner/tree/d4e24b52f05123a353d67294276c006c0a94adbc
- https://github.com/bitextor/bicleaner/tree/96a3d55eca00a3bb937c08388a77b06d8859f084
- https://github.com/bitextor/bicleaner-ai/tree/39326eab2026b168fdb487cdc02ffe873f3345f2
Models I used:
- Bicleaner: https://github.com/bitextor/bicleaner-data/releases/download/v1.5/en-fr.tar.gz
- Bicleaner AI: https://github.com/bitextor/bicleaner-ai-data/releases/download/v1.0/lite-en-fr.tgz
Thank you!
I've been taking a very naive look into this, and I found this issue with urls:
- File:
zcat file.gz | grep "7c83b028cf21f355"
https://greenwire.greenpeace.org/canada/en/search/event/%20/participants%20ASC/thumbnail/50?type%5B0%5D=blog&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=2&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=1&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=2&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=1&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=2&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=3&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=2 https://greenwire.greenpeace.org/canada/fr/search/event/%20/participants%20ASC/thumbnail/50 Formation Greenspeaker Formation Greenspeaker 1.000000 7c83b028cf21f355 7c83b028cf21f355
https://greenwire.greenpeace.org/canada/en/search/event/%20/created%20DESC/thumbnail/100?type%5B0%5D=blog&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=2&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=1&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=1&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=4&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=2&%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=2&%3Bamp%3Bamp%3Bamp%3Bpage=13&%3Bamp%3Bpage=1&%3Bpage=1 https://greenwire.greenpeace.org/canada/fr/search/event/%20/created%20DESC/thumbnail/20?page=4 Formation Greenspeaker Formation Greenspeaker 1.000000 7c83b028cf21f355 7c83b028cf21f355
https://greenwire.greenpeace.org/canada/en/search/event/%20/search_api_relevance%20DESC/thumbnail/100?page=1 https://greenwire.greenpeace.org/canada/fr/search/event/%20/participants%20DESC/thumbnail/50?page=4 Formation Greenspeaker Formation Greenspeaker 1.000000 7c83b028cf21f355 7c83b028cf21f355
- Bicleaner1:
zcat bicleaner1.gz | grep "7c83b028cf21f355"
https://greenwire.greenpeace.org/canada/en/search/event/%20/participants%20ASC/thumbnail/50?type%5B0%5D=blog&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=2&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=1&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=2&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=1&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=2&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=3&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=2 https://greenwire.greenpeace.org/canada/fr/search/event/%20/participants%20ASC/thumbnail/50 Formation Greenspeaker Formation Greenspeaker 1.000000 7c83b028cf21f355 7c83b028cf21f355 0
https://greenwire.greenpeace.org/canada/en/search/event/%20/created%20DESC/thumbnail/100?type%5B0%5D=blog&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=2&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=1&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=1&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=4&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=2&%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=2&%3Bamp%3Bamp%3Bamp%3Bpage=13&%3Bamp%3Bpage=1&%3Bpage=1 https://greenwire.greenpeace.org/canada/fr/search/event/%20/created%20DESC/thumbnail/20?page=4 Formation Greenspeaker Formation Greenspeaker 1.000000 7c83b028cf21f355 7c83b028cf21f355 0
https://greenwire.greenpeace.org/canada/en/search/event/%20/search_api_relevance%20DESC/thumbnail/100?page=1 https://greenwire.greenpeace.org/canada/fr/search/event/%20/participants%20DESC/thumbnail/50?page=4 Formation Greenspeaker Formation Greenspeaker 1.000000 7c83b028cf21f355 7c83b028cf21f355 0
- Bicleaner2:
zcat bicleaner2.gz | grep "7c83b028cf21f355"
https://greenwire.greenpeace.org/canada/en/search/event/%20/participants%20ASC/thumbnail/50?type%5B0%5D=blog&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=2&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=1&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=2&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=1&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=2&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=3&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=2 https://greenwire.greenpeace.org/canada/fr/search/event/%20/participants%20ASC/thumbnail/50 Formation Greenspeaker Formation Greenspeaker 1.000000 7c83b028cf21f355 7c83b028cf21f355 0
https://greenwire.greenpeace.org/canada/en/search/event/%20/participants%20ASC/thumbnail/50?type%5B0%5D=blog&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=2&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=1&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=2&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=1&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=2&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=3&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=2 https://greenwire.greenpeace.org/canada/fr/search/event/%20/participants%20ASC/thumbnail/50 Formation Greenspeaker Formation Greenspeaker 1.000000 7c83b028cf21f355 7c83b028cf21f355 0
https://greenwire.greenpeace.org/canada/en/search/event/%20/participants%20ASC/thumbnail/50?type%5B0%5D=blog&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=2&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=1&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=2&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=1&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=2&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=3&%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bamp%3Bpage=2 https://greenwire.greenpeace.org/canada/fr/search/event/%20/participants%20ASC/thumbnail/50 Formation Greenspeaker Formation Greenspeaker 1.000000 7c83b028cf21f355 7c83b028cf21f355 0
Since your diff takes all the columns, including urls, into account, this issue with urls is giving impression that there are more different lines than it actually are. I tried diff <(zcat file.gz | cut -f3,4) <(zcat bicleaner2.gz | cut -f3,4) | wc -l
and I get 0 differences.
So I think we just need to fix whatever is Bicleaner doing when writing urls to output.
Hi! Thanks for the help!
I've tried without cache
as you pointed out and it seems that is the reason of the problem... I'd never have guessed it... It's true that I should've tried to remove all external components to Bicleaner :/
And I'm observing that the problem seems to be related to -k
flag from cache
, not cache
itself:
zcat file.gz | md5sum
# 199661296e84d638d5adcb17a27994a8
zcat file.gz | cache cat | md5sum
# 199661296e84d638d5adcb17a27994a8
zcat file.gz | cut -f1,2,3,4,5,6,7 | md5sum
# 199661296e84d638d5adcb17a27994a8
zcat file.gz | cache cut -f1,2,3,4,5,6,7 | md5sum
# 199661296e84d638d5adcb17a27994a8
zcat file.gz | cache -k 3,4 cat | md5sum
# 8dfdb3f73082abfe119049beb03e5aa6
zcat bicleaner2.gz | cut -f1,2,3,4,5,6,7 | md5sum # All fields but bicleaner score
# 8dfdb3f73082abfe119049beb03e5aa6
It seems you're totally right pointing out cache
. Do you think it could be a regression from kpu/preprocess#35? I'm going to install the version of cache previous to the issue and I'll come back here to tell if it worked.
So, in the end it seems it is not a problem related to Bicleaner. Anyway, I think it'd be nice if the issue is kept open until the problem is totally solved, even if it seems that the main cause is cache
(if it is ok, of course).
Sure! I'll keep this open. Let me know what you find :)
I think I got it, and I think it is not a bug, but something I assumed and is wrong... I assumed that if a pair of sentences are found by cache
, all the fields should be the same, but it's not true (it's almost true: all the fields depends on the pair of sentences but the URLs...). E.g.:
long_url1 long_url2 Formation Greenspeaker Formation Greenspeaker 1.000000 7c83b028cf21f355 7c83b028cf21f355
long_url3 long_url4 Formation Greenspeaker Formation Greenspeaker 1.000000 7c83b028cf21f355 7c83b028cf21f355
long_url5 long_url6 Formation Greenspeaker Formation Greenspeaker 1.000000 7c83b028cf21f355 7c83b028cf21f355
# after `cache -k3,4`:
long_url1 long_url2 Formation Greenspeaker Formation Greenspeaker 1.000000 7c83b028cf21f355 7c83b028cf21f355
long_url1 long_url2 Formation Greenspeaker Formation Greenspeaker 1.000000 7c83b028cf21f355 7c83b028cf21f355
long_url1 long_url2 Formation Greenspeaker Formation Greenspeaker 1.000000 7c83b028cf21f355 7c83b028cf21f355
In the previous example can be observed that the URLs are the only fields that are independent of the sentences (remember cache -k 3,4
, where 3 and 4 are the missing piece I didn't notice). So, I tried:
diff \
<(zcat bicleaner1.gz | awk -F$'\t' '{print $3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8}') \
<(zcat bicleaner2.gz | awk -F$'\t' '{print $3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8}') \
| wc -l
# 0 YEP!!!
So, like in your example running zcat bicleaner2.gz | grep "7c83b028cf21f355"
, there only the URLs are the fields that are wrong. I assumed all the fields were dependent from the ones I specified (i.e. -k 3,4
), but it is not true, so cache
can't be applied directly or should be applied using paste
as in the method 1.
Sorry for the long reply and wasting your time :(
Closing issue.
Thank you for the support!