pisa-engine/pisa

Extreme score from Query-Likelihood Quantized Index

J9rryGou opened this issue · 14 comments

I created a quantized index by following:

cd /home/jg6226/code/raw_pisa/build
./bin/create_wand_data -c /hdd1/data/ssd2_data_backup/ssd2/data/index/cw09b/CW09B.url.inv -o /ssd2/data/index/cw09b_ql_index/CW09B.ql.quantized.wand --quantize --scorer qld -b 128

./bin/compress_inverted_index -c /hdd1/data/ssd2_data_backup/ssd2/data/index/cw09b/CW09B.url.inv -o /ssd2/data/index/cw09b_ql_index/CW09B.ql.quantized.index.opt -e block_simdbp --quantize --scorer qld --wand /ssd2/data/index/cw09b_ql_index/CW09B.ql.quantized.wand --check

Then I use my edited evaluate_queries to run on a query dataset selected from TREC05

cd /home/jg6226/code/20230101_pisa_termscore_small_size/pisa/build
./bin/evaluate_queries_didordered -e block_simdbp -a ranked_or -i /ssd2/data/index/cw09b_quantized_index/CW09B.quantized.index.opt -q /home/jg6226/data/Hit_Ratio_Project/TREC0506_query/cleaned_query/trec05_testing_queries.txt -k 1000 --scorer quantized --wand /ssd2/data/index/cw09b_quantized_index/CW09B.quantized.wand  --documents /home/jg6226/data/index/cw09b/CW09B.url.fwd.doclex --terms /home/jg6226/data/index/cw09b/CW09B.fwd.termlex -f /home/jg6226/data/Hit_Ratio_Project/TREC0506_query/evaluate_result/trec05_testing_quantized_output.txt -d

I found there are some extreme high score for a document, is there anything wrong with my code?
0OLK%BO`IIM@)FODYH45D4

@J9rryGou I fixed a bug with quantization: #573 can you check if you're still getting this issue?

I just realized that --check has no effect when compressing with quantization. I will see if this can be implemented.

I just realized that --check has no effect when compressing with quantization. I will see if this can be implemented.

Sounds good.

I also tried compress_inverted_index by not passing --check, the index still has the issue I mentioned above.

Yeah, not passing --check will have no effect, it's just being ignored. I'll work on implementing the check for quantized, then maybe that can reveal something...

I haven't figured this one out yet, but I definitely see something is broken.

For one, quantized index using any of the non-blocked encoding is fundamentally broken -- but I think I have an idea why and how to fix it.

Second, I see that at compression time, there is a score that wants to be written: 4294967295, which happens to be a 32-bit int with all 1s, or 2^32 - 1. Not sure yet why but it's a lead.

Also, BM25 doesn't seem to be affected.

I haven't figured this one out yet, but I definitely see something is broken.

For one, quantized index using any of the non-blocked encoding is fundamentally broken -- but I think I have an idea why and how to fix it.

Second, I see that at compression time, there is a score that wants to be written: 4294967295, which happens to be a 32-bit int with all 1s, or 2^32 - 1. Not sure yet why but it's a lead.

Sounds good, thank you so much! Seems like we are very close to the bug when using qld as the ranking function. Yeah, all outputs by using bm25 are all good, according to the results from previous runs.

I haven't figured this one out yet, but I definitely see something is broken.

For one, quantized index using any of the non-blocked encoding is fundamentally broken -- but I think I have an idea why and how to fix it.

Second, I see that at compression time, there is a score that wants to be written: 4294967295, which happens to be a 32-bit int with all 1s, or 2^32 - 1. Not sure yet why but it's a lead.

But I have a question about this:
Since the quantized score has range 0 to 255 (256 is very rare). I did see 256 occur in quantized bm25 score, maybe the way pisa store the quantized score is like this: if it is in range 0 to 255, use one byte, if it is 256, use 2 bytes. That's why before you did that modification of quantizer, it worked well before. For the quantized index of qld, there are some extremely large scores, the PISA will store them with more bytes (maybe up to 8 bytes? I see some score that is even larger than 2^32 -1, but I am not 100% sure.). This can explain why the size of quantized qld index is about 47GB, whereas the size of quantized bm25 index is about 25GB.

So, the way that PISA storing quantized score is not fixing it to 1 byte, but will use more byte if the score is very large?

For one, quantized index using any of the non-blocked encoding is fundamentally broken -- but I think I have an idea why and how to fix it.

BTW, when you say this, is quantized index of bm25 using elias_fano encoding also broken?

For one, quantized index using any of the non-blocked encoding is fundamentally broken -- but I think I have an idea why and how to fix it.

BTW, when you say this, is quantized index of bm25 using elias_fano encoding also broken?

Yeah, I believe so, but I would have to confirm that. Some tests I wrote fail for those indexes, so there's clearly something wrong.

So, the way that PISA storing quantized score is not fixing it to 1 byte, but will use more byte if the score is very large?

Writing is done the same way as frequencies, so depends on the encoding used. Quantization is really just done when computing the score, if that score is 256, then the codec will write it.

@J9rryGou Actually nvm about what I said about non-blocked encodings. They also seem to work for BM25 after all.

@J9rryGou the culprit is how we encode frequencies: we always encode frequency - 1 (because they are all positive). When some scores are quantized to 0, it breaks down, because we end up with 2^32-1 after that subtraction (underflow).

Could you please try the fix branch #575 and report back if it fixes the issue?

Note that I've discovered different issue with PL2 & DPH scorers but both QLD and BM25 should work fine.

@J9rryGou I closed it with the fix in #575 If you encounter this issue again on the new version, feel free to reopen or open a new one.