PanakoStrategy Query Logic - maxListSize @ 250 needs an override
lucaslawes opened this issue · 2 comments
Possible minor refactoring to improve the recognition rate.
Testing Results
Running the query algorithm using a high-powered system found that taking half the query matches as the firstHits and lastHits (see below) results in a slightly better recognition rate.
Suggestion
Add maxListSize to config.properties, maybe with a switch to allow the query algorithm to take half the query matches each time.
if(!overrideMaxListSize) {
//view the first and last hits (max 250)
int maxListSize = 250;
firstHits = queryMatches.subList(0, Math.min(maxListSize,Math.max(minimumUnfilteredHits,queryMatches.size()/5)));
lastHits = queryMatches.subList(queryMatches.size()-Math.min(maxListSize, Math.max(minimumUnfilteredHits,queryMatches.size()/5)), queryMatches.size());
}
else { // Taking half and half seems to achieve a better recognition rate
var numQueryMatches = queryMatches.size();
var numQueryMatchesEvened = numQueryMatches % 2 == 0 ? numQueryMatches : numQueryMatches - 1;
var batchSize = numQueryMatchesEvened / 2;
firstHits = queryMatches.subList(0, batchSize - 1);
lastHits = queryMatches.subList(numQueryMatchesEvened - batchSize , numQueryMatches - 1);
}
Thanks for the bug report!
This is indeed a 'magic number' that should be set in the configuration settings. Having a switch in the configuration settings seems like reasonable thing to do indeed. Especially if performance or query time is less of an issue.
The reason to only take 250 is performance: calculating a median on a small list is more efficient than on a potentially very large list (half the hits could be a lot). Figure 1 in the Panako 2.0 article shows exactly the idea. Impact on retrieval rate is expected to be limited but not thoroughly tested and might differ from one application to an other: in noisy settings many spectral peaks might be present in the query but not be in the reference database and 250 might be not enough to get 'agreement': a relevant median. Also a reason to add it to the configuration settings.
The last commit should allow the requested functionality:
By setting PANAKO_HIT_PART_MAX_SIZE to a very high number (Integer.max_value) and PANAKO_HIT_PART_DIVIDER to 2 the first and last part should be equal to:
var numQueryMatches = queryMatches.size();
var numQueryMatchesEvened = numQueryMatches % 2 == 0 ? numQueryMatches : numQueryMatches - 1;
var batchSize = numQueryMatchesEvened / 2;
firstHits = queryMatches.subList(0, batchSize - 1);
lastHits = queryMatches.subList(numQueryMatchesEvened - batchSize , numQueryMatches - 1);