greenelab/adage-backend

implement weight and boost properties to improve gene search

mhuyck opened this issue · 3 comments

While porting django-genes code to Python 3 for this project, a question came up about the Gene.weight property (see py3-adage-backend/adage/genes/models.py) and how it was used.

After reviewing the original django-genes codebase, it's clear that .weight supports an important search feature. Although it's not initially needed for Adage, it will likely be useful in the future and it will certainly be needed when it comes time to factor the py3 version of django-genes back out into a separate component. Details of my code review are below:

Weight is a search tuning parameter. Although it is not particularly useful for the Pseudomonas data we currently use in Adage, a fair amount of work was done in django-genes (for Tribe, I assume) to add this in because it was needed.

What happens when searching for genes across many data sources is that you find the same gene name being used to refer to different genetic locations even within the same organism. So, when a user is searching for a gene to add to a list, there needs to be a way to sort through the duplicates. From the comments in django-genes/genes/search_indexes.py (lines 34-59) and django-genes/genes/management/commands/genes_load_geneinfo.py (lines 213-271), it appears that the weighting is done in such a way that the “more popular” gene hits will rise to the top of the search list. genes_load_geneinfo.py has logic that counts the number of cross-references and aliases a gene has and gives the gene a higher weight if there are a lot of those. search_indexes.py then tweaks those weights into a boost parameter, which appears to be what actually modifies the search visibility of that gene.

So I take from this that we will need the weight parameter and the boost logic to return, somehow, before this code is folded back into django-genes.

How soon will we need this sort of logic for Adage? I guess that’s really a question for @cgreene, but I think it’s safe to say that if we expand to other organisms we will hit this duplicate gene name issue eventually.

I don't know how soon we'll need it. I think it was designed to address issues with search that showed up in Tribe's human genes.

Let's look into this issue more closely when we port Tribe to Python 3.

Sounds good to me.