Caleydo/tdp_publicdb

LEFT JOIN vs INNER JOIN in aggregated score DB query

Closed this issue · 0 comments

@zichner commented on Tue Oct 24 2017

Release: dev

When building an SQL query for a cell line list score that aggregates across a set of genes, the gene table is join by an INNER JOIN whereas the gene set table is joined by a LEFT JOIN.

Here is an example:

/api/tdp/db/publicdb/cellline_gene_frequency_copynumberclass_score/score?attribute=copynumberclass&species=human&table=copynumber&target=Cellline&value=-2&filter_panel_ensg=Cancer+Gene+Census
SELECT d.celllinename AS id, SUM((copynumberclass in (-2))::INT4) as count, COUNT(copynumberclass) as total 
FROM cellline.tdp_copynumber d 
INNER JOIN public.tdp_gene s ON d.ensg = s.ensg 
LEFT JOIN public.tdp_geneassignment ga ON (d.ensg = ga.ensg)  
WHERE s.species = 'human' AND ga.genesetname = 'Cancer Gene Census' GROUP BY d.celllinename;

Is there a reason for using different joins?
If not, please change the second join also into an INNER JOIN, which is much faster in some cases.
Additionally, you can harmonize the ON statement (either use always brackets or never).


@zichner commented on Mon Oct 30 2017

Just FYI: This issue is the most relevant at the moment. As soon as this change is made and deployed, I can start checking the DB queries for correctness and we can see if all steps of the described use cases are running fast enough or if we need to modify the DB queries or the use cases.


@mstreit commented on Mon Oct 30 2017

We discussed it this morning. @lehnerchristian will take care of it today.