LEFT JOIN vs INNER JOIN in aggregated score DB query
Closed this issue · 0 comments
@zichner commented on Tue Oct 24 2017
Release: dev
When building an SQL query for a cell line list score that aggregates across a set of genes, the gene table is join by an INNER JOIN
whereas the gene set table is joined by a LEFT JOIN
.
Here is an example:
/api/tdp/db/publicdb/cellline_gene_frequency_copynumberclass_score/score?attribute=copynumberclass&species=human&table=copynumber&target=Cellline&value=-2&filter_panel_ensg=Cancer+Gene+Census
SELECT d.celllinename AS id, SUM((copynumberclass in (-2))::INT4) as count, COUNT(copynumberclass) as total
FROM cellline.tdp_copynumber d
INNER JOIN public.tdp_gene s ON d.ensg = s.ensg
LEFT JOIN public.tdp_geneassignment ga ON (d.ensg = ga.ensg)
WHERE s.species = 'human' AND ga.genesetname = 'Cancer Gene Census' GROUP BY d.celllinename;
Is there a reason for using different joins?
If not, please change the second join also into an INNER JOIN
, which is much faster in some cases.
Additionally, you can harmonize the ON
statement (either use always brackets or never).
@zichner commented on Mon Oct 30 2017
Just FYI: This issue is the most relevant at the moment. As soon as this change is made and deployed, I can start checking the DB queries for correctness and we can see if all steps of the described use cases are running fast enough or if we need to modify the DB queries or the use cases.
@mstreit commented on Mon Oct 30 2017
We discussed it this morning. @lehnerchristian will take care of it today.