malariagen/malariagen-data-python

Advanced diplotype clustering orders genes in the CNV section by label rather than by genomic position

Closed this issue · 8 comments

If the cnv_region contains multiple genes, the CNV heatmap rows will be ordered by the gene label rather than their genomic position. This can make it a bit confusing to try to understand the structure of the CNVs. Would be better to order by genomic position.

Here's an example:

af1.plot_diplotype_clustering_advanced(
    region='X:8,438,477-8,460,887',
    snp_transcript='LOC125764232_t1',
    cnv_region='X:8,418,477-8,480,887',
    sample_sets=['1232-VO-KE-OCHOMO-VMF00044', '1231-VO-MULTI-WONDJI-VMF00043', '1236-VO-TZ-OKUMU-VMF00090'],
    sample_query="country in ['Kenya', 'Uganda', 'Tanzania'] and taxon == 'funestus'",
)

image

interesting, didnt notice this!

Doesnt seem to be the case in gambiae? Is there something odd about that funestus locus?

image

Just had a look at your example in the Af1 GFF, they are already in the order of genomic position, although LOC125764275 (middle gene) is on reverse strand.

image

So for some reason CNVs at LOC125764275 are not getting called.

Just had a look at your example in the Af1 GFF, they are already in the order of genomic position, although LOC125764275 (middle gene) is on reverse strand.

No I don't think so, here are the three genes in the region I wanted to show CNV data for...

image

The middle gene should be LOC125764232 but it's not.

Actually, maybe the problem is that the GFF isn't sorted...

Suggested fix is to sort the GFF when it is loaded within the genome_features() function.

Did this ever get resolved? @leehart @alimanfoo

i can do it