gladkia/igvR

support for more genomes

janstrauss1 opened this issue · 22 comments

Dear @paul-shannon,

I just realized that the setGenome method currently only supports hg38, hg19, mm10 and tair10 limiting the use of igvR for the research community.

Do you plan to extend the list of supported genomes in the near future? Will it also be possible to load genomes from a file?

I would personally be very interested in support for the genome of the malaria parasite Plasmodium falciparum 3D7.

Many thanks in advance for your help!
Jan

@janstrauss1 I'd be glad -- working with Jim Robinson, the creator of igv.js -- to add additional genomes - and to document how to add your own. I will contact Jim directly. In my recollection we need the genome and a gff3: can you direct me to your preferred versions of these two?

@janstrauss1
I just reminded myself of the mechanism for supporting a new genome in igv.js and thus in igvR. While waiting for a reply from Jim on his ideas on contributing new genomes to igv.js for all to use, here is what I can pretty easily add support for - so that you can host your own genome, as long as you have access to a webserver which supports range requests. We need range requests in the webserver so that parts of the sequence can be read from an indexed fa.gz file into the browser on demand.

In a proposed new method for igvR:

setCustomGenome(igv, name="TAIR10",
                fastaURL="https://myHost.net/tair10/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa",
                indexURL="https://myHost.net/tair10/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.fai",
                aliasURL: "https://myHost.net/tair10/chromosomeAliases.txt",
                annotationURL: "https://myHost.net/TAIR10_genes.sorted.chrLowered.gff3.gz")

Chromosome aliases would be optional, but often helpful, so that for instance, "chr10:50-100", "Chr10:50-100" and "10:50-100" are all intelligible.

Comments?

@janstrauss1 I'd be glad -- working with Jim Robinson, the creator of igv.js -- to add additional genomes - and to document how to add your own. I will contact Jim directly. In my recollection we need the genome and a gff3: can you direct me to your preferred versions of these two?

@paul-shannon, you could use the genome and gff3 from the most recent release of PlasmodDB release available at https://plasmodb.org/.

Alternatively, you should get the files from ftp://ftp.sanger.ac.uk/pub/project/pathogens/Plasmodium/falciparum/3D7/.

@janstrauss1
I just reminded myself of the mechanism for supporting a new genome in igv.js and thus in igvR. While waiting for a reply from Jim on his ideas on contributing new genomes to igv.js for all to use, here is what I can pretty easily add support for - so that you can host your own genome, as long as you have access to a webserver which supports range requests. We need range requests in the webserver so that parts of the sequence can be read from an indexed fa.gz file into the browser on demand.

In a proposed new method for igvR:

setCustomGenome(igv, name="TAIR10",
                fastaURL="https://myHost.net/tair10/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa",
                indexURL="https://myHost.net/tair10/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.fai",
                aliasURL: "https://myHost.net/tair10/chromosomeAliases.txt",
                annotationURL: "https://myHost.net/TAIR10_genes.sorted.chrLowered.gff3.gz")

Chromosome aliases would be optional, but often helpful, so that for instance, "chr10:50-100", "Chr10:50-100" and "10:50-100" are all intelligible.

Comments?

Sounds good to me as far as I can judge.

However, I would currently already be happy if all the genomes that are hosted on the IGV genome server would be supported. See https://software.broadinstitute.org/software/igv/Genomes for a full list.
It already supports several genomes for the Plasmodium parasite (listed at 97. - 105.)

@janstrauss1
Support for Pfalciparum 3D7 is now (at least minimally) in igvR 1.5.6. I am hosting three files that you directed me towards:

  • PlasmoDB-43_Pfalciparum3D7_Genome.fasta
  • PlasmoDB-43_Pfalciparum3D7_Genome.fasta.fai
  • PlasmoDB-43_Pfalciparum3D7.gff

Jim Robinson, igv.js principal author, explains that the 100+ IGV desktop genomes are not available to igv.js - I don't know why - but only the sixteen listed here. He welcomes further submissions if they are public, not embargoed. I will make sure that the sixteen genomes Jim offers are soon available in igvR.

Pfal3D7 is a good candidate for submission to igv.js. Can you check out my current version? It seems rather skimpy to me: the gff file does not support search by gene symbol, nor does the reference gene track offer gene symbol names. I don't know what the Pfalciparum community expects, but if you guide me, I am happy to make improvements. Then we can submit to Jim.

I fixed the vignette error - the missing autoscale argument in the devel vignette, but not yet in release.

@paul-shannon, thanks for adding support for the Pfalciparum 3D7 genome! The genome fasta and annotation gff file versions from PlasmoDB look generally ok to me.

However, the information at PlasmoDB appears to be only periodically updated according to Wellcome Sanger Institute.

Thus, I would suggest to rather use the most recent genome and annotation file versions that are continually updated in GeneDB and provided on the Wellcome Sanger Institute FTP server at ftp://ftp.sanger.ac.uk/pub/genedb/releases/2019-05/Pfalciparum/.

@janstrauss1 "continually updated" hmm - I may not be in a position to keep the 3D7 genome as current as you wish.

With that in mind, the package now includes a full demo showing how to serve up annotation and reference genome tracks yourself, locally for use within igvR. You can therefore add and/or update any and all reference and annotation tracks whenever you wish. See

https://github.com/paul-shannon/igvR/tree/master/misc/serveYourOwnFiles

This demo uses a small genome, Rhodobacter sphaeroides, a minimal range-request-capable python Flask webserver, and an html file to see it all in action rhodobacter-sphaeroides-demo.html.

If this is of interest, I'd be glad to respond to any questions or suggestions you have, and fix any bugs you encounter.

@janstrauss1 "continually updated" hmm - I may not be in a position to keep the 3D7 genome as current as you wish.

@paul-shannon, that's fine! I didn't mean to ask you to keep the genome continuously up-to-date but rather wanted to make sure that you use the most recent genome and annotation file versions currently available.

Thanks a lot for including a full demo on how to use annotation and reference genome tracks locally! I will give it a try.

Thanks again for your great help!

@janstrauss1 I will replace the PlasmoDB genome for igvR with the next update to GeneDb. If you find igvR useful, and depend upon the current P falciparum genome, I will automate this. Just let me know.

@paul-shannon, that's great thank you!
Yes, I find igvR very useful and will depend on the current P. falciparum genome in the future. I would therefore highly appreciate if you could automate this process.
Many thanks for your efforts!

@janstrauss1 I am updating igvR for the upcoming bioconductor release. I see that igv.js hosts a few more genomes, but still no Pfalciparum.

I confess: I have not done the automation I offered back in May. Is this still of active interest to you?

Dear @paul-shannon, thanks for the update!
Yes, I'm still very interested to use igvR with the most current Pfalciparum genome and would appreciate if you could include the automation.
Many thanks in advance!

Hi Paul,

Could you please compare the annotation file used in igvR to the files available at https://plasmodb.org/common/downloads/Current_Release/Pfalciparum3D7/?
I can well imagine though that the files are still current and no update is necessary.

Maybe you can use the ./Current_Release/Pfalciparum3D7/ directory for automatic updates of the igvR annotation file?

Many thanks for your great support!

Jan

Dear @paul-shannon , I'm a new user of igvR and finds it a perfect tools to visualize my interested regions in R! Just a follow-up on this thread on the plan to support custom genomes, is the proposed method available yet?

setCustomGenome(igv, name="TAIR10",
                fastaURL="https://myHost.net/tair10/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa",
                indexURL="https://myHost.net/tair10/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.fai",
                aliasURL: "https://myHost.net/tair10/chromosomeAliases.txt",
                annotationURL: "https://myHost.net/TAIR10_genes.sorted.chrLowered.gff3.gz")

I'd be definitely interested in giving it a try if possible!

Thanks!

Hi @orionzhou,
Thank you for the prompt to return to this task. I am working on it now.
Two requests, if I may:

  • Could you post your question (also) to https://support.bioconductor.org so that others in the community who may be interested will see this?
  • Can you provide URLs for fasta, fasta index, cytoband and a gene annotation track for you organism? That will permit me to test the code more broadly than with my own example organisms only

Thanks for the reply, and yes -

here is the question posted on bioconductor

here is the genome files:

"id": "maize",
"name": "Zea mays B73v4",
"fastaURL": "https://s3.msi.umn.edu/zhoup-igv-data/Zmays-B73/10.fasta",
"indexURL": "https://s3.msi.umn.edu/zhoup-igv-data/Zmays-B73/10.fasta.fai"
{
"name": "Genes",
"format": "gff3",
"url": "https://s3.msi.umn.edu/zhoup-igv-data/Zmays-B73/10.gff.gz",
"indexURL": "https://s3.msi.umn.edu/zhoup-igv-data/Zmays-B73/10.gff.gz.tbi",
}

@orionzhou, I have an early version of ```setCustomGenome` working with your mays data.

However, somethings seems amiss, which I reproduce in a minimal html/js scripts - for hg38 (which loads fast) and mays (which takes five minutes). I downloaded all of the relevant files from a shell, using curl, and found that hg38 full fasta takes as long to download from aws as 10.fasta does from your server - so independent download speeds do not seem to be the issue.

Could your run the scripts in igvR/misc/serveYourOwnFiles/testZeaMaysFromMinnesota/ and report any insights you have? I may be doing something dumb.

Thanks @paul-shannon , the two human scripts load very fast, but it took ~1 minute to load the maize one on my side too. It seems to be due to the GFF file - I removed the gene track and it loads very fast then.

The IGV-web app works as normal though: https://s3.msi.umn.edu/zhoup-igv/index.html

@orionzhou,

Let's defer the slow maize gff load puzzle for now.

A new version of the package (1.9.4) offers this method, shown here with two examples from inst/unitTests/test_setCustomGenome.R.

Give me feedback, please. Both use hg38 from igv at aws. First, with all parameters set explicitly:

setCustomGenome(igv,
                id="hg38",
                genomeName="Human (GRCh38/hg38)",
                fastaURL="https://s3.amazonaws.com/igv.broadinstitute.org/genomes/seq/hg38/hg38.fa",
                fastaIndexURL="https://s3.amazonaws.com/igv.broadinstitute.org/genomes/seq/hg38/hg38.fa.fai",
                cytobandURL="https://s3.amazonaws.com/igv.broadinstitute.org/annotations/hg38/cytoBandIdeo.txt",
                chromosomeAliasURL=NA,
                geneAnnotationName="Refseq Genes",
                geneAnnotationURL="https://s3.amazonaws.com/igv.org.genomes/hg38/refGene.txt.gz",
                geneAnnotationTrackHeight=500,
                geneAnnotationTrackColor="red",
                initialLocus="chr5:88,621,308-89,001,037",
                visibilityWindow=5000000)

and then with the minimum parameters:

setCustomGenome(igv,
                id="hg38",
                genomeName="Human (GRCh38/hg38)",
                fastaURL="https://s3.amazonaws.com/igv.broadinstitute.org/genomes/seq/hg38/hg38.fa",
                fastaIndexURL="https://s3.amazonaws.com/igv.broadinstitute.org/genomes/seq/hg38/hg38.fa.fai")

Great! Thanks a bunch Paul!