MarWoes/wg-blimp

Support for different species

Opened this issue · 12 comments

Hi,

currently only the hg19 and hg38 human genome builds appear to be supported by this pipeline.
It would be great if support for other species could be added!
I'm particularly interested in using Mmul_10.

Definitely a good idea, I'll have a look at it!

Hi, I would love to see this extended to non-model organisms.
For example being able to over-ride the genome reference and annotation with our own by including a fasta file and a GFF file.
Thanks for a great tool though!

Thanks for your comment! I agree that wg-blimp is too limited as of yet.

Firstly, I will likely simply add an option to omit annotation altogether. This way the whole workflow can at least be executed resulting in DMRs. Currently, segmentation is not working without annotation, but unannotated DMRs can still be called by specifying in the configuration file:

target_files:
- dmr/combined-dmrs.csv
- qc/multiqc_report.html
- qc/methylation_metrics.csv

I'll have a look how to easiest integrate multiple species, but I think the best way would be to allow for arbitrary GFF/BED files, and provide some annotation tracks pre-built (for example for human data).

Implemented as of 7812ec0 . mmul10 is added to the list of supported references. Gene annotations are now based on arbitrary GTF files, so in principle arbitrary organisms may now be run through the analysis pipeline.

GUI still needs some tweaks to not break if any annotation is missing, but I'll try to do some more in-depth testing and more bug fixing and include the changes in the next release, hopefully by the end of next week everything is updated.

For the time being, if there are any urgent analyses, wg-blimp may be installed through Bioconda, and then again installed from source using pip install . to include all recent changes

Thanks for adding this!

Included in release v0.9.6. If any issues remain, feel free to reopen!

Is it possible to use Ensembl GTFs or GFFs?

A step wherein conversion through the wonderful gffread could be really helpful to have a consistent format!

Yes, Ensembl GTFs may be used for analysis. In fact, the included mmul10 setting will download the GTF files from Ensembl whereas human data uses Gencode GTF files. Thanks for the pointer, I'll have look if it's possible to support more formats!

Hi @MarWoes,

how it's going with this?

Would it be possible for the pipeline to work with any other user-interested species now?

thanks,
Hequan

Hi @HeQSun ,

Thanks for reaching out! It is in theory possible to analyse arbitrary species with wg-blimp.

The only information you need is a reference fasta and a GTF file for gene annotation (Ensembl files should work). You can use the command wg-blimp create-config --genome_build 'None' to create a configuration file for your purposes. In the created YAML file you may then set the GTF parameter accordingly.

Please note that for methylome segmentation through MethylSeekR it is necessary to provide CpG island information. These annotations were downloaded and put into this repository manually from UCSC, for example from here. If you want to utilize segmentation, download the corresponding table from that resource, gzip it, and set the YAML config parameter accordingly.

There is currently only limited documentation for arbitrary species support, so I'll re-open this issue to look how to further document this use case.

I hope this helps a little bit, don't hesitate to reach out if you have any further questions or anything is breaking!

Best,
Marius

Hi @HeQSun ,

Thanks for reaching out! It is in theory possible to analyse arbitrary species with wg-blimp.

The only information you need is a reference fasta and a GTF file for gene annotation (Ensembl files should work). You can use the command wg-blimp create-config --genome_build 'None' to create a configuration file for your purposes. In the created YAML file you may then set the GTF parameter accordingly.

Please note that for methylome segmentation through MethylSeekR it is necessary to provide CpG island information. These annotations were downloaded and put into this repository manually from UCSC, for example from here. If you want to utilize segmentation, download the corresponding table from that resource, gzip it, and set the YAML config parameter accordingly.

There is currently only limited documentation for arbitrary species support, so I'll re-open this issue to look how to further document this use case.

I hope this helps a little bit, don't hesitate to reach out if you have any further questions or anything is breaking!

Best,
Marius

Hi @MarWoes ,

thanks for your reply. I will try what you suggested, and let you know if it works..

Best,
Hequan

Hey Marius,

The pipeline is stable and working for me! I started looking over the first batch of results yesterday. I wanted to update this issue with all the information needed to run the pipeline using mice samples in the hopes this will help others in the future.

After following the step-by-step guide to set up the pipeline I ran my files from a config file found here. Make sure you run this file as a .yaml file (I can only upload .txt files to this comment).
config.txt

Here is the command I use to run the pipeline

$ wg-blimp run-snakemake-from-config --cores=8 config.yaml

Here is where I downloaded the files indicated in the config file for mouse samples.

# Refrence genome from Gencode
$ wget http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/latest_release/GRCm39.genome.fa.gz
#unzip the gzipped file
$ gzip -d GRCm39.fa.gz

#Get an annotation of the reference genome, also from Gencode
$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/latest_release/gencode.vM27.annotation.gtf.gz

Files from UCSC table viewer

cgi annotation file
https://genome.ucsc.edu/cgi-bin/hgTables?hgsid=1219084465_HRDjHL1J6LoO4NKjTqqdzM3L9RS0&clade=mammal&org=Mouse&db=mm39&hgta_group=allTracks&hgta_track=cpgIslandExt&hgta_table=0&hgta_regionType=genome&position=chr12%3A56%2C741%2C761-56%2C761%2C390&hgta_outputType=primaryTable&hgta_outFileName=cpi-GRCm38.csv

repeat masker file
https://genome.ucsc.edu/cgi-bin/hgTables?hgsid=1219084465_HRDjHL1J6LoO4NKjTqqdzM3L9RS0&clade=mammal&org=Mouse&db=mm39&hgta_group=allTracks&hgta_track=rmsk&hgta_table=0&hgta_regionType=genome&position=chr12%3A56%2C741%2C761-56%2C761%2C390&hgta_outputType=primaryTable&hgta_outFileName=rptmsk-GRCm38.csv

Hope this helps others!
Happy coding,
Jake