jolespin/veba

[Feature Request] Adding MEGAHIT to assembly module

erfanshekarriz opened this issue · 7 comments

Hi and thanks for the great suite!

I'm a deep sea genomic data analyst and I've found your pseudo-co-assembly method and the eukaryotic database extremely helpful in identifying microeukaryotic metagenomes in complex and difficult-to-reach environments.

I wanted to kindly request if a rapid assembly software like MEGAHIT could be added to the assembly module. I do acknowledge the high-quality performance of metaSPADEs, but for large datasets, the computational requirement can be extremely intensive (being the only assembly software my labmates and I cannot run on our server). I personally use MEGAHIT and was wondering if it (or any alignment software of its kind) could be added to the suite.

Best,

Erfan Shekarriz

Hi Erfan,

That is really exciting research and can't wait to see what your findings are with the the software suite! Once everything is published, please send it over and I'll make sure to link out the publication on my GitHub so other people can find it easier.

I wanted to kindly request if a rapid assembly software like MEGAHIT could be added to the assembly module. I do acknowledge the high-quality performance of metaSPADEs, but for large datasets, the computational requirement can be extremely intensive (being the only assembly software my labmates and I cannot run on our server). I personally use MEGAHIT and was wondering if it (or any alignment software of its kind) could be added to the suite.

Absolutely! I'll try to add this into my update I'm working on for v1.0.3 that I plan to release by the end of the month. I just need to run a few tests with MEGAHIT to make sure none of the scripts are dependent on the SPAdes identifiers.

If you don't want to wait until then, all you need to do is run MEGAHIT separately and map the reads to the resulting scaffolds/contigs.

For instance, here's some code that should do the trick. Just make sure to add the threads:

mkdir -p assembly_megahit
for ID in $(cat identifiers.list); do
   # Make the directory
   OUT_DIR=assembly_megahit/${ID}/output
   mkdir -p ${OUT_DIR}

   # Get the R1/R2 reads 
   R1=veba_output/preprocess/${ID}/output/cleaned_1.fastq.gz
   R2=veba_output/preprocess/${ID}/output/cleaned_2.fastq.gz

   # Run MEGAHIT2
   megahit -1 ${R1} -2 ${R2} -o ${OUT_DIR}
   CONTIGS=${OUT_DIR}/final.contigs.fa
 
   # Bowtie2 index 
   bowtie2-build --seed 0  ${CONTIGS} ${CONTIGS}
   
  # Bowtie2
   bowtie2 -x ${CONTIGS} -1 ${R1} -2 ${R2} --seed 0 --no-unal | samtools sort --reference ${CONTIGS} -T ${OUT_DIR}/samtool_sort_tmp > ${OUT_DIR}/mapped.sorted.bam

  # Samtools index 
   samtools index ${OUT_DIR}/mapped.sorted.bam

done

I'll keep you updated here and any progress towards this.

Let me know if you try the above and need any help.

Cheers,
-Josh

Dear Josh,

Awesome! Since I'm personally comfortable with bash scripting I've gone ahead and used megahit separately, but I thought it would be a useful feature for others :)

The publication is in final minor revision and should be out by the end of the month (or the next) so I will definitely send it to you after all is done! Thanks for your help and look forward to seeing VEBA expand.

Best of lucks,

Erfan

I ended up getting it all done today. Here's the release and change notes:
https://github.com/jolespin/veba/releases/tag/v1.0.3

Out of curiosity, how many eukaryotes were you able to recover and were any only able to recovered via pseudo-coassembly?

I ended up getting it all done today. Here's the release and change notes:
https://github.com/jolespin/veba/releases/tag/v1.0.3

Out of curiosity, how many eukaryotes were you able to recover and were any only able to recovered via pseudo-coassembly?

Hi Josh,

Awesome! Thanks for the efficiency. We're currently tweaking and fine tuning the protocol.

As of now we don't have any complete genomes, only incomplete fragments, but are tesing out different routes to see if we can pull anything (we still haven't leveraged all the tools on VEBA but plan to by the end of this month).

What I can tell you is that we've tried 9 routes/pipelines to retrieve microeukaryotic genomes from a global dataset in the deep sea. All but this have been unsuccessful.

Mostly because all eukaryotic MAGs in complex environments start with alignment-driven/ database-intensive methods. This to me is silly because eukaryotic genes have introns that are not conserved. Eukaryotic genomes are dominated by non-conserved non-coding sequences. Some eukaryotic protein domains have the same 3D structure but only 20% sequence similarity due to post-translational modifications.

This is an even bigger problem if your data comes from an under-studied and isolated biome like the deep sea. The taxa here highly diverge in terms of evolution and sequence for many years. This makes alignment virtually useless. The only sequences that can be detected using alignment methods are rRNA genes (which are non-coincidentally the most recorded in the NCBI database).

The VEBA pipeline helps in multiple ways. First that you filter out prokaryotic and human contamination which increases accuracy. Second that the assembly and binning (metabat2, concoct, maxbin2) are all de-novo with no alignment requirement. The binning is also guided by a probabilistic model based on an ab-initio deep-learning method for identifying eukaryotic contigs (Tiara). You also have a custom micro-Eukaryotic database for MetaEuk. A cherry on top is the pseudo-assembly approach which is very useful for isolated environments with divergent species. Because here the goal isn't to compare metagenomes per sample site, but rather what the genomes of these divergent prokaryotes look like in the extreme/niche isolated biomes (e.g plastic sphere, oil spills, hydrothermal vents). The pseudo-assembly addresses the coverage issue very well here without being guided by alignment! All and all it's great that you've done your research and avoided alignment at all costs until the very end.

For the future, I think it would be interesting to explore the possibility of replacing MetaEuk with a de-novo software (perhaps a deep-learning method) that can automatically extract exons (and cut introns) when finding eukaryotic ORFs, then feeding it to an RNAseq database. This way you can bypass the full splicing issue and leverage Hidden Markov Models for detecting domains and identifying genes using 3D modeling (like with PHYRE2).

Anyhow that's my little essay on why I think the pipeline could be an awesome solution for eukaryotic metagenome binning! As for our results, I will send you a final detailed report of our findings and see how we can make VEBA even better for the purpose.

Hope this all helps and I will reach out soon,

Best

Erfan

As of now we don't have any complete genomes, only incomplete fragments, but are tesing out different routes to see if we can pull anything (we still haven't leveraged all the tools on VEBA but plan to by the end of this month).

Depending on your dataset and your compute resource availability, have you tried bona fide co-assembly? If you're not recovering any complete genomes this could be a viable option but it would require you to reassemble using the concatenated reads.

Here's a walkthrough to do it with VEBA if you're interested.

For the future, I think it would be interesting to explore the possibility of replacing MetaEuk with a de-novo software (perhaps a deep-learning method) that can automatically extract exons (and cut introns) when finding eukaryotic ORFs, then feeding it to an RNAseq database. This way you can bypass the full splicing issue and leverage Hidden Markov Models for detecting domains and identifying genes using 3D modeling (like with PHYRE2).

I used MetaEuk because it was the only software that was conda installable and didn't require complicated license agreements like GENEMARK-EP+. My understanding of MetaEuk is that it only aligns that exons not the full protein sequence but this could be complicated with proteins that are highly divergent but still fold in a similar way. If you come across any eukaryotic gene modeling software that can be installed with either PyPI or Conda, please send them my way and I'll do my best to incorporate it.

Anyhow that's my little essay on why I think the pipeline could be an awesome solution for eukaryotic metagenome binning!

Your feedback is extremely appreciated. If you have any hiccups or any feature requests, send them my way and I'll try to incorporate them ASAP. I tried to think of everything but could always use insight from other researchers with different workflows. My goal for this software is to automate all of the tasks we would have to do manually which includes running steps that we usually do post hoc (e.g., sequence statistics, counts tables generation if BAM files are provided) and reusing files when available (e.g., VEBA pretty much maps once during the assembly phase and that's used for everything downstream).

@erfanshekarriz I'd like to add more marine fungi (and definitely deep sea fungi) to my microeukaryotic db. I'm releasing version 2 right now but want to include it for version 3 in the future.

Do you know of any databases that have these organisms? In particular, protein sequences from these organsisms.

Hi Josh!

I'm sorry for not getting back to you sooner. I was in the middle of the jungle and had absolutely no reception for a few weeks. Happy new year!

Do you know of any databases that have these organisms? In particular, protein sequences from these organisms.

As for your question, there are currently no well-curated databases for marine fungi specifically. Terrestrial and marine fungi groups share many species although I'm not entirely sure about their sequence divergence since it's a newly developed field (https://www.marinefungi.org/history-marine-mycology.php). Deep-sea fungi is even a farther stretch. My paper is essentially the 5th that explores the functional role of fungi in the cold seep microbiome. Hoping there is more to come!

I would recommend three different general sources for fungal genome models in general that include several marine species:

  1. Mycocosm: mycocosm.jgi.doe.gov/mycocosm/home
  2. Ensembl Fungi: https://fungi.ensembl.org/index.html
  3. FungiDB: https://fungidb.org/fungidb/app

For Deep Sea fungi you can see this review https://link.springer.com/chapter/10.1007/978-3-030-19030-9_17 that describe 120 species that have been isolated. I'm hoping most have been sequenced but not entirely sure. I will see if one day I can curate them into a database, but for not they have to be manually retrieved.

Best,

Erfan