marbl/MashMap

Decontamination of bacterial sequences in an assembly

svm-zhang opened this issue · 13 comments

Hello,

Thanks very much for MashMap!

I have an assembly that is contaminated with lots of bacterial sequences, and I'd like to try MashMap to tease them out. I downloaded all complete RefSeq bacterial genomes, and put them in a single FASTA file. My assembly is highly fragmented, including 300k+ contigs. When I ran MashMap as below,

mashmap -r ${refseq} -q ${assembly} -s 500 -t ${NT} -o ${out}

the program consumes large amount of RAM (>= 200GB). My question is:

  1. whether MashMap can be used for this kind of analysis?
  2. if yes, should I split my bacterial database into chunks and do it with multiple batches?

Any advice would be appreciated!

cheers,
Simo

Hi Simo, Mashmap should be applicable for the problem you described above.

Do check the identity threshold (--perc_identity, --pi) parameter; you may want to adjust it according to your requirements. Increasing it (as well as seg. length) would lower down the runtime and memory usage. As you said, splitting database into chunks is also a valid option.

I will expect program to have high RAM and runtime for the thresholds you provided.

Let me know if you have any questions. Will be curious to know how it goes.

One more thing that may be helpful for you is to include the representative genome (corresponding to your assembly) in the database as well. This would help improve the specificity of the method for correct portions of your assembly.

Hi Jain,

Thanks very much for the tip! I just tried with higher --pi (95%) and the memory usage was dramatically reduced.

Regarding your latest comment, what did you mean by "representative genome"? You meant a related reference genome for the species I am assembling, right?

cheers,
Simo

Right

Thanks.

I get another question: if I'd like to decontaminate at reads level, I should go with Mash instead of MashMap right?

Simo

It depends, I think Mash will help you find the source of contamination (e.g., which bacteria contribute to your assembly), but for finding which read is contaminated, you may want to use a read mapper.

Mash doesn't look at reads, only k-mer level analysis so the screen command can tell you what genomes are contained in your read set but not what reads are contributing to these genomes. MashMap gives you read level information so you could use Mash to find a set of candidate genomes to map to and then MashMap to map the reads to see which reads are contaminating.

This makes a lot of sense. Thanks very much!

Hi,

I've got the similar questions as svm-zhang. I downloaded all the bacteria genomes in NCBI and mapped the PacBio reads to remove those contaminated. I came across two problems and the version I used was v2.0.

  1. When I used the default parameters, MashMap ran successfully within about 5 hours.
$path2mashmap -r $contaminants -q third_all.fasta -o mashmap.out

Start time is 2018/07/02--19:10
>>>>>>>>>>>>>>>>>>
Reference = [/home/niuyw/RefData/NCBI_contaminants/contaminants.fa]
Query = [third_all.fasta]
Kmer size = 16
Window size = 90
Segment length = 5000 (read split allowed)
Alphabet = DNA
Percentage identity threshold = 85%
Mapping output file = mashmap.out
Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none)
Execution threads  = 1
>>>>>>>>>>>>>>>>>>
INFO, skch::Sketch::build, minimizers picked from reference = 985588968
INFO, skch::Sketch::index, unique minimizers = 58991923
INFO, skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 14684157) ... (88712, 1)
INFO, skch::Sketch::computeFreqHist, With threshold 0.001%, ignore minimizers occurring >= 3592 times during lookup.
INFO, skch::main, Time spent computing the reference index: 4258.82 sec
INFO, skch::Map::mapQuery, [count of mapped reads, reads qualified for mapping, total input reads] = [463569, 4080228, 6633142]
INFO, skch::main, Time spent mapping the query : 18524.8 sec
INFO, skch::main, mapping results saved in : mashmap.out
Finish time is 2018/07/03--01:30

But when I adjusted the parameter -s and --pi to filter the contaminanted reads more strictly. It failed and seemed to be a RAM problem.

$path2mashmap -r $contaminants -t 8 -q third_all.fasta -s 500 --pi 70 -o mashmap1.out

>>>>>>>>>>>>>>>>>>
Reference = [/home/niuyw/RefData/NCBI_contaminants/contaminants.fa]
Query = [third_all.fasta]
Kmer size = 16
Window size = 1
Segment length = 500 (read split allowed)
Alphabet = DNA
Percentage identity threshold = 70%
Mapping output file = mashmap1.out
Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none)
Execution threads  = 8
>>>>>>>>>>>>>>>>>>
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

My questions are:

  • Is there any ways to run? The computer node I submitted my job to has 2T RAM, is that enough? Or I shoud split my input data?
  • For contamination removal, what percentage identity threshold do you think is proper?
  1. When I checked the 10th column of the output, I found the minimum figure was about 80%. Shouldn't it be 85% (the default of --perc_identity)?
$ sort -t ' ' -k10,10n mashmap.out |head -3
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/100026/37064_49118 12054 7054 12053 + kraken:taxid|336810|NZ_CP021172.1 218034 75901 80900 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/100308/48537_58156 9619 4619 9618 - kraken:taxid|880070|NC_015914.1 6221273 4110216 4115215 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/103161/48125_53357 5232 232 5231 + kraken:taxid|336810|NZ_CP021172.1 218034 75901 80900 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/104812/38481_50150 11669 0 4999 - kraken:taxid|877455|NC_015216.1 2583753 1814585 1819584 80.8247

Thank you in advance!

Hi @YiweiNiu , -s 500 --pi 70 are probably too strict for Mashmap. If you want to go for strict parameters than default, you can try -s 2500 --pi 85 or -s 2500 --pi 80; that should be good enough for your application. More strict the parameters are, higher is the resource requirement (both memory and time).

Also, 10th column conveys the estimated identity from k-mer hits, which can be less than the threshold. These are outputted to ensure high sensitivity.

Thank you for your reply!

Could you please explain that more? Because there are ~10% reads whose length is less than 1kb, and ~40% less than 5kb. That's why I wanted to lower the -s parameter.

And, you forgot my second qustion :) When I checked the 10th column of the output, I found the minimum figure was about 80%. Shouldn't it be 85% (the default of --perc_identity)?

Try -s 500 --pi 85 and see if you get output.

  • Are the pacbio reads sequenced from a whole-genome or a metagenome?
  • How big is your $contaminants database?
  • I would also recommend you to look at Metamaps in case it better suits your needs.

For the second question, please see me previous response again, I had edited it later.

I tryied -s 500 --pi 85 and it ran successfully.

The PacBio reads were from several insects (mixed individuals to get enough DNA). And the $contaminants database contains about 3w sequences, which I merged archaea, bacteria, fungi, protozoa and viral sequences from NCBI. So it's very huge.