marbl/MashMap

Filtering for finding duplications

iminkin opened this issue · 4 comments

Hi,

Suppose I would like to map a genome to itself in order to find duplications. What is the correct filtering option (-f, --filter_mode) for this use case?

-f all should be appropriate, which basically implies no filter.

Please note that Mashmap won't do DP-based alignment, as it does everything based on Jaccard similarity. Depending on your needs, you may want to do Smith-Waterman on Mashmap's output to discard false positives.

Thank you very much!

Apologies for reviving the thread, but I also have a similar question when I went through your paper to see how you guys extracted segmental duplications and I ran into a few doubts under section 3.2.1 Methodology:

Q. Discarding regions mapping with < 90% id and 500bp or lower: Shouldn't the regions be merged together to see if they are a part of a longer window? For eg: consider the segment below:
chr1 100 1600
in which the regions are
chr1 380 880
chr1 900 1500

Q. Why 500bp cutoff? Should we discard everything below 1000bp given that later it is mentioned smaller segments like Alu may cause an inflated motif length?

Sorry, SDs are a new thing for me, so forgive me if it sounds too naive!

Q. Discarding regions mapping with < 90% id and 500bp or lower: Shouldn't the regions be merged together to see if they are a part of a longer window?

There is an attempt made to merge adjacent windows. So this would happen whenever possible.

Q. Why 500bp cutoff? Should we discard everything below 1000bp given that later it is mentioned smaller segments like Alu may cause an inflated motif length?

This is because we don't know the end point associated with >=1000bp alignments at the start. The 500 bp match could be within that alignment. This is coming from the alignment-free approximation that is being done. So we need to check by using alignment whether the 500 bp mapping extends to a valid S.D.