This folder contains the code used in this publication:
Kapusta, Suh & Feschotte (2017) PNAS doi: 10.1073/pnas.1616702114 v4.0 gives identical results, but is WAY easier to run.
From a multi species alignment in maf format and a newick tree, this script outputs microdeletions (1 to 30nt) of the various branches. See its own README for more details
perl <> -dir <dir_with_alignments> [-sp <SP>] [-lenc <X>] [-lene <X>] [-cpu <X>] [-v] [-h|help]
I. List of sizes of all consecutive empty data for a given species (no data on the browser)
Only when the empty data is interrupting a scaffold (note that it could correspond to a misassembly)
Empty data = no \"s\" line. However, there could be insertion (non aligning bases): in that case,
the large indel will be printed only if lower case length - insertion length > lene
II. List of sizes of all consecutive empty blocks for a given species (C lines in the maf = continuous:
"C -- the sequence before and after is contiguous implying that this region was either deleted in the
source or inserted in the reference sequence. The browser draws a single line or a '-' in base mode in these blocks."
In both cases, outputs will be:
[0] [1] [2] [3] [4] [5] [6] [7]
chr(SP) start(SP) end(SP) chr/scaffold(Query) DeletionCoord(Query) length_with_lc length_without_lc length_insertion
lc = lower cases. In column 6, lower cases in reference are removed
Length in col 6 would be the minimum deletion len (underestimated when ancient TEs,
but this will remove all TE insertions in the reference that would not be a deletion).