/InvDeNovo

Python scripts for identifying inversions in fragmented de novo genome assemblies

Primary LanguagePython

InvDeNovo

Python scripts for identifying inversions in fragmented de novo genome assemblies.

InvDeNovo.py

Version 0.2

Author: Jesper Svedberg, Department of Organismal Biology, Uppsala University

E-mail: jesper.svedberg@ebc.uu.se

InvDeNovo.py tries to identify inversions in fragmented denovo assembles. It does this by looking for inversion breakpoints in a MUMmer alignment coordinate file. The script looks for a pairs of contigs that a) both align next to each other at two different locations on a chromosome and b) have both changed alignment direction at both locations. In other words it looks for the alignment pattern contig_1,contig_2,...,-contig_1,-contig_2. This is the signature of an inversion in the fragmented denovo assembly. The workflow is the following: Align the fragmented denovo assembly against a high quality reference assembly using the nucmer tool in the MUMmer package. Convert the delta file to a coordinate file in the BTAB format using "show-coords -Br file.delta > file.btab".

Usage: InvDeNovo.py -c target_chromosomes [-s score_cutoff -g allowed_gaps] file.btab

Usage of default values of -s and -g is recommended. The script can only search for one chromosome (ie one contig in the reference genome) at one time. InvDeNovo.py outputs results to the terminal.

The score cutoff function is semi-depricated. The primary problem when identifying inversions is spurious alignment of repetitive sequences. Initially filtering of hits was primarily done using this parameter, but I later added a filtering of the alignments generated by MUMmer, in order to remove such spurious alignments, and that strategy worked much better. Hit scores are still reported in the output, but it's not very informative using my datasets. Filtering using this parameters may still be useful for other datasets though.

The gaps allowed parameter controls the number of contigs between a contig pair. Again this was primarily used to deal with spurious alignment of repetitive sequences and may not be very useful. Lowering it to 0 will give the most stringent selection of hits.

Output: btab_filename chromosome contig1:contig2 gap_length coord_breakpoint1 coord_breakpoint2

If contig1 = -contig2, both breakpoints are located within one contig.

InvDeNovo_OneBreakpoint.py

Version 0.2

Author: Jesper Svedberg, Department of Organismal Biology, Uppsala University

E-mail: jesper.svedberg@ebc.uu.se

Companion script to InvDeNovo.py. The script tries to identify inversion in fragmented denovo assemblies, but it only looks for single breakpoints instead of both breakpoints of an inversion. This makes it possible to identify inversions where one breakpoint coincides with breaks in the fragmented assembly (ie, the breakpoint is located between contigs), or when you have complex tandem or overlapping inversions, but it also increases the false positive error rate. This calls for careful manual curation, and we've found that many candidate inversions are actually caused by assembly errors.

The script will look for single contigs that map at two different locations at the target chromosome and output the two alignment locations and the distance between them.

The script takes MUMmer BTAB files as input and outputs to the terminal.

Usage: InvDeNovo_OneBreakpoint.py -c target_chromosome [-s minimum_alignment_block_length -d minimum_alignment_distance -v] input.btab

-s sets minimum alignment block length. Repeats often have very short alignment blocks, and this parameter can filter out those.

-d sets the minimum distance between the two alignment coordinates.

-v sets verbose output.

Non-verbose output format: Inversion candidate: target_chromosome vs. contig alignment_distance coord_alignment1 coord_alignment2