Find2genomehaplo: A Python repository from melissamlwong

##Detect two haplotypes in genome assembly using reference genome

This program aims to identify two haplotype sequences within a genome assembly guided by a reference genome. A common problem in genome assembly of heterozygous genome using Illumina short reads is two haploid genome were assembled separately. The detection of these sequences is very important and interesting to some biologists, e.g. studying the origins, mutations and structural variations between the two haploid genomes especially from a hybrid species and investigating X-linked and Y-linked genes etc.

####Workflow:

Extract exon sequences from the reference genome
Identify scaffolds of heterozygous genome which contain exons from a reference genome using blastn
blast-2-sequences the top 2 best hit for each exon
Identify syntenic blocks using DAGchainer

##Getting started

####Install the following dependencies:

Executable for blastn
Executable for makeblastdb
Executable for dagchainer

*Note: Full path must be specified in the script if the executable is not found in the path

####Usage

usage: find2genomehaplo.py [-h] infile ref gff

Identify two haplotypes in a higly heterozygous genome assembly using
reference genome

positional arguments:
   infile      fasta file of heterozygous genome
   ref         fasta file of masked reference genome
   gff         gff3 file of reference genome
optional arguments:
  -h, --help  show this help message and exit

####Output files

The file "dagchainer.dat.aligncoords" contains the results for the two identified haplotypes. Filtering is required to summarize the results.
This pipeline keeps a number of temporary files which make it easier to re-run if the process is interrupted (exon.fa, allexon.blast, unique_pair.dat, top2_blast.tagset, b2seq.blast, dagchainer.dat)

melissamlwong/Find2genomehaplo