GENPOL is a consensus module designed to polish contig assemblies, enhancing the accuracy of genome assemblies. It supports both short-read and long-read sequencing technologies.
- Inputs: Requires three files — contigs in FASTA format, reads in FASTA/FASTQ format, and mapping files in BED format with CIGAR strings.
- Outputs: Generates a set of polished contigs in FASTA format.
To install GENPOL:
git clone https://github.com/mawad89/GenPol.git
cd GenPol
Add the GenPol
directory to your PATH
to enable command-line access.
-
Read Mapping
Choose any preferred mapping tool; however, the recommended tools are:- Minimap2 for long reads.
- BWA for short reads.
Note: Ensure the CIGAR string is in extended format (using
X=
). -
Convert Mapping Files
Convert the mapped files into BED format, including CIGAR strings. -
Contig Separation
To optimize processing time, GENPOL is designed to polish each contig individually. Separate the BED files by contig for parallel processing. -
Run Variant Calling
Run thevc.py
script to identify variants within each read, recording the variant positions in both the reads and contigs. -
Filter Variants
Use thefilter.py
script to aggregate the variants identified byvc.py
. This step filters out variants with less than 51% occurrence, retaining only those with sufficient support for accurate polishing. -
Generate Consensus
Run thepolish.py
script to generate the consensus sequence for each contig. -
Final Assembly
Concatenate all polished contigs into a single FASTA file.
GENPOL is distributed under the MIT License. See the LICENSE
file for details.