flye usage
emilydolivo97 opened this issue · 1 comments
I chose "nanopolish" to call variants.
when I read the documentation of "nanopolish" I found that I shoud give fasta form of my different filtered fastq files ( I dont have fast5 format) .
For this purpose I used 'seqtk" to convert each fastq file ( I have 10 files) to 10 fasta files and now I'm using "flye" for genome assembly. The problem is that the programm is taking too long .
Is there a way to stop it or speed it at specific stage ? this stage must be suitable for my nanopolish analysis !! .
this is my script :
import os
import subprocess
class FastqToAssembledFasta:
def init(self, input_folder, output_folder):
self.input_folder = input_folder
self.output_folder = output_folder
def convert_to_fasta(self):
# Create the output folder if it doesn't exist
if not os.path.exists(self.output_folder):
os.makedirs(self.output_folder)
# Get a list of all FASTQ files in the input folder
fastq_files = [f for f in os.listdir(self.input_folder) if f.endswith('.fastq.gz')]
# Convert each FASTQ file to FASTA format
for fastq_file in fastq_files:
input_path = os.path.join(self.input_folder, fastq_file)
output_path = os.path.join(self.output_folder, fastq_file.replace('.fastq.gz', '.fasta'))
seqtk_cmd = f'seqtk seq -a {input_path} > {output_path}'
subprocess.run(seqtk_cmd, shell=True)
return self.output_folder # Return the output folder containing the converted FASTA files
def assemble_reads(self, fasta_folder):
# Assemble the reads using Flye
assembly_output = os.path.join(self.output_folder, 'assembly.fasta')
flye_cmd = f'flye --nano-raw {fasta_folder}/*.fasta --out-dir {self.output_folder} -t 8 --keep-haplotypes'
subprocess.run(flye_cmd, shell=True)
# Rename the assembly output to a more descriptive name
os.rename(os.path.join(self.output_folder, 'assembly.fasta'), assembly_output)
input_folder = '/data/filtred_reads'
output_folder = '/data/converted_assembled_reads'
Convert FASTQ files to FASTA format
converter = FastqToAssembledFasta(input_folder, output_folder)
fasta_folder = converter.convert_to_fasta()
Assemble reads into a single FASTA file using Flye
converter.assemble_reads(fasta_folder)
Please see the manual for estimated running times for different datasets. It contains info how to stop / resume from different stages as well. If anything is unclear - feel free to follow up in this topic.