institut-de-genomique/HAPO-G

Just to confirm whether I get the right output files

Closed this issue · 17 comments

Hello, I ran the HAPO-G yesterday for polishing, and it looks like it successfully finished (I got the message in the end "Thanks for using Hapo-G, have a great day :-)"). But I don't quite get the output files, and I attach a picture below to show. I don't see the polished fasta file. If the the assembly.fasta is the correct file, the file size seems a bit small (and I think this is a soft link to the original assembly, right? it appeared immediately I started the run.)? So may I ask where is the polished assembly? Thanks!
Screenshot 2022-02-28 at 12 59 31

Hello,
the polished assembly is in hapog_results/hapog.fasta.

Do some logs in logs/hapog_{chunk}.e contain errors?

Also, did you polish the assembly with long reads? If so, it is still very experimental and I don't recommend doing it with Hapo-G at the moment.

Hi, sorry, I realised that I got one sample finished. and the one can't be finished has this screen output

Checking dependencies...
        Found BWA.
        Found Samtools.

Indexing the BAM file...
Done in 13 seconds

Fragmenting the genome into 30 chunks of 4,240,170 bases (depending of scaffold sizes)
Done in 1 seconds

Extracting bam for each chunk
Done in 30 seconds

Launching Hapo-G on each chunk
ERROR: Hapo-G didn't finish successfully, exit code: -11
Faulty command: /miniconda3/envs/hapog/bin/build/hapog -b chunks_bam/chunks_3.bam -f chunks/chunks_3.fasta -o hapog_chunks/chunks_3.fasta -c hapog_chunks/chunks_3.changes

I had checked logs/hapog_chunks_3.e, and it's empty. What do you think that could cause this? I had this error -11 several times already.

Could you launch the faulty command independently to precisely see the error? I mean this one:

/miniconda3/envs/hapog/bin/build/hapog -b chunks_bam/chunks_3.bam -f chunks/chunks_3.fasta -o hapog_chunks/chunks_3.fasta -c hapog_chunks/chunks_3.changes

hmm, it seems strange... I run the faulty command mentioned above, and then I think it finished successfully... So I guess it's somehow random? Will it be a problem if I run several hapog command at the same time in our cluster? There should be no memory problem I think because we have a lot of nodes and a lot of memory per node

(hapog) [ucbtzz1@login15 polish2]$ /miniconda3/envs/hapog/bin/build/hapog -b chunks_bam/chunks_3.bam -f chunks/chunks_3.fasta -o hapog_chunks/chunks_3.fasta -c hapog_chunks/chunks_3.changes
Reference sequence : contig_110_polished
|**************************************************| 100%
Reference sequence : contig_111_polished
|**************************************************| 100%
Reference sequence : contig_112_polished
|**************************************************| 100%
Reference sequence : contig_114_polished
|**************************************************| 100%
Reference sequence : contig_117_polished
|**************************************************| 100%
Reference sequence : contig_119_polished
Segmentation fault

It did not finish successfully, you can see the Segmentation fault at the end of the log. Would it be possible for you to send us the fasta file and the bam file that are used in the faulty command to investigate the issue?

Hello, sorry for the late reply! Sure. How can I send it to you?

I'm sorry, I was sure I replied to your message. Could you please shoot an email to bistace@genoscope.cns.fr? I will send you a link to securely upload your data with Renater Filesender.

I'm sorry, I was sure I replied to your message. Could you please shoot an email to bistace@genoscope.cns.fr? I will send you a link to securely upload your data with Renater Filesender.

Hi, I have the same problem, how can I solve it? Thank you very much!

Checking dependencies...
Found BWA.
Found Samtools.

Generating bwa index...
Done in 897 seconds

Launching mapping on genome...
Done in 20070 seconds

Indexing the BAM file...
Done in 691 seconds

Fragmenting the genome into 23 chunks of 50,191,572 bases (depending of scaffold sizes)
Done in 8 seconds

Extracting bam for each chunk
Done in 388 seconds

Launching Hapo-G on each chunk
ERROR: Hapo-G didn't finish successfully, exit code: -9
Faulty command: /HAPO-G/build/hapog -b chunks_bam/chunks_5.bam -f chunks/chunks_5.fasta -o hapog_chunks/chunks_5.fasta -c hapog_chunks/chunks_5.changes

HI @wjq1981,
could you please open up a new issue with all log files?

Hello,
we've pushed a few changes on Conda and on Github that specifically solve some problems encountered when polishing with long reads. Could you please check it out?

Hello, sorry for the very late reply! I was occupied by other projects for a while. Now I tried the new version with combined HiFi and Illumina bam file, and it got stuck on "launching Hapo-G on each chunk" for more than 12 hours already... the bam file is 9G...

Here are my codes

minimap2 -t 20 -ax map-hifi $assembly.fasta ${HiFi.reads} > ${name}.hifi.sam
samtools view -Sb ${name}.hifi.sam | samtools sort -@ 20 > ${name}.hifi.bam
bwa index $assembly.fasta
bwa mem -t 20 -o ${name}.illumina.sam ${assembly.fasta} ${Illumina_seq1} ${Illumina_seq2}
samtools view -Sb ${name}.illumina.sam | samtools sort -@ 20 > ${name}.illumina.bam
samtools merge -@ 20 ${name}.merged.bam ${name}.hifi.bam ${name}.illumina.bam
samtools sort -@ 20 ${name}.merged.bam > ${name}.merged.sorted.bam
samtools index ${name}.merged.sorted.bam
hapog.py --genome ${assembly.fasta} -b ${name}.merged.sorted.bam -o ${name}_hapog -t 30 -u

and the output for hapog.py

$ hapog.py --genome assembly.fasta -b name.merged.sorted.bam -o name_hapog -t 30 -u

Checking dependencies...
	Found BWA.
	Found Samtools.

Indexing the BAM file...
Done in 157 seconds

Fragmenting the genome into 30 chunks of 5,073,883 bases (depending of scaffold sizes)
Done in 2 seconds

Extracting bam for each chunk
Done in 253 seconds

Launching Hapo-G on each chunk

Is it normal for long-reads data to take this long time?

Hello,

it should be pretty fast (i.e less than a day) on a small bam file. I would recommend that you wait for it to finish but just to be sure, do you see hapog processes on the machine when you execute top? Also, I would recommend running minimap2 with the --secondary=no options as some lines in the bam won't be formatted in the same way than with BWA and could cause some crashes.

Just to add on that, we never tested to give hapog short reads and long reads as a single input so there may be some side effects that we have not encountered yet.

Thanks!! Yes, it's running. Before I was not sure whether to add --secondary=no in minimap2, and now it's clear. Thanks for mentioning that! I did Illumina reads successfully before with the older version of hapog, and it was very fast. that's why I asked this time when it took more than 12 hours. I will update later whether it's finished or not. Thanks!

Hello! I'm happy to confirm that hapog works well with the merged bam file from hifi long-read and illumina short-read! Thanks a lot for the easy-to-use tool! btw, when we have both hifi and illumina, could you recommend how to use both in polishing? combined or separately?

I am glad it worked well!

I would say that separately would be better as there is a coverage cutoff set to 50x to limit the number of calculations we have to do. That means that for each position of the assembly, we only look at the first 50 alignments over that position. This could cause problems if the long-reads are after the short reads in the bam file because that could lead to long-reads not being used for correcting the assembly.

As for the order in which to use the reads, we did not look into that but intuitively, I would say that it may be better to polish first with short reads and then with long reads as repeats may not all be well corrected with only short reads.