mikolmogorov/Flye

--polish-target has reduced N50 and largest contig

Closed this issue · 2 comments

Hello

Firstly thanks for making this great tool.

I have usedFlye 2.9.3-b1797 to assemble ONT reads for a plant genome (assembly size 830 Mb). I installed Flye using bioconda. Before running Flye I removed reads shorter than 5kb.

flye --nano-hq /u/project/vlsork/ldpeck/longreads/fastq/${INFILE%_*}_ALLpass.fl5kb.fastq.gz \
        --genome-size 830m -o flye-hq-${INFILE%_*} -t 7 --scaffold

Assembly                    assembly  
# contigs (>= 0 bp)         7408      
# contigs (>= 1000 bp)      7407      
# contigs (>= 5000 bp)      7381      
# contigs (>= 10000 bp)     7327      
# contigs (>= 25000 bp)     6986      
# contigs (>= 50000 bp)     6169      
Total length (>= 0 bp)      2947843323
Total length (>= 1000 bp)   2947842862
Total length (>= 5000 bp)   2947760873
Total length (>= 10000 bp)  2947336694
Total length (>= 25000 bp)  2941151963
Total length (>= 50000 bp)  2910673812
# contigs                   7395      
Largest contig              7272865   
Total length                2947819947
GC (%)                      35.47     
N50                         821055    
N90                         189133    
auN                         1272818.7 
L50                         922       
L90                         3789  
# N's per 100 kbp           0.47  

Then I ran --polish-target with two iterations

flye --polish-target flye-hq-${INFILE%_*}/assembly.fasta \
	--nano-hq /u/project/vlsork/ldpeck/longreads/fastq/${INFILE%_*}_ALLpass.fl5kb.fastq.gz \
	--iterations 2 --threads 7

Assembly                    polished_2
# contigs (>= 0 bp)         7173      
# contigs (>= 1000 bp)      7121      
# contigs (>= 5000 bp)      6960      
# contigs (>= 10000 bp)     6656      
# contigs (>= 25000 bp)     5754      
# contigs (>= 50000 bp)     4806      
Total length (>= 0 bp)      1557125882
Total length (>= 1000 bp)   1557095537
Total length (>= 5000 bp)   1556616863
Total length (>= 10000 bp)  1554379940
Total length (>= 25000 bp)  1538848805
Total length (>= 50000 bp)  1504360171
# contigs                   7034      
Largest contig              5266179   
Total length                1556930564
GC (%)                      35.44     
N50                         487344    
N90                         108678    
auN                         780367.3  
L50                         811       
L90                         3461      
# N's per 100 kbp           0.00 

You can see that the polishing improved the number of N's and reduced total number of contigs, but the N50 and largest contig have both decreased? I have attached both flye log files from the original assembly step (flye.log) and from the polishing step (flye_polish.log)

Do you know why this might be?

Thanks

Lily

flye.log.gz
flye_polish.log

Hi Lily,

Total length has reduced quite a bit - this is unexpected. I think it may have something to do with scaffolding. If you want to add additional polishing iterations, you can use -i argument during the assmebly, it runs polishing on contigs, rather than scaffolds. With new ONT data 1 round of polishing is usually sufficient.

Hi @mikolmogorov

Thank you, I think you were right about the scaffolding flag. I was also surprised by the total lengths above, as the assembly size is 830 Mb, so the assembly had roughly tripled in size. Running the below script I now have a more expected value for total length.

Thanks

Lily

flye --nano-hq /u/project/vlsork/ldpeck/longreads/fastq/${INFILE%_*}_ALLpass.fl5kb.fastq.gz \
        --genome-size 830m -o flye-hq-${INFILE%_*} -t 7 --iterations 1

Assembly                    assembly 
# contigs (>= 0 bp)         7036     
# contigs (>= 1000 bp)      6994     
# contigs (>= 5000 bp)      6783     
# contigs (>= 10000 bp)     6142     
# contigs (>= 25000 bp)     4686     
# contigs (>= 50000 bp)     3536     
Total length (>= 0 bp)      995236639
Total length (>= 1000 bp)   995209587
Total length (>= 5000 bp)   994556442
Total length (>= 10000 bp)  989618918
Total length (>= 25000 bp)  965235033
Total length (>= 50000 bp)  924011244
# contigs                   6900     
Largest contig              5407364  
Total length                995031275
GC (%)                      35.43    
N50                         368393   
N90                         68235    
auN                         660728.6 
L50                         637      
L90                         3052     
# N's per 100 kbp           0.00