ruanjue/wtdbg2

CCS data assemble far too small

fkyoung1992 opened this issue · 4 comments

Dear prof. Ruan
I assembled a plant genome (~600m ~1.94% heterozygosity) based on ~400G Pacbio Sequel II CCS data with the followed line:
wtdbg2 -t 0 -x ccs -g 600m -i ccs23.fastq.gz -o beichai34 -e 2
The kmer distribution was like this:
|
|
|
|
|
|
|
|
|
||
||
||
||
||
|||
|||
||||
|||||
|||||||
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
********************** 1 - 201 **********************
Quatiles:
10% 20% 30% 40% 50% 60% 70% 80% 90% 95%
1 2 4 6 11 20 55 269 1779 9742
** PROC_STAT(0) **: real 2439.237 sec, user 6326.840 sec, sys 773.720 sec, maxrss 95177552.0 kB, maxvsize 130789572.0 kB
[Wed Apr 19 12:10:56 2023] - high frequency kmer depth is set to 13776
[Wed Apr 19 12:10:56 2023] - Total kmers = 728629161
[Wed Apr 19 12:10:56 2023] - average kmer depth = 7
[Wed Apr 19 12:10:56 2023] - 368836231 low frequency kmers (<2)
[Wed Apr 19 12:10:56 2023] - 4011 high frequency kmers (>13776)
Finally obtained only 4 contigs TOT 54784.
How to adjust the parameter to get a reliable output in this case ? thanks alot.

Then I used the default parameter and did not choose the -x option like this : (please ignore the different name of input, they are actually the same file )
wtdbg2 -t 36 -i SRR16122634.fastq.gz -g 600m -e 3 -o beichai
obtained the Kmer distribution like this:
|
|
|
|
|
|
|
|
|
|
|
||
||
||
||
|||
|||
||||
|||||
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
********************** 1 - 201 **********************
Quatiles:
10% 20% 30% 40% 50% 60% 70% 80% 90% 95%
1 2 3 6 10 22 99 507 3494 20155

If the kmer distribution is not good, please kill me and adjust -k, -p, and -K

Cannot get a good distribution anyway, should adjust -S -s, also -A -e in assembly

** PROC_STAT(0) **: real 3155.775 sec, user 6807.670 sec, sys 931.790 sec, maxrss 118617004.0 kB, maxvsize 189408212.0 kB
[Mon Apr 17 16:54:12 2023] - high frequency kmer depth is set to 23594
[Mon Apr 17 16:54:12 2023] - Total kmers = 674175558
[Mon Apr 17 16:54:12 2023] - average kmer depth = 6
[Mon Apr 17 16:54:12 2023] - 364160349 low frequency kmers (<2)
[Mon Apr 17 16:54:12 2023] - 2058 high frequency kmers (>23594)
[Mon Apr 17 16:54:12 2023] - indexing 310013151 kmers, 2157318418 instances (at most)

this time seemed to get a much better assemble result :
[Mon Apr 17 18:33:17 2023] Estimated: TOT 305317120, CNT 11722, AVG 26047, MAX 212992, N50 31744, L50 2891, N90 14848, L90 8550, Min 5120
[Mon Apr 17 18:33:35 2023] output 11722 contigs
But still only half of the expected size. So how should I adjust the parameter. Any of your response would be very appreciated!!

Please have a look at #259

Please have a look at #259

Thank you for your reply.
As you suggested in the cited case, I tried wtdbg2 -g 600m -t 0 -p 0 -k 19 -AS 4 -K 0.05 -s 0.3 -i SRR16122634.fastq.gz -o bei424 however obtained a much worse result (see as followed ) compared to the default parameter as you could see in my second comment.
[Tue Apr 25 15:20:13 2023] Estimated: TOT 528640, CNT 21, AVG 25174, MAX 81664, N50 59136, L50 4, N90 10240, L90 14, Min 5376
[Tue Apr 25 15:20:13 2023] output 21 contigs.

I tried hifiasm too but the software ran for weeks and didnot output any log and files which seemed to be abnormal for a 600m genome. So I really need your help. Do you have any other suggestions for wtdbg2?

Please check your fastq data first, and read the #259 again.