About read length in multi-pass read simulation

Question

About read length in multi-pass read simulation

wzboy1984 opened this issue a year ago · 3 comments

wzboy1984 commented a year ago

Dear author,

I'm confused by the read length setting in multi-pass read simulation, which was described to be fixed to --length-mean value.
What's the reason of disabling the --length-sd and setting the read length to --length-mean in this condition?

Note: for multi-pass sequencing in WGS simulation, the read length is set roughly equal to the --length-mean value, and -- length-sd is disabled.

Best wishes,

Answer 1 · 2023-10-07T13:37:15.000Z

Thank you for your using PBSIM.
Unlike PacBio CLR and Nanopore reads, the length variance of PacBio HiFi reads is small.
The HiFi read simulation can also be made to use --length-sd, but we believe that a constant HiFi read length is not a disadvantage for simulating HiFi reads.
We are always looking to improve PBSIM's HiFi read simulation and welcome your comments and suggestions.

Answer 2 · 2023-10-08T08:24:15.000Z

Thanks for your reply.
I was simulating hifi reads for genome assembly. When the hifi reads had the same read lengths, there were much smaller number of contained reads found. This led to that the assembling program spent more running time.
Usually, 3 out of 4 reads are contained reads. Too small contained read percentage is abnormal.
I suggest that you can keep the --length-sd, not fix the hifi read length.

Answer 3 · 2023-10-08T13:31:12.000Z

I understand the problem you are having.
We will improve PBSIM to use --length-sd in HiFi read simulation. However, there are some points we would like to consider regarding how to implement it, and the release is scheduled for next month.