bioinfologics/w2rap-contigger

K parameter determination

mictadlo opened this issue · 5 comments

Hi,
What is the best strategy to determine the K parameter?

Thank you in advance,

Michal

Hi Michal,

Choosing the K parameter is usually a trial and error process as it depends on the characteristics of your genome and the coverage and quality of the reads you're assembling. If your genome is quite small you can run multiple assemblies with different K and evaluate each one. We usually start with the default K=200, then explore around this value 180, 220 etc. You can push the K parameter higher if you have higher coverage. We generally don't use other software to estimate this parameter.

Hope this helps.

Best,
Jon

Hi Jon,
Should I check it how it is describe here? Could you provide examples when how to recognize when to increase or decrease K?

Thank you in advance.

Michal

Hi Michal,
Yes, these are the things we usually check. These are all methods of assembly validation, for example, checking whether the contigs accurately represent the PE reads (using KAT comp) or whether genes are reconstructed correctly (using BUSCO). If you have any RNA-seq data you should align these reads or build transcripts and align these to your contigs to see how many genes are represented in each assembly. Different values of K will generate slightly different representations of your genome so it's up to you to determine which more accurately reflects the genome you are constructing. Longer contigs will look better in terms of N50 but are not necessarily the most accurate if they contain mis-assemblies.
Best,
Jon

Hi Jon,
Yes, we have RNA-seq date and we were thinking to use one of the RNA-seq scaffolder and LR_Gapcloser. Have you ever used any of them?

Thank you in advance,

Michal

We only use RNA-seq reads for assembly validation not scaffolding, and we generally don't do gap-closing as this can introduce errors.