kishwarshafin/helen

some problem occurs when evaluating the data of na12878 chromosome 21

huangnengCSU opened this issue · 4 comments

hi
I ran helen to polish the draft assembly from na12878 chromosome 21. But there seem some problems in the polished results.
First I ran marginPolish to generate image features with the command:
marginPolish read2assembly.sort.bam ../assembly.fasta ~/tools/MarginPolish/params/allParams.np.human.r94-g235.json -o chr21_margin -t 60 -f
Second, I ran helen to generate a more accurate assembly with the command:
helen polish -i output_files/ -m ~/tools/helen/models/HELEN_r941_guppy344_human.pkl -b 512 -w 4 -t 60 -o helenPolish -p chr21_helen -g
After that I use pomoxis to evaluate the error rate of the polished assembly.
The following two figures are the results of marginPolish and helen.
image

image

Neng

Hello @huangnengCSU ,

I have a few questions:

  1. What guppy version are you using? I see you are using guppy 235 model file for MP but for HELEN you are using guppy 344 models, do you know which version of the basecaller are you using?

  2. How did you generate your assembly? What are the pomoxis numbers for the raw assembly?

  3. I see you are using hg38 as the "truth", that will have some potential issues, especially in the centromeric, seg-dups and SVs of NA12878. HG38 is not truely representative on any "one" sample. I would highly recommend using the BED file provided by GIAB and pass that to pomoxis for a better assessment.

To kishwarshafin
The raw data was basecalled with Albacore v2.3.4. The long reads assembler we used was Flye. And thank you for your advice about the reference.

neng

@huangnengCSU ,

Albacore v234 is obsolete at this moment. We don't have a model to support that basecaller. There are publicly available data with the latest basecaller that you can use. If you have issues with the newest basecallers please let us know.

To kishwarshafin
Thanks for answering the question, I will close the issue.