Question about model's training data

Question

Question about model's training data

huangnengCSU opened this issue 4 years ago · 2 comments

Hi:
MarginPolish && HELEN is such an excellent pipeline for polishing ONT assembly, which is easy to run and has very high accuracy. I am using the latest model to polishing some human data. I wonder what data do you use to train the model MP_r941_guppy344_human.json and HELEN_r941_guppy344_human.pkl. The training datasets of this two models were not mentioned in the paper. Which specie and which chromosome is used, HG002, CHM13 or HG00733 and chr1-6 or chr1-19, chr21-22?

Neng

Answer 1 · 2020-12-14T20:23:26.000Z

@huangnengCSU ,

MP_r941_guppy344_human.json and HELEN_r941_guppy344_human.pkl uses the same training that is explained in the paper but basecalled with guppy 3.4.4 model. The underlying training is the same which is HG002 chr1-19.

Just to update you, internally we have switched to a new polisher (https://github.com/kishwarshafin/pepper) that produces better/similar accuracy to MarginPolish-HELEN if your data is guppy3.0.5 or higher.

Answer 2 · 2020-12-15T00:44:34.000Z

@kishwarshafin
Thanks so much for your response, I will try your new polisher to generate a more accurate assembly.