
Question about model's training data

huangnengCSU opened this issue · 2 comments

MarginPolish && HELEN is such an excellent pipeline for polishing ONT assembly, which is easy to run and has very high accuracy. I am using the latest model to polishing some human data. I wonder what data do you use to train the model MP_r941_guppy344_human.json and HELEN_r941_guppy344_human.pkl. The training datasets of this two models were not mentioned in the paper. Which specie and which chromosome is used, HG002, CHM13 or HG00733 and chr1-6 or chr1-19, chr21-22?


@huangnengCSU ,

MP_r941_guppy344_human.json and HELEN_r941_guppy344_human.pkl uses the same training that is explained in the paper but basecalled with guppy 3.4.4 model. The underlying training is the same which is HG002 chr1-19.

Just to update you, internally we have switched to a new polisher ( that produces better/similar accuracy to MarginPolish-HELEN if your data is guppy3.0.5 or higher.

Thanks so much for your response, I will try your new polisher to generate a more accurate assembly.