sg.config configuration (The homology of the de novo assembled genome is not known)
hushaoqiang opened this issue · 24 comments
Hello, thank you very much for your SubPhaser software, which is very useful for genotyping. I used HIFI and HIC to assemble the genome of Fragaria x ananassa but I do not know the homology of the 28 scaffolds. Should I put all the chromosomes in the sg.config file in one line?
I am looking forward to your reply.
谢谢老师您的解答,我将使用您所推荐的方式来做同源分析。另外还有个问题请教一下老师您,为了说详细点,我就是用中文了。
老师,使用Camarosa草莓的基因组序列想要试运行一下,基因组文件是发表的序列。配置文件内容:
Fvb1-1 Fvb1-2 Fvb1-3 Fvb1-4
Fvb2-1 Fvb2-2 Fvb2-3 Fvb2-4
Fvb3-1 Fvb3-2 Fvb3-3 Fvb3-4
Fvb4-1 Fvb4-2 Fvb4-3 Fvb4-4
Fvb5-1 Fvb5-2 Fvb5-3 Fvb5-4
Fvb6-1 Fvb6-2 Fvb6-3 Fvb6-4
Fvb7-1 Fvb7-2 Fvb7-3 Fvb7-4
运行命令是:subphaser -i F_ana_Camarosa_6-28-17_hardmasked.fasta -c sg.config
但是会出现:
22-04-28 18:21:39 [INFO] Consistent with subgenome assignment: 16 (6.93%); potential exchange: 149 (64.50%)
22-04-28 18:21:39 [INFO] Output: /home/hsq/hhh/subphaser/kmls/phase-results/k15_q200_f2.bin.enrich
22-04-28 18:21:39 [INFO] ###Step: LTR
22-04-28 18:21:39 [INFO] Identifying LTR-RTs by ['ltr_harvest']
22-04-28 18:21:45 [INFO] Check point file: /home/hsq/hhh/subphaser/kmls/tmp/LTR.scn.ok
exists; skip this step
22-04-28 18:21:45 [INFO] 8186 LTRs identified
22-04-28 18:21:45 [INFO] Check point file: /home/hsq/hhh/subphaser/kmls/tmp/LTR.tesort.ok
exists; skip this step
22-04-28 18:21:45 [INFO] By TEsorter, 190 (2.3%) are classified as LTRs, of which 0 (0.0%) are intact with complete protein domains
22-04-28 18:21:45 [INFO] After filtering, 186 / 8186 (2.3%) LTRs retained
22-04-28 18:21:45 [INFO] Outputing coordinate
- LTR
maps to /home/hsq/hhh/subphaser/kmls/phase-results/k15_q200_f2.ltr.bin.count
22-04-28 18:21:45 [INFO] Start Pool with 32 process(es)
22-04-28 18:21:45 [INFO] Processed 0 sequences
Traceback (most recent call last):
File "/home/hsq/miniconda3/envs/SubPhaser/bin/subphaser", line 33, in
sys.exit(load_entry_point('subphaser==1.2.5', 'console_scripts', 'subphaser')())
File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/main.py", line 779, in main
pipeline.run()
File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/main.py", line 516, in run
ltr_bedlines, enrich_ltr_bedlines = self.step_ltr(d_kmers) if not self.disable_ltr else ([],[])
File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/main.py", line 563, in step_l tr
Seqs.map_kmer3([ltrfile], d_kmers, fout=fout, k=self.k, ncpu=self.ncpu,
File "/home/hsq/miniconda3/envs/SubPhaser/lib/python3.8/site-packages/subphaser-1.2.5-py3.8.egg/subphaser/Seqs.py", line 97, in map_kmer3
logger.info('{} ({:.2%}) sequences contain subgenome-specific kmers'.format(mapped_seqs, mapped_seqs/i))
ZeroDivisionError: division by zero
k15_q200_f2.kmer_pca.pdf是成功生成的。生成的pca图与老师您附表中的图不一样。
请问一下老师我是不是需要调整参数或者其他设置有什么问题。
期待老师您的回复
Do not use hardmasked genome sequences, because subphaser is based on repeated sequences (TEs in most cases) in fact.
老师,请问您对开麦罗莎草莓做分析时(Fig. S27 Subgenome phasing with SubPhaser in the Fragaria x ananassa genome.),使用的文件是哪一个啊。
F_ana_Camarosa_6-28-17.fasta.gz. But you may not get the identical results but similar results.
老师,我刚刚又去GDR网站看了一下,我只看到了 F_ana_Camarosa_6-28-17_hardmasked.fasta.gz跟[F_ana_Camarosa_6-28-17_unmasked.fasta.gz两个文件,请问一下您告诉我的F_ana_Camarosa_6-28-17.fasta.gz文件这个在哪个链接下载啊,麻烦老师您了。
F_ana_Camarosa_6-28-17_unmasked.fasta.gz
老师,已成功运行,4套亚基因组成功分型。图非常好看,感谢老师您的耐心指导与为科研工作者所开发的SubPhaser软件。
Teacher, I have successfully found homologous chromosomes using MUMmer software, and then performed genotyping using SubPhaser. The results are very good.
Thank you!
Great. Thanks for your feedback.
Hi @zhangrengang, Thanks for this excellent software. I have been using this for my newly assembled allopolyploidy genome (2n = 4x = 40). The genome has two subgenomes, but I don't know the homologous pairs. So as you suggested in previous comments, I used mummer and minimapper2 to find the homologous pairs and assigned them in the configurations file.
But after running the SubPhaser with default parameters I got 19 chr in SG1 and only one chr in SG2. Idially, it should be 10 chr in each subgenome.
The differential k-mer heatmap also shows the similar result:
Can you suggest why it happend? Am I approaching rightly?
@kashiff007 It seems that your genome is contig-level, but not chromosome-level?
Yes, It is a newly assembled genome and from cytometry, we speculate that the largest 20 contigs are full genome length.
@kashiff007 Can you show the homoelogous relationship such as dot plot, and your config file? How much percent do the largest 20 contigs account for the whole genome?
here y-axis is my new genome and when we align with another closely related species Oropetium (x-asix), id gives the following plot. You can see each of the x-axis has two aligned positions in y-axis, which tells that these are homologous.
Based on this I made a config file which looks like:
ptg000016l ptg000003l
ptg000013l ptg000004l
ptg000022l ptg000015l
ptg000011l ptg000007l
ptg000012l ptg000006l
ptg000019l ptg000008l
ptg000017l ptg000018l
ptg000002l ptg000010l
ptg000001l ptg000005l
ptg000020l ptg000014l
Each row has two contigs (assuming homologous chromosomes) separated by space.
Total genoem size is 587 mb (587146212) and top 20 contigs are abount 573mb (573812465). or 97.72% of the total genome.
@kashiff007 It seems no problem. Do you have evidence for allo
?
is it something to do with clustering?
@kashiff007 If there are no much assembly errors (swtich errors, etc,), your genome is likely to be an auto-polyploid or be heavily recombined, which can not by phased by subphaser
. You can try to change the parameters, such as setting -q 100
or reducing the number of chromosome sets like:
ptg000016l ptg000003l
ptg000013l ptg000004l
ptg000022l ptg000015l
ptg000011l ptg000007l
ptg000012l
ptg000006l
ptg000019l
...
ptg000020l
ptg000014l
But these may also not work.
OK, I appreciate the explanation.
No. It may be very easy to know the homology by aligning your chromosomes to the genome of Fragaria vesca using tools such as minimap2 etc. You may be able to get a figure like below:
So the sg.config configuration can be set like:
1-1 1-2 1-3 1-4 2-1 2-2 2-3 2-4 ...
Hi, @zhangrengang
I'm soryy to disturb you. The plot is very fancy. Is it plotted by "dosPlotly" ?
Thanks!
@kedduck No, it is from the paper: Edger P P, Poorten T J, VanBuren R et. al. Origin and evolution of the octoploid strawberry genome [J]. Nature Genet., 2019, 51 (3): 541–547 [http://doi.org/10.1038/s41588-019-0356-4]
Thanks for your quick reply!