marbl/harvest-tools

how to use masking repeat file (.bed) in the reference genome

Opened this issue · 4 comments

Hi,
I am working with my 70 assembled genomes to identify core SNPs. Therefore, I am looking to use masking repeat file .bed (generated from Mummer). Please let me know how to use the file in the analysis with the following command.

parsnp -g /ref/ref_genomic -d scaffolds/*.fasta -c

Thank you!

gongyh commented

I have similar issues. How to use soft/hard masked genomes?

Hi @sekhwal and @gongyh,

There is currently no way to mask files through Parsnp. However, you can provide Parsnp with soft-masked or hard-masked genomes.

Thanks for your reply. When i tested, soft-masked genomes are the same with unmasked ones. However, hard-masked bases will be identified as SNPs if not all the genomes are strictly and correctly hard-masked.

Ahh I see, sorry for misunderstanding the issue.

In terms of the core-genome alignment:
Soft-masking the genomes won't impact the resulting core-genome alignment, however hard-masking might. If a hard-masked region exists between two anchors for the alignment, MUSCLE will likely align through the region. However, hard-masked regions cannot be selected as anchors.

In terms of the variants and resulting tree:
This is a good point, and there should be an option to only use SNPs if the reference allele is not hard-masked. I'll transfer this ticket to harvesttools, the program responsible for identifying the variants from the XMFA.