zjshi/Maast

SNP matrix for all genome

Opened this issue · 13 comments

Hey,

We only get a vcf file for tag genomes and SNP files for each genome in gt_results. Do you have any suggestions for making a SNP matrix of all genome?
Thanks!

Seconding this request. I plan to use it to run RAxML. I have attempted to use the resulting concat_allele.aln.fasta of maast tree command, but somehow RAxML only recognizes half of total SNPs listed in vcf file. Or maybe you have a way to tweak something when running RAxML using the concat_allele.aln.fasta?

Thank you very much!

Hi thanks for using Maast! I think concat_allele.aln.fasta should be good for RAxML or iqtree with little or minor changes. Can you possibly share your file? I am happy to take a look to see whether there would be a simple workaround.

Hi!

the command I used for RAxML is: raxmlHPC-PTHREADS -s ${DIRR}concat_allele.aln.fasta -f a -m GTRGAMMA -p 12345 -x 12345 -N 1000 -n Xoo -T 50

I have attached the logfile (job is still running as of now)
JobName.3693.txt
attached the logfile (job is still running as of now)

I have about ~51,100 SNPs based on core_snps.vcf but RAxML only recognizes half of them

Ok I see. Could you please verify the length of each concatenated allele sequences in concat_allele.aln.fasta? SNPs in core_snps.vcf are not necessarily all ended up in the MSA files due to several factors: bi-allelic nor not, covered by a good k-mer or not, prevalence of the site in the population, etc. I

based on seqkit stats, all entries in the concat_allele.aln.fasta have 48,920 bases.

What about the invariant sites (i.e. sites has the same allele across all genomes) in the MSA? Would it be possibly due to automatic removal of these sites by RAxML?

I assume invariant site was already removed through maast tree command? I made all arguments in default.
I will also try to run the aligned fasta in IQ-Tree, see if they would differ in terms of the number of distinct patterns identified.

Yes you are right. Maast will remove sites below min MAF(Minor Allele Frequency) and min MAC(Minor Allele Count). Please let me know how it goes with IQ-tree. At the same time I will look into it on my end. Thanks.

Based on this discussion and the docs, it appears Maast output is not suitable for IQTree, while snippy-core is. The difference is Maast outputs core SNPs, while snippy-core outputs core sites. Core sites include invariant sites, necessary for IQTree to determine branch lengths, while core SNPs do not. You can get around this requirement using ascertainment bias correction (+ASC) with IQTree, but ASC is still viewed with skepticism by many in the research community. Please see the relevant entries here, especially "What are the differences between alignment columns/sites and patterns?" and "Why does IQ-TREE complain about the use of +ASC model?"

http://www.iqtree.org/doc/Frequently-Asked-Questions

Perhaps the best way around this for Maast would be to output invariant core site counts like "snp-sites -C", which could then be used with "iqtree -fconst".

Hi @marade, I appreciate your comment! You made a great point about ASC should be applied to concatenated core SNP alleles for accurate branch length estimation. I would like to clarify that Maast ignores invariant sites by design to facilitate computation especially for tracking a large number of microbial strains. To me, core snp sites + core invariant sites is essentially the entire core genome for a set of input strains. If you are interested in inferring the phylogeny using all site from core genome, there seems to be no need at all for SNP calling. I would run the following maast genomes --fna-dir /path/to/genomes/ --rep-fna /path/to/rep_genome.fna --out-dir /path/Maast/output/ --skip-centroid --keep-redundancy, and ignore all output files except for tag_msa.fna, aka., the multiple sequence alignment of genomes using all core sites. It is also possible to tune core site definition with the option --min-prev. Hope this helps.

Thanks for your perspective @zjshi. I'll give the method you suggest a try.

Oh, forgot to mention one additional use for core sites, which is to support 'soft core' SNPs. This usage is well illustrated by the recently released Core-SNP-filter:

https://github.com/rrwick/Core-SNP-filter

The paper is quite interesting too. Is there a way to achieve something like that with Maast?