PROBIC/mSWEEP

question about parameters for poppunk

Closed this issue · 2 comments

Hello,

Many thanks for your tool. I was wondering if you have any recommendations on the poppunk clustering for input into mSWEEP. I have a modest number of S. epi genomes. I am using dbscan for the model fit and the model refine option (since S. epi is recombinogenic). Are there any parameters that can affect mSWEEP downstream and would you have any recommendations for poppunk clustering for S. epi?

Many thanks

Also do you have the "strain" grouping associated with Staph epi reference database provided in the figshare? I see only three lineages but wasn't sure if you have clustering at the strain level as well. Can we only use mSWEEP at the lineage level?

Hi!

Unfortunately with PopPUNK the 'best' parameter choices depend a bit on the species. I would recommend running the clustering with several options, including both with and without the refinement step although for S. epi using the refinement step makes sense to me, and then selecting the one that has good scores reported for it by PopPUNK and also looks good if you check the results with PopPUNK's visualization options (https://poppunk.readthedocs.io/en/latest/visualisation.html, I like using the microreact option).

That said, we do have some plans to improve the documentation for using PopPUNK with mSWEEP but it will take some time to write out properly because I'm super busy right now :/

As for your second question, I'm not 100% sure what you mean with the "strain" grouping - sorry - but if you mean the 11-group clustering it should be available in FigShare. If it is not there I really need to add it.. In any case, that clustering didn't perform very well in the paper so I don't recommend using it. If you want the individual strains you can extract the names from the fasta file directly with command line tools. For example using grep the command would be something like grep "^>" fasta-file-name.fasta.

If you have your own S. epi genomes that are similar to the ones in your samples then it would be much better to use those as the reference, or add them to the existingreference data, and create a new clustering with PopPUNK or hierBAPS.