bacpop/PopPUNK

Possible issue with taxa labelling in --lineage-clustering

matbeale opened this issue · 17 comments

Hi John,

As discussed, I've been running the --lineage-clustering option.

poppunk --lineage-clustering --ref-db strain_db --external-clustering Global_TPA_SS14-mapped_2020-06-09.noINDELS.fix-masked.gubbins.SNPs.pyjar.renamed.tre_rpinecone-lineages.csv --output db_lineage-clustering-rv --ranks 1,2,3,4 --distances strain_db/strain_db.dists

It seems to have defined clusters, but when applied to a tree (in ggtree) the lineages bear no relation to the phylogenetic structure. This could reflect generating a poppunk database from assembly core genome and then comparing to a very constrained reference alignment derived tree, but I would expect to see at least some definition. You suggested this may relate to taxa labelling in the output.

image

I also note that the --external-clustering option I used here didn't result in an output file.

Many thanks, Mat

Hi Mat,

There looks to be a something slightly odd going on - I will try to take a look at this in the next couple of days.

Was the external clustering passed to into the lineage assignment command? If you can let me know the command line syntax, that'd be great.

Thanks,

Nick.

Hi Nick,

The command I've used to run the lineage clustering is
poppunk --lineage-clustering --ref-db strain_db --output db_lineage-clustering-rv2 --ranks 1,2,3,4 --distances strain_db/strain_db.dists

  • this lead to semi-random patterns you see alongside the tree above. Rerunning poppunk leads to the same clusters (at least at level 4).

At Rank 4 poppunk is grouping around 400 genomes into one cluster, and 90 into a second (with a few others in a third). Those proportions are similar to the divisions in the tree, so my suspicion is that this is a relabelling problem, rather than with the clustering itself.

However, it's worth considering if poppunk is identifying kmers that are not relevant. This particular dataset uses assemblies from direct sequencing (SureSelect-enriched metagenomic data), but the assemblies included all have good scores in CheckM (>95% completeness, <5% contamination). There should be virtually no accessory genome for this species (it's all core and extremely clonal). The tree is based on reference mapping + gubbins, with joint ancestral reconstruction performed after ML. At the least, I would have expected poppunk to be able to reproduce the two major lineages we see in all phylogenetic analyses, but my ultimate goal is to define a persistent and reproducible sublineage nomenclature (I currently use rpinecone, but the clusters there are dependent on the dataset).

I pasted the command I used for adding the --external-clustering in my original post, and it's the same as in the above command, but with the addition of --external-clustering Global_TPA_SS14-mapped_2020-06-09.noINDELS.fix-masked.gubbins.SNPs.pyjar.renamed.tre_rpinecone-lineages.csv. Note that when I do this in combination with --fit-model, it outputs the _external_clusters.csv file, but it seems to be absent when I run poppunk with --lineage-clustering.

Mat

Thanks Mat - I've been running some tests and it seems to be generating some sensible outputs at the moment. There were issues with this previously where we had a mismatch between the output order of sketchlib, and the input order expected by PopPUNK. It is possible these packages could be out of sync. The versions I've been working with are pp_sketch 1.4.0 and the latest version of the PopPUNK master branch from the git repo (6a45c49). Could you let me know which versions you are currently using?

I will have a look at the external clustering "feature".

The lack of external clustering is because the lineages are not yet included as a proper model (enhancement #79). In the mean time, it should be possible for the external clusters to be visualised alongside the lineage assignments using the --generate-viz mode, with the --viz-lineages flag used to specify the _lineages.csv output, and the external clusters supplied through the --info-csv argument. In the mid-term, I think it will be best to allow both --info-csv and --external-clusters to be supplied to the visualisation generation, particularly once a stand alone script (#77).

Ok, so I should be on the latest version of poppunk (pulled from the gitrepo on July 22nd immediately after John altered the setup.py script - commit# 6a45c49)

I think I'm runing the latest pp-sketchlib on bioconda (1.4.0, 02/07/2020).

Regarding the external clustering file, it's not a huge issue for me, since I usually merge and analyse everything in R/ggtree, but more of an observation - the separate methods you suggest fine to me. Thanks.

If you have some examples on the Sanger filesystem, I can take a look an example set directly, if that were helpful?

Sure.

The poppunk lineage db is at /lustre/scratch118/infgen/team216/mb29/Treponema/Global_TPE+TPA_uber-temporal-analysis_2020/phylo/gubbins_SS14-map_TPA-only_2020-06-09/poppunk/popunk_assemblies/db_lineage-clustering-rv2

The original kmer database is at /lustre/scratch118/infgen/team216/mb29/Treponema/Global_TPE+TPA_uber-temporal-analysis_2020/phylo/gubbins_SS14-map_TPA-only_2020-06-09/poppunk/popunk_assemblies/strain_db

Finally, the ML tree (ref based) I'm comparing things against is /lustre/scratch118/infgen/team216/mb29/Treponema/Global_TPE+TPA_uber-temporal-analysis_2020/phylo/gubbins_SS14-map_TPA-only_2020-06-09/iqtree-on-gubbins/Global_TPA_SS14-mapped_2020-06-09.noINDELS.fix-masked.gubbins.SNPs.aln.renamed.treefile). There a few taxa that are absent between the assembly dataset and ML tree (about 5 each way), but the labelling should be the same (have been trying to do a tanglegram between the two trees, but dendextend is giving me trouble right now). I can also see from a manual look at the poppunk clustering assignments that although poppunk gives 2 main clusters, the taxa names associated with those clades don't make much sense.

Thanks - so I set up a fresh conda environment with the latest PopPUNK git clone and ran the clustering:

conda create -n poppunk python=3.8
conda install poppunk graph-tool matplotlib
conda remove --force poppunk
git clone https://github.com/johnlees/PopPUNK
./PopPUNK/poppunk-runner.py  --lineage-clustering --ref-db db_lineage-clustering-rv --distances db_lineage-clustering-rv/db_lineage-clustering-rv.dists --output lineage_clustering_njc --full-db --microreact

These results look quite sensible against the PopPUNK tree, which resembles your phylogeny in its structure:
image

However, comparing a few taxa between the two trees suggests there might have been some scrambling of the names - this probably looks like a sketchlib/PopPUNK interface issue. @johnlees, do you have any suggestions for looking at these?

Yeah, that looks like the tree and clusters I got from poppunk too - but as you say, the names seem scrambled.

I've found this a bit of a frustrating issue, as it's been reported a few times, but I've never been able to replicate it myself. Perhaps @matbeale if you could give the location of your conda env I could try and step through the code myself to try and work it out?
@nickjcroucher do the results above look ok to you? For the rank 1 clusters it looks like some of them are polyphyletic (unless the colours are being reused)

@johnlees The rank 1 clusters are OK I think - there's some polyphyly, but that's either monophyletic lineages arising within an ancestral lineage, or areas where the tree is not well resolved. As all the distances are quite short it might be possible to get better resolution with a higher sketch size.

Just to check on where the problem might arise, I looked at the pairwise distances for two pairs of isolates: PHE140076A and TPA_BCC063 (close in PopPUNK - albeit on long-ish terminal branches - but not in ML tree), and PHE140076A and PHE170397A (close in ML tree, but not in PopPUNK). In the data shared by @matbeale, the distances are:

Query   Reference       Core    Accessory
PHE140076A      TPA_BCC063      0.00045607385   0.010949718
PHE140076A      PHE170397A      0.0006013586    0.003733085

So it does appear that the lineages are reflecting the pairwise distances, but the distances/labels are behaving oddly.

@johnlees my conda environment is at /lustre/scratch118/infgen/team216/mb29/tools/conda-dirs/poppunk-custom/, and the poppunk install from github (6a45c49) is at /nfs/users/nfs_m/mb29/programs/PopPUNK

@nickjcroucher yes the population structure is extremely clonal. We see slightly more definition in the reference mapped tree, and in the long term this may mean that assembly-derived rank1 clusters are unsuitable (although that would be my preference) - alternate sketch sizes would be good. FYI, using a 10-snp threshold on the ref alignment tree in rpinecone, I get around 30 clusters (plus around 10 singletons) for the ML tree - I don't expect poppunk to precisely replicate those, but a similar level of resolution would be ideal for the downstream work I'm doing (and would enable a reproducible taxonomy for the field - given low substitution rates, 10 SNPs reflects around 50 years).

@matbeale It's a bit off topic but I think trying to both increase the sketch size and the k-mer size range used might help with the tree definition - I think we went up to 100,000 for TB to get effectively SNP-level resolution.

Can confirm this is a bug in sketchlib (sorry for the long delay!). I will have a fix soon

This will be fixed in sketchlib v1.5.1 (tonight) and I will update PopPUNK to a new version to support this tomorrow, at which point I will let you know and close this issue.

This should now be fixed if you use PopPUNK v2.2.0 and pp-sketchlib v1.5.1. The latter is available via conda already, the former should appear on conda shortly (or can be taken from master if needed sooner)