Yugo Murawaki. 2024. Principal Component Analysis as a Sanity Check for Bayesian Phylolinguistic Reconstruction. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). (arXiv).
This is a refined version of old Python2 code available at https://github.com/murawaki/lexwave
- Obtain a BEAST XML configuration file
- Often published as part of supplementary materials of a paper
- Edit the XML file to sample ancestral node states
A line like
<log id="TreeWithMetaDataLogger.t:synthdata2" spec="beast.base.evolution.TreeWithMetaDataLogger" tree="@Tree.t:synthdata2"/>
should be rewritten as:
<log id="TreeWithTraitLogger.t:synthdata2" spec="beastclassic.evolution.likelihood.AncestralSequenceLogger" tree="@Tree.t:synthdata2" tag="ancestral-syndata2" data="@synthdata2" siteModel="@SiteModel.s:synthdata2"/
-
Run BEAST and obtain a
trees
file (e.g.,synthdata-synthdata2.trees
) -
Run FigTree to convert the
trees
file into a NEX file (we can skip this step, but we typically want to manually review the outcome before moving forward...)- Open the
trees
file - Click
Export Trees
- Check
Save all trees
- Check
Include Annotations (NEXUS & JSON only)
- Save as a NEX file (e.g.,
synthdata-synthdata2.nex
)
- Open the
-
Convert the NEX file into a Python pickle
python scripts/parse_tree.py synthdata-synthdata2.nex synthdata-synthdata2.pkl
-
Draw a PCA tree
python scripts/pca_tree.py synthdata-synthdata2.pkl ancestral-syndata2 synthdata-synthdata2.last.png
- The second argument specifies the tag given to
AncestralSequenceLogger
- The optional argument
--index=n
specifies then
-th tree in the pickle file - The optional argument
--dtype
must be specified if the substitution model isBinaryCovarion
orPDCovarion
because latent values must be mapped to binary values - You might want to edit the script as some plotting options are hard-coded
- The second argument specifies the tag given to
scripts/pca_kde.py
is similar to scripts/pca_tree.py
but add kernel density estimation of a given clade
- The clade is specified by the leaf nodes sorted alphabetically and joined with colons (e.g., GaroGaro:JingphoJingpho:Rabha)
Several recent studies (e.g., linguistic analysis of Robbeets et al. (2021)) define multiple site models to account for varying rates associated with basic vocabulary items. In such instances, a straightforward solution is to define a logger for each site model. Consequently, multiple copies of the same tree are generated, each providing distinct information about the node states. A postprocessing step is necessary to merge them into a single tree with complete node states.
tea254pdcov-ucln-fbd-constrained.xml
of Robbeets et al. (2021), for example, defines a different distribution for each basic vocabulary item. Consequently, we need to define a logger to each vocabulary item (the file saved astea254pdcov-ucln-fbd-constrained-modified.xml
):<logger id="treelog.t:fire" spec="Logger" fileName="$(filebase).fire.trees" logEvery="50000" mode="tree"> <log id="TreeWithMetaDataLogger.t:tree:fire" spec="beastclassic.evolution.likelihood.AncestralSequenceLogger" branchRateModel="@RelaxedClock.c:clock" tree="@Tree.t:tree" tag="ancestraldata" data="@orgdata.fire" siteModel="@SiteModel.s:fire" /> </logger> <logger id="treelog.t:nose" spec="Logger" fileName="$(filebase).nose.trees" logEvery="50000" mode="tree"> <log id="TreeWithMetaDataLogger.t:tree:nose" spec="beastclassic.evolution.likelihood.AncestralSequenceLogger" branchRateModel="@RelaxedClock.c:clock" tree="@Tree.t:tree" tag="ancestraldata" data="@orgdata.nose" siteModel="@SiteModel.s:nose" /> </logger> ...
- Save the list to a file (e.g.,
items
) - Run BEST and obtain numerous
trees
files - Combine these
tree
files into one pickle filepython scripts/combine.py tea254pdcov-ucln-fbd-constrained-modified.{}.trees items ancestraldata tea254pdcov-ucln-fbd-constrained-modified.last.pkl
- The first argument specifies the file path template instantiated by the second argument