For a given set of D datapoints of M-dimensions, MDS yields a set of D embedded points in N-dimensions
s.t. distances between embedded points preserve the distances between the original datapoints.
Here, we define hamming distance as the datapoint distance, and Euclidean distance as the embedding distance.
Therefore, our 2-D embedding gives Euclidean distances between points that approximate the hamming distance between the corresponding sequences.
- Create a multiple sequence alignment
- Calculate all-vs-all hamming distances between the aligned sequences
- Apply MDS to find a set of 2-D points that match hamming distances
- Visualize the 2-D embedding
The method is very simple and invokes existing modules, there is no install for this repo.
Instead, look at sequenceMDS example 1.ipynb
to see what commands to run.
For creating multiple sequence alignments, we recommend installing MAFFT.
For computing all-vs-all hamming distances, we include several functions in hamming.py
.
For MDS calculation, install sklearn.
-
MDS does not find a perfect match of 2-D distance to provided distance.
However, for a well-specified hamming matrix, it is typically accurate enough to serve for qualitative interpretation and for rule-of-thumb comparisons. -
The MDS algorithm will return slightly different solutions unless a random seed is specified.
The biggest difference is typically the global rotation. -
Computing hamming distance from a multiple sequence alignment, rather than all-vs-all pairwise sequence alignments, is an approximation to save compute time.
With a sufficiently accurate multiple sequence alignment this is typically sufficient for visualization.
For sequences without a coherent common alignment, such as sequences from separate protein families, you may want to compute all-vs-all alignments e.g. using Smith-Waterman.