/wotplot

Small Python library for creating and visualizing dot plot matrices

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

wotplot

wotplot CI Code Coverage DOI PyPI

wotplot is a small Python library for creating and visualizing dot plot matrices.

Notably, wotplot creates the exact dot plot matrix, describing (given some k ≥ 1) every single k-mer match between two sequences. Many tools for visualizing dot plots create only an approximation of this matrix (containing only the "best" matches) in order to save time; wotplot uses a few optimizations to make the creation and visualization of the exact dot plot matrix feasible even for entire prokaryotic genomes. Having this exact matrix can be useful for a variety of downstream analyses.

Quick examples

Small dataset

This example is adapted from Figure 6.20 (bottom right) in Bioinformatics Algorithms, edition 2.

import wotplot

# Define our dataset
s1 = "AGCAGGAGATAAACCTGT"
s2 = "AGCAGGTTATCTACCTGT"
k = 3

# Create the matrix (setting binary=False means we'll distinguish forward,
# reverse-complementary, and palindromic matching k-mers from each other)
m = wotplot.DotPlotMatrix(s1, s2, k, binary=False)

# Convert the matrix to dense format and visualize it using matplotlib's
# imshow() function (for large matrices where dense representations are
# impractical, use viz_spy() instead; see below)
wotplot.viz_imshow(m)

Output dotplot from the above example

In the default colorscheme red cells (🟥) indicate forward matches, blue cells (🟦) indicate reverse-complementary matches, purple cells (🟪) indicate palindromic matches, and white cells (⬜) indicate no matches.

Larger dataset: comparing two E. coli genomes

Using E. coli K-12 (from this assembly) and E. coli O157:H7 (from this assembly). I removed the two plasmid sequences from the O157:H7 assembly.

import wotplot
from matplotlib import pyplot

# (skipping the part where I loaded the genomes into memory as e1s and e2s...)

# Create the matrix (leaving binary=True by default)
# This takes about 3 minutes on a laptop with 8 GB of RAM
em = wotplot.DotPlotMatrix(e1s, e2s, 20, verbose=True)

# Visualize the matrix using matplotlib's spy() function
# This takes about 2 seconds on a laptop with 8 GB of RAM
fig, ax = pyplot.subplots()
wotplot.viz_spy(
    em, markersize=0.01, title="Comparison of two $E. coli$ genomes ($k$ = 20)", ax=ax
)
ax.set_xlabel(f"$E. coli$ K-12 substr. MG1655 ({len(e1s)/1e6:.2f} Mbp) \u2192")
ax.set_ylabel(f"$E. coli$ O157:H7 str. Sakai ({len(e2s)/1e6:.2f} Mbp) \u2192")
fig.set_size_inches(8, 8)

Output dotplot from the above example

When visualizing a binary matrix, the default colorscheme uses black cells (⬛) to indicate matches (forward, reverse-complementary, or palindromic) and white cells (⬜) to indicate no matches.

More detailed tutorial

Please see this Jupyter Notebook.

Installation

wotplot supports Python ≥ 3.6. You can install it and its dependencies using pip:

pip install wotplot

Performance

Optimizations made so far

I've tried to make this library reasonably performant. The main optimizations include:

  • We use suffix arrays (courtesy of the lovely pydivsufsort library) in order to reduce the memory footprint of finding shared k-mers.

  • We store the dot plot matrix in sparse format (courtesy of SciPy) in order to reduce its memory footprint.

  • We support visualizing the dot plot matrix's nonzero values using matplotlib's spy() function, which (at least for large sequences) is faster and more memory-efficient than converting the matrix to a dense format and visualizing it with something like imshow().

That being said...

This library could be made a lot more efficient (I've been documenting ideas in issue #2), but right now it's good enough for my purposes. Feel free to open an issue / make a pull request if you'd like to speed it up ;)

Informal benchmarking

See this Jupyter Notebook for some very informal benchmarking results performed on a laptop with ~8 GB of RAM.

Even on this system, the library can handle reasonably large sequences: in the biggest example, the notebook demonstrates computing the dot plot of two random 100 Mbp sequences (using k = 20) in ~50 minutes. Dot plots of shorter sequences (e.g. 100 kbp or less) usually take only a few seconds to compute, at least for reasonably large values of k.

Why does this library exist?

  1. This library separates the creation and visualization of dot plot matrices. Other tools that I tried produced pretty visualizations, but didn't give me easy access to the underlying matrix.

  2. I wanted something that worked well with matplotlib, so that I could create and tile lots of dotplots at once in complicated ways.

Limitations

  • Performance: Although I've tried to optimize this tool (see the "Performance" section above), it isn't the fastest or the most memory-efficient way to visualize dot plots.

  • Only static visualizations: The visualization methods included in the tool only support the creation of static plots. There are ways to make matplotlib visualizations interactive (e.g. using %matplotlib notebook within a Jupyter Notebook), but (1) I don't currently know enough about these methods to "officially" support them and (2) these visualizations will still probably pale in comparison to the outputs of dedicated interactive visualization software (e.g. ModDotPlot).

Setting up a development environment

First, fork wotplot -- this will make it easy to submit a pull request later.

After you've forked wotplot, you can download a copy of the code from your fork and install wotplot from this downloaded code. The following commands should do this; note that these commands assume (1) that you're using a Unix system and (2) that you have Python ≥ 3.6 and pip installed.

git clone https://github.com/your-github-username-goes-here/wotplot.git
cd wotplot
pip install -e .[dev]

After the above commands, you can check that wotplot was installed successfully by running its test suite:

make test

Acknowledgements

The small example given above, and my initial implementation of an algorithm for computing dot plots, were based on Chapter 6 of Bioinformatics Algorithms (Compeau & Pevzner).

The idea of using suffix arrays to speed up dot plot computation is not new; it is also implemented in Gepard (Krumsiek et al. 2007).

Dependencies

Testing dependencies

Contact

Feel free to open an issue if you have questions, suggestions, comments, or anything else.