This Python 3 module helps you speedup generation of subplots in pseudo-parallel mode using matplotlib
and multiprocessing
. This can be useful if you are dealing with expensive preprocessing or plotting tasks such as violin plots per subplot.
This library uses pythons multiprocessing
module to plot each cell individually. If provided, each process will first evaluate a user-defined preprocessing function. Afterwards, every process will call a second user-defined plotting function providing matplotlib axes to plot on. All created plots then stored as images and then retrieved and assembled by the main thread into a subplot without any decoration.
This module is in a very early stage, so no pypi
releases are currently provided. However, you can simply install this module from git:
pip install git+https://github.com/paulgavrikov/parallel-matplotlib-grid/
Aside from the data all you need to provide is the grid layout grid_shape
and a plotting function plot_fn
.
Here is an example:
from parallelplot import parallel_plot
import matplotlib.pyplot as plt
import numpy as np
def violin(data, fig, axes):
axes.violinplot(data)
# Gen some fake data
X = np.random.uniform(low=-1, high=1, size=(30, 512, 512))
parallel_plot(plot_fn=violin, data=X, grid_shape=(3, 10))
plt.show()
Want to preprocess your data before plotting? No problem! just provide preprocess_fn
.
Here is an example where we apply a PCA transformation:
from parallelplot import parallel_plot
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
def preprocess(data):
return PCA().fit_transform(data)
def violin(data, fig, axes):
axes.violinplot(data)
# Gen some fake data
X = np.random.uniform(low=-1, high=1, size=(30, 512, 512))
parallel_plot(plot_fn=violin, data=X, grid_shape=(3, 10), preprocessing_fn=preprocess)
plt.show()
There are some cases where this module is either useless or adds overhead. Here are a few of those:
- Your plot function and preprocessing functions execute fast, but your data is big.
multiprocessing
usespickle
as input / output format of process tasks which requires data to be serialized. This can introduce a significant overhead. - Your data is over 4 GiB big. For some reason
multiprocessing
is using some ancientpickle
format that only supports data up to 4 GiB of size. There are ways to bypass that, but it's probably not worth it, as pickling is slow, and the computational overhead may not be worth it. - You only have one core available. Sorry 'bout that.
Just create a PR or feel free to raise an issue for questions, feature-requests etc.