scalable analysis of image and time series analysis in python
Thunder is an ecosystem of tools for the analysis of image and time series data in Python. It provides data structures and algorithms for loading, processing, and analyzing these data, and can be useful in a variety of domains, including neuroscience, medical imaging, video processing, and geospatial and climate analysis. It can be used locally, but also supports large-scale analysis through the distributed computing engine spark
. All data structures and analyses in Thunder are designed to run identically and with the same API whether local or distributed.
Thunder is designed around modularity and composability — the core thunder
package, in this repository, only defines common data structures and read/write patterns, and most functionality is broken out into several related packages. Each one is independently versioned, with its own GitHub repository for organizing issues and contributions.
This readme provides an overview of the core thunder
package, its data types, and methods for loading and saving. Tutorials, detailed API documentation, and info about all associated packages can be found at the documentation site.
The core thunder
package defines data structures and read/write patterns for images and series data. It is built on numpy
, scipy
, scikit-learn
, and scikit-image
, and is compatible with Python 2.7+ and 3.4+. You can install it using:
pip install thunder-python
Lots of functionality in Thunder, especially for specific types of analyses, is broken out into the following separate packages.
thunder-regression
mass univariate regression algorithmsthunder-factorization
matrix factorization algorithmsthunder-registration
registration for image sequences
You can install the ones you want with pip
, for example
pip install thunder-regression
pip install thunder-registration
Here's a short snippet showing how to load an image sequence (in this case random data), median filter it, transform it to a series, detrend and compute a fourier transform on each pixel, then convert it to an array.
import thunder as td
data = td.images.fromrandom()
ts = data.median_filter(3).toseries()
frequencies = ts.detrend().fourier(freq=3).toarray()
Most workflows in Thunder begin by loading data, which can come from a variety of sources and locations, and can be either local or distributed (see below).
The two primary data types are images
and series
. images
are used for collections or sequences of images, and are especially useful when working with movie data. series
are used for collections of one-dimensional arrays, often representing time series.
Once loaded, each data type can be manipulated through a variety of statistical operators, including simple statistical aggregiations like mean
min
and max
or more complex operations like gaussian_filter
detrend
and subsample
. Both images
and series
objects are wrappers for ndarrays: either a local numpy
ndarray
or a distributed ndarray using bolt
and spark
. Calling toarray()
on an images
or series
object at any time returns a local numpy
ndarray
, which is an easy way to move between Thunder and other Python data analysis tools, like pandas
and scikit-learn
.
For a full list of methods on image
and series
data, see the documentation site.
Both images
and series
can be loaded from a variety of data types and locations. For all loading methods, the optional argument engine
allows you to specify whether data should be loaded in 'local'
mode, which is backed by a numpy
array, or in 'spark'
mode, which is backed by an RDD.
All loading methods are available on the module for the corresponding data type, for example
import thunder as td
data = td.images.fromtif('/path/to/tifs')
data = td.series.fromarray(somearray)
data_distributed = ts.series.fromarray(somearray, engine=sc)
The argument engine
can be either None
for local use or a SparkContext
for distributed use with Spark. And in either case, methods that load from files e.g. fromtif
or frombinary
can load from either a local filesystem or Amazon S3, with the optional argument credentials
for S3 credentials. See the documentation site for a full list of data loading methods.
Thunder doesn't require Spark and can run locally without it, but Spark and Thunder work great together! To install and configure a Spark cluster, consult the official Spark documentation. Thunder supports Spark version 1.5+ (currently tested against 2.0.0), and uses the Python API PySpark. If you have Spark installed, you can install Thunder just by calling pip install thunder-python
on both the master node and all worker nodes of your cluster. Alternatively, you can clone this GitHub repository, and make sure it is on the PYTHONPATH
of both the master and worker nodes.
Once you have a running cluster with a valid SparkContext
— this is created automatically as the variable sc
if you call the pyspark
executable — you can pass it as the engine
to any of Thunder's loading methods, and this will load your data in distributed 'spark'
mode. In this mode, all operations will be parallelized, and chained operations will be lazily executed.
Thunder is a community effort! The codebase so far is due to the excellent work of the following individuals:
Andrew Osheroff, Ben Poole, Chris Stock, Davis Bennett, Jascha Swisher, Jason Wittenbach, Jeremy Freeman, Josh Rosen, Kunal Lillaney, Logan Grosenick, Matt Conlen, Michael Broxton, Noah Young, Ognen Duzlevski, Richard Hofer, Owen Kahn, Ted Fujimoto, Tom Sainsbury, Uri Laseron, W J Liddy
If you run into a problem, have a feature request, or want to contribute, submit an issue or a pull request, or come talk to us in the chatroom!