DataCollection is an extended data structure for data science and machine learning tasks. It is designed to support advanced features but kept as simple as Python list or iterable.
DataCollection is an experimental fork of towhee.DataCollection
with effects that having better MLOps support.
- Method-chaining API that makes building data processing pipeline much easier;
- Streaming computation that allows large-scale data beyond the memory limitation;
- Easier tabular data programming API;
- Hyper-parameter support in data processing;
- Building prototyping ML pipeline with experiment config and metric tracking;
- Tuning ML pipeline performance with ease;
- Deploying ML application into production;
pip install reactive
DataCollection
is an enhancement to the list
data type in Python. Creating a DataCollection from a list is as simple as:
>>> import reactive as rv
>>> dc = rv.of([0, 1, 2, 3])
>>> dc
[0, 1, 2, 3]
The behavior of DataCollection is designed to be a drop-in-place replacement for the Python list
:
>>> dc = rv.of([0, 1, 2, 3])
>>> dc
[0, 1, 2, 3]
# indexing
>>> dc[1], dc[2]
(1, 2)
# slicing
>>> dc[:2]
[0, 1]
# appending
>>> dc.append(4).append(5)
[0, 1, 2, 3, 4, 5]
DataCollection provides high-order functions such as map
and filter
:
>>> rv.of([0, 1, 2, 3, 4]).map(lambda x: x*2)
[0, 2, 4, 6, 8]
>>> rv.of([0, 1, 2, 3, 4]).filter(lambda x: int(x%2)==0)
[0, 2, 4]
The map
and filter
always return a new DataCollection, making the method-chaining style possible in python:
>>> (
... rv.of([0, 1, 2, 3, 4])
... .filter(lambda x: x%2==1)
... .map(lambda x: x+1)
... .map(lambda x: x*2)
... )
[4, 8]
DataCollection is designed to be extendable by simply defining or import functions:
>>> def add1(x):
... return x + 1
>>> (
... rv.of([0, 1, 2, 3, 4])
... .add1()
... )
[1, 2, 3, 4, 5]
we can directly call the function add1
defined in the context as an API of DataCollection.
DataCollection provides an easier way for processing tabular data in Pandas, for example:
>>> import pandas as pd
>>> df = pd.DataFrame({"a": range(5)})
The DataFrame can be wrapped into a DataCollection, and apply function to different columns.
>>> dc = reactive.from_pandas(df)
>>> def add1(x): return x+1
>>> def mul2(x): return x*2
>>> (
... dc.add1["a", "b"]()
... .mul2["b", "c"]()
... )
a b c
0 0 1 2
1 1 2 4
2 2 3 6
3 3 4 8
4 4 5 10
where add1["a", "b"]
means that we apply add1
to column a
and store the output into column b
.
DataCollection aims to provide a collection API similar to scala.collection
, but try to be more pythonic. Deep learning and Python have fundamentally changed data science and players in the area. And DataCollection tries to improve the readability and performance of ML-related python codes.
- Collection API;
- stream and unstream execution;
-
map
andfilter
; -
flatten
andfold
; -
batch
androlling
;
- Tabular API;
- SISO (single input and single output);
- MIMO (multiple input and multiple output);
- GraphQL support;
-
groupby
API; -
aggregate
API;
- Execution Engine;
- async and parallel execution;
- native execution engine with Rust;
- Arrow-based colmular storage;
- JIT Compiler;
- numba compiler support;
- jit compile hook;