/bertrand

flexible type extensions for pandas

Primary LanguageC++MIT LicenseMIT

pdcast - flexible type extensions for pandas

PyPI version Supported Python versions PyPI wheels Pylint report Mypy report Black format

pdcast enhances the numpy/pandas typing infrastructure, allowing users to write powerful, modular extensions for arbitrary data.

pdcast provides a robust toolset for handling custom data types, including:

  • Automatic creation of ExtensionDtypes: pdcast simplifies and streamlines the creation of new data types for the pandas ecosystem.
  • Universal conversions: pdcast implements a single, overloadable conversion function that can losslessly convert data within its expanded type system.
  • Type inference and schema validation: pdcast can efficiently infer the types of arbitrary data and compare them against an external schema, increasing confidence and reliability in complex data pipelines.
  • First-class support for missing values and mixed-type data: pdcast implements a separate data type for missing values, and can naturally process composite vectors via a split-apply-combine strategy.
  • Data compression: pdcast can losslessly compress data into a more efficient representation, reducing memory usage and increasing performance.
  • Compatibility with third-party libraries: pdcast bridges the gap between dynamically-typed Python and statically-typed extension libraries, allowing users to optimize their code without sacrificing flexibility.

pdcast implements a rich type system for numpy/pandas dtype objects, adding support for:

  • Abstract hierarchies representing different subtypes and implementations. These are lightweight, efficient, and highly extensible, with new types added in as little as :ref:`10 lines of code <tutorial>`.

    >>> @register
    ... class CustomType(ScalarType):
    ...     name = "custom"
    ...     aliases = {"foo", "bar"}
    ...
    ...     def __init__(self, x=None):
    ...         super().__init__(x=x)
  • A configurable, domain-specific mini-language for resolving types. This represents a superset of the existing numpy/pandas syntax, with customizable aliases and semantics.

    >>> resolve_type("foo")
    CustomType(x=None)
    >>> resolve_type("foo").aliases.add("baz")
    >>> resolve_type("baz[x]")
    CustomType(x='x')
  • Vectorized type detection for example data in any format. This is highly optimized and works regardless of an example's .dtype attribute, allowing pdcast to infer the types of ambiguous sequences such as lists, tuples, generators, and dtype: object arrays, no matter their contents.

    >>> detect_type([1, 2, 3])
    PythonIntegerType()
    >>> detect_type([1, 2.3, 4+5j])   # doctest: +SKIP
    CompositeType({int[python], float64[python], complex128[python]})
  • Efficient type checks for vectorized data. These combine the above tools to perform isinstance()-like hierarchical checks for any node in the pdcast type system. If the data are properly labeled, then this is done in constant time, allowing users to add checks wherever they are needed.

    >>> df = pd.DataFrame({"a": [1, 2], "b": [1., 2.], "c": ["a", "b"]})
    >>> typecheck(df, {"a": "int", "b": "float", "c": "string"})
    True
    >>> typecheck(df["a"], "int")
    True
  • Support for composite and decorator types. These can be used to represent mixed data and/or add new functionality to an existing type without modifying its original implementation (for instance by marking it as sparse or categorical).

    >>> resolve_type("int, float, complex")  # doctest: +SKIP
    CompositeType({int, float, complex})
    >>> resolve_type("sparse[int, 23]")
    SparseType(wrapped=IntegerType(), fill_value=23)
  • Multiple dispatch based on the inferred type of one or more of a function's arguments. With the pdcast type system, this can be extended to cover vectorized data in any representation, including those containing mixed elements.

    >>> @dispatch("x", "y")
    ... def add(x, y):
    ...     return x + y
    
    >>> @add.overload("int", "int")
    ... def add_integer(x, y):
    ...     return x - y
    
    >>> add([1, 2, 3], 1)
    0    0
    1    1
    2    2
    dtype: int[python]
    >>> add([1, 2, 3], [1, True, 1.0])
    0      0
    1      3
    2    4.0
    dtype: object
  • Metaprogrammable extension functions with dynamic arguments. These can be used to actively manage the values that are supplied to a function by defining validators for one or more arguments, which pass their results into the body of the function in-place. They can also be used to programmatically add new arguments at runtime, making them available to any virtual implementations that might request them.

    >>> @extension_func
    ... def add(x, y, **kwargs):
    ...     return x + y
    
    >>> @add.argument
    ... def y(val, context: dict) -> int:
    ...     return int(value)
    
    >>> add(1, "2")
    3
    >>> add.y = 2
    >>> add(1)
    3
    >>> del add.y
    >>> add(1)
    Traceback (most recent call last):
       ...
    TypeError: add() missing 1 required positional argument: 'y'
  • Attachable functions with a variety of access patterns. These can be used to export a function to an existing class as a virtual attribute, dynamically modifying its interface at runtime. These attributes can be used to mask existing behavior while maintaining access to the original implementation or be hidden behind virtual namespaces to avoid conflicts altogether, similar to Series.str, Series.dt, etc.

    >>> pdcast.attach()
    >>> series = pd.Series([1, 2, 3])
    >>> series.element_type == detect_type(series)
    True
    >>> series.typecheck("int") == typecheck(series, "int")
    True

Together, these features enable a functional approach to extending pandas with small, fully encapsulated functions that perform special logic based on the types of their arguments. Users are thus able to surgically overload virtually any aspect of the pandas interface or add entirely new behavior specific to one or more of their own data types - all while maintaining the pandas tools they know and love.

pdcast combines its advanced features to implement its own super-charged :func:`cast() <pdcast.cast>` function, which can perform universal data conversions within its expanded type system. Here's a round-trip journey through each of the core families of the pdcast type system:

>>> import numpy as np

>>> class CustomObj:
...     def __init__(self, x):  self.x = x
...     def __str__(self):  return f"CustomObj({self.x})"
...     def __repr__(self):  return str(self)

>>> pdcast.to_boolean([1+0j, "False", None])  # non-homogenous to start
0     True
1    False
2     <NA>
dtype: boolean
>>> _.cast(np.dtype(np.int8))  # to integer
0       1
1       0
2    <NA>
dtype: Int8
>>> _.cast("double")  # to float
0    1.0
1    0.0
2    NaN
dtype: float64
>>> _.cast(np.complex128, downcast=True)  # to complex (minimizing memory usage)
0    1.0+0.0j
1    0.0+0.0j
2   N000a000N
dtype: complex64
>>> _.cast("sparse[decimal, 1]")  # to decimal (sparse)
0      1
1      0
2    NaN
dtype: Sparse[object, Decimal('1')]
>>> _.cast("datetime", unit="Y", since="j2000")  # to datetime (years since j2000 epoch)
0   2001-01-01 12:00:00
1   2000-01-01 12:00:00
2                   NaT
dtype: datetime64[ns]
>>> _.cast("timedelta[python]", since="Jan 1st, 2000 at 12:00 PM")  # to timedelta (µs since j2000)
0    366 days, 0:00:00
1              0:00:00
2                  NaT
dtype: timedelta[python]
>>> _.cast(CustomObj)  # to custom Python object
0    CustomObj(366 days, 0:00:00)
1              CustomObj(0:00:00)
2                            <NA>
dtype: object[<class 'CustomObj'>]
>>> _.cast("categorical[str[pyarrow]]")  # to string (categorical with PyArrow backend)
0    CustomObj(366 days, 0:00:00)
1              CustomObj(0:00:00)
2                            <NA>
dtype: category
Categories (2, string): [CustomObj(0:00:00), CustomObj(366 days, 0:00:00)]
>>> _.cast("bool", true="*", false="CustomObj(0:00:00)")  # back to our original data
0     True
1    False
2     <NA>
dtype: boolean

New implementations for :func:`cast() <pdcast.cast>` can be added dynamically, with customization for both the source and destination types.

>>> @cast.overload("bool[python]", "int[python]")
... def my_custom_conversion(series, dtype, **unused):
...     print("calling my custom conversion...")
...     return series.apply(int, convert_dtype=False)

>>> pd.Series([True, False], dtype=object).cast(int)
calling my custom conversion...
0    1
1    0
dtype: object

Finally, pdcast's powerful suite of function decorators allow users to write their own specialized extensions for existing pandas behavior:

>>> @attachable
... @dispatch("self", "other")
... def __add__(self, other):
...     return getattr(self.__add__, "original", self.__add__)(other)

>>> @__add__.overload("int", "int")
... def add_integer(self, other):
...     return self - other

>>> __add__.attach_to(pd.Series)
>>> pd.Series([1, 2, 3]) + 1
0    0
1    1
2    2
dtype: int64
>>> pd.Series([1, 2, 3]) + [1, True, 1.0]
0      0
1      3
2    4.0
dtype: object

Or create entirely new attributes and methods above and beyond what pandas includes by default.

>>> @attachable
... @dispatch("series")
... def bar(series):
...     raise NotImplementedError("bar is only defined for floating point values")

>>> @bar.overload("float")
... def float_bar(series):
...     print("Hello, World!")
...     return series

>>> bar.attach_to(pd.Series, namespace="foo", pattern="property")
>>> pd.Series([1.0, 2.0, 3.0]).foo.bar
Hello, World!
0    1.0
1    2.0
2    3.0
dtype: float64
>>> pd.Series([1, 2, 3]).foo.bar
Traceback (most recent call last):
   ...
NotImplementedError: bar is only defined for floating point values

pdcast is available under an MIT license.

pdcast is open-source and welcomes contributions. For more information, please contact the package maintainer or submit a pull request on GitHub.

The package maintainer can be contacted via the GitHub issue tracker, or directly at eerkela42@gmail.com.

  • pdlearn - AutoML integration for pandas DataFrames using the pdcast type system.