petl-developers/petl

Support the brand new Pandas Dataframe alternatives

juarezr opened this issue · 0 comments

Problem description

It would be nice to support the brand new Dataframe besides Pandas.

Two interesting candidates would be:

Modin Overview

Scale your pandas workflow by changing a single line of code

Modin uses Ray or Dask to provide an effortless way to speed up your pandas notebooks, scripts, and libraries. Unlike other distributed DataFrame libraries, Modin provides seamless integration and compatibility with existing pandas code. Even using the DataFrame constructor is identical.

Polars Overview

Lightning-fast DataFrame library for Rust and Python

Polars is a lightning fast DataFrame library/in-memory query engine. Its embarrassingly parallel execution, cache efficient algorithms and expressive API makes it perfect for efficient data wrangling, data pipelines, snappy APIs and so much more.

Problem Description

Currently petl supports Pandas by using the functions petl.io.pandas.dataframe and petl.io.pandas.todataframe

Evolving this kind of feature would be important to research:

  • How do they fit in petl use cases.
  • What are the best ergonomic APIs that we need to consider either for adding new functions or adding support to existing ones.
  • What additional burden is needed for supporting it properly. Ex:
    • CI: acceptance tests
    • CD: impact on the releases
    • documentation: details on API, caveats, proper setup, FAQ, and troubleshooting
  • What happens when the upstream projects break compatibilities between versions