Build API for array processing built on dtype-next
Opened this issue · 0 comments
ezmiller commented
Goal
Currently, tablecloth provides an easy-to-use wrapper over tech.ml.dataset’s high-performance dataset processing constructs. One part of the tech.ml stack that tablecloth has not directly covered is dtype-next, which provides a highly performant basis for array-like numerical processing, similar to Numpy. The project I am proposing aims to wrap dtype-next within tablecloth, providing a new easy-to-use API for numerical structures for the emerging Clojure data processing ecosystem.
Rough Outline of Steps
During this project, I will focus on the following tasks:
- Add a new function to tablecloth (perhaps named
column
orarray
) that will return a typed, countable, random-access data structure backed by dtype-next’s abstractions; - Design two API pathways to interact with this structure: one that realizes the data fully at each step, providing more straightforward but less efficient interaction; and another, more performant but slightly harder to use, that allows users to wrap processing steps in a "transaction";
- Mimic the Numpy (and possibly R vector) APIs ensuring an equivalently complete functional interface for numerical processing;
- Ensure support reading-friendly format for printing columns in the Clojure REPL (see techascent/tech.ml.dataset#203);
- Validate the usefulness of the API by implementing real-world examples with various characteristics (missing values, various data types, challenging sizes, etc.) and comparing the ergonomics with other platforms such as Python and R.
Open Questions
- What will the name of this entity be? Some options could be:
array
,column
,buffer
,column-vector
. - Does it make sense for this API to live within tablecloth or might we want to break it out into its own library?
- Are there ways that this work needs to align with the work that @ribelo and @genmeblog are doing to define a syntax for operations on dataset columns (e.g. #47 )?