Build API for array processing built on dtype-next

Question

Build API for array processing built on dtype-next

Opened this issue 3 years ago · 0 comments

Goal

Currently, tablecloth provides an easy-to-use wrapper over tech.ml.dataset’s high-performance dataset processing constructs. One part of the tech.ml stack that tablecloth has not directly covered is dtype-next, which provides a highly performant basis for array-like numerical processing, similar to Numpy. The project I am proposing aims to wrap dtype-next within tablecloth, providing a new easy-to-use API for numerical structures for the emerging Clojure data processing ecosystem.

Rough Outline of Steps

During this project, I will focus on the following tasks:

Add a new function to tablecloth (perhaps named column or array) that will return a typed, countable, random-access data structure backed by dtype-next’s abstractions;
Design two API pathways to interact with this structure: one that realizes the data fully at each step, providing more straightforward but less efficient interaction; and another, more performant but slightly harder to use, that allows users to wrap processing steps in a "transaction";
Mimic the Numpy (and possibly R vector) APIs ensuring an equivalently complete functional interface for numerical processing;
Ensure support reading-friendly format for printing columns in the Clojure REPL (see techascent/tech.ml.dataset#203);
Validate the usefulness of the API by implementing real-world examples with various characteristics (missing values, various data types, challenging sizes, etc.) and comparing the ergonomics with other platforms such as Python and R.

Open Questions

What will the name of this entity be? Some options could be: array, column, buffer, column-vector.
Does it make sense for this API to live within tablecloth or might we want to break it out into its own library?
Are there ways that this work needs to align with the work that @ribelo and @genmeblog are doing to define a syntax for operations on dataset columns (e.g. #47 )?