scicloj/tablecloth

Build API for array processing built on dtype-next

Opened this issue · 0 comments

Goal

Currently, tablecloth provides an easy-to-use wrapper over tech.ml.dataset’s high-performance dataset processing constructs. One part of the tech.ml stack that tablecloth has not directly covered is dtype-next, which provides a highly performant basis for array-like numerical processing, similar to Numpy. The project I am proposing aims to wrap dtype-next within tablecloth, providing a new easy-to-use API for numerical structures for the emerging Clojure data processing ecosystem.

Rough Outline of Steps

During this project, I will focus on the following tasks:

  • Add a new function to tablecloth (perhaps named column or array) that will return a typed, countable, random-access data structure backed by dtype-next’s abstractions;
  • Design two API pathways to interact with this structure: one that realizes the data fully at each step, providing more straightforward but less efficient interaction; and another, more performant but slightly harder to use, that allows users to wrap processing steps in a "transaction";
  • Mimic the Numpy (and possibly R vector) APIs ensuring an equivalently complete functional interface for numerical processing;
  • Ensure support reading-friendly format for printing columns in the Clojure REPL (see techascent/tech.ml.dataset#203);
  • Validate the usefulness of the API by implementing real-world examples with various characteristics (missing values, various data types, challenging sizes, etc.) and comparing the ergonomics with other platforms such as Python and R.

Open Questions

  • What will the name of this entity be? Some options could be: array, column, buffer, column-vector.
  • Does it make sense for this API to live within tablecloth or might we want to break it out into its own library?
  • Are there ways that this work needs to align with the work that @ribelo and @genmeblog are doing to define a syntax for operations on dataset columns (e.g. #47 )?