JuliaAI/ScientificTypes.jl

Make `nrows` a method separate from `schema`

ablaom opened this issue · 5 comments

Recall that ScientificTypes.schema is essentially the same as Tables.schema but with scitypes and nrows added as fields.

There are two reasons for separating out nrows:

  1. Computing nrows for a general table is typically more expensive than computing the schema. In particular, assuming progress towards #127, the compute times are significant in the extreme case of out-of-memory tables, for example. Tables.schema does not include anything about the number of rows for this reason.

  2. With nrows gone, we can view a schema as a table (implement the Tables.jl interface for it) which would be convenient. For example, for a large table DataFrame(schema(X)) would then work.

As an aside, the current implementation of schema(X).nrows is poor for row-based tables and @OkonSamuel improved this in MLJBase, where there is already a separate nrows method (extended from MLJModelInterface). This implementation could be ported to ScientificTypes to replace the current schema(X).nrows.

Sadly, this is breaking. I would propose adding the nrows method and deprecating the nrows field of Schema. We could already implement Schema as a table, simply ignoring this field.

Are there any objections or other thoughts regarding this plan?

@OkonSamuel @quinnj @juliohm @bkamins

If you want to introduce nrows method then probably both nrow and ncol should go to DataAPI.jl and then I would recommend using nrow function name, so that we have consistency with DataFrames.jl (and as a bonus it will be more easy for R users to discover).

@bkamins That's great feedback. Will overload DataAPI.jl nrow.

We need to add nrow and ncol to DataAPI.jl. I will open a PR there to discuss.

In case JuliaData/DataAPI.jl#40 takes long, probably you can expose nrow on your side, and only later make it a method of DataAPI.nrow (when it is there).