Make `nrows` a method separate from `schema`

Question

Make `nrows` a method separate from `schema`

ablaom opened this issue 3 years ago · 5 comments

Recall that ScientificTypes.schema is essentially the same as Tables.schema but with scitypes and nrows added as fields.

There are two reasons for separating out nrows:

Computing nrows for a general table is typically more expensive than computing the schema. In particular, assuming progress towards #127, the compute times are significant in the extreme case of out-of-memory tables, for example. Tables.schema does not include anything about the number of rows for this reason.
With nrows gone, we can view a schema as a table (implement the Tables.jl interface for it) which would be convenient. For example, for a large table DataFrame(schema(X)) would then work.

As an aside, the current implementation of schema(X).nrows is poor for row-based tables and @OkonSamuel improved this in MLJBase, where there is already a separate nrows method (extended from MLJModelInterface). This implementation could be ported to ScientificTypes to replace the current schema(X).nrows.

Sadly, this is breaking. I would propose adding the nrows method and deprecating the nrows field of Schema. We could already implement Schema as a table, simply ignoring this field.

Are there any objections or other thoughts regarding this plan?

@OkonSamuel @quinnj @juliohm @bkamins

Answer 1 · 2021-07-21T22:54:20.000Z

If you want to introduce nrows method then probably both nrow and ncol should go to DataAPI.jl and then I would recommend using nrow function name, so that we have consistency with DataFrames.jl (and as a bonus it will be more easy for R users to discover).

Answer 2 · 2021-07-22T21:50:57.000Z

@bkamins That's great feedback. Will overload DataAPI.jl nrow.

Answer 3 · 2021-07-22T22:11:58.000Z

We need to add nrow and ncol to DataAPI.jl. I will open a PR there to discuss.

Answer 4 · 2021-07-26T00:33:58.000Z

Waiting for JuliaData/DataAPI.jl#40

Answer 5 · 2021-07-26T07:15:41.000Z

In case JuliaData/DataAPI.jl#40 takes long, probably you can expose nrow on your side, and only later make it a method of DataAPI.nrow (when it is there).