Make `nrows` a method separate from `schema`
ablaom opened this issue · 5 comments
Recall that ScientificTypes.schema
is essentially the same as Tables.schema
but with scitypes
and nrows
added as fields.
There are two reasons for separating out nrows
:
-
Computing
nrows
for a general table is typically more expensive than computing theschema
. In particular, assuming progress towards #127, the compute times are significant in the extreme case of out-of-memory tables, for example.Tables.schema
does not include anything about the number of rows for this reason. -
With
nrows
gone, we can view aschema
as a table (implement the Tables.jl interface for it) which would be convenient. For example, for a large tableDataFrame(schema(X))
would then work.
As an aside, the current implementation of schema(X).nrows
is poor for row-based tables and @OkonSamuel improved this in MLJBase, where there is already a separate nrows
method (extended from MLJModelInterface). This implementation could be ported to ScientificTypes to replace the current schema(X).nrows
.
Sadly, this is breaking. I would propose adding the nrows
method and deprecating the nrows
field of Schema
. We could already implement Schema
as a table, simply ignoring this field.
Are there any objections or other thoughts regarding this plan?
If you want to introduce nrows
method then probably both nrow
and ncol
should go to DataAPI.jl and then I would recommend using nrow
function name, so that we have consistency with DataFrames.jl (and as a bonus it will be more easy for R users to discover).
We need to add nrow
and ncol
to DataAPI.jl. I will open a PR there to discuss.
- Waiting for JuliaData/DataAPI.jl#40
In case JuliaData/DataAPI.jl#40 takes long, probably you can expose nrow
on your side, and only later make it a method of DataAPI.nrow
(when it is there).