JuliaAI/MLJ.jl

Throw error when tables are presented with new column orders?

ablaom opened this issue · 1 comments

Over at MLJFlux, @tiemvanderdeure has pointed out the following issue that is actually MLJ generic.

As the example below shows, a user presenting a table for training a model cannot present new data for prediction with a different ordering of the table columns:

N = 1000
X = (x1 = rand(Float32, N), x2 = randn(Float32, N), x3 = categorical(rand('a':'c', N)))
y = categorical(bitrand(N))

model = MLJFlux.NeuralNetworkBinaryClassifier(epochs = 10, builder=MLJFlux.MLP(; hidden=(5,4)), batch_size = 100)
mach = machine(model, X, y)
fit!(mach)

# this errors
predict(mach, (x3 = X.x3, x1 = X.x1, x2 = X.x2))

# this is false!
all(predict(mach, (x2 = X.x2, x1 = X.x1, x3 = X.x3)) .≈ predict(mach, X))

Here is my response from the original post:

Mmm. I think this kind of implicit assumption - that the columns of tables are ordered, and that they be presented in a consistent order, is everywhere in MLJ, and probably elsewhere. [Transferring this issue to MLJ].

One could either try to allow tables to be presented in any column order, or throw a warning when the original order is violated. Personally, I think the latter would be sufficient. If MLJ had a generic data-front end for dealing with tables, apart from Tables.matrix which dumps the feature names, this could be an easy fix either way. But a lot of interfaces just don't save the feature names.

I'd support some kind of resolution, but it's a big ask to adapt across the ecosystem.

This is a problem that other users have also made issues about (e.g. #1023, but I think that there are more).

As a user (and as a contributor as well), the fact that the input into an MLJ machine is a Tables.jl-compatible table made me assume that machines would treat it as tabular data, i.e. use column names. It personally caught me off guard that they don't, and I doubt that I'm the only one.

What makes this more confusing is that some MLJ models do use column names, e.g. those in MLJGLMInterface.jl.

I'd support some kind of resolution, but it's a big ask to adapt across the ecosystem.

I see the point - there are a lot of models out there, and requiring them to use column keys is not going to work.

Maybe there could be an extra model trait in MMI of whether or not a model uses column keys, so that an example like the one above can be part of the test suite for those models.

Otherwise there is always FeatureSelector in MLJModels, which is great.