one_hot_encoder in Test

Question

one_hot_encoder in Test

dotRData opened this issue 7 years ago · 5 comments

how do we use one_hot_encoder in test data ?
lets say some new value got added in some column
it will add extra column in test-dataset, which is not a problem,

but let's say some values are missing in the test-data
and it will drop that column in one_hot_encoder
and that might create a problem while scoring

Answer 1 · 2018-01-12T09:49:14.000Z

Hi,

That's a good one.

A quick fix: I would recommand using sameShape which allows you to control the oclumns of your test set.

After, I don't know what is the best approach, do you have an example of another package that allows you to have the same columns in train and test.

Answer 2 · 2018-01-12T13:08:08.000Z

currently I am using this
testData[, setdiff(names(trainData), names(testData)):=0]

I thought you might have some better way.

Answer 3 · 2018-01-16T08:17:09.000Z

I guess a future modification would be to perrform one_hot_encoder such as fastScale works for example...

With first a buildEncoding funtion to build encoding parameters that would be applicable using one_hot_encoding either on train and test.

Feature should be developped in next version.

Answer 4 · 2018-01-16T14:37:41.000Z

Yes, buildEncoding might also take input as min-frequency of the levels present in the features. That way we might have control over the final dimension of the dataset.

Answer 5 · 2018-01-17T17:30:25.000Z

Good idea. I added it. It is implemented in branch v0.3.5 will be merged soon.