one_hot_encoder in Test
dotRData opened this issue · 5 comments
how do we use one_hot_encoder in test data ?
lets say some new value got added in some column
it will add extra column in test-dataset, which is not a problem,
but let's say some values are missing in the test-data
and it will drop that column in one_hot_encoder
and that might create a problem while scoring
Hi,
That's a good one.
A quick fix: I would recommand using sameShape
which allows you to control the oclumns of your test set.
After, I don't know what is the best approach, do you have an example of another package that allows you to have the same columns in train and test.
currently I am using this
testData[, setdiff(names(trainData), names(testData)):=0]
I thought you might have some better way.
I guess a future modification would be to perrform one_hot_encoder
such as fastScale
works for example...
With first a buildEncoding
funtion to build encoding parameters that would be applicable using one_hot_encoding
either on train and test.
Feature should be developped in next version.
Yes, buildEncoding
might also take input as min-frequency of the levels present in the features. That way we might have control over the final dimension of the dataset.
Good idea. I added it. It is implemented in branch v0.3.5 will be merged soon.