ELToulemonde/dataPreparation

one_hot_encoder in Test

dotRData opened this issue · 5 comments

how do we use one_hot_encoder in test data ?
lets say some new value got added in some column
it will add extra column in test-dataset, which is not a problem,

but let's say some values are missing in the test-data
and it will drop that column in one_hot_encoder
and that might create a problem while scoring

Hi,

That's a good one.

A quick fix: I would recommand using sameShape which allows you to control the oclumns of your test set.

After, I don't know what is the best approach, do you have an example of another package that allows you to have the same columns in train and test.

currently I am using this
testData[, setdiff(names(trainData), names(testData)):=0]

I thought you might have some better way.

I guess a future modification would be to perrform one_hot_encoder such as fastScale works for example...

With first a buildEncoding funtion to build encoding parameters that would be applicable using one_hot_encoding either on train and test.

Feature should be developped in next version.

Yes, buildEncoding might also take input as min-frequency of the levels present in the features. That way we might have control over the final dimension of the dataset.

Good idea. I added it. It is implemented in branch v0.3.5 will be merged soon.