GateNLP/gateplugin-LearningFramework

Rethink feature scaling

johann-petrak opened this issue · 0 comments

The idea of scaling is that some features do not have a bigger influence on the model than others. Our current approaches maybe do not do this properly and may also have other issues:

  • with sparse nominal values, it is easy to have features with one value only, so the variance or range are both zero.
  • scaling to have zero mean is maybe not what we want if 0 the indication for absence?
  • if we just look at the sparse vector dimensions for a single feature, then we maybe want to scale them to have 0..1 range based on all the values from all dimensions? Especially if we scale to 0..1 based on the tfidf original values?
  • but for that we would need an easier way to do tfidf transformations as part of the finishing step.