redpony/creg

Better support for feature engineering

Closed this issue · 1 comments

For feature engineering, it would be nice not to have to train on a different feature file for each combination of features. One solution would be to allow multiple feature files to be loaded for the same training instances (the instance IDs should prevent any ambiguity). (This has the advantage that features can be extracted in parallel.) Another would be a command-line regex option for features to ablate.

I think processing multiple -[t]x arguments would be pretty simple: ReadLabeledInstances() would be called once for each of these, and the resulting feature-value maps unioned for corresponding instance IDs. Conflicting values for the same feature should trigger an error.

Empty files should be ignored (this is so I can do -x /dev/null for disabled featuresets).

For now it should be OK to enforce the constraint that non-empty files contain all instances in the same order.

An alternative would be to allow a single feature file containing multiple nonconsecutive repetitions of the same instance which are then unioned together. Then the user could cat together multiple feature files as the input.