keras-team/tf-keras

Feature request: Add a "pandas field selection layer" to allow saving a specification of what inputs are needed in what order / what the output of the model corresponds to

Opened this issue · 6 comments

System information.

TensorFlow version (you are using): 2.11.0 (though it should not play a role)
Are you willing to contribute it (Yes/No) : No (I am not familiar enough with the Keras internals, it would take me too much time to get familiar with these)

Describe the feature and the current behavior/state.

I regularly perform error correction / post processing of data, where the data are available as a big pandas dataframe, with each potential "entry to process" as a row in the dataframe, and each column in the dataframe as a data field present in each entry. What happens usually is that the input dataframe has many more columns than I end up using - effectively, I use only a few of the features in the end. To remember which columns I use, and in which order, I then end up needing to save, in addition to the Keras model, a specification of the list of ordered feature columns I use as input to my model. Of course, this is a bit tiring and error prone to do by hand, make sure to keep the correct spec alongside the correct Keras model dump, etc.

This reminds me a bit of the "problem" faced when normalizing / denormalizing the data input / output. This used to be a pain (need to save the means and stds separately, and manage them by hand), but this is now super easy to manage thanks to the Normalization layers: since these are part of the network, this means that by using them, the user does not need to worry about storing, restoring, and applying these coefficients by hand, and does not need either to manage additional files that must be kept alongside the keras model dump (a simple in theory but error prone in practice process). For me, these simple Normalization layers are a huge gain, and I would like to leverage this in the same way for the features selection / output labeling.

Therefore, my question is the following: could we add a layer to make the specification of what columns to use from a pandas dataframe, and in what order, part of the specification of the Keras models, by adding a new "pandas field selection layer"? This would remove the overhead / tiring / error prone process of bookkeeping, saving, restoring, etc, this spec which users now have to do by hand.

This could also be used "in reverse" to automatically turn the Keras model output into a pandas, with named column(s). This way, this makes it possible to better / implicitly document the model as a whole (things like, "what is it producing and in what units" can now be embedded in the network, through the name of the output column(s)).

I am not an expert, but an API something like the following could be useful, partially copied from https://keras.io/api/layers/preprocessing_layers/numerical/normalization/ (open to discussions / suggestions of improvements of course :) ):

tf.keras.layers.PandasSelection(
    list_columns, invert=False, **kwargs
)

with arguments:

  • list_columns: the list of pandas columns, like ["column_name_feature_1", "column_name_feature_2", ...]
  • invert: if False, the layer can be used as the input layer to the model, and takes in a pandas dataframe, and will generate the individual samples with the ordered features corresponding to list_columns. If True, the layer can be used as the output of the model, and transforms the purely numeric output of the keras model into a pandas with column names for each output as specified by list_columns.

The layer would generate a runtime exception if the list of columns cannot be found in the input pandas dataframe. The layer would also have a couple of attributes, like .list_columns would return the list_columns list. A .reverse method to return the "reversed" version of the layer.

So my models would now look like (of course could have something else than connected layer at the start and end of the "real" network neural layers):

# defining the inputs / ouputs to use
pandas_column_extraction_layer = layers.PandasSelection(["column_name_1", "column_name_2", ...], invert=False)  # this is new!!
pandas_inv_labeling_layer = layers.PandasSelection(["output_name_1", "output_name_2", "output_name_3"], invert=True)  # this is new!!

# preparing my normalization / denormalization
labels_inv_normalization_layer = layers.Normalization(invert=True, input_shape=[len(pandas_inv_labeling_layer.list_columns),], axis=None)
labels_inv_normalization_layer.adapt(pandas_inv_labeling_layer.reverse(pandas_training_data))
#
predictors_normalization_layer = layers.Normalization()
predictors_normalization_layer.adapt(pandas_column_extraction_layer(pandas_training_data))

# the model itself; everything is part of the model spec and is fully saved / loaded with the default keras functions
input_layer = pandas_column_extraction_layer  # this is new!! takes in a pandas with well labeled columns
normalized_input = predictors_normalization_layer(input_layer)
fully_connected_start = keras.layers.Dense(60, activation="relu")(normalized_input)
... [the internals of the network]
fully_connected_end = keras.layers.Dense(60, activation="relu")(previous_internal_layer)
internal_output = keras.layers.Dense(len(pandas_inv_labeling_layer.list_columns))(fully_connected_end)
denormalized_output = labels_inv_normalization_layer(internal_output)
pandas_output = pandas_inv_labeling_layer(denormalized_output)  # this is new!! outputs a pandas with well labeled columns

keras_model = keras.Model(inputs=input_layer, outputs=pandas_output)

Now calling:

pandas_out = keras_model(pandas_in)

would work out of the box, and pandas_out is a pandas dataframe with the same number of rows as pandas_in, and the set of columns defined in pandas_inv_labeling_layer.list_columns, and all of this metadata is saved / restored with the save and load_model API.

Will this change the current api? How?

This will not change any existing API, this will only add an extra layer that can be used if the user wants and provides "automagic" management of metadata and inputs and outputs specs by leveraging pandas datasets labeling.

Who will benefit from this feature?

Potentially, all users who use pandas as an input to their Keras model, and use a given subset / ordering of the pandas file as an input. These users will not need any longer to implement the bookkeeping themselves, and can delegate it to a Keras layer that is part of the model spec, dump, load, etc.

Contributing

  • Do you want to contribute a PR? (yes/no): No (I am not familiar enough with the Keras internals, it would take me quite a bit of time to get familiar with these)

Hi,

Thanks for opening the feature request.

The standard way of handling the pandas dataframe in Tensorflow is by loading the dataframe and convert the each data to a tensor type and then load it to Keras model.

The above mentioned process is explained in detail in the document below. Please refer: https://www.tensorflow.org/tutorials/load_data/pandas_dataframe

Thanks for pointing to this. I am well aware of this, and I already manage to do this just fine in a way quite similar to what you point to.

What I suggest is that the spec and selection of the columns to use for the predictors, i.e. in this example the 2 lines:

numeric_feature_names = ['age', 'thalach', 'trestbps',  'chol', 'oldpeak']
numeric_features = df[numeric_feature_names]

could be done by a custom Keras layer so that this becomes part of the model and is saved and loaded automatically. A bit like the mean and std in the case of the normalization layer: this is simple to do by hand and this is what I used to do before the Normalization layer was implemented, but it is just so much simpler to have it be part of the model spec rather than needing to save and load a data structure holding the normalization coefficients.

The rationale is that when training many model variants using different sets of features, it becomes heavy / error prone to keep track of this columns spec (need to save additional information alsongside the Keras network dump).

Let me know if this is unclear. As a user that has to test many model variants 'playing around' with using different sets of predictors, having this selection being part of the model would make my life much simpler and less error prone - even though I already manage to do it and save / load this information writing additional code.

@jerabaul29 Thank you for the suggestion!
We usually do not have data type specific layers in Keras.
I believe it is easy to do with a custom layer that saves with the model.
Or you may be able to do it with FeatureSpace.

Thanks for the pointers to resources to look into :) .

So if I understand well, I could get inspiration from https://github.com/keras-team/keras/blob/master/keras/layers/preprocessing/normalization.py + https://keras.io/guides/serialization_and_saving/ to build a custom layer that holds the list of columns to use as its state, and uses this to select data in a pandas / transform NN output into a pandas? And this should work out of the box since a list of strings (describing the columns spec) is a native python type, right?

If this is correct, I think I could actually get a small snippet of code that does this. Then the question is, do you think that this need is common enough / useful for many enough users, that it is worth sharing a layer in this kind through Keras directly, as you do with the Normalization layer?

On this last point, I understand the fact of not having a data type specific layer. But at the same time, using pandas for preparing the input + keras to work with it is becoming nearly an "industry standard", so wondering if this could be useful to a wide group of users and standardization on how to do this could improve usability :) .

I (naively) tried to start playing around from something along the lines of:

import numpy as np
import tensorflow as tf
import pandas as pd
import keras

class PandasSelectionLayer(keras.layers.Layer):
    def __init__(self, list_columns, **kwargs):
        super().__init__(**kwargs)        
        self.list_columns = list_columns

    def call(self, input_data):
        assert isinstance(input_data, pd.DataFrame)
        return tf.convert_to_tensor(input_data[:, self.list_columns].to_numpy().astype(np.float32))

    def get_config(self):
        config = super().get_config()
        config.update(
            {
                "list_columns": self.list_columns,
            }
        )
        return config

list_cols = ["col_1", "col_2"]

pandas_input = PandasSelectionLayer(list_columns=list_cols)
fully_connected = keras.layers.Dense(60, activation="relu")(pandas_input)
output = keras.layers.Dense(1)(fully_connected)

keras_model = keras.Model(inputs=pandas_input, outputs=output)

but "of course" this does not work, as the PandasSelectionLayer is not a valid input layer and building one such object returns the object itself, not a tensor that can be passed to the next layer. Any advice / pointer to how I could build something that actually works? I guess I need to build an InputLayer, not a Layer, but I cannot find examples about how to build a custom InputLayer?

Another option I could use as a short term quick fix is to define a wrapping class, instead of a custom layer. The wrapping class could combine the pandas columns spec + the trained model, and take care of saving / loading both at the same time. This requires a bit of extra boilerplate etc, and would be quite a bit less convenient (for example, this could only take care of already trained network, otherwise adding quite a lot more logics and methods would be needed) so a native layer would be better, but can be an option.

Something like:

class PandasKeras():
    def __init__(self, trained_keras_network, columns_spec):
        ...

    def load(self):
        ...

    def save(self):
        ...

    def predict(pandas_in):
        ...