georgian-io/Multimodal-Toolkit

Is there a way to save the preprocessing objects for inference? (OneHotEncoder, Scaler)

Opened this issue · 4 comments

Hi thank you for developing this package! I want to be able to load the already saved model, then use it for inference like in production. How can I let the inference dataset to go through the same preprocessing steps eg. OneHotEncoding of categorical variables, scaling?

Hi @kkristacia,
To load the model, you just need to run the same steps as creating the model. The only difference is that while calling model = AutoModelWithTabular.from_pretrained(...) make sure you set the first argument pretrained_model_name_or_path to the path that you saved your model in.

Similarly, to preprocess the inference dataset, I would recommend running load_data_from_folder function with the same parameters used in the load_data_from_folder while training. Use the same training data to reconstruct the encoders and replace the test data with your inference data. I know this isn't optimal so we'll definitely change this in a future version.

Please let me know if you run into any other issues and I can help you solve it! :)

Hi Akash, thanks for the clarification. Yea I was hoping for some way to not use the training data during inference. Definitely will be great if future versions can have the functionality!

Hi Akash. Just to second this - it would be great if the preprocessing objects were saved for making inferences in production. Loading my whole dataset into my production environment would take up space unnecessarily. Love the toolkit, and looking forward to seeing an update in the future!

Thanks @dsunart! I'm reopening this issue as a feature request. It should be added in as part of our next release!