schuderer/mllaunchpad

PickleDataSource/Sink

schuderer opened this issue · 7 comments

As a way to save/load anything to/from pickle. Note that the file/location has to be fixed in order to not introduce any vulnerabilities.

Tracking:

  • documented example: 4cd79b4
  • Add type pickled_dataframe to FileDataSource

Would solving this solve #112?

Only for very specific use cases. I personally see additional value in having a data format that mitigates the problem of remembering dtypes and at the same time retains interchangeability with other tools.

You can already pickle your data now using FileDataSource‘s binary_file type with put_raw() and get_raw().

Example for pickling:

Config:

datasinks:
   my_pickle_datasink:
     type: binary_file
     path: /some/file.pickle
     tags: [train]
     options: {}

Code:

import pickle
...
mypickle = pickle.dumps(myobject)
data_sinks["my_pickle_datasink"].put_raw(mypickle)

Unpicking works analogously, just with datasources:, get_raw(), and pickle.loads().

Relevant documentation:
https://docs.python.org/3.6/library/pickle.html#module-interface
https://mllaunchpad.readthedocs.io/en/latest/mllaunchpad.html#mllaunchpad.resource.FileDataSource

Could you try if this works for your use case? Considering that this is just one extra command for the user, I’m inclined to just add this example to the documentation and call it done. This would mean that we can keep mllaunchpad in charge of the IO, and the user is in charge of the data format and contents - be it pickle or otherwise.

Please post back here to let me know.

The above was implemented and works like a charm! Highly preferable over a CSV in some cases. I think it would make a great addition to the documentation. Thanks for the advice!

Great to hear it!

The problem with this workaround is that you have to deal with DataSource-specific code in your model code. That is, if you use this method to persist and load DataFrames, and if one day you decide to use a database instead of a file, you'll have to change your code (so by following this workaround, you give up ML Launchpad's advantage of cleanly separating model-specific code from the environment).

Maybe additionally to documenting this, we should still add a pickled_dataframe type to FileDataSource, where you could just use get_dataframe() as with any other DataSource. That would keep that advantage in place, even when pickling data.

Edit September 13 2021: I misinterpreted the idea of "environment-independent" for this use case, see my comment below.

@joosbuijsNL I added some documentation, hope it is clear: 4cd79b4

The problem with this workaround is that you have to deal with DataSource-specific code in your model code. That is, if you use this method to persist and load DataFrames, and if one day you decide to use a database instead of a file, you'll have to change your code (so by following this workaround, you give up ML Launchpad's advantage of cleanly separating model-specific code from the environment).

After giving this more thought, I have to correct my argument -- the pickling is not datasink-specific. The pickling just creates some binary data. And the method put_raw() is supposed to be available for any kind of datasink that accepts binary data (be it files, databases, data lakes, etc.). So, contrary to what I wrote earlier, you would not have to change the Python code at all, which makes the point moot.

When I wrote the above, my frame of reference was that the user wants to persist a dataframe, so the code has to deal only with dataframes. But what the user actually explicitly wants is to persist is binary data. This changes pickling to be in user scope and outside the environment, and makes a special pickling data type unnecessary (and untidy).

So, if you feel you need to pickle, you can use put_raw()/get_raw() as is described above (and now also described in the docs).

For most csv-cases, you might want to use the new dtypes_path option which will be part of the next release.

Feel free to reopen and comment if this issue needs further attention.