schuderer/mllaunchpad

Make file data sources work with patterns

schuderer opened this issue · 0 comments

Note: I don't mean glob patterns here, but go ahead and read on.

Instead of configuring a file path like:

./bla/some/foo_23.csv

One should be able to configure a file path using a format-pattern like:

./bla/{whatever}/foo_{myid}.csv

This could work like the following:

  1. First, if params are provided as a dict parameter in get_dataframe(), we would replace the {}-patters in a similar way to how it currently works with sql, as a way to deal with files in a directory more dynamically. This would also be useful for the file data sink.
  2. For {}s for which no param is be provided, just match * where {} stands and append every resulting {}-string as a column to the end of the returned data frame (think: unification).
  3. Appending it as a column would also solve the question of what to do if the path with un-parametrized {}s match more than one file: We would simply concatenate the dataframes, but with the {}-determined column the user code would be able to differentiate between files/params if needed. Maybe this should be made an optional parameter either in get_dataframe() or in the datasource config.