AbsaOSS/cobrix

Multiple paths for source files

MaksymFedorchuk opened this issue · 2 comments

I need to read files from multiple folders, but so far I didn't find in cobrix an option to achieve this. So is there a way to read multiple folders without creating multiple rdd's or datasets? If not, then this should be an enhancement request.

Example :
source_folders = ["example1/folder_1/","example2/folder_2/"]
spark.read.format("cobol").load(source_folders)

Parquet and other popular formats have support for multiple sources

That's a good idea! We'll check it out and implement it

While looking into that I've noticed that in order to support multiple paths in .load(...) the data source provider needs to be rewritten in terms of FileFormat instead of RelationProvider. So it might take some time to implement.
If data files are in subdirectories of the same root folder, you can use "/path/*", and the data source will look 1 level of recursion into each subfolder.

If rewriting fom RelationProvider to FileFormat is too hard, we'll add a Cobrix extension option, for instance .option("paths", "/comma,/separated,/paths") as a workaround for sometime.