Add support for partitioning by folders in Hive/Spark Style

Question

Add support for partitioning by folders in Hive/Spark Style

Closed this issue 5 years ago · 19 comments

Partitioning is useful for splitting the output in folders and files while writing (and reading).
In excel case can be useful for separating by day, customer or others ways or for creating smaller spreadsheets that will load faster.

The current behavior of this datasource when DataFrameWriter.partitionBy() method is generating a single file and overwriting for each partition.

For example when a dataframe is partitioned by year, month, and day Sparks writes in two different forms while using the 'csv' (or parquet, avro, etc...) and using the Excel output from this datasource.

Using 'csv':

csv_output_dir/_SUCCESS
csv_output_dir/year=2019/month=7/day=10/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv
csv_output_dir/year=2019/month=7/day=11/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv

Using 'org.zuinnote.spark.office.excel':

xls_output_dir/part-m-00000.xls
xls_output_dir/_SUCCESS

Answer 1 · 2019-07-12T22:04:24.000Z

Thanks, that is true, i will look into supporting the partitioning

…

Am 12.07.2019 um 23:13 schrieb Juarez Rudsatz ***@***.***>: Partitioning is useful for splitting the output in folders and files while writing (and reading). In excel case can be useful for separating by day, customer or others ways or for creating smaller spreadsheets that will load faster. The current behavior of this datasource when DataFrameWriter.partitionBy() method is generating a single file and overwriting for each partition. For example when a dataframe is partitioned by year, month, and day Sparks writes in two different forms while using the 'csv' (or parquet, avro, etc...) and using the Excel output from this datasource. Using 'csv': csv_output_dir/_SUCCESS csv_output_dir/year=2019/month=7/day=10/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv csv_output_dir/year=2019/month=7/day=11/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv Using 'org.zuinnote.spark.office.excel': xls_output_dir/part-m-00000.xls xls_output_dir/_SUCCESS — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

Answer 2 · 2019-07-14T21:29:53.000Z

May I ask in which mode do you use it? Simple mode with simple data types or in CellMode?

Answer 3 · 2019-07-14T23:28:11.000Z

Thanks for the quick response.
Until today we don't export directly to excel because we missed a component for doing the 'last mile'.

The workflow today consists on exporting to CSV files and post-processing with a specific script for converting to excel and formatting. But this becomes very complex and fragile and becomes a burden for creating and maintaining the code.

In cases that we export we simply write using the field names, types and order from the Spark dataframe.

Just exporting partitioned files with the headers and values will cut a lot of effort from the workflows development.

As the excel data is for human consumption, we foresee that the users will request formatting and some data handling and the functionality of templates present in this datasource seems to be very interesting for tackling this need also. But this is the a future problem and first we need the get the basics working well.

Answer 4 · 2019-07-15T09:11:57.000Z

Yes ok, but in which mode did you try the partitioning? Is your source dataframe consisting of simple data types?

…

Am 15.07.2019 um 01:28 schrieb Juarez Rudsatz ***@***.***>: Thanks for the quick response. Until today we don't export directly to excel because we missed a component for doing the 'last mile'. The workflow today consists on exporting to CSV files and post-processing with a specific script for converting to excel and formatting. But this becomes very complex and fragile and becomes a burden for creating and maintaining the code. In cases that we export we simply write using the field names, types and order from the Spark dataframe. Just exporting partitioned files with the headers and values will cut a lot of effort from the workflows development. As the excel data is for human consumption, we foresee that the users will request formatting and some data handling and the functionality of templates present in this datasource seems to be very interesting for tackling this need also. But this is the a future problem and first we need the get the basics working well. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Answer 5 · 2019-07-15T12:21:14.000Z

Sorry for the misunderstanding.
Just simple data types: string, int, bigint, datetime (java).
Thanks, again.

Answer 6 · 2019-07-15T20:57:27.000Z

no problem. Just out of curiosity, do you use df.toDF.write.partionBy("year","month","day").format("org.zuinnote.spark.office.excel")
.option("write.locale.bcp47", "us")
.save("/home/user/office/output")
?

I want to adapt the integration tests towards the scenario that you describe. The reason is that we already basically use the internal Spark APIs, so not much or none modifications should be needed.

Answer 7 · 2019-07-20T23:24:40.000Z

Sorry for the delay.
Yes, you are right! I'm using basically this way:

Dataset<Row> myDF = spark.createDataFrame(myRDD, MyClass.class);

myDF.select("value1", "value2", "year","month","day")
	.write().format("org.zuinnote.spark.office.excel")
	.option("hadoopoffice.write.mimeType", "application/vnd.ms-excel")
	.option("hadoopoffice.write.locale.bcp47", "en")
	.option("hadoopoffice.write.header.write", "true")
	.partionBy("year","month","day")
	.save("hdfs://user/spark/warehouse/output");

Just some additional details that perhaps could be useful:

When using "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" the values for non string cols were empty or 0. I could'nt figure out what happened.
Used "hadoopoffice.write." as "write." was not working. But need further investigate again.
I missing a new option for omitting the cols "year","month","day" from the final output. Currently all spark datasources require the cols in the dataframe for writing and there is not avoid them to be present in the output.

Thanks for the support!

Answer 8 · 2019-07-21T06:02:06.000Z

Tranks Great that it works! I will look now at the other issues.

…

Am 21.07.2019 um 01:24 schrieb Juarez Rudsatz ***@***.***>: Sorry for the delay. Yes, you are right! I'm using basically this way: Dataset<Row> myDF = spark.createDataFrame(myRDD, MyClass.class); myDF.select("value1", "value2", "year","month","day") .write().format("org.zuinnote.spark.office.excel") .option("hadoopoffice.write.mimeType", "application/vnd.ms-excel") .option("hadoopoffice.write.locale.bcp47", "en") .option("hadoopoffice.write.header.write", "true") .partionBy("year","month","day") .save("hdfs://user/spark/warehouse/output"); Just some additional details that perhaps could be useful: When using "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" the values for non string cols were empty or 0. I could'nt figure out what happened. Used "hadoopoffice.write." as "write." was not working. But need further investigation again. I missing a new option for disablint the writing the cols "year","month","day" in the final output. Currently all spark datasources require the cols in the dataframe for writing and there is not way to omit them from output. Thanks for the support! — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Answer 9 · 2019-07-22T19:51:56.000Z

Great! 🥇

Answer 10 · 2019-07-22T21:21:27.000Z

some quick update:

Even if it works, i will add partitioning to the documentation and as a integration test case
About "when using application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" => I will try to reproduce it tomorrow. it is odd because the integration test cases and unit test cases do not expose this behaviour.
About "hadoopoffice.write" i will investigate, but there was a plan to deprecate the options without hadoopoffice prefix anyway to make it consistent across all platforms
About "missing a new option for omitting cols": This will be difficult, because that logic is handled at Spark level. Spark always require to have the partitions as folder AND in the data itself. One could of course define a template where those columns are simply made invisible or sth. like this

Answer 11 · 2019-07-24T14:33:43.000Z

Thanks for the update.

Great to see your support on developing this feature.
I'll check again if it works with xls and not xlsx.
I see no problem in using options prefixed with hadoopoffice.*. Perhaps a little update on the README will help the users of the library.
Yes, that's a pity. And I agree that a option for omitting columns would be better addressed in Spark itself. (Wondering if there is a backlog to register a feature request for Spark)

If you need some help testing, please contact me.

Thanks again for your helpful support.

Answer 12 · 2019-07-24T17:01:07.000Z

I created a improvement issue in Spark Project for adding an opt-in option for omiting columns when saving with Spark:
https://issues.apache.org/jira/browse/SPARK-28505

Answer 13 · 2019-07-24T17:23:05.000Z

Thanks for reporting! Once it is there I will see how to integrate it.

…

Am 24.07.2019 um 19:01 schrieb Juarez Rudsatz ***@***.***>: I created a improvement issue in Spark Project for adding an opt-in option for omiting columns when saving with Spark: https://issues.apache.org/jira/browse/SPARK-28505 — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Answer 14 · 2019-07-28T18:40:45.000Z

Could you check if you still have issues with xlsx? I cannot reproduce this, it is part of the integration tests and they seem to work fine: https://github.com/ZuInnoTe/spark-hadoopoffice-ds/blob/master/src/it/scala/org/zuinnote/spark/office/excel/SparkScalaExcelDSSparkMasterIntegrationSpec.scala

Answer 15 · 2019-07-28T20:21:39.000Z

I just pushed the support for partitioning including an integration test verifying that it works.
I also noted that the partitioned columns are not written by Spark in the file itself but only exist at the folder level (as you required above). I don't know why you observed this for the CSV file.

I can publish it on Maven Central as version 1.3.2 next week. You can then test or if you do not want to wait for the official Maven central version: just clone this Git repository and publishToMavenLocal. Then include in your application 1.3.2.

For the other issues that you mention, can you please create dedicated issues for them, so we don't mix them all in this issue?

Thanks a lot.

Answer 16 · 2019-07-30T20:48:42.000Z

I just published 1.3.2. Can you please test if it meets your needs wrt to partitioning? If it is successful then I propose to close this issue and create for the remaining points that you mention new issues

Answer 17 · 2019-07-30T23:54:55.000Z

Sorry being late.
I will test tomorrow and I will report any issues.
Good to know that the partitioned columns are not included. I think it happened just in the previous xls datasource library. CSV does not show this behaviour also.
I will try to isolate a test case for the xlsx problem also, if I could reproduce.
Thanks for the all effort.

Answer 18 · 2019-07-31T06:29:15.000Z

No problems! Thank you for the testing!

…

Am 31.07.2019 um 01:54 schrieb Juarez Rudsatz ***@***.***>: Sorry being late. I will test tomorrow and I will report any issues. Good to know that the partitioned columns are not included. I think it happened just in the previous xls datasource library. CSV does not show this behaviour also. I will try to isolate a test case for the xlsx problem also, if I could reproduce. Thanks for the all effort. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Answer 19 · 2019-07-31T21:16:32.000Z

I tested the version 1.3.2 and partitioning worked perfectly in a couple of scenarios: xls, xlsx, compression.
Congratulations for the hard work!

Also I created a test case for reproducing the 'hadoopoffice.write.mimeType' 'XLSX'. I will open a new issue and attach the test case.