Add support for partitioning by folders in Hive/Spark Style
Closed this issue · 19 comments
Partitioning is useful for splitting the output in folders and files while writing (and reading).
In excel case can be useful for separating by day, customer or others ways or for creating smaller spreadsheets that will load faster.
The current behavior of this datasource when DataFrameWriter.partitionBy() method is generating a single file and overwriting for each partition.
For example when a dataframe is partitioned by year, month, and day Sparks writes in two different forms while using the 'csv' (or parquet, avro, etc...) and using the Excel output from this datasource.
Using 'csv':
csv_output_dir/_SUCCESS
csv_output_dir/year=2019/month=7/day=10/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv
csv_output_dir/year=2019/month=7/day=11/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv
Using 'org.zuinnote.spark.office.excel':
xls_output_dir/part-m-00000.xls
xls_output_dir/_SUCCESS
May I ask in which mode do you use it? Simple mode with simple data types or in CellMode?
Thanks for the quick response.
Until today we don't export directly to excel because we missed a component for doing the 'last mile'.
The workflow today consists on exporting to CSV files and post-processing with a specific script for converting to excel and formatting. But this becomes very complex and fragile and becomes a burden for creating and maintaining the code.
In cases that we export we simply write using the field names, types and order from the Spark dataframe.
Just exporting partitioned files with the headers and values will cut a lot of effort from the workflows development.
As the excel data is for human consumption, we foresee that the users will request formatting and some data handling and the functionality of templates present in this datasource seems to be very interesting for tackling this need also. But this is the a future problem and first we need the get the basics working well.
Sorry for the misunderstanding.
Just simple data types: string, int, bigint, datetime (java).
Thanks, again.
no problem. Just out of curiosity, do you use df.toDF.write.partionBy("year","month","day").format("org.zuinnote.spark.office.excel")
.option("write.locale.bcp47", "us")
.save("/home/user/office/output")
?
I want to adapt the integration tests towards the scenario that you describe. The reason is that we already basically use the internal Spark APIs, so not much or none modifications should be needed.
Sorry for the delay.
Yes, you are right! I'm using basically this way:
Dataset<Row> myDF = spark.createDataFrame(myRDD, MyClass.class);
myDF.select("value1", "value2", "year","month","day")
.write().format("org.zuinnote.spark.office.excel")
.option("hadoopoffice.write.mimeType", "application/vnd.ms-excel")
.option("hadoopoffice.write.locale.bcp47", "en")
.option("hadoopoffice.write.header.write", "true")
.partionBy("year","month","day")
.save("hdfs://user/spark/warehouse/output");
Just some additional details that perhaps could be useful:
- When using "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" the values for non string cols were empty or 0. I could'nt figure out what happened.
- Used "hadoopoffice.write." as "write." was not working. But need further investigate again.
- I missing a new option for omitting the cols "year","month","day" from the final output. Currently all spark datasources require the cols in the dataframe for writing and there is not avoid them to be present in the output.
Thanks for the support!
Great! 🥇
some quick update:
- Even if it works, i will add partitioning to the documentation and as a integration test case
- About "when using application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" => I will try to reproduce it tomorrow. it is odd because the integration test cases and unit test cases do not expose this behaviour.
- About "hadoopoffice.write" i will investigate, but there was a plan to deprecate the options without hadoopoffice prefix anyway to make it consistent across all platforms
- About "missing a new option for omitting cols": This will be difficult, because that logic is handled at Spark level. Spark always require to have the partitions as folder AND in the data itself. One could of course define a template where those columns are simply made invisible or sth. like this
Thanks for the update.
- Great to see your support on developing this feature.
- I'll check again if it works with xls and not xlsx.
- I see no problem in using options prefixed with
hadoopoffice.*
. Perhaps a little update on the README will help the users of the library. - Yes, that's a pity. And I agree that a option for omitting columns would be better addressed in Spark itself. (Wondering if there is a backlog to register a feature request for Spark)
If you need some help testing, please contact me.
Thanks again for your helpful support.
I created a improvement issue in Spark Project for adding an opt-in option for omiting columns when saving with Spark:
https://issues.apache.org/jira/browse/SPARK-28505
Could you check if you still have issues with xlsx? I cannot reproduce this, it is part of the integration tests and they seem to work fine: https://github.com/ZuInnoTe/spark-hadoopoffice-ds/blob/master/src/it/scala/org/zuinnote/spark/office/excel/SparkScalaExcelDSSparkMasterIntegrationSpec.scala
I just pushed the support for partitioning including an integration test verifying that it works.
I also noted that the partitioned columns are not written by Spark in the file itself but only exist at the folder level (as you required above). I don't know why you observed this for the CSV file.
I can publish it on Maven Central as version 1.3.2 next week. You can then test or if you do not want to wait for the official Maven central version: just clone this Git repository and publishToMavenLocal. Then include in your application 1.3.2.
For the other issues that you mention, can you please create dedicated issues for them, so we don't mix them all in this issue?
Thanks a lot.
I just published 1.3.2. Can you please test if it meets your needs wrt to partitioning? If it is successful then I propose to close this issue and create for the remaining points that you mention new issues
Sorry being late.
I will test tomorrow and I will report any issues.
Good to know that the partitioned columns are not included. I think it happened just in the previous xls datasource library. CSV does not show this behaviour also.
I will try to isolate a test case for the xlsx problem also, if I could reproduce.
Thanks for the all effort.
I tested the version 1.3.2 and partitioning worked perfectly in a couple of scenarios: xls, xlsx, compression.
Congratulations for the hard work!
Also I created a test case for reproducing the 'hadoopoffice.write.mimeType' 'XLSX'. I will open a new issue and attach the test case.