delta-io/connectors

Table cells are all NULL if the Delta Lake table were earlier saved with option "delta.columnMapping.mode" as "name"

ThachNgocTran opened this issue · 8 comments

Reproducibility

If the Delta Lake table is saved using the Python code:

(notice the setting delta.columnMapping.mode)

final_df.write.format("delta")\
    .option("path", f"hdfs://some_ip:9000/data-warehouse/ABC")\
    .option("delta.columnMapping.mode", "name")\
    .mode("overwrite")\
    .saveAsTable("ABC")

In Power BI later, fetching the table, using the Connector, as followed:

let
    Source = fn_ReadDeltaTable(Hdfs.Files("some_ip:50070/data-warehouse/ABC"), [UseFileBuffer=true])
in
    Source

Even though the columns are well recognized/displayed, the table cells are completely null. (i.e. null everywhere)

Comment

As far as I know, the reason to use this setting is to support "disassociate Delta lake table columns with their physical parquet filenames"; thus, allowing to rename/drop them or using names containing special characters. See documentation Column mapping on Databricks.

Unfortunately, it looks like the Connector doesn't support such new option of Delta Lake.

Environment

  • Apache Spark v3.2.2
  • Hadoop/HDFS v3.3.4
  • openjdk 11.0.16.1 2022-08-12 (Temurin)
  • Python 3.10.4
  • Ubuntu 22 LTS
  • Power BI Desktop 2.109.642.0 64-bit (September 2022) for Windows 10

The screenshot of Power BI when the issue occurs.

Screenshot_2022-09-15_001

The PowerBI connector doesn't support column mapping. Looks like it's missing a protocol version check to throw a better error message. cc @gbrueckl

I will have a look but i am quite sure this is currently not supported.
The options I see is either implementing column mapping properly or throwing a better error as suggested
Will keep you updated here!

@ThachNgocTran can you please try the version from here
https://github.com/gbrueckl/connectors/blob/master/powerbi/fn_ReadDeltaTable.pq

it worked quite well for my personal tests but would be great if you could also validate it against your Delta Tables

There is one known limitation where it does not work properly when you have struct data types in your table (but I guess this is a very rare edge case for tables that are used for reporting)

@ThachNgocTran can you please try the version from here https://github.com/gbrueckl/connectors/blob/master/powerbi/fn_ReadDeltaTable.pq

it worked quite well for my personal tests but would be great if you could also validate it against your Delta Tables

There is one known limitation where it does not work properly when you have struct data types in your table (but I guess this is a very rare edge case for tables that are used for reporting)

I had the same issue. I tested the proposed version, and it worked for me. Thanks a lot!

@ThachNgocTran can you please try the version from here
https://github.com/gbrueckl/connectors/blob/master/powerbi/fn_ReadDeltaTable.pq

it worked quite well for my personal tests but would be great if you could also validate it against your Delta Tables

There is one known limitation where it does not work properly when you have struct data types in your table (but I guess this is a very rare edge case for tables that are used for reporting)

@gbrueckl Sorry for getting back this late. I have tested the new function fn_ReadDeltaTable(), and it works well in case of using .option("delta.columnMapping.mode", "name"): no longer see null everywhere.

Before this fix, I worked around by not using the option, and removing special characters ( ,;{}()\n\t=) out of Spark DataFrame's column names beforehand.

Great work!!! 💯

thanks for the testing and the positive feedback @ThachNgocTran and @dominikpeter
I just created the PR with the fix #448

This repo has been deprecated and the code is moved under connectors module in https://github.com/delta-io/delta repository. Please create the issue in repository https://github.com/delta-io/delta. See #556 for details.