delta-io/connectors

A row of null is added when loading a Delta Lake table, with data stored in S3, via Delta Sharing Server, to Power BI Desktop

ThachNgocTran opened this issue · 3 comments

Reproducibility

An example to generate and save a Delta Table to S3:

# Within Apache Spark's PySpark shell:
sc._jsc.hadoopConfiguration().set("fs.s3a.access.key", "[access_key]")
sc._jsc.hadoopConfiguration().set("fs.s3a.secret.key", "[secret_key]")
sc._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "[a_s3_compatible_storage]")

from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","","Smith","36636","M",3000),
    ("Michael","Rose","","40288","M",4000),
    ("Robert","","Williams","42114","M",4000),
    ("Maria","Anne","Jones","39192","F",4000),
    ("Jen","Mary","Brown","","F",-1)
  ]

schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("middlename",StringType(),True), \
    StructField("lastname",StringType(),True), \
    StructField("id", StringType(), True), \
    StructField("gender", StringType(), True), \
    StructField("salary", IntegerType(), True) \
  ])
  
df = spark.createDataFrame(data=data2,schema=schema)

df.write.format("delta").option("path", "s3a://testfolder/test").mode("overwrite").saveAsTable("test")

core-site.xml in conf folder of Delta Sharing Server:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>fs.s3a.access.key</name>
    <value>[access_key]</value>
  </property>

  <property>
    <name>fs.s3a.secret.key</name>
    <value>[secret_key]</value>
  </property>

  <property>
    <name>fs.s3a.endpoint</name>
    <value>[a_s3_compatible_storage]</value>
  </property>

</configuration>

delta-sharing-server.yaml in conf folder of Delta Sharing Server:

# The format version of this config file
version: 1
# Config shares/schemas/tables to share
shares:
- name: "deltalake"
  schemas:
  - name: "[some_schema]"
    tables:
    - name: "test"
      location: "s3a://testfolder/test"
# Set the host name that the server will use
host: "spark01"
# Set the port that the server will listen on. Note: using ports below 1024 
# may require a privileged user in some operating systems.
port: 60000
# Set the url prefix for the REST APIs
endpoint: "/delta-sharing"
# Set the timeout of S3 presigned url in seconds
preSignedUrlTimeoutSeconds: 3600
# How many tables to cache in the server
deltaTableCacheSize: 10
# Whether we can accept working with a stale version of the table. This is useful when sharing
# static tables that will never be changed.
stalenessAcceptable: false
# Whether to evaluate user provided `predicateHints`
evaluatePredicateHints: false

# Authorization
authorization:
  bearerToken: [some_token]

In Power BI Desktop, write Power Query as followed:

let
    Source = DeltaSharing.Contents("[delta_sharing_server_ip]:60000/delta-sharing"),
    deltalake = Source{[Name="deltalake"]}[Data],
    sch = deltalake{[Name="[some_schema]"]}[Data],
    test1 = sch{[Name="test"]}[Data]
in
    test1

Then, the issue:

Screenshot_20221010_133119

Note: this null row can appear randomly, at least when we save the Table multiple times, with the same code above! For example, it can show up at index 3.

Misc

The Delta Sharing server is started:

./bin/delta-sharing-server -J-Xmx2048m -- --config ./conf/delta-sharing-server.yaml

The PySpark is started:

pyspark --master spark://[delta_sharing_server_ip]:7077 --packages io.delta:delta-core_2.12:2.0.0,org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk-bundle:1.11.901,org.apache.hadoop:hadoop-common:3.3.1 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" --num-executors 2 --executor-cores 3 --executor-memory 19G

Environment

  • Apache Spark v3.2.2
  • openjdk 11.0.16.1 2022-08-12 (Temurin)
  • Python 3.10.4
  • Ubuntu 22 LTS
  • Power BI Desktop 2.109.642.0 64-bit (September 2022) for Windows 10
  • Delta Sharing Server v0.5.1

@gbrueckl Sorry to bother you. But I think this issue might be interesting. I saw your contribution to this Connector (#87).

the Delta Sharing connector for Power BI has not much to do with my Power BI Connector
Unfortunately I also do not know who is developing Delta Sharing connector but I am sure one of the moderators will jump in here
@dennyglee @scottsand-db

@ThachNgocTran Could you open an issue i https://github.com/delta-io/delta-sharing instead? We will put Delta Sharing PowerBI code there soon. Closing this as this is not the right repo for Delta Sharing discussion.