audienceproject/spark-dynamodb

Incomplete schema inference while reading from DynamoDB table

Opened this issue · 3 comments

DynamoDB Table:
Screenshot 2021-01-21 at 1 30 54 PM

I am reading the above table using the following code:

spark.read
        .option("tableName", config.tableName)
        .option("region", config.ddbConfig.region)
        .format("dynamodb")
        .load()
df.show()

Result:
|s_id| created_on|p_id|
+----+-------------------+----+
| 002|2018-11-20 12:01:19| 2|
| 001|2018-11-19 12:01:19| 1|
| 006|2018-11-20 12:01:19| 6|
| 005|2018-11-19 12:01:20| 5|
| 004|2018-12-19 12:01:19| 4|
| 003|2019-11-19 12:01:19| 3|

The "num" column was missing from the df. Why did this happen? Is there any flag which I need to set to ensure complete schema inference?

You can pass userSchema option and your schema along with that otherwise it creates schema from the data on first page of dynamodb table.

thanks! this helps.

This library returned an empty dataframe when I tried to read a DDB table with both range key and hash key. Is this a known behaviour?

@siah210 you should pass schema with .schema() parameter just like you do to normal DF that would work.