audienceproject/spark-dynamodb

Infer schema in Python

rbabu7 opened this issue · 5 comments

I don't see an option to provide schema in pyspark, while same option is in scala. Please let me know how to provide a class while reading data using pyspark

Hello!

Thank you for using the library.

Currently the solution would be to specify the schema manually in Python as documented here:
https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#programmatically-specifying-the-schema

Example:

from pyspark.sql.types import *

fields = [StructField("someField", StringType(), True), StructField("someOtherField", StringType(), True)]
schema = StructType(fields)

dynamoDf = spark.read \
  .option("tableName", "SomeTableName") \
  .schema(schema) \ # <-- Here we specify the schema manually
  .format("dynamodb") \
  .load()

Does this solve your problem?

Thanks,
Jacob

Thank you Jacob , this works for me. Also is there a way to specify the GSI when querying dynamodb or is this intelligent to figure out on its own based on predicate provided?

Hi rbabu7

To read from a GSI you can use the following option:

dynamoDf = spark.read \
  .option("tableName", "SomeTableName") \
  .option("indexName", "YourIndexName") \ # <-- Here we specify the GSI
  .schema(schema) \
  .format("dynamodb") \
  .load()

Can we close the issue?

Thanks,
Jacob

Unfortunately it is not yet intelligent enough to use query over scan, even for global secondary index.
See #62

Thank you for your support