Infer schema in Python

Question

Infer schema in Python

rbabu7 opened this issue 4 years ago · 5 comments

I don't see an option to provide schema in pyspark, while same option is in scala. Please let me know how to provide a class while reading data using pyspark

Answer 1 · 2020-07-10T11:16:17.000Z

Hello!

Thank you for using the library.

Currently the solution would be to specify the schema manually in Python as documented here:
https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#programmatically-specifying-the-schema

Example:

from pyspark.sql.types import *

fields = [StructField("someField", StringType(), True), StructField("someOtherField", StringType(), True)]
schema = StructType(fields)

dynamoDf = spark.read \
  .option("tableName", "SomeTableName") \
  .schema(schema) \ # <-- Here we specify the schema manually
  .format("dynamodb") \
  .load()

Does this solve your problem?

Thanks,
Jacob

Answer 2 · 2020-07-11T22:56:04.000Z

Thank you Jacob , this works for me. Also is there a way to specify the GSI when querying dynamodb or is this intelligent to figure out on its own based on predicate provided?

Answer 3 · 2020-07-13T11:33:36.000Z

Hi rbabu7

To read from a GSI you can use the following option:

dynamoDf = spark.read \
  .option("tableName", "SomeTableName") \
  .option("indexName", "YourIndexName") \ # <-- Here we specify the GSI
  .schema(schema) \
  .format("dynamodb") \
  .load()

Can we close the issue?

Thanks,
Jacob

Answer 4 · 2020-07-13T11:35:25.000Z

Unfortunately it is not yet intelligent enough to use query over scan, even for global secondary index.
See #62

Answer 5 · 2020-07-13T13:16:20.000Z

Thank you for your support