audienceproject/spark-dynamodb

DynamoDB load always uses full scan instead of the specified global secondary index in Python

Opened this issue · 0 comments

In Python I'm are trying to read DynamoDB with a global secondary index with a provided schema and filters. It's a very large dynamo table and it takes approximately 4 hrs to do a full table scan. We've created a global secondary index to improve the performance. However we are not sure if this library supports this functionality. Or perhaps we are using it incorrectly. Currently we are using the following code to do a full scan. I tried adding the commented out line to use the index but that didn't work and couldn't find any examples of this.

        dynamo_df = spark.read.schema(table_schema) \
        .option("tableName", "table") \
        // .option("indexName", "x-y-global-secondary-index") \
        .option("region", region) \
        .option("throughput", 2500) \
        .format("dynamodb") \
        .load()

filtered_df = dynamo_df.filter((dynamo_df.x == x ) & (dynamo_df.y > y)

Appreciate the help!