audienceproject/spark-dynamodb

Error when load small amount of data on large cluster

kubicaj opened this issue · 0 comments

Hi,

I have dynamodb which consists only small amount of data (25 rows)
I use the following spark-dynamodb library:

        <dependency>
            <groupId>com.audienceproject</groupId>
            <artifactId>spark-dynamodb_2.11</artifactId>
            <version>1.0.4</version>
        </dependency>

I have a very simple code which I use for loading od dynamodb:

dynamo_db_df = self.spark_session.read.option("tableName", "my-sample-table").format("dynamodb").load()
dynamo_db_df.show()

I have 2 types of running environments:

  1. Small cluster : AWS glue job where worker_type = Standard (50 GB disk and 2 executors)
  2. Large cluster: AWS glue job where worker_type = G.1X, num_workers = 32 (32 executors where each executor have 4 vCPU, 16 GB of memory)

When I run it on cluster which is small then result looks fine:

+---+--------+-------------+-----+---------+--------------------+--------------------+
| Id|OrderNum|        Title|Price|     ISBN|        BookMetadata|             Authors|
+---+--------+-------------+-----+---------+--------------------+--------------------+
| 22|      22|Book 22 Title|  644|900917-22|[true, [Editor-22...|[[M, [USA, Author...|
| 18|      18|Book 18 Title|   97|377399-18|[true, [Editor-18...|[[M, [USA, Author...|
| 16|      16|Book 16 Title|  383|224276-16|[true, [Editor-16...|[[F, [USA, Author...|
|  2|       2| Book 2 Title|   73| 371411-2|[true, [Editor-2,...|[[F, [USA, Author...|
| 13|      13|Book 13 Title|  431|911648-13|[true, [Editor-13...|[[F, [USA, Author...|
|  8|       8| Book 8 Title|  521| 770005-8|[true, [Editor-8,...|[[F, [USA, Author...|
|  9|       9| Book 9 Title|  838| 915353-9|[true, [Editor-9,...|[[F, [USA, Author...|
|  1|       1| Book 1 Title|  782| 637081-1|[true, [Editor-1,...|[[M, [USA, Author...|
|  6|       6| Book 6 Title|  604|  33246-6|[true, [Editor-6,...|[[F, [USA, Author...|
| 24|      24|Book 24 Title|  826|370799-24|[true, [Editor-24...|[[M, [USA, Author...|
|  5|       5| Book 5 Title|  726| 503009-5|[true, [Editor-5,...|[[M, [USA, Author...|
|  4|       4| Book 4 Title|  172| 229720-4|[true, [Editor-4,...|[[M, [USA, Author...|
| 23|      23|Book 23 Title|  574|876365-23|[true, [Editor-23...|[[M, [USA, Author...|
| 19|      19|Book 19 Title|  694|574785-19|[true, [Editor-19...|[[M, [USA, Author...|
|  7|       7| Book 7 Title|  418| 732692-7|[true, [Editor-7,...|[[F, [USA, Author...|
| 11|      11|Book 11 Title|  360|582662-11|[true, [Editor-11...|[[M, [USA, Author...|
|  3|       3| Book 3 Title|  401| 722245-3|[true, [Editor-3,...|[[M, [USA, Author...|
| 20|      20|Book 20 Title|  185|464982-20|[true, [Editor-20...|[[F, [USA, Author...|
| 21|      21|Book 21 Title|  271|685657-21|[true, [Editor-21...|[[F, [USA, Author...|
| 25|      25|Book 25 Title|  688|521779-25|[true, [Editor-25...|[[M, [USA, Author...|
+---+--------+-------------+-----+---------+--------------------+--------------------+
only showing top 20 rows

When I run it on cluster which is large then count of rows is correct but data are empties:

++
||
++
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
++
only showing top 20 rows

But when I load table which has several GBs and millions of rows then everything looks fine.

Please can you check it?