Error when load small amount of data on large cluster
kubicaj opened this issue · 0 comments
kubicaj commented
Hi,
I have dynamodb which consists only small amount of data (25 rows)
I use the following spark-dynamodb library:
<dependency>
<groupId>com.audienceproject</groupId>
<artifactId>spark-dynamodb_2.11</artifactId>
<version>1.0.4</version>
</dependency>
I have a very simple code which I use for loading od dynamodb:
dynamo_db_df = self.spark_session.read.option("tableName", "my-sample-table").format("dynamodb").load()
dynamo_db_df.show()
I have 2 types of running environments:
- Small cluster : AWS glue job where worker_type = Standard (50 GB disk and 2 executors)
- Large cluster: AWS glue job where worker_type = G.1X, num_workers = 32 (32 executors where each executor have 4 vCPU, 16 GB of memory)
When I run it on cluster which is small then result looks fine:
+---+--------+-------------+-----+---------+--------------------+--------------------+
| Id|OrderNum| Title|Price| ISBN| BookMetadata| Authors|
+---+--------+-------------+-----+---------+--------------------+--------------------+
| 22| 22|Book 22 Title| 644|900917-22|[true, [Editor-22...|[[M, [USA, Author...|
| 18| 18|Book 18 Title| 97|377399-18|[true, [Editor-18...|[[M, [USA, Author...|
| 16| 16|Book 16 Title| 383|224276-16|[true, [Editor-16...|[[F, [USA, Author...|
| 2| 2| Book 2 Title| 73| 371411-2|[true, [Editor-2,...|[[F, [USA, Author...|
| 13| 13|Book 13 Title| 431|911648-13|[true, [Editor-13...|[[F, [USA, Author...|
| 8| 8| Book 8 Title| 521| 770005-8|[true, [Editor-8,...|[[F, [USA, Author...|
| 9| 9| Book 9 Title| 838| 915353-9|[true, [Editor-9,...|[[F, [USA, Author...|
| 1| 1| Book 1 Title| 782| 637081-1|[true, [Editor-1,...|[[M, [USA, Author...|
| 6| 6| Book 6 Title| 604| 33246-6|[true, [Editor-6,...|[[F, [USA, Author...|
| 24| 24|Book 24 Title| 826|370799-24|[true, [Editor-24...|[[M, [USA, Author...|
| 5| 5| Book 5 Title| 726| 503009-5|[true, [Editor-5,...|[[M, [USA, Author...|
| 4| 4| Book 4 Title| 172| 229720-4|[true, [Editor-4,...|[[M, [USA, Author...|
| 23| 23|Book 23 Title| 574|876365-23|[true, [Editor-23...|[[M, [USA, Author...|
| 19| 19|Book 19 Title| 694|574785-19|[true, [Editor-19...|[[M, [USA, Author...|
| 7| 7| Book 7 Title| 418| 732692-7|[true, [Editor-7,...|[[F, [USA, Author...|
| 11| 11|Book 11 Title| 360|582662-11|[true, [Editor-11...|[[M, [USA, Author...|
| 3| 3| Book 3 Title| 401| 722245-3|[true, [Editor-3,...|[[M, [USA, Author...|
| 20| 20|Book 20 Title| 185|464982-20|[true, [Editor-20...|[[F, [USA, Author...|
| 21| 21|Book 21 Title| 271|685657-21|[true, [Editor-21...|[[F, [USA, Author...|
| 25| 25|Book 25 Title| 688|521779-25|[true, [Editor-25...|[[M, [USA, Author...|
+---+--------+-------------+-----+---------+--------------------+--------------------+
only showing top 20 rows
When I run it on cluster which is large then count of rows is correct but data are empties:
++
||
++
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
||
++
only showing top 20 rows
But when I load table which has several GBs and millions of rows then everything looks fine.
Please can you check it?