carat-project/carat-dataset-tools

Export impossible

Closed this issue · 4 comments

Hi,

Thank you for sharing your dataset.

I am trying to export the .gz files to csv.
Firstly, I am running your code inside a docker container with spark 2.3 following the pom.xml. Else, it was too difficult to get the good versions of the different softwares on my machine.
Maven is able to generate the jar file, but anytime I try to run this command

spark-submit --class fi.helsinki.cs.nodes.carat.examples.SamplesFromJsonGz target/carat-dataset-tools-1.0.0-jar-with-dependencies.jar --input ../carat-data-top1k-users-2014-to-2018-08-25/top1k-salted-data-to-share-2014-01-01-to-2018-08-25.json-rdd/part-00001.gz --output dummy.csv

I am getting the following error
`2021-01-21 16:32:02 WARN NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2021-01-21 16:32:02 INFO SparkContext:54 - Running Spark version 2.3.1
2021-01-21 16:32:02 INFO SparkContext:54 - Submitted application: fi.helsinki.cs.nodes.carat.examples.SamplesFromJsonGz$
2021-01-21 16:32:02 INFO SecurityManager:54 - Changing view acls to: root
2021-01-21 16:32:02 INFO SecurityManager:54 - Changing modify acls to: root
2021-01-21 16:32:02 INFO SecurityManager:54 - Changing view acls groups to:
2021-01-21 16:32:02 INFO SecurityManager:54 - Changing modify acls groups to:
2021-01-21 16:32:02 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
2021-01-21 16:32:03 INFO Utils:54 - Successfully started service 'sparkDriver' on port 34941.
2021-01-21 16:32:03 INFO SparkEnv:54 - Registering MapOutputTracker
2021-01-21 16:32:03 INFO SparkEnv:54 - Registering BlockManagerMaster
2021-01-21 16:32:03 INFO BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2021-01-21 16:32:03 INFO BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
2021-01-21 16:32:03 INFO DiskBlockManager:54 - Created local directory at /tmp/blockmgr-b9fc9ca9-f3e0-4e28-9744-046afe161999
2021-01-21 16:32:03 INFO MemoryStore:54 - MemoryStore started with capacity 366.3 MB
2021-01-21 16:32:03 INFO SparkEnv:54 - Registering OutputCommitCoordinator
2021-01-21 16:32:03 INFO log:192 - Logging initialized @3259ms
2021-01-21 16:32:03 INFO Server:346 - jetty-9.3.z-SNAPSHOT
2021-01-21 16:32:03 INFO Server:414 - Started @3402ms
2021-01-21 16:32:03 INFO AbstractConnector:278 - Started ServerConnector@5305c37d{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2021-01-21 16:32:03 INFO Utils:54 - Successfully started service 'SparkUI' on port 4040.
2021-01-21 16:32:03 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3e2fc448{/jobs,null,AVAILABLE,@spark}
2021-01-21 16:32:03 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1af7f54a{/jobs/json,null,AVAILABLE,@spark}
2021-01-21 16:32:03 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6ebd78d1{/jobs/job,null,AVAILABLE,@spark}
2021-01-21 16:32:03 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4d157787{/jobs/job/json,null,AVAILABLE,@spark}
2021-01-21 16:32:03 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@68ed96ca{/stages,null,AVAILABLE,@spark}
2021-01-21 16:32:03 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6d1310f6{/stages/json,null,AVAILABLE,@spark}
2021-01-21 16:32:03 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3228d990{/stages/stage,null,AVAILABLE,@spark}
2021-01-21 16:32:03 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@255990cc{/stages/stage/json,null,AVAILABLE,@spark}
2021-01-21 16:32:03 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@51c929ae{/stages/pool,null,AVAILABLE,@spark}
2021-01-21 16:32:03 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3c8bdd5b{/stages/pool/json,null,AVAILABLE,@spark}
2021-01-21 16:32:03 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@29d2d081{/storage,null,AVAILABLE,@spark}
2021-01-21 16:32:03 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@40e4ea87{/storage/json,null,AVAILABLE,@spark}
2021-01-21 16:32:03 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@58783f6c{/storage/rdd,null,AVAILABLE,@spark}
2021-01-21 16:32:03 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3a7b503d{/storage/rdd/json,null,AVAILABLE,@spark}
2021-01-21 16:32:03 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@512d92b{/environment,null,AVAILABLE,@spark}
2021-01-21 16:32:03 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@62c5bbdc{/environment/json,null,AVAILABLE,@spark}
2021-01-21 16:32:03 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7bdf6bb7{/executors,null,AVAILABLE,@spark}
2021-01-21 16:32:03 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1bc53649{/executors/json,null,AVAILABLE,@spark}
2021-01-21 16:32:03 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@88d6f9b{/executors/threadDump,null,AVAILABLE,@spark}
2021-01-21 16:32:03 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@47d93e0d{/executors/threadDump/json,null,AVAILABLE,@spark}
2021-01-21 16:32:03 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@475b7792{/static,null,AVAILABLE,@spark}
2021-01-21 16:32:03 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@24855019{/,null,AVAILABLE,@spark}
2021-01-21 16:32:03 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3abd581e{/api,null,AVAILABLE,@spark}
2021-01-21 16:32:03 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3fabf088{/jobs/job/kill,null,AVAILABLE,@spark}
2021-01-21 16:32:03 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1e392345{/stages/stage/kill,null,AVAILABLE,@spark}
2021-01-21 16:32:03 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://97614b7cad51:4040
2021-01-21 16:32:03 INFO SparkContext:54 - Added JAR file:/mnt/carat-dataset-tools/target/carat-dataset-tools-1.0.0-jar-with-dependencies.jar at spark://97614b7cad51:34941/jars/carat-dataset-tools-1.0.0-jar-with-dependencies.jar with timestamp 1611246723961
2021-01-21 16:32:04 INFO Executor:54 - Starting executor ID driver on host localhost
2021-01-21 16:32:04 INFO Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 42987.
2021-01-21 16:32:04 INFO NettyBlockTransferService:54 - Server created on 97614b7cad51:42987
2021-01-21 16:32:04 INFO BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2021-01-21 16:32:04 INFO BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, 97614b7cad51, 42987, None)
2021-01-21 16:32:04 INFO BlockManagerMasterEndpoint:54 - Registering block manager 97614b7cad51:42987 with 366.3 MB RAM, BlockManagerId(driver, 97614b7cad51, 42987, None)
2021-01-21 16:32:04 INFO BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, 97614b7cad51, 42987, None)
2021-01-21 16:32:04 INFO BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, 97614b7cad51, 42987, None)
2021-01-21 16:32:04 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@f973499{/metrics/json,null,AVAILABLE,@spark}
2021-01-21 16:32:09 INFO SharedState:54 - Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/mnt/carat-dataset-tools/spark-warehouse/').
2021-01-21 16:32:09 INFO SharedState:54 - Warehouse path is 'file:/mnt/carat-dataset-tools/spark-warehouse/'.
2021-01-21 16:32:09 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@67dc6b48{/SQL,null,AVAILABLE,@spark}
2021-01-21 16:32:09 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@57f2e67{/SQL/json,null,AVAILABLE,@spark}
2021-01-21 16:32:09 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6d3ad37a{/SQL/execution,null,AVAILABLE,@spark}
2021-01-21 16:32:09 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@26f5e45d{/SQL/execution/json,null,AVAILABLE,@spark}
2021-01-21 16:32:09 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@14df5253{/static/sql,null,AVAILABLE,@spark}
2021-01-21 16:32:10 INFO StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
2021-01-21 16:32:13 INFO FileSourceStrategy:54 - Pruning directories with:
2021-01-21 16:32:13 INFO FileSourceStrategy:54 - Post-Scan Filters:
2021-01-21 16:32:13 INFO FileSourceStrategy:54 - Output Data Schema: struct<uuid: string, time: bigint, batteryLevel: bigint, triggeredBy: string, batteryState: string ... 8 more fields>
2021-01-21 16:32:13 INFO FileSourceScanExec:54 - Pushed Filters:
2021-01-21 16:32:14 INFO CodeGenerator:54 - Code generated in 744.706771 ms
2021-01-21 16:32:14 INFO MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 427.1 KB, free 365.9 MB)
2021-01-21 16:32:14 INFO MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 37.5 KB, free 365.8 MB)
2021-01-21 16:32:14 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on 97614b7cad51:42987 (size: 37.5 KB, free: 366.3 MB)
2021-01-21 16:32:14 INFO SparkContext:54 - Created broadcast 0 from rdd at Spark2Main.scala:64
2021-01-21 16:32:15 INFO FileSourceScanExec:54 - Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
2021-01-21 16:32:16 INFO CodeGenerator:54 - Code generated in 103.582089 ms
2021-01-21 16:32:17 INFO CodeGenerator:54 - Code generated in 791.195601 ms
2021-01-21 16:32:17 INFO SparkContext:54 - Starting job: show at SamplesFromJsonGz.scala:32
2021-01-21 16:32:17 INFO DAGScheduler:54 - Got job 0 (show at SamplesFromJsonGz.scala:32) with 1 output partitions
2021-01-21 16:32:17 INFO DAGScheduler:54 - Final stage: ResultStage 0 (show at SamplesFromJsonGz.scala:32)
2021-01-21 16:32:17 INFO DAGScheduler:54 - Parents of final stage: List()
2021-01-21 16:32:17 INFO DAGScheduler:54 - Missing parents: List()
2021-01-21 16:32:17 INFO DAGScheduler:54 - Submitting ResultStage 0 (MapPartitionsRDD[7] at show at SamplesFromJsonGz.scala:32), which has no missing parents
2021-01-21 16:32:17 INFO MemoryStore:54 - Block broadcast_1 stored as values in memory (estimated size 198.5 KB, free 365.7 MB)
2021-01-21 16:32:17 INFO MemoryStore:54 - Block broadcast_1_piece0 stored as bytes in memory (estimated size 43.2 KB, free 365.6 MB)
2021-01-21 16:32:17 INFO BlockManagerInfo:54 - Added broadcast_1_piece0 in memory on 97614b7cad51:42987 (size: 43.2 KB, free: 366.2 MB)
2021-01-21 16:32:17 INFO SparkContext:54 - Created broadcast 1 from broadcast at DAGScheduler.scala:1039
2021-01-21 16:32:17 INFO DAGScheduler:54 - Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[7] at show at SamplesFromJsonGz.scala:32) (first 15 tasks are for partitions Vector(0))
2021-01-21 16:32:17 INFO TaskSchedulerImpl:54 - Adding task set 0.0 with 1 tasks
2021-01-21 16:32:18 INFO TaskSetManager:54 - Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 8393 bytes)
2021-01-21 16:32:18 INFO Executor:54 - Running task 0.0 in stage 0.0 (TID 0)
2021-01-21 16:32:18 INFO Executor:54 - Fetching spark://97614b7cad51:34941/jars/carat-dataset-tools-1.0.0-jar-with-dependencies.jar with timestamp 1611246723961
2021-01-21 16:32:18 INFO TransportClientFactory:267 - Successfully created connection to 97614b7cad51/172.17.0.2:34941 after 66 ms (0 ms spent in bootstraps)
2021-01-21 16:32:18 INFO Utils:54 - Fetching spark://97614b7cad51:34941/jars/carat-dataset-tools-1.0.0-jar-with-dependencies.jar to /tmp/spark-4d125529-6ecc-4412-810b-85cf4e32647e/userFiles-94e85a64-4e7c-48de-9f3c-9b38b63073bf/fetchFileTemp7472528122042604585.tmp
2021-01-21 16:32:18 INFO Executor:54 - Adding file:/tmp/spark-4d125529-6ecc-4412-810b-85cf4e32647e/userFiles-94e85a64-4e7c-48de-9f3c-9b38b63073bf/carat-dataset-tools-1.0.0-jar-with-dependencies.jar to class loader
2021-01-21 16:32:19 INFO CodeGenerator:54 - Code generated in 194.519738 ms
2021-01-21 16:32:19 INFO FileScanRDD:54 - Reading File path: file:///mnt/carat-data-top1k-users-2014-to-2018-08-25/top1k-salted-data-to-share-2014-01-01-to-2018-08-25.json-rdd/part-00001.gz, range: 0-3699801, partition values: [empty row]
2021-01-21 16:32:19 INFO CodeGenerator:54 - Code generated in 72.062843 ms
2021-01-21 16:32:19 INFO CodecPool:184 - Got brand-new decompressor [.gz]
2021-01-21 16:32:19 ERROR Executor:91 - Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NullPointerException: Null value appeared in non-nullable field:

  • field (class: "scala.Long", name: "time")
  • root class: "fi.helsinki.cs.nodes.carat.sample.json.JsonSampleAppExtras"
    If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int).
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply_5_0$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown Source)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
    2021-01-21 16:32:19 WARN TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.NullPointerException: Null value appeared in non-nullable field:
  • field (class: "scala.Long", name: "time")
  • root class: "fi.helsinki.cs.nodes.carat.sample.json.JsonSampleAppExtras"
    If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int).
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply_5_0$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown Source)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

2021-01-21 16:32:19 ERROR TaskSetManager:70 - Task 0 in stage 0.0 failed 1 times; aborting job
2021-01-21 16:32:19 INFO TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool
2021-01-21 16:32:19 INFO TaskSchedulerImpl:54 - Cancelling stage 0
2021-01-21 16:32:19 INFO DAGScheduler:54 - ResultStage 0 (show at SamplesFromJsonGz.scala:32) failed in 1.712 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.NullPointerException: Null value appeared in non-nullable field:

  • field (class: "scala.Long", name: "time")
  • root class: "fi.helsinki.cs.nodes.carat.sample.json.JsonSampleAppExtras"
    If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int).
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply_5_0$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown Source)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
2021-01-21 16:32:19 INFO DAGScheduler:54 - Job 0 failed: show at SamplesFromJsonGz.scala:32, took 1.838794 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.NullPointerException: Null value appeared in non-nullable field:

  • field (class: "scala.Long", name: "time")
  • root class: "fi.helsinki.cs.nodes.carat.sample.json.JsonSampleAppExtras"
    If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int).
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply_5_0$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown Source)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:363)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3273)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2484)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2698)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:254)
at org.apache.spark.sql.Dataset.show(Dataset.scala:723)
at org.apache.spark.sql.Dataset.show(Dataset.scala:682)
at org.apache.spark.sql.Dataset.show(Dataset.scala:691)
at fi.helsinki.cs.nodes.carat.examples.SamplesFromJsonGz$.sparkMain(SamplesFromJsonGz.scala:32)
at fi.helsinki.cs.nodes.util.Spark2Main$class.optMain(Spark2Main.scala:39)
at fi.helsinki.cs.nodes.carat.examples.SamplesFromJsonGz$.optMain(SamplesFromJsonGz.scala:15)
at fi.helsinki.cs.nodes.util.OptMain$class.main(OptMain.scala:46)
at fi.helsinki.cs.nodes.carat.examples.SamplesFromJsonGz$.main(SamplesFromJsonGz.scala:15)
at fi.helsinki.cs.nodes.carat.examples.SamplesFromJsonGz.main(SamplesFromJsonGz.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.NullPointerException: Null value appeared in non-nullable field:

  • field (class: "scala.Long", name: "time")
  • root class: "fi.helsinki.cs.nodes.carat.sample.json.JsonSampleAppExtras"
    If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int).
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply_5_0$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown Source)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
    2021-01-21 16:32:19 INFO SparkContext:54 - Invoking stop() from shutdown hook
    2021-01-21 16:32:19 INFO AbstractConnector:318 - Stopped Spark@5305c37d{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
    2021-01-21 16:32:19 INFO SparkUI:54 - Stopped Spark web UI at http://97614b7cad51:4040
    2021-01-21 16:32:19 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
    2021-01-21 16:32:19 INFO MemoryStore:54 - MemoryStore cleared
    2021-01-21 16:32:19 INFO BlockManager:54 - BlockManager stopped
    2021-01-21 16:32:19 INFO BlockManagerMaster:54 - BlockManagerMaster stopped
    2021-01-21 16:32:19 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
    2021-01-21 16:32:19 INFO SparkContext:54 - Successfully stopped SparkContext
    2021-01-21 16:32:19 INFO ShutdownHookManager:54 - Shutdown hook called
    2021-01-21 16:32:19 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-4d125529-6ecc-4412-810b-85cf4e32647e
    2021-01-21 16:32:19 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-0c6fff3e-18ab-4094-a321-fcc10bf7eb13
    `

I tried to add Option[] to the attributes but in that case the maven build failed.
Did I miss something in the export of the data ? Should I ungz all the files before ?

Thank you for your time.

Hi, I tried to answer this but sadly my phone's email client did not handle github replies properly.
I think your issue is that SPARK takes a folder as input, not a file --- so your command should be:

spark-submit --class fi.helsinki.cs.nodes.carat.examples.SamplesFromJsonGz target/carat-dataset-tools-1.0.0-jar-with-dependencies.jar --input ../carat-data-top1k-users-2014-to-2018-08-25/top1k-salted-data-to-share-2014-01-01-to-2018-08-25.json-rdd --output dummy.csv

Hi, thank you for your reply.
I am getting the same issue as before with the updated command line.

Caused by: java.lang.NullPointerException: Null value appeared in non-nullable field:

  • field (class: "scala.Long", name: "time")
  • root class: "fi.helsinki.cs.nodes.carat.sample.json.JsonSampleAppExtras"

Could the issue be my spark version ?

root@97614b7cad51:/mnt/carat-dataset-tools# spark-shell --version Welcome to Spark version 2.3.1 Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_171

Regards, Alex.

Your versions look fine. I will have to try to reproduce this. I have actually not used this old repository carat-dataset-tools with the new shared dataset, but primarily used Python and our internal tools to work with it.

Thank you, I will keep the issue open until then.
I would be happy to have access to the Python code if possible (it's the language I am using for my analysis anyway).
Regards, Alex.