Large event dataset crashing standalone occurrence interpreter

Question

Large event dataset crashing standalone occurrence interpreter

Closed this issue a year ago · 4 comments

This dataset is crashing the standalone occurrence interpreter. It looks like it used to run as a distributed interpretation, but is now being sent to standalone -- even with a 2.5GB occurrence extension.

https://registry.gbif.org/dataset/a74db578-ac84-4907-ab8f-8de6eaa7df56/ingestion-history

Answer 1 · 2023-07-10T14:42:50.000Z

(It would also be useful to have the logging MDC values on the entries.)

INFO  [07-08 18:59:32,020+0000] [pipelines_balancer-1]    org.gbif.pipelines.tasks.balancer.handler.VerbatimMessageHandler: Getting records number from the file - hdfs://ha-nn/data/ingest/a74db578-ac84-4907-ab8f-8de6eaa7df56/201/archive-to-verbatim.yml
INFO  [07-08 18:59:32,024+0000] [pipelines_balancer-1]    org.gbif.pipelines.tasks.balancer.handler.VerbatimMessageHandler: Records number - 17199, Spark Runner type - STANDALONE

Answer 2 · 2023-07-10T14:45:12.000Z

tim@iMac-2 MGYS00003194 % wc -l event.txt 
   17199 event.txt

It's counting event records and deciding standalone, instead of occurrence records

Answer 3 · 2023-07-10T14:58:12.000Z

I looked a little and:

hdfs dfs -cat hdfs://ha-nn/data/ingest/a74db578-ac84-4907-ab8f-8de6eaa7df56/201/archive-to-verbatim.yml
archiveToErCount: 17199

which I think is being read here

I think it is probably meant to be saving the occurrence count, not the event count in archiveToErCount after the pivot is performed, or perhaps both numbers or so.

Answer 4 · 2023-07-31T12:45:13.000Z

Fix deployed to PROD