gbif/pipelines

Large event dataset crashing standalone occurrence interpreter

Closed this issue · 4 comments

This dataset is crashing the standalone occurrence interpreter. It looks like it used to run as a distributed interpretation, but is now being sent to standalone -- even with a 2.5GB occurrence extension.

https://registry.gbif.org/dataset/a74db578-ac84-4907-ab8f-8de6eaa7df56/ingestion-history

(It would also be useful to have the logging MDC values on the entries.)

INFO  [07-08 18:59:32,020+0000] [pipelines_balancer-1]    org.gbif.pipelines.tasks.balancer.handler.VerbatimMessageHandler: Getting records number from the file - hdfs://ha-nn/data/ingest/a74db578-ac84-4907-ab8f-8de6eaa7df56/201/archive-to-verbatim.yml
INFO  [07-08 18:59:32,024+0000] [pipelines_balancer-1]    org.gbif.pipelines.tasks.balancer.handler.VerbatimMessageHandler: Records number - 17199, Spark Runner type - STANDALONE
tim@iMac-2 MGYS00003194 % wc -l event.txt 
   17199 event.txt

It's counting event records and deciding standalone, instead of occurrence records

I looked a little and:

hdfs dfs -cat hdfs://ha-nn/data/ingest/a74db578-ac84-4907-ab8f-8de6eaa7df56/201/archive-to-verbatim.yml
archiveToErCount: 17199

which I think is being read here

I think it is probably meant to be saving the occurrence count, not the event count in archiveToErCount after the pivot is performed, or perhaps both numbers or so.

muttcg commented

Fix deployed to PROD