Large event dataset crashing standalone occurrence interpreter
Closed this issue · 4 comments
MattBlissett commented
This dataset is crashing the standalone occurrence interpreter. It looks like it used to run as a distributed interpretation, but is now being sent to standalone -- even with a 2.5GB occurrence extension.
https://registry.gbif.org/dataset/a74db578-ac84-4907-ab8f-8de6eaa7df56/ingestion-history
MattBlissett commented
(It would also be useful to have the logging MDC values on the entries.)
INFO [07-08 18:59:32,020+0000] [pipelines_balancer-1] org.gbif.pipelines.tasks.balancer.handler.VerbatimMessageHandler: Getting records number from the file - hdfs://ha-nn/data/ingest/a74db578-ac84-4907-ab8f-8de6eaa7df56/201/archive-to-verbatim.yml
INFO [07-08 18:59:32,024+0000] [pipelines_balancer-1] org.gbif.pipelines.tasks.balancer.handler.VerbatimMessageHandler: Records number - 17199, Spark Runner type - STANDALONE
timrobertson100 commented
tim@iMac-2 MGYS00003194 % wc -l event.txt
17199 event.txt
It's counting event records and deciding standalone, instead of occurrence records
timrobertson100 commented
I looked a little and:
hdfs dfs -cat hdfs://ha-nn/data/ingest/a74db578-ac84-4907-ab8f-8de6eaa7df56/201/archive-to-verbatim.yml
archiveToErCount: 17199
which I think is being read here
I think it is probably meant to be saving the occurrence count, not the event count in archiveToErCount
after the pivot is performed, or perhaps both numbers or so.
muttcg commented
Fix deployed to PROD