gbif/pipelines

VERBATIM_TO_IDENTIFIER runs small datasets on Spark

Closed this issue · 2 comments

The VERBATIM_TO_IDENTIFIER stage is running everything on Spark (Yarn), even for tiny datasets such as this one.

We should either fix the config to be something reasonable (e.g. 1M records or >1GB uncompressed size or so) or rework this stage so that it doesn't require distributed computing.

There is only one implementation of that workflow - yarn/beam

Deployed to PROD