gbif/crawler

wish: allow a registry admin to force a crawl, regardless of the date of the last run

Opened this issue · 2 comments

For debugging reasons (e.g. to check the effect of data mapping fixes), it is sometimes necessary to re-trigger a crawl within a shorter time interval from the last run. Due to the fixed lag period of seven days between crawl attempts, this is presently not easily possible: any earlier crawl request will be refused (resulting in log messages like "DEBUG message: Not eligible to crawl [e45c7d91-81c6-4455-86e3-2965a5739b1f] - crawled 5 days ago, which is within threshold of 7 days.)"

At least for a registry administrator, it would be very helpful to be able to request a re-crawl at any time, also before the seven day waiting period is up (and starting from a re-harvest of the local data source, e.g. DwC-A).

Also: to an outside user (e.g. a registered publisher requesting the crawl through the UI), the enforced lag period and ignoring of the crawl request is not immediately transparent, they will just find that nothing happens in response to them requesting a re-crawl. Some feedback would be helpful here.

Due to the fixed lag period of seven days between crawl attempts, this is presently not easily possible: any earlier crawl request will be refused

That's not actually the case -- any dataset can be recrawled as soon as any current crawl of that dataset has completed. That debug message is from the regular 7-day scheduler, the message from the click-to-crawl process is a little older "Requested crawl for dataset [e45c7d91-81c6-4455-86e3-2965a5739b1f] but crawl already scheduled or running, ignoring"

This dataset is stuck in limbo in the crawl queue, due to ... a bug. I don't know how it has happened, but the DWCA was fragmented twice, the second time starting after the crawl record had already been cleaned from ZooKeeper: https://logs.gbif.org/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:'2019-11-19T15:56:05.743Z',mode:absolute,to:'2019-11-19T16:11:34.831Z'))&_a=(columns:!(service,message),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:AWyLao3iHCKcR6PFXuPR,key:datasetKey,negate:!f,type:phrase,value:e45c7d91-81c6-4455-86e3-2965a5739b1f),query:(match:(datasetKey:(query:e45c7d91-81c6-4455-86e3-2965a5739b1f,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:AWyLao3iHCKcR6PFXuPR,key:service,negate:!t,type:phrase,value:crawler-crawl-scheduler),query:(match:(service:(query:crawler-crawl-scheduler,type:phrase))))),index:AWyLao3iHCKcR6PFXuPR,interval:auto,query:(match_all:()),sort:!('@timestamp',desc))

It's something we need to handle, but I'm not sure it should be automated, as there's a high chance of making more of a mess when there isn't a problem -- i.e. clicking the button when a crawl is genuinely still in progress.

We have already fixed the case of an invalid archive preventing metadata updates, so I think there are now three cases this button would be needed:

  • After changing a default value machine tag to reprocess the data
  • After changing interpretation code

I think appropriate re-runs of just interpretation can be done for pipelines already (there is a button "Rerun specific steps in a pipeline" which can be used, but it needs some explanation on how to use it).

  • The other case is bugs in the old crawling, which are pretty rare anyway, and will become irrelevant once we switch that off early next year.