(crawl-job-extractor) Bunch of CJE papercuts and issues

Question

Closed this issue 6 months ago · 1 comments

Known problems with master:

The EC_URL table has been retired, so extraction from DB doesn't work in master anymore. We probably don't even really need known URLs anymore since we're mostly doing recrawls. Maybe the specs format should simplify to just a CSV with domain and crawl depth.

Notes from triggering the last crawl:

The process lacks a ProcessServiceHeartbeat so isn't visible in the control gui. Could do with chatting a bit in the EventLog as well.
Workflow for recrawl is a mess. Right now re-crawls are only possible for the same spec, given it's still related. This is not a combination that is useful. There should either be a way to manually relate or de-relate specifications and crawls; or explicitly specify a specification when doing a recrawl.
It would be nice if there was a way to merge specifications without using command line tools. Possibly less necessary if we go the CSV route mentioned in the first point, but still, it would be nice to reduce the amount of routine work done over SSH ⌨️

Answer 1 · 2023-12-10T18:14:34.000Z

No longer an issue