(crawler-process) Duplicate IDs in spec handled improperly

Question

(crawler-process) Duplicate IDs in spec handled improperly

Closed this issue a year ago · 1 comments

If a crawlspec contains duplicate IDs, the website is crawled multiple times. If the duplicate IDs are close in sequence in the file, the result is corrupted crawl data.

Fix: Add a set with seen id:s and to deduplicate before launching new processes.

Answer 1 · 2023-07-07T18:03:28.000Z

Fixed in 2619d19