(crawler-process) Duplicate IDs in spec handled improperly
Closed this issue · 1 comments
vlofgren commented
If a crawlspec contains duplicate IDs, the website is crawled multiple times. If the duplicate IDs are close in sequence in the file, the result is corrupted crawl data.
Fix: Add a set with seen id:s and to deduplicate before launching new processes.