MarginaliaSearch/MarginaliaSearch

(crawler-process) Duplicate IDs in spec handled improperly

Closed this issue · 1 comments

If a crawlspec contains duplicate IDs, the website is crawled multiple times. If the duplicate IDs are close in sequence in the file, the result is corrupted crawl data.

Fix: Add a set with seen id:s and to deduplicate before launching new processes.

Fixed in 2619d19