How to filter duplicate urls
sdivens opened this issue · 2 comments
sdivens commented
when using yield Task('image', url=image_url, post=task.post) ,how can I do to filter the duplicate urls?
sym233 commented
# global
url_set: Set[str] = set()
# in task_generator:
for url in urls:
if url not in url_set:
url_set.add(url)
yield Task(url=url, ...)
I use this method
lorien commented
This is no built-in duplicates filter in Spider. You have to implement it on your own.