lorien/grab

How to filter duplicate urls

sdivens opened this issue · 2 comments

when using yield Task('image', url=image_url, post=task.post) ,how can I do to filter the duplicate urls?

# global
url_set: Set[str] = set()

# in task_generator:
  for url in urls:
    if url not in url_set:
      url_set.add(url)
      yield Task(url=url, ...)

I use this method

This is no built-in duplicates filter in Spider. You have to implement it on your own.