xirtah/gopa-spider

Sites which do a redirect are all being marked a duplicates of each other

Closed this issue · 2 comments

Try indexing google.com, a lot of duplicates will pop up. I believe this is due to google.com being a redirect and hence every other site that does a redirect will have the same simhash.

Looks like the fetch joint is working correctly. It is fetching the page and if it is a redirect it is marking the task status as redirect and adding the new url to the check queue. It looks like somewhere lower in the pipeline redirected tasks are being marked as duplicated because they all have the same simhash (i.e. empty).

Fixed by 181ec28