Sites which do a redirect are all being marked a duplicates of each other
Closed this issue · 2 comments
LzrBear commented
Try indexing google.com, a lot of duplicates will pop up. I believe this is due to google.com being a redirect and hence every other site that does a redirect will have the same simhash.
LzrBear commented
Looks like the fetch joint is working correctly. It is fetching the page and if it is a redirect it is marking the task status as redirect and adding the new url to the check queue. It looks like somewhere lower in the pipeline redirected tasks are being marked as duplicated because they all have the same simhash (i.e. empty).