xirtah/gopa-spider

When a page has a link on it with auto-generated url params going to itself the spider gets stuck in a never ending loop

Closed this issue · 2 comments

See the following example
http://www.defence.gov.au/uhtbin/cgisirsi/?ps=yU4kVfNSp1/SIRSI/0/49

This url goes to a page with an ok button it. This ok button has a url which is exactly the same as the source url but has a new dynamically generated ps url param
image

This causes the spider to endlessly keep on adding urls to check.
image

A possible solution could be to take the hash of the page after removing all urls from it. If the hash is a duplicate stop processing and do not add the urls to the fetch queue.

Fixed the content simhash, now if a page is similar in content it will mark it as duplicated and not process it further. Fix made in following checkins: 58dc39c, a38816b,