When a page has a link on it with auto-generated url params going to itself the spider gets stuck in a never ending loop
Closed this issue · 2 comments
LzrBear commented
See the following example
http://www.defence.gov.au/uhtbin/cgisirsi/?ps=yU4kVfNSp1/SIRSI/0/49
This url goes to a page with an ok button it. This ok button has a url which is exactly the same as the source url but has a new dynamically generated ps url param
This causes the spider to endlessly keep on adding urls to check.
LzrBear commented
A possible solution could be to take the hash of the page after removing all urls from it. If the hash is a duplicate stop processing and do not add the urls to the fetch queue.