Indefinite loop
Opened this issue · 15 comments
Command:
php sitemap.php file=sitemap.xml site=https://www.forum.2globalnomads.info
Output:
[+] Added: https://www.forum.2globalnomads.info/
[+] Added: https://www.forum.2globalnomads.info/./search.php?sid=522b2caac3959c510cf81dfa71395a08
[+] Added: https://www.forum.2globalnomads.info/./
[+] Added: https://www.forum.2globalnomads.info/././search.php?sid=29496a0b8ec0ea15213165add3ccfde1
[+] Added: https://www.forum.2globalnomads.info/././
[+] Added: https://www.forum.2globalnomads.info/./././search.php?sid=0eb538cef14c5efac802a5b86d31dc09
[+] Added: https://www.forum.2globalnomads.info/./././
[+] Added: https://www.forum.2globalnomads.info/././././search.php?sid=1e56e609ccae15b28e70f409780e8835
...
Interesting case.
../ and ./ are not covered cases yet. They should be simplified after the relative to absolute conversion.
I believe this is the problem you are talking about: Remove Dot Segments. There is a PHP gist that might be helpful here: https://gist.github.com/rdlowrey/5f56cc540099de9d5006
Great find! Woke up this morning thinking I would have to read RFCs again.
Shouldn't be hard to implement this.
I think I got it. Hopefully I didn't break anything in the process.
I am not sure about that. In that forum there is only handful of of topics and posts, but the crawler is finding hundreds of URLs. Try and see: https://www.forum.2globalnomads.info/
That's because it's finding a lot of urls you should have blacklisted such as posting.php and search.php
I will let it run to the end to see that it's not in an indefinite loop.
It run out of memory and crashed before finishing. I am pretty sure there is still problems with PHPBB3 forums.
In the temp sitemap file there was 7565 lines when it crashed. That's pretty impossible without duplicates or looping.
I will take a closer look. There are too many safeguards against duplicate links. The issue is something else.
My closer look yielded results. The sid
argument is at fault here, mostly. #31 would fix this.
sid (or similar functionality) is pretty common in systems that have session handling. If I am not completely wrong, it should be ignored by default. How about if you do it so that it can be enabled with an option like
php sitemap.php --enable-sid
Sitemaps don't need sids, I assure you.
The sane default option is going to be to have all arguments ignored by default and a number of arguments will be whitelisted by default.
Sounds good to me.
For sake of interest, I analysed the data.
=> cat sitemap.xml.partial | grep "php?f=3" | tee >(wc -l) | cat
<loc>https://www.forum.2globalnomads.info/viewtopic.php?f=3&t=31&sid=9a0e59593d7f1a504e0556c92f59dc5e&view=print</loc>
........
<loc>https://www.forum.2globalnomads.info/viewforum.php?f=3&sid=793f5d0f359deb76f70c4be8e155bd41</loc>
62
The same page was indexed 62 times but with different sids and views.