vezaynk/Sitemap-Generator-Crawler

Indefinite loop

Opened this issue · 15 comments

Interesting case.

../ and ./ are not covered cases yet. They should be simplified after the relative to absolute conversion.

Thyra commented

I believe this is the problem you are talking about: Remove Dot Segments. There is a PHP gist that might be helpful here: https://gist.github.com/rdlowrey/5f56cc540099de9d5006

Great find! Woke up this morning thinking I would have to read RFCs again.

Shouldn't be hard to implement this.

I think I got it. Hopefully I didn't break anything in the process.

I am not sure about that. In that forum there is only handful of of topics and posts, but the crawler is finding hundreds of URLs. Try and see: https://www.forum.2globalnomads.info/

That's because it's finding a lot of urls you should have blacklisted such as posting.php and search.php

I will let it run to the end to see that it's not in an indefinite loop.

It run out of memory and crashed before finishing. I am pretty sure there is still problems with PHPBB3 forums.

In the temp sitemap file there was 7565 lines when it crashed. That's pretty impossible without duplicates or looping.

I will take a closer look. There are too many safeguards against duplicate links. The issue is something else.

My closer look yielded results. The sid argument is at fault here, mostly. #31 would fix this.

sid (or similar functionality) is pretty common in systems that have session handling. If I am not completely wrong, it should be ignored by default. How about if you do it so that it can be enabled with an option like
php sitemap.php --enable-sid

Sitemaps don't need sids, I assure you.

The sane default option is going to be to have all arguments ignored by default and a number of arguments will be whitelisted by default.

Sounds good to me.

For sake of interest, I analysed the data.

=> cat sitemap.xml.partial | grep "php?f=3" | tee >(wc -l) | cat

<loc>https://www.forum.2globalnomads.info/viewtopic.php?f=3&amp;t=31&amp;sid=9a0e59593d7f1a504e0556c92f59dc5e&amp;view=print</loc>
........
<loc>https://www.forum.2globalnomads.info/viewforum.php?f=3&amp;sid=793f5d0f359deb76f70c4be8e155bd41</loc>
62

The same page was indexed 62 times but with different sids and views.