vezaynk/Sitemap-Generator-Crawler

Retrieve and parse header before requesting the full page

vezaynk opened this issue · 20 comments

Doing so would fix a wide array of issues, namely the fact that the script currently downloads huge PDFs needlessly just to check their HTTP code. Absolutely pointless.

An issue to open later would be to add a flag for "max file size" to avoid downloading any files larger than whatever.

This will also address #26

Running just
$header=get_headers($url);
should be way more efficient.

It would be, that is true. I want to do it even more efficiently however (if possible, probably is). If I use get_headers, I would be requested the headers twice which will actually end up hurting performance on more light-weight sites. Splitting the request into two parts would be ideal.

Then you will have to use CURL, which should be way less efficient that the PHP function. This might be something you want to test.

It's using cURL anyways, no performance loss there.

I am sure that it is currently using cURL. You asserted that using cURL would make things slower (it may, it may not, I have no idea), I responded that I am using cURL anyways and using more of it's functions would not slow it down further, not that I am sure that cURL is more performant than a given alternative.

Oh, OK, a misunderstanding. I though you said that get_headers($url) is using internally cURL.

I think it should be way faster to use just get_headers() and run cURL only when needed.

As I said, it will be faster certainly be faster for some cases, but for well-built websites it will be slower because it's going to be sending more requests than necessary.

I define a well-built website as a website that has no broken links and no over-sized pages. I wouldn't want to punish websites who do it right in favor of websites who have plenty of broken links and such.

I am however having a hard time finding ressources saying how to achieve what I am looking for, hence I may very well end up going with the alternative.

Are you currently crawling CSS, JS, and all other files for the URLs? Using headers should eliminate going through unnecessary files when you know in advance the media types and I believe there are more useless than useful files in an average website.

Unless someone links a CSS/JS/whatever file with an a[href], it wouldn't even look at it.

CSS and JS should not be directly linked, but there there are often various downloads like PDFs etc. that are href-linked.

Yes, they are href linked and can be optionally indexed since #40. Images however are a better example of wasteful downloads.

Could always do a is_html() function that users get_headers then retrieve if necessary.

The method I was initially proposing would be done with something like this:

curl_setopt( $ch, CURLOPT_WRITEFUNCTION, "my_callback" );
  function my_callback() {
    return (isValid) ? 1 : -1;
  }

It is really not exactly I was looking for but works for all intents and purposes. The way it works can be read in detail here, but in short, it can abort a cURL operation as soon as some data has been received.

This doesn't sound amazingly efficient to me and as such I feel compelled to actually test how well it stacks up against just getting the headers and optionally the full request afterwards as @2globalnomads was proposing.

I lied. The better way to do this is with CURLOPT_HEADERFUNCTION

I created a branch to test that implementation, it's not the cleanest thing in the world, but it works. https://github.com/knyzorg/Sitemap-Generator-Crawler/tree/single-request

The performance improvements are predictably obvious with websites that have a lot of PDFs such as this one: http://rolf-herbold.de

The effect is ideal, performance for standard sites is unaffected while the performance of sites as the above is improved (how much is determined by internet speed).

On a sidenote, I'm actually working on multi-threaded(?) version of your crawler via curl_multi_exec.

I have some non database scraped sites that I crawl to build up the site and single threaded takes forever.

What is this curl_multi_exec version you speak of?

After more testing. What I did is definitely not merge-able as is. On some systems, handling a failed cURL is more expensive than downloading whatever.