Retrieve and parse header before requesting the full page

Question

Retrieve and parse header before requesting the full page

vezaynk opened this issue 7 years ago · 20 comments

Doing so would fix a wide array of issues, namely the fact that the script currently downloads huge PDFs needlessly just to check their HTTP code. Absolutely pointless.

An issue to open later would be to add a flag for "max file size" to avoid downloading any files larger than whatever.

Answer 1 · 2017-11-05T15:05:26.000Z

This will also address #26

Answer 2 · 2017-11-05T15:09:07.000Z

Running just
$header=get_headers($url);
should be way more efficient.

Answer 3 · 2017-11-05T15:11:28.000Z

It would be, that is true. I want to do it even more efficiently however (if possible, probably is). If I use get_headers, I would be requested the headers twice which will actually end up hurting performance on more light-weight sites. Splitting the request into two parts would be ideal.

Answer 4 · 2017-11-05T15:15:39.000Z

Then you will have to use CURL, which should be way less efficient that the PHP function. This might be something you want to test.

Answer 5 · 2017-11-05T15:16:27.000Z

It's using cURL anyways, no performance loss there.

Answer 6 · 2017-11-05T15:20:39.000Z

Are you sure? https://itstillworks.com/getheaders-vs-curl-speed-php-12245951.html

Answer 7 · 2017-11-05T15:30:04.000Z

I am sure that it is currently using cURL. You asserted that using cURL would make things slower (it may, it may not, I have no idea), I responded that I am using cURL anyways and using more of it's functions would not slow it down further, not that I am sure that cURL is more performant than a given alternative.

Answer 8 · 2017-11-05T15:32:34.000Z

Oh, OK, a misunderstanding. I though you said that get_headers($url) is using internally cURL.

I think it should be way faster to use just get_headers() and run cURL only when needed.

Answer 9 · 2017-11-05T15:44:45.000Z

As I said, it will be faster certainly be faster for some cases, but for well-built websites it will be slower because it's going to be sending more requests than necessary.

I define a well-built website as a website that has no broken links and no over-sized pages. I wouldn't want to punish websites who do it right in favor of websites who have plenty of broken links and such.

I am however having a hard time finding ressources saying how to achieve what I am looking for, hence I may very well end up going with the alternative.

Answer 10 · 2017-11-05T15:48:49.000Z

Are you currently crawling CSS, JS, and all other files for the URLs? Using headers should eliminate going through unnecessary files when you know in advance the media types and I believe there are more useless than useful files in an average website.

Answer 11 · 2017-11-05T15:51:22.000Z

Unless someone links a CSS/JS/whatever file with an a[href], it wouldn't even look at it.

Answer 12 · 2017-11-05T15:54:08.000Z

CSS and JS should not be directly linked, but there there are often various downloads like PDFs etc. that are href-linked.

Answer 13 · 2017-11-05T16:14:38.000Z

Yes, they are href linked and can be optionally indexed since #40. Images however are a better example of wasteful downloads.

Answer 14 · 2017-11-11T09:36:07.000Z

Could always do a is_html() function that users get_headers then retrieve if necessary.

Answer 15 · 2017-11-13T21:32:01.000Z

The method I was initially proposing would be done with something like this:

curl_setopt( $ch, CURLOPT_WRITEFUNCTION, "my_callback" );
  function my_callback() {
    return (isValid) ? 1 : -1;
  }

It is really not exactly I was looking for but works for all intents and purposes. The way it works can be read in detail here, but in short, it can abort a cURL operation as soon as some data has been received.

This doesn't sound amazingly efficient to me and as such I feel compelled to actually test how well it stacks up against just getting the headers and optionally the full request afterwards as @2globalnomads was proposing.

Answer 16 · 2017-11-13T22:10:28.000Z

I lied. The better way to do this is with CURLOPT_HEADERFUNCTION

Answer 17 · 2017-11-14T02:32:23.000Z

I created a branch to test that implementation, it's not the cleanest thing in the world, but it works. https://github.com/knyzorg/Sitemap-Generator-Crawler/tree/single-request

The performance improvements are predictably obvious with websites that have a lot of PDFs such as this one: http://rolf-herbold.de

The effect is ideal, performance for standard sites is unaffected while the performance of sites as the above is improved (how much is determined by internet speed).

Answer 18 · 2017-11-14T16:29:00.000Z

On a sidenote, I'm actually working on multi-threaded(?) version of your crawler via curl_multi_exec.

I have some non database scraped sites that I crawl to build up the site and single threaded takes forever.

Answer 19 · 2017-11-14T16:48:10.000Z

What is this curl_multi_exec version you speak of?

Answer 20 · 2017-12-27T22:04:34.000Z

After more testing. What I did is definitely not merge-able as is. On some systems, handling a failed cURL is more expensive than downloading whatever.