duzun/hQuery.php

Scrape in background

xmadscientist opened this issue · 5 comments

I love what you've done with this! I was wondering if there was any way to have hQuery queue up as a background process. I built a tool with this API, but while it's scraping, no other pages on my local server will load until it is completely finished. Is there some sort of functionality for this?

duzun commented

This question is not directly related to the functionality of hQuery, but the answer is yes, there are ways.

The simplest solution is to have a cron job that would call a PHP script via CLI every minute o so to scrape the web.

Another solution is to run PHP as a daemon like so:
nohup php scrape.php &
Inside scrape.php you would probably have something like:

while(true) {
    $url = getNextUrlFromDb();
    myScrapeMethod($url);
    sleep(1);
}

There are other ways too.

But in any case, it should not interfere with your webapp. It is a sign of bad PHP + Webserver configuration or bad application design.
Are you calling start_session() in your script? If so, while one request is running, the next one is blocket until the first one finishes.

Yes I am calling start_session(). Thing is, I need the scraper to run under a very specific set of conditions. Would a cronjob still be viable for this?

duzun commented

start_session() by default relies on cookies to work.
cronjob runs PHP CLI, which is not in HTTP context, thus there are no cookies.
Unless you have custom session handlers defined which can run in CLI mode and yet create valid sessions, you can't use start_session(), or at least there is no use to call it.

Maybe you should revise your algorithm?

Thanks for that info. I have a user system that uses hQuery to scrape some data if the user needs it. Perhaps I will use a cronjob to just run it periodically for all users in the system.

duzun commented

Psssst!
Here is a trick that could help in the case you run PHP through fastcgi (eg. nginx + php-fpm):

start_session();
// do stuff that requires session...
session_write_close(); // save $_SESSION and close session
fastcgi_finish_request(); // user receives the HTTP response after this call, but PHP keeps running in background.

// crawl here, after session is closed and the user has received the response
$doc = hQuery::fromUrl('https://example.com/');