Testing how a spider scrapes a given HTML file

Question

Testing how a spider scrapes a given HTML file

seb-jones opened this issue 2 years ago · 5 comments

Hello there,

Just a question. Is there a simple way to feature test a spider by giving it some HTML and inspecting what it returns, e.g. making assertions against what would be returned by collectSpider.

Many thanks

Seb

Answer 1 · 2022-07-02T05:26:53.000Z

I'm afraid there isn't nice way to do this at the moment but it's something I will probably add in the future.

Answer 2 · 2022-07-02T07:09:49.000Z

Cool cool, thanks for the response :)

Answer 3 · 2022-07-02T17:39:47.000Z

For what it's worth, I've managed to implement a fairly simple, albeit inelegant, way to do these kind of tests in the meantime. It works by firing up a PHP dev server and pointing the spider to that URL by overriding the startUrls. Thought I'd share the code here in case it was useful to anyone:

$serverProcess = null;

beforeAll(function () {
    global $serverProcess;
    $serverProcess = proc_open('cd resources/html && php -S localhost:8123', [], $pipes);
});

it('scrapes an html page', function () {
    $scrapedItems = Roach::collectSpider(
        MySpider::class,
        new Overrides(startUrls: ['http://localhost:8123']),
    );

    // do some assertions on $scrapedItems
});

afterAll(function () {
    global $serverProcess;
    proc_terminate($serverProcess);
});

The above assumes that there is an index.html file in resources/html.

I imagine there's probably a nicer way to do it, but this seems to be working right now.

Answer 4 · 2022-07-02T18:54:08.000Z

FYI, I've already started working on testing helpers for this. https://twitter.com/warsh33p/status/1543150150205538304

Shouldn't take too much longer.

Answer 5 · 2022-07-02T20:03:58.000Z

Nice! I look forward to trying them out.