Is there a documented way to scrape Single Page Applications?
bilogic opened this issue · 1 comments
bilogic commented
SPAs usually pass on a CSRF token for use in subsequent requests, is there a roach way of scraping such sites?
ksassnowski commented
If the CSRF token is part of the page's source, then you can extract it like any other piece of information. You would then have to figure out how exactly the site expects the CSRF to be sent with each subsequent request, for example as a header.
You can then set the header from within your spider before dispatching new requests: https://roach-php.dev/docs/processing-responses#returning-custom-requests
So, assuming the CSRF token exists in the page source like this
<meta name="csrfToken" content="...">
Your parse
method could look something like this
public function parse(Response $response): \Generator
{
// do your scraping here...
$csrfToken = $response->filter('meta[name="csrfToken"]')->attr('content');
$request = new Request(
'POST',
'https://next-url-to-crawl.com',
$this->parse(...),
// Assuming the csrf token should get passed in the X-CSRF-Token header
['headers' => ['X-CSRF-Token' => $csrfToken]],
);
yield ParseResult::fromValue($request);
}