unclear how well non-interactive crawler stacks are supported

Question

unclear how well non-interactive crawler stacks are supported

Opened this issue 3 years ago · 6 comments

When looking at https://stateofscraping.org one can see that Scrapy and Curl are also tested. It is however unclear how to support such non-interactive stacks myself with this framework.

An assignment contains multiple plugins/pages, for which each page has the following interface:

https://github.com/ulixee/double-agent/blob/89c194335b6c0382ac4d1dce235898d242db0a02/collect/interfaces/ISessionPage.ts#L1-L7

waiting for something to happen is not possible in such stacks, as it is a simple download resource & close;
clicking on something is equally not possible, what would be possible is mimicking it by going to the same url / trigger same request, if that somehow is possible, but that wouldn't be clear from just the selector alone;
it seems that the current runners just ignore requests which have the isRedirect property === true, why is this?

In a way this question is related to #58 , but this question really focusses on how we ourselves would implement such a non-interactive stack runner? It could help if the curl & scrapy stack is also implemented in this repo, given it anyway is already shown in those test results.

Answer 1 · 2022-03-09T21:00:22.000Z

I tried some days ago by the way to simply ignore these click/wait interact tasks and just go to the URL's, so perhaps that's just the way to go? Not clear either way. Could be a nice addition to the documentation if it can be made clear why these things are required and what tests cannot be supported if one does not complete these parts of the assignments.

Perhaps in my proposal to make the runners useable as a library it could also make sense to be able to configure the capabilities of the stack (e.g. no interaction, or no screen, or no JS, etc... as for such stacks that miss certain capabilities it might be a bit silly to get them returned without the ability to complete these.

Answer 2 · 2022-03-14T10:38:28.000Z

I made an example in my own repo with a curl implementation, as an easy to reproduce example. It makes use of some custom code and is an evolution of the interface part of PR #56.

If I run that runner however I get errors and most (if not all) is missing for tests, so it seems that stacks without possibility of interaction (time/keyboard/mouse) aren't supported out of the box?

mac-os-10-11--chrome-89-0 IS MISSING browser-codecs
mac-os-10-11--chrome-89-0 IS MISSING browser-dom-environment
mac-os-10-11--chrome-89-0 IS MISSING browser-fingerprints
mac-os-10-11--chrome-89-0 IS MISSING http-assets
/my-fork-based-repo/double-agent/analyze/plugins/http-basic-cookies/lib/CheckGenerator.js:25
                throw new Error(`no cookies created for ${key}`);
                      ^

Error: no cookies created for http-SubDomainRedirect
    at CheckGenerator.extractChecks (/my-fork-based-repo/double-agent/analyze/plugins/http-basic-cookies/lib/CheckGenerator.js:25:23)
    at new CheckGenerator (/my-fork-based-repo/double-agent/analyze/plugins/http-basic-cookies/lib/CheckGenerator.js:13:14)
    at HttpCookies.runIndividual (/my-fork-based-repo/double-agent/analyze/plugins/http-basic-cookies/index.js:27:32)
    at Analyze.addIndividual (/my-fork-based-repo/double-agent/analyze/index.js:66:38)
    at analyzeAssignmentResults (/my-fork-based-repo/double-agent/runner/lib/analyzeAssignmentResults.js:40:31)
    at async configureTestAndAnalyzeStack (/my-fork-based-repo/stack-common/lib/stack.js:65:5)

Which makes me wonder how you got to the results for curl as mentioned in in #58?

Runner Code:

import { IRunner, IRunnerFactory } from '@double-agent/runner/interfaces/runner';
import IAssignment from '@double-agent/collect-controller/interfaces/IAssignment';
import ISessionPage from '@double-agent/collect/interfaces/ISessionPage';

import util from 'util';
import { exec as execNonPromise } from 'child_process';
const exec = util.promisify(execNonPromise);

class CurlRunnerFactory implements IRunnerFactory {
  public runnerId(): string {
      return 'curl';
  }

  public async startFactory() {
    return;  // nothing to manage, we'll spawn on the fly
  }

  public async spawnRunner(assignment: IAssignment): Promise<IRunner> {
    return new CurlRunner(assignment.userAgentString);
  }

  public async stopFactory() {
    return;
  }
}

class CurlRunner implements IRunner {
  userAgentString: string;
  lastPage?: ISessionPage;

  constructor(userAgentString: string) {
    this.userAgentString = userAgentString;
  }

  public async run(assignment: IAssignment) {
    console.log('--------------------------------------');
    console.log('STARTING ', assignment.id, assignment.userAgentString);
    let counter = 0;
    try {
      for (const pages of Object.values(assignment.pagesByPlugin)) {
        counter = await this.runPluginPages(assignment, pages, counter);
      }
      console.log(`[%s.✔] FINISHED ${assignment.id}`, assignment.num);
    } catch (err) {
      console.log('[%s.x] Error on %s', assignment.num, this.lastPage?.url, err);
      process.exit();
    }
  }

  async runPluginPages(
    assignment: IAssignment,
    pages: ISessionPage[],
    counter: number,
  ) {
    let isFirst = true;
    let currentPageUrl;
    for (const page of pages) {
      this.lastPage = page;
      const step = `[${assignment.num}.${counter}]`;
      if (page.isRedirect) continue;
      if (isFirst || page.url !== currentPageUrl) {
        console.log('%s GOTO -- %s', step, page.url);
        const statusCode = await fetchResource(page.url, this.userAgentString);
        if (statusCode >= 400) {
          console.error(`${statusCode}, url: ${page.url}`);
          continue;
        }
      }
      isFirst = false;

      if (page.waitForElementSelector) {
        console.log('%s waitForElementSelector -- %s: Ignore no support by curl', step, page.waitForElementSelector);
      }

      if (page.clickElementSelector) {
        console.log('%s Wait for clickElementSelector -- %s: Ignore no support by curl', step, page.clickElementSelector);
      }
      counter += 1;
    }

    return counter;
  }

  async stop() {
    return;
  }
}

async function fetchResource(url: string, userAgentString: string): Promise<number> {
  const { stdout } = await exec(`curl -k -s -o /dev/null -w "%{http_code}" -H 'user-agent: ${userAgentString}' -XGET '${url}'`);
  return parseInt(stdout.trim());
}

export { CurlRunnerFactory };

Answer 3 · 2022-03-14T10:39:33.000Z

Related to this I would also suggest to perhaps not fail the analyze code on such failures but more use it in the sense that the test failed? Giving it 0 score on that test, as that is essentially what it boils down to?

Answer 4 · 2022-03-14T14:35:22.000Z

@GlenDC I'm not sure the best way to give some overall info to your PRs here.

In version one of DoubleAgent, we had one big combined repo that had:

Scraper Report
Tests
Dockers
"Runners" for various engines
All the export profiles

This proved to just be WAY too confusing to come into (as evidenced by @calebjclark trying to add some stuff into it).

We also started thinking about how to create results that a normal human being could reason through. The old scraper report was hard to understand what was actually wrong when you failed a test. A lot of @calebjclark's work was translating our results into something that looked like pseudo-code on the new website design.

After the re-organization, we ended up with:

Scraper Report is in it's own repo. The various scraper engine "runners" are in this repo. The point of the repo is to demonstrate how normal "stacks" could be detected, and "why". This repo is private because it didn't get finished. Caleb got really far on it, but decided that we needed to move temporarily to other things before we came back to this. Dockers are also in this repo.

This brings us back to your original question - below is the CURL implementation in that repo. NOTE: this hasn't been updated/run in a while. I'm not sure how well it currently runs.

forEachAssignment({ scraperFrameworkId }, async assignment => {
  const curl = new Curl();
  curl.setOpt('USERAGENT', assignment.useragent);
  curl.setOpt('SSL_VERIFYPEER', 0);
  curl.setOpt('COOKIEJAR', __dirname + '/cookiejar.txt');
  curl.setOpt('COOKIESESSION', 1);
  curl.setOpt('FOLLOWLOCATION', 1);
  curl.setOpt('AUTOREFERER', 1);

  for (const pages of Object.values(assignment.pagesByPlugin)) {
    for (const page of pages) {
      console.log(page);
      if (curl.getInfo(Curl.info.EFFECTIVE_URL) !== page.url) {
        try {
          console.log('GET ', page.url);
          await httpGet(curl, page.url);
        } catch (error) {
          console.log(`ERROR getting page.url: ${page.url}`, error);
          throw error;
        }
      }
      if (page.clickDestinationUrl) {
        try {
          console.log('GET click dest', page.clickDestinationUrl);
          await httpGet(curl, page.clickDestinationUrl);
        } catch (error) {
          console.log(`ERROR getting page.clickDestinationUrl: ${page.clickDestinationUrl}`, error);
          throw error;
        }
      }
    }
  }
  curl.close();
}).catch(console.log);

async function httpGet(curl: Curl, url: string) {
  curl.setOpt('URL', url);
  const finished = new Promise((resolve, reject) => {
    curl.on('end', resolve);
    curl.on('error', reject);
  });
  curl.perform();
  await finished;
}

Moving to other repos:

Slab: the data files for the raw profiles got moved to another repo because they became too big and were polluting all the check-ins (not the "probes") and it can be recreated many different ways (ie, buy machines, hire mechanical turks, etc). Our current approach is using BrowserStack, but we might change that in the future (very likely). So this is a private implementation of generating profiles (which is basically just using a runner that uses selenium across a ton of browser/OS combos).
DoubleAgent: this remained the test suite itself. We have flows for running the tests against existing probes, but ideally you could also generate a profile easily using your own browser and clicking manually through all the results. That would give you a golden record that you can test your stack against.

Hopefully this helps gives some background. Back to your questions:

An http stack has a lot of ways it can be detected. CURL requests can be picked up all along the stack - TLS, tcp, Cookies you use, how you handle redirects, whether you load assets, how you "pretend" to submit a form or click a link, etc, etc.
isRedirect is a way you can know that a link needs a follow-on link, and gives you the info you could use to properly set your headers, cookies, referers, etc -- all of which change based on how you got to a url and from which origin.
The analyze bug does seem to be a problem that may have cropped up since we were last working on non-javascript engines like CURL.
I don't know how much we want to move back into DoubleAgent from ScraperReport. But Scraper Report is also 3/4 finished, so I'm unsure the best path forward. Maybe @calebjclark can weigh in here.
If we make DoubleAgent a library you can embed in tests, we're going to need to try to mimic the website's way to show you pseudo-code of how the result looks vs how it "should have looked". This also calls into question "where" this code should live. I think we'd have to port some of the result comparisons back into DoubleAgent, but I think they're mostly built around html snippets at the moment.

Answer 5 · 2022-03-14T15:51:52.000Z

If you will I do not mind helping with the design and development of that.

In my opinion you're already pretty close to it being allowed for people to run against their own stacks. Either way that's a bit out of scope though as I handled that part in issue #59. In there I also already shown how I did it already and with minimal work. You're repo was pretty much there. I do not expect not do I think it is realistic given the time budget constrains that it all has to be super shiny and fancy. Goal was simply to avoid having to modify the double-agent code in order to make it testable against one's own stack and to ensure one doesn't have to pull in dependencies from stacks implemented by Double-Agent as an example.

For what's me considered that part is done except for the part where we would have to find some alignment, if that is possible at all. Than it would just be about documenting some bits, and getting to work on the results.

Furthermore I am certainly also not looking for fancy error reporting, if something looks like psuedocode or just very verbose output files, I honestly do not mind, and again I also do not mind to contribute in that part, just need to find a way to work together ,if you fancy that idea.

At this stage of my fork of double-agent (and honestly the code changes are pretty minimal I would think) all that would still be required is the ability for one to generate their own assignments and analyse them. Once that is done the repo is as flexible as one can hope for, while still being useable out of the box as it is today for the example stacks :)

Answer 6 · 2022-03-14T15:53:22.000Z

I'm by the way really looking for the ability similar to state of scraping but than automated in w/e format (I do not mind that part) to figure out what checks fail and which succeed for each of the individual layers and categories. That and the ability to also plugin some custom ones where desired.

I can contribute dev time into this as well as ideas. My hope was that I could achieve that with double agent, but the reports out of analyser do not tell me much, if anything at all.