cloudfour/lighthouse-parade

add option to limit number of pages crawled?

Opened this issue · 8 comments

Hey this is a cool tool. Here's a feature idea assuming you're open to it.

Currently one can use:

--max-crawl-depth  2

to get, say, index page + 1st linked pages.

But maybe there are a lot of linked pages, and you just need a representative sample which is more than 1 page but not tons of pages.

So maybe another option like:

--max-crawled-pages N # or just --max-pages ?

and the crawler stops after it exhausts all pages allowed by other options, or N, whichever is smaller.

One might then use it like:

lighthouse-parade --max-crawl-depth 2 --max-crawled-pages 20 example.com

Hmm, this is an interesting suggestion! One concern is that the pages that are crawled would be nondeterministic, i.e. if you ran lighthouse-parade twice with the same flags it could crawl a different set of pages because of pages loading at different speeds, throttling, etc. The "first n pages" is not necessarily a representative sample of all the pages on the site. Do you have a suggestion of how to make the crawled pages more deterministic & representative of the whole site?

Interesting challenge, that didn't occur to me.

For my use, I'd only ever want the page I point it at, plus some (and ideally yes, always the same) set of linked pages from that page. So like, index page plus 20 pages linked off it. In that case, it seems like it would always be deterministic to the extent the page itself isn't changing.

Given linked_page_limit = --max-linked-pages N (or --max-leaf/outer-pages N):

  1. fetch the index page
  2. let all_index_links be an array of all links on the index page (removing duplicates).
  3. let index_links be the first linked_page_limit items in all_index_links
  4. fetch all pages in index_links
  5. run lighthouse on the array [index page, ... index_links]

I could see the complication increasing if this were used with a max crawl depth of more than 1, so maybe they're just mutually exclusive options? One would either specify to crawl some fixed depth entirely, or use this "index page plus N children" mode.

OTOH, yeah if it is going to add too much complexity perhaps it's not necessary just for my use case. (Maybe if I could just run lighthouse parade multiple times, specifying a single page each time, and have those results all lumped into the same csv/report that could do the trick?)

Just wanted to provide support for this. --max-crawl-depth 2 on one site gives me 50 pages. --max-crawl-depth 3 I stopped somewhere after 2k pages.

There should be somewhere in between.

@mgifford the new version (on the next branch currently) will support stopping the command with ctrl-c when you have enough output, the results (up to that point) will all be saved in the output, so you can stop it at any point you want.

Excellent.. Happy to hear this.

@calebeby what is the best way to test with the next branch?

I'm currently running with:
npx lighthouse-parade https://www.example.ccom ./lighthouse-parade-data --max-crawl-depth 3

Hi @mgifford! I published a beta of it on the next tag on npm: https://www.npmjs.com/package/lighthouse-parade?activeTab=versions. You can install it with npm i -g lighthouse-parade@next, or you can use it through npx like this: npx lighthouse-parade@next https://www.example.ccom/ ./lighthouse-parade-data --max-crawl-depth 3. I have been meaning to finalize the release for quite a while now, but have been super busy with school.

Let me know if you run into anything else!

That's great. Might want to just add that to https://github.com/cloudfour/lighthouse-parade

Thanks!