Support filtering / limiting scope of URLs

Question

Support filtering / limiting scope of URLs

Closed this issue 4 years ago · 8 comments

Andy D:

One thing I didn't see in the docs is whether it was possible to limit the depth or number of pages in the crawl - on some sites (retailers / publishers) I could see the crawl size getting pretty large

Others have asked about maybe some kind of flag to filter urls. All seem to be thinking about the same use-case: more efficiently analyzing chunks of a big site.

Answer 1 · 2020-11-02T22:48:22.000Z

I like this idea! Should URLs passed via the flag be excluded both from crawling and lighthouse-ing? Or just from lighthouse-ing? i.e. if a URL is excluded, should the crawler discover pages that are linked from the excluded page? @emersonthis

Answer 2 · 2020-11-03T00:01:58.000Z

Great question. My gut is that we'd want to filter the reports but "crawl through" ineligible URLs. I passed this question along to two of Jason's performance colleagues who mentioned this idea. I'll update here whenever I hear their thoughts.

In theory we could also support either behavior. Maybe with two different flags? As a product designer, I usually discourage punting decisions like this to the user, but there might be two valid use cases here, and I suspect the resulting implementation wouldn't be meaningfully more complicated either way.

Answer 3 · 2020-12-11T09:30:17.000Z

Just discovered this tool and tried the spreadsheet, and it is so nice to have such a handy solution easily available, thank you!

I think the filtering would be a great enhancement!

One scenario where this could be useful would be when dealing with multiple translations/markets on a site without having the translation/market in the domain. If the pages are the same across all the languages except for the text content, you might want to ignore all the translated pages. So for instance test shop.com/* but exclude shop.com/fr/, shop.com/de/ etc.

Ignoring languages could of course ignore potential font performance problems in a specific language. Maybe that could be fixed by supporting patterns in include/exclude paths, so you could still include and the test the main entry point for shop.com/fr/ and shop.com/de but avoid crawling them.

Answer 4 · 2020-12-12T00:01:00.000Z

I'd love some kind of filter. My off the cuff idea for doing this would be to support two things:

First, a depth option ( --depth 1 would crawl example.com and example.com/*/ but no deeper).

Second, I'd also like to be able to crawl starting from a given directory, e.g. https://example.com/events/ would crawl all pages in /events, including any pages in subdirectories of /events. So it would crawl https://example.com/events/January.html and also all documents in https://example.com/events/January/.

Answer 5 · 2020-12-13T13:47:07.000Z

Yup totally have the same use-case! We have multiple brands under the same top domain level, would be nice to choose to start from a given directory

Answer 6 · 2020-12-14T14:56:22.000Z

Having 14 languages, I like to limit the crawler too. Also it could be nice to be able to limit more than just depth, like N pages pr URL level.

Answer 7 · 2020-12-14T19:05:26.000Z

@calebeby Looks like simplecrawler already supports discoverRegex and maxDepth options, so supporting most of what's described above should be as simple as adding new option flags and passing them through to the crawler.

Answer 8 · 2021-01-15T21:24:47.000Z

This has been released in 1.1.0