Starting crawl from subdirectory

Question

Starting crawl from subdirectory

Closed this issue 10 months ago · 4 comments

Details

When I run a command for --site https://site/subdirector on my mac, everything works as I'd like; starting with that page, doesn't find a sitemap file, so falls back to crawling from https://site/subdirector but on a windows machine, the crawling starts at the domain https://site.

Is there a configuration that I can force it to start at the subdirectory? I tried -include /subdirector/.* but that doesn't seem to do it. With that, it just hangs.

Debug shows this "GET /api/reports 200 object - 0ms" repeating over and over.

Mac:
Successfully connected to https://teamsideline.com/Layouts/minimalist/Home.aspx?d=ZHcj%2bsPHK5g%2bZkLyQaVo0Q%3d%3d/, status code: 200. unlighthouse 07:50:32

───────────────────────────────────────────────────╮
│ │
│ ⛵ unlighthouse cli @ v0.5.0 │
│ │
│ ▸ Scanning: https://teamsideline.com/Layouts/minimalist/Home.aspx?d=ZHcj%2bsPHK5g%2bZkLyQaVo0Q%3d%3d/ │
│ ▸ Route Discovery: Crawler

Windows:
Successfully connected to https://teamsideline.com/. (Status: 200). Unlighthouse 2:50:40 PM
─────────╮
│ │
│ ⛵ Unlighthouse cli @ v0.11.4 │
│ │
│ ▸ Scanning: https://teamsideline.com/ │
│ ▸ Route Discovery: Crawler

Answer 1 · 2024-03-06T23:07:51.000Z

I notice this works with unlighthouse@0.5.1 but not 0.6.0 or after.

Answer 2 · 2024-03-06T23:43:59.000Z

--include-urls does not solve this issue. It hangs the same as the original issue.

Answer 3 · 2024-03-07T00:45:38.000Z

Hi @Robanna777, thanks for the issue.

Seems like this wasn't supported and worked by accident in earlier versions. I've pushed up a fix for it, you can use it as:

npx unlighthouse@0.11.5 --site https://teamsideline.com/sites/apex/home

Let me know if you have any issues with it.

Answer 4 · 2024-03-07T14:35:01.000Z

That's awesome. Thank you. That works perfectly.