harlan-zw/unlighthouse

Starting crawl from subdirectory

Closed this issue · 4 comments

Details

When I run a command for --site https://site/subdirector on my mac, everything works as I'd like; starting with that page, doesn't find a sitemap file, so falls back to crawling from https://site/subdirector but on a windows machine, the crawling starts at the domain https://site.

Is there a configuration that I can force it to start at the subdirectory? I tried -include /subdirector/.* but that doesn't seem to do it. With that, it just hangs.

Debug shows this "GET /api/reports 200 object - 0ms" repeating over and over.

Mac:
Successfully connected to https://teamsideline.com/Layouts/minimalist/Home.aspx?d=ZHcj%2bsPHK5g%2bZkLyQaVo0Q%3d%3d/, status code: 200. unlighthouse 07:50:32

───────────────────────────────────────────────────╮
│ │
│ ⛵ unlighthouse cli @ v0.5.0 │
│ │
│ ▸ Scanning: https://teamsideline.com/Layouts/minimalist/Home.aspx?d=ZHcj%2bsPHK5g%2bZkLyQaVo0Q%3d%3d/
│ ▸ Route Discovery: Crawler

Windows:
Successfully connected to https://teamsideline.com/. (Status: 200). Unlighthouse 2:50:40 PM
─────────╮
│ │
│ ⛵ Unlighthouse cli @ v0.11.4 │
│ │
│ ▸ Scanning: https://teamsideline.com/
│ ▸ Route Discovery: Crawler

I notice this works with unlighthouse@0.5.1 but not 0.6.0 or after.

--include-urls does not solve this issue. It hangs the same as the original issue.

Hi @Robanna777, thanks for the issue.

Seems like this wasn't supported and worked by accident in earlier versions. I've pushed up a fix for it, you can use it as:

npx unlighthouse@0.11.5 --site https://teamsideline.com/sites/apex/home

Let me know if you have any issues with it.

That's awesome. Thank you. That works perfectly.