Scraper indexes folders, not just their files
ArthurFlag opened this issue · 2 comments
Hi,
My website is behind authentication so to index it, I start a local HTTP server (either HTTPD via the Docker image or using Python's http.server
).
I have content folders that do not contain index.html files but other html files:
├── myapp
│ ├── catalog.html
│ ├── users.html
│ └── myapp.html
And somehow, certain folders only, are indexed (along with the content, which is good) 🤔 :
It does this with only 2 folders out of the dozens i'm scraping. There is nothing different about these folders compared to the rest.
I'm using this basic config file:
{
"index_name": "newstore-index",
"start_urls": [
{
"url": "http://127.0.0.1/docs/"
}
],
"selectors": {
"default": {
"lvl0": "h1",
"lvl1": "h2",
"lvl2": "h3",
"lvl3": "h4",
"lvl4": "h5",
"text": "p, li"
}
}
}
I'm not sure where the problem is, could it be the server or the scraper?
The source of the problem can be both. I would highly recommend you to use a sitemap. You can find some details here. It will be straightforward
Some points to check:
- Make sure that the missing pages are referenced from another crawled one thanks to a hyperlink (
<a/>
tag). - Make sure the missing pages are available with a
200
http status
If you want to investigate further, you can look under the hood and focus on the scrapy's log by switching this parameter to DEBUG
. You will need to run the crawler from the source code then.
Closing this issue since it is related to a personal setup that do not only involved the scraper.
Great, thanks a lot (again) 👍