redirection to a forbidden domain happened without slash suffix character in the web crawler
msmygit opened this issue · 1 comments
msmygit commented
Setup
% langstream -V
LangStream CLI 0.5.0 (8162f382)
Web crawler configuration
pipeline:
- name: "Crawl the WebSite"
type: "webcrawler-source"
configuration:
seed-urls:
- "https://aws.amazon.com/about-aws/whats-new/2023/11"
allowed-domains:
- "https://aws.amazon.com/about-aws/whats-new/2023/11"
forbidden-paths: []
...
When we execute the below command,
langstream docker run test -app examples/docker-chatbot -s ./secrets.yaml
we get the following error,
15:23:56.896 [crawler-webcrawler-source-1-runner-465eeb4a-f140-4b8d-b683-ee51ee76f401] INFO a.l.a.webcrawler.WebCrawlerSource -- The last cycle didn't produce any new documents
15:23:56.896 [crawler-webcrawler-source-1-runner-465eeb4a-f140-4b8d-b683-ee51ee76f401] INFO a.l.a.webcrawler.crawler.WebCrawler -- Crawling url: https://aws.amazon.com/about-aws/whats-new/2023/11
15:23:57.086 [crawler-webcrawler-source-1-runner-465eeb4a-f140-4b8d-b683-ee51ee76f401] WARN a.l.a.webcrawler.crawler.WebCrawler -- A redirection to a forbidden domain happened (from https://aws.amazon.com/about-aws/whats-new/2023/11 to /about-aws/whats-new/2023/11/)
Workaround
Adding the slash (/
) character suffix at the seed-urls
and allowed-domains
fixed the error.
pipeline:
- name: "Crawl the WebSite"
type: "webcrawler-source"
configuration:
seed-urls:
- "https://aws.amazon.com/about-aws/whats-new/2023/11/"
allowed-domains:
- "https://aws.amazon.com/about-aws/whats-new/2023/11/"
forbidden-paths: []
...
eolivelli commented
Happy that you have found a solution
I have one question:
do you want to index only 1 page ?