LangStream/langstream

redirection to a forbidden domain happened without slash suffix character in the web crawler

msmygit opened this issue · 1 comments

Setup

% langstream -V
LangStream CLI 0.5.0 (8162f382)

Web crawler configuration

pipeline:
  - name: "Crawl the WebSite"
    type: "webcrawler-source"
    configuration:
      seed-urls:
        - "https://aws.amazon.com/about-aws/whats-new/2023/11"
      allowed-domains:
        - "https://aws.amazon.com/about-aws/whats-new/2023/11"
      forbidden-paths: []
      ...

When we execute the below command,

langstream docker run test -app examples/docker-chatbot -s ./secrets.yaml

we get the following error,

15:23:56.896 [crawler-webcrawler-source-1-runner-465eeb4a-f140-4b8d-b683-ee51ee76f401] INFO  a.l.a.webcrawler.WebCrawlerSource -- The last cycle didn't produce any new documents
15:23:56.896 [crawler-webcrawler-source-1-runner-465eeb4a-f140-4b8d-b683-ee51ee76f401] INFO  a.l.a.webcrawler.crawler.WebCrawler -- Crawling url: https://aws.amazon.com/about-aws/whats-new/2023/11
15:23:57.086 [crawler-webcrawler-source-1-runner-465eeb4a-f140-4b8d-b683-ee51ee76f401] WARN  a.l.a.webcrawler.crawler.WebCrawler -- A redirection to a forbidden domain happened (from https://aws.amazon.com/about-aws/whats-new/2023/11 to /about-aws/whats-new/2023/11/)

Workaround

Adding the slash (/) character suffix at the seed-urls and allowed-domains fixed the error.

pipeline:
  - name: "Crawl the WebSite"
    type: "webcrawler-source"
    configuration:
      seed-urls:
        - "https://aws.amazon.com/about-aws/whats-new/2023/11/"
      allowed-domains:
        - "https://aws.amazon.com/about-aws/whats-new/2023/11/"
      forbidden-paths: []
      ...

Happy that you have found a solution

I have one question:
do you want to index only 1 page ?