aws-samples/aws-step-functions-kendra-web-crawler-search-engine

Lambda.TooManyRequestsException

Opened this issue · 1 comments

Hi @cogwirrel, this is great work! I'm super impressed. I found your blog post on the internet and thought I should try it out.
Also, because I'm quite new to AWS and need to learn its features, products and everything.

I'd like to mention some things and I have some questions around it.

Required flags

I've noticed that some flags are required. If they are not provided, then the steps will move successfully to done without querying the urls.

./crawl --base-url https://www.bundestag.de/ --name Bundestag --profile default --keywords /

Would create the following step graph:
Screenshot 2024-11-21 at 09 29 03

So I figured, it doesn't matter what keywords one provides, the difference is the --start-paths flag.

./crawl --base-url https://www.bundestag.de/ --name Bundestag --profile default --start-paths /arbeit

Note that the --keywords flag is not used here. So I assume, this is optional, too like the --name flag.
Screenshot 2024-11-21 at 09 33 28

It took me some time to figure this out. And I probably should have read the Readme with more precision.
Because the Readme mentions that --name is optional. Implying that the others are required. So it is not explicitly mentioned, that they are required.

What does --startPaths do?

You see, I have an error in the step, when the url actually get crawled and I indeed seem to have a problem crawling the knowledge of the entire Bundestag website. (Germany's government page) I only got the members of the parliament (/abgeordnete). Which I find funny, because my --startPath was /arbeit. Which is a complete (different) section. My assumption is, when I define --startPaths then it would read every page under the URI /arbeit and wouldn't switch to a different URI.

Lambda.TooManyRequestsException

I wonder, what I'm doing wrong here, of if this is because of default limits set in this new AWS account? I'm having difficulties figuring out, where to check the limits. And I wonder if its my own account's limits, or some other error that I am not aware of.

I got an error that the threshold exceeded.

{
    "details": {
        "cause": "Rate Exceeded. (Service: AWSLambda; Status Code: 429; Error Code: TooManyRequestsException; Request ID: <some request ID>; Proxy: null)",
        "error": "Lambda.TooManyRequestsException"
    },
    "redrive_count": "0",
    "id": "6",
    "type": "ExecutionFailed",
    "previous_event_id": "5",
    "event_timestamp": "1732178765867",
    "execution_arn": "arn:aws:states:eu-west-1:<some profile ID>:express:webcrawler-state-machine/<some ID>",
    "map_run_arn": "arn:aws:states:eu-west-1<some profile ID>:mapRun:webcrawler-state-machine/<some other ID>",
    "parent_execution_arn": "arn:aws:states:eu-west-1:<some profile ID>:execution:webcrawler-state-machine:Bundestag-continued-2024-11-21T08-44-19-059Z"
}

Hey @BirgitPohl,

Thanks very much! Sorry for my slow reply!

Required flags

I think the documentation around what's required and optional could definitely be improved! --base-url and --start-paths are required but --name and --keywords are optional :)

What does --start-paths do?

These are the paths within the website that the crawler starts from :) The crawler works by keeping track of the urls it sees in a queue, and the --start-paths specifies which urls should initially be populated in the queue.

In your case it sounds like the crawler started at /arbeit, but found links somewhere to /abgeordnete, and visited those too.

The key here is that --start-paths only specifies where the crawler should start, it doesn't restrict the pages that it will crawl. To restrict the pages it will crawl, you can use the --keywords option which filters the urls added to the queue. You might get the behaviour you expect by also adding --keywords /arbeit.

Lambda.TooManyRequestsException

Yes this sounds like account limits! It might be worth setting lower batch sizes and redeploying. You can find those batch sizes here:

/**
* The default concurrency limit for the Distributed Map state's child executions
*
* Distributed Map state can support up to 10,000 concurrent executions but we need to consider the default Lambda
* concurrency limit of 1000 per AWS region. To increase the concurrency limit for child executions, you can request
* a quota increase for the Lambda concurrency limit and then update the concurrency limit for the child executions of
* the Distributed Map state accordingly. You may also need to use provisioned concurrency for the Lambda function "CrawlPageAndQ"
* to deal with the initial burst of concurrency.
*
* For a new deployment of the solution to work within the default Lambda concurrency limit, we set the default concurrency
* limit for the Distributed Map state to 1000.
*/
export const DEFAULT_DISTRIBUTED_MAP_CONCURRENCY_LIMIT = 1000;
/**
* The default number of urls to sync in parallel.
*
* Note that this "DEFAULT_PARALLEL_URLS_TO_SYNC" must be the same or bigger than the DEFAULT_DISTRIBUTED_MAP_CONCURRENCY_LIMIT.
*/
export const DEFAULT_PARALLEL_URLS_TO_SYNC = 1000;

You could try setting both to 100 and see if this helps :)

Hope this helps!

Cheers,
Jack