Lambda.TooManyRequestsException
Opened this issue · 1 comments
Hi @cogwirrel, this is great work! I'm super impressed. I found your blog post on the internet and thought I should try it out.
Also, because I'm quite new to AWS and need to learn its features, products and everything.
I'd like to mention some things and I have some questions around it.
Required flags
I've noticed that some flags are required. If they are not provided, then the steps will move successfully to done without querying the urls.
./crawl --base-url https://www.bundestag.de/ --name Bundestag --profile default --keywords /
Would create the following step graph:
So I figured, it doesn't matter what keywords one provides, the difference is the --start-paths
flag.
./crawl --base-url https://www.bundestag.de/ --name Bundestag --profile default --start-paths /arbeit
Note that the --keywords
flag is not used here. So I assume, this is optional, too like the --name flag
.
It took me some time to figure this out. And I probably should have read the Readme with more precision.
Because the Readme mentions that --name
is optional. Implying that the others are required. So it is not explicitly mentioned, that they are required.
What does --startPaths
do?
You see, I have an error in the step, when the url actually get crawled and I indeed seem to have a problem crawling the knowledge of the entire Bundestag website. (Germany's government page) I only got the members of the parliament (/abgeordnete
). Which I find funny, because my --startPath
was /arbeit
. Which is a complete (different) section. My assumption is, when I define --startPaths
then it would read every page under the URI /arbeit
and wouldn't switch to a different URI.
Lambda.TooManyRequestsException
I wonder, what I'm doing wrong here, of if this is because of default limits set in this new AWS account? I'm having difficulties figuring out, where to check the limits. And I wonder if its my own account's limits, or some other error that I am not aware of.
I got an error that the threshold exceeded.
{
"details": {
"cause": "Rate Exceeded. (Service: AWSLambda; Status Code: 429; Error Code: TooManyRequestsException; Request ID: <some request ID>; Proxy: null)",
"error": "Lambda.TooManyRequestsException"
},
"redrive_count": "0",
"id": "6",
"type": "ExecutionFailed",
"previous_event_id": "5",
"event_timestamp": "1732178765867",
"execution_arn": "arn:aws:states:eu-west-1:<some profile ID>:express:webcrawler-state-machine/<some ID>",
"map_run_arn": "arn:aws:states:eu-west-1<some profile ID>:mapRun:webcrawler-state-machine/<some other ID>",
"parent_execution_arn": "arn:aws:states:eu-west-1:<some profile ID>:execution:webcrawler-state-machine:Bundestag-continued-2024-11-21T08-44-19-059Z"
}
Hey @BirgitPohl,
Thanks very much! Sorry for my slow reply!
Required flags
I think the documentation around what's required and optional could definitely be improved! --base-url
and --start-paths
are required but --name
and --keywords
are optional :)
What does
--start-paths
do?
These are the paths within the website that the crawler starts from :) The crawler works by keeping track of the urls it sees in a queue, and the --start-paths
specifies which urls should initially be populated in the queue.
In your case it sounds like the crawler started at /arbeit
, but found links somewhere to /abgeordnete
, and visited those too.
The key here is that --start-paths
only specifies where the crawler should start, it doesn't restrict the pages that it will crawl. To restrict the pages it will crawl, you can use the --keywords
option which filters the urls added to the queue. You might get the behaviour you expect by also adding --keywords /arbeit
.
Lambda.TooManyRequestsException
Yes this sounds like account limits! It might be worth setting lower batch sizes and redeploying. You can find those batch sizes here:
You could try setting both to 100
and see if this helps :)
Hope this helps!
Cheers,
Jack