decrypto-org/spider

Extend DB Results Cache to Pool

Closed this issue · 1 comments

In order to give the network always a collection of different entries from different baseUrls, we extend the Cache to always at least hold a certain amount of entries, which then can be passed to the network. Those entries should be "random" enough such that we do not bombard a server with 100 simultaneous requests. This could lead to the detection and ban of our scraper. To prevent that, we further can insert a rate limit within the network class to prevent more than 4 (Firefox Standard) simultaneous connections to the server. If already 4 connections are open, the 5th received from the Pool will be stalled (Introducing a Backlog queue in the network module) and executed once one of the four pending requests returned. So the logic would work like this: Once a request returns, we first check if there exists a stalled request for this domain. If so, use this. Otherwise, get a new entry from the pool. The pool itself is responsible that it has always enough entries.

Pool lives now within the network module, since the network needs to handle the back off cache anyway - that way, we have the logic for the pool at the same location, which also makes sense in the perspective of the "interface" thinking: Otherwise the two modules would be tightly coupled to ensure good working caching/pooling.
Closing this issue for now.