long term tasks using grab spider

Question

long term tasks using grab spider

EnzoRondo opened this issue 6 years ago · 1 comments

what I want: remove memory leaks when working with very long tasks (over 100m of links), distribute loading to several processes working at the same time

how desired solution looks:

I want use different spider processes (20 and more), each process should handle near 500 threads (finally 20 proc + 500 threads = over 10k threads totally)
grab should release memory in time through the shutting down processes and spawning new ones automatically
grab should distribute proxies evenly (we need load our proxies equally)

current grab.spider design based on threading and I see no ways to distribute tasks to several separated processes

I have found interesting settings in code, but can't get how it works because lack of documentation:

parser_requests_per_process=10000,
parser_pool_size=1,

I am also interesting how serious guys solving high load tasks now and which is the right design to handle serious volumes

Answer 1 · 2018-07-17T14:59:40.000Z

I do not think I'll do big changes in grab internal design anymore.
Grab design is outdated, deprecated and complicated.
I am thinking about creating crawler engine from scratch. Crawler engine designed for high load. It would be asyncio based, I think.