lorien/grab

long term tasks using grab spider

EnzoRondo opened this issue · 1 comments

what I want: remove memory leaks when working with very long tasks (over 100m of links), distribute loading to several processes working at the same time

how desired solution looks:

  • I want use different spider processes (20 and more), each process should handle near 500 threads (finally 20 proc + 500 threads = over 10k threads totally)

  • grab should release memory in time through the shutting down processes and spawning new ones automatically

  • grab should distribute proxies evenly (we need load our proxies equally)

current grab.spider design based on threading and I see no ways to distribute tasks to several separated processes

I have found interesting settings in code, but can't get how it works because lack of documentation:

parser_requests_per_process=10000,
parser_pool_size=1,

I am also interesting how serious guys solving high load tasks now and which is the right design to handle serious volumes

I do not think I'll do big changes in grab internal design anymore.
Grab design is outdated, deprecated and complicated.
I am thinking about creating crawler engine from scratch. Crawler engine designed for high load. It would be asyncio based, I think.