long term tasks using grab spider
EnzoRondo opened this issue · 1 comments
what I want: remove memory leaks when working with very long tasks (over 100m of links), distribute loading to several processes working at the same time
how desired solution looks:
-
I want use different spider processes (20 and more), each process should handle near 500 threads (finally 20 proc + 500 threads = over 10k threads totally)
-
grab should release memory in time through the shutting down processes and spawning new ones automatically
-
grab should distribute proxies evenly (we need load our proxies equally)
current grab.spider design based on threading and I see no ways to distribute tasks to several separated processes
I have found interesting settings in code, but can't get how it works because lack of documentation:
parser_requests_per_process=10000,
parser_pool_size=1,
I am also interesting how serious guys solving high load tasks now and which is the right design to handle serious volumes
I do not think I'll do big changes in grab internal design anymore.
Grab design is outdated, deprecated and complicated.
I am thinking about creating crawler engine from scratch. Crawler engine designed for high load. It would be asyncio based, I think.