TeamHG-Memex/scrapy-rotating-proxies

Using get_random() to select a proxy is not optimal

fredd-427 opened this issue · 0 comments

Hello,
I discovered that using get_random() to choose a proxy from the list is not optimal, indeed in my example:

  • I crawl a site that uses datadom to protect itself from crawling, so not to be banned, I have a DOWNLOAD_DELAY at 180 seconds
  • I have 2 proxies in ROTATING_PROXY_LIST
  • DOWNLOAD_DELAY=180
  • CONCURRENT_REQUESTS_PER_DOMAIN=1
  • CONCURRENT_REQUESTS=2 (like the number of proxies)

Sometimes get_random() returns the same proxy as the spider already in use and therefore waits for the end of the DOWNLOAD_DELAY.

Would it be possible to replace get_random() with a get_unused() function? a function that returns the first "free" proxy that is not inside the DOWNLOAD_DELAY?

thank you
fred

1st file : log I observed with the problem (see the comments to the right)
2nd file : log without problem (see the comments to the right)
1st log.txt
2nd log.txt