Scrapy spider 3x slower after implementing this
Closed this issue · 4 comments
Hi,
I apologize, I know this is probably the wrong place to post this, but I have the following issue: After implementing this in my spider, my average number of crawled pages per minute is now about three times slower. I have CONCURRENT_REQUESTS_PER_DOMAIN set to 32. I am trying to find out what the bottleneck is and what I can possibly do to regain the previous speed (approximately three times faster) without overloading the site.
I appreciate your work on this project and any advice you can give me.
You're right, I was doing some tests with 100 random websites and a CONCURRENT_REQUESTS
of 5 and scrapy-impersonate is much slower.
Twisted
scrapy crawl testing 3.43s user 1.03s system 25% cpu 17.223 total
scrapy crawl testing 3.67s user 1.13s system 27% cpu 17.578 total
scrapy-impersonate
scrapy crawl testing 3.44s user 1.10s system 19% cpu 23.003 total
scrapy crawl testing 3.57s user 1.18s system 19% cpu 24.631 total
I'm sorry I did not get back to you sooner. My spider is complex, and I was pushed to make it work again, so I needed to prioritize. I appreciate you looking into it. In the end, I used a similar library: https://github.com/tieyongjie/scrapy-fingerprint. It is also slower than native Scrapy, but faster than scrapy-impersonate, so maybe you can compare both.
I just generated a PR to fix this, now the results are similar to Twisted. Testing was performed with a CONCURRENT_REQUESTS
of 32 and 84 random sites
Twisted
test1: scrapy crawl impersonate 2.62s user 0.55s system 33% cpu 9.344 total
test2: scrapy crawl impersonate 2.29s user 0.39s system 46% cpu 5.814 total
scrapy-impersonate
test1: scrapy crawl impersonate 2.29s user 0.47s system 35% cpu 7.854 total
test2: scrapy crawl impersonate 2.25s user 0.45s system 49% cpu 5.444 total