jxlil/scrapy-impersonate

Scrapy spider 3x slower after implementing this

Closed this issue · 4 comments

Hi,
I apologize, I know this is probably the wrong place to post this, but I have the following issue: After implementing this in my spider, my average number of crawled pages per minute is now about three times slower. I have CONCURRENT_REQUESTS_PER_DOMAIN set to 32. I am trying to find out what the bottleneck is and what I can possibly do to regain the previous speed (approximately three times faster) without overloading the site.

I appreciate your work on this project and any advice you can give me.

Hi @petrrutz

Could you share a small spider that works as a PoC? I could review this later.

Thanks

You're right, I was doing some tests with 100 random websites and a CONCURRENT_REQUESTS of 5 and scrapy-impersonate is much slower.

Twisted

scrapy crawl testing  3.43s user 1.03s system 25% cpu 17.223 total
scrapy crawl testing  3.67s user 1.13s system 27% cpu 17.578 total

scrapy-impersonate

scrapy crawl testing  3.44s user 1.10s system 19% cpu 23.003 total
scrapy crawl testing  3.57s user 1.18s system 19% cpu 24.631 total

I'm sorry I did not get back to you sooner. My spider is complex, and I was pushed to make it work again, so I needed to prioritize. I appreciate you looking into it. In the end, I used a similar library: https://github.com/tieyongjie/scrapy-fingerprint. It is also slower than native Scrapy, but faster than scrapy-impersonate, so maybe you can compare both.

I just generated a PR to fix this, now the results are similar to Twisted. Testing was performed with a CONCURRENT_REQUESTS of 32 and 84 random sites

Twisted

test1: scrapy crawl impersonate  2.62s user 0.55s system 33% cpu 9.344 total
test2: scrapy crawl impersonate  2.29s user 0.39s system 46% cpu 5.814 total

scrapy-impersonate

test1: scrapy crawl impersonate  2.29s user 0.47s system 35% cpu 7.854 total
test2: scrapy crawl impersonate  2.25s user 0.45s system 49% cpu 5.444 total