Doesn't seem to work with requests inside a Twisted Reactor process

Question

Doesn't seem to work with requests inside a Twisted Reactor process

nomasprime opened this issue 5 years ago · 6 comments

Hi, I'm new to Python so apologies in advance if this is obvious or I don't explain things very well.

For a learning project I'm writing a web scraper with tests. In the following example it works as expected with the commented out line but requests inside the CrawlerProcess/Twisted Reactor aren't being picked up.

    @pytest.mark.vcr
    def test_parse(self):
        # assert requests.get("http://httpbin.org/ip").text == '{"ip": true}'
        BaseSpider.start_urls = ['thoughtbot']

        BaseSpider.custom_settings = {
            'ROBOTSTEXT_OBEY': False
        }

        process = CrawlerProcess()
        process.crawl(BaseSpider)
        process.start()

I'm not sure if this is a bug or I'm just doing something wrong?

Answer 1 · 2020-06-25T18:33:48.000Z

Hi!

There could be multiple reasons for such behavior. To untackle what is going on let's start from the environment. What versions of python, vcrpy, pytest-recording, Twisted and (I assume that it is the case) scrapy are you using? You can get packages info from the pip freeze output. Also, what OS are you using? If e.g. it is Windows and CrawlingProcess actually spawns a new process, then the new process will not have things applied by VCR-py and we can look into this, I'll try to reproduce it locally once I'll have more info about the environment

Answer 2 · 2020-06-25T18:43:52.000Z

Thanks @Stranger6667, really appreciate your help.

I'm on MacOS 15.15.5 with Python 3.8.3.

pip output:

vcrpy==4.0.2
pytest==5.4.3
pytest-recording==0.8.1
Twisted==20.3.0
Scrapy==2.2.0

I saw the VCR compatibility doc and wondered if maybe Scrapy isn't compatibly?

Answer 3 · 2020-06-25T19:05:14.000Z

Looks like Scrapy uses twisted.web.

Answer 4 · 2020-06-25T19:48:48.000Z

Unfortunately, VCR doesn't support twisted.web. I see a couple of things we can do about it:

Implement it in VCRPy. Quite hardcore option, to be honest, but still possible. In short, we'll need to implement certain mocks for twisted interfaces, I can't assess complexity at the moment;
Since twisted.web uses sockets under the hood there are some other tools, that might help, for example, HTTMock

In any case, pytest-recording will automatically get it working, once it will be implemented on the vcrpy side, which I think is the cleanest way to record/replay HTTP for twisted.web

Answer 5 · 2020-06-25T20:02:22.000Z

I've raised the issue with VCRpy and I'll look into, assume you meant, HTTPretty.

Also found this answer on StackOverflow which talks about simply using Scrapy's built-in cache.

I'll have a play around with both.

Thanks @Stranger6667.

Answer 6 · 2020-06-25T20:12:24.000Z

@nomasprime yep, I meant HTTPretty

Thanks @Stranger6667.

You are very welcome! :)