Doesn't seem to work with requests inside a Twisted Reactor process
nomasprime opened this issue · 6 comments
Hi, I'm new to Python so apologies in advance if this is obvious or I don't explain things very well.
For a learning project I'm writing a web scraper with tests. In the following example it works as expected with the commented out line but requests inside the CrawlerProcess/Twisted Reactor aren't being picked up.
@pytest.mark.vcr
def test_parse(self):
# assert requests.get("http://httpbin.org/ip").text == '{"ip": true}'
BaseSpider.start_urls = ['thoughtbot']
BaseSpider.custom_settings = {
'ROBOTSTEXT_OBEY': False
}
process = CrawlerProcess()
process.crawl(BaseSpider)
process.start()
I'm not sure if this is a bug or I'm just doing something wrong?
Hi!
There could be multiple reasons for such behavior. To untackle what is going on let's start from the environment. What versions of python, vcrpy, pytest-recording, Twisted and (I assume that it is the case) scrapy are you using? You can get packages info from the pip freeze
output. Also, what OS are you using? If e.g. it is Windows and CrawlingProcess
actually spawns a new process, then the new process will not have things applied by VCR-py and we can look into this, I'll try to reproduce it locally once I'll have more info about the environment
Thanks @Stranger6667, really appreciate your help.
I'm on MacOS 15.15.5 with Python 3.8.3.
pip output:
vcrpy==4.0.2
pytest==5.4.3
pytest-recording==0.8.1
Twisted==20.3.0
Scrapy==2.2.0
I saw the VCR compatibility doc and wondered if maybe Scrapy isn't compatibly?
Looks like Scrapy uses twisted.web.
Unfortunately, VCR doesn't support twisted.web
. I see a couple of things we can do about it:
- Implement it in VCRPy. Quite hardcore option, to be honest, but still possible. In short, we'll need to implement certain mocks for twisted interfaces, I can't assess complexity at the moment;
- Since twisted.web uses sockets under the hood there are some other tools, that might help, for example, HTTMock
In any case, pytest-recording
will automatically get it working, once it will be implemented on the vcrpy side, which I think is the cleanest way to record/replay HTTP for twisted.web
I've raised the issue with VCRpy and I'll look into, assume you meant, HTTPretty.
Also found this answer on StackOverflow which talks about simply using Scrapy's built-in cache.
I'll have a play around with both.
Thanks @Stranger6667.