selenium.common.exceptions.TimeoutException
soheilrt opened this issue ยท 12 comments
Hi,
I'm getting the following Exception from selenium, I guess it's because of the latest update of Google Chrome.
chrome driver version: 105.0.5195
Error stack:
[2022/08/31 12:17:42|crawl_immobilienscout.py|DEBUG ]: Got search URL https://www.immobilienscout24.de/Suche/shape/wohnung-mieten?shape=ZXxpX0lzdH1vQWRmQGd9QG5rQmN5QXRwQH1fRGVxQX1oU3FtQWJSfXtCX2lBfXxAd1dvRWVfQHtkQGlSd3lAbnJAeVRuYkFDQXtnQGpyQWFEdnhCbElkdUNqa0BibExuZ0BsbEJwUmhSenVAbUVmYkBmfUBycEBpUg..&petsallowedtypes=negotiable&numberofrooms=2.0-&price=-1500.0&livingspace=40.0-&exclusioncriteria=swapflat&pricetype=calculatedtotalrent&sorting=2&enteredFrom=result_list&pagenumber={0}
Traceback (most recent call last):
File "flathunt.py", line 110, in <module>
main()
File "flathunt.py", line 106, in main
launch_flat_hunt(config, heartbeat)
File "flathunt.py", line 30, in launch_flat_hunt
hunter.hunt_flats()
File "/home/srahmat/projects/soheil/flathunter/flathunter/hunter.py", line 54, in hunt_flats
for expose in processor_chain.process(self.crawl_for_exposes(max_pages)):
File "/home/srahmat/projects/soheil/flathunter/flathunter/hunter.py", line 33, in crawl_for_exposes
return chain(*[try_crawl(searcher, url, max_pages)
File "/home/srahmat/projects/soheil/flathunter/flathunter/hunter.py", line 33, in <listcomp>
return chain(*[try_crawl(searcher, url, max_pages)
File "/home/srahmat/projects/soheil/flathunter/flathunter/hunter.py", line 25, in try_crawl
return searcher.crawl(url, max_pages)
File "/home/srahmat/projects/soheil/flathunter/flathunter/abstract_crawler.py", line 158, in crawl
return self.get_results(url, max_pages)
File "/home/srahmat/projects/soheil/flathunter/flathunter/crawl_immobilienscout.py", line 48, in get_results
soup = self.get_page(search_url, self.driver, page_no)
File "/home/srahmat/projects/soheil/flathunter/flathunter/crawl_immobilienscout.py", line 117, in get_page
return self.get_soup_from_url(
File "/home/srahmat/projects/soheil/flathunter/flathunter/abstract_crawler.py", line 92, in get_soup_from_url
self.resolve_recaptcha(driver, checkbox, afterlogin_string)
File "/home/srahmat/.local/share/virtualenvs/flathunter-Lr2nvbiT/lib/python3.8/site-packages/backoff/_sync.py", line 105, in retry
ret = target(*args, **kwargs)
File "/home/srahmat/projects/soheil/flathunter/flathunter/abstract_crawler.py", line 207, in resolve_recaptcha
iframe_present = self._wait_for_iframe(driver)
File "/home/srahmat/projects/soheil/flathunter/flathunter/abstract_crawler.py", line 264, in _wait_for_iframe
iframe = WebDriverWait(driver, 10).until(EC.visibility_of_element_located(
File "/home/srahmat/.local/share/virtualenvs/flathunter-Lr2nvbiT/lib/python3.8/site-packages/selenium/webdriver/support/wait.py", line 90, in until
raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message:
Stacktrace:
#0 0x5586c5881693 <unknown>
#1 0x5586c567ab0a <unknown>
#2 0x5586c56b35f7 <unknown>
#3 0x5586c56b37c1 <unknown>
#4 0x5586c56e6804 <unknown>
#5 0x5586c56d094d <unknown>
#6 0x5586c56e44b0 <unknown>
#7 0x5586c56d0743 <unknown>
#8 0x5586c56a6533 <unknown>
#9 0x5586c56a7715 <unknown>
#10 0x5586c58d17bd <unknown>
#11 0x5586c58d4bf9 <unknown>
#12 0x5586c58b6f2e <unknown>
#13 0x5586c58d59b3 <unknown>
#14 0x5586c58aae4f <unknown>
#15 0x5586c58f4ea8 <unknown>
#16 0x5586c58f5052 <unknown>
#17 0x5586c590f71f <unknown>
#18 0x7f93c6556609 <unknown>
No, I'm running it without any containers, I found the issue, The driver couldn't find the captcha within the given timeout.
The exception happens here:
flathunter/flathunter/abstract_crawler.py
Lines 261 to 269 in 0b5445d
Hi,
I'm facing a similar issues, and I also saw related issues to the warning I'm getting now Unable to find IS24 variable in window
. I tried to scale up the instance I ran in on AWS and also tried running it local on Mac M1, but all failed with the same error.
[2022/08/31 12:36:36|config.py |INFO ]: Using config /opt/flathunter/config.yaml
[2022/08/31 12:36:36|flathunt.py |DEBUG ]: Settings from config: <flathunter.config.Config object at 0x7f74bc14a3d0>
[2022/08/31 12:36:36|abstract_crawler.py |INFO ]: Initializing Chrome WebDriver for crawler "CrawlImmobilienscout"...
[2022/08/31 12:36:36|<WebDriverManager> |DEBUG ]: ====== WebDriver manager ======
[2022/08/31 12:36:36|<WebDriverManager> |DEBUG ]: Get LATEST chromedriver version for google-chrome 105.0.5195
[2022/08/31 12:36:36|<WebDriverManager> |DEBUG ]: Driver [/home/flathunter/.wdm/drivers/chromedriver/linux64/105.0.5195/chromedriver] found in cache
[2022/08/31 12:36:37|crawl_immobilienscout.py|DEBUG ]: Got search URL https://www.immobilienscout24.de/Suche/de/berlin/berlin/wohnung-mieten?haspromotion=false&numberofrooms=2.5-&price=-1600.
0&livingspace=66.0-&exclusioncriteria=swapflat&pricetype=calculatedtotalrent&geocodes=110000000307,110000001101,110000001112,110000000101,110000000102,110000000201,110000001103,110000000301,110000000202,110000000302,110000000105,110000000
106,110000001110,110000001111&sorting=2&enteredFrom=result_list&pagenumber={0}
[2022/08/31 12:36:39|twocaptcha_solver.py |INFO ]: Trying to solve geetest.
[2022/08/31 12:36:39|twocaptcha_solver.py |DEBUG ]: Got response from 2captcha/in: OK|71383588107
[2022/08/31 12:36:40|twocaptcha_solver.py |DEBUG ]: Got response from 2captcha/res: CAPCHA_NOT_READY
[2022/08/31 12:36:40|twocaptcha_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2022/08/31 12:36:45|twocaptcha_solver.py |DEBUG ]: Got response from 2captcha/res: CAPCHA_NOT_READY
[2022/08/31 12:36:45|twocaptcha_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2022/08/31 12:36:50|twocaptcha_solver.py |DEBUG ]: Got response from 2captcha/res: CAPCHA_NOT_READY
[2022/08/31 12:36:50|twocaptcha_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2022/08/31 12:36:55|twocaptcha_solver.py |DEBUG ]: Got response from 2captcha/res: CAPCHA_NOT_READY
[2022/08/31 12:36:55|twocaptcha_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2022/08/31 12:37:06|twocaptcha_solver.py |DEBUG ]: Got response from 2captcha/res: CAPCHA_NOT_READY
[2022/08/31 12:37:06|twocaptcha_solver.py |INFO ]: Captcha is not ready yet, waiting...
[2022/08/31 12:37:11|twocaptcha_solver.py |DEBUG ]: Got response from 2captcha/res: OK|{"geetest_challenge":"c02a11c5302d97b24c778310c3e208e0","geetest_validate":"b9da8cc9973777384c6406699
7b43e7e","geetest_seccode":"b9da8cc9973777384c64066997b43e7e|jordan"}
[2022/08/31 12:37:13|crawl_immobilienscout.py|WARNING ]: Unable to find IS24 variable in window
[2022/08/31 12:38:13|crawl_immobilienscout.py|DEBUG ]: Got search URL https://www.immobilienscout24.de/Suche/de/berlin/berlin/wohnung-mieten?haspromotion=false&numberofrooms=2.5-&price=-1600.0&livingspace=66.0-&exclusioncriteria=swapflat&pricetype=calculatedtotalrent&geocodes=110000000307,110000001101,110000001112,110000000101,110000000102,110000000201,110000001103,110000000301,110000000202,110000000302,110000000105,110000000106,110000001110,110000001111&sorting=2&enteredFrom=result_list&pagenumber={0}
Aug 31 12:38:24 flathunter flathunter[2865]: Traceback (most recent call last):
Aug 31 12:38:24 flathunter flathunter[2865]: File "flathunt.py", line 105, in <module>
Aug 31 12:38:24 flathunter flathunter[2865]: main()
Aug 31 12:38:24 flathunter flathunter[2865]: File "flathunt.py", line 101, in main
Aug 31 12:38:24 flathunter flathunter[2865]: launch_flat_hunt(config, heartbeat)
Aug 31 12:38:24 flathunter flathunter[2865]: File "flathunt.py", line 38, in launch_flat_hunt
Aug 31 12:38:24 flathunter flathunter[2865]: hunter.hunt_flats()
Aug 31 12:38:24 flathunter flathunter[2865]: File "/opt/flathunter/flathunter/hunter.py", line 54, in hunt_flats
Aug 31 12:38:24 flathunter flathunter[2865]: for expose in processor_chain.process(self.crawl_for_exposes(max_pages)):
Aug 31 12:38:24 flathunter flathunter[2865]: File "/opt/flathunter/flathunter/hunter.py", line 34, in crawl_for_exposes
Aug 31 12:38:24 flathunter flathunter[2865]: for searcher in self.config.searchers()
Aug 31 12:38:24 flathunter flathunter[2865]: File "/opt/flathunter/flathunter/hunter.py", line 35, in <listcomp>
Aug 31 12:38:24 flathunter flathunter[2865]: for url in self.config.get('urls', [])])
Aug 31 12:38:24 flathunter flathunter[2865]: File "/opt/flathunter/flathunter/hunter.py", line 25, in try_crawl
Aug 31 12:38:24 flathunter flathunter[2865]: return searcher.crawl(url, max_pages)
Aug 31 12:38:24 flathunter flathunter[2865]: File "/opt/flathunter/flathunter/abstract_crawler.py", line 158, in crawl
Aug 31 12:38:24 flathunter flathunter[2865]: return self.get_results(url, max_pages)
Aug 31 12:38:24 flathunter flathunter[2865]: File "/opt/flathunter/flathunter/crawl_immobilienscout.py", line 55, in get_results
Aug 31 12:38:24 flathunter flathunter[2865]: soup = self.get_page(search_url, self.driver, page_no)
Aug 31 12:38:24 flathunter flathunter[2865]: File "/opt/flathunter/flathunter/crawl_immobilienscout.py", line 128, in get_page
Aug 31 12:38:24 flathunter flathunter[2865]: afterlogin_string=self.afterlogin_string
Aug 31 12:38:24 flathunter flathunter[2865]: File "/opt/flathunter/flathunter/abstract_crawler.py", line 92, in get_soup_from_url
Aug 31 12:38:24 flathunter flathunter[2865]: self.resolve_recaptcha(driver, checkbox, afterlogin_string)
Aug 31 12:38:24 flathunter flathunter[2865]: File "/home/flathunter/.local/share/virtualenvs/flathunter--s35lxKo/lib/python3.7/site-packages/backoff/_sync.py", line 105, in retry
Aug 31 12:38:24 flathunter flathunter[2865]: ret = target(*args, **kwargs)
Aug 31 12:38:24 flathunter flathunter[2865]: File "/opt/flathunter/flathunter/abstract_crawler.py", line 207, in resolve_recaptcha
Aug 31 12:38:24 flathunter flathunter[2865]: iframe_present = self._wait_for_iframe(driver)
Aug 31 12:38:24 flathunter flathunter[2865]: File "/opt/flathunter/flathunter/abstract_crawler.py", line 265, in _wait_for_iframe
Aug 31 12:38:24 flathunter flathunter[2865]: (By.CSS_SELECTOR, "iframe[src^='https://www.google.com/recaptcha/api2/anchor?']")))
Aug 31 12:38:24 flathunter flathunter[2865]: File "/home/flathunter/.local/share/virtualenvs/flathunter--s35lxKo/lib/python3.7/site-packages/selenium/webdriver/support/wait.py", line 90, in until
Aug 31 12:38:24 flathunter flathunter[2865]: raise TimeoutException(message, screen, stacktrace)
Aug 31 12:38:24 flathunter flathunter[2865]: selenium.common.exceptions.TimeoutException: Message:
Any luck @soheilrt?
I tried changing that timeout you mentioned but no success, and also as I mentioned scaling up the resources also didn't help.
Nope, still working on it. I guess they made a change in their captcha system. It requires a bit of manual reverse engineering.
I just tried this myself. Running locally, with headless turned off (so I can see and inspect the Selenium Chrome Instance), I get detected as a bot right away and no captcha is offered.
The same URL in a new incognito window (same request headers, same IP) gets the captcha request.
I see in the response headers that the 'reese84' cookie is set, which seems to be a sign that Imperva is in use. If we don't find another way around that (like using ScrapFly), that might be game-over for Immoscout.
I don't know if someone has tried this before or not. maybe we can try to reverse engineer the application API and use it instead of web interface. It most likely don't have any captchas and might have only a rate limiter. What's your idea?
Time to crack open burpsuite :) I think the problem we have as an open source project is the cat-and-mouse game - the Immoscout developers can read these comments and implement countermeasures for anything we do here. Even if we reverse-engineered their API, they could just add a cookie check that propogates a verification token from the captcha, and we would be locked out again.
For me, the advantage of something like ScrapFly is that Immoscout has no insights into how ScrapFly does its scraping, and they have to somehow continue to allow real humans to use the website - they can't block arbitrarily sophisticated scrapers without blocking their customers. ScrapFly is also a paid service - if the scraping breaks, we talk to ScrapFly support instead of reverse-engineering again. I'll try signing up for the free tier this week and seeing if it's at least feasible. It's not super expensive (compared to ImmoScout Pro! :) )
Huginn is an interesting project. Maybe we end up migrating to that, or using AWS Mechanical Turk(!)
I managed to bypass the immobilienscout24
captcha system. but it's not yet well tested. It seems that we won't need captcha solvers if user respects the website's rate limit(I guess 10-20 requests per minute or so).
I don't want to share the solution yet to make sure everything works well, and don't give their developer the chance to implement a solution for this in advance. I will keep you updated. stay tuned.
Very cool! We're all rooting for you :)