flathunters/flathunter

selenium.common.exceptions.TimeoutException

soheilrt opened this issue ยท 12 comments

Hi,

I'm getting the following Exception from selenium, I guess it's because of the latest update of Google Chrome.

chrome driver version: 105.0.5195

Error stack:

[2022/08/31 12:17:42|crawl_immobilienscout.py|DEBUG   ]: Got search URL https://www.immobilienscout24.de/Suche/shape/wohnung-mieten?shape=ZXxpX0lzdH1vQWRmQGd9QG5rQmN5QXRwQH1fRGVxQX1oU3FtQWJSfXtCX2lBfXxAd1dvRWVfQHtkQGlSd3lAbnJAeVRuYkFDQXtnQGpyQWFEdnhCbElkdUNqa0BibExuZ0BsbEJwUmhSenVAbUVmYkBmfUBycEBpUg..&petsallowedtypes=negotiable&numberofrooms=2.0-&price=-1500.0&livingspace=40.0-&exclusioncriteria=swapflat&pricetype=calculatedtotalrent&sorting=2&enteredFrom=result_list&pagenumber={0}
Traceback (most recent call last):
  File "flathunt.py", line 110, in <module>
    main()
  File "flathunt.py", line 106, in main
    launch_flat_hunt(config, heartbeat)
  File "flathunt.py", line 30, in launch_flat_hunt
    hunter.hunt_flats()
  File "/home/srahmat/projects/soheil/flathunter/flathunter/hunter.py", line 54, in hunt_flats
    for expose in processor_chain.process(self.crawl_for_exposes(max_pages)):
  File "/home/srahmat/projects/soheil/flathunter/flathunter/hunter.py", line 33, in crawl_for_exposes
    return chain(*[try_crawl(searcher, url, max_pages)
  File "/home/srahmat/projects/soheil/flathunter/flathunter/hunter.py", line 33, in <listcomp>
    return chain(*[try_crawl(searcher, url, max_pages)
  File "/home/srahmat/projects/soheil/flathunter/flathunter/hunter.py", line 25, in try_crawl
    return searcher.crawl(url, max_pages)
  File "/home/srahmat/projects/soheil/flathunter/flathunter/abstract_crawler.py", line 158, in crawl
    return self.get_results(url, max_pages)
  File "/home/srahmat/projects/soheil/flathunter/flathunter/crawl_immobilienscout.py", line 48, in get_results
    soup = self.get_page(search_url, self.driver, page_no)
  File "/home/srahmat/projects/soheil/flathunter/flathunter/crawl_immobilienscout.py", line 117, in get_page
    return self.get_soup_from_url(
  File "/home/srahmat/projects/soheil/flathunter/flathunter/abstract_crawler.py", line 92, in get_soup_from_url
    self.resolve_recaptcha(driver, checkbox, afterlogin_string)
  File "/home/srahmat/.local/share/virtualenvs/flathunter-Lr2nvbiT/lib/python3.8/site-packages/backoff/_sync.py", line 105, in retry
    ret = target(*args, **kwargs)
  File "/home/srahmat/projects/soheil/flathunter/flathunter/abstract_crawler.py", line 207, in resolve_recaptcha
    iframe_present = self._wait_for_iframe(driver)
  File "/home/srahmat/projects/soheil/flathunter/flathunter/abstract_crawler.py", line 264, in _wait_for_iframe
    iframe = WebDriverWait(driver, 10).until(EC.visibility_of_element_located(
  File "/home/srahmat/.local/share/virtualenvs/flathunter-Lr2nvbiT/lib/python3.8/site-packages/selenium/webdriver/support/wait.py", line 90, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message: 
Stacktrace:
#0 0x5586c5881693 <unknown>
#1 0x5586c567ab0a <unknown>
#2 0x5586c56b35f7 <unknown>
#3 0x5586c56b37c1 <unknown>
#4 0x5586c56e6804 <unknown>
#5 0x5586c56d094d <unknown>
#6 0x5586c56e44b0 <unknown>
#7 0x5586c56d0743 <unknown>
#8 0x5586c56a6533 <unknown>
#9 0x5586c56a7715 <unknown>
#10 0x5586c58d17bd <unknown>
#11 0x5586c58d4bf9 <unknown>
#12 0x5586c58b6f2e <unknown>
#13 0x5586c58d59b3 <unknown>
#14 0x5586c58aae4f <unknown>
#15 0x5586c58f4ea8 <unknown>
#16 0x5586c58f5052 <unknown>
#17 0x5586c590f71f <unknown>
#18 0x7f93c6556609 <unknown>

No, I'm running it without any containers, I found the issue, The driver couldn't find the captcha within the given timeout.

The exception happens here:

def _wait_for_iframe(self, driver: selenium.webdriver.Chrome):
"""Wait for iFrame to appear"""
try:
iframe = WebDriverWait(driver, 10).until(EC.visibility_of_element_located(
(By.CSS_SELECTOR, "iframe[src^='https://www.google.com/recaptcha/api2/anchor?']")))
return iframe
except NoSuchElementException:
print("No iframe found, therefore no chaptcha verification necessary")
return None

Hi,

I'm facing a similar issues, and I also saw related issues to the warning I'm getting now Unable to find IS24 variable in window. I tried to scale up the instance I ran in on AWS and also tried running it local on Mac M1, but all failed with the same error.

[2022/08/31 12:36:36|config.py               |INFO    ]: Using config /opt/flathunter/config.yaml
[2022/08/31 12:36:36|flathunt.py             |DEBUG   ]: Settings from config: <flathunter.config.Config object at 0x7f74bc14a3d0>
[2022/08/31 12:36:36|abstract_crawler.py     |INFO    ]: Initializing Chrome WebDriver for crawler "CrawlImmobilienscout"...
[2022/08/31 12:36:36|<WebDriverManager>      |DEBUG   ]: ====== WebDriver manager ======
[2022/08/31 12:36:36|<WebDriverManager>      |DEBUG   ]: Get LATEST chromedriver version for google-chrome 105.0.5195
[2022/08/31 12:36:36|<WebDriverManager>      |DEBUG   ]: Driver [/home/flathunter/.wdm/drivers/chromedriver/linux64/105.0.5195/chromedriver] found in cache
[2022/08/31 12:36:37|crawl_immobilienscout.py|DEBUG   ]: Got search URL https://www.immobilienscout24.de/Suche/de/berlin/berlin/wohnung-mieten?haspromotion=false&numberofrooms=2.5-&price=-1600.
0&livingspace=66.0-&exclusioncriteria=swapflat&pricetype=calculatedtotalrent&geocodes=110000000307,110000001101,110000001112,110000000101,110000000102,110000000201,110000001103,110000000301,110000000202,110000000302,110000000105,110000000
106,110000001110,110000001111&sorting=2&enteredFrom=result_list&pagenumber={0}
[2022/08/31 12:36:39|twocaptcha_solver.py    |INFO    ]: Trying to solve geetest.
[2022/08/31 12:36:39|twocaptcha_solver.py    |DEBUG   ]: Got response from 2captcha/in: OK|71383588107
[2022/08/31 12:36:40|twocaptcha_solver.py    |DEBUG   ]: Got response from 2captcha/res: CAPCHA_NOT_READY
[2022/08/31 12:36:40|twocaptcha_solver.py    |INFO    ]: Captcha is not ready yet, waiting...
[2022/08/31 12:36:45|twocaptcha_solver.py    |DEBUG   ]: Got response from 2captcha/res: CAPCHA_NOT_READY
[2022/08/31 12:36:45|twocaptcha_solver.py    |INFO    ]: Captcha is not ready yet, waiting...
[2022/08/31 12:36:50|twocaptcha_solver.py    |DEBUG   ]: Got response from 2captcha/res: CAPCHA_NOT_READY
[2022/08/31 12:36:50|twocaptcha_solver.py    |INFO    ]: Captcha is not ready yet, waiting...
[2022/08/31 12:36:55|twocaptcha_solver.py    |DEBUG   ]: Got response from 2captcha/res: CAPCHA_NOT_READY
[2022/08/31 12:36:55|twocaptcha_solver.py    |INFO    ]: Captcha is not ready yet, waiting...
[2022/08/31 12:37:06|twocaptcha_solver.py    |DEBUG   ]: Got response from 2captcha/res: CAPCHA_NOT_READY
[2022/08/31 12:37:06|twocaptcha_solver.py    |INFO    ]: Captcha is not ready yet, waiting...
[2022/08/31 12:37:11|twocaptcha_solver.py    |DEBUG   ]: Got response from 2captcha/res: OK|{"geetest_challenge":"c02a11c5302d97b24c778310c3e208e0","geetest_validate":"b9da8cc9973777384c6406699
7b43e7e","geetest_seccode":"b9da8cc9973777384c64066997b43e7e|jordan"}
[2022/08/31 12:37:13|crawl_immobilienscout.py|WARNING ]: Unable to find IS24 variable in window
[2022/08/31 12:38:13|crawl_immobilienscout.py|DEBUG   ]: Got search URL https://www.immobilienscout24.de/Suche/de/berlin/berlin/wohnung-mieten?haspromotion=false&numberofrooms=2.5-&price=-1600.0&livingspace=66.0-&exclusioncriteria=swapflat&pricetype=calculatedtotalrent&geocodes=110000000307,110000001101,110000001112,110000000101,110000000102,110000000201,110000001103,110000000301,110000000202,110000000302,110000000105,110000000106,110000001110,110000001111&sorting=2&enteredFrom=result_list&pagenumber={0}
Aug 31 12:38:24 flathunter flathunter[2865]: Traceback (most recent call last):
Aug 31 12:38:24 flathunter flathunter[2865]: File "flathunt.py", line 105, in <module>
Aug 31 12:38:24 flathunter flathunter[2865]: main()
Aug 31 12:38:24 flathunter flathunter[2865]: File "flathunt.py", line 101, in main
Aug 31 12:38:24 flathunter flathunter[2865]: launch_flat_hunt(config, heartbeat)
Aug 31 12:38:24 flathunter flathunter[2865]: File "flathunt.py", line 38, in launch_flat_hunt
Aug 31 12:38:24 flathunter flathunter[2865]: hunter.hunt_flats()
Aug 31 12:38:24 flathunter flathunter[2865]: File "/opt/flathunter/flathunter/hunter.py", line 54, in hunt_flats
Aug 31 12:38:24 flathunter flathunter[2865]: for expose in processor_chain.process(self.crawl_for_exposes(max_pages)):
Aug 31 12:38:24 flathunter flathunter[2865]: File "/opt/flathunter/flathunter/hunter.py", line 34, in crawl_for_exposes
Aug 31 12:38:24 flathunter flathunter[2865]: for searcher in self.config.searchers()
Aug 31 12:38:24 flathunter flathunter[2865]: File "/opt/flathunter/flathunter/hunter.py", line 35, in <listcomp>
Aug 31 12:38:24 flathunter flathunter[2865]: for url in self.config.get('urls', [])])
Aug 31 12:38:24 flathunter flathunter[2865]: File "/opt/flathunter/flathunter/hunter.py", line 25, in try_crawl
Aug 31 12:38:24 flathunter flathunter[2865]: return searcher.crawl(url, max_pages)
Aug 31 12:38:24 flathunter flathunter[2865]: File "/opt/flathunter/flathunter/abstract_crawler.py", line 158, in crawl
Aug 31 12:38:24 flathunter flathunter[2865]: return self.get_results(url, max_pages)
Aug 31 12:38:24 flathunter flathunter[2865]: File "/opt/flathunter/flathunter/crawl_immobilienscout.py", line 55, in get_results
Aug 31 12:38:24 flathunter flathunter[2865]: soup = self.get_page(search_url, self.driver, page_no)
Aug 31 12:38:24 flathunter flathunter[2865]: File "/opt/flathunter/flathunter/crawl_immobilienscout.py", line 128, in get_page
Aug 31 12:38:24 flathunter flathunter[2865]: afterlogin_string=self.afterlogin_string
Aug 31 12:38:24 flathunter flathunter[2865]: File "/opt/flathunter/flathunter/abstract_crawler.py", line 92, in get_soup_from_url
Aug 31 12:38:24 flathunter flathunter[2865]: self.resolve_recaptcha(driver, checkbox, afterlogin_string)
Aug 31 12:38:24 flathunter flathunter[2865]: File "/home/flathunter/.local/share/virtualenvs/flathunter--s35lxKo/lib/python3.7/site-packages/backoff/_sync.py", line 105, in retry
Aug 31 12:38:24 flathunter flathunter[2865]: ret = target(*args, **kwargs)
Aug 31 12:38:24 flathunter flathunter[2865]: File "/opt/flathunter/flathunter/abstract_crawler.py", line 207, in resolve_recaptcha
Aug 31 12:38:24 flathunter flathunter[2865]: iframe_present = self._wait_for_iframe(driver)
Aug 31 12:38:24 flathunter flathunter[2865]: File "/opt/flathunter/flathunter/abstract_crawler.py", line 265, in _wait_for_iframe
Aug 31 12:38:24 flathunter flathunter[2865]: (By.CSS_SELECTOR, "iframe[src^='https://www.google.com/recaptcha/api2/anchor?']")))
Aug 31 12:38:24 flathunter flathunter[2865]: File "/home/flathunter/.local/share/virtualenvs/flathunter--s35lxKo/lib/python3.7/site-packages/selenium/webdriver/support/wait.py", line 90, in until
Aug 31 12:38:24 flathunter flathunter[2865]: raise TimeoutException(message, screen, stacktrace)
Aug 31 12:38:24 flathunter flathunter[2865]: selenium.common.exceptions.TimeoutException: Message:

Any luck @soheilrt?
I tried changing that timeout you mentioned but no success, and also as I mentioned scaling up the resources also didn't help.

Nope, still working on it. I guess they made a change in their captcha system. It requires a bit of manual reverse engineering.

I just tried this myself. Running locally, with headless turned off (so I can see and inspect the Selenium Chrome Instance), I get detected as a bot right away and no captcha is offered.
2022-08-31-160234_1060x784_scrot

The same URL in a new incognito window (same request headers, same IP) gets the captcha request.

2022-08-31-160250_1325x1241_scrot

I see in the response headers that the 'reese84' cookie is set, which seems to be a sign that Imperva is in use. If we don't find another way around that (like using ScrapFly), that might be game-over for Immoscout.

I don't know if someone has tried this before or not. maybe we can try to reverse engineer the application API and use it instead of web interface. It most likely don't have any captchas and might have only a rate limiter. What's your idea?

Time to crack open burpsuite :) I think the problem we have as an open source project is the cat-and-mouse game - the Immoscout developers can read these comments and implement countermeasures for anything we do here. Even if we reverse-engineered their API, they could just add a cookie check that propogates a verification token from the captcha, and we would be locked out again.

For me, the advantage of something like ScrapFly is that Immoscout has no insights into how ScrapFly does its scraping, and they have to somehow continue to allow real humans to use the website - they can't block arbitrarily sophisticated scrapers without blocking their customers. ScrapFly is also a paid service - if the scraping breaks, we talk to ScrapFly support instead of reverse-engineering again. I'll try signing up for the free tier this week and seeing if it's at least feasible. It's not super expensive (compared to ImmoScout Pro! :) )

Huginn is an interesting project. Maybe we end up migrating to that, or using AWS Mechanical Turk(!)

well, I tried ScrapFly and the success rate is not promising, see below.

Screenshot from 2022-08-31 17-37-53

๐ŸŽ‰ ๐ŸŽ‰ Good News ๐ŸŽ‰ ๐ŸŽ‰

I managed to bypass the immobilienscout24 captcha system. but it's not yet well tested. It seems that we won't need captcha solvers if user respects the website's rate limit(I guess 10-20 requests per minute or so).

I don't want to share the solution yet to make sure everything works well, and don't give their developer the chance to implement a solution for this in advance. I will keep you updated. stay tuned.

Screenshot from 2022-08-31 23-19-36

Very cool! We're all rooting for you :)

A patch has been merged, issue fixed: #211