flathunters/flathunter

Immoscout: Bot detection/No captcha necessary

phi1eas opened this issue · 40 comments

Hi,

I am trying to run flathunter on immscout24 using imagetyperz. I run into the following issue:

$ pipenv run python3 flathunt.py
[2023/01/25 21:04:20|config.py               |INFO    ]: Using config path /home/max/flathunter/config.yaml
[2023/01/25 21:04:20|chrome_wrapper.py       |INFO    ]: Initializing Chrome WebDriver for crawler...
[2023/01/25 21:04:21|patcher.py              |INFO    ]: patching driver executable /home/max/.local/share/undetected_chromedriver/9418e1b60bf980e1_chromedriver
[2023/01/25 21:04:33|abstract_crawler.py     |INFO    ]: Timeout waiting for iframe element - no captcha verification necessary?
[2023/01/25 21:04:33|crawl_immobilienscout.py|WARNING ]: Unable to find IS24 variable in window
[2023/01/25 21:04:33|crawl_immobilienscout.py|ERROR   ]: IS24 bot detection has identified our script as a bot - we've been blocked

What I think is weird is this: If I do not pass "--headless" as a driver_argument, a Chromium window opens. This window has the immoscout bot detection page loaded. If I copy the URL from that window, and open this URL in a new tab in Chromium, I get the same page, but this time with the Captcha.

Is this because immoscout24 classified me as a bot, or is there something else going on?

This is my config.yaml:

loop:
    active: yes
    sleeping_time: 600

urls:
  - https://www.immobilienscout24.de/Suche/de/berlin/berlin/wohnung-mieten?enteredFrom=one_step_search

filters:

blacklist:
  - Innenstadt

durations:
    - name: John
      destination: Hauptbahnhof, München
      modes: 
          - gm_id: transit
            title: "Öff."
          - gm_id: bicycling
            title: "Rad"
    - name: Jane
      destination: Karlsplatz, München
      modes: 
          - gm_id: transit
            title: "Öff."
          - gm_id: driving
            title: "Auto"

message: |
    {title}
    Zimmer: {rooms}
    Größe: {size}
    Preis: {price}
    Ort: {address}

    {url}

google_maps_api:
    key: YOUR_API_KEY
    url: https://maps.googleapis.com/maps/api/distancematrix/json?origins={origin}&destinations={dest}&mode={mode}&sensor=true&key={key}&arrival_time={arrival}
    enable: False

captcha:
     imagetyperz:
           token: 4B59D2B4CC6B4DE0AFC09D310F77D8CE
#       2captcha:
#             api_key: alskdjaskldjfklj
     driver_arguments:
       - "--no-sandbox"
       - "--disable-gpu"
       - "--remote-debugging-port=9222"
       - "--disable-dev-shm-usage"
       - "window-size=1024,768"

notifiers:
    - telegram
#     - mattermost
#     - apprise

telegram:
  bot_token: (censored)
  notify_with_images: true
  receiver_ids:
      - (censored)

Thank you so much!

Hi @phi1eas ,

I've definitely made the same experience as you before - that the URL in the chrome-driver frame gets detected but the same URL in the normal browser window works fine. We used to have that regularly before we switched to the undetected-chromedriver library, but it's a cat-and-mouse game, and of course IS24 is always trying to improve their detection. You can see in #296 and #272 that you're not the only one hitting this. Unfortunately, it seems a bit random which users / setups get detected and which not.

@ozeidan made a comment in #272 that they have been working on a solution based on an undetected-chromedriver-provided docker image. That might be something to look at if you want to look deeper into how to develop a long-term fix for this. But your setup looks fine, and pretty similar to mine - I don't think there's a problem there.

What I can recommend, if you are just doing a simple search in Berlin, is to use the hosted version at https://flathunter.codders.io . You can log in there with your Telegram account and setup a basic filter, and you will get messages about new flats in Berlin - no setup required from your side, and the Immoscout crawling is working at time of writing.

The blacklist, google_maps_api and durations sections from your config can safely be deleted if you're not using those features - I don't know how that made it to the sample config.

Hope that helps!

Thank you so much for your quick and helpful response! I will look into your references and try to contribute where I can.

All the best!

Just to make sure I'm not missing something: Running flathunter without --headless driver argument, I get this site:

Screenshot from 2023-01-25 23-28-17

Now if I copy the link and open a new tab in the same window, I get this site with a captcha:

Screenshot from 2023-01-25 23-28-34

Doesn't this mean that there must be some different information passed by the browser if I manually open the link, as opposed to opening it within flathunter? Maybe we could use that?

Thanks again!

Yeah, I mean, obviously somewhere there there must be a difference. The tricky part is working out where. You could try and spy on the traffic between the browsers and immoscout to see what the difference in requests is, but it might also be that some Javascript is running in the page after it loads to decide whether or not to show the captcha. It could be about the position of your mouse, or the size of the window, or pretty much any property of the application (browsers running javascript give away a lot of clues).

But the fact that you can reproduce it, and that you have a good case and a bad case on the same machine, is already a solid start for investigating.

Ah. I should also say. The code we use to launch the window also blocks the GeeTest API call (I think the captcha is powered by GeeTest). We do this so that we can request the Captcha from Python without re-using the same captcha token twice. So that is obviously one difference between the automated browser and the manual browser. You can try disabling that (https://github.com/flathunters/flathunter/blob/main/flathunter/chrome_wrapper.py#L46) and see if that makes a difference. Flathunter won't be able to solve the captcha, but you'll be able to see if that's what's tripping the bot detection.

@phi1eas @ozeidan I just bumped the version of undetected-chromedriver to the latest (3.4.x). Maybe you can check if the issue is resolved for you in the latest.

Just merged in #313, which bumps undetected-chromedriver up a version again. Maybe try again and see if that's better?

23722 commented

I tried the updated version but had no luck. The output remains the same that @phi1eas described.

I set everything up today (Feb 28, running on Mac OSX 10.14 & sending notifications via Telegram, captchas solved w/ Imagetyperz), and I had the same problem (first without any driver_arguments and then even after adding "--headless").

However, I got it working (no longer detecting me as a bot) after I added the additional driver_arguments suggested by @codders in #296 (see here):

driver_arguments:
            - "--no-sandbox"
            - "--headless"
            - "--disable-gpu"
            - "--remote-debugging-port=9222"
            - "--disable-dev-shm-usage"
            - "window-size=1024,768"

UPDATE: Nevermind, I guess it really is somehow stochastic / traffic-dependent? Because now I'm running it and being detected as a bot again (without any change to the config.yaml file) and getting the same output as in @phi1eas's original post.

@conorheins Damn - nice try! Thanks for the updates, and sorry to hear that you're struggling with the bot detection. I don't know if it would help you to turn down the looping frequency. It's really hard to see from here what makes a difference. As far as I can tell, it works okay most of the time for most users, but it's for sure not working for everyone all the time.

Thanks for the quick reply @codders -- good to know, I'll try messing with the looping frequency. To be clear, by that you mean decreasing the count in sleeping_time in the loop field of the config file?

Increasing the sleeping_time, yeah. If it sleeps for longer you're less likely to trigger spam protections.

Is there anything else I could try changing / playing with to make IS24 crawler work in Google Cloud Deployment? It doesn't work for me at all (gets blocked all the time)

@infctr If you've tried everything here, I'm not sure what else. What deployment region are you using in Google Cloud? For me, it's working reliably out of europe-west1 as a scheduled job.

Ah. I should also say. The code we use to launch the window also blocks the GeeTest API call (I think the captcha is powered by GeeTest). We do this so that we can request the Captcha from Python without re-using the same captcha token twice. So that is obviously one difference between the automated browser and the manual browser. You can try disabling that (https://github.com/flathunters/flathunter/blob/main/flathunter/chrome_wrapper.py#L46) and see if that makes a difference. Flathunter won't be able to solve the captcha, but you'll be able to see if that's what's tripping the bot detection.

I also face the same issue (local run on windows 10 laptop), so I tried commenting this line. Flathunter still reports "Unable to find IS24 variable in window" and "IS24 bot detection has identified our script as a bot - we've been blocked". In the browser it looks like "Gleich geht's weiter" page which quickly redirects to the "Ich bin kein Roboter" page without captcha, and then captcha appears, after like a second or so. With this line uncommented captcha does not appear. So there is indeed some relation, but script can't pass it anyway unfortunately.

Hi, I just read here about this problem: I wrote my own script with headless chrome and a php-wrapper for immoscout. I do not crawl the html-version, but the json-url they use for the map. It looks like this: https://www.immobilienscout24.de/Suche/controller/mapResults.go?searchUrl=/Suche/radius/wohnung-mieten?

My crawler gets blocked initially and then periodically after about 20 minutes. The blocker page from above without the captcha shows up then, the captcha is only displayed on the web-version.

I think they do some kind of browser-fingerprinting with the script they load from https://www.immobilienscout24.de/assets/immo-1-17 (I think an antibot-script from distil network?)

However, you can simply open the json-page in a new incognito window and reload it without solving any captcha and you will get through. So my workaround right now is very silly: I copy the value from the cookie "reese84" from incognito-window to my script, then it runs again for about 20 minutes. I think immoscout just does some kind of whitelisting for your browser with the distil-script and sets a fresh cookie reese84 when the script does not detect you as a bot. And: Sovling a captcha on the web-version does not help in this case, you still get blocked on the json-version vice versa. (test-case: if you open the web-version with headless chrome (in non-headless mode) and pass the captcha, the data for the map from the json-url does not load).

Anyway, your script works differently I suppose but maybe this info is helpful (or old for you, then sorry for the interruption) ...

So maybe it is a very big misconception on my side, but the idea is that you prove on your side if a fresh cookie in your script solves your problem, and if so (probably not, because it might have some additional ip-range-blocking), we could search for a service to automate this (=> send-url-and-return-fresh-cookie-api)? I did not find such service on 2captcha or imagetyperz....

Screenshot-cookietoken

Hi @trendschau ,

Thanks for the detailed investigation and information. Is your code up on Github anywhere?

I'm not sure if what you describe relates to the problem that our users encounter or not. Right now, for many users, the captcha solving works "just fine" - I have an instance running on Google Cloud that has been scraping ImmoScout for years without problems using the Flathunter code. I have also noticed the reese84 cookie and I do think it is significant - here is a match on another project that seems to have dug a bit deeper into the problem: Jackiebibili/ticket_tracker_api@272c539 . Maybe @Jackiebibili has some clues for us. I also mentioned in #210 that I think this is related to Imperva bot protection, but I don't have good evidence for that.

It seems like ticket_tracker_api solved this with JS injection - that might be something we could try or investigate.

@codders totally agree, I don't know if it is related to the problem described in this ticket but you can easily proof it by adding a valid reese84-cookie to headless chrome. Since flathunters works fine for all other users, the reason for the blocking page might be totally different, but maybe the solution is similar.

My code is probably not of interest (very basic), but I cleaned it from all captcha-solving parts (not needed anymore) and pushed it to github. I never planned to publish it, so I am sorry for the spaghetti ... I think a super simplistic workaround might be a browser extension in another window, that stores cookies periodically on the file system in combination with a page refresh extension (something like https://github.com/ktty1220/export-cookie-for-puppeteer but without manual action). But I have to stop coding and start searching for a flat now ...

@trendschau Thanks for the tip and for the code! Yes - any hints are welcome to resolve this, and I'll be happy to try this (or even happier if someone else on the thread wants to make a PR). If it fixes the issue for the users that are struggling, it would be an amazing find.

Best of luck with your search!

@codders just to finish this: I found a way to automate the process with two browser extensions. Very dirty but it seems to work for now, so immobilienscout has some open data there :D Btw the archive-part of their website is completely unprotected as well, ahtough not very helpful for flat searchers. Pushed the code in case it is of interest. Good luck to you all, too!

I solved this problem by injecting my cookie to the header to the GET request in abstract_crawler.py. It seems like if you have a valid cookie from one of your logged in sessions in the IS24 you can surpass the robot check. Btw I have a premium account so that might be a thing for the paid users.

I see that @trendschau already pointed out a similar solution

So I’ve been playing with this as well and I noticed that when I got detected as a bot (with no capture showing, as above) I can log into IS24 in the running Chrome session with my user account (plus) and then the subsequent reloads work fine. Don't know yet for how long. Will report back.

@yanone did you try to use the set cookie feature?

Yes, I did, and it wouldn’t work, still blocked.
I figure what I did is probably technically identical to setting the cookie in the header, but apparently also not. Let's see for how long this runs, but I don't see a problem with logging into a Selenium session as long as they let me. I remember from another project that Safari won't let you touch it or else it breaks instantly, but with Chrome you can freely interact with the browser, which is nice.

Yes, I did, and it wouldn’t work, still blocked.

And that's probably because the Selenium app is a separate process. The Chrome that I opened manually and got the cookie from and the Chrome that the bot opens are two different instances.
So one probably needs to make sure that the cookie is truly coming from the same instance. And arguably I find logging in easier than extracting the cookie and then – now we’re getting closer – restarting the bot, which restarts the browser instance, too. This all needs to happen in the same process, is my guess.

Are you sure you are copying the correct cookie in correct format? If it seems harder than logging in manually probably there is sth wrong :D

Update: It ran for about an hour, and now they've logged me out and are showing me the captcha page without a captcha again.

an hour is not so bad :D

Hello Hello wonderful people! I get almost the same error, my coding skills are intermediate/low so I tryed to play a little with the settings.
Is there anything new to fix this? My solution is to restart manually the bot, but then at this point it's the same as refreshing the web page manually.

[2023/07/14 09:49:21|config.py |INFO ]: Using config path C:\Users\asus\flathunter/config.yaml
[2023/07/14 09:49:21|chrome_wrapper.py |INFO ]: Initializing Chrome WebDriver for crawler...
[2023/07/14 09:49:21|init.py |WARNING ]: could not detect version_main.therefore, we are assuming it is chrome 108 or higher
[2023/07/14 09:49:21|init.py |INFO ]: setting properties for headless
[2023/07/14 09:49:33|abstract_crawler.py |INFO ]: Timeout waiting for iframe element - no captcha verification necessary?

I'm seeing that the headed chromium browser isn't setting the same reese84 cookie that I have in my config file. Anyone else able to see that it is being set correctly?

I am also observing the same issue as reported in the ticket when I start the script. I have tried also the reese84 cookie approach but still it gets detected from the beginning.

Does anyone know how to resolve this issue? I've tried everything discussed here, different reese84 cookies values, etc...

Same for me. Would love an update! Getting blocked right out of the gate, even with my normal browsers reese84 cookie...

ewamal commented

Same here, at first it lasted at least a day, now I am getting blocked basically right away

To the commenters who are struggling, would be great if you can leave some info about your setup - what OS, docker or direct, chomedriver arguments etc.

+1 my config running on windows docker desktop with reese84 cookie variable set:
captcha:
2captcha:
api_key: xxxxxxxxxxxxxxxx
driver_arguments:

  • --no-sandbox
  • --headless
  • --disable-gpu
  • --remote-debugging-port=9222
  • --disable-dev-shm-usage
  • window-size=1024,768

2023-10-31 09:00:21 [2023/10/31 08:00:21|abstract_crawler.py |INFO ]: Timeout waiting for iframe element - no captcha verification necessary?
2023-10-31 09:00:21 [2023/10/31 08:00:21|immobilienscout.py |WARNING ]: Unable to find IS24 variable in window
2023-10-31 09:00:21 [2023/10/31 08:00:21|immobilienscout.py |ERROR ]: IS24 bot detection has identified our script as a bot - we've been blocked
2023-10-31 09:10:33 [2023/10/31 08:10:33|abstract_crawler.py |INFO ]: Timeout waiting for iframe element - no captcha verification necessary?
2023-10-31 09:10:33 [2023/10/31 08:10:33|immobilienscout.py |WARNING ]: Unable to find IS24 variable in window
2023-10-31 09:10:33 [2023/10/31 08:10:33|immobilienscout.py |ERROR ]: IS24 bot detection has identified our script as a bot - we've been blocked

same in WSL (Windows Subsystem for Linux) environment:
nuc12_ubuntu_sub@JSNUC12WSHi3:/opt/flathunter$ sudo -u flathunter /home/flathunter/.local/bin/pipenv run python flathunt.py [2023/10/31 11:25:15|config.py |INFO ]: Using config path /opt/flathunter/config.yaml [2023/10/31 11:25:15|chrome_wrapper.py |INFO ]: Initializing Chrome WebDriver for crawler... [2023/10/31 11:25:16|patcher.py |INFO ]: patching driver executable /home/flathunter/.local/share/undetected_chromedriver/undetected_chromedriver [2023/10/31 11:25:17|init.py |INFO ]: setting properties for headless [2023/10/31 11:25:27|abstract_crawler.py |INFO ]: Timeout waiting for iframe element - no captcha verification necessary? [2023/10/31 11:25:27|immobilienscout.py |WARNING ]: Unable to find IS24 variable in window [2023/10/31 11:25:27|immobilienscout.py |ERROR ]: IS24 bot detection has identified our script as a bot - we've been blocked

Not a flathunter issue, but I'm working on a project that has the same issues. I noticed that with headless chrome via puppeteer, the browser gets locked out without showing a captcha.

With headed chrome, I was able to bypass bot detection using the paid capsolver.com API (https://www.capsolver.com/blog/The-other-captcha/bypass-imperva-nodejs) and 2captcha for the geetest captcha. Guess I'll just keep running in headed mode for now, although it's probably a bit more resource hungry.

I decyphered the /assets/immo-1-17 script but couldn't figure out what exactly is going on, yet. Since it's the only script that's being loaded in the headless lock-out case, this has to have the solution in it.

fmmix commented

For me it never worked once with any kind of driver arguments or reese values

Maybe #514 will help some of you...