orangecoding/fredy

Immoscout support

carstenhag opened this issue ยท 17 comments

Supporting Immoscout would be great :). Tried out a bit and so far it looks good!

I realize that in the readme you say it's not supported yet, but that can be changed eventually... perhaps... maybe, right? :D

@carstenhag I'd love to support it. In fact I have supported Immoscout a long time, but since a couple of months, they've put a lot of effort into blocking crawlers and bots like Fredy.

They're using a pretty effective way to determining whether or not you're a bot. If the algorithm finds you're a bot, you need to solve a capture.

Here's what I found is happening;

  1. Immoscout is using re-capture to apply a score to find out whether you're a bot
  2. IF this score is too high, an additional check is applied (a localstorage value is being set)
  3. IF this fails and the score is too high, any further request is blocked until you solve a capture

This however is not solvable at the moment. Sure I could trick the localstorage check, however, tricking re-capture would be a whole different level.

I'm working on this with different OpenSource Dev's, but if you have an idea, you're more than welcome to contribute ;)

I see, thanks for the extensive answer :). I know some solutions to reCaptcha from other tools (specifically JDownloader2):

  • provide a browser extension or app which are linked to the server. This app/browser extension opens the reCaptcha thing and you can perform the challenge there.
  • integrate captcha solving websites (see https://prowebscraper.com/blog/top-10-captcha-solving-services-compared/). This is probably the best solution for fredy I would guess. I think it costs less than an euro for hundreds of captcha solves, and people would probably pay for it (my sister would, I'm setting up a crawler for her).

provide a browser extension or app which are linked to the server. This app/browser extension opens the reCaptcha thing and you can perform the challenge there.

This would make no real sense to me in an app like fredy. The purpose is to run every x minutes and crawl on it's own rather then having human interaction..

integrate captcha solving websites

Yes, I'm looking into something like this, however the problem is that in order to solve re-capture, I'd need to implement the crawler core differently. Currently, my crawler is extremely light-weigh as in, it's only request based, not even a headless brower. With capture solver like the one you mentioned, I'd need to use something like puppeteer, which would introduce different problems (like for instance when you just want to run it on a linux server)

currently experimenting with cached versions of immoscout.. maybe this could be a solution..

http://webcache.googleusercontent.com/search?q=cache:immobilienscout24.de

Funnily it also works via archive.today, I just recorded https://archive.ph/Gw2qz for example

Funnily it also works via archive.today, I just recorded https://archive.ph/Gw2qz for example

yes, unfortunately those snapshots are unreliable and most likely pretty old. the one for instance that you posted is from yesterday. However, this is an interesting thing to look at. 2 questions arise at this point;

  1. how can they scrape the whole thing without running into the capture hell
  2. Is there a way to make it work for our purpose (by having a more up to date version)

The purpose of archive.today is to create a snapshot of a website for a specific time. Therefore, the snapshot that carstenhag created, will stay as it was yesterday.

I would suggest to execute a request of that service to an own server and potentially find out how their request looks like at the server side. Unfortunately, I was not able to find any source code of the service and we cannot just copy their strategy.

Theoretically, it would also be possible to use archive.today as a proxy, but I do not think that they would like such a use of their service.

Yeah, agree, we can't use archive.today of course as they would be pretty annoyed by us. Just found it interesting that it works for them. I guess they have a browser running, because they also run js.

@carstenhag most likely some headless approach like https://pptr.dev/

If you find a headless browser that works for circumventing the captcha, we could provide it optionally as an additional docker-container. However, my attempts using the most recent version of chrome + Selenium failed so far even when I turned off everything that indicates being in headless mode.

@saschnet As I mentioned, they're using re-capture by google. I know a few guys who build that and I know a little bit of the internals, thus I know re-capture works by testing out various things. After all, they build a score. If this score is too high, you're considered a bot. The scrore calculation changes every once in a while to make it harder for ppl to fight agains, thus I'm a bit hopeless tbh.

interestingly enough however, when I try adding the search url here and wait for it to take a screenshot it works 0o
https://web-capture.net/

Got progress... seems like I can bring back the support sooner or later. Needs lot's of polishing and checks whether this approach is working also futurewise, but so far so good.

image

Ok, I now have a reproducible (but very experimental) way to support immoscout again. Will push the changes soonish.

image

@carstenhag @saschnet I've created a pr to bring back the immoscout support and I would very much appreciate if you could take a look at it.
#21

I had shortly looked over it, but as I'm not that experienced with js I didn't comment. Thanks for adding the support for immoscout! :)