PaulMcInnis/JobFunnel

[DISCUSSION] Captcha

PaulMcInnis opened this issue ยท 8 comments

Hey everyone,

It seems that indeed and others have caught on to scraping and have taken action to stop it.

We can integrate web-driven scraping but this is not easily automated or tested.

I think this may be a serious problem for this tool in general, the regexes we have built still work, but captcha is catching the scrapers very easily, after under a hundred jobs or so.

Does anyone have any ideas to help with this issue?

One option is that we go the route of a web-driven scraper, perhaps this tool could be made into some kind of browser extension?

Another option is to forgo scraping detailed job information entirely, but this will significantly degrade the matching and data quality.

Nllii commented

I tried using this code from geohot a couple of years back, I never got it to work. its's not practical code, just a doodle.

https://github.com/geohot/lolrecaptcha

https://www.blackhat.com/docs/asia-16/materials/asia-16-Sivakorn-Im-Not-a-Human-Breaking-the-Google-reCAPTCHA-wp.pdf

Well, one aspect of this is that I dont want to automate the captcha dodging since I think that is ethically dubious, but I think we have other options for the workflow maybe.

One datapoint that im having a bit of trouble collecting is on average how many jobs one can scrape before they get captcha'd (None error on detail scrape).

Maybe it could pick from a list of proxies? Would probably get rid of the captcha all together.
Edit: Also I'd like to add that at 200 jobs exactly, I got the captcha treatment.

yeah I get dinged pretty quick nowadays, I figure i'm on their $hit list ๐Ÿ˜†

Not a bad idea around the proxies, that would be an interesting feature, I'll create a little feature-stub for this.

Nllii commented

yeah I get dinged pretty quick nowadays, I figure i'm on their $hit list ๐Ÿ˜†

Not a bad idea around the proxies, that would be an interesting feature, I'll create a little feature-stub for this.

for proxies I have used https://github.com/TheSpeedX/PROXY-List ,mainly for mega.io limiting upload and downloads.
https://github.com/tonikelope/megabasterd.git , MBD has a feature where it picks the next proxy once it gets throttle; it triggers the next proxy in the list. I haven't used proxies on JobFunnel yet. Can't wait to try it out if I get block.

P.s.. Youtube still gives me captcha once a week now. It was every 4 hours since December 2020, now it's once a week. I think they are outsourcing machine learning labels to me.

Based on this discussion, we will move forwards with #145