tasos-py/Search-Engines-Scraper

Results appear different than the browser

stuartathompson opened this issue · 5 comments

If I search for something on DuckDuckGo in the browser in Incognito I get one list of results. When running using this scraper I get a different set of results. Is there an explanation for this?

Thanks!

This has to do with headless browsing and DuckDuckGo.

For some odd reason, Duckduckgo serves different results to the browser when you're in headless mode than when showing a real broswer.

I had to do my own scraper with Puppeteer to get the "correct" results. Maybe consider a flag for using headless: false?

If it works with Puppeteer, it's probably related to Js - Puppeteer is an emulator, while this repo uses a plain HTTP client. Browser emulation is heavy on resources and so I decided to use Python's requests lib instead. You could try setting a user-agent with .set_headers(), but I doubt it will help.

The issue is that DuckDuckGo serves phony results to headless browsers. You have to use a visible browser to get correct results or find another way to do it. I ended up writing my own scraper for DDG.

I think a headless/non-headless mode would be really valuable for this library.

So, the difference you noticed is because we're using the no-js version of Duckduckgo (https://html.duckduckgo.com/html/), while the regular results are fetched from a js file (https://links.duckduckgo.com/d.js?q=test&t=D&l=us-en&s=0&a=h_&dl=en&ct=GR&ss_mkt=us&vqd=3-271302360671697817199458226164694755283-142931334088488469610097276646263969243&p_ent=&ex=-1&sp=0). Of course, we can get this file without an emulator and parse the results out of the js code; the only problem is the vqd= parameter, but I'll see if I can reverse engineer it.

I changed Duckduckgo to use the js version, and now the results should be similar to those we get from a browser. I'll keep this issue open for a while, in case there are any bugs.