Results appear different than the browser

Question

Results appear different than the browser

stuartathompson opened this issue 3 years ago · 5 comments

If I search for something on DuckDuckGo in the browser in Incognito I get one list of results. When running using this scraper I get a different set of results. Is there an explanation for this?

Thanks!

Answer 1 · 2022-01-13T00:34:52.000Z

This has to do with headless browsing and DuckDuckGo.

For some odd reason, Duckduckgo serves different results to the browser when you're in headless mode than when showing a real broswer.

I had to do my own scraper with Puppeteer to get the "correct" results. Maybe consider a flag for using headless: false?

Answer 2 · 2022-01-14T06:12:45.000Z

If it works with Puppeteer, it's probably related to Js - Puppeteer is an emulator, while this repo uses a plain HTTP client. Browser emulation is heavy on resources and so I decided to use Python's requests lib instead. You could try setting a user-agent with .set_headers(), but I doubt it will help.

Answer 3 · 2022-01-18T17:17:28.000Z

The issue is that DuckDuckGo serves phony results to headless browsers. You have to use a visible browser to get correct results or find another way to do it. I ended up writing my own scraper for DDG.

I think a headless/non-headless mode would be really valuable for this library.

Answer 4 · 2022-01-21T00:57:12.000Z

So, the difference you noticed is because we're using the no-js version of Duckduckgo (https://html.duckduckgo.com/html/), while the regular results are fetched from a js file (https://links.duckduckgo.com/d.js?q=test&t=D&l=us-en&s=0&a=h_&dl=en&ct=GR&ss_mkt=us&vqd=3-271302360671697817199458226164694755283-142931334088488469610097276646263969243&p_ent=&ex=-1&sp=0). Of course, we can get this file without an emulator and parse the results out of the js code; the only problem is the vqd= parameter, but I'll see if I can reverse engineer it.

Answer 5 · 2022-02-01T10:40:52.000Z

I changed Duckduckgo to use the js version, and now the results should be similar to those we get from a browser. I'll keep this issue open for a while, in case there are any bugs.