Athlon1600/SerpScraper

Random results depending on proxy used

freddurst1805 opened this issue · 17 comments

Hello and thank you for every changes you made those lasts days.
I have a problem here. I use a solution to use multi proxy in combination with your library, but the results returned for the same keyword depend on the proxy that make the request and so they can vary a lot.

I tried to use the preference
$google->setPreference('google_domain', 'google.com');

but the result is changing anyway. Do you have an idea of what is about ? how can it be solved ?
Maybe we can comunicate with Skype, this is my nick: freddurst1805.

Thank you :)

I'm not sure anyway how to use the ncr function

but the result is changing anyway.

well of course they do.... google personalizes search results based on your location. If you're connecting from an IP address based in California, google will show you "California-related" search results. There is no way to turn this off at the moment.

I'm not sure anyway how to use the ncr function

If you don't want google to redirect your search queries through a country-specific google version, then you just do this:

$engine = GoogleSearch();
$engine->ncr();

$engine->search("dragons");

technically you would only have to call it once per each cookie session, but I would include it with each GoogleSearch instance just to be safe.

Also, you mentioned this in your last message:

I also needed to tweak your class GoogleSearch to integrate multi proxy management with specific port

what did you have to add? I might be able to include those features directly.

There is my SetProxy modified function to correspond to my proxy type (IP:PORT) :

final public function setProxy($proxy, $new_profile = true){
    $this->default_options['curl'][CURLOPT_PROXY] = $proxy;

    // do we want to use a different cookie profile for this proxy?
    if($new_profile){
	$this->setProfileID($proxy);
    }
		
    $this->reloadClient();
}

It's very basic btw.

I tried to use ncr() but I now get an error :
Fatal error: Uncaught exception 'GuzzleHttp\Exception\ConnectException' with message 'cURL error 6: Could not resolve host: cookies (see http://curl.haxx.se/libcurl/c/libcurl-errors.html)' in \vendor\guzzlehttp\guzzle\src\Handler\CurlFactory.php on line 186

setProxy should be working now as it was already fixed:
dc6cb16

about ncr(), that will be fixed soon. I'm rewriting the whole script so it no longer depends on Guzzle and instead uses just pure cURL. Give it a day or two and I'll be done.

Super thank you :)
About the proxy, the problem is I use proxy like IP:PORT and not USERNAME:PASSWORD@IP:PORT

the username:password portion is optional. If you just passed in 'ip:port', guzzle would understand that too.

Any new about your restructuration ?
I'm about to put an application using your library in production, so let me know when you get some info :)
Do you need some help for anything ?

Thank you

yeah it's mostly finished. I'll have it on github in a couple hours.

ok, I think this should finally work:
9d32d24

Thank you.
Yes it apparently works, but now it's the proxy that is not working anymore.
It's not in the GoogleSearch class now but in Curl class. But it's no connexion between both class with the setProxy function.
I tried to fix it but it's still buggy. If you can fix it I will be really happy :)

I can't believe I forgot about it... here:
2b9ce97

The proxies do not seem to work anyway. I've been blocked by captcha.
I will try again tomorrow, let's keep in touch.

Have a good evening

Ok unfortunatly the proxy usage doesn't seem to work anymore. I still get the captcha.
Maybe the setProxy function is not connected to cURL ?

Ok my bad I think I have a problem with my proxy provider. Thank you anyway :)

Hi there. Finally it's not due to my proxies, so I think mayitbe it's something wrong with the cURL request ? I have a function that return proxy as IP:PORT that I set for the request. But anyway still the request after 1O keep returning me the "captcha error".

function getGoogleContent($keyword){
        $google = new GoogleSearch();
        
        $proxy = $this->RotateProxy();
        $google->setProxy($proxy);
        
        $google->ncr();
        
        $google->setPreference('results_per_page', 100);
        $google->setPreference('google_domain', 'google.com');
        
        $results = array();

        $response = $google->search($keyword, 1);

        $results = ($response->error == false) ? $response->results : false;
        
        return $results;
}

Look, I really need something functionnal for 20th december, so I have two away to do it:

  • Maybe we can work on this issue together ?
  • Otherwise can you tell me how to go back to the first functionnal version with guzzle 5 with Git ?

Thank you for your help

It looked at it again, and it seems that I forgot to rename $new_profile_id variable into $new_profile. So you should have gotten that error. And that part deals with cookie session storage for each proxy so it's important. Try this now it should work:

2f0ba85

also, when you do testing, just a google search query for "what is my ip address" and then check HTML contents of that page so see via what IP address you're connecting. Like this:

$response = $google->search('what is my ip address');
echo $response->html;

^ it should show:
"YOUR_PROXY_IP_HERE
Your public IP address"

Because everything seems to work for me at this point...

Thank you very much for all your efforts to help me. I'll try tomorrow but i'm pretty sure this is it.
Thank you for the tip to return my ip, this way i'll have a better overview.

Let me know if you need anything :)