GeneralMills/pytrends

Identical queries return different data

arosenbe opened this issue · 7 comments

Hi there and thanks for the awesome package.

I've been using pytrends to download time series from Google Trends for each state/metro-area in the US. I noticed that the results of some of my queries were changing considerably across successive runs. The data on the Google Trends website doesn't exhibit this sort of behavior. I'm wondering if this is the result of a bug or just user error (in which case I'd be grateful for some advice).

Here's a (usually) reproducible example

import pytrends
import time
import numpy 

google_username = ********
google_password = ********

def get_df(google_username, google_password):
    pytrend = TrendReq(google_username, google_password)
    pytrend.build_payload(kw_list=['bagel'], 
                          geo = 'US', 
                          timeframe = '2004-01-01 2010-01-01')
    df = pytrend.interest_over_time()
    return df

df1 = get_df(google_username, google_password)
time.sleep(600)
df2 = get_df(google_username, google_password)

print df1.equals(df2) # False
print numpy.corrcoef(df1['bagel'], df2['bagel']) # Not all 1, can be quite low

I've run this a few times, so I don't think it's related to Google changing over to a new random sample. My understanding is that Google only makes this change once per day (line 197 here). However anecdotally, I seem to experience the largest discrepancies in results on less-searched terms and smaller geographic areas: exactly where I would expect sampling error/random noise to wreak the most havoc.

Let me know if I can provide more information, and thanks in advance!


P.S. I don't think I can provide an actually reproducible example because the results seem to be stochastic, and there's some positive probability that two samples from the same (discrete) distribution will yield the same results.

Hmm I wonder if they are using a cookie to keep serving back the same results. I'm going to try accessing it from two different computers/google accounts to see if the results differ.

Hey @dreyco676, any results from the test above?

So I'm seeing the same thing. I don't think its anything I can control for as its on Google's end. If you figure out a way to ensure it let me know.

Thanks for the confirmation! I was worried that there wasn't going to be an easy fix on your end.

My understanding is thus that the pytrends payload doesn't contain the same data as what's on the Google Trends site at runtime. If this is the case, do you have a sense of what data the payload does contain (e.g., old samples of Google Trends data or random noise)?

Huh? So there's a chance this library just returns random noise?

@LRonHubs it seems like the endpoint I'm hitting might not be 100% consistent for low volume searches it seems like it is more of a rounding up or down rather than them intentionally putting noise on the data. I don't have any contact with the Google Trends team to know how or why it does this.

@dreyco676 Thank you for mentioning the low volume searches. Do you know how to enable the functionality "Include low search volume regions" as shown in the screenshot? Obviously, as the previous posts have mentioned, pytrends tends to return different sets of cities in various runs. Thks a lot.

image