Inconsistent results when scraping the same timerange, place, keywords multiple times
annika-stechemesser opened this issue · 11 comments
Hello,
I used gtrendsR to scrape for two searchwords with an "or" connection (covid+corona) in one place (geocode US-CT-533) for the timerange 2020-04-01 2021-07-01. Here is my line of code:
local_trends=gtrends(keyword='covid+corona',geo=local_geo,time ="2020-04-01 2021-07-01")$interest_over_time
I ran it multiple times and noticed that every time I got different results (see plot below). How is this possible given that none of the parameters changed and the timerange is in the past? Also none of the versions I got with gtrends exactly match the data I see in the browser when I put these inputs in the search.
Can you explain what is going on here and advise me how to get the correct data?
Thanks very much!
I think we have seen this before and it is explained as 'well they reserve the right to answer that way' as what we hit is not a fully defined API :-/ Maybe Google subsamples, and you found a query that shows that? Edit: Never mind!
But I better let @PMassicotte chime in...
That is strange, I can not reproduce the problem on my side.
library(gtrendsR)
library(ggplot2)
l <- list()
v <- 1:6
for (i in v) {
df <- gtrends(keyword='covid+corona',geo="US",time ="2020-04-01 2021-07-01")$interest_over_time
df$run <- paste("Run#", i)
l[[i]] <- df
}
df <- do.call(rbind, l)
ggplot(df, aes(x = date, y = hits, color = run)) +
geom_line()
Created on 2022-08-29 with reprex v2.0.2
New try using your exact same GEO code:
library(gtrendsR)
library(ggplot2)
l <- list()
v <- 1:6
for (i in v) {
df <- gtrends(keyword='covid+corona',geo="US-CT-533",time ="2020-04-01 2021-07-01")$interest_over_time
df$run <- paste("Run#", i)
l[[i]] <- df
}
df <- do.call(rbind, l)
ggplot(df, aes(x = date, y = hits, color = run)) +
geom_line()
Created on 2022-08-29 with reprex v2.0.2
@annika-stechemesser Can you try my code and see if you have the same results?
If I run your code and loop through multiple scrapes without wait they match up, however my graph looks slightly different to yours for example. I ran my various scrapes with a larger time delay between them, maybe that's it? I will try to run them spread out over a few hours and see what I get. Thanks a lot for the help!
Google provides the folliwng information: https://support.google.com/trends/answer/4365533?hl=en
According to Google, there are two types of samples one can access:
- “Real-time data is a sample covering the last seven days.”
- “Non-real-time data is a separate sample from real-time data and goes as far back
as 2004 and up to 36 hours before your search
Appendix B of https://www.sciencedirect.com/science/article/abs/pii/S2452306221001210 may be an interesting read as well.
Also the medium article by Simon Rogers is telling: https://medium.com/google-news-lab/what-is-google-trends-data-and-what-does-it-mean-b48f07342ee8
Our hypothesis is that samples from the full Google Trends dataset are not retaken for each query. However, we suspect that the sample taken from the full dataset could be based on an in-memory database somewhere on a Google Trends server instance, so that queries to Google Trends can be processed faster. If different IP addresses are routed to different instances there might be different in-memory samples that give different results. Also, if instances are shutdown, renewed, or the routing of traffic changes, the in-memory database may have to be resampled from the full Google Trends dataset.
We therefore assume that the result from Google Trends does not depend on the IP address per se. More precisely, we think it depends on the instance your query is routed to. This would also explain the inconsistencies in Google Trends data reported across time by Behnen, Kessler, Kruse, Schoenmakers, Zerr, and Gómez (2020), since in modern Cloud service instances are scaled up and down dynamically, depending on traffic.
Thank you @JBleher these comments have been really helpful. Running the code with a ~24h break gave different timeseries (see below). The same run in a non-delayed loop still gives the same data. I am not sure what to do with that stastistically but it does not seem to be a problem with gTrends. Thank you!
On a positive note, the time series you are querying seems to be calculated on enough search volume, so that the variation induced by different samples is rather small.
Do you have any advice on how to force getting into a new batch? I tried changing IP address and deleting cookies manually in the browser but none of that worked so far. I would just like to see a bunch of variation for my request but am pretty unclear on how to get it... Does the cookie-URL parameter have anything to do with it? Thank you!
You may be able to use different servers in different locations. Lists of free proxy-servers that you could use can be found on the internet. Or you could use the TOR network. However, be aware that some servers may be used by other people to circumvent rate limits. So you will still need to slow down the requests and have some try and catch logic to handle potentially empty data sets...