search_fullarchive returning duplicate Tweets

Question

search_fullarchive returning duplicate Tweets

Closed this issue 2 years ago · 3 comments

Problem

When using search_fullarchive, I am trying to pull n = 3000 tweets, but it returns a large list of 100 lists with data and 4 of these being empty lists. The other 96 lists are data.frames of 34 rows and 38 columns. Upon further inspection, in each of the 96 lists, there are the same 34 Tweets and their information, meaning that instead of having information for (34 * 96 = ~3264 Tweets), I only have the information for 34 Tweets copied 96 times over in the large list. I have the Premium version of the Search Tweets API, and this call consumes 96 requests to the Twitter API. Thank you in advance for your help!

Reproduce the problem

`covid_data <- search_fullarchive(q="#covid19 -is:retweet has:geo", n = 3000, env_name = "researchpr", fromDate = "202202080000", toDate = "202205100000", token = twitter_token, parse = FALSE)`

rtweet version

1.0.2

Session info

R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.6

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rtweet_1.0.2   ngram_3.2.1    quanteda_3.2.3

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9         pillar_1.8.1       compiler_4.2.1     prettyunits_1.1.1 
 [5] tools_4.2.1        stopwords_2.3      progress_1.2.2     jsonlite_1.8.4    
 [9] lifecycle_1.0.3    tibble_3.1.8       gtable_0.3.1       lattice_0.20-45   
[13] pkgconfig_2.0.3    rlang_1.0.6        Matrix_1.5-1       fastmatch_1.1-3   
[17] DBI_1.1.3          cli_3.4.1          rstudioapi_0.14    curl_4.3.3        
[21] withr_2.5.0        dplyr_1.0.10       httr_1.4.4         askpass_1.1       
[25] generics_0.1.3     vctrs_0.5.1        hms_1.1.2          grid_4.2.1        
[29] tidyselect_1.2.0   glue_1.6.2         R6_2.5.1           fansi_1.0.3       
[33] ggplot2_3.3.6      magrittr_2.0.3     scales_1.2.1       ellipsis_0.3.2    
[37] assertthat_0.2.1   colorspace_2.0-3   utf8_1.2.2         stringi_1.7.8     
[41] openssl_2.0.5      RcppParallel_5.1.5 munsell_0.5.0      crayon_1.5.2

Answer 1 · 2022-12-12T03:48:15.000Z

I think this is a duplicate of #732 (Pagination was broken, hence you don't get all the results), I'll check it. If this is really a duplicate, it should be fixed in the devel version of the package. You can try it let me know if this is really solved in devel or not.
You might also be interested to know that the premium API should allow for 500 requests each time (fixed after #720 in devel).

Answer 2 · 2022-12-12T08:04:05.000Z

Hello! I installed the developer version using

install_github("ropensci/rtweet", ref = "devel")

and also used premium = TRUE, and I've found that when I input

covid_data <- search_fullarchive(q="#covid19 -is:retweet has:geo", n = 50, env_name = "researchpr", premium = TRUE, fromDate = "202202080000", toDate = "202205100000", token = twitter_token, parse = FALSE)

it works as expected, but when I input

halalan_data <- search_fullarchive(q="#halalan22 -is:retweet has:geo", n = 50, env_name = "researchpr", premium = TRUE, fromDate = "202202080000", toDate = "202205100000", token = twitter_token, parse = FALSE)

the error stated in my first post still occurs, despite the only thing changing being the query text.

Answer 3 · 2022-12-13T01:12:20.000Z

I'm sorry but I don't have premium and I cannot test your query with the "-is:retweet has:geo" operators. I used this similar query:

 halalan_data <- search_fullarchive(q="#halalan22", n = 450,
                          env_name = "fullArchive",
                          fromDate = "202202080000", toDate = "202205100000",
                          parse = TRUE)

And I did get one error about the edit new fields

Error in `.rowNamesDF<-`(x, value = value) : 
  duplicate 'row.names' are not allowed

Which I fixed in devel, but I got no error with parse = FALSE and I never got a duplicated tweet id_str. There are some duplicated tweets (text) but they are from different users (I simply checked via cursory overview of the users_data()).

If you can create a reproducible example without any premium (paid) operator it would help me debug and fix this before the next release this week.