search_fullarchive returning duplicate Tweets
Closed this issue · 3 comments
Problem
When using search_fullarchive, I am trying to pull n = 3000 tweets, but it returns a large list of 100 lists with data and 4 of these being empty lists. The other 96 lists are data.frames of 34 rows and 38 columns. Upon further inspection, in each of the 96 lists, there are the same 34 Tweets and their information, meaning that instead of having information for (34 * 96 = ~3264 Tweets), I only have the information for 34 Tweets copied 96 times over in the large list. I have the Premium version of the Search Tweets API, and this call consumes 96 requests to the Twitter API. Thank you in advance for your help!
Reproduce the problem
`covid_data <- search_fullarchive(q="#covid19 -is:retweet has:geo", n = 3000, env_name = "researchpr", fromDate = "202202080000", toDate = "202205100000", token = twitter_token, parse = FALSE)`
rtweet version
1.0.2
Session info
R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.6
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] rtweet_1.0.2 ngram_3.2.1 quanteda_3.2.3
loaded via a namespace (and not attached):
[1] Rcpp_1.0.9 pillar_1.8.1 compiler_4.2.1 prettyunits_1.1.1
[5] tools_4.2.1 stopwords_2.3 progress_1.2.2 jsonlite_1.8.4
[9] lifecycle_1.0.3 tibble_3.1.8 gtable_0.3.1 lattice_0.20-45
[13] pkgconfig_2.0.3 rlang_1.0.6 Matrix_1.5-1 fastmatch_1.1-3
[17] DBI_1.1.3 cli_3.4.1 rstudioapi_0.14 curl_4.3.3
[21] withr_2.5.0 dplyr_1.0.10 httr_1.4.4 askpass_1.1
[25] generics_0.1.3 vctrs_0.5.1 hms_1.1.2 grid_4.2.1
[29] tidyselect_1.2.0 glue_1.6.2 R6_2.5.1 fansi_1.0.3
[33] ggplot2_3.3.6 magrittr_2.0.3 scales_1.2.1 ellipsis_0.3.2
[37] assertthat_0.2.1 colorspace_2.0-3 utf8_1.2.2 stringi_1.7.8
[41] openssl_2.0.5 RcppParallel_5.1.5 munsell_0.5.0 crayon_1.5.2
I think this is a duplicate of #732 (Pagination was broken, hence you don't get all the results), I'll check it. If this is really a duplicate, it should be fixed in the devel version of the package. You can try it let me know if this is really solved in devel or not.
You might also be interested to know that the premium API should allow for 500 requests each time (fixed after #720 in devel).
Hello! I installed the developer version using
install_github("ropensci/rtweet", ref = "devel")
and also used premium = TRUE, and I've found that when I input
covid_data <- search_fullarchive(q="#covid19 -is:retweet has:geo", n = 50, env_name = "researchpr", premium = TRUE, fromDate = "202202080000", toDate = "202205100000", token = twitter_token, parse = FALSE)
it works as expected, but when I input
halalan_data <- search_fullarchive(q="#halalan22 -is:retweet has:geo", n = 50, env_name = "researchpr", premium = TRUE, fromDate = "202202080000", toDate = "202205100000", token = twitter_token, parse = FALSE)
the error stated in my first post still occurs, despite the only thing changing being the query text.
I'm sorry but I don't have premium and I cannot test your query with the "-is:retweet has:geo" operators. I used this similar query:
halalan_data <- search_fullarchive(q="#halalan22", n = 450,
env_name = "fullArchive",
fromDate = "202202080000", toDate = "202205100000",
parse = TRUE)
And I did get one error about the edit new fields
Error in `.rowNamesDF<-`(x, value = value) :
duplicate 'row.names' are not allowed
Which I fixed in devel, but I got no error with parse = FALSE and I never got a duplicated tweet id_str
. There are some duplicated tweets (text) but they are from different users (I simply checked via cursory overview of the users_data()
).
If you can create a reproducible example without any premium (paid) operator it would help me debug and fix this before the next release this week.