`lookup_tweets()` returns different variables (column names) for different status IDs
bretsw opened this issue · 5 comments
Problem
When trying to combine the output of different dataframes returned by lookup_tweets()
, I noticed that different variables (column names) get returned for different statuses. Anecdotally (I haven't done a large systematic check), I've identified at least 7 different configurations of returned variables (see below).
Expected behavior
I would expect that lookup_tweets()
would always return the same variables (column names) in the same order, no matter what statuses are looked up, or when returning an empty dataframe.
Reproduce the problem
## Empty dataframe
names0a <- names(rtweet::lookup_tweets("X"))
names0b <- names(rtweet::lookup_tweets("1578252090102751232"))
## Group 1 with same variables
names1a <- names(rtweet::lookup_tweets("1580002144631279616"))
names1b <- names(rtweet::lookup_tweets("1578751613883551746"))
names1c <- names(rtweet::lookup_tweets("1578751608015695873"))
names1d <- names(rtweet::lookup_tweets("1578751601849688066"))
names1e <- names(rtweet::lookup_tweets("1578295296144343041"))
names1f <- names(rtweet::lookup_tweets("1578295296144343041"))
names1g <- names(rtweet::lookup_tweets("1578241812292161538"))
names1h <- names(rtweet::lookup_tweets("1580011666804473856"))
## Group 2 with same variables
names2a <- names(rtweet::lookup_tweets("1578824308260237312"))
names2b <- names(rtweet::lookup_tweets("1579955040399552512"))
## Group 3 with same variables
names3a <- names(rtweet::lookup_tweets("1580186891151777792"))
names3b <- names(rtweet::lookup_tweets("1580045974495297536"))
names3c <- names(rtweet::lookup_tweets("1580030817190846464"))
names3d <- names(rtweet::lookup_tweets("1579825294537474050"))
## Group 4 with same variables
names4a <- names(rtweet::lookup_tweets("1580212580249133056"))
## Group 5 with same variables
names5a <- names(rtweet::lookup_tweets("1580172355699023872"))
## Group 6 with same variables
names6a <- names(rtweet::lookup_tweets("1579969347942219776"))
## Using this code within and across groups show the same or different set of variables.
which(names1a != names1b)
which(names1a != names2a)
## You can also view all the configurations at once.
View(tibble::tibble(names0a, names1a, names2a, names3a, names4a, names5a, names6a))
rtweet version
## copy/paste output
packageVersion("rtweet")
[1] ‘1.0.2’
Session info
## copy/paste output
sessionInfo()
R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.6.7
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] pillar_1.7.0 compiler_4.2.1 cellranger_1.1.0 prettyunits_1.1.1
[5] progress_1.2.2 tools_4.2.1 digest_0.6.29 rtweet_1.0.2
[9] jsonlite_1.8.0 googledrive_2.0.0 evaluate_0.15 lifecycle_1.0.1
[13] tibble_3.1.8 gargle_1.2.0 pkgconfig_2.0.3 rlang_1.0.4
[17] DBI_1.1.3 cli_3.3.0 rstudioapi_0.13 curl_4.3.2
[21] yaml_2.3.5 xfun_0.31 fastmap_1.1.0 httr_1.4.4
[25] withr_2.5.0 dplyr_1.0.9 stringr_1.4.1 knitr_1.39
[29] hms_1.1.1 generics_0.1.2 vctrs_0.4.1 fs_1.5.2
[33] googlesheets4_1.0.1 tidyselect_1.1.2 glue_1.6.2 R6_2.5.1
[37] fansi_1.0.3 rmarkdown_2.15 purrr_0.3.4 beepr_1.3
[41] magrittr_2.0.3 ellipsis_0.3.2 htmltools_0.5.3 assertthat_0.2.1
[45] utf8_1.2.2 stringi_1.7.8 crayon_1.5.1 audio_0.1-10
Thank you very much for the detailed analysis. This is indeed a bug in rtweet. However, they all return the same columns (internally it creates a tibble with all the columns and fills them), but in different order:
a <- tibble::tibble(sort(names0a), sort(names1a), sort(names2a), sort(names3a), sort(names4a), sort(names5a), sort(names6a))
b <- apply(a, 1, function(x){length(unique(x))})
View(a[b != 1, ])
I will sort the names before returning them to the user for easier usage. I will also check other outputs too just in case. Meanwhile instead of rbind
you will need to use merge
. Or simply sort before rbind
: tweets[, sort(colnames(tweets)]
PS: Now that we are mentioning missing columns, currently the premium endpoints of API v1 return three new fields: "edit_controls", "edit_history", "editable". See #738 and #739. I am considering providing support for those too in the next version so it will also break code. Although, I might just stop upgrading those endpoints to focus on API v2. Just for your interest as package maintainer of a dependency.
Note, I hope to have by the end of this month (or next) the release ready. I'm not sure if this will help with your users in ropensci/tidytags#80 or if you prefer to prevent any future omission in my part and change to merge/join in tidytags.
So, I decided to not sort it alphabetically but by the internal representation of tweets (this way is less disruptive). This is now fixed in the devel branch, but I wait a bit until I close the issue in case you have some other feedback on the fix.
I see you addressed this in your own package. I close the issue
I see you addressed this in your own package. I close the issue
Yes, thank you! I really appreciate your prompt attention to this, and this issue pointed out some problems in the tidytags code that I needed to address. This has strengthened the package all around.