ropensci/rtweet

`lookup_tweets()` returns different variables (column names) for different status IDs

bretsw opened this issue · 5 comments

Problem

When trying to combine the output of different dataframes returned by lookup_tweets(), I noticed that different variables (column names) get returned for different statuses. Anecdotally (I haven't done a large systematic check), I've identified at least 7 different configurations of returned variables (see below).

Expected behavior

I would expect that lookup_tweets() would always return the same variables (column names) in the same order, no matter what statuses are looked up, or when returning an empty dataframe.

Reproduce the problem

## Empty dataframe
names0a <- names(rtweet::lookup_tweets("X"))
names0b <- names(rtweet::lookup_tweets("1578252090102751232"))

## Group 1 with same variables
names1a <- names(rtweet::lookup_tweets("1580002144631279616"))
names1b <- names(rtweet::lookup_tweets("1578751613883551746"))
names1c <- names(rtweet::lookup_tweets("1578751608015695873"))
names1d <- names(rtweet::lookup_tweets("1578751601849688066"))
names1e <- names(rtweet::lookup_tweets("1578295296144343041"))
names1f <- names(rtweet::lookup_tweets("1578295296144343041"))
names1g <- names(rtweet::lookup_tweets("1578241812292161538"))
names1h <- names(rtweet::lookup_tweets("1580011666804473856"))

## Group 2 with same variables
names2a <- names(rtweet::lookup_tweets("1578824308260237312"))
names2b <- names(rtweet::lookup_tweets("1579955040399552512"))

## Group 3 with same variables
names3a <- names(rtweet::lookup_tweets("1580186891151777792"))
names3b <- names(rtweet::lookup_tweets("1580045974495297536"))
names3c <- names(rtweet::lookup_tweets("1580030817190846464"))
names3d <- names(rtweet::lookup_tweets("1579825294537474050"))

## Group 4 with same variables
names4a <- names(rtweet::lookup_tweets("1580212580249133056"))

## Group 5 with same variables
names5a <- names(rtweet::lookup_tweets("1580172355699023872"))

## Group 6 with same variables
names6a <- names(rtweet::lookup_tweets("1579969347942219776"))

## Using this code within and across groups show the same or different set of variables.
which(names1a != names1b)
which(names1a != names2a)

## You can also view all the configurations at once.
View(tibble::tibble(names0a, names1a, names2a, names3a, names4a, names5a, names6a))

rtweet version

## copy/paste output
packageVersion("rtweet")

[1] ‘1.0.2’

Session info

## copy/paste output
sessionInfo()

R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.6.7

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] pillar_1.7.0 compiler_4.2.1 cellranger_1.1.0 prettyunits_1.1.1
[5] progress_1.2.2 tools_4.2.1 digest_0.6.29 rtweet_1.0.2
[9] jsonlite_1.8.0 googledrive_2.0.0 evaluate_0.15 lifecycle_1.0.1
[13] tibble_3.1.8 gargle_1.2.0 pkgconfig_2.0.3 rlang_1.0.4
[17] DBI_1.1.3 cli_3.3.0 rstudioapi_0.13 curl_4.3.2
[21] yaml_2.3.5 xfun_0.31 fastmap_1.1.0 httr_1.4.4
[25] withr_2.5.0 dplyr_1.0.9 stringr_1.4.1 knitr_1.39
[29] hms_1.1.1 generics_0.1.2 vctrs_0.4.1 fs_1.5.2
[33] googlesheets4_1.0.1 tidyselect_1.1.2 glue_1.6.2 R6_2.5.1
[37] fansi_1.0.3 rmarkdown_2.15 purrr_0.3.4 beepr_1.3
[41] magrittr_2.0.3 ellipsis_0.3.2 htmltools_0.5.3 assertthat_0.2.1
[45] utf8_1.2.2 stringi_1.7.8 crayon_1.5.1 audio_0.1-10

llrs commented

Thank you very much for the detailed analysis. This is indeed a bug in rtweet. However, they all return the same columns (internally it creates a tibble with all the columns and fills them), but in different order:

a <- tibble::tibble(sort(names0a), sort(names1a), sort(names2a), sort(names3a), sort(names4a), sort(names5a), sort(names6a))
b <- apply(a, 1, function(x){length(unique(x))})
View(a[b != 1, ])

I will sort the names before returning them to the user for easier usage. I will also check other outputs too just in case. Meanwhile instead of rbind you will need to use merge. Or simply sort before rbind: tweets[, sort(colnames(tweets)]

PS: Now that we are mentioning missing columns, currently the premium endpoints of API v1 return three new fields: "edit_controls", "edit_history", "editable". See #738 and #739. I am considering providing support for those too in the next version so it will also break code. Although, I might just stop upgrading those endpoints to focus on API v2. Just for your interest as package maintainer of a dependency.

llrs commented

Note, I hope to have by the end of this month (or next) the release ready. I'm not sure if this will help with your users in ropensci/tidytags#80 or if you prefer to prevent any future omission in my part and change to merge/join in tidytags.

llrs commented

So, I decided to not sort it alphabetically but by the internal representation of tweets (this way is less disruptive). This is now fixed in the devel branch, but I wait a bit until I close the issue in case you have some other feedback on the fix.

llrs commented

I see you addressed this in your own package. I close the issue

I see you addressed this in your own package. I close the issue

Yes, thank you! I really appreciate your prompt attention to this, and this issue pointed out some problems in the tidytags code that I needed to address. This has strengthened the package all around.