as_dribble() and drive_ls() get stuck in folders 3 levels deep
gorkang opened this issue · 2 comments
Hi there!
When trying to use {pins} with Google Drive, I am encountering some issues, the last of which, seems to be related with {googledrive} having troubles finding a 3-levels deep folder as in Level1/Level2/Level3/
. My Google Drive has about 400K files and >1TB.
As you can see in the reprex below, as_dribble()
and drive_ls()
have no issues with one and two level deep folders, but adding a third level makes them get stuck.
# 1 level folder
path = paste0("pins-testing/")
googledrive::as_dribble(path)
#> # A dribble: 1 × 4
#> name path id drive_resource
#> <chr> <chr> <drv_id> <list>
#> 1 pins-testing pins-testing/ 1MStG1e73DoRO8rxGG93uRBSQIBfUS6ai <named list [34]>
googledrive::drive_ls(path)
#> # A dribble: 3 × 3
#> name id drive_resource
#> <chr> <drv_id> <list>
#> 1 pid_X 1H0tSPVNNDeUnkjIEOAKQeQTSL7g_-Jea <named list [34]>
#> 2 pid_999x 1VZQA3fLm3Vsp00Q1e_FaayBXkSChmCFB <named list [33]>
#> 3 pid_999 1yG-KuNSikiACquLAvZbm24B30-2whzm- <named list [33]>
# 2 levels folder
path = paste0("pins-testing/pid_X/")
googledrive::as_dribble(path)
#> # A dribble: 1 × 4
#> name path id drive_resource
#> <chr> <chr> <drv_id> <list>
#> 1 pid_X pins-testing/pid_X/ 1H0tSPVNNDeUnkjIEOAKQeQTSL7g_-Jea <named list [34]>
googledrive::drive_ls(path)
#> # A dribble: 3 × 3
#> name id drive_resource
#> <chr> <drv_id> <list>
#> 1 SimpleName 1X0Bt8TXaoczBsKvUleI0MDG0EKlZtc-w <named list [34]>
#> 2 - 1mlkMFIpcb3Pe0AyWwWJsLT7DpdocC1ZY <named list [33]>
#> 3 20230908T062343Z-db9b5 1_1b8j5glMUMBtmsvseMsETeKAHd-yDOi <named list [34]>
# 3 levels folder
path = paste0("pins-testing/pid_X/20230908T062343Z-db9b5/")
# This two get stuck forever
# googledrive::drive_ls(path)
# httr::with_verbose(googledrive::as_dribble(path))
# MANUALLY PASTED THIS HERE
#>-> GET /drive/v3/files?#>orderBy=recency%20desc&q=%28trashed%20%3D%20false%29%20and%20%28mimeType%20%3D%20%27application%2Fvnd.google-#>apps.folder%27%29&supportsAllDrives=TRUE&fields=nextPageToken%2C%2A&pageToken=~%21%21~AI9FV7TmS1p5A_fnD_ADi00BMVvamke8nm9NmPnV1O9_k9OlCbRbYMQV-SR0Q7gXzFCEADgbCVf37JHJvP-_dSgcJwHAWQDflgmFECOZRgE4UujQEJvgyYEUF1aAL8ZOPqNKJF5smipiwCGMVpAs0W5CxDkfXxOApNuKj8m1IlGIK8XMNPxsvayoYa0Yf-#>MgfqUi0okfcb2OKy_WmTQrHSbp8E6380yR1JLTaCS7dxU1P41PbZCfpsqiwiilf018rNC31ySclHMptSYC1lyC6dJJFYR9eub0r9tr4UetbEMJr7t_AULQHi8FMa0sNQmmgb2qxt-wT7NX6YbFitdjnVujYG8uajjA5w%3D%3D HTTP/2
#>-> Host: www.googleapis.com
#>-> user-agent: googledrive/2.1.1 (GPN:RStudio; ) gargle/1.5.2 httr/1.4.7
#>-> accept-encoding: deflate, gzip, br, zstd
#>-> accept: application/json, text/xml, application/xml, */*
#>-> authorization: Bearer [EDITED]->
#><- HTTP/2 200
#><- vary: Origin, X-Origin
#><- pragma: no-cache
#><- expires: Mon, 01 Jan 1990 00:00:00 GMT
#><- cache-control: no-cache, no-store, max-age=0, must-revalidate
#><- date: Fri, 08 Sep 2023 06:46:37 GMT
#><- content-type: application/json; charset=UTF-8
#><- content-encoding: gzip
#><- server: ESF
#><- content-length: 10065
#><- x-xss-protection: 0
#><- x-frame-options: SAMEORIGIN
#><- x-content-type-options: nosniff
#><- alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
#><-
Created on 2023-09-08 with reprex v2.0.2
Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.3.1 (2023-06-16)
#> os Ubuntu 22.04.3 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Atlantic/Canary
#> date 2023-09-08
#> pandoc 3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> askpass 1.2.0 2023-09-03 [1] RSPM (R 4.3.0)
#> cli 3.6.1 2023-03-23 [1] RSPM
#> curl 5.0.2 2023-08-14 [1] RSPM (R 4.3.0)
#> digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.1)
#> dplyr 1.1.3 2023-09-03 [1] RSPM (R 4.3.0)
#> evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0)
#> fansi 1.0.4 2023-01-22 [1] RSPM
#> fastmap 1.1.1 2023-02-24 [1] RSPM
#> fs 1.6.3 2023-07-20 [1] RSPM (R 4.3.0)
#> gargle 1.5.2 2023-07-20 [1] RSPM (R 4.3.0)
#> generics 0.1.3 2022-07-05 [1] RSPM
#> glue 1.6.2 2022-02-24 [1] RSPM
#> googledrive 2.1.1 2023-06-11 [1] CRAN (R 4.3.0)
#> htmltools 0.5.6 2023-08-10 [1] RSPM (R 4.3.0)
#> httr 1.4.7 2023-08-15 [1] RSPM (R 4.3.0)
#> jsonlite 1.8.7 2023-06-29 [1] RSPM (R 4.3.0)
#> knitr 1.43 2023-05-25 [1] RSPM (R 4.3.0)
#> lifecycle 1.0.3 2022-10-07 [1] RSPM
#> magrittr 2.0.3 2022-03-30 [1] RSPM
#> openssl 2.1.0 2023-07-15 [1] RSPM (R 4.3.0)
#> pillar 1.9.0 2023-03-22 [1] RSPM
#> pkgconfig 2.0.3 2019-09-22 [1] RSPM
#> purrr 1.0.2 2023-08-10 [1] RSPM (R 4.3.0)
#> R.cache 0.16.0 2022-07-21 [1] RSPM
#> R.methodsS3 1.8.2 2022-06-13 [1] RSPM
#> R.oo 1.25.0 2022-06-12 [1] RSPM
#> R.utils 2.12.2 2022-11-11 [1] RSPM
#> R6 2.5.1 2021-08-19 [1] RSPM
#> rappdirs 0.3.3 2021-01-31 [1] RSPM
#> reprex 2.0.2 2022-08-17 [1] RSPM
#> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0)
#> rmarkdown 2.24 2023-08-14 [1] RSPM (R 4.3.0)
#> rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.1)
#> sessioninfo 1.2.2 2021-12-06 [1] RSPM
#> styler 1.10.2 2023-08-29 [1] RSPM (R 4.3.0)
#> tibble 3.2.1 2023-03-20 [1] RSPM
#> tidyselect 1.2.0 2022-10-10 [1] RSPM
#> utf8 1.2.3 2023-01-31 [1] RSPM
#> vctrs 0.6.3 2023-06-14 [1] RSPM (R 4.3.0)
#> withr 2.5.0 2022-03-03 [1] RSPM
#> xfun 0.40 2023-08-09 [1] RSPM (R 4.3.0)
#> yaml 2.3.7 2023-01-23 [1] RSPM
#>
#> [1] /home/emrys/R/x86_64-pc-linux-gnu-library/4.3
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────
I'm not working on googledrive right now, so this is a rather superficial response.
But a function such as drive_ls()
is entirely a googledrive creation. The Drive API does not actually offer support for listing everything in a folder. So we are doing lots of recursive work inside googledrive, with many API calls, which is undoubtedly very slow. I can easily imagine that whatever you are trying to do ("3-levels deep folder", "My Google Drive has about 400K files") is running up against practical performance constraints of the rather naive implementation we have here.
So my very high-level advice is to approach this from a different angle.
You'll have to play around a bit, but the idea is to not create a request that forces googledrive to range over all 400K of your files trying to resolve a filepath with many components (folders within folders within folders).
Here is a horrible sketch (I'm sure this code does not work, but it should convey the idea):
library(googledrive)
library(tidyverse)
d1 <- drive_get("pins-testing/")
d1_listing <- drive_ls(d1)
d2 <- filter(d1_listing, name == "pid_X")
d2_listing <- drive_ls(d2)
d3 <- filter(d2_listing, name == "20230908T062343Z-db9b5")
d3_listing <- drive_ls(d3)
The key idea is to provide file IDs whenever possible instead of a filepath. This is what happens when you specify a target folder with a dribble
instead of just its filepath. The approach above does this in a rather boneheaded stepwise way, but hopefully it makes things more clear. There's probably something less ugly that will work, but that should get you started.
Thanks @jennybc
In the end, my impression is that @juliasilge solved the issue by using basically the technique you hinted here.
Thanks again!