tidyverse/googledrive

as_dribble() and drive_ls() get stuck in folders 3 levels deep

gorkang opened this issue · 2 comments

Hi there!

When trying to use {pins} with Google Drive, I am encountering some issues, the last of which, seems to be related with {googledrive} having troubles finding a 3-levels deep folder as in Level1/Level2/Level3/. My Google Drive has about 400K files and >1TB.

As you can see in the reprex below, as_dribble() and drive_ls() have no issues with one and two level deep folders, but adding a third level makes them get stuck.

# 1 level folder
path = paste0("pins-testing/")
googledrive::as_dribble(path)
#> # A dribble: 1 × 4
#>   name         path          id                                drive_resource   
#>   <chr>        <chr>         <drv_id>                          <list>           
#> 1 pins-testing pins-testing/ 1MStG1e73DoRO8rxGG93uRBSQIBfUS6ai <named list [34]>
googledrive::drive_ls(path)
#> # A dribble: 3 × 3
#>   name     id                                drive_resource   
#>   <chr>    <drv_id>                          <list>           
#> 1 pid_X    1H0tSPVNNDeUnkjIEOAKQeQTSL7g_-Jea <named list [34]>
#> 2 pid_999x 1VZQA3fLm3Vsp00Q1e_FaayBXkSChmCFB <named list [33]>
#> 3 pid_999  1yG-KuNSikiACquLAvZbm24B30-2whzm- <named list [33]>

# 2 levels folder
path = paste0("pins-testing/pid_X/")
googledrive::as_dribble(path)
#> # A dribble: 1 × 4
#>   name  path                id                                drive_resource   
#>   <chr> <chr>               <drv_id>                          <list>           
#> 1 pid_X pins-testing/pid_X/ 1H0tSPVNNDeUnkjIEOAKQeQTSL7g_-Jea <named list [34]>
googledrive::drive_ls(path)
#> # A dribble: 3 × 3
#>   name                   id                                drive_resource   
#>   <chr>                  <drv_id>                          <list>           
#> 1 SimpleName             1X0Bt8TXaoczBsKvUleI0MDG0EKlZtc-w <named list [34]>
#> 2 -                      1mlkMFIpcb3Pe0AyWwWJsLT7DpdocC1ZY <named list [33]>
#> 3 20230908T062343Z-db9b5 1_1b8j5glMUMBtmsvseMsETeKAHd-yDOi <named list [34]>


# 3 levels folder
path = paste0("pins-testing/pid_X/20230908T062343Z-db9b5/")

# This two get stuck forever
# googledrive::drive_ls(path)
# httr::with_verbose(googledrive::as_dribble(path))

# MANUALLY PASTED THIS HERE
#>-> GET /drive/v3/files?#>orderBy=recency%20desc&q=%28trashed%20%3D%20false%29%20and%20%28mimeType%20%3D%20%27application%2Fvnd.google-#>apps.folder%27%29&supportsAllDrives=TRUE&fields=nextPageToken%2C%2A&pageToken=~%21%21~AI9FV7TmS1p5A_fnD_ADi00BMVvamke8nm9NmPnV1O9_k9OlCbRbYMQV-SR0Q7gXzFCEADgbCVf37JHJvP-_dSgcJwHAWQDflgmFECOZRgE4UujQEJvgyYEUF1aAL8ZOPqNKJF5smipiwCGMVpAs0W5CxDkfXxOApNuKj8m1IlGIK8XMNPxsvayoYa0Yf-#>MgfqUi0okfcb2OKy_WmTQrHSbp8E6380yR1JLTaCS7dxU1P41PbZCfpsqiwiilf018rNC31ySclHMptSYC1lyC6dJJFYR9eub0r9tr4UetbEMJr7t_AULQHi8FMa0sNQmmgb2qxt-wT7NX6YbFitdjnVujYG8uajjA5w%3D%3D HTTP/2
#>-> Host: www.googleapis.com
#>-> user-agent: googledrive/2.1.1 (GPN:RStudio; ) gargle/1.5.2 httr/1.4.7
#>-> accept-encoding: deflate, gzip, br, zstd
#>-> accept: application/json, text/xml, application/xml, */*
#>-> authorization: Bearer  [EDITED]-> 
#><- HTTP/2 200 
#><- vary: Origin, X-Origin
#><- pragma: no-cache
#><- expires: Mon, 01 Jan 1990 00:00:00 GMT
#><- cache-control: no-cache, no-store, max-age=0, must-revalidate
#><- date: Fri, 08 Sep 2023 06:46:37 GMT
#><- content-type: application/json; charset=UTF-8
#><- content-encoding: gzip
#><- server: ESF
#><- content-length: 10065
#><- x-xss-protection: 0
#><- x-frame-options: SAMEORIGIN
#><- x-content-type-options: nosniff
#><- alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
#><- 

Created on 2023-09-08 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.1 (2023-06-16)
#>  os       Ubuntu 22.04.3 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Atlantic/Canary
#>  date     2023-09-08
#>  pandoc   3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  askpass       1.2.0   2023-09-03 [1] RSPM (R 4.3.0)
#>  cli           3.6.1   2023-03-23 [1] RSPM
#>  curl          5.0.2   2023-08-14 [1] RSPM (R 4.3.0)
#>  digest        0.6.33  2023-07-07 [1] CRAN (R 4.3.1)
#>  dplyr         1.1.3   2023-09-03 [1] RSPM (R 4.3.0)
#>  evaluate      0.21    2023-05-05 [1] CRAN (R 4.3.0)
#>  fansi         1.0.4   2023-01-22 [1] RSPM
#>  fastmap       1.1.1   2023-02-24 [1] RSPM
#>  fs            1.6.3   2023-07-20 [1] RSPM (R 4.3.0)
#>  gargle        1.5.2   2023-07-20 [1] RSPM (R 4.3.0)
#>  generics      0.1.3   2022-07-05 [1] RSPM
#>  glue          1.6.2   2022-02-24 [1] RSPM
#>  googledrive   2.1.1   2023-06-11 [1] CRAN (R 4.3.0)
#>  htmltools     0.5.6   2023-08-10 [1] RSPM (R 4.3.0)
#>  httr          1.4.7   2023-08-15 [1] RSPM (R 4.3.0)
#>  jsonlite      1.8.7   2023-06-29 [1] RSPM (R 4.3.0)
#>  knitr         1.43    2023-05-25 [1] RSPM (R 4.3.0)
#>  lifecycle     1.0.3   2022-10-07 [1] RSPM
#>  magrittr      2.0.3   2022-03-30 [1] RSPM
#>  openssl       2.1.0   2023-07-15 [1] RSPM (R 4.3.0)
#>  pillar        1.9.0   2023-03-22 [1] RSPM
#>  pkgconfig     2.0.3   2019-09-22 [1] RSPM
#>  purrr         1.0.2   2023-08-10 [1] RSPM (R 4.3.0)
#>  R.cache       0.16.0  2022-07-21 [1] RSPM
#>  R.methodsS3   1.8.2   2022-06-13 [1] RSPM
#>  R.oo          1.25.0  2022-06-12 [1] RSPM
#>  R.utils       2.12.2  2022-11-11 [1] RSPM
#>  R6            2.5.1   2021-08-19 [1] RSPM
#>  rappdirs      0.3.3   2021-01-31 [1] RSPM
#>  reprex        2.0.2   2022-08-17 [1] RSPM
#>  rlang         1.1.1   2023-04-28 [1] CRAN (R 4.3.0)
#>  rmarkdown     2.24    2023-08-14 [1] RSPM (R 4.3.0)
#>  rstudioapi    0.15.0  2023-07-07 [1] CRAN (R 4.3.1)
#>  sessioninfo   1.2.2   2021-12-06 [1] RSPM
#>  styler        1.10.2  2023-08-29 [1] RSPM (R 4.3.0)
#>  tibble        3.2.1   2023-03-20 [1] RSPM
#>  tidyselect    1.2.0   2022-10-10 [1] RSPM
#>  utf8          1.2.3   2023-01-31 [1] RSPM
#>  vctrs         0.6.3   2023-06-14 [1] RSPM (R 4.3.0)
#>  withr         2.5.0   2022-03-03 [1] RSPM
#>  xfun          0.40    2023-08-09 [1] RSPM (R 4.3.0)
#>  yaml          2.3.7   2023-01-23 [1] RSPM
#> 
#>  [1] /home/emrys/R/x86_64-pc-linux-gnu-library/4.3
#>  [2] /usr/local/lib/R/site-library
#>  [3] /usr/lib/R/site-library
#>  [4] /usr/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

I'm not working on googledrive right now, so this is a rather superficial response.

But a function such as drive_ls() is entirely a googledrive creation. The Drive API does not actually offer support for listing everything in a folder. So we are doing lots of recursive work inside googledrive, with many API calls, which is undoubtedly very slow. I can easily imagine that whatever you are trying to do ("3-levels deep folder", "My Google Drive has about 400K files") is running up against practical performance constraints of the rather naive implementation we have here.

So my very high-level advice is to approach this from a different angle.

You'll have to play around a bit, but the idea is to not create a request that forces googledrive to range over all 400K of your files trying to resolve a filepath with many components (folders within folders within folders).

Here is a horrible sketch (I'm sure this code does not work, but it should convey the idea):

library(googledrive)
library(tidyverse)

d1 <- drive_get("pins-testing/")
d1_listing <- drive_ls(d1)
d2 <- filter(d1_listing, name == "pid_X")
d2_listing <- drive_ls(d2)
d3 <- filter(d2_listing, name == "20230908T062343Z-db9b5")
d3_listing <- drive_ls(d3)

The key idea is to provide file IDs whenever possible instead of a filepath. This is what happens when you specify a target folder with a dribble instead of just its filepath. The approach above does this in a rather boneheaded stepwise way, but hopefully it makes things more clear. There's probably something less ugly that will work, but that should get you started.

Thanks @jennybc

In the end, my impression is that @juliasilge solved the issue by using basically the technique you hinted here.

Thanks again!