Paginated glue responses not covered

Question

Paginated glue responses not covered

Closed this issue 4 years ago · 24 comments

Hi Dyfan, one more for the week:

when noctua queries glue for a list of databases with get_databases(), the list is truncated in case the result gets paginated. Same with get_tables().

Not sure what is the underlying reason for this, but in one of our environments the first call to glue returns only the first 23 databases. For one reason or another setting the MaxResults parameter doesn't help here.

One fix would be to implement a separate paginated reader function and replace the native paws calls with that, right? Perhaps some more elegant implementation could be come up with.

get_all_databases <- function(glue) {
  cresults <- glue$get_databases()
  cdbs <- sapply(cresults$DatabaseList,function(x) x$Name)
  
  citers <- 0
  while(length(cresults$NextToken) > 0 & citers < 1000) {
    
    cresults <- glue$get_databases(NextToken = cresults$NextToken)
    cdbs <- c(cdbs, sapply(cresults$DatabaseList,function(x) x$Name))
    citers <- citers + 1
  }
  
  cdbs
}

(I tend to get nervous with these while loops so I added the max iteration count there...)

Answer 1 · 2021-01-29T09:29:45.000Z

Sorry @OssiLehtinen I am not a 100% sure what is the issue. Can you get me an example of what is happening? Is the issue not all databases and tables aren't being returned. Or table names are getting truncated?

Answer 2 · 2021-01-29T10:53:10.000Z

Ah sorry for being unclear. The issue is that all databases and tables are note returned. For example, the connection panel in Rstudio shows only the first 23 databases, even though many more exist in the catalog.

Answer 3 · 2021-01-29T10:59:16.000Z

Ah I see :) i will have a look into it. Strange it is only displaying that top 23 data bases, if i remember correctly the glue call should return a 1000 🤔.

But you are right it does need the Next token as it won't return everything.

Just thinking out loud (don't know if it is true so don't quote me 😆) is there a restriction on the rstudio connection tab? As in the number of databases and tables it will show

Answer 4 · 2021-01-29T11:20:19.000Z

Yeah it is pretty strange, and I don't quite understand why we're getting just the first 23 databases (even if we explicitly set MaxResults = 1000L in the paws function call). 23 is a pretty strange number of results to return.

It shouldn't be an issue with rstudio as the issue is there also with a direct call to glue$get_databases() and in an R session in a terminal.

BTW, boto3 has the same issue, so it doesn't seem to be a specific bug with paws either. Well, fo of course it is in principle a bug with both as the MaxResults parameter isn't really respected. aws-cli returns all the databases, however, but maybe there is some built in pagination handling.

p.s. I guess someone might have 1000+ databases (not saying that anyone should :D) and then pagination would be needed even if things were working correctly...

Answer 5 · 2021-01-29T11:25:54.000Z

Interesting, i guess i need to make 24 temporary databases in my athena just to see this happening :D hahaha

Answer 6 · 2021-01-29T11:37:02.000Z

Haha, maybe you would see it, maybe not. The strange thing is that this happens only in one of our environments/accounts. One specific feature there is that we have quite a lot of content shared via lake formation to that account, but the 'breaking point' of the database list doesn't seem to correlate with anything.

Answer 7 · 2021-01-29T11:42:34.000Z

does this happen in dbGetTables and dbListTables. I think i will need to update these methods accordingly

Answer 8 · 2021-01-29T11:57:00.000Z

Yes, the same issue there. Must be everywhere where glue$get_databases() is used.

Answer 9 · 2021-01-29T12:01:06.000Z

Interesting, and does this only happen when you are working in that environment?

Answer 10 · 2021-01-29T12:06:12.000Z

Yes, only in there.

Answer 11 · 2021-01-29T12:11:48.000Z

hmmm interesting. Are the permissions to aws glue the same in that environment compared to your others?

Answer 12 · 2021-01-29T12:13:26.000Z

Or is thats glue catalogue sync up correctly with your others?

Answer 13 · 2021-01-29T12:17:40.000Z

Well there certainly could be something weird going on especially with the permissions granted through lake formation, but haven't spotted anything relevant so far. And when we do another get_databases() query with NextToken we get results for the rest of the tables, so no fundamental problem with permissions in my opinion.

I will let you know if we make some discoveries about what might cause this.

Answer 14 · 2021-01-29T12:21:23.000Z

Well next token needs to be added to these functions. Hopefully this will fix it going forward :) very interesting problem as i haven't come across it before :)

Answer 15 · 2021-01-29T12:44:48.000Z

Perhaps a helper like this could be handy?:

glue <- paws::glue()

get_paginated_from_paws <- function(paws_fun, token_field = "NextToken", content_node = NULL, ...) {
  
  cargs <- as.list(match.call())
  cargs[[1]] <- NULL
  cargs[["paws_fun"]] <- NULL
  cresults <- do.call("paws_fun", cargs)
  
  if(is.null(content_node)) content_node <- names(cresults)[1]
  
  ccontent <- cresults[[content_node]]
                 
  citers <- 0
  while(length(cresults[[token_field]]) > 0 & citers < 100) {
    
    ccargs <- cargs
    ccargs[[token_field]] <- cresults[[token_field]]
    
    cresults <- do.call("paws_fun", ccargs)
    ccontent <- c(ccontent, cresults[[content_node]])
    citers <- citers + 1
  }
  
  ccontent
}


get_paginated_from_paws(glue$get_databases)
sapply(get_paginated_from_paws(glue$get_databases), function(x) x$Name)

get_paginated_from_paws(glue$get_table, DatabaseName = "mydatabase", Name = "mytable")

The idea would be that you can give it different functions such as glue$get_databases or glue$get_table as an argument and the function will handle the looping. Some assumptions are made about the structure of the query results. Shouldn't make much overhead if NextToken is empty after the first call.

Answer 16 · 2021-01-29T12:58:56.000Z

@OssiLehtinen thanks :) I will have a look to see how this works in practise :D

Answer 17 · 2021-02-01T12:09:13.000Z

@OssiLehtinen are you available to test if branch aws-table-list fixes the issue you are coming across in your environment?

# Installation method for branch: aws-table-list
remotes::install_github("dyfanjones/noctua", ref="aws-table-list")

This works in my environment but I would like to make sure it is fixed in yours :)

Answer 18 · 2021-02-01T12:13:28.000Z

Thanks for this :) I have created 2 helper functions for the glue query as I wanted to format the returning response from AWS. This is to help with reducing the number of loops required to parse the named lists, and help with performance :)

Perhaps a helper like this could be handy?:

glue <- paws::glue()

get_paginated_from_paws <- function(paws_fun, token_field = "NextToken", content_node = NULL, ...) {
  
  cargs <- as.list(match.call())
  cargs[[1]] <- NULL
  cargs[["paws_fun"]] <- NULL
  cresults <- do.call("paws_fun", cargs)
  
  if(is.null(content_node)) content_node <- names(cresults)[1]
  
  ccontent <- cresults[[content_node]]
                 
  citers <- 0
  while(length(cresults[[token_field]]) > 0 & citers < 100) {
    
    ccargs <- cargs
    ccargs[[token_field]] <- cresults[[token_field]]
    
    cresults <- do.call("paws_fun", ccargs)
    ccontent <- c(ccontent, cresults[[content_node]])
    citers <- citers + 1
  }
  
  ccontent
}


get_paginated_from_paws(glue$get_databases)
sapply(get_paginated_from_paws(glue$get_databases), function(x) x$Name)

get_paginated_from_paws(glue$get_table, DatabaseName = "mydatabase", Name = "mytable")

The idea would be that you can give it different functions such as glue$get_databases or glue$get_table as an argument and the function will handle the looping. Some assumptions are made about the structure of the query results. Shouldn't make much overhead if NextToken is empty after the first call.

Answer 19 · 2021-02-01T14:20:08.000Z

Everything seems to work correctly now. The connections tab shows all the databases and also dbListTables and dbGetTables show the full list now. So, great work again, thanks!

p.s. the rstudio_conn_tab = F option really comes handy, as reading the whole glue catalog is even slower now (obviously it takes more time to actually read in the complete catalog :) )

Answer 20 · 2021-02-01T14:28:59.000Z

You must have a fairly large glue catalogue for it to be fairly slow :D . From the R side of things it should be nice and fast when parsing the lists 😄 (Well as fast as I know how currently 😋)

🤔 How is dbGetTables or dbListTables? Do they run slow-ish?

Answer 21 · 2021-02-01T15:25:25.000Z

Yeah, it's some 1200 tables or so at the moment. But I think using Lake Formation is the final nail in the coffin (quite a few of those tables are shared from other accounts on this specific account, where things are especially slow).

I totally agree the slowness shouldn't be an R side issue. I think I will raise an issue with AWS about it.

Answer 22 · 2021-02-01T15:27:09.000Z

Oh, dbListTables and dbGetTables are equally slow. Also 'select * from information_schema.tables'.

Answer 23 · 2021-02-01T15:48:04.000Z

@OssiLehtinen Thanks for giving an update on this 😄

Answer 24 · 2021-02-02T16:31:46.000Z

PR #139 fixes this issue 😄