DyfanJones/noctua

Paginated glue responses not covered

Closed this issue ยท 24 comments

Hi Dyfan, one more for the week:

when noctua queries glue for a list of databases with get_databases(), the list is truncated in case the result gets paginated. Same with get_tables().

Not sure what is the underlying reason for this, but in one of our environments the first call to glue returns only the first 23 databases. For one reason or another setting the MaxResults parameter doesn't help here.

One fix would be to implement a separate paginated reader function and replace the native paws calls with that, right? Perhaps some more elegant implementation could be come up with.

get_all_databases <- function(glue) {
  cresults <- glue$get_databases()
  cdbs <- sapply(cresults$DatabaseList,function(x) x$Name)
  
  citers <- 0
  while(length(cresults$NextToken) > 0 & citers < 1000) {
    
    cresults <- glue$get_databases(NextToken = cresults$NextToken)
    cdbs <- c(cdbs, sapply(cresults$DatabaseList,function(x) x$Name))
    citers <- citers + 1
  }
  
  cdbs
}

(I tend to get nervous with these while loops so I added the max iteration count there...)

Sorry @OssiLehtinen I am not a 100% sure what is the issue. Can you get me an example of what is happening? Is the issue not all databases and tables aren't being returned. Or table names are getting truncated?

Ah sorry for being unclear. The issue is that all databases and tables are note returned. For example, the connection panel in Rstudio shows only the first 23 databases, even though many more exist in the catalog.

Ah I see :) i will have a look into it. Strange it is only displaying that top 23 data bases, if i remember correctly the glue call should return a 1000 ๐Ÿค”.

But you are right it does need the Next token as it won't return everything.

Just thinking out loud (don't know if it is true so don't quote me ๐Ÿ˜†) is there a restriction on the rstudio connection tab? As in the number of databases and tables it will show

Yeah it is pretty strange, and I don't quite understand why we're getting just the first 23 databases (even if we explicitly set MaxResults = 1000L in the paws function call). 23 is a pretty strange number of results to return.

It shouldn't be an issue with rstudio as the issue is there also with a direct call to glue$get_databases() and in an R session in a terminal.

BTW, boto3 has the same issue, so it doesn't seem to be a specific bug with paws either. Well, fo of course it is in principle a bug with both as the MaxResults parameter isn't really respected. aws-cli returns all the databases, however, but maybe there is some built in pagination handling.

p.s. I guess someone might have 1000+ databases (not saying that anyone should :D) and then pagination would be needed even if things were working correctly...

Interesting, i guess i need to make 24 temporary databases in my athena just to see this happening :D hahaha

Haha, maybe you would see it, maybe not. The strange thing is that this happens only in one of our environments/accounts. One specific feature there is that we have quite a lot of content shared via lake formation to that account, but the 'breaking point' of the database list doesn't seem to correlate with anything.

does this happen in dbGetTables and dbListTables. I think i will need to update these methods accordingly

Yes, the same issue there. Must be everywhere where glue$get_databases() is used.

Interesting, and does this only happen when you are working in that environment?

Yes, only in there.

hmmm interesting. Are the permissions to aws glue the same in that environment compared to your others?

Or is thats glue catalogue sync up correctly with your others?

Well there certainly could be something weird going on especially with the permissions granted through lake formation, but haven't spotted anything relevant so far. And when we do another get_databases() query with NextToken we get results for the rest of the tables, so no fundamental problem with permissions in my opinion.

I will let you know if we make some discoveries about what might cause this.

Well next token needs to be added to these functions. Hopefully this will fix it going forward :) very interesting problem as i haven't come across it before :)

Perhaps a helper like this could be handy?:

glue <- paws::glue()

get_paginated_from_paws <- function(paws_fun, token_field = "NextToken", content_node = NULL, ...) {
  
  cargs <- as.list(match.call())
  cargs[[1]] <- NULL
  cargs[["paws_fun"]] <- NULL
  cresults <- do.call("paws_fun", cargs)
  
  if(is.null(content_node)) content_node <- names(cresults)[1]
  
  ccontent <- cresults[[content_node]]
                 
  citers <- 0
  while(length(cresults[[token_field]]) > 0 & citers < 100) {
    
    ccargs <- cargs
    ccargs[[token_field]] <- cresults[[token_field]]
    
    cresults <- do.call("paws_fun", ccargs)
    ccontent <- c(ccontent, cresults[[content_node]])
    citers <- citers + 1
  }
  
  ccontent
}


get_paginated_from_paws(glue$get_databases)
sapply(get_paginated_from_paws(glue$get_databases), function(x) x$Name)

get_paginated_from_paws(glue$get_table, DatabaseName = "mydatabase", Name = "mytable")

The idea would be that you can give it different functions such as glue$get_databases or glue$get_table as an argument and the function will handle the looping. Some assumptions are made about the structure of the query results. Shouldn't make much overhead if NextToken is empty after the first call.

@OssiLehtinen thanks :) I will have a look to see how this works in practise :D

@OssiLehtinen are you available to test if branch aws-table-list fixes the issue you are coming across in your environment?

# Installation method for branch: aws-table-list
remotes::install_github("dyfanjones/noctua", ref="aws-table-list")

This works in my environment but I would like to make sure it is fixed in yours :)

Thanks for this :) I have created 2 helper functions for the glue query as I wanted to format the returning response from AWS. This is to help with reducing the number of loops required to parse the named lists, and help with performance :)

Perhaps a helper like this could be handy?:

glue <- paws::glue()

get_paginated_from_paws <- function(paws_fun, token_field = "NextToken", content_node = NULL, ...) {
  
  cargs <- as.list(match.call())
  cargs[[1]] <- NULL
  cargs[["paws_fun"]] <- NULL
  cresults <- do.call("paws_fun", cargs)
  
  if(is.null(content_node)) content_node <- names(cresults)[1]
  
  ccontent <- cresults[[content_node]]
                 
  citers <- 0
  while(length(cresults[[token_field]]) > 0 & citers < 100) {
    
    ccargs <- cargs
    ccargs[[token_field]] <- cresults[[token_field]]
    
    cresults <- do.call("paws_fun", ccargs)
    ccontent <- c(ccontent, cresults[[content_node]])
    citers <- citers + 1
  }
  
  ccontent
}


get_paginated_from_paws(glue$get_databases)
sapply(get_paginated_from_paws(glue$get_databases), function(x) x$Name)

get_paginated_from_paws(glue$get_table, DatabaseName = "mydatabase", Name = "mytable")

The idea would be that you can give it different functions such as glue$get_databases or glue$get_table as an argument and the function will handle the looping. Some assumptions are made about the structure of the query results. Shouldn't make much overhead if NextToken is empty after the first call.

Everything seems to work correctly now. The connections tab shows all the databases and also dbListTables and dbGetTables show the full list now. So, great work again, thanks!

p.s. the rstudio_conn_tab = F option really comes handy, as reading the whole glue catalog is even slower now (obviously it takes more time to actually read in the complete catalog :) )

You must have a fairly large glue catalogue for it to be fairly slow :D . From the R side of things it should be nice and fast when parsing the lists ๐Ÿ˜„ (Well as fast as I know how currently ๐Ÿ˜‹)

๐Ÿค” How is dbGetTables or dbListTables? Do they run slow-ish?

Yeah, it's some 1200 tables or so at the moment. But I think using Lake Formation is the final nail in the coffin (quite a few of those tables are shared from other accounts on this specific account, where things are especially slow).

I totally agree the slowness shouldn't be an R side issue. I think I will raise an issue with AWS about it.

Oh, dbListTables and dbGetTables are equally slow. Also 'select * from information_schema.tables'.

@OssiLehtinen Thanks for giving an update on this ๐Ÿ˜„

PR #139 fixes this issue ๐Ÿ˜„