paws-r/paws

Infinite loop while paginating paws.management::cloudwatchlogs()$get_log_events()

Closed this issue · 12 comments

I am working on wlandau/crew.aws.batch#2, and things are mostly coming along well. However, I notice that when I try to use pagination to get the full log of an AWS Batch job, the call to get_log_events() hangs. Turning on verbose logging with options(paws.log_level = 3), it appears that the function call churns in an infinite loop even when the log is very small. I am using paws.common 0.6.4 and paws.management 0.4.0 (both current CRAN). Here is my reprex:

library(paws.common)
library(paws.compute)
library(paws.management)
options(paws.log_level = 3)
region <- "us-east-2"
client_batch <- batch(region = region)
client_cloudwatchlogs <- cloudwatchlogs(region = region)
definition <- client_batch$register_job_definition(
  jobDefinitionName = "crew-aws-batch-test",
  type = "container",
  platformCapabilities = "EC2",
  timeout = list(attemptDurationSeconds = 600),
  containerProperties = list(
    image = "alpine:latest",
    resourceRequirements = list(
      memory = list(value = "128", type = "MEMORY"),
      cpus = list(value = "1", type = "VCPU")),
    logConfiguration = list(
      logDriver = "awslogs",
      options = list(
        "awslogs-group" = "/aws/batch/job",
        "awslogs-region" = "us-east-2",
        "awslogs-stream-prefix" = "crew-aws-batch-test")
    )
  )
)
job <- client_batch$submit_job(
  jobName = "crew-aws-batch-test-job",
  jobQueue = "crew-aws-batch-job-queue",
  jobDefinition = "crew-aws-batch-test",
  timeout = list(attemptDurationSeconds = 600),
  containerOverrides = list(
    command = list("echo", "done with container\ndone with job"),
    resourceRequirements = list(
      memory = list(value = "128", type = "MEMORY"),
      cpus = list(value = "1", type = "VCPU")
    )
  )
)
while (TRUE) {
  info <- client_batch$describe_jobs(jobs = job$jobId)
  status <- info$jobs[[1L]]$status
  message(paste("job status:", status))
  if (status %in% c("SUCCEEDED", "FAILED")) {
    break
  }
  Sys.sleep(5)
}
log_stream_name <- info$jobs[[1L]]$container$logStreamName
pages <- paws.common::paginate(
  client_cloudwatchlogs$get_log_events(
    logGroupName = "/aws/batch/job",
    logStreamName = log_stream_name,
    startFromHead = FALSE
  )
)

On my end, the R console output looks like this (although I censored the sensitive information). It never stops printing messages like these.

INFO [2023-12-08 15:06:23.988]: -> CONNECT logs.us-east-2.amazonaws.com:443 HTTP/1.1
-> Host: logs.us-east-2.amazonaws.com:443
-> User-Agent: libcurl/8.1.2 r-curl/5.1.0 httr/1.4.7
-> Proxy-Connection: Keep-Alive
-> 
INFO [2023-12-08 15:06:24.233]: <- HTTP/1.1 200 Connection established
INFO [2023-12-08 15:06:24.233]: <- 
INFO [2023-12-08 15:06:24.654]: -> POST / HTTP/1.1
-> Host: logs.us-east-2.amazonaws.com
-> Accept-Encoding: deflate, gzip
-> Accept: application/json, text/xml, application/xml, */*
-> User-Agent: paws/0.6.4 (R4.3.2; darwin20; aarch64)
-> X-Amz-Target: Logs_20140328.GetLogEvents
-> Content-Type: application/x-amz-json-1.1
-> Content-Length: 134
-> X-Amz-Security-Token: CENSORED
-> X-Amz-Date: 20231208T200623Z
-> Authorization: CENSORED
-> 
INFO [2023-12-08 15:06:24.655]: >> {"logGroupName":"/aws/batch/job","logStreamName":"crew-aws-batch-test/default/CENSORED","startFromHead":false}

INFO [2023-12-08 15:06:24.854]: <- HTTP/1.1 200 OK
INFO [2023-12-08 15:06:24.854]: <- x-amzn-RequestId: 368eb364-dc05-4e9f-b45b-ed9a6f453e88
INFO [2023-12-08 15:06:24.854]: <- Content-Type: application/x-amz-json-1.1
INFO [2023-12-08 15:06:24.854]: <- Content-Length: 489
INFO [2023-12-08 15:06:24.854]: <- Date: Fri, 08 Dec 2023 20:06:24 GMT
INFO [2023-12-08 15:06:24.854]: <- 
INFO [2023-12-08 15:06:24.867]: -> POST / HTTP/1.1
-> Host: logs.us-east-2.amazonaws.com
-> Accept-Encoding: deflate, gzip
-> Accept: application/json, text/xml, application/xml, */*
-> User-Agent: paws/0.6.4 (R4.3.2; darwin20; aarch64)
-> X-Amz-Target: Logs_20140328.GetLogEvents
-> Content-Type: application/x-amz-json-1.1
-> Content-Length: 209
-> X-Amz-Security-Token: CENSORED
-> X-Amz-Date: 20231208T200624Z
-> Authorization: CENSORED
-> 
INFO [2023-12-08 15:06:24.867]: >> {"logGroupName":"/aws/batch/job","logStreamName":"crew-aws-batch-test/default/CENSORED","nextToken":"f/CENSORED/s","startFromHead":false}

Session info:

R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.1.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Indiana/Indianapolis
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] crew.aws.batch_0.0.0.9001 testthat_3.2.0            paws.management_0.4.0    
[4] paws.compute_0.4.0        paws.common_0.6.4        

loaded via a namespace (and not attached):
 [1] utf8_1.2.4        stringi_1.8.2     digest_0.6.33     magrittr_2.0.3    pkgload_1.3.3    
 [6] fastmap_1.1.1     jsonlite_1.8.8    rprojroot_2.0.4   processx_3.8.2    pkgbuild_1.4.2   
[11] sessioninfo_1.2.2 brio_1.1.3        urlchecker_1.0.1  ps_1.7.5          promises_1.2.1   
[16] httr_1.4.7        purrr_1.0.2       fansi_1.0.5       crew_0.7.0        cli_3.6.1        
[21] shiny_1.7.5.1     rlang_1.1.2       crayon_1.5.2      ellipsis_0.3.2    withr_2.5.2      
[26] remotes_2.4.2.1   cachem_1.0.8      devtools_2.4.5    tools_4.3.2       parallel_4.3.2   
[31] memoise_2.0.1     getip_0.1-3       nanonext_0.11.0   httpuv_1.6.11     curl_5.1.0       
[36] vctrs_0.6.5       R6_2.5.1          mime_0.12         lifecycle_1.0.4   stringr_1.5.1    
[41] fs_1.6.3          htmlwidgets_1.6.2 usethis_2.2.2     miniUI_0.1.1.1    desc_1.4.2       
[46] pkgconfig_2.0.3   callr_3.7.3       pillar_1.9.0      later_1.3.1       data.table_1.14.8
[51] glue_1.6.2        profvis_0.3.8     Rcpp_1.0.11       tibble_3.2.1      tidyselect_1.2.0 
[56] rstudioapi_0.15.0 xtable_1.8-4      htmltools_0.5.6.1 mirai_0.11.3      compiler_4.3.2   
[61] prettyunits_1.2.0

By the way, is there a completely straightforward and reliable way to get the CloudWatch log group given a job ID, job name, or even job ARN? I can't find it in the output of describe_jobs() or describe_job_definitions(), so for the sake of wlandau/crew.aws.batch#2, I am having users create their own job definitions and supply an explicit log group name with default "/aws/batch/job". Not ideal if someone's sys admins use a custom logging scheme which users don't know about, so I'd be interested in a better approach if there is one.

Sorry about that @wlandau I will try and get this in for paws.common 0.6.5. Off the top of my head I am not sure but I will have a little look into it :)

Hmmm it looks like get_log_events isn't a paginator in boto3 https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/logs.html#paginators

From digging a little more, there seems to be some issues raised on botocore and aws-sdk around this:

boto/botocore#1474
aws/aws-sdk#422

Not sure what to do on this one as I would like to keep align with the other aws-sdks however it doesn't feel difficult to implement.

From my understanding it should be checking the primary_result_key for any results (similar to MaxItems). Initial idea for implementation:
https://github.com/DyfanJones/paws/blob/paginator_empty_result/paws.common/R/paginate.R#L59-L85

Here is a solution that aws js v3 uses. I believe we can implement something similar.

aws/aws-sdk-js-v3#3524

@wlandau I believe I have a solution to this based off the aws sdk* js v3:

remotes::install_github("dyfanjones/paws/paws.common", ref = "paginator_empty_result")

library(paws)

client = cloudwatchlogs(config(credentials(profile = "paws")))

pages <- paws.common::paginate(
  client$get_log_events(
    logGroupName = "/aws/sagemaker/NotebookInstances",
    logStreamName = "paws-demo/jupyter.log",
    startFromHead = TRUE
  ),
  StopOnSameToken = TRUE
)
length(pages)
#> [1] 2

Created on 2023-12-19 with reprex v2.0.2

Please try it out :)

For completeness, here is the typescript smithy reference: https://github.com/smithy-lang/smithy-typescript/blob/main/packages/core/src/pagination/createPaginator.ts#L20-L58

export function createPaginator<
  PaginationConfigType extends PaginationConfiguration,
  InputType extends object,
  OutputType extends object
>(
  ClientCtor: any,
  CommandCtor: any,
  inputTokenName: string,
  outputTokenName: string,
  pageSizeTokenName?: string
): (config: PaginationConfigType, input: InputType, ...additionalArguments: any[]) => Paginator<OutputType> {
  return async function* paginateOperation(
    config: PaginationConfigType,
    input: InputType,
    ...additionalArguments: any[]
  ): Paginator<OutputType> {
    let token: any = config.startingToken || undefined;
    let hasNext = true;
    let page: OutputType;

    while (hasNext) {
      (input as any)[inputTokenName] = token;
      if (pageSizeTokenName) {
        (input as any)[pageSizeTokenName] = (input as any)[pageSizeTokenName] ?? config.pageSize;
      }
      if (config.client instanceof ClientCtor) {
        page = await makePagedClientRequest(CommandCtor, config.client, input, ...additionalArguments);
      } else {
        throw new Error(`Invalid client, expected instance of ${ClientCtor.name}`);
      }
      yield page;
      const prevToken = token;
      token = (page as any)[outputTokenName];
      hasNext = !!(token && (!config.stopOnSameToken || token !== prevToken));
    }
    // @ts-ignore
    return undefined;
  };
}

Thanks for the patch, @DyfanJones! Unfortunately, I won’t be able to test until January when I get back from vacation. If #722 works on your end, please feel free to merge etc.

No worries, I won't be pushing this to the cran before the new year. I am just going to go over the logic before finally merging :)

Hi @DyfanJones, I hope you had a nice holiday.

I tested your work on this issue, both using #721 (comment) and https://github.com/wlandau/crew.aws.batch/blob/9714d49d32d94b1e044cd06f2fde0956f6ea556e/tests/monitor/test-jobs.R#L88-L130, and paws.common::paginate(StopOnSameToken = TRUE) works. Thanks for the patch!

By the way, I noticed the default argument of StopOnSameToken is FALSE. Which scenarios should it be TRUE, and which require StopOnSameToken = FASLE?

That is great news 👍

The implementation mimics how the JavaScript AWS SDK V3 paginates with the default set as false or missing stopOnSameToken in the config, this also helps paws and keeps previous pagination behaviour for other services.

For knowing what service to set StopOnSameToken = TRUE, that is a tricky question (currently I am not sure). When checking the documentation for aws sdk js v3 (https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/Package/-aws-sdk-client-cloudwatch-logs/Variable/paginateGetLogEvents/) it only says the following.

For some APIs, such as CloudWatchLogs events, the next page token will always be present.When true, this config field will have the paginator stop when the token doesn't change instead of when it is not present.

It might be worth having this documentation in paws as well. Sadly this isn't a solid answer. If anyone knows a better answer other than documentation I am happy add it in 😄

For completeness ====

Mini example for testing out aws sdk js v3 paginator logic.

let tokens = ["a", "b", "c", "c", "d"];

export interface PaginationConfiguration {
    stopOnSameToken?: boolean;
};

export function createPaginator(config: PaginationConfiguration): void {

    let hasNext: boolean = true;
    let token: string = undefined;
    let i: number = 0;

    while (hasNext) {
        const prevToken = token;
        token = tokens[i];
        hasNext = !!(token && (!config.stopOnSameToken || token !== prevToken));
        console.log(hasNext)
        i++;
    };
};

const paginatorConfig = { stopOnSameToken: true };
// const paginatorConfig = { };
// const paginatorConfig = { stopOnSameToken: false };

createPaginator(paginatorConfig);

// paginatorConfig.stopOnSameToken = true
// true
// true
// true
// false

// paginatorConfig.stopOnSameToken = false or
// paginatorConfig = {}
// true
// true
// true
// true
// true
// false

Note: paws.common 0.7.0 has been released to the cran