Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite loop while paginating paws.management::cloudwatchlogs()$get_log_events() #721

Closed
wlandau opened this issue Dec 8, 2023 · 12 comments
Assignees
Labels
bug 🐞 Something isn't working

Comments

@wlandau
Copy link

wlandau commented Dec 8, 2023

I am working on wlandau/crew.aws.batch#2, and things are mostly coming along well. However, I notice that when I try to use pagination to get the full log of an AWS Batch job, the call to get_log_events() hangs. Turning on verbose logging with options(paws.log_level = 3), it appears that the function call churns in an infinite loop even when the log is very small. I am using paws.common 0.6.4 and paws.management 0.4.0 (both current CRAN). Here is my reprex:

library(paws.common)
library(paws.compute)
library(paws.management)
options(paws.log_level = 3)
region <- "us-east-2"
client_batch <- batch(region = region)
client_cloudwatchlogs <- cloudwatchlogs(region = region)
definition <- client_batch$register_job_definition(
  jobDefinitionName = "crew-aws-batch-test",
  type = "container",
  platformCapabilities = "EC2",
  timeout = list(attemptDurationSeconds = 600),
  containerProperties = list(
    image = "alpine:latest",
    resourceRequirements = list(
      memory = list(value = "128", type = "MEMORY"),
      cpus = list(value = "1", type = "VCPU")),
    logConfiguration = list(
      logDriver = "awslogs",
      options = list(
        "awslogs-group" = "/aws/batch/job",
        "awslogs-region" = "us-east-2",
        "awslogs-stream-prefix" = "crew-aws-batch-test")
    )
  )
)
job <- client_batch$submit_job(
  jobName = "crew-aws-batch-test-job",
  jobQueue = "crew-aws-batch-job-queue",
  jobDefinition = "crew-aws-batch-test",
  timeout = list(attemptDurationSeconds = 600),
  containerOverrides = list(
    command = list("echo", "done with container\ndone with job"),
    resourceRequirements = list(
      memory = list(value = "128", type = "MEMORY"),
      cpus = list(value = "1", type = "VCPU")
    )
  )
)
while (TRUE) {
  info <- client_batch$describe_jobs(jobs = job$jobId)
  status <- info$jobs[[1L]]$status
  message(paste("job status:", status))
  if (status %in% c("SUCCEEDED", "FAILED")) {
    break
  }
  Sys.sleep(5)
}
log_stream_name <- info$jobs[[1L]]$container$logStreamName
pages <- paws.common::paginate(
  client_cloudwatchlogs$get_log_events(
    logGroupName = "/aws/batch/job",
    logStreamName = log_stream_name,
    startFromHead = FALSE
  )
)

On my end, the R console output looks like this (although I censored the sensitive information). It never stops printing messages like these.

INFO [2023-12-08 15:06:23.988]: -> CONNECT logs.us-east-2.amazonaws.com:443 HTTP/1.1
-> Host: logs.us-east-2.amazonaws.com:443
-> User-Agent: libcurl/8.1.2 r-curl/5.1.0 httr/1.4.7
-> Proxy-Connection: Keep-Alive
-> 
INFO [2023-12-08 15:06:24.233]: <- HTTP/1.1 200 Connection established
INFO [2023-12-08 15:06:24.233]: <- 
INFO [2023-12-08 15:06:24.654]: -> POST / HTTP/1.1
-> Host: logs.us-east-2.amazonaws.com
-> Accept-Encoding: deflate, gzip
-> Accept: application/json, text/xml, application/xml, */*
-> User-Agent: paws/0.6.4 (R4.3.2; darwin20; aarch64)
-> X-Amz-Target: Logs_20140328.GetLogEvents
-> Content-Type: application/x-amz-json-1.1
-> Content-Length: 134
-> X-Amz-Security-Token: CENSORED
-> X-Amz-Date: 20231208T200623Z
-> Authorization: CENSORED
-> 
INFO [2023-12-08 15:06:24.655]: >> {"logGroupName":"/aws/batch/job","logStreamName":"crew-aws-batch-test/default/CENSORED","startFromHead":false}

INFO [2023-12-08 15:06:24.854]: <- HTTP/1.1 200 OK
INFO [2023-12-08 15:06:24.854]: <- x-amzn-RequestId: 368eb364-dc05-4e9f-b45b-ed9a6f453e88
INFO [2023-12-08 15:06:24.854]: <- Content-Type: application/x-amz-json-1.1
INFO [2023-12-08 15:06:24.854]: <- Content-Length: 489
INFO [2023-12-08 15:06:24.854]: <- Date: Fri, 08 Dec 2023 20:06:24 GMT
INFO [2023-12-08 15:06:24.854]: <- 
INFO [2023-12-08 15:06:24.867]: -> POST / HTTP/1.1
-> Host: logs.us-east-2.amazonaws.com
-> Accept-Encoding: deflate, gzip
-> Accept: application/json, text/xml, application/xml, */*
-> User-Agent: paws/0.6.4 (R4.3.2; darwin20; aarch64)
-> X-Amz-Target: Logs_20140328.GetLogEvents
-> Content-Type: application/x-amz-json-1.1
-> Content-Length: 209
-> X-Amz-Security-Token: CENSORED
-> X-Amz-Date: 20231208T200624Z
-> Authorization: CENSORED
-> 
INFO [2023-12-08 15:06:24.867]: >> {"logGroupName":"/aws/batch/job","logStreamName":"crew-aws-batch-test/default/CENSORED","nextToken":"f/CENSORED/s","startFromHead":false}

Session info:

R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.1.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Indiana/Indianapolis
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] crew.aws.batch_0.0.0.9001 testthat_3.2.0            paws.management_0.4.0    
[4] paws.compute_0.4.0        paws.common_0.6.4        

loaded via a namespace (and not attached):
 [1] utf8_1.2.4        stringi_1.8.2     digest_0.6.33     magrittr_2.0.3    pkgload_1.3.3    
 [6] fastmap_1.1.1     jsonlite_1.8.8    rprojroot_2.0.4   processx_3.8.2    pkgbuild_1.4.2   
[11] sessioninfo_1.2.2 brio_1.1.3        urlchecker_1.0.1  ps_1.7.5          promises_1.2.1   
[16] httr_1.4.7        purrr_1.0.2       fansi_1.0.5       crew_0.7.0        cli_3.6.1        
[21] shiny_1.7.5.1     rlang_1.1.2       crayon_1.5.2      ellipsis_0.3.2    withr_2.5.2      
[26] remotes_2.4.2.1   cachem_1.0.8      devtools_2.4.5    tools_4.3.2       parallel_4.3.2   
[31] memoise_2.0.1     getip_0.1-3       nanonext_0.11.0   httpuv_1.6.11     curl_5.1.0       
[36] vctrs_0.6.5       R6_2.5.1          mime_0.12         lifecycle_1.0.4   stringr_1.5.1    
[41] fs_1.6.3          htmlwidgets_1.6.2 usethis_2.2.2     miniUI_0.1.1.1    desc_1.4.2       
[46] pkgconfig_2.0.3   callr_3.7.3       pillar_1.9.0      later_1.3.1       data.table_1.14.8
[51] glue_1.6.2        profvis_0.3.8     Rcpp_1.0.11       tibble_3.2.1      tidyselect_1.2.0 
[56] rstudioapi_0.15.0 xtable_1.8-4      htmltools_0.5.6.1 mirai_0.11.3      compiler_4.3.2   
[61] prettyunits_1.2.0
@wlandau
Copy link
Author

wlandau commented Dec 8, 2023

By the way, is there a completely straightforward and reliable way to get the CloudWatch log group given a job ID, job name, or even job ARN? I can't find it in the output of describe_jobs() or describe_job_definitions(), so for the sake of wlandau/crew.aws.batch#2, I am having users create their own job definitions and supply an explicit log group name with default "/aws/batch/job". Not ideal if someone's sys admins use a custom logging scheme which users don't know about, so I'd be interested in a better approach if there is one.

@DyfanJones DyfanJones added the bug 🐞 Something isn't working label Dec 8, 2023
@DyfanJones
Copy link
Member

Sorry about that @wlandau I will try and get this in for paws.common 0.6.5. Off the top of my head I am not sure but I will have a little look into it :)

@DyfanJones
Copy link
Member

Hmmm it looks like get_log_events isn't a paginator in boto3 https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/logs.html#paginators

From digging a little more, there seems to be some issues raised on botocore and aws-sdk around this:

boto/botocore#1474
aws/aws-sdk#422

@DyfanJones
Copy link
Member

DyfanJones commented Dec 12, 2023

Not sure what to do on this one as I would like to keep align with the other aws-sdks however it doesn't feel difficult to implement.

From my understanding it should be checking the primary_result_key for any results (similar to MaxItems). Initial idea for implementation:
https://github.com/DyfanJones/paws/blob/paginator_empty_result/paws.common/R/paginate.R#L59-L85

@DyfanJones
Copy link
Member

Here is a solution that aws js v3 uses. I believe we can implement something similar.

aws/aws-sdk-js-v3#3524

@DyfanJones
Copy link
Member

DyfanJones commented Dec 19, 2023

@wlandau I believe I have a solution to this based off the aws sdk* js v3:

remotes::install_github("dyfanjones/paws/paws.common", ref = "paginator_empty_result")

library(paws)

client = cloudwatchlogs(config(credentials(profile = "paws")))

pages <- paws.common::paginate(
  client$get_log_events(
    logGroupName = "/aws/sagemaker/NotebookInstances",
    logStreamName = "paws-demo/jupyter.log",
    startFromHead = TRUE
  ),
  StopOnSameToken = TRUE
)
length(pages)
#> [1] 2

Created on 2023-12-19 with reprex v2.0.2

Please try it out :)

@DyfanJones
Copy link
Member

DyfanJones commented Dec 20, 2023

For completeness, here is the typescript smithy reference: https://github.com/smithy-lang/smithy-typescript/blob/main/packages/core/src/pagination/createPaginator.ts#L20-L58

export function createPaginator<
  PaginationConfigType extends PaginationConfiguration,
  InputType extends object,
  OutputType extends object
>(
  ClientCtor: any,
  CommandCtor: any,
  inputTokenName: string,
  outputTokenName: string,
  pageSizeTokenName?: string
): (config: PaginationConfigType, input: InputType, ...additionalArguments: any[]) => Paginator<OutputType> {
  return async function* paginateOperation(
    config: PaginationConfigType,
    input: InputType,
    ...additionalArguments: any[]
  ): Paginator<OutputType> {
    let token: any = config.startingToken || undefined;
    let hasNext = true;
    let page: OutputType;

    while (hasNext) {
      (input as any)[inputTokenName] = token;
      if (pageSizeTokenName) {
        (input as any)[pageSizeTokenName] = (input as any)[pageSizeTokenName] ?? config.pageSize;
      }
      if (config.client instanceof ClientCtor) {
        page = await makePagedClientRequest(CommandCtor, config.client, input, ...additionalArguments);
      } else {
        throw new Error(`Invalid client, expected instance of ${ClientCtor.name}`);
      }
      yield page;
      const prevToken = token;
      token = (page as any)[outputTokenName];
      hasNext = !!(token && (!config.stopOnSameToken || token !== prevToken));
    }
    // @ts-ignore
    return undefined;
  };
}

@wlandau
Copy link
Author

wlandau commented Dec 20, 2023

Thanks for the patch, @DyfanJones! Unfortunately, I won’t be able to test until January when I get back from vacation. If #722 works on your end, please feel free to merge etc.

@DyfanJones
Copy link
Member

No worries, I won't be pushing this to the cran before the new year. I am just going to go over the logic before finally merging :)

@wlandau
Copy link
Author

wlandau commented Jan 2, 2024

Hi @DyfanJones, I hope you had a nice holiday.

I tested your work on this issue, both using #721 (comment) and https://github.com/wlandau/crew.aws.batch/blob/9714d49d32d94b1e044cd06f2fde0956f6ea556e/tests/monitor/test-jobs.R#L88-L130, and paws.common::paginate(StopOnSameToken = TRUE) works. Thanks for the patch!

By the way, I noticed the default argument of StopOnSameToken is FALSE. Which scenarios should it be TRUE, and which require StopOnSameToken = FASLE?

@wlandau wlandau closed this as completed Jan 2, 2024
@DyfanJones
Copy link
Member

That is great news 👍

The implementation mimics how the JavaScript AWS SDK V3 paginates with the default set as false or missing stopOnSameToken in the config, this also helps paws and keeps previous pagination behaviour for other services.

For knowing what service to set StopOnSameToken = TRUE, that is a tricky question (currently I am not sure). When checking the documentation for aws sdk js v3 (https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/Package/-aws-sdk-client-cloudwatch-logs/Variable/paginateGetLogEvents/) it only says the following.

For some APIs, such as CloudWatchLogs events, the next page token will always be present.When true, this config field will have the paginator stop when the token doesn't change instead of when it is not present.

It might be worth having this documentation in paws as well. Sadly this isn't a solid answer. If anyone knows a better answer other than documentation I am happy add it in 😄

For completeness ====

Mini example for testing out aws sdk js v3 paginator logic.

let tokens = ["a", "b", "c", "c", "d"];

export interface PaginationConfiguration {
    stopOnSameToken?: boolean;
};

export function createPaginator(config: PaginationConfiguration): void {

    let hasNext: boolean = true;
    let token: string = undefined;
    let i: number = 0;

    while (hasNext) {
        const prevToken = token;
        token = tokens[i];
        hasNext = !!(token && (!config.stopOnSameToken || token !== prevToken));
        console.log(hasNext)
        i++;
    };
};

const paginatorConfig = { stopOnSameToken: true };
// const paginatorConfig = { };
// const paginatorConfig = { stopOnSameToken: false };

createPaginator(paginatorConfig);

// paginatorConfig.stopOnSameToken = true
// true
// true
// true
// false

// paginatorConfig.stopOnSameToken = false or
// paginatorConfig = {}
// true
// true
// true
// true
// true
// false

@DyfanJones
Copy link
Member

Note: paws.common 0.7.0 has been released to the cran

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants