pambrose/prometheus-proxy

Timeout config parameter

ksanjay7 opened this issue · 47 comments

Hi,
Currently I have an prometheus-proxy in us-east-1 and trying to pull the cloudwatch metrics from us-east-1 of same account.
When I scrape the data the time is - cloudwatch_exporter_scrape_duration_seconds 11.337143927 and when I scrape the current instance metrics using node_exporter (9100 port) I get the data immediately.

When I try to access this metrics across proxy, I was able to get node_exporter data (9100) without any issue. However I was not able to get cloudwatch data in the agent as I see that it is getting time out.

Please could you advise if there is a workaround for this or have a timeout configuration at proxy side?

Thanks in advance.

Hi,

I am afraid I do not understand your question. Are you asking if there is a timeout for scrapes?

Paul

Hi Paul,

Thanks for your reply.

Yes, when I try to scrap the data from prometheus-agent, proxy is unable to scarpe if more than 10 sec. If I I could get in less than 10 secs like cloudwatch_exporter_scrape_duration_seconds 6.2806632 I could see the data in proxy. So I'm looking for any configuration in proxy to change the default time out.

~Sanjay

The scrape timeouts are controlled with the .conf file value:
proxy.internal.scrapeRequestTimeoutSecs

The default value is 5. Try setting it to a longer value and see if that solves the issue.

Hi Paul,

Sorry for delay in reply.

Below is the situation -

Use case 1 -
If I have a scape time of cloudwatch_exporter_scrape_duration_seconds 6.153901805
and with metrics -
region: us-east-1
metrics:

  • aws_namespace: AWS/EC2
    aws_metric_name: CPUUtilization
    aws_dimensions: [InstanceId]
    aws_statistics: [Minimum, Maximum, Average]
  • aws_namespace: AWS/EC2
    aws_metric_name: DiskReadOps
    aws_dimensions: [InstanceId]
    aws_statistics: [Minimum, Maximum, Average]

In the proxy, I get the data cloudwatch_exporter_scrape_duration_seconds 2.383826874

Use case 2 -

If metrics with (more metrics than earlier)
cloudwatch_exporter_scrape_duration_seconds 7.84222368
metrics:

  • aws_namespace: AWS/EC2
    aws_metric_name: CPUUtilization
    aws_dimensions: [InstanceId]
    aws_statistics: [Minimum, Maximum, Average]
  • aws_namespace: AWS/EC2
    aws_metric_name: DiskReadOps
    aws_dimensions: [InstanceId]
    aws_statistics: [Minimum, Maximum, Average]
  • aws_namespace: AWS/EC2
    aws_metric_name: DiskWriteOps
    aws_dimensions: [InstanceId]
    aws_statistics: [Minimum, Maximum, Average]
  • aws_namespace: AWS/EC2
    aws_metric_name: DiskReadBytes
    aws_dimensions: [InstanceId]
    aws_statistics: [Minimum, Maximum, Average]

In the proxy when I do curl I get the the results intermittently (sometime curl time out and some time I see the results)

Use case 3
If I have all EC2 metrics -
cloudwatch_exporter_scrape_duration_seconds 11.136508655

  • aws_namespace: AWS/EC2
    aws_metric_name: CPUUtilization
    aws_dimensions: [InstanceId]
    aws_statistics: [Minimum, Maximum, Average]
  • aws_namespace: AWS/EC2
    aws_metric_name: DiskReadOps
    aws_dimensions: [InstanceId]
    aws_statistics: [Minimum, Maximum, Average]
  • aws_namespace: AWS/EC2
    aws_metric_name: DiskWriteOps
    aws_dimensions: [InstanceId]
    aws_statistics: [Minimum, Maximum, Average]
  • aws_namespace: AWS/EC2
    aws_metric_name: DiskReadBytes
    aws_dimensions: [InstanceId]
    aws_statistics: [Minimum, Maximum, Average]
  • aws_namespace: AWS/EC2
    aws_metric_name: DiskWriteBytes
    aws_dimensions: [InstanceId]
    aws_statistics: [Minimum, Maximum, Average]
  • aws_namespace: AWS/EC2
    aws_metric_name: NetworkIn
    aws_dimensions: [InstanceId]
    aws_statistics: [Minimum, Maximum, Average]
  • aws_namespace: AWS/EC2
    aws_metric_name: NetworkOut
    aws_dimensions: [InstanceId,InstanceType]
    aws_statistics: [Minimum, Maximum, Average]
  • aws_namespace: AWS/EC2
    aws_metric_name: NetworkPacketsIn
    aws_dimensions: [InstanceId,InstanceType]
    aws_statistics: [Average]
  • aws_namespace: AWS/EC2
    aws_metric_name: NetworkPacketsOut
    aws_dimensions: [InstanceId,InstanceType]
    aws_statistics: [Average]
  • aws_namespace: AWS/EC2
    aws_metric_name: MetadataNoToken
    aws_dimensions: [InstanceId,InstanceType]
    aws_statistics: [SampleCount]
  • aws_namespace: AWS/EC2
    aws_metric_name: CPUCreditUsage
    aws_dimensions: [InstanceId,InstanceType]
    aws_statistics: [SampleCount]
  • aws_namespace: AWS/EC2
    aws_metric_name: CPUCreditBalance
    aws_dimensions: [InstanceId,InstanceType]
    aws_statistics: [SampleCount]
  • aws_namespace: AWS/EC2
    aws_metric_name: CPUSurplusCreditBalance
    aws_dimensions: [InstanceId,InstanceType]
    aws_statistics: [SampleCount]
  • aws_namespace: AWS/EC2
    aws_metric_name: CPUSurplusCreditsCharged
    aws_dimensions: [InstanceId,InstanceType]
    aws_statistics: [SampleCount]
  • aws_namespace: AWS/EC2
    aws_metric_name: StatusCheckFailed
    aws_dimensions: [InstanceId,InstanceType]
    aws_statistics: [SampleCount]
  • aws_namespace: AWS/EC2
    aws_metric_name: StatusCheckFailed_Instance
    aws_dimensions: [InstanceId,InstanceType]
    aws_statistics: [SampleCount]
  • aws_namespace: AWS/EC2
    aws_metric_name: StatusCheckFailed_System
    aws_dimensions: [InstanceId,InstanceType]
    aws_statistics: [SampleCount]

Proxy always doesn't any results.

Below are my proxy settings -

proxy.conf
proxy {
proxy.internal.scrapeRequestTimeoutSecs = 300
agent.port = 50051

tls {
certChainFilePath = "certs/server1.pem" // Server certificate chain file path
privateKeyFilePath = "certs/server1.key" // Server private key file path
trustCertCollectionFilePath = "certs/ca.pem" // Trust certificate collection file path
}
}

docker run --rm -d -p 8082:8082 -p 8092:8092 -p 50051:50051 -p 8080:8080
--mount type=bind,source="$(pwd)"/certs,target=/app/certs
--mount type=bind,source="$(pwd)/noauth.conf",target=/app/proxy.conf
--env PROXY_CONFIG=proxy.conf
--env ADMIN_ENABLED=true
--env METRICS_ENABLED=true
--name prometheus-proxy pambrose/prometheus-proxy:1.6.4

Is there something I'm missing to get the results to proxy ?

Did you try increasing the value of proxy.internal.scrapeRequestTimeoutSecs?

Thanks for your reply Paul.. I have used below settings -
Below are my proxy settings -

proxy.conf
proxy {
proxy.internal.scrapeRequestTimeoutSecs = 300
agent.port = 50051

tls {
certChainFilePath = "certs/server1.pem" // Server certificate chain file path
privateKeyFilePath = "certs/server1.key" // Server private key file path
trustCertCollectionFilePath = "certs/ca.pem" // Trust certificate collection file path
}
}

Hi Sanjay,

Sorry for the delayed response.

Let me see if I can simplify things to better understand your issue.
Without a proxy or agent involved, can you successfully scrape the metrics from AWS?
When you add the proxy and agent, and you configure the agent to scrape the URL
you scraped in the previous step, are you saying you are not getting data?

Paul

Hi Paul,

Thanks for your reply.

I'm using 3 components here. Cloudwatch-exporter, agent and proxy.
Cloudwatch-exporter was able to scrap all the metrics from cloudwatch and also agent was able to get all metrics scraped.
However, proxy was not able to scrap all metrics successfully every time. When the scrap metrics from exporter -> agent were minimal, proxy was able to get the metrics. However if I provide more, metrics exporter -> agent were successful, but agent ->proxy was not working.

Regards, Sanjay

Hi Sanjay,

I do not understand what you mean when you say: "agent was able to get all metrics scraped."

Can you see the metrics from the the Cloudwatch-exporter in your browser?
If you point the agent at the same URL, can you see them from the proxy?
If not, what do you see?

Paul

hi Paul,

Can you see the metrics from the the Cloudwatch-exporter in your browser? - Yes I see all the metrics when I curl "http://xx.xxx.xxx.xxx:9106/metrics"
If you point the agent at the same URL, can you see them from the proxy? - I see the metrics when cloud watch is scrapping just 2-3 metrics and not when I scrap more than 3 metrics.

I have also tried using increasing scrap time as below and did not work when I try for more than 3 metics.
proxy.conf
proxy {
proxy.internal.scrapeRequestTimeoutSecs = 300
agent.port = 50051

tls {
certChainFilePath = "certs/server1.pem" // Server certificate chain file path
privateKeyFilePath = "certs/server1.key" // Server private key file path
trustCertCollectionFilePath = "certs/ca.pem" // Trust certificate collection file path
}
}

Okay, so "http://xx.xxx.xxx.xxx:9106/metrics" is the endpoint you want to reach and you can curl it directly and see all the metrics. But when you scrape "http://xx.xxx.xxx.xxx:9106/metrics" using the agent and then curl the proxy, you do not see the same results as the direct curl request against the exporter. Are random lines missing or does the file just cut off arbitrarily?

Thats correct Paul.. I don't see same results in proxy and cloud exporter always.
When cloudwatch metrics to scrap is less, I see in cloud watch scrap and proxy. But when metrics increases, I could see in cloudwatch scrap and not on proxy, like empty.

For example -
Use case 1 -
I could see in both cloud watch scrap - http://xx.xxx.xxx.xxx:9106/metrics" and also in proxy same results

metrics:
aws_namespace: AWS/EC2
aws_metric_name: CPUUtilization
aws_dimensions: [InstanceId]
aws_statistics: [Minimum, Maximum, Average]
aws_namespace: AWS/EC2
aws_metric_name: DiskReadOps
aws_dimensions: [InstanceId]
aws_statistics: [Minimum, Maximum, Average]

Use case 2 -

If metrics with more metrics than earlier scrapped through cloudwatch
I could see metrics in cloud watch scrap - curl http://xx.xxx.xxx.xxx:9106/metrics" but not in proxy. Proxy doesn't return any results

metrics:
aws_namespace: AWS/EC2
aws_metric_name: CPUUtilization
aws_dimensions: [InstanceId]
aws_statistics: [Minimum, Maximum, Average]
aws_namespace: AWS/EC2
aws_metric_name: DiskReadOps
aws_dimensions: [InstanceId]
aws_statistics: [Minimum, Maximum, Average]
aws_namespace: AWS/EC2
aws_metric_name: DiskWriteOps
aws_dimensions: [InstanceId]
aws_statistics: [Minimum, Maximum, Average]
aws_namespace: AWS/EC2
aws_metric_name: DiskReadBytes
aws_dimensions: [InstanceId]
aws_statistics: [Minimum, Maximum, Average]

And does the log on the agent say anything alarming?

What is the approximate line count of a direct curl scrape against "http://xx.xxx.xxx.xxx:9106/metrics" that fails on the proxy?

I see 828 lines in cloud-exporter scrap.
Ya Paul I see error in agent log when timeout is not set in the conf file of agent

00:40:01.272 INFO [Agent.kt:244] - Version: 1.6.4 Release Date: 4/30/20 [main]
00:40:01.415 INFO [AgentPathManager.kt:45] - Proxy path /xxxxxxxx9106 will be assigned to http://xxxx.xxxx.xxx.xxx:9106/metrics [main]
00:40:01.416 INFO [AgentPathManager.kt:45] - Proxy path /xxxx-devops will be assigned to http://xxxx.xxxx.xxx.xxx:9100/metrics [main]
00:40:01.419 INFO [AgentGrpcService.kt:116] - Creating gRPC stubs [main]
00:40:01.420 INFO [GrpcDsl.kt:48] - Creating connection for gRPC server at xxxx.xxxx.xxx.xxx:50051 using plaintext [main]
00:40:01.598 INFO [Agent.kt:109] - Assigning agent name: Unnamed-xxxxXXX [main]
00:40:01.603 INFO [Agent.kt:110] - Assigning proxy reconnect pause time: 3.00s [main]
00:40:01.604 INFO [GenericService.kt:97] - Admin service disabled [main]
00:40:01.605 INFO [GenericService.kt:115] - Metrics service disabled [main]
00:40:01.605 INFO [GenericService.kt:124] - Zipkin reporter service disabled [main]
00:40:01.612 INFO [GenericService.kt:181] - Adding service Agent{agentId=, agentName=Unnamed-xxxxXXX, proxyHost=xxxx.xxxx.xxx.xxx:50051, adminService=Disabled, metricsService=Disabled} [main]
00:40:01.683 INFO [GenericServiceListener.kt:29] - Starting Agent{agentId=, agentName=Unnamed-xxxxXXX, proxyHost=xxxx.xxxx.xxx.xxx:50051, adminService=Disabled, metricsService=Disabled} [main]
00:40:01.688 INFO [GenericServiceListener.kt:30] - Running Agent{agentId=, agentName=Unnamed-xxxxXXX, proxyHost=xxxx.xxxx.xxx.xxx:50051, adminService=Disabled, metricsService=Disabled} [Agent Unnamed-xxxxXXX]
00:40:01.689 INFO [GenericService.kt:136] - All Agent services healthy [Agent Unnamed-xxxxXXX]
00:40:01.694 INFO [AgentGrpcService.kt:142] - Connecting to proxy at xxxx.xxxx.xxx.xxx:50051 using plaintext... [Agent Unnamed-xxxxXXX]
00:40:01.983 INFO [AgentGrpcService.kt:144] - Connected to proxy at xxxx.xxxx.xxx.xxx:50051 using plaintext [Agent Unnamed-xxxxXXX]
00:40:02.017 INFO [AgentPathManager.kt:65] - Registered http://xxxx.xxxx.xxx.xxx:9106/metrics as /xxxxXXX9106 [Agent Unnamed-xxxxXXX]
00:40:02.026 INFO [AgentPathManager.kt:65] - Registered http://xxxx.xxxx.xxx.xxx:9100/metrics as /xxxxXXX-devops [Agent Unnamed-xxxxXXX]
00:40:02.069 INFO [Agent.kt:194] - Heartbeat scheduled to fire after 5.00s of inactivity [DefaultDispatcher-worker-1]
00:40:22.252 WARN [AgentHttpService.kt:76] - fetchScrapeUrl() kotlinx.coroutines.TimeoutCancellationException: Timed out waiting for 15000 ms - http://xxxx.xxxx.xxx.xxx:9106/metrics [DefaultDispatcher-worker-2]
kotlinx.coroutines.TimeoutCancellationException: Timed out waiting for 15000 ms
at kotlinx.coroutines.TimeoutKt.TimeoutCancellationException(Timeout.kt:116)
at kotlinx.coroutines.TimeoutCoroutine.run(Timeout.kt:86)
at kotlinx.coroutines.EventLoopImplBase$DelayedRunnableTask.run(EventLoop.common.kt:493)
at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:272)
at kotlinx.coroutines.DefaultExecutor.run(DefaultExecutor.kt:68)
at java.base/java.lang.Thread.run(Unknown Source)
19:00:58.371 WARN [AgentHttpService.kt:76] - fetchScrapeUrl() kotlinx.coroutines.TimeoutCancellationException: Timed out waiting for 15000 ms - http://xxxx.xxxx.xxx.xxx:9106/metrics [DefaultDispatcher-worker-1]

and in the proxy log - not sure if this is relevant**

ERROR [ScrapeRequestManager.kt:46] - Missing ScrapeRequestWrapper for scrape_id: 86 [grpc-default-executor-4]

Now I have added below timeout parameter in agent.conf and I don't see timeout error in log. However I still don't see scrap results in proxy

agent.conf
agent {
proxy {
hostname = "xx.xxx.xxx.xxx"
port = 50051
proxy.internal.scrapeRequestTimeoutSecs = 300
}

proxy.conf
proxy {
proxy.internal.scrapeRequestTimeoutSecs = 300
agent.port = 50051

tls {
certChainFilePath = "certs/server1.pem" // Server certificate chain file path
privateKeyFilePath = "certs/server1.key" // Server private key file path
trustCertCollectionFilePath = "certs/ca.pem" // Trust certificate collection file path
}
}

So you increased proxy.internal.scrapeRequestTimeoutSecs and that stopped the timeouts, but you are not seeing any results in the proxy. That sounds like it is hanging on the request.

How about trying something like 30 seconds and then we will see if that times out.

And these timeouts are happening consistently?
And do you see any delays when you directly curl to the exporter?

I have changed to 30 in both agent and proxy, restarted docker and still I see same behaviour. Exporter is able to scrap successfully and not the proxy.
After setting timeout parameter in agent to 300 and now to 30, I don't see any timeout errors

Below are the exporter stats for ~828 lines

HELP cloudwatch_exporter_scrape_duration_seconds Time this CloudWatch scrape took, in seconds.
TYPE cloudwatch_exporter_scrape_duration_seconds gauge
cloudwatch_exporter_scrape_duration_seconds 7.351928798
HELP cloudwatch_exporter_scrape_error Non-zero if this scrape failed.
TYPE cloudwatch_exporter_scrape_error gauge
cloudwatch_exporter_scrape_error 0.0
HELP cloudwatch_requests_total API requests made to CloudWatch
TYPE cloudwatch_requests_total counter
cloudwatch_requests_total{action="getMetricStatistics",namespace="AWS/EC2",} 1352.0
cloudwatch_requests_total{action="listMetrics",namespace="AWS/EC2",} 34.0

So when you change the timeout in the agent to 30, and then query the proxy, there are two possible outcomes:

  1. after 30 seconds you see a timeout on the agent when it fails to read the data from the exporter and you get nothing from the proxy
  2. you see no timeouts on the agent and you get data on the proxy.

I cannot envision another scenario. You should either see data from the proxy or see an error in the agent log.
But it sounds like that is not the case. You see no data and no error?

Yes Paul.. I see no error in agent and no data in proxy when my exporter scrap logs are nearly ~828 and timeout set to 30. I was able to see the data of exporter scrap data in proxy if scrap log is ~274 lines even with default time out or 30
Not sure if this will help - Cloudwatch exporter and agent are in AWS and Proxy is in Azure.

Hmmm, that is bizarre. If you are willing to dig a little deeper with this, I can add some instrumentation to the logging that will give us a better idea of what is happening. with the exporter scrapes. Something whacky is happening.

Sure Paul.. I'm happy to make changes..

hi Paul.. I have increased the instance type (better Network type) of the prometheus agent and the cloudwatch cloudwatch_exporter_scrape_duration_seconds has changed from 12 sec to 7 sec. However the data from agent to proxy is still getting time out. Is the time out declaration by me is correct?

agent.conf
agent {
proxy {
hostname = "xx.xxx.xxx.xxx"
port = 50051
proxy.internal.scrapeRequestTimeoutSecs = 30
}

proxy.conf
proxy {
proxy.internal.scrapeRequestTimeoutSecs = 30
agent.port = 50051

tls {
certChainFilePath = "certs/server1.pem" // Server certificate chain file path
privateKeyFilePath = "certs/server1.key" // Server private key file path
trustCertCollectionFilePath = "certs/ca.pem" // Trust certificate collection file path
}
}

Hi Sanjay,

Sorry, I got distracted by something this week.

The setting is:
proxy {
internal.scrapeRequestTimeoutSecs = 30
agent.port = 50051
}

Good news: I think I found an issue with how I am doing timeouts on the agent. Let me test the fix and get a new release out to you. Sorry for this taking so long.

Paul

Thanks Paul.. Good to hear that.. will wait for the fix.

Hi Sanjay,

Okay, I think we are going to make some progress with your problem.

Please do the following:

  1. Please upgrade to 1.8.0
  2. Remove references to internal.scrapeRequestTimeoutSecs in the proxy and agent conf files.
  3. Add these lines to your agent conf file:
agent {
  scrapeTimeoutSecs = 15   // Assign this your max waiting time

  admin {
    enabled = true
    debugEnabled = true
  }
}

You will now see timeouts happening in the agent logs when it occurs.
This config also turns on debugging, which will provide more verbose logging.

You also might want to turn on debugging on the proxy with:

proxy {
  admin {
    enabled = true
    port = 8092     // You can change this to whatever port you want
    debugEnabled = true
  }
}

After the proxy is running, you can go to the debug URL at:
http://proxy_hostname:8092/debug

You will then see requests and their result status as they come in from Prometheus.

Let me know what you see!

Paul

Hi Paul,

Thanks for the fix. I could see the metrics in proxy when scrap time from exporter is less than 15 sec which has solved my problem.

I have decided to test with more scrap data. So I have tried with "cloudwatch_exporter_scrape_duration_seconds 43.095729379" and I could see the metrics in agent, but proxy is getting timeout.

Error in agent logs -

01:04:33.164 WARN [AgentHttpService.kt:83] - fetchScrapeUrl() io.ktor.client.features.HttpRequestTimeoutException: Request timeout has been expired [url=http://xx.xx.xx.xx:9106/metrics, request_timeout=15000 ms] - http://xx.xx.xx.xx:9106/metrics [DefaultDispatcher-worker-2]
io.ktor.client.features.HttpRequestTimeoutException: Request timeout has been expired [url=http://xx.xx.xx.xx:9106/metrics, request_timeout=15000 ms]
at io.ktor.client.features.HttpTimeout$Feature$install$1$invokeSuspend$$inlined$apply$lambda$1.invokeSuspend(HttpTimeout.kt:120)
at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:56)
at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:738)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665)

Below are the settings :

agent {
scrapeRequestTimeoutSecs = 120
proxy {
hostname = "xx.xx.xx.xx"
port = 50051
}
admin {
enabled = true
debugEnabled = true
}
}

proxy {
admin {
enabled = true
port = 9999 // You can change this to whatever port you want
debugEnabled = true
}
}

I'm I missing anything with the configurations with override the scrapeTimeoutSecs = 15 ?

Sounds like progress. Change scrapeRequestTimeoutSecs to scrapeTimeoutSecs in your agent config file.

My bad.. below are the current conf and still see proxy getting timeout after 15 sec
agent {
scrapeTimeoutSecs = 120
proxy {
hostname = "xx.xx.xx.xx"
port = 50051
}
admin {
enabled = true
debugEnabled = true
}
}

01:33:58.930 WARN [AgentHttpService.kt:99] - fetchScrapeUrl() kotlinx.coroutines.TimeoutCancellationException: Timed out waiting for 15000 ms - http://xx.xx.xx.xx:9106/metrics [DefaultDispatcher-worker-2]
kotlinx.coroutines.TimeoutCancellationException: Timed out waiting for 15000 ms
at kotlinx.coroutines.TimeoutKt.TimeoutCancellationException(Timeout.kt:158)
at kotlinx.coroutines.TimeoutCoroutine.run(Timeout.kt:128)
at kotlinx.coroutines.EventLoopImplBase$DelayedRunnableTask.run(EventLoop.common.kt:497)
at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:274)
at kotlinx.coroutines.DefaultExecutor.run(DefaultExecutor.kt:68)
at java.base/java.lang.Thread.run(Unknown Source)

If you look at the top of the agent log, you will see the value of scrapeTimeoutSecs it is using.
If you are not seeing 120s, then make sure you are reading the right config file.

I see that its is using 120 Sec. Below is the log for reference.
01:50:32.076 INFO [Agent.kt:276] - Version: 1.8.0 Release Date: 8/28/20 [main]
01:50:32.147 INFO [AgentOptions.kt:81] - proxyHostname: xx.xx.xx.xx [main]
01:50:32.148 INFO [AgentOptions.kt:86] - agentName: [main]
01:50:32.148 INFO [AgentOptions.kt:90] - consolidated: false [main]
01:50:32.151 INFO [AgentOptions.kt:94] - scrapeTimeoutSecs: 120s [main]
01:50:32.152 INFO [AgentOptions.kt:100] - chunkContentSizeKbs: 32768 [main]
01:50:32.152 INFO [AgentOptions.kt:104] - minGzipSizeBytes: 512 [main]
01:50:32.152 INFO [AgentOptions.kt:108] - overrideAuthority: [main]
01:50:32.152 INFO [BaseOptions.kt:127] - adminEnabled: true [main]
01:50:32.153 INFO [BaseOptions.kt:133] - adminPort: 8093 [main]
01:50:32.153 INFO [BaseOptions.kt:139] - metricsEnabled: false [main]
01:50:32.153 INFO [BaseOptions.kt:151] - metricsPort: 8083 [main]
01:50:32.153 INFO [BaseOptions.kt:145] - debugEnabled: true [main]
01:50:32.154 INFO [BaseOptions.kt:157] - certChainFilePath: [main]
01:50:32.154 INFO [BaseOptions.kt:163] - privateKeyFilePath: [main]
01:50:32.154 INFO [BaseOptions.kt:169] - trustCertCollectionFilePath: [main]
01:50:32.172 INFO [AgentPathManager.kt:48] - Proxy path /xx.xx.xx.xx will be assigned to http://xx.xx.xx.xx:9106/metrics [main]
01:50:32.174 INFO [AgentGrpcService.kt:122] - Creating gRPC stubs [main]
01:50:32.175 INFO [GrpcDsl.kt:48] - Creating connection for gRPC server at xx.xx.xx.xx:50051 using plaintext [main]
01:50:32.308 INFO [Agent.kt:101] - Agent name: Unnamed-xx.xx.xx.xx [main]
01:50:32.309 INFO [Agent.kt:102] - Proxy reconnect pause time: 3.00s [main]
01:50:32.309 INFO [Agent.kt:103] - Scrape timeout time: 120s [main]
01:50:32.322 INFO [Agent.kt:107] - Adding /debug endpoint [main]
01:50:32.334 INFO [Log.java:169] - Logging initialized @805ms to org.eclipse.jetty.util.log.Slf4jLog [main]
01:50:32.385 INFO [GenericService.kt:182] - Adding service AdminService{port=8093, paths=[/ping, /version, /healthcheck, /threaddump, /debug]} [main]
01:50:32.385 INFO [GenericService.kt:115] - Metrics service disabled [main]
01:50:32.386 INFO [GenericService.kt:124] - Zipkin reporter service disabled [main]
01:50:32.386 INFO [GenericService.kt:182] - Adding service Agent{agentId=, agentName=Unnamed-xx.xx.xx.xx, proxyHost=xx.xx.xx.xx:50051, adminService=AdminService{port=8093, paths=[/ping, /version, /healthcheck, /threaddump, /debug]}, metricsService=Disabled} [main]
01:50:32.405 INFO [GenericServiceListener.kt:29] - Starting Agent{agentId=, agentName=Unnamed-xx.xx.xx.xx, proxyHost=xx.xx.xx.xx:50051, adminService=AdminService{port=8093, paths=[/ping, /version, /healthcheck, /threaddump, /debug]}, metricsService=Disabled} [main]
01:50:32.406 INFO [GenericServiceListener.kt:29] - Starting AdminService{port=8093, paths=[/ping, /version, /healthcheck, /threaddump, /debug]} [Agent Unnamed-xx.xx.xx.xx]
01:50:32.561 INFO [GenericServiceListener.kt:30] - Running AdminService{port=8093, paths=[/ping, /version, /healthcheck, /threaddump, /debug]} [AdminService STARTING]
01:50:32.562 INFO [GenericServiceListener.kt:30] - Running Agent{agentId=, agentName=Unnamed-xx.xx.xx.xx, proxyHost=xx.xx.xx.xx:50051, adminService=AdminService{port=8093, paths=[/ping, /version, /healthcheck, /threaddump, /debug]}, metricsService=Disabled} [Agent Unnamed-xx.xx.xx.xx]
01:50:32.562 INFO [GenericService.kt:136] - All Agent services healthy [Agent Unnamed-xx.xx.xx.xx]
01:50:32.584 INFO [AgentGrpcService.kt:147] - Connecting to proxy at xx.xx.xx.xx:50051 using plaintext... [Agent Unnamed-xx.xx.xx.xx]
01:50:32.804 INFO [AgentClientInterceptor.kt:51] - Assigned agentId: 3 to Agent{agentId=3, agentName=Unnamed-xx.xx.xx.xx, proxyHost=xx.xx.xx.xx:50051, adminService=AdminService{port=8093, paths=[/ping, /version, /healthcheck, /threaddump, /debug]}, metricsService=Disabled} [grpc-default-executor-0]
01:50:32.809 INFO [AgentGrpcService.kt:149] - Connected to proxy at xx.xx.xx.xx:50051 using plaintext [Agent Unnamed-xx.xx.xx.xx]
01:50:32.843 INFO [AgentPathManager.kt:67] - Registered http://xx.xx.xx.xx:9106/metrics as /xx.xx.xx.xx [Agent Unnamed-xx.xx.xx.xx]
01:50:32.848 INFO [Agent.kt:221] - Heartbeat scheduled to fire after 5.00s of inactivity [DefaultDispatcher-worker-2]
01:51:16.243 WARN [AgentHttpService.kt:99] - fetchScrapeUrl() kotlinx.coroutines.TimeoutCancellationException: Timed out waiting for 15000 ms - http://xx.xx.xx.xx:9106/metrics [DefaultDispatcher-worker-3]
kotlinx.coroutines.TimeoutCancellationException: Timed out waiting for 15000 ms
at kotlinx.coroutines.TimeoutKt.TimeoutCancellationException(Timeout.kt:158)
at kotlinx.coroutines.TimeoutCoroutine.run(Timeout.kt:128)
at kotlinx.coroutines.EventLoopImplBase$DelayedRunnableTask.run(EventLoop.common.kt:497)
at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:274)
at kotlinx.coroutines.DefaultExecutor.run(DefaultExecutor.kt:68)
at java.base/java.lang.Thread.run(Unknown Source)

Okay, I looked at the code and found an issue, but it does not explain what you are seeing. I made a change and manually tested for timeouts with different values in the agent.conf file and it worked. Give me 15 minutes and I will post a new release. It will allow you to set a timeout from the command line as well, which will give us some more info.

Grab the new 1.8.1 release and try it with your agent.conf file. If you see 15 seconds again, you can try overriding it with the command line option --timeout or the env var SCRAPE_TIMEOUT_SECS. To be clear, the agent.scrapeTimeoutSecs config setting should get the job done. If the CLI or env var options work, then it is an issue with the config file.

Paul,

Sorry for the trouble..
I'm not sure if I'm missing something as I still see same issue even though I see in logs as 120s
I restarted docker with scrape parameters through command line -

Agent -
docker run -d -p 8083:8083 -p 8093:8093 -v pwd/agent.conf:/app/agent.conf --env AGENT_CONFIG=agent.conf --env SCRAPE_TIMEOUT_SECS=120 --env PROXY_HOSTNAME=xx.xx.xx.xx --env --debug --name prometheus-agent pambrose/prometheus-agent:1.8.1

Proxy -
docker run --rm -d -p 8082:8082 -p 8092:8092 -p 50051:50051 -p 8080:8080
--mount type=bind,source="$(pwd)"/certs,target=/app/certs
--mount type=bind,source="$(pwd)/noauth.conf",target=/app/proxy.conf
--env PROXY_CONFIG=proxy.conf
--env ADMIN_ENABLED=true
--env METRICS_ENABLED=true
--name prometheus-proxy pambrose/prometheus-proxy:1.8.1

04:30:29.743 INFO [AgentOptions.kt:94] - scrapeTimeoutSecs: 120s [main]
04:30:29.743 INFO [AgentOptions.kt:100] - chunkContentSizeKbs: 32768 [main]
04:30:29.744 INFO [AgentOptions.kt:104] - minGzipSizeBytes: 512 [main]
04:30:29.744 INFO [AgentOptions.kt:108] - overrideAuthority: [main]
04:30:29.744 INFO [BaseOptions.kt:127] - adminEnabled: true [main]
04:30:29.744 INFO [BaseOptions.kt:133] - adminPort: 8093 [main]
04:30:29.745 INFO [BaseOptions.kt:139] - metricsEnabled: false [main]
04:30:29.745 INFO [BaseOptions.kt:151] - metricsPort: 8083 [main]
04:30:29.745 INFO [BaseOptions.kt:145] - debugEnabled: true [main]
04:30:29.745 INFO [BaseOptions.kt:157] - certChainFilePath: [main]
04:30:29.746 INFO [BaseOptions.kt:163] - privateKeyFilePath: [main]
04:30:29.746 INFO [BaseOptions.kt:169] - trustCertCollectionFilePath: [main]
04:30:29.763 INFO [AgentPathManager.kt:48] - Proxy path /xx.xx.xx.xx will be assigned to http://xx.xx.xx.xx:9106/metrics [main]
04:30:29.765 INFO [AgentGrpcService.kt:122] - Creating gRPC stubs [main]
04:30:29.765 INFO [GrpcDsl.kt:48] - Creating connection for gRPC server at xx.xx.xx.xx:50051 using plaintext [main]
04:30:29.900 INFO [Agent.kt:101] - Agent name: Unnamed-xx.xx.xx.xx [main]
04:30:29.900 INFO [Agent.kt:102] - Proxy reconnect pause time: 3.00s [main]
04:30:29.901 INFO [Agent.kt:103] - Scrape timeout time: 120s [main]
04:30:29.914 INFO [Agent.kt:107] - Adding /debug endpoint [main]
04:30:29.926 INFO [Log.java:169] - Logging initialized @817ms to org.eclipse.jetty.util.log.Slf4jLog [main]
04:30:29.978 INFO [GenericService.kt:182] - Adding service AdminService{port=8093, paths=[/ping, /version, /healthcheck, /threaddump, /debug]} [main]
04:30:29.978 INFO [GenericService.kt:115] - Metrics service disabled [main]
04:30:29.978 INFO [GenericService.kt:124] - Zipkin reporter service disabled [main]
04:30:29.979 INFO [GenericService.kt:182] - Adding service Agent{agentId=, agentName=Unnamed-xx.xx.xx.xx, proxyHost=xx.xx.xx.xx:50051, adminService=AdminService{port=8093, paths=[/ping, /version, /healthcheck, /threaddump, /debug]}, metricsService=Disabled} [main]
04:30:29.998 INFO [GenericServiceListener.kt:29] - Starting Agent{agentId=, agentName=Unnamed-xx.xx.xx.xx, proxyHost=xx.xx.xx.xx:50051, adminService=AdminService{port=8093, paths=[/ping, /version, /healthcheck, /threaddump, /debug]}, metricsService=Disabled} [main]
04:30:29.999 INFO [GenericServiceListener.kt:29] - Starting AdminService{port=8093, paths=[/ping, /version, /healthcheck, /threaddump, /debug]} [Agent Unnamed-xx.xx.xx.xx]
04:30:30.128 INFO [GenericServiceListener.kt:30] - Running AdminService{port=8093, paths=[/ping, /version, /healthcheck, /threaddump, /debug]} [AdminService STARTING]
04:30:30.129 INFO [GenericServiceListener.kt:30] - Running Agent{agentId=, agentName=Unnamed-xx.xx.xx.xx, proxyHost=xx.xx.xx.xx:50051, adminService=AdminService{port=8093, paths=[/ping, /version, /healthcheck, /threaddump, /debug]}, metricsService=Disabled} [Agent Unnamed-xx.xx.xx.xx]
04:30:30.130 INFO [GenericService.kt:136] - All Agent services healthy [Agent Unnamed-xx.xx.xx.xx]
04:30:30.154 INFO [AgentGrpcService.kt:147] - Connecting to proxy at xx.xx.xx.xx:50051 using plaintext... [Agent Unnamed-xx.xx.xx.xx]
04:30:30.363 INFO [AgentClientInterceptor.kt:51] - Assigned agentId: 4 to Agent{agentId=4, agentName=Unnamed-xx.xx.xx.xx, proxyHost=xx.xx.xx.xx:50051, adminService=AdminService{port=8093, paths=[/ping, /version, /healthcheck, /threaddump, /debug]}, metricsService=Disabled} [grpc-default-executor-0]
04:30:30.368 INFO [AgentGrpcService.kt:149] - Connected to proxy at xx.xx.xx.xx:50051 using plaintext [Agent Unnamed-xx.xx.xx.xx]
04:30:30.403 INFO [AgentPathManager.kt:67] - Registered http://xx.xx.xx.xx:9106/metrics as /xx.xx.xx.xx [Agent Unnamed-xx.xx.xx.xx]
04:30:30.410 INFO [Agent.kt:221] - Heartbeat scheduled to fire after 5.00s of inactivity [DefaultDispatcher-worker-2]
04:31:09.114 WARN [AgentHttpService.kt:99] - fetchScrapeUrl() kotlinx.coroutines.TimeoutCancellationException: Timed out waiting for 15000 ms - http://xx.xx.xx.xx:9106/metrics [DefaultDispatcher-worker-3]
kotlinx.coroutines.TimeoutCancellationException: Timed out waiting for 15000 ms
at kotlinx.coroutines.TimeoutKt.TimeoutCancellationException(Timeout.kt:158)
at kotlinx.coroutines.TimeoutCoroutine.run(Timeout.kt:128)
at kotlinx.coroutines.EventLoopImplBase$DelayedRunnableTask.run(EventLoop.common.kt:497)
at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:274)
at kotlinx.coroutines.DefaultExecutor.run(DefaultExecutor.kt:68)
at java.base/java.lang.Thread.run(Unknown Source)

Wow, that is crazy! You can see the value is set to 120s at the top of the log, but it is still using 15s. The value is never touched again after it is printed to the log and yet it is somehow changing! I have been running it from the command line. Let me play with it from docker so I have the same setup as you. I will work on this first thing in the AM.

Sorry for the hassle here. I think we are close.

Hi Sanjay, I just pushed 1.8.2 to docker hub. Can you run that. It logs the value that is being assigned to the timeout. I want to make sure what I think it happening, is actually happening. This will give me a helpful clue. I am looking for a log line saying Setting scrapeTimeoutSecs = 120s Thanks.

Thanks for your support Paul.

05:40:35.455 INFO [Agent.kt:276] - Version: 1.8.2 Release Date: 9/1/20 [main]
05:40:35.521 INFO [AgentOptions.kt:82] - proxyHostname: xx.xx.xx.xx [main]
05:40:35.521 INFO [AgentOptions.kt:87] - agentName: [main]
05:40:35.521 INFO [AgentOptions.kt:91] - consolidated: false [main]
05:40:35.524 INFO [AgentOptions.kt:95] - scrapeTimeoutSecs: 120s [main]
05:40:35.524 INFO [AgentOptions.kt:101] - chunkContentSizeKbs: 32768 [main]
05:40:35.525 INFO [AgentOptions.kt:105] - minGzipSizeBytes: 512 [main]
05:40:35.525 INFO [AgentOptions.kt:109] - overrideAuthority: [main]
05:40:35.525 INFO [BaseOptions.kt:127] - adminEnabled: true [main]
05:40:35.525 INFO [BaseOptions.kt:133] - adminPort: 8093 [main]
05:40:35.526 INFO [BaseOptions.kt:139] - metricsEnabled: false [main]
05:40:35.526 INFO [BaseOptions.kt:151] - metricsPort: 8083 [main]
05:40:35.526 INFO [BaseOptions.kt:145] - debugEnabled: true [main]
05:40:35.526 INFO [BaseOptions.kt:157] - certChainFilePath: [main]
05:40:35.527 INFO [BaseOptions.kt:163] - privateKeyFilePath: [main]
05:40:35.527 INFO [BaseOptions.kt:169] - trustCertCollectionFilePath: [main]
05:40:35.545 INFO [AgentPathManager.kt:48] - Proxy path /xx.xx.xx.xx will be assigned to http://xx.xx.xx.xx:9106/metrics [main]
05:40:35.546 INFO [AgentGrpcService.kt:122] - Creating gRPC stubs [main]
05:40:35.547 INFO [GrpcDsl.kt:48] - Creating connection for gRPC server at xx.xx.xx.xx:50051 using plaintext [main]
05:40:35.681 INFO [Agent.kt:101] - Agent name: Unnamed-xx.xx.xx.xx [main]
05:40:35.682 INFO [Agent.kt:102] - Proxy reconnect pause time: 3.00s [main]
05:40:35.682 INFO [Agent.kt:103] - Scrape timeout time: 120s [main]
05:40:35.695 INFO [Agent.kt:107] - Adding /debug endpoint [main]
05:40:35.707 INFO [Log.java:169] - Logging initialized @801ms to org.eclipse.jetty.util.log.Slf4jLog [main]
05:40:35.759 INFO [GenericService.kt:182] - Adding service AdminService{port=8093, paths=[/ping, /version, /healthcheck, /threaddump, /debug]} [main]
05:40:35.760 INFO [GenericService.kt:115] - Metrics service disabled [main]
05:40:35.760 INFO [GenericService.kt:124] - Zipkin reporter service disabled [main]
05:40:35.761 INFO [GenericService.kt:182] - Adding service Agent{agentId=, agentName=Unnamed-xx.xx.xx.xx, proxyHost=xx.xx.xx.xx:50051, adminService=AdminService{port=8093, paths=[/ping, /version, /healthcheck, /threaddump, /debug]}, metricsService=Disabled} [main]
05:40:35.779 INFO [GenericServiceListener.kt:29] - Starting Agent{agentId=, agentName=Unnamed-xx.xx.xx.xx, proxyHost=xx.xx.xx.xx:50051, adminService=AdminService{port=8093, paths=[/ping, /version, /healthcheck, /threaddump, /debug]}, metricsService=Disabled} [main]
05:40:35.780 INFO [GenericServiceListener.kt:29] - Starting AdminService{port=8093, paths=[/ping, /version, /healthcheck, /threaddump, /debug]} [Agent Unnamed-xx.xx.xx.xx]
05:40:35.912 INFO [GenericServiceListener.kt:30] - Running AdminService{port=8093, paths=[/ping, /version, /healthcheck, /threaddump, /debug]} [AdminService STARTING]
05:40:35.912 INFO [GenericServiceListener.kt:30] - Running Agent{agentId=, agentName=Unnamed-xx.xx.xx.xx, proxyHost=xx.xx.xx.xx:50051, adminService=AdminService{port=8093, paths=[/ping, /version, /healthcheck, /threaddump, /debug]}, metricsService=Disabled} [Agent Unnamed-xx.xx.xx.xx]
05:40:35.913 INFO [GenericService.kt:136] - All Agent services healthy [Agent Unnamed-xx.xx.xx.xx]
05:40:35.942 INFO [AgentGrpcService.kt:147] - Connecting to proxy at xx.xx.xx.xx:50051 using plaintext... [Agent Unnamed-xx.xx.xx.xx]
05:40:36.164 INFO [AgentClientInterceptor.kt:51] - Assigned agentId: 9 to Agent{agentId=9, agentName=Unnamed-xx.xx.xx.xx, proxyHost=xx.xx.xx.xx:50051, adminService=AdminService{port=8093, paths=[/ping, /version, /healthcheck, /threaddump, /debug]}, metricsService=Disabled} [grpc-default-executor-0]
05:40:36.169 INFO [AgentGrpcService.kt:149] - Connected to proxy at xx.xx.xx.xx:50051 using plaintext [Agent Unnamed-xx.xx.xx.xx]
05:40:36.198 INFO [AgentPathManager.kt:67] - Registered http://xx.xx.xx.xx:9106/metrics as /xx.xx.xx.xx [Agent Unnamed-xx.xx.xx.xx]
05:40:36.205 INFO [Agent.kt:221] - Heartbeat scheduled to fire after 5.00s of inactivity [DefaultDispatcher-worker-2]
05:40:47.649 INFO [AgentHttpService.kt:81] - Setting scrapeTimeoutSecs = 120s [DefaultDispatcher-worker-2]
05:41:02.707 WARN [AgentHttpService.kt:88] - fetchScrapeUrl() kotlinx.coroutines.TimeoutCancellationException: Timed out waiting for 15000 ms - http://xx.xx.xx.xx:9106/metrics [DefaultDispatcher-worker-2]
kotlinx.coroutines.TimeoutCancellationException: Timed out waiting for 15000 ms
at kotlinx.coroutines.TimeoutKt.TimeoutCancellationException(Timeout.kt:158)
at kotlinx.coroutines.TimeoutCoroutine.run(Timeout.kt:128)
at kotlinx.coroutines.EventLoopImplBase$DelayedRunnableTask.run(EventLoop.common.kt:497)
at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:274)
at kotlinx.coroutines.DefaultExecutor.run(DefaultExecutor.kt:68)
at java.base/java.lang.Thread.run(Unknown Source)

Okay, I see what the problem is. I was not paying attention to the exception name. The exception is TimeoutCancellationException, which is not a http timeout issue, i.e., something else is timing out somewhere else. I will look at it in the AM. Thanks for your patience.

I figured out the problem. It turns out there are two timeout values in ktor: one associated with the request and one in the CIOEngineConfig. The CIOEngineConfig value defaults to 15 seconds and that is where your timeout was coming from. I changed that to default to 90 seconds, so your requestTimeoutSecs values will prevail as long as they are under 90 seconds. I pushed a new 1.8.3 release. Give it a try and see what you think.

Works like charm. Thank you Paul for all the support and patience in providing a fix.

That is great news! Thanks for hanging in there.

Hi Sanjay,

I found a memory leak in the code that I added to fix your timeout issue.
Please upgrade to 1.8.4.

Sorry about that.

Paul

Hi Paul,

I feel that the default timeout of proxy 90 secs doesn't seems to take effect when scrapeTimeoutSecs is not set in agent.conf and getting timeout to 15.

I did couple of load test cycle testing with new container and 1.8.3 (too) and I see that proxy getting out after 60 secs when scrapeTimeoutSecs = 90 and to 15 without scrapeTimeoutSecs parameter.

Scenario which I tested earlier against today is - Cloudwatch exporter scrap earlier was taking ~57 sec and today I increase the load to ~64 secs and see that proxy was getting timeout to 60 which was expected to 90. Unfortunately I don't see the timeout error too in the log.

agent {
scrapeTimeoutSecs = 90
proxy {
hostname = "xx.xx.xx.xx"
port = 50051
}
admin {
enabled = true
debugEnabled = true
}

proxy {
agent.port = 50051

admin {
enabled = true
port = 9999 // You can change this to whatever port you want
debugEnabled = true
}
}
}

21:52:19.457 INFO [Agent.kt:276] - Version: 1.8.4 Release Date: 9/3/20 [main]
21:52:19.521 INFO [AgentOptions.kt:82] - proxyHostname: xx.xx.xx.xx [main]
21:52:19.521 INFO [AgentOptions.kt:87] - agentName: [main]
21:52:19.521 INFO [AgentOptions.kt:91] - consolidated: false [main]
21:52:19.524 INFO [AgentOptions.kt:95] - scrapeTimeoutSecs: 90.0s [main]
21:52:19.525 INFO [AgentOptions.kt:101] - chunkContentSizeKbs: 32768 [main]
21:52:19.525 INFO [AgentOptions.kt:105] - minGzipSizeBytes: 512 [main]
21:52:19.525 INFO [AgentOptions.kt:109] - overrideAuthority: [main]
21:52:19.525 INFO [BaseOptions.kt:127] - adminEnabled: true [main]
21:52:19.526 INFO [BaseOptions.kt:133] - adminPort: xx.xx.xx.xx [main]
21:52:19.526 INFO [BaseOptions.kt:139] - metricsEnabled: false [main]
21:52:19.526 INFO [BaseOptions.kt:151] - metricsPort: 8083 [main]
21:52:19.526 INFO [BaseOptions.kt:145] - debugEnabled: true [main]
21:52:19.527 INFO [BaseOptions.kt:157] - certChainFilePath: [main]
21:52:19.527 INFO [BaseOptions.kt:163] - privateKeyFilePath: [main]
21:52:19.527 INFO [BaseOptions.kt:169] - trustCertCollectionFilePath: [main]
21:52:19.545 INFO [AgentPathManager.kt:48] - Proxy path /xx.xx.xx.xx will be assigned to http://xx.xx.xx.xx:9106/metrics [main]
21:52:19.547 INFO [AgentGrpcService.kt:122] - Creating gRPC stubs [main]
21:52:19.548 INFO [GrpcDsl.kt:48] - Creating connection for gRPC server at xx.xx.xx.xx:50051 using plaintext [main]
21:52:19.687 INFO [Agent.kt:101] - Agent name: Unnamed-xx.xx.xx.xx [main]
21:52:19.687 INFO [Agent.kt:102] - Proxy reconnect pause time: 3.00s [main]
21:52:19.688 INFO [Agent.kt:103] - Scrape timeout time: 90.0s [main]
21:52:19.700 INFO [Agent.kt:107] - Adding /debug endpoint [main]
21:52:19.709 INFO [Log.java:169] - Logging initialized @815ms to org.eclipse.jetty.util.log.Slf4jLog [main]
21:52:19.759 INFO [GenericService.kt:182] - Adding service AdminService{port=xx.xx.xx.xx, paths=[/ping, /version, /healthcheck, /threaddump, /debug]} [main]
21:52:19.759 INFO [GenericService.kt:115] - Metrics service disabled [main]
21:52:19.759 INFO [GenericService.kt:124] - Zipkin reporter service disabled [main]
21:52:19.760 INFO [GenericService.kt:182] - Adding service Agent{agentId=, agentName=Unnamed-xx.xx.xx.xx, proxyHost=xx.xx.xx.xx:50051, adminService=AdminService{port=xx.xx.xx.xx, paths=[/ping, /version, /healthcheck, /threaddump, /debug]}, metricsService=Disabled} [main]
21:52:19.779 INFO [GenericServiceListener.kt:29] - Starting Agent{agentId=, agentName=Unnamed-xx.xx.xx.xx, proxyHost=xx.xx.xx.xx:50051, adminService=AdminService{port=xx.xx.xx.xx, paths=[/ping, /version, /healthcheck, /threaddump, /debug]}, metricsService=Disabled} [main]
21:52:19.780 INFO [GenericServiceListener.kt:29] - Starting AdminService{port=xx.xx.xx.xx, paths=[/ping, /version, /healthcheck, /threaddump, /debug]} [Agent Unnamed-xx.xx.xx.xx]
21:52:19.908 INFO [GenericServiceListener.kt:30] - Running AdminService{port=xx.xx.xx.xx, paths=[/ping, /version, /healthcheck, /threaddump, /debug]} [AdminService STARTING]
21:52:19.908 INFO [GenericServiceListener.kt:30] - Running Agent{agentId=, agentName=Unnamed-xx.xx.xx.xx, proxyHost=xx.xx.xx.xx:50051, adminService=AdminService{port=xx.xx.xx.xx, paths=[/ping, /version, /healthcheck, /threaddump, /debug]}, metricsService=Disabled} [Agent Unnamed-xx.xx.xx.xx]
21:52:19.909 INFO [GenericService.kt:136] - All Agent services healthy [Agent Unnamed-xx.xx.xx.xx]
21:52:19.939 INFO [AgentGrpcService.kt:147] - Connecting to proxy at xx.xx.xx.xx:50051 using plaintext... [Agent Unnamed-xx.xx.xx.xx]
21:52:20.157 INFO [AgentClientInterceptor.kt:51] - Assigned agentId: 4 to Agent{agentId=4, agentName=Unnamed-xx.xx.xx.xx, proxyHost=xx.xx.xx.xx:50051, adminService=AdminService{port=xx.xx.xx.xx, paths=[/ping, /version, /healthcheck, /threaddump, /debug]}, metricsService=Disabled} [grpc-default-executor-0]
21:52:20.162 INFO [AgentGrpcService.kt:149] - Connected to proxy at xx.xx.xx.xx:50051 using plaintext [Agent Unnamed-xx.xx.xx.xx]
21:52:20.189 INFO [AgentPathManager.kt:67] - Registered http://xx.xx.xx.xx:9106/metrics as /xx.xx.xx.xx [Agent Unnamed-xx.xx.xx.xx]
21:52:20.200 INFO [Agent.kt:221] - Heartbeat scheduled to fire after 5.00s of inactivity [DefaultDispatcher-worker-2]

Okay, let me test that scenario.

Okay, there are 3 timeouts:

  1. proxy.internal.scrapeRequestTimeoutSecs. (defaults to 90s) (it was 60, but I increased it to 90)
  2. agent.internal.cioTimeoutSecs (defaults to 90s)
  3. agent.scrapeTimeoutSecs (defaults to 15s)

So long as agent.scrapeTimeoutSecs is less than 90s, that is the only one you need to adjust. If you want to go above 90s, then you will have to increase #1 and #2 accordingly.

If you timed out after 60s in 1.8.3, then the log message would have been in the proxy.

Give 1.8.5 a try and see what you think.

Paul

Sure Paul.. I will try to test it during traffic peak time and provide feedback. Currently I'm getting data with same volume in less than 60 secs.

Hi Paul.. sorry for delay in response... I was able to get the data in Proxy, agent and one change I did in Prometheus is to increase the scrap default time from 10s to 90s as I see timeout errors in Prometheus .

No worries. So everything is working properly?

Yes Paul.. No issues observed.