When spawning more than 16 requests with Java 11 HTTP client: com.clickhouse.client.ClickHouseException: Code: 159. Execution timed out
devurandom opened this issue · 10 comments
Describe the bug
When spawning more than 16 requests in parallel to a ClickHouse server using the Java HTTP client, all requests will hang and eventually time out (unless I also set MAX_THREADS_PER_CLIENT=1
or greater):
#error {
:cause nil
:via
[{:type java.util.concurrent.ExecutionException
:message com.clickhouse.client.ClickHouseException: Code: 159. Execution timed out
:at [java.util.concurrent.FutureTask report FutureTask.java 122]}
{:type com.clickhouse.client.ClickHouseException
:message Code: 159. Execution timed out
:at [com.clickhouse.client.ClickHouseException of ClickHouseException.java 147]}
{:type java.util.concurrent.TimeoutException
:message nil
:at [java.util.concurrent.CompletableFuture timedGet CompletableFuture.java 1960]}]
:trace
[[java.util.concurrent.CompletableFuture timedGet CompletableFuture.java 1960]
[java.util.concurrent.CompletableFuture get CompletableFuture.java 2095]
[com.clickhouse.client.ClickHouseClient executeAndWait ClickHouseClient.java 878]
[ch_timeout_mwe$execute_BANG_ invokeStatic ch_timeout_mwe.clj 34]
[ch_timeout_mwe$execute_BANG_ invoke ch_timeout_mwe.clj 31]
[ch_timeout_mwe$ping invokeStatic ch_timeout_mwe.clj 40]
[ch_timeout_mwe$ping invoke ch_timeout_mwe.clj 36]
[ch_timeout_mwe$_main$iter__158__162$fn__163$fn__164$fn__165 invoke ch_timeout_mwe.clj 56]
[clojure.core$binding_conveyor_fn$fn__5823 invoke core.clj 2047]
[clojure.lang.AFn call AFn.java 18]
[java.util.concurrent.FutureTask run FutureTask.java 317]
[java.util.concurrent.ThreadPoolExecutor runWorker ThreadPoolExecutor.java 1144]
[java.util.concurrent.ThreadPoolExecutor$Worker run ThreadPoolExecutor.java 642]
[java.lang.Thread run Thread.java 1583]]}
Afterwards the application is broken; the only way to make requests to ClickHouse possible again is by restarting the JVM.
Repeatedly sending 16 or fewer requests does not exhibit the hanging requests.
Steps to reproduce
- Follow the instructions in https://github.com/devurandom/clickhouse-timeout-mwe to run the sample application.
- Observe that everything works fine when spawning 16 or less requests, or when passing a value greater than 0 as
MAX_THREADS_PER_CLIENT
. - Observe that the application hangs and all requests to ClickHouse throw
java.util.concurrent.TimeoutException
when spawning 17 or more requests while passingMAX_THREADS_PER_CLIENT
as 0.
Expected behaviour
ClickHouse Java HTTP client works, regardless of how many requests I tell it to make at the same time.
Code example
See https://github.com/devurandom/clickhouse-timeout-mwe.
Configuration
Environment
- Client version: ClickHouse HTTP Client 0.6.2, using the built-in Java HTTP Client
- Language version: Java 11, Clojure 1.11.3
- OS: Linux 6.9
ClickHouse server
- ClickHouse Server version: ClickHouse Cloud
- ClickHouse Server non-default settings, if any: ClickHouse Cloud
CREATE TABLE
statements for tables involved: Reproduction works without any tables
I believe I am facing the same issue, using Spring Boot Webflux (Reactor).
When creating the ClickHouseNode
and ClickHouseClient
as a bean:
@Configuration
public class DatabaseDriverConfig {
@Bean
public ClickHouseNode nodeFactory() {
return ClickHouseNode.builder()
.host(clickhouse.getHost())
.port(ClickHouseProtocol.HTTP, clickhouse.getPort())
.database(clickhouse.getDatabase())
.credentials(ClickHouseCredentials.fromUserAndPassword(
clickhouse.getUsername(),
clickhouse.getPassword()
))
.build();
}
@Bean
public ClickHouseClient clientFactory(ClickHouseNode node) {
return ClickHouseClient.builder()
.option(ClickHouseClientOption.CLIENT_NAME, "query-engine-api")
// the below response has async enabled - I have omitted it from this builder.
.nodeSelector(ClickHouseNodeSelector.of(node.getProtocol()))
.build();
}
}
And then using it to make calls to Clickhouse:
public Mono<PageInfo> fetchPageInfo(UUID cid, Integer pageNo) {
final String q = """
SELECT COALESCE(COUNT(*), 0) as total_elements
FROM seer.event
""";
return Mono.fromFuture(client.read(node)
.format(ClickHouseFormat.RowBinaryWithNamesAndTypes)
.query(q)
.execute())
.map(ClickHouseResponse::firstRecord)
.map(x -> {
int totalElements = x.getValue("total_elements").asInteger();
return PageInfo.builder()
.pageNo(pageNo)
.totalPages(totalElements / pageNo)
.totalElements(totalElements)
.build();
})
.onErrorResume(NoSuchElementException.class, e -> Mono.just(PageInfo.builder()
.pageNo(pageNo)
.totalPages(0)
.totalElements(0)
.build()));
}
The first 10 requests work, the moment I make the 11th request the connection times-out and is no longer usable. The only way to get the client to work again is to restart the API.
The error looks like the following:
2024-07-21T15:04:01.580+01:00 ERROR 71784 --- [query-engine-api] [kHouseWorker-21] a.w.r.e.AbstractErrorWebExceptionHandler : [47bd0242-11] 500 Server Error for HTTP GET "/event?start=2024-01-01T13:13:13&end=2024-05-01T13:13:13&pageNo=1&limit=500"
com.clickhouse.client.ClickHouseException: Timeout deadline: 180000 MILLISECONDS, actual: 180007 MILLISECONDS, server ClickHouseNode [uri=http://localhost:8123/seer]@-737830262
at com.clickhouse.client.ClickHouseException.of(ClickHouseException.java:168) ~[clickhouse-client-0.6.0.jar:clickhouse-client 0.6.0 (revision: cb8e198)]
Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException:
Error has been observed at the following site(s):
*__checkpoint ⇢ Handler com.hexploits.seer.api.queryengine.controller.EventController#fetchEvents(UUID, UUID, String, ServerWebExchange) [DispatcherHandler]
*__checkpoint ⇢ HTTP GET "/event?start=2024-01-01T13:13:13&end=2024-05-01T13:13:13&pageNo=1&limit=500" [ExceptionHandlingWebHandler]
Original Stack Trace:
at com.clickhouse.client.ClickHouseException.of(ClickHouseException.java:168) ~[clickhouse-client-0.6.0.jar:clickhouse-client 0.6.0 (revision: cb8e198)]
at com.clickhouse.client.AbstractClient.lambda$execute$0(AbstractClient.java:275) ~[clickhouse-client-0.6.0.jar:clickhouse-client 0.6.0 (revision: cb8e198)]
at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768) ~[na:na]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[na:na]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[na:na]
at java.base/java.lang.Thread.run(Thread.java:833) ~[na:na]
Caused by: org.apache.hc.core5.http.ConnectionRequestTimeoutException: Timeout deadline: 180000 MILLISECONDS, actual: 180007 MILLISECONDS
at org.apache.hc.client5.http.impl.classic.InternalExecRuntime.acquireEndpoint(InternalExecRuntime.java:120) ~[httpclient5-5.2.3.jar:5.2.3]
at org.apache.hc.client5.http.impl.classic.ConnectExec.execute(ConnectExec.java:125) ~[httpclient5-5.2.3.jar:5.2.3]
at org.apache.hc.client5.http.impl.classic.ExecChainElement.execute(ExecChainElement.java:51) ~[httpclient5-5.2.3.jar:5.2.3]
at org.apache.hc.client5.http.impl.classic.ProtocolExec.execute(ProtocolExec.java:192) ~[httpclient5-5.2.3.jar:5.2.3]
at org.apache.hc.client5.http.impl.classic.ExecChainElement.execute(ExecChainElement.java:51) ~[httpclient5-5.2.3.jar:5.2.3]
at org.apache.hc.client5.http.impl.classic.HttpRequestRetryExec.execute(HttpRequestRetryExec.java:96) ~[httpclient5-5.2.3.jar:5.2.3]
at org.apache.hc.client5.http.impl.classic.ExecChainElement.execute(ExecChainElement.java:51) ~[httpclient5-5.2.3.jar:5.2.3]
at org.apache.hc.client5.http.impl.classic.RedirectExec.execute(RedirectExec.java:115) ~[httpclient5-5.2.3.jar:5.2.3]
at org.apache.hc.client5.http.impl.classic.ExecChainElement.execute(ExecChainElement.java:51) ~[httpclient5-5.2.3.jar:5.2.3]
at org.apache.hc.client5.http.impl.classic.InternalHttpClient.doExecute(InternalHttpClient.java:170) ~[httpclient5-5.2.3.jar:5.2.3]
at org.apache.hc.client5.http.impl.classic.CloseableHttpClient.execute(CloseableHttpClient.java:123) ~[httpclient5-5.2.3.jar:5.2.3]
at com.clickhouse.client.http.ApacheHttpConnectionImpl.post(ApacheHttpConnectionImpl.java:241) ~[clickhouse-http-client-0.6.0.jar:clickhouse-http-client 0.6.0 (revision: cb8e198)]
at com.clickhouse.client.http.ClickHouseHttpClient.send(ClickHouseHttpClient.java:118) ~[clickhouse-http-client-0.6.0.jar:clickhouse-http-client 0.6.0 (revision: cb8e198)]
at com.clickhouse.client.AbstractClient.sendAsync(AbstractClient.java:161) ~[clickhouse-client-0.6.0.jar:clickhouse-client 0.6.0 (revision: cb8e198)]
at com.clickhouse.client.AbstractClient.lambda$execute$0(AbstractClient.java:273) ~[clickhouse-client-0.6.0.jar:clickhouse-client 0.6.0 (revision: cb8e198)]
at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768) ~[na:na]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[na:na]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[na:na]
at java.base/java.lang.Thread.run(Thread.java:833) ~[na:na]
When spawning more than 16 requests in parallel to a ClickHouse server using the Java HTTP client, all requests will hang and eventually time out (unless I also set
MAX_THREADS_PER_CLIENT=1
or greater, the default being 0):
As a heads up - the following does not work for me:
@Bean
public ClickHouseClient clientFactory(ClickHouseNode node) {
return ClickHouseClient.builder()
.option(ClickHouseClientOption.CLIENT_NAME, "query-engine-api")
.nodeSelector(ClickHouseNodeSelector.of(node.getProtocol()))
.option(ClickHouseClientOption.ASYNC, true)
.option(ClickHouseClientOption.MAX_THREADS_PER_CLIENT, 1)
.build();
}
I've also tried setting the max threads per client to 5, 10 and 15 - with none of them working as intended. The moment the 11th request comes through the request times-outs.
Removing the implementation 'org.apache.httpcomponents.client5:httpclient5:5.2.3'
dependency fixes this issue as you fall back to HTTP_URL_CONNECTION
which doesn't have this issue.
Hope this helps someone else!
Removing the
implementation 'org.apache.httpcomponents.client5:httpclient5:5.2.3'
dependency fixes this issue as you fall back toHTTP_URL_CONNECTION
which doesn't have this issue.
I am using the Java 11 HTTP client, cf. https://github.com/devurandom/clickhouse-timeout-mwe/blob/88a48525a6ef68b21cb9dbea0fe0d5019f3ac36b/src/ch_timeout_mwe.clj#L20-L23.
Good day, @devurandom @CameronMukherjee !
The problem looks like when response object is not closed. It will keep connection open and would not let it return to a connection pool (that is in apache HC built-in).
With HttpURLConnection it will create unlimited number of connection and may eventually fail by open file descriptors.
Update: Setting ClickHouseClientOption/MAX_THREADS_PER_CLIENT 2
(instead of the default of 0), which makes the problem disappear in my minimal example, does not seem to workaround the issue in our larger application. It is still unclear to me why.
Removing the
implementation 'org.apache.httpcomponents.client5:httpclient5:5.2.3'
dependency fixes this issue as you fall back toHTTP_URL_CONNECTION
which doesn't have this issue.
I can confirm that changing https://github.com/devurandom/clickhouse-timeout-mwe/blob/88a48525a6ef68b21cb9dbea0fe0d5019f3ac36b/src/ch_timeout_mwe.clj#L20-L23 from {ClickHouseHttpOption/CONNECTION_PROVIDER HttpConnectionProvider/HTTP_CLIENT}
to {ClickHouseHttpOption/CONNECTION_PROVIDER HttpConnectionProvider/HTTP_URL_CONNECTION}
also works around the issue in my minimal example. (I did not try it in our service for the reason given by @chernser.)
The problem looks like when response object is not closed.
So this is related to #1712 ?
It will keep connection open and would not let it return to a connection pool (that is in apache HC built-in).
My minimal example reproduces this without the Apache HTTP client. It uses the Java 11 HTTP client instead.
With HttpURLConnection it will create unlimited number of connection and may eventually fail by open file descriptors.
So setting {ClickHouseHttpOption/CONNECTION_PROVIDER HttpConnectionProvider/HTTP_URL_CONNECTION}
is not a workaround that would work in a long running application or service. Thanks for this clarification!
The machine I reproduced https://github.com/devurandom/clickhouse-timeout-mwe on has 8 as "processors" reported by (.availableProcessors (Runtime/getRuntime))
, while our actual application runs on a machine / in a container with 2 "processors" (cf. https://opentelemetry.io/docs/specs/semconv/runtime/jvm-metrics/#metric-jvmcpucount).
This seems relevant in the context of com.clickhouse.client.http.ClickHouseHttpClient#newConnection which uses the executor created by com.clickhouse.client.AbstractClient#init via com.clickhouse.client.ClickHouseClient#getExecutorService and com.clickhouse.data.ClickHouseDataStreamFactory.DefaultExecutors with (+ 1 (* 2 (.availableProcessors (Runtime/getRuntime))))
threads if MAX_THREADS_PER_CLIENT
is unset (i.e. 0).
Setting -XX:ActiveProcessorCount=2
indeed changes the number of requests ClickHouse can handle without hanging:
(.availableProcessors (Runtime/getRuntime))
=> 2
(-main "${ENDPOINT}" "${USERNAME}" "${PASSWORD}" 2 0)
Spawning threads...
Wait for all threads:
Waiting for thread 0
Waiting for thread 1
Collected all threads!
=> nil
(-main "${ENDPOINT}" "${USERNAME}" "${PASSWORD}" 4 0)
Spawning threads...
Wait for all threads:
Waiting for thread 0
Waiting for thread 1
Waiting for thread 2
Waiting for thread 3
Collected all threads!
=> nil
(-main "${ENDPOINT}" "${USERNAME}" "${PASSWORD}" 5 0)
Spawning threads...
Wait for all threads:
Waiting for thread 0
ERROR: #error {
:cause nil
:via
[{:type java.util.concurrent.ExecutionException
:message com.clickhouse.client.ClickHouseException: Code: 159. Execution timed out
:at [java.util.concurrent.FutureTask report FutureTask.java 122]}
{:type com.clickhouse.client.ClickHouseException
:message Code: 159. Execution timed out
:at [com.clickhouse.client.ClickHouseException of ClickHouseException.java 147]}
{:type java.util.concurrent.TimeoutException
:message nil
:at [java.util.concurrent.CompletableFuture timedGet CompletableFuture.java 1960]}]
:trace
[[java.util.concurrent.CompletableFuture timedGet CompletableFuture.java 1960]
[java.util.concurrent.CompletableFuture get CompletableFuture.java 2095]
[com.clickhouse.client.ClickHouseClient executeAndWait ClickHouseClient.java 878]
[ch_timeout_mwe$execute_BANG_ invokeStatic ch_timeout_mwe.clj 34]
[ch_timeout_mwe$execute_BANG_ invoke ch_timeout_mwe.clj 31]
[ch_timeout_mwe$ping invokeStatic ch_timeout_mwe.clj 40]
[ch_timeout_mwe$ping invoke ch_timeout_mwe.clj 36]
[ch_timeout_mwe$_main$iter__1943__1947$fn__1948$fn__1949$fn__1950 invoke ch_timeout_mwe.clj 56]
[clojure.core$binding_conveyor_fn$fn__5823 invoke core.clj 2047]
[clojure.lang.AFn call AFn.java 18]
[java.util.concurrent.FutureTask run FutureTask.java 264]
[java.util.concurrent.ThreadPoolExecutor runWorker ThreadPoolExecutor.java 1136]
[java.util.concurrent.ThreadPoolExecutor$Worker run ThreadPoolExecutor.java 635]
[java.lang.Thread run Thread.java 840]]}
...
The -main
function is the one in https://github.com/devurandom/clickhouse-timeout-mwe.
We are now using java.util.concurrent.Semaphore
to limit the number of request we submit to the ClickHouse client to 2 times the number of processors ((* 2 (.availableProcessors (Runtime/getRuntime)))
) and our service finally sends out requests, receives responses and runs stably.
@devurandom Thank you for the information!
- Do you close Response objects?
- Am I right that deadlock happens if task is submitted to the executor service when no idle workers?
- Would you please describe your high-level logic? where do you lock?
Thanks!
I used the word "deadlock" in my previous comment which might have caused confusion. I should have written "hang" (or "block" or similar). I corrected that in my original comment above.
Do you close Response objects?
In the minimal example, if a response would be received instead of a timeout exception, the response would be closed, cf. https://github.com/devurandom/clickhouse-timeout-mwe/blob/main/src/ch_timeout_mwe.clj#L31-L34.
The problem is that no response is ever received by the code of the minimal example, because there seems to be a problem within the ClickHouse client that prevents a request from ever being processed. The ClickHouse client throws a timeout exception before it ever returns a response.
Am I right that deadlock happens if task is submitted to the executor service when no idle workers?
I cannot answer this question, as I do not know enough about the inner workings of the ClickHouse client. I just stepped through the ClickHouse client code with a debugger and took notes.
Would you please describe your high-level logic? where do you lock?
In the minimal example, no locks are used. It first spawns a bunch of futures on Clojure's own thread pool (not shared with e.g. the Java 11 HTTP client or the ClickHouse client) and then waits for them to complete, cf. https://github.com/devurandom/clickhouse-timeout-mwe/blob/main/src/ch_timeout_mwe.clj#L55-L63.