Unable to run on Kubernetes Cluster
randomthought opened this issue · 2 comments
Firstly thanks for the great work!
I am having difficulties trying to get simple_tensorflow_serving working on a Kubernetes cluster. Seems to be something with H20, logs are not descriptive enough for me to pinpoint it. It just keeps hanging on the connection refused error below.
01-15 20:11:07.286 10.0.0.41:54321 180 main INFO: H2O started in 2983ms
01-15 20:11:07.286 10.0.0.41:54321 180 main INFO:
01-15 20:11:07.286 10.0.0.41:54321 180 main INFO: Open H2O Flow in your web browser: http://10.0.0.41:54321
01-15 20:11:07.287 10.0.0.41:54321 180 main INFO:
01-15 20:11:09.699 10.0.0.41:54321 180 FJ-126-3 INFO: Cloud of size 2 formed [/10.0.0.5:54321, /10.0.0.41:54321]
2019-01-15 20:11:14 INFO Try to get function from file: ./models/h2o_prostate_model/preprocess_function.marshal
2019-01-15 20:11:14 INFO Try to get function from file: ./models/h2o_prostate_model/postprocess_function.marshal
2019-01-15 20:11:14 INFO Try to initialize and connect the h2o server
Checking whether there is an H2O instance running at http://localhost:54321. connected.
Warning: Your H2O cluster version is too old (8 months and 27 days)! Please download and install the latest version from http://h2o.ai/download/
01-15 20:11:14.371 10.0.0.41:54321 180 #28758-13 INFO: POST /4/sessions, parms: {}
01-15 20:11:14.377 10.0.0.41:54321 180 #28758-13 INFO: Locking cloud to new members, because water.api.schemas4.SessionIdV4
01-15 20:11:14.414 10.0.0.41:54321 180 #.5:54321 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused
01-15 20:11:14.717 10.0.0.41:54321 180 #.5:54321 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused
01-15 20:11:15.020 10.0.0.41:54321 180 #.5:54321 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused
01-15 20:11:15.322 10.0.0.41:54321 180 #.5:54321 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused
01-15 20:11:15.625 10.0.0.41:54321 180 #.5:54321 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused
01-15 20:11:15.928 10.0.0.41:54321 180 #.5:54321 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused
Thanks for reporting.
Have you setup the H2O cluster to run with one H2O instance? It seems to be the problem of network but I'm not sure why it fails to connect with localhost service.
I also got the same issue. Error Log:11-19 14:40:18.737 10.237.73.201:54321 18656 #80:54323 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused. Below is the config I tried conf$spark.executor.instances <- 171
spark.yarn.executor.memoryOverhead<- 2048
conf$spark.executor.memory <- "18g"
conf$spark.executor.cores <- 5
spark.yarn.driver.memoryOverhead<- 39936
conf$spark.driver.memory<-"57.6g"
conf$spark.driver.cores<- 5
conf$'sparklyr.shell.executor-memory' <- "32g"
conf$'sparklyr.shell.driver-memory' <- "32g"
conf$spark.yarn.am.memory <- "32g"
conf$spark.dynamicAllocation.enabled <- "false"