tobegit3hub/simple_tensorflow_serving

Unable to run on Kubernetes Cluster

randomthought opened this issue · 2 comments

Firstly thanks for the great work!

I am having difficulties trying to get simple_tensorflow_serving working on a Kubernetes cluster. Seems to be something with H20, logs are not descriptive enough for me to pinpoint it. It just keeps hanging on the connection refused error below.

01-15 20:11:07.286 10.0.0.41:54321       180    main      INFO: H2O started in 2983ms
01-15 20:11:07.286 10.0.0.41:54321       180    main      INFO:
01-15 20:11:07.286 10.0.0.41:54321       180    main      INFO: Open H2O Flow in your web browser: http://10.0.0.41:54321
01-15 20:11:07.287 10.0.0.41:54321       180    main      INFO:
01-15 20:11:09.699 10.0.0.41:54321       180    FJ-126-3  INFO: Cloud of size 2 formed [/10.0.0.5:54321, /10.0.0.41:54321]
2019-01-15 20:11:14 INFO     Try to get function from file: ./models/h2o_prostate_model/preprocess_function.marshal
2019-01-15 20:11:14 INFO     Try to get function from file: ./models/h2o_prostate_model/postprocess_function.marshal
2019-01-15 20:11:14 INFO     Try to initialize and connect the h2o server
Checking whether there is an H2O instance running at http://localhost:54321. connected.
Warning: Your H2O cluster version is too old (8 months and 27 days)! Please download and install the latest version from http://h2o.ai/download/
01-15 20:11:14.371 10.0.0.41:54321       180    #28758-13 INFO: POST /4/sessions, parms: {}
01-15 20:11:14.377 10.0.0.41:54321       180    #28758-13 INFO: Locking cloud to new members, because water.api.schemas4.SessionIdV4
01-15 20:11:14.414 10.0.0.41:54321       180    #.5:54321 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused
01-15 20:11:14.717 10.0.0.41:54321       180    #.5:54321 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused
01-15 20:11:15.020 10.0.0.41:54321       180    #.5:54321 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused
01-15 20:11:15.322 10.0.0.41:54321       180    #.5:54321 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused
01-15 20:11:15.625 10.0.0.41:54321       180    #.5:54321 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused
01-15 20:11:15.928 10.0.0.41:54321       180    #.5:54321 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused

Thanks for reporting.

Have you setup the H2O cluster to run with one H2O instance? It seems to be the problem of network but I'm not sure why it fails to connect with localhost service.

I also got the same issue. Error Log:11-19 14:40:18.737 10.237.73.201:54321 18656 #80:54323 ERRR: Got IO error when sending batch UDP bytes: java.net.ConnectException: Connection refused. Below is the config I tried conf$spark.executor.instances <- 171
spark.yarn.executor.memoryOverhead<- 2048
conf$spark.executor.memory <- "18g"
conf$spark.executor.cores <- 5

spark.yarn.driver.memoryOverhead<- 39936
conf$spark.driver.memory<-"57.6g"
conf$spark.driver.cores<- 5

conf$'sparklyr.shell.executor-memory' <- "32g"
conf$'sparklyr.shell.driver-memory' <- "32g"
conf$spark.yarn.am.memory <- "32g"
conf$spark.dynamicAllocation.enabled <- "false"