Multiple Learners Training Job Fails
nkpng2k opened this issue · 2 comments
Raising this issue as per my conversation with Tommy earlier today.
minikube version: v0.25.2
kubectl version:
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-07T12:22:21Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"", Minor:"", GitVersion:"v1.9.4", GitCommit:"bee2d1505c4fe820744d26d41ecd3fdd4a3d6546", GitTreeState:"clean", BuildDate:"2018-03-21T21:48:36Z", GoVersion:"go1.9.1", Compiler:"gc", Platform:"linux/amd64"}
Kubernetes: 1.9.4
When running a single learner job with the same python script, no issues. Whole process completes.
When running a multi learner job (only thing changed in the manifest is learners: 2
process fails.
Logs are as follows:
Nicholass-MBP:FfDL npng$ $CLI_CMD list
Getting all models ...
ID Name Framework Training status Submitted Completed
training-C6DTcIMmR h2o3_automl h2o3:latest COMPLETED N/A N/A
training-TOnw5SGiR h2o3_automl h2o3:latest FAILED N/A N/A
2 records found.
Nicholass-MBP:FfDL npng$ $CLI_CMD logs training-TOnw5SGiR
Getting model training logs for 'training-TOnw5SGiR'...
Status: FAILED
Cannot read trained model log: rpc error: code = Unknown desc = NoSuchKey: The specified key does not exist.
status code: 404, request id: , host id: Nicholass-MBP:FfDL npng$
Nicholass-MBP:FfDL npng$ kubectl get pods
NAME READY STATUS RESTARTS AGE
alertmanager-78676b6756-2l2zb 1/1 Running 0 32m
etcd0 1/1 Running 0 32m
ffdl-lcm-dd5f59b55-bm52q 1/1 Running 0 32m
ffdl-restapi-7789dbdf5f-2j4mh 1/1 Running 0 32m
ffdl-trainer-59bd46cfdb-9csqr 1/1 Running 2 32m
ffdl-trainingdata-688bf5f44b-48wqb 1/1 Running 5 32m
ffdl-ui-6545f7dd5b-lpqcd 1/1 Running 0 32m
grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba-75bd6c4ccd-h4bs9 0/1 ImagePullBackOff 0 6m
jobmonitor-3be3332c-e2f4-4a2b-4775-3069398a12ba-64f9b94465s7gmh 1/1 Running 0 6m
learner-1-3be3332c-e2f4-4a2b-4775-3069398a12ba-f8d8b8c98-6drgz 0/7 Pending 0 6m
learner-2-3be3332c-e2f4-4a2b-4775-3069398a12ba-979949d49-9jv9f 0/7 Pending 0 6m
mongo-0 1/1 Running 0 32m
prometheus-556d97b566-fmgkp 2/2 Running 0 32m
pushgateway-665b6c4b9-hg85s 2/2 Running 0 32m
storage-0 1/1 Running 0 32m
Nicholass-MBP:FfDL npng$
Nicholass-MBP:FfDL npng$ kubectl describe pod grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba-75bd6c4ccd-h4bs9
Name: grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba-75bd6c4ccd-h4bs9
Namespace: default
Node: minikube/192.168.99.100
Start Time: Fri, 27 Apr 2018 15:45:33 -0700
Labels: app=grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba
pod-template-hash=3168270778
service=dlaas-parameter-server
training_id=training-TOnw5SGiR
Annotations: <none>
Status: Pending
IP: 172.17.0.16
Controlled By: ReplicaSet/grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba-75bd6c4ccd
Containers:
grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba:
Container ID:
Image: docker.io/ffdl/parameter-server:master-97
Image ID:
Port: 50051/TCP
State: Waiting
Reason: ImagePullBackOff
Ready: False
Restart Count: 0
Limits:
cpu: 500m
memory: 1048576k
Requests:
cpu: 500m
memory: 1048576k
Environment:
JOBID: 1111
NUM_LEARNERS: 2
TCP_PORT: 50051
ZK_DIR: training-TOnw5SGiR/parameter-server
ZK_DIR: training-TOnw5SGiR/parameter-server
DLAAS_ETCD_ADDRESS: <set to the key 'DLAAS_ETCD_ADDRESS' in secret 'lcm-secrets'> Optional: false
DLAAS_ETCD_USERNAME: <set to the key 'DLAAS_ETCD_USERNAME' in secret 'lcm-secrets'> Optional: false
DLAAS_ETCD_PASSWORD: <set to the key 'DLAAS_ETCD_PASSWORD' in secret 'lcm-secrets'> Optional: false
DLAAS_ETCD_PREFIX: <set to the key 'DLAAS_ETCD_PREFIX' in secret 'lcm-secrets'> Optional: false
FOR_TEST: 1
DLAAS_JOB_ID: training-TOnw5SGiR
ZNODE_NAME: singleshard
DATA_STORE_AUTHURL: http://s3.default.svc.cluster.local
MODEL_STORE_OBJECTID: dlaas-models/training-TOnw5SGiR.zip
RESULT_STORE_AUTHURL: http://s3.default.svc.cluster.local
RESULT_STORE_TYPE: s3_datastore
RESULT_STORE_USERNAME: test
MODEL_STORE_APIKEY: test
DATA_DIR: h2o3_training_data
DATA_STORE_TYPE: s3_datastore
MODEL_STORE_USERNAME: test
MODEL_DIR: /model-code
GPU_COUNT: 0.000000
RESULT_DIR: h2o3_trained_model
DATA_STORE_OBJECTID: h2o3_training_data
SCHED_POLICY: dense
RESULT_STORE_OBJECTID: h2o3_trained_model/training-TOnw5SGiR
LOG_DIR: /logs
MODEL_STORE_AUTHURL: http://s3.default.svc.cluster.local
MODEL_STORE_TYPE: s3_datastore
DATA_STORE_USERNAME: test
DATA_STORE_APIKEY: test
RESULT_STORE_APIKEY: test
TRAINING_COMMAND: python h2o3_baseline.py --trainDataFile ${DATA_DIR}/higgs_train_10k.csv --target response --memory 1
TRAINING_ID: training-TOnw5SGiR
Mounts:
/etc/certs/ from etcd-ssl-cert-vol (ro)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-nllw4 (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
etcd-ssl-cert-vol:
Type: Secret (a volume populated by a Secret)
SecretName: lcm-secrets
Optional: false
default-token-nllw4:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-nllw4
Optional: false
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m default-scheduler Successfully assigned grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba-75bd6c4ccd-h4bs9 to minikube
Normal SuccessfulMountVolume 6m kubelet, minikube MountVolume.SetUp succeeded for volume "etcd-ssl-cert-vol"
Normal SuccessfulMountVolume 6m kubelet, minikube MountVolume.SetUp succeeded for volume "default-token-nllw4"
Normal Pulling 5m (x4 over 6m) kubelet, minikube pulling image "docker.io/ffdl/parameter-server:master-97"
Warning Failed 5m (x4 over 6m) kubelet, minikube Failed to pull image "docker.io/ffdl/parameter-server:master-97": rpc error: code = Unknown desc = Error response from daemon: pull access denied for ffdl/parameter-server, repository does not exist or may require 'docker login'
Warning Failed 5m (x4 over 6m) kubelet, minikube Error: ErrImagePull
Warning Failed 5m (x6 over 6m) kubelet, minikube Error: ImagePullBackOff
Normal BackOff 1m (x21 over 6m) kubelet, minikube Back-off pulling image "docker.io/ffdl/parameter-server:master-97"
I think this is what is blocking the rest of the processes:
Normal Pulling 5m (x4 over 6m) kubelet, minikube pulling image "docker.io/ffdl/parameter-server:master-97"
Warning Failed 5m (x4 over 6m) kubelet, minikube Failed to pull image "docker.io/ffdl/parameter-server:master-97": rpc error: code = Unknown desc = Error response from daemon: pull access denied for ffdl/parameter-server, repository does not exist or may require 'docker login'
Warning Failed 5m (x4 over 6m) kubelet, minikube Error: ErrImagePull
Warning Failed 5m (x6 over 6m) kubelet, minikube Error: ImagePullBackOff
Normal BackOff 1m (x21 over 6m) kubelet, minikube Back-off pulling image "docker.io/ffdl/parameter-server:master-97"
Hi @nkpng2k, you can fix this error by adding h2o3 to the supported native distributed frameworks list at
FfDL/lcm/service/lcm/service_impl.go
Line 54 in d8e5caa