IBM/FfDL

Multiple Learners Training Job Fails

nkpng2k opened this issue · 2 comments

Raising this issue as per my conversation with Tommy earlier today.

minikube version: v0.25.2
kubectl version:
Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-07T12:22:21Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"", Minor:"", GitVersion:"v1.9.4", GitCommit:"bee2d1505c4fe820744d26d41ecd3fdd4a3d6546", GitTreeState:"clean", BuildDate:"2018-03-21T21:48:36Z", GoVersion:"go1.9.1", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes: 1.9.4

When running a single learner job with the same python script, no issues. Whole process completes.

When running a multi learner job (only thing changed in the manifest is learners: 2 process fails.

Logs are as follows:

Nicholass-MBP:FfDL npng$ $CLI_CMD list
Getting all models ...
ID                   Name          Framework     Training status   Submitted   Completed
training-C6DTcIMmR   h2o3_automl   h2o3:latest   COMPLETED         N/A         N/A
training-TOnw5SGiR   h2o3_automl   h2o3:latest   FAILED            N/A         N/A

2 records found.
Nicholass-MBP:FfDL npng$ $CLI_CMD logs training-TOnw5SGiR
Getting model training logs for 'training-TOnw5SGiR'...
Status: FAILED
Cannot read trained model log: rpc error: code = Unknown desc = NoSuchKey: The specified key does not exist.
	status code: 404, request id: , host id: Nicholass-MBP:FfDL npng$
Nicholass-MBP:FfDL npng$ kubectl get pods
NAME                                                              READY     STATUS             RESTARTS   AGE
alertmanager-78676b6756-2l2zb                                     1/1       Running            0          32m
etcd0                                                             1/1       Running            0          32m
ffdl-lcm-dd5f59b55-bm52q                                          1/1       Running            0          32m
ffdl-restapi-7789dbdf5f-2j4mh                                     1/1       Running            0          32m
ffdl-trainer-59bd46cfdb-9csqr                                     1/1       Running            2          32m
ffdl-trainingdata-688bf5f44b-48wqb                                1/1       Running            5          32m
ffdl-ui-6545f7dd5b-lpqcd                                          1/1       Running            0          32m
grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba-75bd6c4ccd-h4bs9     0/1       ImagePullBackOff   0          6m
jobmonitor-3be3332c-e2f4-4a2b-4775-3069398a12ba-64f9b94465s7gmh   1/1       Running            0          6m
learner-1-3be3332c-e2f4-4a2b-4775-3069398a12ba-f8d8b8c98-6drgz    0/7       Pending            0          6m
learner-2-3be3332c-e2f4-4a2b-4775-3069398a12ba-979949d49-9jv9f    0/7       Pending            0          6m
mongo-0                                                           1/1       Running            0          32m
prometheus-556d97b566-fmgkp                                       2/2       Running            0          32m
pushgateway-665b6c4b9-hg85s                                       2/2       Running            0          32m
storage-0                                                         1/1       Running            0          32m
Nicholass-MBP:FfDL npng$
Nicholass-MBP:FfDL npng$ kubectl describe pod grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba-75bd6c4ccd-h4bs9
Name:           grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba-75bd6c4ccd-h4bs9
Namespace:      default
Node:           minikube/192.168.99.100
Start Time:     Fri, 27 Apr 2018 15:45:33 -0700
Labels:         app=grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba
                pod-template-hash=3168270778
                service=dlaas-parameter-server
                training_id=training-TOnw5SGiR
Annotations:    <none>
Status:         Pending
IP:             172.17.0.16
Controlled By:  ReplicaSet/grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba-75bd6c4ccd
Containers:
  grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba:
    Container ID:
    Image:          docker.io/ffdl/parameter-server:master-97
    Image ID:
    Port:           50051/TCP
    State:          Waiting
      Reason:       ImagePullBackOff
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     500m
      memory:  1048576k
    Requests:
      cpu:     500m
      memory:  1048576k
    Environment:
      JOBID:                  1111
      NUM_LEARNERS:           2
      TCP_PORT:               50051
      ZK_DIR:                 training-TOnw5SGiR/parameter-server
      ZK_DIR:                 training-TOnw5SGiR/parameter-server
      DLAAS_ETCD_ADDRESS:     <set to the key 'DLAAS_ETCD_ADDRESS' in secret 'lcm-secrets'>   Optional: false
      DLAAS_ETCD_USERNAME:    <set to the key 'DLAAS_ETCD_USERNAME' in secret 'lcm-secrets'>  Optional: false
      DLAAS_ETCD_PASSWORD:    <set to the key 'DLAAS_ETCD_PASSWORD' in secret 'lcm-secrets'>  Optional: false
      DLAAS_ETCD_PREFIX:      <set to the key 'DLAAS_ETCD_PREFIX' in secret 'lcm-secrets'>    Optional: false
      FOR_TEST:               1
      DLAAS_JOB_ID:           training-TOnw5SGiR
      ZNODE_NAME:             singleshard
      DATA_STORE_AUTHURL:     http://s3.default.svc.cluster.local
      MODEL_STORE_OBJECTID:   dlaas-models/training-TOnw5SGiR.zip
      RESULT_STORE_AUTHURL:   http://s3.default.svc.cluster.local
      RESULT_STORE_TYPE:      s3_datastore
      RESULT_STORE_USERNAME:  test
      MODEL_STORE_APIKEY:     test
      DATA_DIR:               h2o3_training_data
      DATA_STORE_TYPE:        s3_datastore
      MODEL_STORE_USERNAME:   test
      MODEL_DIR:              /model-code
      GPU_COUNT:              0.000000
      RESULT_DIR:             h2o3_trained_model
      DATA_STORE_OBJECTID:    h2o3_training_data
      SCHED_POLICY:           dense
      RESULT_STORE_OBJECTID:  h2o3_trained_model/training-TOnw5SGiR
      LOG_DIR:                /logs
      MODEL_STORE_AUTHURL:    http://s3.default.svc.cluster.local
      MODEL_STORE_TYPE:       s3_datastore
      DATA_STORE_USERNAME:    test
      DATA_STORE_APIKEY:      test
      RESULT_STORE_APIKEY:    test
      TRAINING_COMMAND:       python h2o3_baseline.py --trainDataFile ${DATA_DIR}/higgs_train_10k.csv --target response --memory 1
      TRAINING_ID:            training-TOnw5SGiR
    Mounts:
      /etc/certs/ from etcd-ssl-cert-vol (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-nllw4 (ro)
Conditions:
  Type           Status
  Initialized    True
  Ready          False
  PodScheduled   True
Volumes:
  etcd-ssl-cert-vol:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  lcm-secrets
    Optional:    false
  default-token-nllw4:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-nllw4
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     <none>
Events:
  Type     Reason                 Age               From               Message
  ----     ------                 ----              ----               -------
  Normal   Scheduled              6m                default-scheduler  Successfully assigned grpc-ps-3be3332c-e2f4-4a2b-4775-3069398a12ba-75bd6c4ccd-h4bs9 to minikube
  Normal   SuccessfulMountVolume  6m                kubelet, minikube  MountVolume.SetUp succeeded for volume "etcd-ssl-cert-vol"
  Normal   SuccessfulMountVolume  6m                kubelet, minikube  MountVolume.SetUp succeeded for volume "default-token-nllw4"
  Normal   Pulling                5m (x4 over 6m)   kubelet, minikube  pulling image "docker.io/ffdl/parameter-server:master-97"
  Warning  Failed                 5m (x4 over 6m)   kubelet, minikube  Failed to pull image "docker.io/ffdl/parameter-server:master-97": rpc error: code = Unknown desc = Error response from daemon: pull access denied for ffdl/parameter-server, repository does not exist or may require 'docker login'
  Warning  Failed                 5m (x4 over 6m)   kubelet, minikube  Error: ErrImagePull
  Warning  Failed                 5m (x6 over 6m)   kubelet, minikube  Error: ImagePullBackOff
  Normal   BackOff                1m (x21 over 6m)  kubelet, minikube  Back-off pulling image "docker.io/ffdl/parameter-server:master-97"

I think this is what is blocking the rest of the processes:

  Normal   Pulling                5m (x4 over 6m)   kubelet, minikube  pulling image "docker.io/ffdl/parameter-server:master-97"
  Warning  Failed                 5m (x4 over 6m)   kubelet, minikube  Failed to pull image "docker.io/ffdl/parameter-server:master-97": rpc error: code = Unknown desc = Error response from daemon: pull access denied for ffdl/parameter-server, repository does not exist or may require 'docker login'
  Warning  Failed                 5m (x4 over 6m)   kubelet, minikube  Error: ErrImagePull
  Warning  Failed                 5m (x6 over 6m)   kubelet, minikube  Error: ImagePullBackOff
  Normal   BackOff                1m (x21 over 6m)  kubelet, minikube  Back-off pulling image "docker.io/ffdl/parameter-server:master-97"

Hi @nkpng2k, you can fix this error by adding h2o3 to the supported native distributed frameworks list at

NativeFrameworks = []string{"tensorflow", "caffe2", "mxnet", "horovod"}

Closed with #88