Training status is PENDING not change

Question

Training status is PENDING not change

bleachzk opened this issue 7 years ago · 9 comments

I can‘t get any error log ...

Answer 1 · 2018-06-20T09:03:25.000Z

LCM logs:

{"level":"info","msg":"transport: http2Server.HandleStreams failed to read frame: read tcp [::1]:8443-\u003e[::1]:53622: read: connection reset by peer","time":"2018-06-20T08:56:58Z"}

Answer 2 · 2018-06-20T21:12:34.000Z

Hi @bleachzk, can I have some details about your job? (e.g. $CLI_CMD show training-gSR-qONmR). If you are requesting GPUs for your training job, do you have any GPU resources available on your Kubernetes Cluster?

Answer 3 · 2018-06-24T14:24:37.000Z

Model definition:
Name: tf_convolutional_network_tutorial
Description: Convolutional network model using tensorflow
Framework: tensorflow:1.7.0-gpu-py3
Training:
Status: PENDING
Submitted: N/A
Completed: N/A
Resources: 2.00 CPUs | 1.00 GPUs | 2.00 GB Mem | 1 node(s)
Command: python3 convolutional_network.py --trainImagesFile ${DATA_DIR}/train-images-idx3-ubyte.gz
--trainLabelsFile ${DATA_DIR}/train-labels-idx1-ubyte.gz --testImagesFile ${DATA_DIR}/t10k-images-idx3-ubyte.gz
--testLabelsFile ${DATA_DIR}/t10k-labels-idx1-ubyte.gz --learningRate 0.001
--trainingIters 20000
Input data : sl-internal-os-input
Output data: sl-internal-os-output
Data stores:
ID: sl-internal-os-input
Type: s3_datastore
Connection:
auth_url: http://s3.default.svc.cluster.local
bucket: tf_training_data
password: test
type: s3_datastore
user_name: test
ID: sl-internal-os-output
Type: s3_datastore
Connection:
auth_url: http://s3.default.svc.cluster.local
bucket: tf_trained_model
password: test
type: s3_datastore
user_name: test
Summary metrics:
OK

Answer 4 · 2018-06-24T14:26:41.000Z

@Tomcli I have set nvidia-device-plugin

Answer 5 · 2018-06-25T18:05:23.000Z

Hi @bleachzk, did you deploy ffdl-lcm with device-plugin tag? (e.g. helm install --set lcm.version="device-plugin" .), since ffdl-lcm:latest will use accelerators for GPU resources.

After you changed ffdl-lcm with device-plugin tag, all the new GPU jobs should consume nvidia.com/gpu resources.

As accelerator deprecated in K8s 1.10, we will add a new pre-0.1 patch to FfDL this week to use device-plugin as default.

Answer 6 · 2018-07-02T16:42:37.000Z

@Tomcli after upgrade to v0.1，leaner pod start error：

MountVolume.SetUp failed for volume "cosoutputmount-107082b8-77ec-4686-61f6-b87d630babfb" : mount command failed, status: Failure, reason: Error mounting volume: s3fs mount failed: s3fs: error while loading shared libraries: libfuse.so.2: cannot open shared object file: No such file or directory

Answer 7 · 2018-07-02T17:00:49.000Z

Hi @bleachzk , with our new v0.1 release, we require users to install the s3fs drivers on each of their nodes (e.g. using the storage-plugin helm chart or follow the ibmcloud-object-storage-plugin instructions ).

Since the s3fs installation may vary based on different Kubernetes environment, I can point you to a more specific instruction if you can let me know what kind of Kubernetes environment your are using.

Thanks.

Answer 8 · 2018-07-03T01:40:15.000Z

System version：CentOS 7.2 3.10.0-514.26.2.el7.x86_64
Kubernetes version：1.10
Docker version：CE 18.03
@Tomcli

Answer 9 · 2018-07-03T22:37:57.000Z

@bleachzk For your cluster, I think you need to install the s3fs drivers and copy the driver binary ibmc-s3fs on each of your worker nodes.

sudo apt-get install s3fs
sudo mkdir -p /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs
sudo cp <FfDL repo>/bin/ibmc-s3fs /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs
sudo chmod +x /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs/ibmc-s3fs
sudo systemctl restart kubelet

Then, install the storage-plugin helm chart if you haven't done it.

helm install storage-plugin --set cloud=false

Then your learner pods should able to mount on any S3 Object Storage.