Training status is PENDING not change
bleachzk opened this issue · 9 comments
LCM logs:
{"level":"info","msg":"transport: http2Server.HandleStreams failed to read frame: read tcp [::1]:8443-\u003e[::1]:53622: read: connection reset by peer","time":"2018-06-20T08:56:58Z"}
Hi @bleachzk, can I have some details about your job? (e.g. $CLI_CMD show training-gSR-qONmR
). If you are requesting GPUs for your training job, do you have any GPU resources available on your Kubernetes Cluster?
Model definition:
Name: tf_convolutional_network_tutorial
Description: Convolutional network model using tensorflow
Framework: tensorflow:1.7.0-gpu-py3
Training:
Status: PENDING
Submitted: N/A
Completed: N/A
Resources: 2.00 CPUs | 1.00 GPUs | 2.00 GB Mem | 1 node(s)
Command: python3 convolutional_network.py --trainImagesFile ${DATA_DIR}/train-images-idx3-ubyte.gz
--trainLabelsFile ${DATA_DIR}/train-labels-idx1-ubyte.gz --testImagesFile ${DATA_DIR}/t10k-images-idx3-ubyte.gz
--testLabelsFile ${DATA_DIR}/t10k-labels-idx1-ubyte.gz --learningRate 0.001
--trainingIters 20000
Input data : sl-internal-os-input
Output data: sl-internal-os-output
Data stores:
ID: sl-internal-os-input
Type: s3_datastore
Connection:
auth_url: http://s3.default.svc.cluster.local
bucket: tf_training_data
password: test
type: s3_datastore
user_name: test
ID: sl-internal-os-output
Type: s3_datastore
Connection:
auth_url: http://s3.default.svc.cluster.local
bucket: tf_trained_model
password: test
type: s3_datastore
user_name: test
Summary metrics:
OK
Hi @bleachzk, did you deploy ffdl-lcm
with device-plugin tag? (e.g. helm install --set lcm.version="device-plugin" .
), since ffdl-lcm:latest
will use accelerators for GPU resources.
After you changed ffdl-lcm
with device-plugin
tag, all the new GPU jobs should consume nvidia.com/gpu
resources.
As accelerator deprecated in K8s 1.10, we will add a new pre-0.1 patch to FfDL this week to use device-plugin as default.
@Tomcli after upgrade to v0.1,leaner pod start error:
MountVolume.SetUp failed for volume "cosoutputmount-107082b8-77ec-4686-61f6-b87d630babfb" : mount command failed, status: Failure, reason: Error mounting volume: s3fs mount failed: s3fs: error while loading shared libraries: libfuse.so.2: cannot open shared object file: No such file or directory
Hi @bleachzk , with our new v0.1 release, we require users to install the s3fs drivers on each of their nodes (e.g. using the storage-plugin
helm chart or follow the ibmcloud-object-storage-plugin instructions ).
Since the s3fs installation may vary based on different Kubernetes environment, I can point you to a more specific instruction if you can let me know what kind of Kubernetes environment your are using.
Thanks.
System version:CentOS 7.2 3.10.0-514.26.2.el7.x86_64
Kubernetes version:1.10
Docker version:CE 18.03
@Tomcli
@bleachzk For your cluster, I think you need to install the s3fs drivers and copy the driver binary ibmc-s3fs
on each of your worker nodes.
sudo apt-get install s3fs
sudo mkdir -p /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs
sudo cp <FfDL repo>/bin/ibmc-s3fs /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs
sudo chmod +x /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ibm~ibmc-s3fs/ibmc-s3fs
sudo systemctl restart kubelet
Then, install the storage-plugin
helm chart if you haven't done it.
helm install storage-plugin --set cloud=false
Then your learner pods should able to mount on any S3 Object Storage.