pod keep restarting because of there isn't a configuration file in federated-learning-surface-defect example with image v0.5.1

Question

pod keep restarting because of there isn't a configuration file in federated-learning-surface-defect example with image v0.5.1

xinzongyan opened this issue 2 years ago · 5 comments

What happened:
federated-learning-surface-defect-detection-train worker need a configuration file, but there is no configuration file, and the doc didn't mention it.

the logs of aggregation worker

[root@board1 ~]# kubectl logs -f federated-learning-surface-defect-detection-aggregation-c67cq
2023-01-11 03:17:01.881770: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2023-01-11 03:17:01.881802: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-01-11 03:17:03.286409: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-01-11 03:17:03.286435: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
2023-01-11 03:17:03.286457: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (federated-learning-surface-defect-detection-aggregation-c67cq): /proc/driver/nvidia/version does not exist
2023-01-11 03:17:03.286692: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-11 03:17:03.292278: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3312000000 Hz
2023-01-11 03:17:03.292504: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x224f970 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-01-11 03:17:03.292518: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
Traceback (most recent call last):
File "aggregate.py", line 35, in
run_server()
File "aggregate.py", line 29, in run_server
chooser=simple_chooser)
File "/home/lib/sedna/service/server/aggregation.py", line 280, in init
server = Config().server._asdict()
File "/home/plato/plato/config.py", line 136, in new
raise ValueError("A configuration file must be supplied.")
ValueError: A configuration file must be supplied.

the logs of train worker

[root@board2 ~]# docker logs -f k8s_train-worker_federated-learning-surface-defect-detection-train-94npn_default_9b39d8bf-c0bc-48b7-b929-d6a646e8b60d_2
2023-01-11 03:17:46.909824: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2023-01-11 03:17:46.909851: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-01-11 03:17:49.472157: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2023-01-11 03:17:49.472186: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
2023-01-11 03:17:49.472209: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (board2): /proc/driver/nvidia/version does not exist
2023-01-11 03:17:49.472517: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-11 03:17:49.477353: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 3312000000 Hz
2023-01-11 03:17:49.477533: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x267cac0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-01-11 03:17:49.477543: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2023-01-11 03:17:49.478930: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 102629376 exceeds 10% of free system memory.
2023-01-11 03:17:49.537921: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 102629376 exceeds 10% of free system memory.
Traceback (most recent call last):
File "train.py", line 60, in
main()
File "train.py", line 55, in main
transmitter=s3_transmitter)
File "/home/lib/sedna/core/federated_learning/federated_learning.py", line 196, in init
server = Config().server._asdict()
File "/home/plato/plato/config.py", line 136, in new
raise ValueError("A configuration file must be supplied.")
ValueError: A configuration file must be supplied.

What you expected to happen:
The container operates normally, and train with the dataset.

How to reproduce it (as minimally and precisely as possible):
I used the image of kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.5.1
please uses this image to reproduce it.

Anything else we need to know?:

Environment:

Sedna Version

$ kubectl get -n sedna deploy gm -o jsonpath='{.spec.template.spec.containers[0].image}'
kubeedge/sedna-gm:v0.5.1

$ kubectl get -n sedna ds lc -o jsonpath='{.spec.template.spec.containers[0].image}'
kubeedge/sedna-lc:v0.5.1

Kubernets Version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T18:03:20Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T17:57:25Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}

KubeEdge Version

$ cloudcore --version
Version: v1.12.1

$ edgecore --version
Version: v1.12.1

Answer 1 · 2023-01-11T09:58:23.000Z

have you solved it? how to do? @xinzongyan

Answer 2 · 2023-01-11T10:06:47.000Z

no ,I change image to v0.4.0 ,this version can run complated.

Answer 3 · 2023-01-11T10:24:07.000Z

change image to v0.4.0 in build_image.sh and kubectl create -f xxx.yaml? @xinzongyan

Answer 4 · 2023-01-12T03:06:32.000Z

use this image kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.4.0 @ylfbx329

Answer 5 · 2023-02-10T02:42:12.000Z

/cc @XinYao1994 Would you mind taking a look at this issue?