Error: Unable to extract container id (with cgroup v1 on CentOS 8)
maestro4 opened this issue · 8 comments
Issue description
- issue description: The SWCI node task report errors while running the MNIST PYT example. The SWOP report errors "Unable to extract container id". Unfortunately the container is not being built. I can successfully use the curl command to build the container within the swop1 container using the docker.socket and the Dockerfile, which I created using the example https://github.com/HewlettPackard/swarm-learning/blob/master/examples/mnist-pyt/swci/taskdefs/user_env_pyt_build_task.yaml .
- occurrence: consistent
- error messages: SWOP: "Unable to extract container id" and SWCI: Taskrunner state error
- commands used for starting containers:
- docker logs [APLS, SPIRE, SN, SL, SWCI]:
SWOP:
2022-07-28 12:41:18,176 : swarm.swop : INFO : SL Nodes validation is started
2022-07-28 12:41:18,176 : swarm.swop : INFO : Attempting to contact API-Server at : <IP>:30304
2022-07-28 12:41:18,222 : swarm.swop : INFO : API-Server is UP!
2022-07-28 12:41:18,226 : swarm.swop : INFO : SWOPCtx :
============================================================
===== NODE UID : b22a7728-a0b0-45cb-9a9b-7b0ffb1673d8 =====
============================================================
/usr/lib/python3.8/site-packages/urllib3/connection.py:460: SubjectAltNameWarning: Certificate for <IP> has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/urllib3/urllib3/issues/497 for details.)
warnings.warn(
2022-07-28 12:51:39,838 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : user_env_pyt_build_task , opId : 9632004996807340828 - Begins
2022-07-28 12:51:42,856 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : user_env_pyt_build_task , opId : 9632004996807340828 - Ends
2022-07-28 12:51:48,884 : swarm.swop : INFO : SWOPBuildTask: Validating profile
2022-07-28 12:51:55,063 : swarm.swop : ERROR : Unable to extract container id
2022-07-28 12:51:58,078 : swarm.swop : INFO : SWOPExecutor: Total Tasks: 1 , Current Task : user_env_pyt_build_task , opId : 9632004996807340828 Done
2022-07-28 12:52:24,177 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : swarm_mnist_task , opId : 11113303237863304723 - Begins
2022-07-28 12:52:27,196 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : swarm_mnist_task , opId : 11113303237863304723 - Ends
2022-07-28 12:52:30,382 : swarm.swop : INFO : SWOPExecutor: Total Tasks: 2 , Current Task : swarm_mnist_task , opId : 11113303237863304723 Done
SWCI:
SWCI:0 > ######################################################################
SWCI:0 > # (C)Copyright 2021,2022 Hewlett Packard Enterprise Development LP
SWCI:0 > ######################################################################
SWCI:0 >
SWCI:0 > # Assumption : SWOP is already running
SWCI:0 >
SWCI:0 > # SWCI context setup
SWCI:0 > EXIT ON FAILURE
SWCI:0 > EXIT ON FAILURE IS TURNED ON
SWCI:1 > wait for <IP>
API Server is UP!
SWCI:2 > create context test-mnist <IP>
API Server is UP!
CONTEXT CREATED : test-mnist
/usr/lib/python3.8/site-packages/urllib3/connection.py:455: SubjectAltNameWarning: Certificate for <IP> has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/urllib3/urllib3/issues/497 for details.)
warnings.warn(
SWCI:3 > switch context test-mnist
DEFAULT CONTEXT SET TO : test-mnist
SWCI:4 > EXIT ON FAILURE OFF
SWCI:4 > EXIT ON FAILURE IS TURNED OFF
SWCI:5 >
SWCI:5 > #Change to the directory where we are mounting the host
SWCI:5 > cd /platform/swarm/usr
SWCI:5 > Current Directory : /platform/swarm/usr
SWCI:6 >
SWCI:6 > # Create and finalize build task
SWCI:6 > EXIT ON FAILURE
SWCI:6 > EXIT ON FAILURE IS TURNED ON
SWCI:7 > create task from taskdefs/user_env_pyt_build_task.yaml
Task definition is valid
Task Registered : user_env_pyt_build_task
Appending Task Body
batch start : 1 , len : 4 Successful
batch start : 5 , len : 4 Successful
batch start : 9 , len : 4 Successful
batch start : 13 , len : 4 Successful
batch start : 17 , len : 1 Successful
Task creation Successful
WARNING: Task should be finalized by user explicitly
SWCI:8 > finalize task user_env_pyt_build_task
Task Finalized
SWCI:9 > get task info user_env_pyt_build_task
NAME : user_env_pyt_build_task
TASKTYPE : MAKE_USER_CONTAINER
CREATETIME : 2022-07-28 12:51:12
AUTHOR : HPESwarm
CONTENTLINES : 18
PREREQ : ROOTTASK
OUTCOME : user-env-pyt1.5-swop
FINALIZED : True
SWCI:10 > get task body user_env_pyt_build_task
0000: ---
0001: BuildContext : sl-cli-lib
0002: BuildSteps :
0003: - FROM docker.io/pytorch/pytorch:1.5-cuda10.1-cudnn7-runtime
0004: -
0005: - RUN apt-get update && apt-get install \
0006: - build-essential python3-dev python3-pip \
0007: - python3-setuptools --no-install-recommends -y
0008: -
0009: - RUN conda install pip ruamel.yaml
0010: -
0011: - RUN pip3 install --upgrade pip protobuf && pip3 install \
0012: - matplotlib opencv-python pandas sklearn future
0013: -
0014: - RUN mkdir -p /tmp/hpe-swarmcli-pkg
0015: - COPY swarmlearning-client-py3-none-manylinux_2_24_x86_64.whl /tmp/hpe-swarmcli-pkg/swarmlearning-client-py3-none-manylinux_2_24_x86_64.whl
0016: - RUN pip3 install /tmp/hpe-swarmcli-pkg/swarmlearning-client-py3-none-manylinux_2_24_x86_64.whl
0017: BuildType : INLINE
SWCI:11 > list tasks
ROOTTASK
user_env_pyt_build_task
SWCI:12 > EXIT ON FAILURE OFF
SWCI:12 > EXIT ON FAILURE IS TURNED OFF
SWCI:13 >
SWCI:13 > # Assign build task to taskrunner
SWCI:13 > EXIT ON FAILURE
SWCI:13 > EXIT ON FAILURE IS TURNED ON
SWCI:14 > RESET TASKRUNNER defaulttaskbb.taskdb.sml.hpe
TaskRunner Reset
SWCI:15 > ASSIGN TASK user_env_pyt_build_task TO defaulttaskbb.taskdb.sml.hpe WITH 2 PEERS
Task assigned to TaskRunner
SWCI:16 > WAIT FOR TASKRUNNER defaulttaskbb.taskdb.sml.hpe
WAITING FOR TASKRUNNER TO COMPLETE
WAITING FOR TASKRUNNER TO COMPLETE
WAITING FOR TASKRUNNER TO COMPLETE
WAITING FOR TASKRUNNER TO COMPLETE
TASKRUNNER FINISHED
STATE : ERROR
TIME : 2022-07-28 12:51:55
SWCI:17 > EXIT ON FAILURE OFF
SWCI:17 > EXIT ON FAILURE IS TURNED OFF
SWCI:18 >
SWCI:18 > # Build task was already run. Now build and run swarm run tasks
SWCI:18 >
SWCI:18 > # Create and finalize swarm run task
SWCI:18 > EXIT ON FAILURE
SWCI:18 > EXIT ON FAILURE IS TURNED ON
SWCI:19 > create task from taskdefs/swarm_mnist_task.yaml
Task definition is valid
Task Registered : swarm_mnist_task
Appending Task Body
batch start : 1 , len : 4 Successful
batch start : 5 , len : 4 Successful
batch start : 9 , len : 4 Successful
batch start : 13 , len : 2 Successful
Task creation Successful
WARNING: Task should be finalized by user explicitly
SWCI:20 > finalize task swarm_mnist_task
Task Finalized
SWCI:21 > get task info swarm_mnist_task
NAME : swarm_mnist_task
TASKTYPE : RUN_SWARM
CREATETIME : 2022-07-28 12:52:00
AUTHOR : HPESwarm
CONTENTLINES : 15
PREREQ : user_env_pyt_build_task
OUTCOME : swarm_mnist_task
FINALIZED : True
SWCI:22 > get task body swarm_mnist_task
0000: ---
0001: Command : model/mnist_pyt.py
0002: Entrypoint : python3
0003: WorkingDir : /tmp/test
0004: PrivateContent : /tmp/test/data-and-scratch
0005: SharedContent :
0006: - Src : /home/smadan/git/swarm-learning/workspace/mnist-pyt/model
0007: Tgt : /tmp/test/model
0008: MType : BIND
0009: Envvars :
0010: - DATA_DIR : data-and-scratch/app-data
0011: - SCRATCH_DIR : data-and-scratch/scratch
0012: - MODEL_DIR : model
0013: - MAX_EPOCHS : 2
0014: - MIN_PEERS : 4
SWCI:23 > list tasks
ROOTTASK
user_env_pyt_build_task
swarm_mnist_task
SWCI:24 > EXIT ON FAILURE OFF
SWCI:24 > EXIT ON FAILURE IS TURNED OFF
SWCI:25 >
SWCI:25 > # Assign run task
SWCI:25 > EXIT ON FAILURE
SWCI:25 > EXIT ON FAILURE IS TURNED ON
SWCI:26 > RESET TASKRUNNER defaulttaskbb.taskdb.sml.hpe
TaskRunner Reset
SWCI:27 > ASSIGN TASK swarm_mnist_task TO defaulttaskbb.taskdb.sml.hpe WITH 4 PEERS
Task assigned to TaskRunner
SWCI:28 > WAIT FOR TASKRUNNER defaulttaskbb.taskdb.sml.hpe
WAITING FOR TASKRUNNER TO COMPLETE
WAITING FOR TASKRUNNER TO COMPLETE
TASKRUNNER FINISHED
STATE : ERROR
TIME : 2022-07-28 12:52:29
SWCI:29 > RESET TASKRUNNER defaulttaskbb.taskdb.sml.hpe
TaskRunner Reset
SWCI:30 > EXIT ON FAILURE OFF
SWCI:30 > EXIT ON FAILURE IS TURNED OFF
SWCI:31 >
SWCI:31 > # List and reset training contract
SWCI:31 > EXIT ON FAILURE
SWCI:31 > EXIT ON FAILURE IS TURNED ON
SWCI:32 > LIST CONTRACTS
defaultbb.cqdb.sml.hpe
SWCI:33 > RESET CONTRACT defaultbb.cqdb.sml.hpe
Contract Reset
SWCI:34 > EXIT ON FAILURE OFF
SWCI:34 > EXIT ON FAILURE IS TURNED OFF
SWCI:35 >
SWCI:35 > # Exit
SWCI:35 > EXIT
SWCI:35 > EXITING
Swarm Learning Version:
- Find the docker tag of the Swarm images ( $ docker images | grep hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning )
docker images | grep hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning
hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/sn 1.0.0 0fbeb1e14459 3 months ago 1.23 GB
hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/swci 1.0.0 3c76a7bb4f87 3 months ago 1.07 GB
hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/swop 1.0.0 f0d463e98f17 3 months ago 953 MB
hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/sl 1.0.0 d1c9f233521e 3 months ago 1.2 GB
OS and ML Platform
- details of host OS:
cat /etc/centos-release
CentOS Linux release 8.5.2111
- details of ML platform used: pytorch
- details of Swarm learning Cluster (Number of machines, SL nodes, SN nodes): 2 machines, 2 SL nodes, 2 SN nodes
Quick Checklist: Respond [Yes/No]
- APLS server web GUI shows available Licenses? Yes
- If Multiple systems are used, can each system access every other system? Yes
- Is Password-less SSH configuration setup for all the systems? Yes
- If GPU or other protected resources are used, does the account have sufficient privileges to access and use them? Yes
- Is the user id a member of the docker group? yes
Additional notes
- Are you running documented example without any modification? Almost, I additionally modified the IPs in SWOP profiles, added SWARM_LOG_LEVEL=DEBUG env variable to run_swop script, and also used workaround from #103.
Thanks to Yoshio Sugiyama (IMOKURI). This problem has already been resolved in #103 . I solved my same problem using this solutions. Please close this issue to priorities the pending one. Thanks.
Actually, Yoshio Sugiyama (IMOKURI) asked me to create a new issue as the workaround from #103 doesn't work for me.
I also tried on CentOS Stream 8 and could not reproduce the issue.
(I did not use #103 work around.)
My SWOP log
Are you using CentOS 8 instead of CentOS Stream 8?
(CentOS 8 is already EOL, so you might want to use another OS.)
What would be the result of the following command?
docker exec <Container Name of SWOP> cat /proc/self/cgroup
$ docker exec swop1 cat /proc/self/cgroup
12:hugetlb:/
11:net_cls,net_prio:/
10:rdma:/
9:pids:/user.slice/user-1361.slice/session-2653.scope
8:blkio:/system.slice/sshd.service
7:cpuset:/
6:memory:/user.slice/user-1361.slice/session-2653.scope
5:perf_event:/
4:cpu,cpuacct:/
3:devices:/user.slice
2:freezer:/
1:name=systemd:/user.slice/user-1361.slice/user@1361.service/user.slice/podman-688920.scope/29a15e1074e18656d30438dd4acffe05f7da56d90a87e356929001d856bfab34
We are actually using podman and not docker on our systems. We do have /var/run/docker.sock in the containers and I could successfully test with curl the creation of containers through the socket.
We have also tried to use pull_image task with swarm_mnist_task. pull_image works successfully but swarm_mnist_task fails with "Unable to extract container id", even though the image is pulled correctly:
2022-07-29 12:35:50,762 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : user_env_pyt_build_task , opId : 11808671601640825250 - Begins
2022-07-29 12:35:53,782 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : user_env_pyt_build_task , opId : 11808671601640825250 - Ends
2022-07-29 12:35:53,798 : swarm.swop : INFO : SWOPDockerPullTask: Validating profile
2022-07-29 12:35:53,948 : swarm.swop : INFO : SWOPDockerPullTask: Profile validated
2022-07-29 12:35:56,961 : swarm.swop : INFO : SWOPDockerPullTask: Using Default login credentials
2022-07-29 12:35:59,976 : swarm.swop : INFO : SWOPDockerPullTask: Docker pull started
2022-07-29 12:36:07,994 : swarm.swop : INFO : SWOPDockerPullTask: Docker Pull Successful
2022-07-29 12:36:11,008 : swarm.swop : INFO : SWOPExecutor: Total Tasks: 1 , Current Task : user_env_pyt_build_task , opId : 11808671601640825250 Done
2022-07-29 12:36:36,087 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : swarm_mnist_task , opId : 11739075930308445596 - Begins
2022-07-29 12:36:39,105 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : swarm_mnist_task , opId : 11739075930308445596 - Ends
2022-07-29 12:36:39,275 : swarm.swop : ERROR : Unable to extract container id
2022-07-29 12:36:42,289 : swarm.swop : INFO : SWOPExecutor: Total Tasks: 2 , Current Task : swarm_mnist_task , opId : 11739075930308445596 Done
Thanks for the logs.
I think swarm learning does not work with podman at this time.
If possible, could you please install docker and try swarm learning?
(I think you can uninstall podman and buildah and install docker)
Unfortunately, in our organization all (GPU) systems are meant for multi-users. On these systems docker is not safe therefore our IT just allow podman.
Can I do something to make the swarm-learning library compatible with podman?
Currently Swarm learning is not qualified on podman.
Closing this issue, as the actual issue of extracting container ID is resolved in latest 1.1.0 release.