HewlettPackard/swarm-learning

Error: Unable to extract container id (with cgroup v1 on CentOS 8)

maestro4 opened this issue · 8 comments

Issue description

  • issue description: The SWCI node task report errors while running the MNIST PYT example. The SWOP report errors "Unable to extract container id". Unfortunately the container is not being built. I can successfully use the curl command to build the container within the swop1 container using the docker.socket and the Dockerfile, which I created using the example https://github.com/HewlettPackard/swarm-learning/blob/master/examples/mnist-pyt/swci/taskdefs/user_env_pyt_build_task.yaml .
  • occurrence: consistent
  • error messages: SWOP: "Unable to extract container id" and SWCI: Taskrunner state error
  • commands used for starting containers:
  • docker logs [APLS, SPIRE, SN, SL, SWCI]:
    SWOP:
2022-07-28 12:41:18,176 : swarm.swop : INFO : SL Nodes validation is started
2022-07-28 12:41:18,176 : swarm.swop : INFO : Attempting to contact API-Server at : <IP>:30304
2022-07-28 12:41:18,222 : swarm.swop : INFO : API-Server is UP!
2022-07-28 12:41:18,226 : swarm.swop : INFO : SWOPCtx :
============================================================
===== NODE UID :  b22a7728-a0b0-45cb-9a9b-7b0ffb1673d8 =====
============================================================

/usr/lib/python3.8/site-packages/urllib3/connection.py:460: SubjectAltNameWarning: Certificate for <IP> has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/urllib3/urllib3/issues/497 for details.)
  warnings.warn(
2022-07-28 12:51:39,838 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : user_env_pyt_build_task , opId : 9632004996807340828 - Begins
2022-07-28 12:51:42,856 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : user_env_pyt_build_task , opId : 9632004996807340828 - Ends
2022-07-28 12:51:48,884 : swarm.swop : INFO : SWOPBuildTask: Validating profile
2022-07-28 12:51:55,063 : swarm.swop : ERROR : Unable to extract container id
2022-07-28 12:51:58,078 : swarm.swop : INFO : SWOPExecutor: Total Tasks: 1 , Current Task : user_env_pyt_build_task , opId : 9632004996807340828 Done
2022-07-28 12:52:24,177 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : swarm_mnist_task , opId : 11113303237863304723 - Begins
2022-07-28 12:52:27,196 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : swarm_mnist_task , opId : 11113303237863304723 - Ends
2022-07-28 12:52:30,382 : swarm.swop : INFO : SWOPExecutor: Total Tasks: 2 , Current Task : swarm_mnist_task , opId : 11113303237863304723 Done

SWCI:

SWCI:0 > ######################################################################
SWCI:0 > # (C)Copyright 2021,2022 Hewlett Packard Enterprise Development LP
SWCI:0 > ######################################################################
SWCI:0 >
SWCI:0 > # Assumption : SWOP is already running
SWCI:0 >
SWCI:0 > # SWCI context setup
SWCI:0 > EXIT ON FAILURE
SWCI:0 > EXIT ON FAILURE IS TURNED ON
SWCI:1 > wait for <IP>
    API Server is UP!
SWCI:2 > create context test-mnist <IP>
    API Server is UP!
    CONTEXT CREATED : test-mnist
/usr/lib/python3.8/site-packages/urllib3/connection.py:455: SubjectAltNameWarning: Certificate for <IP> has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/urllib3/urllib3/issues/497 for details.)
  warnings.warn(
SWCI:3 > switch context test-mnist
    DEFAULT CONTEXT SET TO : test-mnist
SWCI:4 > EXIT ON FAILURE OFF
SWCI:4 > EXIT ON FAILURE IS TURNED OFF
SWCI:5 >
SWCI:5 > #Change to the directory where we are mounting the host
SWCI:5 > cd /platform/swarm/usr
SWCI:5 > Current Directory : /platform/swarm/usr
SWCI:6 >
SWCI:6 > # Create and finalize build task
SWCI:6 > EXIT ON FAILURE
SWCI:6 > EXIT ON FAILURE IS TURNED ON
SWCI:7 > create task from taskdefs/user_env_pyt_build_task.yaml
    Task definition is valid
    Task Registered : user_env_pyt_build_task
    Appending Task Body
    batch start : 1 , len : 4 Successful
    batch start : 5 , len : 4 Successful
    batch start : 9 , len : 4 Successful
    batch start : 13 , len : 4 Successful
    batch start : 17 , len : 1 Successful
    Task creation Successful
    WARNING: Task should be finalized by user explicitly
SWCI:8 > finalize task user_env_pyt_build_task
    Task Finalized
SWCI:9 > get task info user_env_pyt_build_task
    NAME         : user_env_pyt_build_task
    TASKTYPE     : MAKE_USER_CONTAINER
    CREATETIME   : 2022-07-28 12:51:12
    AUTHOR       : HPESwarm
    CONTENTLINES : 18
    PREREQ       : ROOTTASK
    OUTCOME      : user-env-pyt1.5-swop
    FINALIZED    : True
SWCI:10 > get task body user_env_pyt_build_task
    0000: ---
    0001: BuildContext : sl-cli-lib
    0002: BuildSteps   :
    0003:     - FROM docker.io/pytorch/pytorch:1.5-cuda10.1-cudnn7-runtime
    0004:     -
    0005:     - RUN apt-get update && apt-get install           \
    0006:     -    build-essential python3-dev python3-pip     \
    0007:     -    python3-setuptools --no-install-recommends -y
    0008:     -
    0009:     - RUN conda install pip ruamel.yaml
    0010:     -
    0011:     - RUN pip3 install --upgrade pip protobuf && pip3 install \
    0012:     -    matplotlib opencv-python pandas sklearn future
    0013:     -
    0014:     - RUN mkdir -p /tmp/hpe-swarmcli-pkg
    0015:     - COPY swarmlearning-client-py3-none-manylinux_2_24_x86_64.whl /tmp/hpe-swarmcli-pkg/swarmlearning-client-py3-none-manylinux_2_24_x86_64.whl
    0016:     - RUN pip3 install /tmp/hpe-swarmcli-pkg/swarmlearning-client-py3-none-manylinux_2_24_x86_64.whl
    0017: BuildType : INLINE
SWCI:11 > list tasks
    ROOTTASK
    user_env_pyt_build_task
SWCI:12 > EXIT ON FAILURE OFF
SWCI:12 > EXIT ON FAILURE IS TURNED OFF
SWCI:13 >
SWCI:13 > # Assign build task to taskrunner
SWCI:13 > EXIT ON FAILURE
SWCI:13 > EXIT ON FAILURE IS TURNED ON
SWCI:14 > RESET TASKRUNNER defaulttaskbb.taskdb.sml.hpe
    TaskRunner Reset
SWCI:15 > ASSIGN TASK user_env_pyt_build_task TO defaulttaskbb.taskdb.sml.hpe WITH 2 PEERS
    Task assigned to TaskRunner
SWCI:16 > WAIT FOR TASKRUNNER defaulttaskbb.taskdb.sml.hpe
    WAITING FOR TASKRUNNER TO COMPLETE
    WAITING FOR TASKRUNNER TO COMPLETE
    WAITING FOR TASKRUNNER TO COMPLETE
    WAITING FOR TASKRUNNER TO COMPLETE
    TASKRUNNER FINISHED
      STATE : ERROR
      TIME  : 2022-07-28 12:51:55
SWCI:17 > EXIT ON FAILURE OFF
SWCI:17 > EXIT ON FAILURE IS TURNED OFF
SWCI:18 >
SWCI:18 > # Build task was already run. Now build and run swarm run tasks
SWCI:18 >
SWCI:18 > # Create and finalize swarm run task
SWCI:18 > EXIT ON FAILURE
SWCI:18 > EXIT ON FAILURE IS TURNED ON
SWCI:19 > create task from taskdefs/swarm_mnist_task.yaml
    Task definition is valid
    Task Registered : swarm_mnist_task
    Appending Task Body
    batch start : 1 , len : 4 Successful
    batch start : 5 , len : 4 Successful
    batch start : 9 , len : 4 Successful
    batch start : 13 , len : 2 Successful
    Task creation Successful
    WARNING: Task should be finalized by user explicitly
SWCI:20 > finalize task swarm_mnist_task
    Task Finalized
SWCI:21 > get task info swarm_mnist_task
    NAME         : swarm_mnist_task
    TASKTYPE     : RUN_SWARM
    CREATETIME   : 2022-07-28 12:52:00
    AUTHOR       : HPESwarm
    CONTENTLINES : 15
    PREREQ       : user_env_pyt_build_task
    OUTCOME      : swarm_mnist_task
    FINALIZED    : True
SWCI:22 > get task body swarm_mnist_task
    0000: ---
    0001: Command : model/mnist_pyt.py
    0002: Entrypoint : python3
    0003: WorkingDir : /tmp/test
    0004: PrivateContent : /tmp/test/data-and-scratch
    0005: SharedContent :
    0006:   - Src   : /home/smadan/git/swarm-learning/workspace/mnist-pyt/model
    0007:     Tgt   : /tmp/test/model
    0008:     MType : BIND
    0009: Envvars :
    0010:   - DATA_DIR : data-and-scratch/app-data
    0011:   - SCRATCH_DIR : data-and-scratch/scratch
    0012:   - MODEL_DIR : model
    0013:   - MAX_EPOCHS : 2
    0014:   - MIN_PEERS : 4
SWCI:23 > list tasks
    ROOTTASK
    user_env_pyt_build_task
    swarm_mnist_task
SWCI:24 > EXIT ON FAILURE OFF
SWCI:24 > EXIT ON FAILURE IS TURNED OFF
SWCI:25 >
SWCI:25 > # Assign run task
SWCI:25 > EXIT ON FAILURE
SWCI:25 > EXIT ON FAILURE IS TURNED ON
SWCI:26 > RESET TASKRUNNER defaulttaskbb.taskdb.sml.hpe
    TaskRunner Reset
SWCI:27 > ASSIGN TASK swarm_mnist_task TO defaulttaskbb.taskdb.sml.hpe WITH 4 PEERS
    Task assigned to TaskRunner
SWCI:28 > WAIT FOR TASKRUNNER defaulttaskbb.taskdb.sml.hpe
    WAITING FOR TASKRUNNER TO COMPLETE
    WAITING FOR TASKRUNNER TO COMPLETE
    TASKRUNNER FINISHED
      STATE : ERROR
      TIME  : 2022-07-28 12:52:29
SWCI:29 > RESET TASKRUNNER defaulttaskbb.taskdb.sml.hpe
    TaskRunner Reset
SWCI:30 > EXIT ON FAILURE OFF
SWCI:30 > EXIT ON FAILURE IS TURNED OFF
SWCI:31 >
SWCI:31 > # List and reset training contract
SWCI:31 > EXIT ON FAILURE
SWCI:31 > EXIT ON FAILURE IS TURNED ON
SWCI:32 > LIST CONTRACTS
    defaultbb.cqdb.sml.hpe
SWCI:33 > RESET CONTRACT defaultbb.cqdb.sml.hpe
    Contract Reset
SWCI:34 > EXIT ON FAILURE OFF
SWCI:34 > EXIT ON FAILURE IS TURNED OFF
SWCI:35 >
SWCI:35 > # Exit
SWCI:35 > EXIT
SWCI:35 > EXITING

Swarm Learning Version:

  • Find the docker tag of the Swarm images ( $ docker images | grep hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning )
docker images | grep hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning
hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/sn    1.0.0       0fbeb1e14459  3 months ago  1.23 GB
hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/swci  1.0.0       3c76a7bb4f87  3 months ago  1.07 GB
hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/swop  1.0.0       f0d463e98f17  3 months ago  953 MB
hub.myenterpriselicense.hpe.com/hpe_eval/swarm-learning/sl    1.0.0       d1c9f233521e  3 months ago  1.2 GB

OS and ML Platform

  • details of host OS:
cat /etc/centos-release
CentOS Linux release 8.5.2111
  • details of ML platform used: pytorch
  • details of Swarm learning Cluster (Number of machines, SL nodes, SN nodes): 2 machines, 2 SL nodes, 2 SN nodes

Quick Checklist: Respond [Yes/No]

  • APLS server web GUI shows available Licenses? Yes
  • If Multiple systems are used, can each system access every other system? Yes
  • Is Password-less SSH configuration setup for all the systems? Yes
  • If GPU or other protected resources are used, does the account have sufficient privileges to access and use them? Yes
  • Is the user id a member of the docker group? yes

Additional notes

  • Are you running documented example without any modification? Almost, I additionally modified the IPs in SWOP profiles, added SWARM_LOG_LEVEL=DEBUG env variable to run_swop script, and also used workaround from #103.

Thanks to Yoshio Sugiyama (IMOKURI). This problem has already been resolved in #103 . I solved my same problem using this solutions. Please close this issue to priorities the pending one. Thanks.

Thanks to Yoshio Sugiyama (IMOKURI). This problem has already been resolved in #103 . I solved my same problem using this solutions. Please close this issue to priorities the pending one. Thanks.

Actually, Yoshio Sugiyama (IMOKURI) asked me to create a new issue as the workaround from #103 doesn't work for me.

I also tried on CentOS Stream 8 and could not reproduce the issue.
(I did not use #103 work around.)

My SWOP log

image

Are you using CentOS 8 instead of CentOS Stream 8?
(CentOS 8 is already EOL, so you might want to use another OS.)

What would be the result of the following command?

docker exec <Container Name of SWOP> cat /proc/self/cgroup 
$ docker exec swop1 cat /proc/self/cgroup
12:hugetlb:/
11:net_cls,net_prio:/
10:rdma:/
9:pids:/user.slice/user-1361.slice/session-2653.scope
8:blkio:/system.slice/sshd.service
7:cpuset:/
6:memory:/user.slice/user-1361.slice/session-2653.scope
5:perf_event:/
4:cpu,cpuacct:/
3:devices:/user.slice
2:freezer:/
1:name=systemd:/user.slice/user-1361.slice/user@1361.service/user.slice/podman-688920.scope/29a15e1074e18656d30438dd4acffe05f7da56d90a87e356929001d856bfab34

We are actually using podman and not docker on our systems. We do have /var/run/docker.sock in the containers and I could successfully test with curl the creation of containers through the socket.

We have also tried to use pull_image task with swarm_mnist_task. pull_image works successfully but swarm_mnist_task fails with "Unable to extract container id", even though the image is pulled correctly:

2022-07-29 12:35:50,762 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : user_env_pyt_build_task , opId : 11808671601640825250 - Begins
2022-07-29 12:35:53,782 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : user_env_pyt_build_task , opId : 11808671601640825250 - Ends
2022-07-29 12:35:53,798 : swarm.swop : INFO : SWOPDockerPullTask: Validating profile
2022-07-29 12:35:53,948 : swarm.swop : INFO : SWOPDockerPullTask: Profile validated
2022-07-29 12:35:56,961 : swarm.swop : INFO : SWOPDockerPullTask: Using Default login credentials
2022-07-29 12:35:59,976 : swarm.swop : INFO : SWOPDockerPullTask: Docker pull started
2022-07-29 12:36:07,994 : swarm.swop : INFO : SWOPDockerPullTask: Docker Pull Successful
2022-07-29 12:36:11,008 : swarm.swop : INFO : SWOPExecutor: Total Tasks: 1 , Current Task : user_env_pyt_build_task , opId : 11808671601640825250 Done
2022-07-29 12:36:36,087 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : swarm_mnist_task , opId : 11739075930308445596 - Begins
2022-07-29 12:36:39,105 : swarm.swop : INFO : SWOPExecutor: ENROLL TO Task : swarm_mnist_task , opId : 11739075930308445596 - Ends
2022-07-29 12:36:39,275 : swarm.swop : ERROR : Unable to extract container id
2022-07-29 12:36:42,289 : swarm.swop : INFO : SWOPExecutor: Total Tasks: 2 , Current Task : swarm_mnist_task , opId : 11739075930308445596 Done

Thanks for the logs.

I think swarm learning does not work with podman at this time.

If possible, could you please install docker and try swarm learning?
(I think you can uninstall podman and buildah and install docker)

Unfortunately, in our organization all (GPU) systems are meant for multi-users. On these systems docker is not safe therefore our IT just allow podman.

Can I do something to make the swarm-learning library compatible with podman?

Currently Swarm learning is not qualified on podman.

Closing this issue, as the actual issue of extracting container ID is resolved in latest 1.1.0 release.