在WSL:Ubuntu20.04下创建SPU设备出现问题
integrationex01 opened this issue · 4 comments
执行环境:WSL:Ubuntu20.04
版本:SecretFlow: 1.1.0b0
执行代码:spu_device = sf.SPU(aby3_config) (文档:教程:SPU基础)
打印日志:
(SPURuntime pid=19246) 2023-09-20 14:00:38.165 [info] [default_brpc_retry_policy.cc:DoRetry:52] socket error, sleep=1000000us and retry
(SPURuntime pid=19245) 2023-09-20 14:00:38.165 [info] [default_brpc_retry_policy.cc:DoRetry:52] socket error, sleep=1000000us and retry
(SPURuntime pid=19246) 2023-09-20 14:00:39.165 [info] [default_brpc_retry_policy.cc:LogHttpDetail:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=1 addr=127.0.0.1:56231} (0x0x4a73900): Connection refused [R1][E112]Not connected to 127.0.0.1:56231 yet, server_id=1'
(SPURuntime pid=19246) 2023-09-20 14:00:39.165 [info] [default_brpc_retry_policy.cc:DoRetry:75] aggressive retry, sleep=1000000us and retry
(SPURuntime pid=19245) 2023-09-20 14:00:39.165 [info] [default_brpc_retry_policy.cc:LogHttpDetail:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=1 addr=127.0.0.1:56231} (0x0x4cdfd00): Connection refused [R1][E112]Not connected to 127.0.0.1:56231 yet, server_id=1'
(SPURuntime pid=19245) 2023-09-20 14:00:39.165 [info] [default_brpc_retry_policy.cc:DoRetry:75] aggressive retry, sleep=1000000us and retry
(SPURuntime pid=19246) 2023-09-20 14:00:40.166 [info] [default_brpc_retry_policy.cc:LogHttpDetail:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=1 addr=127.0.0.1:56231} (0x0x4a73900): Connection refused [R1][E112]Not connected to 127.0.0.1:56231 yet, server_id=1 [R2][E112]Not connected to 127.0.0.1:56231 yet, server_id=1'
(SPURuntime pid=19246) 2023-09-20 14:00:40.166 [info] [default_brpc_retry_policy.cc:DoRetry:75] aggressive retry, sleep=1000000us and retry
(SPURuntime pid=19245) 2023-09-20 14:00:40.166 [info] [default_brpc_retry_policy.cc:LogHttpDetail:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=1 addr=127.0.0.1:56231} (0x0x4cdfd00): Connection refused [R1][E112]Not connected to 127.0.0.1:56231 yet, server_id=1 [R2][E112]Not connected to 127.0.0.1:56231 yet, server_id=1'
(SPURuntime pid=19245) 2023-09-20 14:00:40.166 [info] [default_brpc_retry_policy.cc:DoRetry:75] aggressive retry, sleep=1000000us and retry
(SPURuntime pid=19248) 2023-09-20 14:00:40.214 [info] [default_brpc_retry_policy.cc:DoRetry:69] not retry for reached rcp timeout, ErrorCode '1008', error msg '[E1008]Reached timeout=2000ms @127.0.0.1:34335'
不知道是具体哪块的问题,是WSL网络方面的问题还是其他情况,希望能解答一下,谢谢。
这应该不是WSL网络方面的问题
可以先检查一下 config 里面的端口是否已经被占用吗?Thx~
可以先检查一下 config 里面的端口是否已经被占用吗?Thx~
————————————————————————————————————
这是运行时对应的config:
代码:aby3_config = sf.utils.testing.cluster_def(parties=["alice", "bob", "carol"])
aby3-config:node:[{'party': 'alice', 'address': '127.0.0.1:55661'}, {'party': 'bob', 'address': '127.0.0.1:41041'}, {'party': 'carol', 'address': '127.0.0.1:42505'}]
这是对应端口的查看:
(sf) :$ sudo netstat tnulp | grep 55661$ sudo netstat tnulp | grep 41041
tcp 0 0 localhost:55661 localhost:44110 ESTABLISHED
tcp 0 0 localhost:44112 localhost:55661 ESTABLISHED
tcp 0 0 localhost:55661 localhost:44112 ESTABLISHED
tcp 0 0 localhost:44110 localhost:55661 ESTABLISHED
(sf) :
tcp 0 0 localhost:40610 localhost:41041 TIME_WAIT
tcp 0 0 localhost:41041 localhost:40616 ESTABLISHED
tcp 0 0 localhost:40614 localhost:41041 ESTABLISHED
tcp 0 0 localhost:41041 localhost:40614 ESTABLISHED
tcp 0 0 localhost:40616 localhost:41041 ESTABLISHED
(sf) :$ sudo netstat tnulp | grep 42505$ sudo lsof -i:55661
tcp 0 0 localhost:42505 localhost:52662 ESTABLISHED
tcp 0 0 localhost:52662 localhost:42505 ESTABLISHED
tcp 0 0 localhost:52668 localhost:42505 ESTABLISHED
tcp 0 0 localhost:42505 localhost:52668 ESTABLISHED
(sf) :
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
ray::SPUR 8602 30u IPv4 37020 0t0 TCP localhost:55661 (LISTEN)
ray::SPUR 8602 54u IPv4 38036 0t0 TCP localhost:55661->localhost:44110 (ESTABLISHED)
ray::SPUR 8602 57u IPv4 38037 0t0 TCP localhost:55661->localhost:44112 (ESTABLISHED)
ray::SPUR 8604 51u IPv4 36202 0t0 TCP localhost:44110->localhost:55661 (ESTABLISHED)
ray::SPUR 8613 51u IPv4 35103 0t0 TCP localhost:44112->localhost:55661 (ESTABLISHED)
(sf) :$ sudo lsof -i:42505$ sudo lsof -i:41041
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
ray::SPUR 8602 65u IPv4 37056 0t0 TCP localhost:52668->localhost:42505 (ESTABLISHED)
ray::SPUR 8604 61u IPv4 39155 0t0 TCP localhost:52662->localhost:42505 (ESTABLISHED)
ray::SPUR 8613 30u IPv4 35099 0t0 TCP localhost:42505 (LISTEN)
ray::SPUR 8613 61u IPv4 27402 0t0 TCP localhost:42505->localhost:52662 (ESTABLISHED)
ray::SPUR 8613 65u IPv4 40171 0t0 TCP localhost:42505->localhost:52668 (ESTABLISHED)
(sf) :
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
ray::SPUR 8602 64u IPv4 37055 0t0 TCP localhost:40616->localhost:41041 (ESTABLISHED)
ray::SPUR 8604 30u IPv4 36198 0t0 TCP localhost:41041 (LISTEN)
ray::SPUR 8604 64u IPv4 38039 0t0 TCP localhost:41041->localhost:40614 (ESTABLISHED)
ray::SPUR 8604 65u IPv4 38041 0t0 TCP localhost:41041->localhost:40616 (ESTABLISHED)
ray::SPUR 8613 64u IPv4 40163 0t0 TCP localhost:40614->localhost:41041 (ESTABLISHED)
发现提交issue的项目有些问题(是yacl不是secretflow)不好意思
执行环境:WSL:Ubuntu20.04 版本:SecretFlow: 1.1.0b0 执行代码:spu_device = sf.SPU(aby3_config) (文档:教程:SPU基础) 打印日志: (SPURuntime pid=19246) 2023-09-20 14:00:38.165 [info] [default_brpc_retry_policy.cc:DoRetry:52] socket error, sleep=1000000us and retry (SPURuntime pid=19245) 2023-09-20 14:00:38.165 [info] [default_brpc_retry_policy.cc:DoRetry:52] socket error, sleep=1000000us and retry (SPURuntime pid=19246) 2023-09-20 14:00:39.165 [info] [default_brpc_retry_policy.cc:LogHttpDetail:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=1 addr=127.0.0.1:56231} (0x0x4a73900): Connection refused [R1][E112]Not connected to 127.0.0.1:56231 yet, server_id=1' (SPURuntime pid=19246) 2023-09-20 14:00:39.165 [info] [default_brpc_retry_policy.cc:DoRetry:75] aggressive retry, sleep=1000000us and retry (SPURuntime pid=19245) 2023-09-20 14:00:39.165 [info] [default_brpc_retry_policy.cc:LogHttpDetail:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=1 addr=127.0.0.1:56231} (0x0x4cdfd00): Connection refused [R1][E112]Not connected to 127.0.0.1:56231 yet, server_id=1' (SPURuntime pid=19245) 2023-09-20 14:00:39.165 [info] [default_brpc_retry_policy.cc:DoRetry:75] aggressive retry, sleep=1000000us and retry (SPURuntime pid=19246) 2023-09-20 14:00:40.166 [info] [default_brpc_retry_policy.cc:LogHttpDetail:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=1 addr=127.0.0.1:56231} (0x0x4a73900): Connection refused [R1][E112]Not connected to 127.0.0.1:56231 yet, server_id=1 [R2][E112]Not connected to 127.0.0.1:56231 yet, server_id=1' (SPURuntime pid=19246) 2023-09-20 14:00:40.166 [info] [default_brpc_retry_policy.cc:DoRetry:75] aggressive retry, sleep=1000000us and retry (SPURuntime pid=19245) 2023-09-20 14:00:40.166 [info] [default_brpc_retry_policy.cc:LogHttpDetail:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=1 addr=127.0.0.1:56231} (0x0x4cdfd00): Connection refused [R1][E112]Not connected to 127.0.0.1:56231 yet, server_id=1 [R2][E112]Not connected to 127.0.0.1:56231 yet, server_id=1' (SPURuntime pid=19245) 2023-09-20 14:00:40.166 [info] [default_brpc_retry_policy.cc:DoRetry:75] aggressive retry, sleep=1000000us and retry (SPURuntime pid=19248) 2023-09-20 14:00:40.214 [info] [default_brpc_retry_policy.cc:DoRetry:69] not retry for reached rcp timeout, ErrorCode '1008', error msg '[E1008]Reached timeout=2000ms @127.0.0.1:34335'
不知道是具体哪块的问题,是WSL网络方面的问题还是其他情况,希望能解答一下,谢谢。
这些日志是info 不是error ,多个计算节点的端口拉起的时间有差,导致刚启动几秒钟可能出现这种日志,但不是异常了。继续执行教程内容就行。