secretflow/yacl

yacl实现的yacl::link::Context复用时出现问题

WandQ opened this issue · 1 comments

通过yacl::link::FactoryBrpc().CreateContext(lctx_desc, rank)方法创建信道后,作为参数传给PSI库使用信道,运行PSI任务后,不关闭信道,运行第二个PSI任务。若第一个任务顺利,则第二个任务也顺利进行,若第一个任务中途出错,则第二个PSI任务也会因为通信失败。

第二次任务错误日志参考如下:

I0923 18:34:58.588133   259     0 external/com_github_brpc_brpc/src/brpc/server.cpp:1204] Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on port=61531.
W0923 18:34:58.588263   259     0 external/com_github_brpc_brpc/src/brpc/server.cpp:1210] Builtin services are disabled according to ServerOptions.has_builtin_services

[2024-09-23 18:35:03.601] [info] [csv_checker.cc:245] Executing script to get duplicates: LC_ALL=C tail -n +2 /var/folders/_4/gt37q6v92mqfg9zxm6w63fc00000gn/T/968c6e84-11d1-4ad9-a51e-d848be38ccf9.psi_checked | LC_ALL=C sort --parallel=8 --buffer-size=1G --stable | LC_ALL=C uniq -d > /var/folders/_4/gt37q6v92mqfg9zxm6w63fc00000gn/T/968c6e84-11d1-4ad9-a51e-d848be38ccf9.psi_checked_duplicates

[2024-09-23 18:35:06.675] [info] [csv_checker.cc:245] Executing script to get duplicates: LC_ALL=C tail -n +2 /var/folders/_4/gt37q6v92mqfg9zxm6w63fc00000gn/T/8d925c92-2153-4e78-b481-0bee88bf6f72.psi_checked | LC_ALL=C sort --parallel=8 --buffer-size=1G --stable | LC_ALL=C uniq -d > /var/folders/_4/gt37q6v92mqfg9zxm6w63fc00000gn/T/8d925c92-2153-4e78-b481-0bee88bf6f72.psi_checked_duplicates
[2024-09-23 18:35:09.701] [info] [bucket_psi.cc:43] Exception caught: [external/yacl/yacl/link/transport/channel.cc:427] Get data timeout, key=root:3:ALLGATHER
msg: Exception caught: [external/yacl/yacl/link/transport/channel.cc:427] Get data timeout, key=root:3:ALLGATHER

有两个问题:
1.我要如何做到第一个PSI任务失败后,不重启信道的第二个PSI任务也能顺利运行呢?(每次建立信道要耗时2秒,影响任务性能)
2.我启动第二个任务有没有什么办法销毁信道,重新建立连接,我依次执行
auto lctx = yacl::link::FactoryBrpc().CreateContext(lctx_desc, rank);
lctx->WaitLinkTaskFinish();
auto lctx2 = yacl::link::FactoryBrpc().CreateContext(lctx_desc, rank);
是会报错的。
0 external/com_github_brpc_brpc/src/brpc/server.cpp:1097] Fail to listen 127.0.0.1:61530
libc++abi: terminating with uncaught exception of type yacl::IoError: [external/yacl/yacl/link/transport/brpc_link.cc:104] brpc server failed start
SIGABRT: abort
PC=0x19cd4ed78 m=0 sigcode=0
signal arrived during cgo execution

  1. 出错后不能复用
  2. lctx->WaitLinkTaskFinish(); 后请调用lctx.reset()把旧的link释放掉