secretflow/yacl

yacl通信实际测试问题

Closed this issue · 5 comments

根据factory_test.cc测试的一部分,我改造到真实的两台机器上测试,感觉有些问题。
环境:两台机器都是Ubuntu系统,地址分别为172.18.0.2, 172.18.0.3,分别取编号(rank)为0,1。
rank是0的机器运行代码如下

//Mytest.cpp
#include <iostream>
#include <string>
#include <vector>
#include <map>
#include <memory>
#include <type_traits>
#include <variant>
#include <unistd.h>
#include <future>
#include <limits>
#include "fmt/format.h"
#include "gtest/gtest.h"
#include "yacl/link/context.h"
#include "yacl/link/link.h"
#include "yacl/link/factory.h"

class FactoryTest{
 public:
  FactoryTest()
  {
    static int desc_count = 0;
    contexts_.resize(2);
    yacl::link::ContextDesc desc;
    desc.id = fmt::format("world_{}", desc_count++);
    desc.brpc_retry_count = 20;
    desc.parties.push_back(yacl::link::ContextDesc::Party("alice", "172.18.0.2:63927"));
    desc.parties.push_back(yacl::link::ContextDesc::Party("bob", "172.18.0.3:63921"));
    auto create_brpc = [&](int self_rank) {
      contexts_[self_rank] = yacl::link::FactoryBrpc().CreateContext(desc, self_rank);
    };
    std::vector<std::future<void>> creates;
    creates.push_back(std::async(create_brpc, 0));
    for (auto& f : creates) {
      f.get();
    }
    std::cout << "Connect to Bob successfully\n";
  }

  void work()
  {
    auto test = [&](int self_rank)
    {
      int dst_rank = 1 - self_rank;
      this->contexts_[self_rank]->SendAsync(dst_rank, "Hello I am 0", "test");
      yacl::Buffer r = this->contexts_[self_rank]->Recv(dst_rank, "test");
      std::string r_str(r.data<const char>(), r.size());
      std::cout << self_rank << " Receive "  << r_str << '\n';
    };
    std::vector<std::future<void>> tests;
    tests.push_back(std::async(test, 0));
    for (auto& f : tests) {
      f.get();
    }
  }

  ~FactoryTest()
  {
    auto wait = [&](int self_rank) {
      contexts_[self_rank]->WaitLinkTaskFinish();
    };
    std::vector<std::future<void>> waits;
    waits.push_back(std::async(wait, 0));
    for (auto& f : waits) {
      f.get();
    }
  }
  std::vector<std::shared_ptr<yacl::link::Context>> contexts_;
};

int main() {
  FactoryTest F;
  sleep(2);
  F.work();
  return 0;
}

编号为1的机器的代码主要改了上面的self_rank的取值。由于是手工启动,测试时两台机器启动程序的时间可能会相差几秒,先启动1号机器的程序,再启动0号机器的。上面代码运行没有问题,0号机器输出

0 Receive Hello I am 1

1号机器输出

1 Receive Hello I am 0

但是代码中如果去掉sleep(2)语句,再测试时就会有以下报错,0号机器报错

I0924 02:51:37.530009 1192314 /repository/brpc-1.6.0/src/brpc/server.cpp:1127] Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on port=63927.
Connect to Bob successfully
I0924 02:51:56.632742 1192407 /repository/brpc-1.6.0/src/brpc/socket.cpp:2465] Checking Socket{id=0 addr=172.18.0.3:63921} (0x7fbacc067020)
terminate called after throwing an instance of 'yacl::IoError'
what(): [/repository/yacl/yacl/link/transport/channel.cc:351] Get data timeout, key=world_0:P2P-1:1->0
Stacktrace:
#0 yacl::link::transport::Channel::Recv()+0x4d68b8

Aborted (core dumped)

1号机器报错


[2023-09-24 02:51:55.515] [info] [default_brpc_retry_policy.cc:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=0 addr=172.18.0.2:63927} (0x0x7f8a34067000): Connection refused [R1][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R2][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R3][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R4][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R5][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R6][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R7][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R8][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R9][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R10][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R11][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R12][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R13][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R1
auto wait = [&](int self_rank) {
4][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R15][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R16][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R17][E112]Not connected
to 172.18.0.2:63927 yet, server_id=0 [R18][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R19][E112]Not connected to 172.18.0.2:63927 yet, server_id=0'
[2023-09-24 02:51:55.515] [info] [default_brpc_retry_policy.cc:75] aggressive retry, sleep=1000000us and retry
I0924 02:51:56.516082 769 /repository/brpc-1.6.0/src/brpc/socket.cpp:2465] Checking Socket{id=0 addr=172.18.0.2:63927} (0x7f8a34067000)
1 Receive Hello I am 0
I0924 02:51:56.516975 695 /repository/brpc-1.6.0/src/brpc/socket.cpp:2525] Revived Socket{id=0 addr=172.18.0.2:63927} (0x7f8a34067000) (Connectable)
[2023-09-24 02:51:56.522] [error] [channel.cc:98] SendImpl error [/repository/yacl/yacl/link/transport/brpc_link.cc:187] send, rpc failed=112, message=[E111]Fail to connect Socket{id=0 addr=172.18.0.2:63927}
(0x0x7f8a34067000): Connection refused [R1][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R2][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R3][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R4][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R5][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R6][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R7][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R8][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R9][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R10][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R11][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R12][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R13][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R14][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R15][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R16][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R17][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R18][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R19][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R20][E112]Not connected to 172.18.0.2:63927 yet, server_id=0
Stacktrace:
#0 yacl::link::transport::BrpcLink::SendRequest()+0x4cb5cf
#1 (unknown)+0x7f8a34002da0

上面省略了一些[info]段落。1号机器确实输出了”1 Receive Hello I am 0”,但0号机器似乎没有收到消息。我确信1号机器程序启动后,0号机器的程序在5秒内启动。

要确保两方的context都连上了,可以在调用SendAsync之前调用ContextConnectToMesh方法,比如

    auto create_brpc = [&](int self_rank) {
      contexts_[self_rank] = yacl::link::FactoryBrpc().CreateContext(desc, self_rank);
      contexts_[self_rank]->ConnectToMesh();
    };

,这样就不用sleep(2)了。
之所以这个地方会报错是在调用SendAsync的时候对端的socket没有建立,总会有个时间差的问题。

contexts_[self_rank]->ConnectToMesh();可以确保双方都建立好服务了。

根据factory_test.cc测试的一部分,我改造到真实的两台机器上测试,感觉有些问题。 环境:两台机器都是Ubuntu系统,地址分别为172.18.0.2, 172.18.0.3,分别取编号(rank)为0,1。 rank是0的机器运行代码如下

//Mytest.cpp
#include <iostream>
#include <string>
#include <vector>
#include <map>
#include <memory>
#include <type_traits>
#include <variant>
#include <unistd.h>
#include <future>
#include <limits>
#include "fmt/format.h"
#include "gtest/gtest.h"
#include "yacl/link/context.h"
#include "yacl/link/link.h"
#include "yacl/link/factory.h"

class FactoryTest{
 public:
  FactoryTest()
  {
    static int desc_count = 0;
    contexts_.resize(2);
    yacl::link::ContextDesc desc;
    desc.id = fmt::format("world_{}", desc_count++);
    desc.brpc_retry_count = 20;
    desc.parties.push_back(yacl::link::ContextDesc::Party("alice", "172.18.0.2:63927"));
    desc.parties.push_back(yacl::link::ContextDesc::Party("bob", "172.18.0.3:63921"));
    auto create_brpc = [&](int self_rank) {
      contexts_[self_rank] = yacl::link::FactoryBrpc().CreateContext(desc, self_rank);
    };
    std::vector<std::future<void>> creates;
    creates.push_back(std::async(create_brpc, 0));
    for (auto& f : creates) {
      f.get();
    }
    std::cout << "Connect to Bob successfully\n";
  }

  void work()
  {
    auto test = [&](int self_rank)
    {
      int dst_rank = 1 - self_rank;
      this->contexts_[self_rank]->SendAsync(dst_rank, "Hello I am 0", "test");
      yacl::Buffer r = this->contexts_[self_rank]->Recv(dst_rank, "test");
      std::string r_str(r.data<const char>(), r.size());
      std::cout << self_rank << " Receive "  << r_str << '\n';
    };
    std::vector<std::future<void>> tests;
    tests.push_back(std::async(test, 0));
    for (auto& f : tests) {
      f.get();
    }
  }

  ~FactoryTest()
  {
    auto wait = [&](int self_rank) {
      contexts_[self_rank]->WaitLinkTaskFinish();
    };
    std::vector<std::future<void>> waits;
    waits.push_back(std::async(wait, 0));
    for (auto& f : waits) {
      f.get();
    }
  }
  std::vector<std::shared_ptr<yacl::link::Context>> contexts_;
};

int main() {
  FactoryTest F;
  sleep(2);
  F.work();
  return 0;
}

编号为1的机器的代码主要改了上面的self_rank的取值。由于是手工启动,测试时两台机器启动程序的时间可能会相差几秒,先启动1号机器的程序,再启动0号机器的。上面代码运行没有问题,0号机器输出

0 Receive Hello I am 1

1号机器输出

1 Receive Hello I am 0

但是代码中如果去掉sleep(2)语句,再测试时就会有以下报错,0号机器报错

I0924 02:51:37.530009 1192314 /repository/brpc-1.6.0/src/brpc/server.cpp:1127] Server[yacl:🔗:transport::internal::ReceiverServiceImpl] is serving on port=63927. Connect to Bob successfully I0924 02:51:56.632742 1192407 /repository/brpc-1.6.0/src/brpc/socket.cpp:2465] Checking Socket{id=0 addr=172.18.0.3:63921} (0x7fbacc067020) terminate called after throwing an instance of 'yacl::IoError' what(): [/repository/yacl/yacl/link/transport/channel.cc:351] Get data timeout, key=world_0:P2P-1:1->0 Stacktrace: #0 yacl:🔗:transport::Channel::Recv()+0x4d68b8

Aborted (core dumped)

1号机器报错

… [2023-09-24 02:51:55.515] [info] [default_brpc_retry_policy.cc:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=0 addr=172.18.0.2:63927} (0x0x7f8a34067000): Connection refused [R1][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R2][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R3][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R4][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R5][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R6][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R7][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R8][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R9][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R10][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R11][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R12][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R13][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R1 auto wait = [&](int self_rank) { 4][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R15][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R16][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R17][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R18][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R19][E112]Not connected to 172.18.0.2:63927 yet, server_id=0' [2023-09-24 02:51:55.515] [info] [default_brpc_retry_policy.cc:75] aggressive retry, sleep=1000000us and retry I0924 02:51:56.516082 769 /repository/brpc-1.6.0/src/brpc/socket.cpp:2465] Checking Socket{id=0 addr=172.18.0.2:63927} (0x7f8a34067000) 1 Receive Hello I am 0 I0924 02:51:56.516975 695 /repository/brpc-1.6.0/src/brpc/socket.cpp:2525] Revived Socket{id=0 addr=172.18.0.2:63927} (0x7f8a34067000) (Connectable) [2023-09-24 02:51:56.522] [error] [channel.cc:98] SendImpl error [/repository/yacl/yacl/link/transport/brpc_link.cc:187] send, rpc failed=112, message=[E111]Fail to connect Socket{id=0 addr=172.18.0.2:63927} (0x0x7f8a34067000): Connection refused [R1][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R2][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R3][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R4][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R5][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R6][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R7][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R8][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R9][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R10][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R11][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R12][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R13][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R14][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R15][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R16][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R17][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R18][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R19][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R20][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 Stacktrace: #0 yacl:🔗:transport::BrpcLink::SendRequest()+0x4cb5cf #1 (unknown)+0x7f8a34002da0

上面省略了一些[info]段落。1号机器确实输出了”1 Receive Hello I am 0”,但0号机器似乎没有收到消息。我确信1号机器程序启动后,0号机器的程序在5秒内启动。

Hi, 之前这个问题使用ConnectToMesh 虽然可以规避,根因确认了是brpc的重试机制有个bug,apache/brpc#2395 (comment)
@maths644311798

@huocun-ant
Yacl用的版本好像是brpc-1.6.0,但是PR apache/brpc#2419 (comment) 修改的是master的。Yacl后续有什么策略吗?

@huocun-ant Yacl用的版本好像是brpc-1.6.0,但是PR apache/brpc#2419 (comment) 修改的是master的。Yacl后续有什么策略吗?

  1. yacl正在考虑新增一套脱离brpc retry policy的重试策略,相对更加灵活和可控
  2. brpc的版本升级后续应该也会做的,看有没有必要了