yacl通信实际测试问题
Closed this issue · 5 comments
根据factory_test.cc测试的一部分,我改造到真实的两台机器上测试,感觉有些问题。
环境:两台机器都是Ubuntu系统,地址分别为172.18.0.2, 172.18.0.3,分别取编号(rank)为0,1。
rank是0的机器运行代码如下
//Mytest.cpp
#include <iostream>
#include <string>
#include <vector>
#include <map>
#include <memory>
#include <type_traits>
#include <variant>
#include <unistd.h>
#include <future>
#include <limits>
#include "fmt/format.h"
#include "gtest/gtest.h"
#include "yacl/link/context.h"
#include "yacl/link/link.h"
#include "yacl/link/factory.h"
class FactoryTest{
public:
FactoryTest()
{
static int desc_count = 0;
contexts_.resize(2);
yacl::link::ContextDesc desc;
desc.id = fmt::format("world_{}", desc_count++);
desc.brpc_retry_count = 20;
desc.parties.push_back(yacl::link::ContextDesc::Party("alice", "172.18.0.2:63927"));
desc.parties.push_back(yacl::link::ContextDesc::Party("bob", "172.18.0.3:63921"));
auto create_brpc = [&](int self_rank) {
contexts_[self_rank] = yacl::link::FactoryBrpc().CreateContext(desc, self_rank);
};
std::vector<std::future<void>> creates;
creates.push_back(std::async(create_brpc, 0));
for (auto& f : creates) {
f.get();
}
std::cout << "Connect to Bob successfully\n";
}
void work()
{
auto test = [&](int self_rank)
{
int dst_rank = 1 - self_rank;
this->contexts_[self_rank]->SendAsync(dst_rank, "Hello I am 0", "test");
yacl::Buffer r = this->contexts_[self_rank]->Recv(dst_rank, "test");
std::string r_str(r.data<const char>(), r.size());
std::cout << self_rank << " Receive " << r_str << '\n';
};
std::vector<std::future<void>> tests;
tests.push_back(std::async(test, 0));
for (auto& f : tests) {
f.get();
}
}
~FactoryTest()
{
auto wait = [&](int self_rank) {
contexts_[self_rank]->WaitLinkTaskFinish();
};
std::vector<std::future<void>> waits;
waits.push_back(std::async(wait, 0));
for (auto& f : waits) {
f.get();
}
}
std::vector<std::shared_ptr<yacl::link::Context>> contexts_;
};
int main() {
FactoryTest F;
sleep(2);
F.work();
return 0;
}
编号为1的机器的代码主要改了上面的self_rank的取值。由于是手工启动,测试时两台机器启动程序的时间可能会相差几秒,先启动1号机器的程序,再启动0号机器的。上面代码运行没有问题,0号机器输出
0 Receive Hello I am 1
1号机器输出
1 Receive Hello I am 0
但是代码中如果去掉sleep(2)语句,再测试时就会有以下报错,0号机器报错
I0924 02:51:37.530009 1192314 /repository/brpc-1.6.0/src/brpc/server.cpp:1127] Server[yacl::link::transport::internal::ReceiverServiceImpl] is serving on port=63927.
Connect to Bob successfully
I0924 02:51:56.632742 1192407 /repository/brpc-1.6.0/src/brpc/socket.cpp:2465] Checking Socket{id=0 addr=172.18.0.3:63921} (0x7fbacc067020)
terminate called after throwing an instance of 'yacl::IoError'
what(): [/repository/yacl/yacl/link/transport/channel.cc:351] Get data timeout, key=world_0:P2P-1:1->0
Stacktrace:
#0 yacl::link::transport::Channel::Recv()+0x4d68b8
Aborted (core dumped)
1号机器报错
…
[2023-09-24 02:51:55.515] [info] [default_brpc_retry_policy.cc:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=0 addr=172.18.0.2:63927} (0x0x7f8a34067000): Connection refused [R1][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R2][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R3][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R4][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R5][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R6][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R7][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R8][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R9][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R10][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R11][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R12][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R13][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R1
auto wait = [&](int self_rank) {
4][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R15][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R16][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R17][E112]Not connected
to 172.18.0.2:63927 yet, server_id=0 [R18][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R19][E112]Not connected to 172.18.0.2:63927 yet, server_id=0'
[2023-09-24 02:51:55.515] [info] [default_brpc_retry_policy.cc:75] aggressive retry, sleep=1000000us and retry
I0924 02:51:56.516082 769 /repository/brpc-1.6.0/src/brpc/socket.cpp:2465] Checking Socket{id=0 addr=172.18.0.2:63927} (0x7f8a34067000)
1 Receive Hello I am 0
I0924 02:51:56.516975 695 /repository/brpc-1.6.0/src/brpc/socket.cpp:2525] Revived Socket{id=0 addr=172.18.0.2:63927} (0x7f8a34067000) (Connectable)
[2023-09-24 02:51:56.522] [error] [channel.cc:98] SendImpl error [/repository/yacl/yacl/link/transport/brpc_link.cc:187] send, rpc failed=112, message=[E111]Fail to connect Socket{id=0 addr=172.18.0.2:63927}
(0x0x7f8a34067000): Connection refused [R1][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R2][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R3][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R4][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R5][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R6][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R7][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R8][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R9][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R10][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R11][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R12][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R13][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R14][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R15][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R16][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R17][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R18][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R19][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R20][E112]Not connected to 172.18.0.2:63927 yet, server_id=0
Stacktrace:
#0 yacl::link::transport::BrpcLink::SendRequest()+0x4cb5cf
#1 (unknown)+0x7f8a34002da0
上面省略了一些[info]段落。1号机器确实输出了”1 Receive Hello I am 0”,但0号机器似乎没有收到消息。我确信1号机器程序启动后,0号机器的程序在5秒内启动。
要确保两方的context都连上了,可以在调用SendAsync
之前调用Context
的 ConnectToMesh
方法,比如
auto create_brpc = [&](int self_rank) {
contexts_[self_rank] = yacl::link::FactoryBrpc().CreateContext(desc, self_rank);
contexts_[self_rank]->ConnectToMesh();
};
,这样就不用sleep(2)
了。
之所以这个地方会报错是在调用SendAsync
的时候对端的socket没有建立,总会有个时间差的问题。
contexts_[self_rank]->ConnectToMesh();
可以确保双方都建立好服务了。
根据factory_test.cc测试的一部分,我改造到真实的两台机器上测试,感觉有些问题。 环境:两台机器都是Ubuntu系统,地址分别为172.18.0.2, 172.18.0.3,分别取编号(rank)为0,1。 rank是0的机器运行代码如下
//Mytest.cpp #include <iostream> #include <string> #include <vector> #include <map> #include <memory> #include <type_traits> #include <variant> #include <unistd.h> #include <future> #include <limits> #include "fmt/format.h" #include "gtest/gtest.h" #include "yacl/link/context.h" #include "yacl/link/link.h" #include "yacl/link/factory.h" class FactoryTest{ public: FactoryTest() { static int desc_count = 0; contexts_.resize(2); yacl::link::ContextDesc desc; desc.id = fmt::format("world_{}", desc_count++); desc.brpc_retry_count = 20; desc.parties.push_back(yacl::link::ContextDesc::Party("alice", "172.18.0.2:63927")); desc.parties.push_back(yacl::link::ContextDesc::Party("bob", "172.18.0.3:63921")); auto create_brpc = [&](int self_rank) { contexts_[self_rank] = yacl::link::FactoryBrpc().CreateContext(desc, self_rank); }; std::vector<std::future<void>> creates; creates.push_back(std::async(create_brpc, 0)); for (auto& f : creates) { f.get(); } std::cout << "Connect to Bob successfully\n"; } void work() { auto test = [&](int self_rank) { int dst_rank = 1 - self_rank; this->contexts_[self_rank]->SendAsync(dst_rank, "Hello I am 0", "test"); yacl::Buffer r = this->contexts_[self_rank]->Recv(dst_rank, "test"); std::string r_str(r.data<const char>(), r.size()); std::cout << self_rank << " Receive " << r_str << '\n'; }; std::vector<std::future<void>> tests; tests.push_back(std::async(test, 0)); for (auto& f : tests) { f.get(); } } ~FactoryTest() { auto wait = [&](int self_rank) { contexts_[self_rank]->WaitLinkTaskFinish(); }; std::vector<std::future<void>> waits; waits.push_back(std::async(wait, 0)); for (auto& f : waits) { f.get(); } } std::vector<std::shared_ptr<yacl::link::Context>> contexts_; }; int main() { FactoryTest F; sleep(2); F.work(); return 0; }
编号为1的机器的代码主要改了上面的self_rank的取值。由于是手工启动,测试时两台机器启动程序的时间可能会相差几秒,先启动1号机器的程序,再启动0号机器的。上面代码运行没有问题,0号机器输出
0 Receive Hello I am 1
1号机器输出
1 Receive Hello I am 0
但是代码中如果去掉sleep(2)语句,再测试时就会有以下报错,0号机器报错
I0924 02:51:37.530009 1192314 /repository/brpc-1.6.0/src/brpc/server.cpp:1127] Server[yacl:🔗:transport::internal::ReceiverServiceImpl] is serving on port=63927. Connect to Bob successfully I0924 02:51:56.632742 1192407 /repository/brpc-1.6.0/src/brpc/socket.cpp:2465] Checking Socket{id=0 addr=172.18.0.3:63921} (0x7fbacc067020) terminate called after throwing an instance of 'yacl::IoError' what(): [/repository/yacl/yacl/link/transport/channel.cc:351] Get data timeout, key=world_0:P2P-1:1->0 Stacktrace: #0 yacl:🔗:transport::Channel::Recv()+0x4d68b8
Aborted (core dumped)
1号机器报错
… [2023-09-24 02:51:55.515] [info] [default_brpc_retry_policy.cc:29] cntl ErrorCode '112', http status code '200', response header '', error msg '[E111]Fail to connect Socket{id=0 addr=172.18.0.2:63927} (0x0x7f8a34067000): Connection refused [R1][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R2][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R3][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R4][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R5][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R6][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R7][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R8][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R9][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R10][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R11][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R12][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R13][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R1 auto wait = [&](int self_rank) { 4][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R15][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R16][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R17][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R18][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R19][E112]Not connected to 172.18.0.2:63927 yet, server_id=0' [2023-09-24 02:51:55.515] [info] [default_brpc_retry_policy.cc:75] aggressive retry, sleep=1000000us and retry I0924 02:51:56.516082 769 /repository/brpc-1.6.0/src/brpc/socket.cpp:2465] Checking Socket{id=0 addr=172.18.0.2:63927} (0x7f8a34067000) 1 Receive Hello I am 0 I0924 02:51:56.516975 695 /repository/brpc-1.6.0/src/brpc/socket.cpp:2525] Revived Socket{id=0 addr=172.18.0.2:63927} (0x7f8a34067000) (Connectable) [2023-09-24 02:51:56.522] [error] [channel.cc:98] SendImpl error [/repository/yacl/yacl/link/transport/brpc_link.cc:187] send, rpc failed=112, message=[E111]Fail to connect Socket{id=0 addr=172.18.0.2:63927} (0x0x7f8a34067000): Connection refused [R1][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R2][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R3][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R4][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R5][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R6][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R7][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R8][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R9][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R10][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R11][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R12][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R13][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R14][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R15][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R16][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R17][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R18][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R19][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 [R20][E112]Not connected to 172.18.0.2:63927 yet, server_id=0 Stacktrace: #0 yacl:🔗:transport::BrpcLink::SendRequest()+0x4cb5cf #1 (unknown)+0x7f8a34002da0
上面省略了一些[info]段落。1号机器确实输出了”1 Receive Hello I am 0”,但0号机器似乎没有收到消息。我确信1号机器程序启动后,0号机器的程序在5秒内启动。
Hi, 之前这个问题使用ConnectToMesh
虽然可以规避,根因确认了是brpc的重试机制有个bug,apache/brpc#2395 (comment)
@maths644311798
@huocun-ant
Yacl用的版本好像是brpc-1.6.0,但是PR apache/brpc#2419 (comment) 修改的是master的。Yacl后续有什么策略吗?
@huocun-ant Yacl用的版本好像是brpc-1.6.0,但是PR apache/brpc#2419 (comment) 修改的是master的。Yacl后续有什么策略吗?
- yacl正在考虑新增一套脱离brpc retry policy的重试策略,相对更加灵活和可控
- brpc的版本升级后续应该也会做的,看有没有必要了