trpc-group/tnet

发包异常

Closed this issue · 14 comments

在测试过程中,服务器都使用tnet作为网络库,A进程使用tnet.DialTCP连接B进程服务器,当大量客户端连接和A进程开始交互后,A进程频繁和B进程进行消息通信,在一段时间后,B进程解包失败。
通过日志分析排查,结合使用tcpdump抓包发现,B进程收到了一个异常消息包,通过tcpdump可以看到,消息是从A进程发送过来,但是在A进程的Send方法中,并没有找到发送该消息的日志和信息。
A进程消息收发也没有异常和报错,什么情况会导致这种现象发生,从tcpdump抓到的包看,疑似是A进程的DialTCP连接底层发送了其他客户端连接上的数据,导致的B进程解包失败。
目前从代码上看DailTCP和Service使用的是同一个TCPHandler回调函数。

Hello, could you please provide a reproducible repository and specify your working environment, including the Go version, operating system version, and architecture?

Go version: 1.21.3
Linux version 3.10.0-1160.66.1.el7.x86_64

Is anyone looking into this issue?

@silencelbl Could you please directly provide a reproducible github repo? You can push your code to your public repository and provide a link here.

I can think of a common scenario where this issue might occur. The root cause could be port reuse, where multiple processes are listening on the same port. To troubleshoot this, you can check if there are multiple processes using the same port.

@silencelbl I still believe it would be best if you could directly provide a minimal reproducible code repository. This would minimize the communication effort required. Otherwise, I would need to ask for details about each issue mentioned in your description in order to reproduce them accurately. Could you please provide a repository? It would save time for both of us. Thank you.

@silencelbl I still believe it would be best if you could directly provide a minimal reproducible code repository. This would minimize the communication effort required. Otherwise, I would need to ask for details about each issue mentioned in your description in order to reproduce them accurately. Could you please provide a repository? It would save time for both of us. Thank you.

这两天我尽量抽空整理份能复现问题的代码,可否添加个VX?

I can think of a common scenario where this issue might occur. The root cause could be port reuse, where multiple processes are listening on the same port. To troubleshoot this, you can check if there are multiple processes using the same port.

测试环境虽然在一台机子上启动了多个服务进程,client是在单独的施压机上单独运行,各服务进程监听端口是检查过的,我们在tnet测试异常后,切换会原生net包代码,问题不在复现,后面我整理下代码,提供一份给你看看。

@silencelbl I still believe it would be best if you could directly provide a minimal reproducible code repository. This would minimize the communication effort required. Otherwise, I would need to ask for details about each issue mentioned in your description in order to reproduce them accurately. Could you please provide a repository? It would save time for both of us. Thank you.

复现代码已上传,我邀请了你

@silencelbl I still believe it would be best if you could directly provide a minimal reproducible code repository. This would minimize the communication effort required. Otherwise, I would need to ask for details about each issue mentioned in your description in order to reproduce them accurately. Could you please provide a repository? It would save time for both of us. Thank you.

复现代码已上传,我邀请了你

OK, I've received the invitation. I am working on it.

@silencelbl, I have figured out the solution. You have reused the buffer that is provided to tnet.Conn.Writev. But you only provide SafeWrite option to the server side. You should also enable the SafeWrite option on the client side:

image

@silencelbl, I have figured out the solution. You have reused the buffer that is provided to tnet.Conn.Writev. But you only provide SafeWrite option to the server side. You should also enable the SafeWrite option on the client side:

image

3Q,问题得以解决,测试过程无再现此问题!在同一进程中的conn/tcp/udp都应设置safeWrite对吧@WineChord

@silencelbl, I have figured out the solution. You have reused the buffer that is provided to tnet.Conn.Writev. But you only provide SafeWrite option to the server side. You should also enable the SafeWrite option on the client side:
image

3Q,问题得以解决,测试过程无再现此问题!在同一进程中的conn/tcp/udp都应设置safeWrite对吧@WineChord

As long as you manage the buffer passed to writev on your own, you should always set safewrite to true.