发包异常

Question

发包异常

Closed this issue 9 months ago · 14 comments

在测试过程中，服务器都使用tnet作为网络库，A进程使用tnet.DialTCP连接B进程服务器，当大量客户端连接和A进程开始交互后，A进程频繁和B进程进行消息通信，在一段时间后，B进程解包失败。
通过日志分析排查，结合使用tcpdump抓包发现，B进程收到了一个异常消息包，通过tcpdump可以看到，消息是从A进程发送过来，但是在A进程的Send方法中，并没有找到发送该消息的日志和信息。
A进程消息收发也没有异常和报错，什么情况会导致这种现象发生，从tcpdump抓到的包看，疑似是A进程的DialTCP连接底层发送了其他客户端连接上的数据，导致的B进程解包失败。
目前从代码上看DailTCP和Service使用的是同一个TCPHandler回调函数。

Answer 1 · 2023-12-16T08:03:54.000Z

Hello, could you please provide a reproducible repository and specify your working environment, including the Go version, operating system version, and architecture?

Answer 2 · 2023-12-18T01:28:47.000Z

Go version: 1.21.3
Linux version 3.10.0-1160.66.1.el7.x86_64

Answer 3 · 2023-12-18T11:13:04.000Z

今天再次测试，发现有串包现象，日志看到send接口消息是发送到一个client连接的，但是B进程收到了消息，日志中确实没有发送到B进行的痕迹。测试环境：8C16G 腾讯云服务器 Linux 3.10.0-1160.66.1.el7.x86_64 代码库：trpc.group/trpc-go/tnet v1.0.0 Go版本：version go1.21.3 linux/amd64 tnet.SetNumpollers(runtime.NumCPU()) ----------------------------------------------------------------------------------------- A进程创建了2个TCPService，同时会调用tnet.DialTCP连接B进程 B进程创建了1个TCPService（接收A进程的消息） Service参数设置：WithSafeWrite(true), WithNonBlocking(true) TCPHander方法，Service和DialTCP绑定的同一个回调函数在测试过程中，client连接到A进程，A进程触发逻辑转发消息到B进程，在client个数不多的情况，不会触发异常问题，当client达到一定量后，异常必现。

…

------------------ 原始邮件 ------------------ 发件人: "trpc-group/tnet" ***@***.***>; 发送时间: 2023年12月16日(星期六) 下午4:04 ***@***.***>; ***@***.******@***.***>; 主题: Re: [trpc-group/tnet] 发包异常 (Issue #20) Hello, could you please provide a reproducible repository and specify your working environment, including the Go version, operating system version, and architecture? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

Answer 4 · 2023-12-22T11:07:19.000Z

Is anyone looking into this issue?

Answer 5 · 2023-12-22T11:16:04.000Z

@silencelbl Could you please directly provide a reproducible github repo? You can push your code to your public repository and provide a link here.

Answer 6 · 2023-12-25T03:03:04.000Z

I can think of a common scenario where this issue might occur. The root cause could be port reuse, where multiple processes are listening on the same port. To troubleshoot this, you can check if there are multiple processes using the same port.

Answer 7 · 2023-12-25T03:30:05.000Z

@silencelbl I still believe it would be best if you could directly provide a minimal reproducible code repository. This would minimize the communication effort required. Otherwise, I would need to ask for details about each issue mentioned in your description in order to reproduce them accurately. Could you please provide a repository? It would save time for both of us. Thank you.

Answer 8 · 2023-12-25T04:22:11.000Z

@silencelbl I still believe it would be best if you could directly provide a minimal reproducible code repository. This would minimize the communication effort required. Otherwise, I would need to ask for details about each issue mentioned in your description in order to reproduce them accurately. Could you please provide a repository? It would save time for both of us. Thank you.

这两天我尽量抽空整理份能复现问题的代码，可否添加个VX？

Answer 9 · 2023-12-25T04:27:39.000Z

I can think of a common scenario where this issue might occur. The root cause could be port reuse, where multiple processes are listening on the same port. To troubleshoot this, you can check if there are multiple processes using the same port.

测试环境虽然在一台机子上启动了多个服务进程，client是在单独的施压机上单独运行，各服务进程监听端口是检查过的，我们在tnet测试异常后，切换会原生net包代码，问题不在复现，后面我整理下代码，提供一份给你看看。

Answer 10 · 2023-12-25T14:27:12.000Z

@silencelbl I still believe it would be best if you could directly provide a minimal reproducible code repository. This would minimize the communication effort required. Otherwise, I would need to ask for details about each issue mentioned in your description in order to reproduce them accurately. Could you please provide a repository? It would save time for both of us. Thank you.

复现代码已上传，我邀请了你

Answer 11 · 2023-12-26T06:23:11.000Z

@silencelbl I still believe it would be best if you could directly provide a minimal reproducible code repository. This would minimize the communication effort required. Otherwise, I would need to ask for details about each issue mentioned in your description in order to reproduce them accurately. Could you please provide a repository? It would save time for both of us. Thank you.

复现代码已上传，我邀请了你

OK, I've received the invitation. I am working on it.

Answer 12 · 2023-12-26T09:11:10.000Z

@silencelbl, I have figured out the solution. You have reused the buffer that is provided to tnet.Conn.Writev. But you only provide SafeWrite option to the server side. You should also enable the SafeWrite option on the client side:

Answer 13 · 2023-12-28T17:09:34.000Z

@silencelbl, I have figured out the solution. You have reused the buffer that is provided to tnet.Conn.Writev. But you only provide SafeWrite option to the server side. You should also enable the SafeWrite option on the client side:

3Q，问题得以解决，测试过程无再现此问题！在同一进程中的conn/tcp/udp都应设置safeWrite对吧@WineChord

Answer 14 · 2023-12-28T23:26:23.000Z

@silencelbl, I have figured out the solution. You have reused the buffer that is provided to tnet.Conn.Writev. But you only provide SafeWrite option to the server side. You should also enable the SafeWrite option on the client side:

3Q，问题得以解决，测试过程无再现此问题！在同一进程中的conn/tcp/udp都应设置safeWrite对吧@WineChord

As long as you manage the buffer passed to writev on your own, you should always set safewrite to true.