ADLINK-IST/opensplice

Communication problem on arm64 Ubuntu

Opened this issue · 2 comments

Environment

Opensplice community 6.9.181127OSS in DDSI peer mode. Config, QoS
Node1(192.168.13.11): Ubuntu 18.04, Python3.6, Intel x64
Node2(192.168.13.101): Ubuntu 18.04 (JetPack 4.4), Python3.6, NVIDIA Xavier (ARM64)
Node3(192.168.13.201): Ubuntu 16.04 (JetPack 3.3), Python3.5, NVIDIA TX2 (ARM64)

Problem Description

Node1 and node2 can communicate with each other, but Node3 can not communicate others. Ping is OK.
In ospl-info.log when node1 and 3 running, there are warnings like "thread tev failed to make progress" and "thread dq.builtins failed to make progress", are their reasons for this problem? How can I solve it?
Node1 log
Node3 log

It's a bit difficult to judge the logs without knowing exactly what actions triggered it.

Looks like we have a connection (socket) being used for both read and write. Think a write is started on the socket after which a read fails and the reader closes the socket. The writer then tries to use a closed socket and errors. Simplest way to eliminate the error is to not generate it if connection has been closed as is expected behaviour. If connections are not closed nicely then may always get tev warnings as when tcp read/writes block we try and hold onto the connection as long as possible before cleaning it out. Can always reduce the configurable read/write connection timeouts.

I don't think it's really a 'known issue' in DDSI, but more like regular TCP behaviour. You can change operating-system defaults (i.e. stuff in /proc/sys/net/ipv4/tcp_* on Linux) but these defaults are usually quite sane and messing with them risks causing all kinds of weird symptoms. You can also increase the DDSI lease-time to outlive TCP timeouts. But you can imagine possibility for multiple hosts timing out at roughly the same time, lease-renew thread getting randomly scheduled in etc. so a good number is difficult to pick (and the higher the lease-timeout, the less responsive the system becomes).

It's a bit difficult to judge the logs without knowing exactly what actions triggered it.

Looks like we have a connection (socket) being used for both read and write. Think a write is started on the socket after which a read fails and the reader closes the socket. The writer then tries to use a closed socket and errors. Simplest way to eliminate the error is to not generate it if connection has been closed as is expected behaviour. If connections are not closed nicely then may always get tev warnings as when tcp read/writes block we try and hold onto the connection as long as possible before cleaning it out. Can always reduce the configurable read/write connection timeouts.

I don't think it's really a 'known issue' in DDSI, but more like regular TCP behaviour. You can change operating-system defaults (i.e. stuff in /proc/sys/net/ipv4/tcp_* on Linux) but these defaults are usually quite sane and messing with them risks causing all kinds of weird symptoms. You can also increase the DDSI lease-time to outlive TCP timeouts. But you can imagine possibility for multiple hosts timing out at roughly the same time, lease-renew thread getting randomly scheduled in etc. so a good number is difficult to pick (and the higher the lease-timeout, the less responsive the system becomes).

Thansk for replying. Do you mean the warnings are result but not cause, the real reason is communication failure? Is that mean modifying system TCP configuration or DDSI lease-time may solve the warning but communication may still fail?
Anyway, could you please be more specify about about how to modify the DDSI lease-time?