Unprocessed bundles and connection errors in 3-node scenario
teschmitt opened this issue · 3 comments
I am using the coreemu-lab to simulate a scenario involving three nodes, using one node to ferry bundles between the other two. All nodes run dtnd
with the same arguments:
dtnd --cla mtcp --nodeid $(hostname) --endpoint "dtn://this-is-our/~group" --interval 2s --janitor 3s --peer-timeout 4s
Assuming there are three nodes n1
(IP: 10.0.0.1), n2
(10.0.0.2), and n3
(10.0.0.3) of which n1
and n3
are pre-loaded with a certain number of messages (M) for the group endpoint that is shared by all nodes. Initially, all nodes have no connection between them:
[n1] [n3]
[n2]
After 15 seconds, n2
moves into range of n1
and receives M bundles:
[n1] [n3]
|
[n2]
After further 15 seconds, n2
moves into range of n3
where it stays until the end of the simulation at T+120s. Here, it should receive M bundles from n3
and forward M bundles originating from n1
to n3
:
[n1] [n2]--[n3]
Dependent on M, n2
will exhibit faulty behavior. E.g. for M=1000, we get the following bundle transfer stats:
node | sent | recvd
n1 | 1000 | 0
n2 | ---- | 1067
n3 | 1000 | 1000
A look at the dtnd
log from n2
and n3
shows us that after neighbor discovery at about T+33s, n2
sends all bundles originating from n1
to n3
, but only processes 67 bundles from n3
and then just idles until the end of the experiment. Logs on n3
show all bundles have been sent to n2
.
With M=5000, connection errors start popping up as n2
tries to forward bundles to n1
at 10.0.0.1 long after it has gone out of range. This causes long stalls of about 35 to 50 secs in the sending process in which the dtnd
freezes. Here are two consecutive log entries whose timestamps show the stall duration:
...
2022-12-28T22:25:16.315Z INFO dtn7::core::processing > [inconspicuous log entry]
2022-12-28T22:26:24.574Z ERROR dtn7::cla::mtcp > Error connecting to remote 10.0.0.1:16162
...
So the idling until the end of the experiment in M=1000 could actually just be a stall that gets interrupted because the experiment has run out of time. Weirdly enough, the M=5000 setup actually sees all bundles transferred completely, even in the presence of the errors and stalls.
I've attached the scenario setup. M can be regulated through NUMMSGS
in experiment.conf
. Also included are the logs referenced in this issue: connerr.zip
Thank you for taking the time to report this issue and provide an easy to use minimal scenario.
But I'm not sure how to reproduce your errors.
When I run this scenario in my machine with M=1000
I get the following output:
--------------------------------------------------------------------
SIMULATION RESULTS
--------------------------------------------------------------------
Found connection errors on n2: 0
Message stats:
node | sent | recvd
n1 | 1000 | 1000
n2 | ---- | 2000
n3 | 1000 | 1000
I did have to add cp /shared/bin/* /usr/local/bin
to the pre hook in your experiment.conf
but without this dtnsend
should never be possible as the binary will be missing in the coreemu-lab image.
For M=5000
I get these results:
--------------------------------------------------------------------
SIMULATION RESULTS
--------------------------------------------------------------------
Found connection errors on n2: 0
Message stats:
node | sent | recvd
n1 | 5000 | 5000
n2 | ---- | 10000
n3 | 5000 | 5000
There might be an issue on your local machine running the docker container.
Did you try this on multiple machines?
Ok I've cross-checked this issue on another machine (Linux VM running on an M1 Macbook) and I could not reproduce the error until I loaded the ebtables
kernel module. As a matter of fact, this module was also loaded on the machine I originally encountered this error on.
Often but not always, all bundles were transmitted successfully, but there are always connection errors in the dtnd
logs. Taking a look at the logs, I can see that these errors cause long pauses in transmission.
Might this be some sort of feud between ebtables
and dtn7-rs
?
Here is the Dockerfile
I used to run these experiments:
FROM rust:1.62.1 as builder
WORKDIR /root
RUN cargo install --locked --bins --examples --root /usr/local --git https://github.com/dtn7/dtn7-rs --rev 0bd550ce dtn7
FROM gh0st42/coreemu-lab:1.0.0
COPY --from=builder /usr/local/bin/* /usr/local/bin/
RUN echo "export USER=root" >> /root/.bashrc
ENV USER root
EXPOSE 22
EXPOSE 5901
EXPOSE 50051
EDIT: both ebtables
and sch_netem
are loaded when the errors crop up:
$ lsmod | grep -E "ebtables|sch_netem"
ebtables 45056 1 ebtable_filter
sch_netem 20480 0
as i cannot easily reproduce the problem and it does not happen on other machines, I will close the issue now.
If you get new insights or can reproduce the bug please reopen the issue.