NXP/isochron

Random Premature/Late Transmissions with Taprio Traffic Controller

Opened this issue · 4 comments

Hey,

I'm currently using isochron to measure the delay from one I210 to another using a taprio traffic controller over VLAN. Unfortunately, it seem to be getting a lot of variable results. Sometimes the program will run perfectly fine for over a minute, other times it will result in late/premature transmissions within seconds, discarding all other timestamps.

The taprio traffic controller configuration is as follows:

qdisc taprio 100: root refcnt 9 tc 3 map 0 0 0 0 0 0 1 2 0 0 0 0 0 0 0 0
queues offset 0 count 1 offset 1 count 1 offset 2 count 1
clockid TAI	base-time 0 cycle-time 10000000 cycle-time-extension 0
	index 0 cmd S gatemask 0x4 interval 2000000
	index 1 cmd S gatemask 0x2 interval 2000000
	index 2 cmd S gatemask 0x1 interval 6000000

where isochron has socket priority 6 (using traffic class 1).

The isochron send command I am using is as follows:
sudo isochron send -i enp1s0.5 -s 64 --client 10.0.0.1 -c 0.01 -t 1 -w 0.002 -F ${fileprefix}-isochron.dat -n 6500 -o -O 37 --cpu-mask $((1 << 1)) --sched-fifo --sched-priority 99 -4 -J 10.0.0.1 -p 6 -S 0.002 -Q
and the following is the isochron receive:
taskset -a $((1 << 1)) sudo isochron rcv -i enp5s0.5 -t 1 -O 37 --sched-fifo --sched-priority 99 -4

The resulting isochron output on the send side is:

         Now:   1692646016.985340048
First wakeup:   1692646017.485900000
   Base time:   1692646017.486200000
  Cycle time:            0.001000000
Late transmission by 1 cycles detected for seqid 57 scheduled for 1692646017.542200000: TX hwts 1692646017.543384569
Timed out waiting for TX timestamps, 64944 timestamps unacknowledged
seqid 58 missing timestamps: hw,
...

I've noticed that increasing the window size of isochron causes ptp (which is running on traffic class 2) to fault:

ptp4l[4324.490]: timed out while polling for tx timestamp
ptp4l[4324.490]: increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug
ptp4l[4324.490]: port 1 (enp1s0.5): send delay request failed
ptp4l[4324.490]: port 1 (enp1s0.5): SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)

and decreasing the window size any more results in isochron having consistent premature/late transmissions.

I was mainly wondering if this was common behavior with isochron, if there were any configuration changes that could be made to help stabilize the results, and finally if there are any other known causes that may be interfering with the isochron processes.

So isochron uses gatemask 0x2, and PTP uses gatemask 0x4.

I've noticed that increasing the window size of isochron causes ptp (which is running on traffic class 2) to fault

When you increase isochron's window beyond 0.002 seconds (2 ms, the interval of the "gatemask 0x2" entry, it starts overlapping with the PTP schedule entry.

And when that happens, sometimes isochron will wake up just early enough to send a timestampable packet and "steal" ptp4l's chance of doing the same thing. It's an i210 hardware limitation where there cannot be more than one TX timestampable packet in flight at any given time, see the many threads (only on isochron) on the topic:
#15 (comment)
#13

Check ethtool -S enp1s0 | grep tx_hwtstamp_skipped to make sure I'm right.

and decreasing the window size any more results in isochron having consistent premature/late transmissions.

I don't think those are a problem of the tool. I forget if the log is saved when the program aborts due to a premature/late transmission. If it is, can you do a postmortem of the log file with isochron-report, and see what were the S, T, t and s times for the failed (last) seqid? If we don't find something obviously miscalculated/off-looking, I'd tend to trust the data.

Can you please post your taprio configuration script for I210? I know the igb driver doesn't even support full taprio offload. I think it must use the txtime-assist feature (flags 0x1, paired with tc-etf child Qdiscs, and those have a weird/finicky "delta" fudge factor as well, that needs to be carefully tuned). But I don't have any experience with Intel NICs - I don't have a set of commands that I know for sure works.

It almost sounds like there's no actual packet scheduling in your NIC, and you're just relying on pure chance and on the operating system's scheduler to send packets out at the right time.

Hey, I should've specified that when I change the isochron window times, I also change the taprio configurations to match them. With regards to the isochron window size being bigger causing ptp to fault, this was the taprio configuration I was using:

qdisc taprio 100: root refcnt 9 tc 3 map 0 0 0 0 0 0 1 2 0 0 0 0 0 0 0 0
queues offset 0 count 1 offset 1 count 1 offset 2 count 1
clockid TAI	base-time 0 cycle-time 15000000 cycle-time-extension 0
	index 0 cmd S gatemask 0x4 interval 2000000
	index 1 cmd S gatemask 0x2 interval 10000000
	index 2 cmd S gatemask 0x1 interval 3000000

and the isochron send command would also change to reflect these changes (different window size and cycle time).

Here is the command that I am using to change the taprio configurations:

sudo tc qdisc replace dev enp1s0 parent root handle 100 taprio num_tc 3 queues 1@0 1@1 1@2 map 0 0 0 0 0 0 1 2 \
        sched-entry S 04 2000000 \
        sched-entry S 02 2000000 \
        sched-entry S 01 6000000 \
        clockid CLOCK_TAI

You do seem to be right about the isochron packet replacing the ptp packet and it can be seen in the ethtool command (output: tx_hwtstamp_skipped: 67), but I'm still not sure why they seem to be doing so when the traffic shaper and isochron are configured correctly to one another.

I'm still not sure why they seem to be doing so when the traffic shaper and isochron are configured correctly to one another

Because, as mentioned, with "flags 0" (implicit if you don't specify it) and "clockid CLOCK_TAI", you use software-only packet scheduling (the hrtimer-based kernel implementation from net/sched/sch-taprio.c). That will do a best-effort attempt of delivering packets for transmission to the NIC. You are not protected against gate overruns. It is entirely plausible that, even with isochron sitting quietly in its time slot, the software qdisc will still schedule a ptp4l packet a bit late, and isochron will tramp over it.

With "flags 0", you get just about as much TSN out of the i210 NIC as from any non-TSN Realtek, Mediatek, or $insert_vendor card here. With the added downside that the i210 can't handle multiple TX timestamps in flight.

I think this wiki is maintained by Intel people. See how they recommend configuring the taprio root with flags 0x1, and how they recommend attaching tc-etf child Qdiscs to taprio's classes: https://tsn.readthedocs.io/qdiscs.html