NXP/isochron

Missing HW and SW timestamps

Opened this issue · 8 comments

SPYFF commented

Hi! Just a generic question, not necessarily to the isochron. When I run the software with smaller and smaller cycle times, like two or one digit usec, more and more missing timestamps reported. Is there something I can tune or this is an expected behavior? Is it possible to increase the buffersize of the MSG_ERRQUEUE or thats irrelevant for this problem? I use Intel i225 (igc driver) by the way.

The isochron report shows strange zero timestamps for those seqids:

5477 1657634261.805760775 1657634261.850920957 1657634261.850922060 0.000000000 1657634261.850929143 1657634261.850969224
5478 1657634261.805760780 1657634261.850926407 1657634261.850927447 0.000000000 1657634261.850934596 1657634261.850969270
5479 1657634261.805760785 1657634261.850931946 1657634261.850932950 1657634261.850940063 1657634261.850940100 1657634261.850969319
5480 1657634261.805760790 1657634261.850937442 1657634261.850939053 0.000000000 1657634261.850946321 1657634261.850969364
5481 1657634261.805760795 1657634261.850943630 1657634261.850945347 1657634261.850952424 1657634261.850952465 1657634261.850969409
5482 1657634261.805760800 1657634261.850949799 1657634261.850952194 0.000000000 1657634261.850959300 1657634261.850969456

For larger cycle times, I get all of the timestamps (rx: sw, hw, tx: sw, hw)

SPYFF commented

For example the following command by default gives almost 50% hw timestamp loss.

isochron send -i enp3s0 -s 64 -c 0.00005 --client 10.0.0.20 --num-frames 1000 -F isochron.dat --sync-threshold 2000

If I increase the rcvbuf of the data socket (where we have the error queue too) this reduced to 5-10 missing timstamp. However no matter what I do, I cannot reduce the missing timestamps to zero.

SPYFF commented

Ok I think I found the problem: basically I too dumb for linux. Sorry for the noise, it was also an igc specific problem I guess.
With bpftrace I found if the isochron and the igc's PTP tx tstamp worker (igc_ptp_tx_work) scheduled into different CPU cores, I lost some/many timestamps. With large cycle times it was not an issue, but with smaller cycle times it is. Running the measurement with --cpu-mask=$core_of_igc_kworker I got much more HW TX timestamp (1000 in my case, running it with larger numbers like 5000 or more however still not ok) without lose a single one.

If you can have some hints or tweaks to minimize the chance of the lost HW timestamps I would be glad to hear it, but I think this safe to close because not related to isochron.

Hi Ferenc,
Sorry for the late response, I just came back from vacation.
It sounds like the problem is caused at least partially by the igc driver's inability to perform TX timestamping for more than 1 packet at a time:
https://elixir.bootlin.com/linux/v5.18.11/source/drivers/net/ethernet/intel/igc/igc_main.c#L1451
I see there's a "tx_hwtstamp_skipped" ethtool -S counter, could you check if that is what is incrementing? I sadly don't have the necessary hardware for this.

SPYFF commented

Hi!

Thanks for the help, tx_hwtstamp_skipped counter indeed incrementing (but only when I run isochron on the same core as the igc kworker). Do you think it might worth to mention this issue on the netdev list for the Intel devs or you see some quick fix what I can apply here?

Sorry again for the delay, I don't see a quick fix, I think it's a problem if the driver decides to drop TX timestamping requests willy-nilly, and it should be reported on the netdev list and see what can be done. At the very least, the driver could queue the packet until the current one is no longer being timestamped. This could be done by anyone with the hardware, since it's just some extra logic, not so much adding support for a different set of timer registers.

SPYFF commented

Vinicius replied on netdev to the issue with a patchset using all four registers for timestamping.

Vinicius replied on netdev to the issue with a patchset using all four registers for timestamping.

And does it solve the problem?

SPYFF commented

I compiled the kernel but the testbed is occupied for a while, so I havent had the opportunity to test it, but I'll be back soon.