pylessard/python-can-isotp

No connection via virtual environement

pompushko opened this issue · 20 comments

Hello, I have virtual server and virtual vcan0 interface. And this code:

    tpsock = isotp.socket(timeout=5)
    tpsock.set_opts(tx_stmin=500000)
    ecu_address = isotp.Address(isotp.AddressingMode.NormalFixed_29bits, source_address=parsed_data.TesterAddressClient, target_address=parsed_data.ECUAddressServer)
    conn = IsoTPSocketConnection(args.interface, rxid=int.from_bytes(parsed_data.CANIdECU, 'big'), txid=int.from_bytes(parsed_data.CANIdTester, 'big'), tpsock=tpsock, address=ecu_address)
    with Client(conn,  request_timeout=5) as client:
        response = client.change_session(3)
        if response.code == Response.Code.PositiveResponse:
            logger_app.logger.info("UDS session changed to Extended")
  File "/home/me/.local/lib/python3.10/site-packages/udsoncan/client.py", line 128, in decorated
    return func(self, *args, **kwargs)
  File "/home/me/.local/lib/python3.10/site-packages/udsoncan/client.py", line 186, in change_session
    response = self.send_request(req)
  File "/home/me/.local/lib/python3.10/site-packages/udsoncan/client.py", line 1950, in send_request
    self.conn.send(payload)
  File "/home/me/.local/lib/python3.10/site-packages/udsoncan/connections.py", line 65, in send
    self.specific_send(payload)
  File "/home/me/.local/lib/python3.10/site-packages/udsoncan/connections.py", line 316, in specific_send
    self.tpsock.send(payload)
  File "/home/me/.local/lib/python3.10/site-packages/isotp/tpsock/__init__.py", line 84, in send
    return self._socket.send(*args, **kwargs)

When I try to run my application, I got [TimeoutError] : timed out error.
Even in candump vcan0 no data.

But! When I try cansend vcan0 01a#11223344AABBCCDD, I see data:

candump vcan0
  vcan0  01A   [8]  11 22 33 44 AA BB CC DD
  vcan0  01A   [8]  11 22 33 44 AA BB CC DD
  vcan0  01A   [8]  11 22 33 44 AA BB CC DD

Thank you!

Where is the server? A different process?

Somebody once told me: "Problems are like bananas, they come in bunch". That's the 3rd report of problem with socket.send() in the last 2 weeks.

So far, these 2 issues reported similar problems and most plausible cause is a known race condition in the kernel.

I will look more in details regarding this, but for now, any additional info you can provide will be helpful. For instance,

  • Have you been able to run this code previously?
  • Have you changed something.
  • Version of kernel
  • etc

Well. Sorry for small amount of info :)
This is Ubuntu 22.04 server in GCP
Let me grab more info from that server :)

  1. Have you been able to run this code previously?
    Unfortunately, no, this is fresh install of server. I made many same installs on Raspberry Pi, so I was surprised...
  2. Have you changed something.
    Nope, new server installation :)
  3. Version of kernel
    6.2.0-1016-gcp #18~22.04.1-Ubuntu SMP Fri Sep 29 04:56:44 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

I started upgrade up to Ubuntu 23.04. Lets see...

Are you saying that you were running similar code on a raspberry pi and it worked?

Are you saying that you were running similar code on a raspberry pi and it worked?

Exactly. On 3-4 devices: Pi3, Pi4, PiZero...

@lumagi: Thoughts?

Well. After upgrading of Ubuntu... nothing was changed :D

Linux gcpvm-1 6.2.0-1016-gcp #18-Ubuntu SMP Fri Sep 22 16:23:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

cat /etc/os-release
PRETTY_NAME="Ubuntu 23.04"
NAME="Ubuntu"
VERSION_ID="23.04"
VERSION="23.04 (Lunar Lobster)"
VERSION_CODENAME=lunar
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=lunar
LOGO=ubuntu-logo
lumagi commented

Judging from the stack trace and the code, it's likely that this is also the race condition I mentioned in the other two threads.
In general, all that's needed to see this race condition is to use IsoTPSocketConnection with a kernel version that's affected. The bug first appeared as a regression after patch torvalds/linux@79e19fa was introduced before release of kernel version 6.3. Since the patch was also backported into the stable kernel versions and not only the next release, it seemed to have been picked up by all major distributions, even if they're using an older kernel version.

I guess anyone that's using your UDS client with a kernel ISO-TP socket could potentially be affected. The problem is that the error message is really counter intuitive. The time out happens because the send and receive side of the IsoTPSocketConnection are fighting and only one can win. The other one times out. So it's not really a time out issue on the bus but inside the kernel.

The patch with the fix is being applied to the stable kernel versions as well. That means that most or all affected Linux distributions should also pick up the patch eventually.

Since the issue only occurs with the kernel socket, the simplest way to verify if it's really the kernel race condition is to replace IsoTPSocketConnection with the Python implementation. If the problem goes away, it was likely the aforementioned bug.

It is really horrible to see the effect of the backported patch that finally ADDED a regression.
@lumagi : Hopefully your fix shows up in the stable kernels soon. At least I got a notification that the fix is queued up for 6.1 and 6.5 stable kernels. So lets see when Ubuntu applies the fix to the kernel 6.2 used above.
@pompushko when Ubuntu does not update the 6.2 in time you might also install the latest 6.1 or 6.5 kernels via Ubuntu PPA

It is really horrible to see the effect of the backported patch that finally ADDED a regression. @lumagi : Hopefully your fix shows up in the stable kernels soon. At least I got a notification that the fix is queued up for 6.1 and 6.5 stable kernels. So lets see when Ubuntu applies the fix to the kernel 6.2 used above. @pompushko when Ubuntu does not update the 6.2 in time you might also install the latest 6.1 or 6.5 kernels via Ubuntu PPA

Thank you :)

I upgraded... but this is insane, because it works before upgrading :D
modprobe: FATAL: Module vcan not found in directory /lib/modules/6.2.0-1017-gcp

This solve in my case to add virtual can:
sudo apt-get install -y linux-modules-extra-$(uname -r)

But still doesnt works at all. No data ;(

Seems like still old kernel

If you check out that kernel at the Ubuntu site:

https://launchpad.net/ubuntu/+source/linux-gcp/6.2.0-1017.19

The changes were summarized at Thu, 05 Oct 2023 21:59:43 -0500
But the fix was applied on 13th which later causes the stable upstream.

Therefore
https://launchpad.net/ubuntu/+archive/primary/+sourcefiles/linux-gcp/6.2.0-1017.19/linux-gcp_6.2.0-1017.19.diff.gz

Contains "can: isotp: isotp_ops: fix poll() to not report false EPOLLOUT events" - but not the fix for it!

So you still have to wait for the fix to show up in stable kernels or you install a 6.6-rc6 kernel which definitely contains the fix.

But even https://kernel.ubuntu.com/mainline/v6.6-rc6/ does not exist at time of writing this post (only v6.6-rc5).
In 1-2 days its worth to re-check for 6.6-rc6 there.

@hartkopp Do you think there could be a way to detect the existence of the bug at runtime so a warning can be issued?

I made more experiments :)

fresh install of Ubuntu 22.04 server. And fresh pip3 modules and the last version of python3
Also, I updated the kernel up to 6.5 to try:
Linux canbus-dev 6.5.0-1004-oem #4-Ubuntu SMP PREEMPT_DYNAMIC Fri Sep 15 19:52:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

I have my working script on Raspberry Pi. But when I run client.change_session(2) on VM with Virtual CAN, I got this one error after some client.read_data_by_identifier (2 or 3 times I read some data from UDS server):

[TimeoutError] : timed out
Traceback (most recent call last):
  File "/home/username/fca-flash-bundle-master/main.py", line 261, in <module>
    sys.exit(main())
  File "/home/username/fca-flash-bundle-master/main.py", line 123, in main
    response = client.change_session(2)
  File "/home/username/.local/lib/python3.10/site-packages/udsoncan/client.py", line 174, in decorated
    return func(self, *args, **kwargs)
  File "/home/username/.local/lib/python3.10/site-packages/udsoncan/client.py", line 232, in change_session
    response = self.send_request(req)
  File "/home/username/.local/lib/python3.10/site-packages/udsoncan/client.py", line 2143, in send_request
    self.conn.send(payload)
  File "/home/username/.local/lib/python3.10/site-packages/udsoncan/connections.py", line 65, in send
    self.specific_send(payload)
  File "/home/username/.local/lib/python3.10/site-packages/udsoncan/connections.py", line 321, in specific_send
    self.tpsock.send(payload)
  File "/home/username/.local/lib/python3.10/site-packages/isotp/tpsock/__init__.py", line 89, in send
    return self._socket.send(*args, **kwargs)
TimeoutError: timed out

here is candump:

  vcan0  18DA20F1   [3]  02 10 03
  vcan0  18DAF120   [7]  06 50 03 00 1E 0B B8
  vcan0  18DA20F1   [4]  03 22 F1 90
  vcan0  18DAF120   [8]  10 14 62 F1 90 5A 41 52
  vcan0  18DA20F1   [3]  30 00 00
  vcan0  18DAF120   [8]  21 45 41 45 48 56 36 4D
  vcan0  18DAF120   [8]  22 37 36 35 30 38 37 34
  vcan0  18DA20F1   [4]  03 22 F1 A0
  vcan0  18DAF120   [8]  10 43 62 F1 A0 35 30 35
  vcan0  18DA20F1   [3]  30 00 00
  vcan0  18DAF120   [8]  21 35 35 36 32 35 20 20
  vcan0  18DAF120   [8]  22 20 5A 41 52 45 41 45
  vcan0  18DAF120   [8]  23 48 56 36 4D 37 36 35
  vcan0  18DAF120   [8]  24 30 38 37 34 43 44 43
  vcan0  18DAF120   [8]  25 4D 2E 30 32 20 20 20
  vcan0  18DAF120   [8]  26 20 01 20 20 20 20 20
  vcan0  18DAF120   [8]  27 20 20 20 20 20 20 20
  vcan0  18DAF120   [8]  28 20 20 20 20 20 20 20
  vcan0  18DAF120   [6]  29 00 52 50 5C 14

@hartkopp Do you think there could be a way to detect the existence of the bug at runtime so a warning can be issued?

I have no idea. The problem only shows up in Phython environments so far.

Currently the fixes are available in Linux 6.6, and 6.5.8+ and 6.1.59+
Other Kernel versions that have applied the original poll() "fix" seem to be broken - this also applies to 5.10 and 5.15 LTS kernels. :-(

fresh install of Ubuntu 22.04 server. And fresh pip3 modules and the last version of python3 Also, I updated the kernel up to 6.5 to try: Linux canbus-dev 6.5.0-1004-oem #4-Ubuntu SMP PREEMPT_DYNAMIC Fri Sep 15 19:52:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Hi @pompushko ,

I created an out-of-tree isotp driver in a special branch "mainline-5.4+"

https://github.com/hartkopp/can-isotp/tree/mainline-5.4+

If you pull this repo and switch to the "mainline-5.4+" branch you should be able to build a can-isotp.ko module on your current system which covers all the latest fixes.

Once the module is build (just type make) you should modprobe this can-isotp.ko module which should do the job. Can you please give it a try?

fresh install of Ubuntu 22.04 server. And fresh pip3 modules and the last version of python3 Also, I updated the kernel up to 6.5 to try: Linux canbus-dev 6.5.0-1004-oem #4-Ubuntu SMP PREEMPT_DYNAMIC Fri Sep 15 19:52:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Hi @pompushko ,

I created an out-of-tree isotp driver in a special branch "mainline-5.4+"

https://github.com/hartkopp/can-isotp/tree/mainline-5.4+

If you pull this repo and switch to the "mainline-5.4+" branch you should be able to build a can-isotp.ko module on your current system which covers all the latest fixes.

Once the module is build (just type make) you should modprobe this can-isotp.ko module which should do the job. Can you please give it a try?

Yes. All works as expected :D
Thank you for that fix :)

  vcan0  18DA40F1   [8]  21 C0 05 06 A7 A8 07 E3
  vcan0  18DA40F1   [8]  22 02 F0 05 C0 10 A8 27
  vcan0  18DA40F1   [8]  23 F2 02 C0 30 C0 00 16
  vcan0  18DA40F1   [8]  24 A0 6E 20 A8 00 E2 04
  vcan0  18DA40F1   [8]  25 9F 8C 00 08 77 74 97
  vcan0  18DA40F1   [8]  26 0C 6F B0 A8 10 E2 04
  vcan0  18DA40F1   [8]  27 9F 8C 00 08 77 74 97
  vcan0  18DA40F1   [8]  28 0C 06 AD A8 1D E2 08
  vcan0  18DA40F1   [8]  29 A8 2D E2 42 A8 3D E2
  vcan0  18DA40F1   [8]  2A 4D 9F 8C 00 08 77 74
  vcan0  18DA40F1   [8]  2B 97 0C C0 04 DA 64 6E
  vcan0  18DA40F1   [8]  2C 35 A8 04 E2 09 07 2F
  vcan0  18DA40F1   [8]  2D F7 EB C3 C4 17 2F F8

But I start facing another one issue during request_transfer_exit()

response = client.request_transfer_exit()
            if response.code == Response.Code.PositiveResponse:
                logger_app.logger.info("Flashing completed")

This is physical interface for cangw:

  can0  18DA40F1   [8]  2C 11 4B E8 00 12 8F 0E
  can0  18DA40F1   [8]  2D 00 12 8F 0E 00 12 8F
  can0  18DA40F1   [8]  2E 0E 00 12 8F 0E 00 12
  can0  18DA40F1   [8]  2F 8F 0E 00 12 8F 0E 00
  can0  18DA40F1   [8]  20 12 8F 0E 00 12 8F 0E
  can0  18DA40F1   [8]  21 00 12 8F 0E 00 12 8F
  can0  18DA40F1   [8]  22 0E 00 12 8F 0E 00 12
  can0  18DA40F1   [8]  23 8F 0E 00 12 8F 0E 00
  can0  18DA40F1   [8]  24 12 8F 0E 00 12 8F 0E
  can0  18DAF140   [3]  02 76 D9
  can0  18DA40F1   [2]  01 37
  can0  18DAF140   [4]  03 7F 37 78

This is virtual interface:

 (1698409922.584110)  vcan0  18DA40F1   [8]  2C 11 4B E8 00 12 8F 0E
 (1698409922.584508)  vcan0  18DA40F1   [8]  2D 00 12 8F 0E 00 12 8F
 (1698409922.584899)  vcan0  18DA40F1   [8]  2E 0E 00 12 8F 0E 00 12
 (1698409922.585309)  vcan0  18DA40F1   [8]  2F 8F 0E 00 12 8F 0E 00
 (1698409922.585699)  vcan0  18DA40F1   [8]  20 12 8F 0E 00 12 8F 0E
 (1698409922.586096)  vcan0  18DA40F1   [8]  21 00 12 8F 0E 00 12 8F
 (1698409922.586486)  vcan0  18DA40F1   [8]  22 0E 00 12 8F 0E 00 12
 (1698409922.586885)  vcan0  18DA40F1   [8]  23 8F 0E 00 12 8F 0E 00
 (1698409922.587267)  vcan0  18DA40F1   [8]  24 12 8F 0E 00 12 8F 0E
 (1698409922.828261)  vcan0  18DAF140   [3]  02 76 D9
 (1698409922.829107)  vcan0  18DA40F1   [2]  01 37
 (1698409922.854382)  vcan0  18DAF140   [4]  03 7F 37 78

I'm not sure if that issue is a python-can or iso-tp...
But I got this one:

  File "/home/user/.local/lib/python3.11/site-packages/udsoncan/client.py", line 974, in request_transfer_exit
    response = self.send_request(request)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/udsoncan/client.py", line 2184, in send_request
    raise TimeoutException('Did not receive response in time. %s time has expired (timeout=%.3f sec)' %
udsoncan.exceptions.TimeoutException: Did not receive response in time. Global request timeout time has expired (timeout=49.974 sec)

UPD: problem in my UDS server

Shall we close that issue then?

Shall we close that issue then?

Definitely :D

Just to finish this up:

The problem has been introduced into the mainline kernel with this patch (2023-04-05)
torvalds/linux@79e19fa

And it has been fixed with this patch (2023-10-06)
torvalds/linux@d9c2ba6

Both patches have been backported to the Linux Longterm stable kernels.

These stable kernels do only contain the first fix but not the second fix and are therefore broken regarding this python-can issue:
5.10.178 - 5.10.199
5.15.107 - 5.15.137
6.1.24 - 6.1.58

So updating your system to the latest stable kernels (5.10.200+, 5.15.138+, 6.1.59+, 6.5.8+) will fix it.