Evaluate switching from txdbus to other Python DBus package
Closed this issue ยท 34 comments
Description
A problem with using txdbus as the solution to talk to the BlueZ DBus service is that it seems to leak file descriptors somewhere. Right now this is solved by having a store of twisted reactors in the __init__.py
of bleak.backends.bluez
.
This might be improved by using another package. Candidates to evaluate:
- DBussy: https://github.com/ldo/dbussy
- Jeepney: https://jeepney.readthedocs.io/en/latest/
Pros
- Both are pure python and would reduce the number of dependencies for bleak
- It might resolve the file descriptor problem
- An
asyncio
-only solution in Linux
Cons
- Twisted (which powers txdbus) is a very stable and efficient package. Replacing it with newer, less tested packages might lead to unexpected problems.
I think I would be in favor of exploring one of the alternatives. I did some work last year to fix some bugs in txdbus. One bug (that fortunately doesn't affect Bleak) I was not able to fix because it would require changing twisted, but I found it to just be too complicated.
Both of the packages you have listed look good.
I am running into some weird issues with Twisted right now that seem to be linked to this issue. It seems to be very related to the asyncio
issue that was raised. We will likely be looking into this in the coming weeks. If anyone else has input before we dive in, it would be much appreciated.
Here is the problem that I am seeing when trying to read a characteristic, for reference:
Traceback (most recent call last):
File ".../bleak/backends/bluezdbus/client.py", line 454, in read_gatt_char
).asFuture(asyncio.get_event_loop())
twisted.internet.error.ConnectionDone: Connection was closed cleanly.
@cspensky Please try! I think Jeepney is more promising than DBussy, even though I like the name better of the latter one.
I am running into some weird issues with Twisted right now that seem to be linked to this issue. It seems to be very related to the
asyncio
issue that was raised. We will likely be looking into this in the coming weeks. If anyone else has input before we dive in, it would be much appreciated.Here is the problem that I am seeing when trying to read a characteristic, for reference:
Traceback (most recent call last): File ".../bleak/backends/bluezdbus/client.py", line 454, in read_gatt_char ).asFuture(asyncio.get_event_loop()) twisted.internet.error.ConnectionDone: Connection was closed cleanly.
is there any progress on that? I'm having the same issue but if notify callback is called of the chracteristics. would very much appreciate if this was solved soon.
@mitea1 as far as I know, no effort has been spent trying to solve this yet. Please give it a go if you feel up for a challenge!
@mitea1 as far as I know, no effort has been spent trying to solve this yet. Please give it a go if you feel up for a challenge!
@hbldh I'm afraid I will not find the time to dive deeply into this and solve it.
@cspensky Do you know if this only happens on specific bluez versions. I'm using 5.50 and bleak 0.9.1. If there was a version without this issue I could downgrade to it
@mitea1 I have been seeing this same issue, and have not found any solution (despite a reasonable amount of effort). My next step was going to be to switch the DBus package to fix it. Unfortunately, other things at work have taken my attention recently. I hope to have some time this month to address this issue. Please do let me know if you make any headway.
For reference, I am running Bluez 5.48 and am on the develop
branch of bleak.
@hbldh do I get that right if I say at the moment there is no valid version set bleak/bluez that works 100% without this issues regarding dbus? so only the .net backend version works?
Just asking because I could probably switch to other bleak/bluez version if there is any that works without this issues
do I get that right if I say at the moment there is no valid version set bleak/bluez that works 100% without this issues regarding dbus?
No. I would not say that you are right. I have not personally experienced this in any of my usage of Bleak on Ubuntu, or at least not to any extent that has been annoying enough to warrant me remembering it. I have used a variety of BlueZ versions over the course of several years, but I have not registered which. Some Linux OS versions and BlueZ have more problems than others, see e.g. #332. But I have no record of which have worked well, since such errors have been isolated occurrences which have been solved by either identifying problematic hardware or downgrading BlueZ or Linux kernel.
What do you mean 100%? That this will never happen? I cannot and will not guarantee that on any system. It is Bleak, plus several OS subsystems working together, combined with hardware, wireless communication and some connected peripheral with hardware and software of its own. Of course some strange disconnections will happen. Do not expect flawless execution with BLE.
so only the .net backend version works?
No. I would say that all three backends "work".
@cspensky I made an interesting observation:
I use bleak partly to send (write) data to a "rx" characteristic and get responses from another "tx" characteristic which I subscribed to. This process is abstracted in a send_request()
method. inside this method I do the following
- subscribe to TX char (using
start_notify()
) - write_char RX char (using
write_gatt_char()
) - unsubscribe TX char (using
stop_notify()
)
I noticed that I get this error after every second call of send_request()
.
Now I changed things a little bit
After I connect to the device I call start_notify()
only once and unsubscribe just before disconnecting.
So send_request()
basically only consists of:
- write_char RX char (using
write_gatt_char()
)
Now it works without any Issues.
So to make things short:
My assumption is that there is a problem for bluez if we use start_notify()
several times in an ongoing session (no disconnection). even if you unsubscribe the characteristics in between
So the solution for me was just call start_notify()
only once after bleak connected to a device
@hbldh maybe this sounds legit to you or you might have an explanation for it?
@mitea1 Thank you for the insights! I am also using a TX and RX characteristic with notifications. Based on your findings, I will see if I can try to get to the bottom of the issue. May I ask what type of device you are communicating with? I saw this when interacting with an iPhone XR.
@mitea1 Thank you for the insights! I am also using a TX and RX characteristic with notifications. Based on your findings, I will see if I can try to get to the bottom of the issue. May I ask what type of device you are communicating with? I saw this when interacting with an iPhone XR.
@cspensky Bleak is running inside a docker image balenalib/raspberrypi3-debian-python:3.7
on a RPi4 to communicate with proprietary embedded device using nordic sdk.
Traceback (most recent call last):
File ".../bleak/backends/bluezdbus/client.py", line 454, in read_gatt_char
).asFuture(asyncio.get_event_loop())
twisted.internet.error.ConnectionDone: Connection was closed cleanly.
just out of curiosity... why are you using read_gatt_char()
functionality at all if you are using RX and TX like characteristics with notifications? shouldn't you use the notify callback for the TX characteristic of the device you are communication with instead like I do? otherwise you would have to poll the characteristic the whole time.
Another alternative could be "Bluetooth Management sockets" http://www.bluez.org/the-management-interface/. I do not think it is a good idea, lots of work etc, but it deserves to be mentioned.
python-dbus-next is another good option.
After playing with dbus-next for a bit, it seems to be a pretty heavyweight solution. ๐ Is anyone else seeing pretty high CPU usage, especially when scanning? I may poke around a bit and see if one of the other solutions to see if they are a bit more lightweight... Would love to get some input before I jump into this.
Can you do some profiling to see what is using the CPU? I'm surprised that it would be much different from txdbus.
I just ran a quick instance with cProfile where I ran a scanner until 3 devices were found and connected to.
Sorted by Total Time, it seems like the unmarshalling of BlueZ data is the main culprit. There is a clear spike in the CPU usage when scanning, which makes sense as a pure Python3 solution.
ncalls tottime percall cumtime percall filename:lineno(function)
1287 14.84 0.01153 14.84 0.01153 ~:0(<method 'poll' of 'select.epoll' objects>)
2550 0.2162 8.48e-05 0.6628 0.0002599 traceback.py:321(extract)
25169 0.1803 7.163e-06 0.1803 7.163e-06 ~:0(<built-in method posix.stat>)
75383 0.08302 1.101e-06 0.1328 1.762e-06 unmarshaller.py:49(read)
1287 0.0561 4.359e-05 17.48 0.01358 base_events.py:1784(_run_once)
19584 0.05558 2.838e-06 0.2065 1.054e-05 linecache.py:53(checkcache)
1239 0.05236 4.226e-05 0.665 0.0005367 unmarshaller.py:241(_unmarshall)
215290/214199 0.0449 2.096e-07 0.04518 2.109e-07 ~:0(<built-in method builtins.len>)
1655 0.04296 2.596e-05 0.09222 5.572e-05 __init__.py:284(__init__)
21673 0.03959 1.827e-06 0.1033 4.766e-06 unmarshaller.py:158(read_ctype)
1655 0.03751 2.267e-05 0.03751 2.267e-05 ~:0(<method 'write' of '_io.TextIOWrapper' objects>)
246 0.03678 0.0001495 0.03678 0.0001495 ~:0(<method 'read' of '_io.BufferedReader' objects>)
36102/2445 0.03607 1.475e-05 0.4629 0.0001893 unmarshaller.py:233(read_argument)
19806 0.03494 1.764e-06 0.1212 6.122e-06 traceback.py:285(line)
36598 0.03487 9.528e-07 0.03487 9.528e-07 traceback.py:292(walk_stack)
40556 0.03451 8.509e-07 0.06644 1.638e-06 unmarshaller.py:122(align)
34088 0.03195 9.373e-07 0.03981 1.168e-06 linecache.py:147(lazycache)
14 0.03127 0.002234 0.03252 0.002323 ~:0(<built-in method _imp.create_dynamic>)
18763 0.03078 1.64e-06 0.07633 4.068e-06 linecache.py:15(getline)
34088 0.03051 8.95e-07 0.03051 8.95e-07 traceback.py:243(__init__)
15086/11300 0.02908 2.573e-06 0.07188 6.361e-06 signature.py:251(verify)
102468 0.02838 2.77e-07 0.0358 3.494e-07 ~:0(<built-in method builtins.isinstance>)
8730/8517 0.0269 3.158e-06 0.2647 3.108e-05 unmarshaller.py:181(read_variant)
4505/1807 0.02534 1.403e-05 0.4487 0.0002483 unmarshaller.py:204(read_array)
Thanks for that. txdbus
is pure Python too, so if it has better performance, maybe the unmarshaller can be copied from there.
The Jeepney library specifically has performance as a non-goal, so I think that rules out that library.
Unfortunately txdbus
had some other issues for me, so I was never able to really stress test it. The dbus-next
migration seemed to solve those.
I used to contribute to PyBluez that used pygattlib, which was just a gatttool wrapper for Python.
Is there a particular reason that we want Bleak to use D-Bus or our Python instead of a more low-level, faster approach?
I will likely need to implement something more perfomant for my needs, and if there is a desire to have this incorporated in Bleak, I'm happy to drive that effort. I plan to keep the Bleak API intact either way, so it will be compatible (the idea of a universal API in Python is fantastic!).
This doesn't appear to be has big of a problem on OSX, because it seems like CoreBluetooth does a lot of the heavy lifting with data processing. I have not had the opportunity test on Windows yet.
Is performance important for advertising data, characteristic notifications or both?
Is there a particular reason that we want Bleak to use D-Bus or our Python instead of a more low-level, faster approach?
Using such low-level calls requires elevated privileges and requires in-depth knowledge of the Bluetooth protocol and could potentially interfere with BlueZ that will still be running in the background anyway.
For me personally, I think I would rather spend the time to fix BlueZ rather than trying to work around it.
I'd love to have it be more performant all around. I'm not proposing going away from BlueZ but using a more direct interface. Something like gattlib's Python interface (pygattlib) should be significantly less overhead than using a pure-Python D-Bus solution.
The library still seems to be maintained as well.
Ah, I see I misunderstood how that library worked (based on the readme, it sounds like I'm not the only one).
The only downside I see is that it can be a pain for users on non-x86 architectures to compile the library. But maybe it makes sense to add it as an optional backend for those who need the performance benift?
This was the same conclusion that @Bernstern and I came to yesterday. I plan to add gattlib
as an optional backend on Linux. Some preliminary testing has led me to believe that it will be pretty straightforward and significantly faster. There will hopefully be a pull request in the near future.
@cspensky I might have missed something in the discussion above, but I just want to weigh in with my opinion regarding the inclusion of gattlib in Bleak.
I used to use pygattlib
a lot about four - five years ago. It was for high-speed transfer of data from MbientLab sensors. It handled very high rates of transfer and I was very content with it. It was at the time (2016-2019) not maintained and there were some bugs that were problematic, but managable. Then came the announcement that BlueZ would deprecate gatttool
and only support the new DBus API that they were working on. I realized that in order to keep a working BLE testing environment in Python I would need to do something else. It does not matter how much I prefer not using such a verbose and "chatty" solution as the DBus system, it was the direction of BlueZ, which is the de facto standard for Bluetooth in most Linux distributions.
Consequently, I created Bleak as an exercise in cross-platform API building and asyncio
coding. pygattlib
had all the threading in the C++ parts and I disliked not being able to debug easily. Thus I stipulated some ground rules for myself and went to work. This package is, with contributions from several others, the result.
If you by some impressive feat of software engineering manage to coax gattlib
into a asyncio
coat, then I will be very impressed. It might be done, I don't know, because I have never explored that path. I would still prefer it not to be in the pip-installable bleak
package though, even as an optional backend available as an extras
. I think you cannot escape e.g. compilation, extra apt-installable packages and permission/capability handling with libcap
or similar.
My suggestion is that you make a new package and host it yourselves on PyPI. I would gladly include references to it in the Bleak docs and README, but since it relies on a "possibly-soon-deprecated" part of BlueZ and probably will violate some of the initial ideas of Bleak, I argue that it should be an external add-on in that case.
Another option is to actually use pygattlib
instead of Bleak. It is very good and it is what I used as a model for Bleak, and if it has gotten an active maintenence once again, then I highly recommend it for high throughput demands. DBus is an unavoidable bottleneck I think, regardless how optimized marshalling/unmarshalling you write.
You should, in my opinion, use Bleak for what it is written for (cross-platform without much hassle) and pygattlib
when you need high speed in a Linux distribution system.
@hbldh Thank you for the detailed response. Indeed, I agree that Bleak is the of the various attempts that I have worked with, hence my current involvement in this repository. Thank you for all of your work! ๐
I did start down the gattlib as a Bleak backend path out of curiosity but have since pulled the plug. For interested parties, it does seem to be possible to make it play nicely with async, you can see my branches here: bleak and https://github.com/Allthenticate/gattlib/tree/development. It appears that D-Bus in general is the bottleneck and not the Python code in particular. I ran some tests in pure C vs. the Python version and did not see much difference, unfortunately. The library that I was comparing the speed against was bluepy. However, it appears that are circumventing D-Bus and doing some hacky stuff where they are wrapping a compiled binary with Python to get that speed gain.
The only way I've really found to get our CPU usage down is by adding scanning filters or, in our case, acting as a peripheral instead of a central. Thus, I plan to jump on board with bless and help push that forward.
I am working on replacing txdbus
with dbus-next
. Or rather asyncdbus
, which is a fork of mine that supports Trio as well as asyncio, using the anyio adapter.
Scanning works already, getting it to work was easy enough that I expect the rest to fall into place without too much hassle. Progress is somewhat limited by the fact that asyncdbus
needs anyio v3 which is not even in alpha yet.
asyncdbus: https://github.com/M-o-a-T/asyncdbus, also on PyPI
anyio: https://github.com/agronholm/anyio (at the moment you need the start_task
branch)
@smurfix that sounds great, but be advised that there already is work done on porting to dbus-next
(in the dbus-next-2
branch) and we are planning on merging that as soon as it is stable. You might want to make your efforts based on that if you want to stay compatible with bleak.
Yeah, I saw. My problem here is that there's no good way to do this the anyio/Trio way without deprecating .connect
/.disconnect
and replacing them with an async context manager. That applies to dbus-next as well as bleak, so a direct merge is a bit difficult.
That being said, I learned (and copied) a lot from the code in the dbus-next-2
branch, to the point that I can now use my code to reliably interrogate my [censored] blood pressure cuff. I have pushed the latest versions.
The cuff reader is in https://github.com/smurfix/ble-stuff.
I plan to check with the dbus-next maintainers about merging my fork back, but (a) that'll have to wait for anyio v3 to be resonably stable/ready, and (b) as I wrote it does require some incompatible changes.
I'm not sure this issue is the best place to mention it, but the current performance of the bleak with bluez backend is really poor.
I mean that the very basic code could consume a lot of processor time on small processors.
def device_detection_callback(device: BLEDevice, advertisement_data):
print(device, advertisement_data)
async def run():
async with BleakScanner(scanning_mode='passive') as scanner:
scanner.register_detection_callback(device_detection_callback)
await asyncio.sleep(3)
devices = await scanner.get_discovered_devices()
print('New devices: ', devices)
await asyncio.sleep(1)
This code consumes up to 60-70% running on imx6ull 680 MHz processor (arm32, single core). Similar results are both for txdbus and dbus-next based backends.
I compared the results with the full-features software https://github.com/espruino/EspruinoHub bases on JS library "noble" for accessing BLE. CPU time is less than 3% for that software including scanning BLE and sending data to MQTT.
I suppose that it would be hard to reach the same performance results using python-only dependencies.
Maybe some approaches from noble
could be used for bleak backend?
@devbis That is problematic indeed... Is it Bleak or the DBus system that consumes the CPU? If it is Bleak and not the DBus system, then I have too look at what actually takes time.
Noble does not - as far as I know - use the DBus API of BlueZ, but implements its own solution, so it is not burdened by the overhead of DBus. It is an impressive feat and I have the outmost respect for the effort made by the developers to achieve such performance in Noble.
However, I am not capable of nor interested in copying that for Bleak. Bleak was meant to be the opposite of Noble in a way; only use existing, built-in, OS native BLE solutions. Bleak will not implement communication of its own or regress to using old solutions. In case you want that, there is the pygattlib project that avoids DBus by doing something similar as Noble.
@devbis Unfortunately, we had the same issues as well on both OSX and Linux. It appears to be a fundamental problem with parsing the high-throughput scanning data in Python, which is inherently slow. You can cut the CPU time a bit by filtering based on UUID or RSSI, for example. I submitted a fix a while back that fixed these on the dbus-next branch. However, it was still too CPU intensive for our needs. We now use Bleak for testing, but had to abandon it for anything more than poking around with devices. A fast, albeit EOL'd alternative, that has a C plugin and does not use D-Bus is https://github.com/IanHarvey/bluepy
@cspensky If you get a chance it would be great to know if 0.19.0 resolves the performance concerns