/android-binder-ipc

Project to analyse & test Android Binder IPC machnism as well as its related C++ library interface

Primary LanguageC++

30 Mar 2012
No updates for quite a while now, been busy on an iPhone app and finding a full-time job :(

Anyway, I've got my Galaxy Nexus for a few days now. I'm still trying to get the hang of it, once I have some feeling of its responsce time and latency, I'll load and test the new driver on it.





29 Feb 2012
Just fixed another issue that caused apps crashing on my U8150. It's caused by some not well-thought design details in the binder protocal and framework implementation, see the comments in the source code.

As of today, the new binder driver now runs on Goldfish 4.0.3 (ISC) and U8150 2.2 (Froyo). It's fully compatiable with the old driver (at least everything discovered so far), which means no changes required for the Framework. All you need to do is to apply the patch to the kernel and upload the new kernel to the device.

As always, this is just a start - a nice one though. Further tests, bug fixing and improvements will be on the way.

Todo/Reminder:
* Priority inheritance to be reviewed and added.





17 Feb 2012
It's been a while since this file was last updated, although the driver has been moving forward all the time.

Fixed a few things, and about a couple of days ago, the driver could get goldfish boot to the end, the system appeared to be normal at the end of booting. The only issue is after boot animation stopped, the home screen doesn't show up. It's been driving me nuts...

To summarize some major changes done in the last a few days,

* mmap and shared kernel/user buffer were finally added. I could have sticked to the inlined buffer mode, but there is just too much overhead to do that. The framework, the driver and applications are all built on the fact that a shared buffer is used. Another benefit of doing that is a full drop replacement of the existing driver is now possible, without requiring to patch the framework or service manager.

* a simple memory allocator (fast slob) is implemented to manager the shared buffer used for transaction data. Seems to be working great, the only limitation it has is there's an up-limit on maximum buffer size allocated. Will restrcuture the interface a bit so service manager and the rest of the framework can use different buffer allocation sizes.

* passing shared file handles across process boundary is now supported.

* object reference counting has just been added, seems to be working fine, although further tests are required.

So far, the driver is not much different to the existing driver in terms of funcationality. I think the only thing left is the re-nice stuff when sending messages. It's trivial and shouldn't affect the function of the driver, so I decided to do it a little later (of course, after some careful thoughts as usual).

I'll be focusing on getting the home screen up - appears to be the last hurdle in getting a working version.



31 Jan 2012
Okay, I decided to use the same test program binder_tester to test the performance of the old driver. It was easy to add the support for the old driver, but the inital result didn't quite impress me - in fact I was a little disappointed as on average the round trip time on the old driver is around 17us, where as shown ealier, the new driver on average got around 33us. 

That wan't a good feeling after all these efforts. So I tried all sorts of measurements to find out why the performance of the new driver was bad. Both the timestamping and perf-sched tool pointed out that the new driver has 8-10us scheduling delay, where the old one has around 5us. If you time it by two you get around extra 6-10us difference, which sort of matched with the round-trip time difference. But why should the scheduler threat two drivers differently? Again, spent hours and hours trying to find out behaviral difference between two drivers. Tried increasing/decreasing transaction data size, nah, didn't work. Tried generating the same responce commands and at the same point, so literaly the application made the same amount of ioctl() for both drivers, didn't work either. Even tried reducing two data copying in the driver to one (by faking) so it acted just like the old driver, no luck. I timestamped many different points in the driver, none of them gave much glues except all pointing to the large scheduling delay.

Eventually, I ran out of ideas. I thought I might do this last try by adding same instrumentation points to the old driver, so I could see what path it actually did do better. The moment I tried, I was shocked - after the old driver being equipted with same amount of instrumentation points, it yielded the same or even slightly longer round trip time. I couldn't understand why would a dozen of memory writes would cause that big scheduling delay. Anyway, seeing this, I decided to disable all the instrumentation points in both drivers, and redo the performance test. This time, I got better or should I say, expected results. The new driver without those instrumentation points get around 16us round trip time. Hooray! sched-perf tool shows both drivers get similar amount of scheduling delay, which is about 5-6us.

So finally all these work paid off. The new driver with a clean design and implementation beats the old driver, even it has twice of memory copying. BTW, the instrumentation buffer I used in the test is 272 bytes, which is carried through back and forth in the transaction and reply messages as transaction data, which I think should be bigger than most of the IPC calls (?). Anyway, at least it proves at that level of data amount or less, memory copying delay is negligible - to be precise it's about 1-2us measued on my test machine.

So here are some test results to show off,

* new driver
C_SEND		S_RECV		S_REPLY		C_RECV
0.000000	0.000007	0.000009	0.000016

* old driver
C_SEND		S_RECV		S_REPLY		C_RECV
0.000000	0.000007	0.000010	0.000017

The scale is second. From C_SEND to C_RECV is exact a round-trip time measured in the user application. The results shown are averaged over 10,000 iterations.

Finally, the last question remaining is the scheduling delay 5-6us with both drivers, can it be reduced. As learned above, either the speculative reads/writes incurred by those intrumenation points, regardless the trival amount (8 bytes each, up to 12 times), or maybe do_gettimeofday that upset the scheduler or data cache or both. I'll be looking into this a little later.

More tests will be done in the following days to compare the performance of two drivers on other occasions, especially concurrency tests. Stability and functionality will also be tested and improved as I go.


28 Jan 2012
The test program binder_tester works like a charm. It timestamps a transaction-reply roundtrip time, plus all the critical points the data goes through. Example shown below:

  C_SEND	   K_IOC	 K_WR_IN	   K_ENQ	   K_DEQ	  K_COPY	  S_RECV	 S_REPLY	   K_IOC	 K_WR_IN	   K_ENQ	   K_DEQ	  K_COPY	  C_RECV	  C_EXIT	
0.000000	0.000001	0.000002	0.000004	0.000012	0.000013	0.000014	0.000016	0.000017	0.000018	0.000019	0.000028	0.000030	0.000031	0.000033	
0.000000	0.000002	0.000003	0.000004	0.000012	0.000013	0.000015	0.000016	0.000017	0.000018	0.000019	0.000028	0.000030	0.000031	0.000033	
0.000000	0.000002	0.000003	0.000004	0.000012	0.000013	0.000014	0.000016	0.000017	0.000018	0.000019	0.000028	0.000030	0.000031	0.000032	

The test shows that with a simple C implementation of server and client, a two-way transaction round trip time is about 34us, comparing to 150us-300us I have seen with binderAddInts, which was implemented on top of the framework. 

Further analysis shows that majority of that round trip time (34us) is spent on context switching, which can be deomostrated by the delay between K_ENQ and K_DEQ (happend twice for a two-way message). Those two delays take about half of the total time (~8us each). There are at least two context switches and one task switch happens in that delay, show below,

client binder_thread_write (K_ENQ) 
client binder_thread_read (read TRANSACTION_COMPLETE) -- kernel to user switch
client ioctl (to read reply)                          -- user to kernel switch
client blocked on waiting on reply                    -- task switch/wakeup
server binder_thread_read (K_DEQ)

Other timestamps sugggest a kernel-to-user or user-to-kernel context switch takes about 1us, that leaves about 6us spent on a task switch / waking up. The reason I'm using waking up, because on SMP (which I'm using), switching task could simply mean waking up the ready task on another CPU. Although it's not alway the case, as the scheduler considers other factors, such as cache pollution, in order to schedule a task onto a different CPU.  To further complicate the matter, I'm using BFS for the scheduler, the 3.1.6 vinilla scheduler behaves about 20-30% worse.

Another thing that the above results sugggest is the framework based test app binderAddInts can't be used for testing the performance of the driver as the time spending in the driver (~20-25us) is just a fraction of the total time spent for a binderAddInts two-way message. It also proves at least the extra memory copying I added in the new driver didn't or couldn't do anyting bad except just simplifying the driver logic.

 









!!! Tested only on x86 32bit only (for now) - I'm using Ubuntu 10.04

To Build:

1. Build binder module(s)
	$ cd module; make
	* Two drivers built: binder_new.ko the version I'm working on, and binder_old.ko the existing binder driver
	* For binder_old.ko, see gen_deps.sh & deps.c for how missing symbols are patched :) 

2. Build service manager
	$ cd servicemanager; make

3. Build binder library
	$ cd libs; make

4. Build the test program
	$ cd test; make


To Test (run as root for now)
1. Load driver
	# cd moudle; insmod binder_new.ko

2. Start service manager
	# cd servicemanager; ./servicemanager &

3. Run the test program 
	# cd test
	# ./binderAddInts -n 10000
	# ./binderAddInts -n 10000 -d 0

Notes:
1. The two drivers are currently incompatible. To run the above tests on the old driver, one needs to comment out INLINE_TRANSACTION_DATA macro in the following two files and recompile everything:
	libs/binder/IPCThreadState.cpp
	servicemanager/binder.c


2. With the old driver, run the above two tests - one with delay between iterations and the other one without - you see the average iteration delay is much longer for the one without delay. It's most likely due to inefficient locking in the binder driver, i.e. global binder_lock. So I guess it's step one for this project to address.

	# ./binderAddInts -n 10000
serverCPU:  unbound
clientCPU:  unbound
iterations: 10000
iterDelay: 0.001
Time per iteration min: 0.000151486 avg: 0.000181443 max: 0.00497661

	# ./binderAddInts -n 10000 -d 0
serverCPU:  unbound
clientCPU:  unbound
iterations: 10000
iterDelay: 0
Time per iteration min: 0.000139962 avg: 0.000258004 max: 0.00490384


3. The current results of the new driver. Not so much better, but it's just some first tests, tuning on the way
	# ./binderAddInts -n 10000
serverCPU:  unbound
clientCPU:  unbound
iterations: 10000
iterDelay: 0.001
Time per iteration min: 0.000142057 avg: 0.000203101 max: 0.00329881

	# ./binderAddInts -n 10000 -d 0
serverCPU:  unbound
clientCPU:  unbound
iterations: 10000
iterDelay: 0
Time per iteration min: 0.000139124 avg: 0.000242037 max: 0.00243285