Possible System ANR caused by Rethink 0.5.5n 241117
Closed this issue · 4 comments
Rethink possibly does something it shouldn’t when ConnectivityService tries to figure out how many bytes have been transferred.
- IMPACT: resulting recursive processing causes deadlock in ConnectivityService that surfaces as a modal dialog “Process system isn't responding”
Reported after 5 seconds in logcat:
ActivityManager: ANR in system
- this is what displays the dialog
It appears:
- NetworkStatsService is retrieving number of transferred bytes, while
- NetworkPolicyManagerService wants to change a power-save whitelist, and
- those 3 threads share a mutex lock
- the problem could be caused by any recursion among the 4 threads, or
- NetworkStats or NetworkPolicy improperly processing messages while locks are held
After a logcat session and “adb bugreport” I could extract some stack traces that may indicate the issue to someone familiar with the Rethink codebase:
3 threads are blocked, one is labeled “android.ui”:
Thread 10 blocked trying to retrieve basic network information
"Binder:1761_2" prio=5 tid=10 Blocked
| group="main" sCount=1 ucsCount=0 flags=1 obj=0x16880738 self=0x6f8ce5cb90
| sysTid=1784 nice=0 cgrp=default sched=0/0 handle=0x6d9c9d5cb0
| state=S schedstat=( 93178365016 20996661122 312707 ) utm=6596 stm=2720 core=0 HZ=100
| stack=0x6d9c8de000-0x6d9c8e0000 stackSize=991KB
| held mutexes=
at com.android.server.net.NetworkPolicyManagerService.isUidNetworkingBlocked(NetworkPolicyManagerService.java:5658)
- waiting to lock <0x01ac25bd> (a java.lang.Object) held by thread 73
at android.net.NetworkPolicyManager.isUidNetworkingBlocked(NetworkPolicyManager.java:561)
at android.net.connectivity.com.android.server.ConnectivityService.isNetworkWithCapabilitiesBlocked(ConnectivityServic
e.java:2244)
at android.net.connectivity.com.android.server.ConnectivityService.filterNetworkInfo(ConnectivityService.java:2295)
at android.net.connectivity.com.android.server.ConnectivityService.getFilteredNetworkInfo(ConnectivityService.java:230
5)
at android.net.connectivity.com.android.server.ConnectivityService.getActiveNetworkInfo(ConnectivityService.java:2323)
at android.net.IConnectivityManager$Stub.onTransact(IConnectivityManager.java:800)
at android.os.Binder.execTransactInternal(Binder.java:1184)
at android.os.Binder.execTransact(Binder.java:1143)
Thread 12 is blocked trying to do something to the power-save temporary whitelist apps
"android.ui" prio=5 tid=12 Blocked
| group="main" sCount=1 ucsCount=0 flags=1 obj=0x16880ab0 self=0x6f8ce5e760
| sysTid=1858 nice=-2 cgrp=default sched=0/0 handle=0x6d878e1cb0
| state=S schedstat=( 48030805827 21559711586 518964 ) utm=2827 stm=1974 core=2 HZ=100
| stack=0x6d877de000-0x6d877e0000 stackSize=1039KB
| held mutexes=
at com.android.server.net.NetworkPolicyManagerService$NetworkPolicyManagerInternalImpl.onTempPowerSaveWhitelistChange(
NetworkPolicyManagerService.java:5758)
- waiting to lock <0x01ac25bd> (a java.lang.Object) held by thread 73
at com.android.server.DeviceIdleController.addPowerSaveTempWhitelistAppDirectInternal(DeviceIdleController.java:2924)
at com.android.server.DeviceIdleController$LocalService.addPowerSaveTempWhitelistAppDirect(DeviceIdleController.java:2
071)
at com.android.server.am.ActivityManagerService.pushTempAllowlist(ActivityManagerService.java:14762)
at com.android.server.am.ActivityManagerService$UiHandler.handleMessage(ActivityManagerService.java:1622)
at android.os.Handler.dispatchMessage(Handler.java:106)
at android.os.Looper.loopOnce(Looper.java:201)
at android.os.Looper.loop(Looper.java:288)
at android.os.HandlerThread.run(HandlerThread.java:67)
at com.android.server.ServiceThread.run(ServiceThread.java:44)
at com.android.server.UiThread.run(UiThread.java:45)
thread 73 looking for a byte-count
"NetworkPolicy" prio=5 tid=73 Blocked [0/1876]
| group="main" sCount=1 ucsCount=0 flags=1 obj=0x16885e38 self=0x6f8ceeebd0
| sysTid=2082 nice=0 cgrp=default sched=0/0 handle=0x6d39a67cb0
| state=S schedstat=( 214175918817 13176636000 107735 ) utm=18313 stm=3103 core=0 HZ=100
| stack=0x6d39964000-0x6d39966000 stackSize=1039KB
| held mutexes=
at com.android.server.net.NetworkStatsService.internalGetHistoryForNetwork(NetworkStatsService.java:868)
- waiting to lock <0x0894e281> (a java.lang.Object) held by thread 72
at com.android.server.net.NetworkStatsService.internalGetSummaryForNetwork(NetworkStatsService.java:846)
at com.android.server.net.NetworkStatsService.getNetworkTotalBytes(NetworkStatsService.java:880)
at com.android.server.net.NetworkStatsService.access$2100(NetworkStatsService.java:175)
at com.android.server.net.NetworkStatsService$NetworkStatsManagerInternalImpl.getNetworkTotalBytes(NetworkStatsService.java:1677)
at com.android.server.net.NetworkPolicyManagerService.getNetworkTotalBytes(NetworkPolicyManagerService.java:5496)
at com.android.server.net.NetworkPolicyManagerService.getTotalBytes(NetworkPolicyManagerService.java:5491)
at com.android.server.net.NetworkPolicyManagerService.updateNetworkEnabledNL(NetworkPolicyManagerService.java:1891)
at com.android.server.net.NetworkPolicyManagerService.updateNetworksInternal(NetworkPolicyManagerService.java:1666)
- locked <0x01ac25bd> (a java.lang.Object)
- locked <0x0ebaa280> (a java.lang.Object)
at com.android.server.net.NetworkPolicyManagerService.access$400(NetworkPolicyManagerService.java:335)
at com.android.server.net.NetworkPolicyManagerService$3.onSubscriptionsChanged(NetworkPolicyManagerService.java:1012)
at android.telephony.TelephonyRegistryManager$1.lambda$onSubscriptionsChanged$0(TelephonyRegistryManager.java:116)
at android.telephony.TelephonyRegistryManager$1$$ExternalSyntheticLambda0.run(unavailable:2)
at android.os.Handler.handleCallback(Handler.java:938)
at android.os.Handler.dispatchMessage(Handler.java:99)
at android.os.Looper.loopOnce(Looper.java:201)
at android.os.Looper.loop(Looper.java:288)
at android.os.HandlerThread.run(HandlerThread.java:67)
thread 72 holds locks and is not Blocked
"NetworkStats" prio=5 tid=72 Runnable
| group="main" sCount=0 ucsCount=0 flags=0 obj=0x16885d58 self=0x6f8ceed000
| sysTid=2080 nice=0 cgrp=default sched=0/0 handle=0x6d3ab71cb0
| state=R schedstat=( 1797470426316 10061066054 97414 ) utm=176679 stm=3067 core=7 HZ=100
| stack=0x6d3aa6e000-0x6d3aa70000 stackSize=1039KB
| held mutexes= "mutator lock"(shared held)
at android.net.NetworkStats.findIndex(NetworkStats.java:758)
at android.net.NetworkStats.combineValues(NetworkStats.java:702)
at android.net.NetworkStats.combineAllValues(NetworkStats.java:748)
at com.android.server.net.NetworkStatsFactory.adjustForTunAnd464Xlat(NetworkStatsFactory.java:392)
at com.android.server.net.NetworkStatsFactory.readNetworkStatsDetail(NetworkStatsFactory.java:363)
- locked <0x085a6e68> (a java.lang.Object)
at com.android.server.net.NetworkStatsService.readNetworkStatsUidDetail(NetworkStatsService.java:1917)
at com.android.server.net.NetworkStatsService.getNetworkStatsUidDetail(NetworkStatsService.java:1933)
at com.android.server.net.NetworkStatsService.recordSnapshotLocked(NetworkStatsService.java:1440)
at com.android.server.net.NetworkStatsService.performPollLocked(NetworkStatsService.java:1529)
at com.android.server.net.NetworkStatsService.handleNotifyNetworkStatusLocked(NetworkStatsService.java:1293)
at com.android.server.net.NetworkStatsService.handleNotifyNetworkStatus(NetworkStatsService.java:1269)
- locked <0x0894e281> (a java.lang.Object)
at com.android.server.net.NetworkStatsService.access$500(NetworkStatsService.java:175)
at com.android.server.net.NetworkStatsService$NetworkStatsHandler.handleMessage(NetworkStatsService.java:385)
at android.os.Handler.dispatchMessage(Handler.java:106)
at android.os.Looper.loopOnce(Looper.java:201)
at android.os.Looper.loop(Looper.java:288)
at android.os.HandlerThread.run(HandlerThread.java:67)
Thanks for the bug report. Must have taken quite some time as ANR dumps aren't all that small.
How frequently do you see this? Is there a set of steps that can reproduce this, even if intermittently?
We perhaps must be careful in our use of ConnectivityService APIs, we use quite a few deprecated ones where we shouldn't...
It is just general use that on rare occasions coincide with statistics collection, I think
It took 4 days for it to happen so it is not frequent, but I was determined to get some data on it
— At the time, airplane mode was on, just unlocked and looking at an off-line browser tapping around
After occurring first time after boot, the phone does show occasional slowness, like 30 s without displaying a dialog, or a dialog 2 hours later. The phone is in bad state after restarting a thread or something
— A reboot is back to normal: play again
An easy fix would be moving addPowerSaveTempWhitelistAppDirect off the UI thread using app-scoped Kotlin coroutine?
while(true) and a semaphore
launch the coroutine from invoking method from the Application component onCreate
The impact will then be nothing even if it locks up. Costs nothing, a coroutine is an object instance
coroutines in Kotlin is a time-shared thread-group, poor man’s parallelism
If it’s then still suspected to be broken, it can be instrumented with display in the UI
Thanks. If you are comfortable, please consider emailing me (mz at celzero dot com) the full ANR dump. No pressure.
An easy fix would be moving addPowerSaveTempWhitelistAppDirect off the UI thread using app-scoped Kotlin coroutine?
That code is Android framework's (Android's code that runs in each app's process or acts as a "server" processing app's requests). Albeit, these code paths may have been triggered by our code. We have to do a sweep to see which one it might be; not straight forward without the entire ANR trace (if you can't share it is okay).
This has nothing to do with Rethink, it’s some hardware issue with the device
- The System ANR comes directly from native code, unaffected by any app
- all 4 threads are doing good things, none is locked-up
- The system_service UI thread executes PUSH_TEMP_ALLOWLIST_UI_MSG, a message clearly designed for it and the thread itself appears to draw on-top things like volume sliders and such, independent of any app
At present, I can’t remember if I saw the dialog on another device
This phone is declared working for like 4 days at a time, after which reboot to continue