[go: nahoru, domu]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving NotificationDismissBroadcastReceiver logic to NotificationsProcessingService #7779

Merged
merged 3 commits into from
Jun 12, 2018

Conversation

mzorz
Copy link
Contributor
@mzorz mzorz commented May 18, 2018

comes from #7773 (comment)

Thank you @malinajirka for your comments there, that made me look closer! 🙇

How will creating a new Thread prevent the application to end up in a deadlock? Won't the only effect be that instead of UI thread and Thread XYZ being in deadlock there will be Thread ABC and Thread XYZ in deadlock? So it will actually fix the ANR but not the really issue?

I think you nailed it, good question @malinajirka! And I realize my description of the problem in that PR didn't do justice to the exact specific problem. TBH I had the same doubts as well. I think the way I approached it is thinking we may fix the ANR even though we may not fix the root cause of the problem, as this fix would help avoid the conditions on which the deadlock is likely to happen, which in the end is already a probably good enough improvement.

Let me explain a bit more how I think things go then - looking into it a bit closer, the dump shows this:

"main" tid=1 Blocked 
[...]
at org.wordpress.android.push.GCMMessageService.removeNotification (GCMMessageService.java)
- waiting to lock <0x0d44704c> (a java.lang.Class<org.wordpress.android.push.GCMMessageService>) held by thread 23
  at org.wordpress.android.ui.notifications.NotificationDismissBroadcastReceiver.onReceive (NotificationDismissBroadcastReceiver.java:20)

Note that thread 23 does not appear in the dump (been going through a lot of other reports and none of them seem to have the dump for the blocking thread).

Basically that broadcast receiver is set to be called as the delete intent for the notification, i.e. when the user either dismisses it (slide off the screen) or taps on it, the notification is dismissed and this intent gets triggered, and received by the NotificationDismissBroadcastReceiver. What it does internally is it calls removeNotification(), which is locked on the class object itself (GCMMessageService.class) as you've seen.
At the same time the Activity gets launched and the onCreate() method is executed. This method also calls a synchronized method within GCMMessageService.
Here comes the important part: both the Activity lifecycle and the BroadcastReceiver are executed on the main thread (see https://developer.android.com/reference/android/content/BroadcastReceiver#onreceive). I'm tempted to say that's where the deadlock happens, but I'm with you in that there's no 2 threads in this situation, it's the same thread (main thread) locking on the same lock object.

After looking into the reports a bit more, I found something interesting.

There are other similar instances happening, all seem to be due to the synchronized methods in GCMMessageService class
1 - this one happens in WPMainActivity.onResume as it calls GCMMessageService.removeAllNotifications

"main" tid=1 Blocked 
[...]
| held mutexes=
[...]
at org.wordpress.android.push.GCMMessageService.removeAllNotifications (GCMMessageService.java)
- waiting to lock <0x0a59d727> (a java.lang.Class<org.wordpress.android.push.GCMMessageService>) held by thread 71
  at org.wordpress.android.ui.main.WPMainActivity.onResume (WPMainActivity.java:491)

2 - WPMainActivity.onCreate calls WPMainActivity.launchWithNoteId which holds on GCMMessageService.getNotificationsCount()

held mutexes=
  at org.wordpress.android.push.GCMMessageService.getNotificationsCount (GCMMessageService.java)
- waiting to lock <0x0b0f597b> (a java.lang.Class<org.wordpress.android.push.GCMMessageService>) held by thread 68
  at org.wordpress.android.ui.main.WPMainActivity.launchWithNoteId (WPMainActivity.java:399)
  at org.wordpress.android.ui.main.WPMainActivity.onCreate (WPMainActivity.java:267)

In #1, the user taps on a notification and thus that's why WPMainActivity.onResume is executed (main pendingIntent for the notification is to wake it up and launch the NotificationsDetailActivity), but then, also the broadcast receiver should be triggered (and is not shown in the dump). Assuming only the second one to arrive gets dumped, thenit would make sense that this is happening sometimes for one (broadcastreceiver) and sometimes for other entry point in the app where the broadcastreceiver "locked" the object first.

There are many more, and I find a pattern that the blocking thread is not in the dump (the blocked thread only is), but when the blocked thread is NotificationDismissBroadcastReceiver.onReceive then the blocking must be elsewhere, and when the blocked thread is something else, the NotificationDismissBroadcastReceiver.onReceive doesn't appear. This also seems to make sense according to the amount of reports, as we have approximately half of reports on NotificationDismissBroadcastReceiver.onReceive being blocked, and the rest of the reports are other threads supposedly being blocked by the other threads. Given odds of execution should be even, I tend to think that this is, in fact, a deadlock problem.

So, knowledge adds up and gives some meaning to all this information:
1 - tapping on any notification will produce both the actual intent (bringing the app to the foreground, then launching NotificationDetailActivity to show the Note detail), and trigger the NotificationDismissBroadcastReceiver receiver.
2 - the amount of reports and variety tend to show it's always something happening between the NotificationDismissBroadcastReceiver and catching the other thread wanting to access a GCMMessageService synchronized method as well, on any of various entry points which match the lifecycle of the notification-tapped intent.
3 - given this, I think we might have a better chance solving the issue by encapsulating the BroadcastReceiver execution in a different thread, rather than doing it separately on each call from WPMainActivity or NotificationsDetailActivity (also this should be more future-proof).

Therefore, this PR moves the logic to the already existing NotificationsProcessingService, adding a specific action to handle notification dismissal processing, and gets rid of the broadcast receiver.

I think it works fine, and hope will resolve the issue!

@malinajirka
Copy link
Contributor

Thank you @mzorz for great analysation and summary of all the information we have. Couple of notes/questions.

AFAIUnderstand current solution moves the logic from BroadcastReceiver (UI thread) to a service (working thread). If that's the case, my comment from the previous PR is still valid, isn't it?

I think the way I approached it is thinking we may fix the ANR even though we may not fix the root cause of the problem, as this fix would help avoid the conditions on which the deadlock is likely to happen, which in the end is already a probably good enough improvement.

It might fix the ANR, but I'm not sure that I see how it would help avoid the conditions on which the deadlock is likely to happen.
Moreover I believe right now we get reports that the deadlock is actually happening just because it's happening on the UI thread and therefore it results in ANR. If we hide it and run it on a different thread we'll never get any report again even though the code won't work. So I'm not convinced it would actually be an improvement.

when the user either dismisses it (slide off the screen) or taps on it, the notification is dismissed and this intent gets triggered, and received by the NotificationDismissBroadcastReceiver. What it does internally is it calls removeNotification(), which is locked on the class object itself (GCMMessageService.class) as you've seen.
At the same time the Activity gets launched and the onCreate() method is executed. This method also calls a synchronized method within GCMMessageService.

Is it really at the same time. Since both these methods are executed on one thread (UI), how can they be executed at the same time? I think they are both added into the queue of the UI Thread Handler class and they are executed sequentially. Which one goes first is not defined (which might also be the reason why NotificationDismissBroadcastReceiver.onReceive is in the dump half of the times).

In #1, the user taps on a notification and thus that's why WPMainActivity.onResume is executed (main pendingIntent for the notification is to wake it up and launch the NotificationsDetailActivity), but then, also the broadcast receiver should be triggered (and is not shown in the dump).

I believe it's not shown in the thread dump, because it's not on the stack yet. It's waiting in the queue of the UI Thread Handler, but it never gets it's chance, because the NotificationsDetailActivity is blocked on the lock and it results in ANR.

After reading through all the information, I'm starting to think it's not actually a deadlock situation - at least not in the real sense. We have just one resource GCMMessageService.class. It seems more like someone inside the synchronized block ends up in an infinite loop or something like that.

Can't the the real issue be hidden in modifying collection within a for/foreach loops.
e.g. rebuildAndUpdateNotificationsOnSystemBarForThisNote -> NOTIFICATION_HELPER.rebuildAndUpdateNotificationsOnSystemBar -> where we call remove() and put() on the collection. I believe it's not defined in Java what happens when the app modifies a collection while iterating over it. Wdyt?

@mzorz
Copy link
Contributor Author
mzorz commented May 21, 2018

As discussed elsewhere, putting this one on hold in favor of testing #7783 first. Marked this one with [Status] Not Ready for Review

@mzorz
Copy link
Contributor Author
mzorz commented Jun 11, 2018

With the 10.1 release, I've been checking the ANR reports only to find out our attempts in #7783 didn't seem to make any difference. I did find something interesting though, that I think goes in favor of thinking the ANRs are because of some operation taking longer, rather than having an infinite loop somewhere. While this is a particular case, I think it helps exemplify.

This following ANR shows there's a network operation going on to retrieve the notification's icon, which takes just too long and the other waiting operation (triggered by NotificationDismissedReceiver) sits there waiting for the lock.

==============================================================
The "AsyncTask #2" (tid=29) thread is performing network I/O.
==============================================================

"main" prio=5 tid=1 Blocked
  | group="main" sCount=1 dsCount=0 obj=0x76446550 self=0x7f81ac7a00
  | sysTid=19315 nice=-4 cgrp=default sched=0/0 handle=0x7f862f2a98
  | state=S schedstat=( 1808509990 369277565 967 ) utm=124 stm=56 core=1 HZ=100
  | stack=0x7fc69fc000-0x7fc69fe000 stackSize=8MB
  | held mutexes=
  at org.wordpress.android.push.GCMMessageService.removeNotification (GCMMessageService.java:196)
- waiting to lock <0x0c958dfc> (a java.lang.Class<org.wordpress.android.push.GCMMessageService>) held by thread 29
  at org.wordpress.android.ui.notifications.NotificationDismissBroadcastReceiver.onReceive (NotificationDismissBroadcastReceiver.java:20)




"AsyncTask #2" prio=5 tid=29 Native
  | group="main" sCount=1 dsCount=0 obj=0x32c05940 self=0x7f5ee23600
  | sysTid=19369 nice=0 cgrp=default sched=0/0 handle=0x7f5db2a450
  | state=S schedstat=( 43679069 259833692 178 ) utm=3 stm=1 core=2 HZ=100
  | stack=0x7f5da28000-0x7f5da2a000 stackSize=1037KB
  | held mutexes=
  #00  pc 000000000006abdc  /system/lib64/libc.so (__connect+4)
  #01  pc 000000000000122c  /system/lib64/libnetd_client.so (???)
  #02  pc 000000000001687c  /system/lib64/libopenjdk.so (NET_Connect+96)
  #03  pc 0000000000026d30  /system/lib64/libopenjdk.so (PlainSocketImpl_socketConnect+528)
  #04  pc 00000000000fcd2c  /system/framework/arm64/boot.oat (Java_java_net_PlainSocketImpl_socketConnect__Ljava_net_InetAddress_2II+168)
  at java.net.PlainSocketImpl.socketConnect (Native method)
  at java.net.AbstractPlainSocketImpl.doConnect (AbstractPlainSocketImpl.java:334)
- locked <0x0a7f81df> (a java.net.SocksSocketImpl)
  at java.net.AbstractPlainSocketImpl.connectToAddress (AbstractPlainSocketImpl.java:196)
  at java.net.AbstractPlainSocketImpl.connect (AbstractPlainSocketImpl.java:178)
  at java.net.SocksSocketImpl.connect (SocksSocketImpl.java:356)
  at java.net.Socket.connect (Socket.java:586)
  at org.apache.http.conn.scheme.PlainSocketFactory.connectSocket (PlainSocketFactory.java:124)
  at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection (DefaultClientConnectionOperator.java:193)
  at org.apache.http.impl.conn.AbstractPoolEntry.open (AbstractPoolEntry.java:172)
  at org.apache.http.impl.conn.AbstractPooledConnAdapter.open (AbstractPooledConnAdapter.java:130)
  at org.apache.http.impl.client.DefaultRequestDirector.executeOriginal (DefaultRequestDirector.java:1334)
  at org.apache.http.impl.client.DefaultRequestDirector.execute (DefaultRequestDirector.java:700)
  at org.apache.http.impl.client.AbstractHttpClient.execute (AbstractHttpClient.java:560)
  at org.apache.http.impl.client.AbstractHttpClient.execute (AbstractHttpClient.java:492)
  at org.apache.http.impl.client.AbstractHttpClient.execute (AbstractHttpClient.java:470)
  at org.wordpress.android.util.ImageUtils.downloadBitmap (ImageUtils.java:154)
  at org.wordpress.android.push.GCMMessageService$NotificationHelper.getLargeIconBitmap (GCMMessageService.java:590)
  at org.wordpress.android.push.GCMMessageService$NotificationHelper.buildAndShowNotificationFromNoteData (GCMMessageService.java:425)
  at org.wordpress.android.push.GCMMessageService$NotificationHelper.handleDefaultPush (GCMMessageService.java:332)
  at org.wordpress.android.push.GCMMessageService$NotificationHelper.access$100 (GCMMessageService.java:304)
  at org.wordpress.android.push.GCMMessageService.synchronizedHandleDefaultPush (GCMMessageService.java:112)
- locked <0x0c958dfc> (a java.lang.Class<org.wordpress.android.push.GCMMessageService>)
  at org.wordpress.android.push.GCMMessageService.onMessageReceived (GCMMessageService.java:135)
  at com.google.android.gms.gcm.GcmListenerService.zzo (unavailable)
  at com.google.android.gms.gcm.GcmListenerService.zzn (unavailable)
  at com.google.android.gms.gcm.GcmListenerService.zzm (unavailable)
  at com.google.android.gms.gcm.GcmListenerService.zza (unavailable)
  at com.google.android.gms.gcm.GcmListenerService$1.run (unavailable)
  at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1133)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:607)
  at java.lang.Thread.run (Thread.java:762)

I think we should be right to try merging this PR (which moves handling of the Broadcast Receiver to the ProcessingService) - and try adding some end to end tracking that we can check later. If we find as many non-closing (only starting) track hits instead of closed loops, then it would make sense to investigate further.

Added the tracking here c782ac6402

Ready for another round @malinajirka

@mzorz mzorz added this to the 10.3 milestone Jun 11, 2018
@malinajirka
Copy link
Contributor

Thanks @mzorz !

our attempts in #7783 didn't seem to make any difference

I'm sorry to hear that, however it did worth a try;). Thanks!

and try adding some end to end tracking that we can check later. If we find as many non-closing (only starting) track hits instead of closed loops, then it would make sense to investigate further.

Sounds like a good plan! 👍

LGTM

@malinajirka malinajirka merged commit ec44b1a into develop Jun 12, 2018
@mzorz mzorz deleted the fix/anr-open-notification-detail-take2 branch June 18, 2018 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants