Possible crash on async explicit dispose in 1.4.0

Question

Possible crash on async explicit dispose in 1.4.0

ewaza opened this issue 3 years ago · 3 comments

Hi all,

I've ran into an issue on a production system after upgrading to version 1.4.0 from 1.3.0. I haven't been able to exactly pinpoint the cause, because there seems to be lot of changes in the reference checking code.

I do have an example to reproduce the scenario. gst1-java-core-crash-issue-example.zip. When running the Junit test it will crash sooner or later with: GLib-GObject:ERROR:../../../gobject/gobject.c:3208:toggle_refs_notify: assertion failed: (tstack.n_toggle_refs == 1

I've tested on multiple systems with the same result. When using version 1.3.0 the test runs without any problems till the end.

Answer 1 · 2021-09-21T14:33:53.000Z

Thanks for the report. Firstly, the provided example is a little broken - the ordering of the two tasks is not guaranteed.

However, there is a potential problem. If you move the explicit disposal into Gst::invokeLater then it might be OK - seems to be OK here. We do strongly recommend using the Gst executor for all interaction with the library - massive numbers of threads can definitely be an issue. Although this issue might appear with just one - I think it's caused by explicit disposal while a native object is active on another thread. It's possible we might be able to provide a bit more safety here.

The fixes between 1.3.0 and 1.4.0 are probably the trigger but not the cause - the old implementation was much more dependent on the garbage collector to clear up native references behind the scenes, particularly with messages.

Do you still need explicit disposal? Do you actually need to create so many pipelines in your production system?

Answer 2 · 2021-10-04T12:02:11.000Z

Thank you very much for your extensive response and sorry for my late reply.

You're absolutely right about my example being a bit broken. The discardPipeline.drainPermits() call should happen outside of the tasks and before scheduling them. The system I tried to mimic actually does this right. Also fixing my example did not change the behavior.

However as you mentioned, stopping to use explicit disposals or using them with Gst::invokeLater seems to prevent crashes. So I will definitely see if this could work reliably on our system.

Regarding the number of pipelines. Our production system has parallel real-time requirements which make having a number of concurrent active pipelines unavoidable. However the number is way lower than in my example. That is why we experienced the issue maybe once every week.

Instead of doing a create/dispose for each session I could use a (couple of) pool(s) of pipelines which will be recycled. Is this something you recommend for more stability? In the long run we should probably split the work over multiple OS processes and do some recycling there.

Answer 3 · 2021-10-04T14:36:09.000Z

Another potential workaround - at least your test passes with it here - is to use

pipeline.getBus().setSyncHandler(msg -> BusSyncReply.PASS);

That causes the Message objects to be created eagerly on the streaming thread again - basically 1.3.0 behaviour. This causes a lots of messages to hold references until the GC runs though, which is why the changes were made. Allowing the GC to clean up the Pipeline instead if that works is still probably better.

Regarding the number of pipelines. Our production system has parallel real-time requirements which make having a number of concurrent active pipelines unavoidable. ... Instead of doing a create/dispose for each session I could use a (couple of) pool(s) of pipelines which will be recycled. Is this something you recommend for more stability?

Multiple concurrent pipelines shouldn't be a problem. I tried a version of your test that uses the Gst executor to configure, start and dispose all pipelines and that seemed OK. Remember that all the pipelines have their own threads anyway, so you don't need to parallelize the setup and control steps. I doubt pooling is worth the hassle, but really depends on the system.

Incidentally, I can offer support via https://www.codelerity.com in terms of looking at specifics of client systems if that's of interest in future. Without going in depth, help is always going to be somewhat generic.

I'll keep this issue open for now - trying to think if there's a way we can stop the crash from happening while keeping lazy creation of the native wrappers for message and source element.