haxenme/nme

relatively rare timer crash. can't get why yet

Closed this issue · 7 comments

nulld commented

Please excuse me - it definetly relates to my pull request. #679
Crash is quite rare - ocured only once in 3 days for now ( it's small rollout on ~1000 active users).
Can anyone help me to figure out how it can be possible (crash and generated code below)

 (haxe::Timer_obj::nmeCheckTimers(double)) (SourceCode: /builds/zLuWkBZg/1/wnw/charm-mobile-client/bin/android/haxe/cpp/./src/haxe/Timer.cpp:143) (see HXLINE( 192) in code bellow)
 (nme::app::IPollClient_obj::onPoll(Dynamic, double)) (SourceCode: /builds/zLuWkBZg/1/wnw/charm-mobile-client/bin/android/haxe/cpp/include/nme/app/IPollClient.h:22)
 (nme::app::Application_obj::pollClients(double)) (SourceCode: /builds/zLuWkBZg/1/wnw/charm-mobile-client/bin/android/haxe/cpp/./src/nme/app/Application.cpp:339)
 (nme::app::Window_obj::nmeProcessWindowEvent(Dynamic)) (SourceCode: /builds/zLuWkBZg/1/wnw/charm-mobile-client/bin/android/haxe/cpp/./src/nme/app/Window.cpp:226)
 (nme::app::__Window_objnmeProcessWindowEvent(hx::Object*, Dynamic const&)) (SourceCode: /builds/zLuWkBZg/1/wnw/charm-mobile-client/bin/android/haxe/cpp/./src/nme/app/Window.cpp:358)
 (val_call1) (SourceCode: /opt/haxelib/haxelib-dev/hxcpp/src/hx/CFFI.cpp:720)
 (external_handler_native(nme::Event&, void*)) (SourceCode: /opt/haxelib/haxelib-dev/nme/project/src/common/ExternalInterface.cpp:1542)
 (nme::Stage::HandleEvent(nme::Event&)) (SourceCode: /opt/haxelib/haxelib-dev/nme/project/src/common/Stage.cpp:297)
 (nme::AndroidStage::OnPoll()) (SourceCode: /opt/haxelib/haxelib-dev/nme/project/src/android/AndroidFrame.cpp:249)
 (Java_org_haxe_nme_NME_onPoll) (SourceCode: /opt/haxelib/haxelib-dev/nme/project/src/android/AndroidFrame.cpp:965)
void Timer_obj::nmeCheckTimers(Float inStamp){
            	HX_GC_STACKFRAME(&_hx_pos_102f69689d37b2bb_186_nmeCheckTimers)
HXDLIN( 186)		if (::hx::IsNotNull( ::haxe::Timer_obj::sRunningTimers )) {
HXLINE( 188)			int i = 0;
HXLINE( 189)			while((i < ::haxe::Timer_obj::sRunningTimers->length)){
HXLINE( 191)				 ::haxe::Timer timer = ::haxe::Timer_obj::sRunningTimers->__get(i).StaticCast<  ::haxe::Timer >();
HXLINE( 192)				if (timer->mRunning) {
HXLINE( 193)					timer->nmeCheck(inStamp);
            				}
HXLINE( 195)				if (!(timer->mRunning)) {
HXLINE( 197)					::haxe::Timer_obj::sRunningTimers->setCtx( HX_CTX, i,::haxe::Timer_obj::sRunningTimers->__get((::haxe::Timer_obj::sRunningTimers->length - 1)).StaticCast<  ::haxe::Timer >());
HXLINE( 199)					_hx_array_set_size_exact(::haxe::Timer_obj::sRunningTimers,(::haxe::Timer_obj::sRunningTimers->length - 1));
            				}
            				else {
HXLINE( 205)					i = (i + 1);
            				}
            			}
            		}
            	}

So I guess timer is null. Which would mean that i>=sRunningTimers.length. Can't really see how.
The other possibility is thread safety. If a different thread pushes to the time list while this loop is running, it might get messed up.

It is also possible that SetSize is not GC-moving safe - have not fully studied it. I should add a test for it. It is also not great to use for resizing here because it does not allow wiggle room in the size and may cause excessive reallocing. Since splice will optimize away the return result if you do not use it, I would recommend just removing the 'setSize' path and always use splice.

For simplicity, you could also not bother moving the last element and just splice(i,1).

nulld commented

The other possibility is thread safety. If a different thread pushes to the time list while this loop is running, it might get messed up.

Yes, it's very likely the cause of my problem, thanks. We sometimes use haxe.Timer.delay to call something in haxe thread.

Since splice will optimize away the return result if you do not use it, I would recommend just removing the 'setSize' path and always use splice.

Thanks for advice, exaclly this aproach was introduced here:
#658
But because of it closed by author and not able to be merged, made another one. Please review it.

It is also possible that SetSize is not GC-moving safe - have not fully studied it. I should add a test for it.

It will be great. Btw, we have rarely crashes in StringMap can be moving gc be cause of it too?

Routine hx::TStringElement<Dynamic>::getHash() at /opt/haxelib/haxelib-dev/hxcpp/src/hx/Hash.h:81
Routine hx::Hash<hx::TStringElement<Dynamic> >::set(String, Dynamic const&) at /opt/haxelib/haxelib-dev/hxcpp/src/hx/Hash.h:565
Routine __string_hash_set(hx::Object*, Dynamic&, String, Dynamic const&, bool) at /opt/haxelib/haxelib-dev/hxcpp/src/hx/Hash.cpp:314
Routine haxe::ds::StringMap_obj::set(String, Dynamic) at /builds/zLuWkBZg/1/wnw/charm-mobile-client/bin/android/haxe/cpp/./src/haxe/ds/StringMap.cpp:73

We have this gc flags set for now:

	<haxedef name="HXCPP_GC_GENERATIONAL"/>
	<haxedef name="HXCPP_GC_MOVING"/>

There are tests for StringMap with Gc: https://github.com/HaxeFoundation/hxcpp/blob/master/test/haxe/TestStringHash.hx
Although, maybe not with Dynamic values - these could be added (possibly using Null<Int>)
It can be tricky to reproduce - but if it can be reproduced then it might not be too hard to fix, and any ideas about what makes it more likely to crash would be good.
StringMap is also not thread safe, so something to think about.

nulld commented

but if it can be reproduced then it might not be too hard to fix, and any ideas about what makes it more likely to crash would be good.

Finally, got it in debugger.
Mn47Tl86cw

Crashed thread stack:

CharmFarmGd.exe!hx::MarkAllocUnchecked(void * inPtr, hx::MarkContext * __inCtx) Line 2000
	at d:\devel\haxelib-dev\hxcpp\src\hx\gc\immix.cpp(2000)
CharmFarmGd.exe!hx::MarkAlloc(void * inPtr, hx::MarkContext * __inCtx) Line 487
	at d:\devel\haxelib-dev\hxcpp\include\hx\gc.h(487)
CharmFarmGd.exe!hx::MarkMember<String>(String & outT, hx::MarkContext * __inCtx) Line 49
	at d:\devel\haxelib-dev\hxcpp\include\hx\gctypeinference.h(49)
CharmFarmGd.exe!hx::StringData::__Mark(hx::MarkContext * __inCtx) Line 2272
	at d:\devel\haxelib-dev\hxcpp\src\string.cpp(2272)
CharmFarmGd.exe!hx::MarkContext::processMarkStack() Line 1877
	at d:\devel\haxelib-dev\hxcpp\src\hx\gc\immix.cpp(1877)
CharmFarmGd.exe!GlobalAllocator::ThreadLoop(int inId) Line 4397
	at d:\devel\haxelib-dev\hxcpp\src\hx\gc\immix.cpp(4397)
CharmFarmGd.exe!GlobalAllocator::SThreadLoop(void * inInfo) Line 4434
	at d:\devel\haxelib-dev\hxcpp\src\hx\gc\immix.cpp(4434)

Some other threads are doing something gc related, too (doesn't them should wait some semaphore or smth atm?)

CharmFarmGd.exe!hx::MarkObjectAlloc(hx::Object * inPtr, hx::MarkContext * __inCtx) Line 496
	at d:\devel\haxelib-dev\hxcpp\include\hx\gc.h(496)
CharmFarmGd.exe!hx::MarkMember<ru::stablex::ui::widgets::Widget_obj>(hx::ObjectPtr<ru::stablex::ui::widgets::Widget_obj> & outT, hx::MarkContext * __inCtx) Line 16
	at d:\devel\haxelib-dev\hxcpp\include\hx\gctypeinference.h(16)
CharmFarmGd.exe!com::nevosoft::waw::view::gui::DailyQuest_obj::__Mark(hx::MarkContext * __inCtx) Line 5766
	at d:\devel\charm-mobile-client\bin_test\windows\haxe\cpp\src\com\nevosoft\waw\view\gui\dailyquest.cpp(5766)
CharmFarmGd.exe!hx::MarkContext::processMarkStack() Line 1877
	at d:\devel\haxelib-dev\hxcpp\src\hx\gc\immix.cpp(1877)
CharmFarmGd.exe!GlobalAllocator::ThreadLoop(int inId) Line 4397
	at d:\devel\haxelib-dev\hxcpp\src\hx\gc\immix.cpp(4397)
CharmFarmGd.exe!GlobalAllocator::SThreadLoop(void * inInfo) Line 4434
	at d:\devel\haxelib-dev\hxcpp\src\hx\gc\immix.cpp(4434)

Like this one:

CharmFarmGd.exe!HxSemaphore::Wait() Line 275
	at d:\devel\haxelib-dev\hxcpp\include\hx\thread.h(275)
CharmFarmGd.exe!LocalAllocator::PauseForCollect() Line 5931
	at d:\devel\haxelib-dev\hxcpp\src\hx\gc\immix.cpp(5931)
CharmFarmGd.exe!LocalAllocator::CallAlloc(int inSize, unsigned int inObjectFlags) Line 6058
	at d:\devel\haxelib-dev\hxcpp\src\hx\gc\immix.cpp(6058)
CharmFarmGd.exe!hx::ImmixAllocator::alloc(hx::ImmixAllocator * alloc, unsigned int inSize, bool inContainer, const char * inName) Line 372
	at d:\devel\haxelib-dev\hxcpp\include\hx\gc.h(372)
CharmFarmGd.exe!hx::Object::operator new(unsigned int inSize, bool inContainer, const char * inName) Line 170
	at d:\devel\haxelib-dev\hxcpp\include\hx\object.h(170)
CharmFarmGd.exe!_hx_std_socket_new(bool udp, bool ipv6) Line 180
	at d:\devel\haxelib-dev\hxcpp\src\hx\libs\std\socket.cpp(180)
CharmFarmGd.exe!sys::net::Socket_obj::init() Line 76
	at d:\devel\charm-mobile-client\bin_test\windows\haxe\cpp\src\sys\net\socket.cpp(76)
CharmFarmGd.exe!sys::net::Socket_obj::__construct() Line 54
	at d:\devel\charm-mobile-client\bin_test\windows\haxe\cpp\src\sys\net\socket.cpp(54)
CharmFarmGd.exe!sys::net::Socket_obj::__alloc(hx::ImmixAllocator * _hx_ctx) Line 227
	at d:\devel\charm-mobile-client\bin_test\windows\haxe\cpp\src\sys\net\socket.cpp(227)
CharmFarmGd.exe!sys::Http_obj::customRequest(bool post, hx::ObjectPtr<haxe::io::Output_obj> api, hx::ObjectPtr<sys::net::Socket_obj> sock, String method) Line 165
	at d:\devel\charm-mobile-client\bin_test\windows\haxe\cpp\src\sys\http.cpp(165)

@hughsando, I've save minidump and can send to you if it can help.
Btw, can U clear to me - is it safe to alocate something( strings, for example ) in one thread, and then use it in another? Don't get - how GC track this things..

So the thread usage looks fine. This looks like your socket thread has called "new" which has triggered a collection. The collection pauses the haxe threads (who will enter a "wait" state) while the collection happens, There are 2 (or more) threads calling "MarkMember". This is ok, since these threads are the GC threads started in SThreadLoop and system is designed to work "lock free" as much as possible while marking and will go dormant once the haxe threads are restarted.
It is generally ok to allocate in one thread and use in another - as long as the ownership is clear. String setting is not atomic since there are pointer and length components , so it is possible one thread may see the pointer before the length is set and vice-versa if you are randomly reading a member or static variable.
The crash here is happening where a StringData (the Dynamic version of a String) tries to mark its member, but the pointer is messed up. This is either a GC marking failure - probably missing the owner of the StringData somehow, or a some unsafe thread setting of an object (Map or Array).
The mini-dump might not show that much since I really need to go back in time a bit to see where it initially goes wrong.
The two things I would look at are:

  • make sure you are not doing too much stuff in background threads (socket handler?). One thing I find very useful here is to use nme.app.Application.runOnMainThread to pass a local result to the global state.
  • Investigate GC failures. Here, try the haxedef: HXCPP_GC_DEBUG_LEVEL=1 or uncomment the define in Immix.cpp
    This will cause the marking to happen in single-threaded mode, and the crash stack will show the total ownership chain of the offending object, which will give a good clue about which component is messed-up

You can also try running with the haxedefs HXCPP_CHECK_POINTER and HXCPP_GC_CHECK_POINTER. This will add runtime checks that may pickup the problem early.

You can also try uncommenting define HX_GC_VERIFY in Immix.cpp. This may pinpoint a GC failure as soon as it happens, but may also mask the bug so it is a bit of a double-edged sword. But if you get a crash stack with this on, then it would give good information.

So, if you are hunting this bug, I would first try HXCPP_GC_CHECK_POINTER and HXCPP_GC_DEBUG_LEVEL=1 and see if you can get a callstack. If this does not work, you can uncomment HX_GC_VERIFY.

nulld commented

So, if you are hunting this bug, I would first try HXCPP_GC_CHECK_POINTER and HXCPP_GC_DEBUG_LEVEL=1

Awesome advice, thank U very much! Seems I found the promlematic code:
I had the map, where I store loader.url => loader:

var loadingLoaders:Map<String, Loader>;

Which I count from main haxe thread like this: Lambda.count(loadingLoaders)
But at the same moment Loader was doing his loading job in worker thread.

Now I've done like this var loadingLoaders:Map<String, Int>; (loader.url => loader.id) and can't reproduce crash anymore.
But one moment stays unclear to me - can I store loader.url as a map key in main thread and safely use it (I meen loader.url) in the worker thread (for e.g. doing haxe.Http.request(loader.url))? Won't it lead to some kind of race condition?

You can use a string as soon as it is created because they are immutable - but you still need to take care with maps. If you think of them like arrays where when they change size (add or remove item), internally they may create a copy of themselves and switch their state over to the copy. If the "set" happens after the copy is made, but before the state gets set, things will break.
There are 2 ways to always get this right: 1 use a mutex to guard and accesses to the maps, or 2 only ever read or write the map from one thread.
Say the main thread creates a loader for a given resource URI. The main thread is the only one that accesses "loadingLoaders", so it creates a Loader object, sets it in the map and then calls "loader.startThread".
The loader thread does its thing and then finishes and want to remove itself from the map. You could combine this with the delivery of the result like:

function startThread() {
  Thread.create( () -> {
    var http = new Http(uri);
     ...
     var result = http.data.
     Application.runOnMainThread( () -> {
        // Absolutely fine to do this here...
        loaderLoder.remove(uri);
       // and this too...
       loder.onLoad(result);
     });
  });
}