Removing client window via script API causes crash (v1.13.0 beta)

Question

Removing client window via script API causes crash (v1.13.0 beta)

Closed this issue 3 years ago · 13 comments

I've been using the built app version for Mac today and I've been consistently crashing about 1.5-2 hours after launching -- nothing in the debug log, and the symptoms are rainbow beach ball locking up requiring a force quit. If I can help debug let me know!

Answer 1 · 2021-09-19T15:50:43.000Z

Do you have any more info on the crashes? Is it completely random or reproducible in any way? Is it always in that 1-2h range and im guessing that's probably induced by higher scroll levels while hunting etc?

I've been trying to push longer sessions last couple of days (6-8h) but haven't run into any crashes as of yet. Honestly, i was kind of expecting some issues to arise since i've experienced random crashes in the past anywhere from 1 day to up to a month of uptime to reproduce and all the measures i had put in place to prevent those crashes were reworked by another contributor in the last release.

This is why i didn't want to rush this release although i guess that didn't help much.

Answer 2 · 2021-09-19T20:42:24.000Z

Yes it was after about 1-2 hours of active hunting, so a lot of scroll. It happened 3 times in 1 afternoon (after 1-2 hours each time) before I reverted back to 1.12. I'll do some more testing with a fresh install and see if I can identify any way to reproduce. Not sure if its relevant but I'm on Mac Big Sur 11.5.2 with m1 chip processor, and I don't experience this issue with v1.12.

Answer 3 · 2021-09-24T18:51:00.000Z

Hi, I wasn't able to reproduce the crash in a multi-hour session either. What is possible to do is to build the version before the pull request #83
Another potential problem could be with pull request #91
And finally might be some Qt bug on Mac related to the links in QPlainTextEditor, so potential issue is #92
I don't have a mac build environment right now so cannot test.
A side note: maybe introduce a way to record a session and replay it later? now with network decoupling done it could be possible. I can have a look at it.

Answer 4 · 2021-10-03T19:15:07.000Z

I've been running more tests for about 2 weeks now and i got nothing. I'm starting to doubt it has anything to do with any memory leaks or dead locks (which would explain complete random crashes), having processed over 50 million lines of data under moderate load (much higher stress levels than possible in any realistic scenarios).

If it's a bug on mac i don't know how we're going to find it, considering how many changes were made in that last release.

I don't think there's any need to look into recording sessions since that's already all in the debug log but it just contains too much sensitive information. What could be relevant though if you could maybe share your settings profile (with any private data omitted)? Mainly your highlights could trigger something that im missing right now.

Answer 5 · 2021-10-03T19:48:14.000Z

It might actually not be a bad idea to experiment with the prerecorded session after a crash happens. It could be helpful to identify if it's indeed a random crash or it's triggered by some specific event.

I have a mock server that can feed debug logs back to the game client.

https://github.com/matoom/frostbite/blob/master/support/mock.rb

Just need to change the input file file_path = 'mock.xml' to the recorded debug log, run the script and then connect to 127.0.0.1 auth host instead of eaccess.play.net.

Answer 6 · 2021-10-04T00:46:26.000Z

@matoom Okay I ran the app for about 4 hours today and got a crash, I'm replaying the debug log now via the mock server, will let you know if it crashes in the same place! Regarding highlights, I went with a vanilla build today so no custom highlights at all. The other relevant info may be that I'm running Lich, and I do hook into the script API to create custom windows via my own Lich scripts (so I always set the API port to 49166).

Update: running back the full session through the mock server did not cause a crash.

Answer 7 · 2021-10-04T20:40:34.000Z

I've been running 1.13.0 today without my custom windows integration, and so far no crashes.. could be related 🤔

Answer 8 · 2021-10-05T18:32:12.000Z

After 2 days of running my usual routine WITHOUT using the script API, I've had no crashes. I'm ready to say that the bug is somewhere in the script API implementation. Is there a way I can help debug down this route?

Answer 9 · 2021-10-05T19:05:55.000Z

I believe I've isolated the crasher -- in my script API model, I can consistently crash the app when calling to remove a window via the script API like so:

  # Remove stream window
  #
  # @param [String] name unique window id (see #Client::window_list)
  # @return [int] 1 on success; 0 on fail
  def window_remove(name)
    @@fb_api_socket.puts "CLIENT WINDOW_REMOVE?#{ERB::Util.url_encode(name)}\n"
    @@fb_api_socket.gets('\0').chomp('\0').to_i
  end

Answer 10 · 2021-10-06T07:39:40.000Z

Thanks for the insight.
@matoom it looks like could be potential multithreading error here in the code to remove window, here is the extract:

    WindowWriterThread* writer = streamWriters.value(id);
    writer->wait();
    delete writer;
    streamWriters.remove(id);

To start with we can move streamWriters.remove(id); directly after the first line to remove it from the list of observers while we are deleting it.
Then the writer->wait(); is not needed anymore, and in fact could harm - the thread is not yet notified to be stopped. We also do wait in the destructor of the WorkQueueThread, maybe the ->stop() should be used instead.

Answer 11 · 2021-10-06T08:11:19.000Z

Yes I've just confirmed your observations. Opening a telnet session to Frostbite scripting API one could easily reproduce the crash even without connecting to DR servers:

➜  ~ telnet localhost 3000
Trying ::1...
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
CLIENT WINDOW_ADD?test&Test
1\0CLIENT WINDOW_WRITE?test&Hello
1\0CLIENT WINDOW_REMOVE?test

Here the crash comes.

Answer 12 · 2021-10-06T08:16:54.000Z

I've added a pull request which fixed this issue you've found @hennii

Answer 13 · 2021-10-15T19:52:00.000Z

That's good news and the fix seems to check out as far as i can tell. I didn't notice in lingering threads or any issues while testing, i guess no need to call stop.

I'll push a new release containing those 2 fixes soon.