Restarting i3 results in crash

Question

Restarting i3 results in crash

aksr opened this issue 8 years ago · 41 comments

aksr commented 8 years ago

i3 version 4.12-133-ge51a89e

http://sprunge.us/MTHY

What I did:

i3 config: http://sprunge.us/BZhG
run goldendict*
restart (via key-keybinding)

What I saw:
i3 crashes

What I expected instead:
For i3 to be restarted normally.

A few things more...
This doesn't happen with 4.12 (I mean i3 doesn't crash with above setting.).
I'm not sure if the same is true for freezes.

*The above bug can be reproduced using firefox (no need for goldendict).
Just the config file should be properly changed.
For example: for_window [instance="^Navigator$" class="^Firefox$"] mark f, border normal 1

Also, I'm not sure if the freezes are somehow related to this bug.
(I'm not yet able to locate when or why they happen.)
But, to get i3 to be responsive again—I have to send signal SIGCONT to i3.

If you need more info, or, if I'm not specific enough, please do tell.

Answer 1 · 2016-10-21T15:23:13.000Z

I don’t see a link to logs.i3wm.org. Did you follow http://i3wm.org/docs/debugging.html? (In case you actually provided a link to a logfile, please ignore me.)

Answer 2 · 2016-10-21T15:24:03.000Z

Can you bisect this issue between 4.12 and the commit you're using?

Answer 3 · 2016-10-21T15:33:12.000Z

I'm not sure If I have enough time atm. I tried to get some input on IRC before making noise on the github.

Answer 4 · 2016-10-21T15:34:38.000Z

Pulling a gdb backtrace would also help. :-)

If you're not faster I'll try to repro this on the weekend and try to get it fixed. @stapelberg We should wait with 4.13 for this.

Answer 5 · 2016-10-21T15:36:50.000Z

Fair enough, thank you.

Answer 6 · 2016-10-22T03:31:02.000Z

Even the version 4.12 freezes randomly(?) (i3 is made responsive by sending signal SIGCONT to it).
I'm not able to find out when or why it freezes (yet).
There are others with the same issue: https://marc.ttias.be/arch-general/2016-09/msg00067.php
This means, freezes and the crash (described above) are two separate bugs.

I should probably open another bug issue.

Answer 7 · 2016-10-22T08:37:59.000Z

For the freeze a backtrace might be helpful as well because we'd perhaps see where it's stuck. But yeah, it sounds like a separate issue.

Answer 8 · 2016-10-22T08:40:11.000Z

I cannot reproduce the issue, so please provide the backtrace and, ideally, bisect the issue. :-)

Answer 9 · 2016-10-22T08:41:44.000Z

Which issue, the first or the second?

Answer 10 · 2016-10-22T08:42:51.000Z

The crash. :-) For the freeze please open a separate issue as you suggested.

Answer 11 · 2016-10-22T08:51:56.000Z

As soon as I have the time.

Answer 12 · 2016-10-30T20:08:15.000Z

@aksr Any news?

I'll remove the blocking link again since it's not reproducible and nobody else has reported it, so it doesn't appear to be a problem affecting the majority of users.

Answer 13 · 2016-10-31T15:23:57.000Z

@Airblader I don't have enought time atm (still).

Answer 14 · 2016-11-05T14:37:01.000Z

@Airblader: This is by far the strangest bug I encountered so far (and I've been using i3 at least for 5 years now).

I think I located it:

Phase 1
1. http://sprunge.us/daAf
2. run goldendict
3. restart i3
(i3 crashes or exits or something similar)

Take notice of the 'mark g' part in the config file.

Phase 2
1. remove the 'mark g' part from the config
2. run steps 1–3
(problem isn't there anymore)

The problem appears to be related to this commit: e51a89e

Strange:
I had mark to be set on the startup for two programs (goldendict and firefox).
It doesn't matter which program I run, the same problem appears.
I used goldendict because it's more lightweight.

More Strange:
The problem doesn't appear if the last line: bind Mod4+52 floating enable ... of the config file isn't there (or, at least, it appears like that).

I'm hoping we are getting closer to solve this.

Answer 15 · 2016-11-05T22:24:41.000Z

Thanks for the feedback! Since external links die, for reference the content of the link is

bind Mod4+33 exec --no-startup-id /usr/bin/dmenu_run
bind Mod4+Control+27 restart
bind Mod4+52 floating enable, resize grow height 800 px, resize grow width 1200 px, move absolute position center
for_window [instance="^goldendict$" class="^Goldendict$"] mark g, border normal 1

The problem appears to be related to this commit: e51a89e

Why do you think that? Did you bisect it?

I'm hoping we are getting closer to solve this.

A backtrace would tell us what happens right away and still seems like the best way to tackle this in my eyes. I have a pretty good idea what's wrong, but it's all about where exactly it goes wrong.

Answer 16 · 2016-11-06T10:35:54.000Z

Why do you think that? Did you bisect it?

Yes.

Answer 17 · 2016-11-06T17:34:31.000Z

Alright, thanks. Now we just need the backtrace so we know where exactly it faults.

Answer 18 · 2016-11-14T23:32:19.000Z

@aksr Can you get us that backtrace? :-)

Answer 19 · 2016-11-15T05:55:11.000Z

I'm sorry, I'll do this as soon as I can.

Answer 20 · 2016-11-29T05:33:25.000Z

@Airblader Is this good/enough?

http://sprunge.us/JFRY
http://sprunge.us/IeXN

Answer 21 · 2016-11-29T07:07:06.000Z

@aksr Looks good at first glance. I'll look at it in detail later. Thanks!

Answer 22 · 2016-11-29T20:09:10.000Z

So the problem actually is that we are dumping a container with layout L_DEFAULT which fails an assertion in dump_node.

@stapelberg I'd opt for removing that assertion and demoting it to a log message. As long as L_DEFAULT is in the code (and actively used, albeit temporarily) we should expect it to occur. On that note, should we remove the layout default command?

Answer 23 · 2016-11-30T18:05:24.000Z

L_DEFAULT should only ever be used in con->workspace_layout, not con->layout (which is what causes the assertion). I can’t find any assignments to layout, and I think (but haven’t verified) that all containers should get a layout value immediately after being created. Do you know of any places in the code where that isn’t true?

Instead of applying a workaround, I think we should try to understand how i3 can get into this situation, as the issue might be symptomatic of a larger state corruption issue and/or logic bug.

Answer 24 · 2016-11-30T20:08:32.000Z

Yeah, I guess you're right.

@aksr When it happens again, could you provide backtrace and i3 log? We need them both of the same run so we can take the faulty container from the backtrace and trace it through the log file.

Answer 25 · 2016-12-01T05:34:29.000Z

I'm not sure will the i3 log be of any use, anyway here you go:
gdb: http://sprunge.us/UcSA
i3log: http://sprunge.us/ALYQ (this is a complete log)

Answer 26 · 2016-12-01T06:45:24.000Z

That doesn't really look like an i3 log file. How did you get it? You can get a better log if you switch to another VT (Ctrl+Alt+F2) and issue

DISPLAY=:0 i3-dump-log > /tmp/i3.log

But you need to do that while the i3 session is still showing the (likely black) rectangle (crash screen). Also make sure you have debuglog enabled (see debugging docs on the website). Thanks for your patience!

Answer 27 · 2016-12-01T16:48:26.000Z

I don't get any rectangle or anything similar.
Maybe this: http://sprunge.us/MEDL

Answer 28 · 2016-12-01T17:26:13.000Z

Yes, that's better. Thank you. :-) My current guess is that the problem here is that the order of JSON keys is not guaranteed and sometimes we read the mark before the layout, resulting in a container with incorrect layout.

Answer 29 · 2016-12-01T17:33:02.000Z

@stapelberg So this leaves us with a few options:

When restoring a layout, do not call con_mark but set it »manually«. When an invalid layout containing the same mark multiple times is loaded we wouldn't catch this anymore, though.
Add a flag to con_mark to suppress event sending. But then we'd not send the events during a restart. I'm not sure whether this is good or bad (next to whether the solution itself is all that great).
Before calling con_mark we check the layout and set it to something else if it's the default layout. Sounds fairly hack-ish to me.
We weaken the assertion and simply dump the default layout at this point.
We memorize the marks, but only call con_mark once the container has finished parsing.

I think I'd prefer some kind of 2. or 5. All the others share the common problem that the event would contain a half-parsed container structure which sounds bad. I think we should decide whether a restart should send IPC events like this and depending on that answer go for either 5. (if we want events) or 2. (if we don't want events).

Thoughts?

Answer 30 · 2016-12-18T17:48:47.000Z

Option 5 is the way to go, I think. Fundamentally, we should always try restore all serialized attributes before applying state which depends on these attributes.

Answer 31 · 2017-05-15T07:47:18.000Z

Hi,
I'm curious, what's the status of the fix?
This bug is present in 4.13 as well.
Thanks.

Answer 32 · 2017-05-15T09:19:45.000Z

I'm curious, what's the status of the fix?

The issue is still open, that means no one has fixed it yet.

Answer 33 · 2017-05-19T01:42:15.000Z

I am also getting this issue. I guess I just have to keep my finger off of the restart button.

Answer 34 · 2017-05-19T06:27:32.000Z

We should fix this before the next release.

Answer 35 · 2017-05-19T06:52:21.000Z

@f0lie @charonbz @aksr Could you try the commit I pushed which hopefully fixes this? Please beware, it's untested and could also just end up crashing your session; so don't keep anything important open that needs saving.

Answer 36 · 2017-05-19T15:40:25.000Z

I would like to help the project out since I use it so much. But I am not sure how do I run this version on my laptop and go back to my older version.

Answer 37 · 2017-05-20T14:17:53.000Z

@Airblader I have tried your commit. It fixes the problem.

Answer 38 · 2017-05-21T06:39:47.000Z

@Airblader I have added a testcase for this when I was working on this issue. If you would like to use it, you can use it directly or ask me to send the PR.

Answer 39 · 2017-05-22T06:52:30.000Z

@hwangcc23 Thanks for confirming. The problem I have with the testcase is that it doesn't reliably reproduce the issue (which I think is impossible to do). But it's still better than nothing, I assume. Feel free to send a PR for the fix + test. Thanks!

Answer 40 · 2017-09-07T16:16:13.000Z

Is this still present after #2779 and #2901?

Answer 41 · 2017-09-07T19:13:58.000Z

This issue should be closed already, the auto-hook just didn't catch it.