i3/i3

Restarting i3 results in crash

aksr opened this issue · 41 comments

aksr commented

i3 version 4.12-133-ge51a89e

http://sprunge.us/MTHY

What I did:

  1. i3 config: http://sprunge.us/BZhG
  2. run goldendict*
  3. restart (via key-keybinding)

What I saw:
i3 crashes

What I expected instead:
For i3 to be restarted normally.

A few things more...
This doesn't happen with 4.12 (I mean i3 doesn't crash with above setting.).
I'm not sure if the same is true for freezes.

*The above bug can be reproduced using firefox (no need for goldendict).
Just the config file should be properly changed.
For example: for_window [instance="^Navigator$" class="^Firefox$"] mark f, border normal 1

Also, I'm not sure if the freezes are somehow related to this bug.
(I'm not yet able to locate when or why they happen.)
But, to get i3 to be responsive again—I have to send signal SIGCONT to i3.

If you need more info, or, if I'm not specific enough, please do tell.

i3bot commented

I don’t see a link to logs.i3wm.org. Did you follow http://i3wm.org/docs/debugging.html? (In case you actually provided a link to a logfile, please ignore me.)

Can you bisect this issue between 4.12 and the commit you're using?

aksr commented

I'm not sure If I have enough time atm. I tried to get some input on IRC before making noise on the github.

Pulling a gdb backtrace would also help. :-)

If you're not faster I'll try to repro this on the weekend and try to get it fixed. @stapelberg We should wait with 4.13 for this.

aksr commented

Fair enough, thank you.

aksr commented

Even the version 4.12 freezes randomly(?) (i3 is made responsive by sending signal SIGCONT to it).
I'm not able to find out when or why it freezes (yet).
There are others with the same issue: https://marc.ttias.be/arch-general/2016-09/msg00067.php
This means, freezes and the crash (described above) are two separate bugs.

I should probably open another bug issue.

For the freeze a backtrace might be helpful as well because we'd perhaps see where it's stuck. But yeah, it sounds like a separate issue.

I cannot reproduce the issue, so please provide the backtrace and, ideally, bisect the issue. :-)

aksr commented

Which issue, the first or the second?

The crash. :-) For the freeze please open a separate issue as you suggested.

aksr commented

As soon as I have the time.

@aksr Any news?

I'll remove the blocking link again since it's not reproducible and nobody else has reported it, so it doesn't appear to be a problem affecting the majority of users.

aksr commented

@Airblader I don't have enought time atm (still).

aksr commented

@Airblader: This is by far the strangest bug I encountered so far (and I've been using i3 at least for 5 years now).

I think I located it:

Phase 1
1. http://sprunge.us/daAf
2. run goldendict
3. restart i3
(i3 crashes or exits or something similar)

Take notice of the 'mark g' part in the config file.

Phase 2
1. remove the 'mark g' part from the config
2. run steps 1–3
(problem isn't there anymore)

The problem appears to be related to this commit: e51a89e

Strange:
I had mark to be set on the startup for two programs (goldendict and firefox).
It doesn't matter which program I run, the same problem appears.
I used goldendict because it's more lightweight.

More Strange:
The problem doesn't appear if the last line: bind Mod4+52 floating enable ... of the config file isn't there (or, at least, it appears like that).

I'm hoping we are getting closer to solve this.

Thanks for the feedback! Since external links die, for reference the content of the link is

bind Mod4+33 exec --no-startup-id /usr/bin/dmenu_run
bind Mod4+Control+27 restart
bind Mod4+52 floating enable, resize grow height 800 px, resize grow width 1200 px, move absolute position center
for_window [instance="^goldendict$" class="^Goldendict$"] mark g, border normal 1

The problem appears to be related to this commit: e51a89e

Why do you think that? Did you bisect it?

I'm hoping we are getting closer to solve this.

A backtrace would tell us what happens right away and still seems like the best way to tackle this in my eyes. I have a pretty good idea what's wrong, but it's all about where exactly it goes wrong.

aksr commented

Why do you think that? Did you bisect it?

Yes.

Alright, thanks. Now we just need the backtrace so we know where exactly it faults.

@aksr Can you get us that backtrace? :-)

aksr commented

I'm sorry, I'll do this as soon as I can.

@aksr Looks good at first glance. I'll look at it in detail later. Thanks!

So the problem actually is that we are dumping a container with layout L_DEFAULT which fails an assertion in dump_node.

@stapelberg I'd opt for removing that assertion and demoting it to a log message. As long as L_DEFAULT is in the code (and actively used, albeit temporarily) we should expect it to occur. On that note, should we remove the layout default command?

L_DEFAULT should only ever be used in con->workspace_layout, not con->layout (which is what causes the assertion). I can’t find any assignments to layout, and I think (but haven’t verified) that all containers should get a layout value immediately after being created. Do you know of any places in the code where that isn’t true?

Instead of applying a workaround, I think we should try to understand how i3 can get into this situation, as the issue might be symptomatic of a larger state corruption issue and/or logic bug.

Yeah, I guess you're right.

@aksr When it happens again, could you provide backtrace and i3 log? We need them both of the same run so we can take the faulty container from the backtrace and trace it through the log file.

aksr commented

I'm not sure will the i3 log be of any use, anyway here you go:
gdb: http://sprunge.us/UcSA
i3log: http://sprunge.us/ALYQ (this is a complete log)

That doesn't really look like an i3 log file. How did you get it? You can get a better log if you switch to another VT (Ctrl+Alt+F2) and issue

DISPLAY=:0 i3-dump-log > /tmp/i3.log

But you need to do that while the i3 session is still showing the (likely black) rectangle (crash screen). Also make sure you have debuglog enabled (see debugging docs on the website). Thanks for your patience!

aksr commented

I don't get any rectangle or anything similar.
Maybe this: http://sprunge.us/MEDL

Yes, that's better. Thank you. :-) My current guess is that the problem here is that the order of JSON keys is not guaranteed and sometimes we read the mark before the layout, resulting in a container with incorrect layout.

@stapelberg So this leaves us with a few options:

  1. When restoring a layout, do not call con_mark but set it »manually«. When an invalid layout containing the same mark multiple times is loaded we wouldn't catch this anymore, though.
  2. Add a flag to con_mark to suppress event sending. But then we'd not send the events during a restart. I'm not sure whether this is good or bad (next to whether the solution itself is all that great).
  3. Before calling con_mark we check the layout and set it to something else if it's the default layout. Sounds fairly hack-ish to me.
  4. We weaken the assertion and simply dump the default layout at this point.
  5. We memorize the marks, but only call con_mark once the container has finished parsing.

I think I'd prefer some kind of 2. or 5. All the others share the common problem that the event would contain a half-parsed container structure which sounds bad. I think we should decide whether a restart should send IPC events like this and depending on that answer go for either 5. (if we want events) or 2. (if we don't want events).

Thoughts?

Option 5 is the way to go, I think. Fundamentally, we should always try restore all serialized attributes before applying state which depends on these attributes.

Hi,
I'm curious, what's the status of the fix?
This bug is present in 4.13 as well.
Thanks.

I'm curious, what's the status of the fix?

The issue is still open, that means no one has fixed it yet.

f0lie commented

I am also getting this issue. I guess I just have to keep my finger off of the restart button.

We should fix this before the next release.

@f0lie @charonbz @aksr Could you try the commit I pushed which hopefully fixes this? Please beware, it's untested and could also just end up crashing your session; so don't keep anything important open that needs saving.

f0lie commented

I would like to help the project out since I use it so much. But I am not sure how do I run this version on my laptop and go back to my older version.

@Airblader I have tried your commit. It fixes the problem.

@Airblader I have added a testcase for this when I was working on this issue. If you would like to use it, you can use it directly or ask me to send the PR.

@hwangcc23 Thanks for confirming. The problem I have with the testcase is that it doesn't reliably reproduce the issue (which I think is impossible to do). But it's still better than nothing, I assume. Feel free to send a PR for the fix + test. Thanks!

Is this still present after #2779 and #2901?

This issue should be closed already, the auto-hook just didn't catch it.