Yelp/dumb-init

tests failed for armv7hl and aarch64

muayyad-alsadi opened this issue · 18 comments

tests/child_processes_test.py::test_all_processes_receive_term_on_exit_if_setsid[1] FAILED [  2%]
tests/child_processes_test.py::test_all_processes_receive_term_on_exit_if_setsid[0] FAILED [  3%]
....
        # read a line from print_signals, figure out its pid
        line = proc.stdout.readline()
        match = re.match(b'ready \(pid: ([0-9]+)\)\n', line)
>       assert match, line
E       AssertionError: b''
E       assert None
tests/child_processes_test.py:95: AssertionError

guessing: these architectures are often emulated which makes them more sensitive to the race that this test tries to avoid via sleep .1 -- could you try making that sleep larger (or simply retry) and seeing if it fixes the issue?

I have retried, as you can see I've built it on koji and on copr and it failed on both.

how can I adjust the sleep?

there's a sleep .1 in those tests, make it a bigger number?

I'll try that.

these architectures are often emulated

In Fedora, we build on native hardware.

Is there any way for us to test in that build environment? It's always really hard to try to debug these issues since they don't reproduce anywhere we can access. Even being able to tweak the code a bit (e.g. to add debugging output) and submit a test run or something would be helpful.

It could be the case that the systems are too slow or overloaded that the tests fail, but it's kind of surprising that they fail with such consistency if that's the case.

In Fedora, we build on native hardware.

Correction, all the links that start with https://koji are from a native environment.
The links that start with https://copr are from an emulated environment.

Is there any way for us to test in that build environment?

Yes, there is. All Fedora packagers (that includes both Muayyad and me) have access to the build system and can submit builds of the Fedora dumb-init package with customized sources - by adding patches or generating a custom tarball. Upstream maintainers don't have that access (unless they are also Fedora packages), but we can coordinate that (e.g. by an open Pull Request here with the debugging changes).

We also have access to Test Machine Resources For Package Maintainers - that includes some real HW with Fedora on armv7hl.

You can find me as mhroncok on freenode IRC #fedora-python channel, uually being online during Europe/Prague afternoons/evenings for a synchronized communication (I'm done for today though).

BTW I don't know anything about the issue here, got diverged here from a Fedora procedural thing.

Can try to reproduce it on the arm test machine tomorrow.

I've nerd slipped a bit. This problem does not show if I run tox on current master on a Fedora 30 armv7hl. Neither with Python 3.7 nor 3.8.

#155 also looks like the same traceback on a similar architecture but from Debian, though it looks like currently the build is working on all Debian architectures (including arm64 which is the same as aarch64 in Fedora). Here's the patch Debian applied: https://sources.debian.org/src/dumb-init/1.2.2-1.1/debian/patches/increase-test-sleep-time.patch/

We also applied the 1 second patch in 0708f27, but it looks like Fedora still has 0.1 seconds in its tests.

Is there any way for us to test in that build environment?

yes, please create a FAS account.

building is as simple as clicking "pull request" on github

pagure is an opensource github alternative

as you can see, we host a SPEC file (and maybe patches)

https://src.fedoraproject.org/rpms/dumb-init

Fedora does not encourage down-stream patches, if the patch is good for fedora and debian, then it's going to be good for everybody like OpenSuSE.

We also applied the 1 second

if you like I can make it an environment variable that defaults to 0.1

'{python} -m testing.print_signals & sleep {sleep}'.format(
                python=sys.executable, sleep=os.environ.get('TEST_SLEEP', '0.1')
            ),

this PR will help us adjust/tune sleep without patching

#196

I don't think the PR is necessary, we already changed it to 1 second, fedora is using an old copy of that test?

  1. I prefer to ship latest stable released version, instead of random commits.
  2. dynamic is better than hard-coded
  3. why waste 1 second per test, if most platforms would work with 0.1, I can make the SPEC file pass proper value based on architecture
  4. what if MIPS or whatever added that needs 5 seconds ..etc.
  1. you're not ok taking an already committed patch but instead you're ok taking another uncommitted patch?
  2. I disagree, complexity in tests is 👎
  3. it's the test suite, not a performance critical path
  4. if python can't start in 1 second something is probably very very very wrong

you're not ok taking an already committed patch but instead you're ok taking another uncommitted patch?

I'm ok shipping a merged upstream patch, it's just a suggestion.

I disagree, complexity in tests is -1

reading env with sane default is part of 12 factor, and it's an old unix practice since 1970
it's a well-established way for such use case.

it's the test suite, not a performance critical path

dumb-init have 178 test, so it's 178 seconds instead of 17 seconds, dumb-init develops when doing hotfixes will hate you!

if python can't start in 1 second something is probably very very very wrong

why wait 1 second if 0.1 works.

I'm not sure why you're getting defensive about this and I'm not comfortable continuing this conversation :( this is a question of 2 tests not the entire suite.

Mention me directly if you need help with the environments ;)

it's now fixed after applying the mentioned patch

https://src.fedoraproject.org/rpms/dumb-init/tree/master
https://koji.fedoraproject.org/koji/buildinfo?buildID=1414770

@asottile I'm not familiar with what each test do or how many times it's called. As I said before I'm ok shipping an upstream patch. the conversation that just happened should happen so that people benefit from it being documented in the ticket.