ZivaVatra/flac2all

Errors on invoking latest version

graysky2 opened this issue · 23 comments

% flac2all vorbis,mp3,test --vorbis-options='quality=2' -o ./fromFlac ./files
Traceback (most recent call last):
  File "/usr/bin/flac2all", line 33, in <module>
    sys.exit(load_entry_point('flac2all==5.5', 'console_scripts', 'flac2all')())
  File "/usr/lib/python3.9/site-packages/flac2all_pkg/__init__.py", line 439, in main
    clustered_encode(localworkers=True)
  File "/usr/lib/python3.9/site-packages/flac2all_pkg/__init__.py", line 101, in clustered_encode
    tsock.bind("tcp://*:2019")
  File "/usr/lib/python3.9/site-packages/zmq/sugar/socket.py", line 208, in bind
    super().bind(addr)
  File "zmq/backend/cython/socket.pyx", line 540, in zmq.backend.cython.socket.Socket.bind
  File "zmq/backend/cython/checkrc.pxd", line 28, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Address already in use

Or just:

% flac2all f:libfdk_aac .  
Traceback (most recent call last):
  File "/usr/bin/flac2all", line 33, in <module>
    sys.exit(load_entry_point('flac2all==5.5', 'console_scripts', 'flac2all')())
  File "/usr/lib/python3.9/site-packages/flac2all_pkg/__init__.py", line 439, in main
    clustered_encode(localworkers=True)
  File "/usr/lib/python3.9/site-packages/flac2all_pkg/__init__.py", line 101, in clustered_encode
    tsock.bind("tcp://*:2019")
  File "/usr/lib/python3.9/site-packages/zmq/sugar/socket.py", line 208, in bind
    super().bind(addr)
  File "zmq/backend/cython/socket.pyx", line 540, in zmq.backend.cython.socket.Socket.bind
  File "zmq/backend/cython/checkrc.pxd", line 28, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Address already in use

Hey sorry, last year or so has been difficult for me, so haven't had time to attend to my hobbies.

I will try to replicate, what exactly did you do to trigger the error?

I have seen that error happen when you have another instance of flac2all running (because they both try to bind to the same address). Does it happen on a clean run with no other flac2all instances running?

Sorry to hear about your hardships... hope things are looking up for you/family. Just try converting some flac files like this flac2all f:libfdk_aac . My ffmpeg is compiled with libfdk-aac.

Hi Graysky,

So far I have been unable to replicate this on the current master branch. Is this an issue on the last official release, or with the latest master branch?

Thanks!

@ZivaVatra - I just build from master and it worked, but it continues running even after the conversion was completed.

% flac2all f:libfdk_aac .

And

% ps aux | grep flac2all

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
facade  2777781  0.7  0.0 110240 23420 pts/0    Sl+  07:29   0:00 /usr/bin/python /usr/bin/flac2all f:libfdk_aac .
facade  2777785  0.0  0.0 110264 15060 pts/0    S+   07:29   0:00 /usr/bin/python /usr/bin/flac2all f:libfdk_aac .
facade  2777858  0.4  0.0 110240 15404 pts/0    Sl+  07:29   0:00 /usr/bin/python /usr/bin/flac2all f:libfdk_aac .
facade  2777861  0.0  0.0      0     0 pts/0    Z+   07:29   0:00 [flac2all] <defunct>
facade  2777864  0.0  0.0      0     0 pts/0    Z+   07:29   0:00 [flac2all] <defunct>
facade  2777867  0.4  0.0 110240 15408 pts/0    Sl+  07:29   0:00 /usr/bin/python /usr/bin/flac2all f:libfdk_aac .
facade  2777870  0.0  0.0      0     0 pts/0    Z+   07:29   0:00 [flac2all] <defunct>
facade  2777874  0.0  0.0      0     0 pts/0    Z+   07:29   0:00 [flac2all] <defunct>
facade  2777878  0.0  0.0      0     0 pts/0    Z+   07:29   0:00 [flac2all] <defunct>

Hi Graysky,

Looks like it an issue with the multi-process handling. I've been able to replicate, but so far have not been able to find a solution. The os.join() should reap all the processes (and wait before it terminates), but for some reason it doesn't.

I've seen this with other multi-process programs I've written in Python, so not sure if it is a bug in Python or something about how I write multi-process code.

Still investigating (when I find time, which is in very short supply atm unfortunately), but even if it does not terminate the children, at least it completes the conversion without failure.

I've just found this software (thanks!) and hit this issue; came here and discovered you'd posted this less than 1 hour ago, by coincidence. It's not happening for me on every execution, which suggests it may be a race condition closing the final worker. I also wanted to add for others arriving here that you can just killall flac2all to kill the process and free up the ports that zmq is bound to, allowing you to run the program again, in case that isn't obvious.

(edit: python 3.10.4 on Arch, fwiw)

You're welcome, and welcome! :-) That is quite a coincidence indeed!

Do you run in clustered mode with remote workers? Or just locally on a single machine?

So, if it was a race condition on the final worker, I would expect only one worker to be left behind, but as graysky has shown (and what my testing has replicated) is that multiple child processes end up as orphans.

Orphan processes are processes where the parent dies, and they end up being reaped by init and just keep running until complete. As they keep running after the parent, the children don't release their ZMQ sockets, so if you try to run flac2all again, you get the "Address in use" error. The "quick fix" to kill the children (which releases the sockets) and start flac2all again.

The thing is, this should never happen. Python's "process.join()" function is supposed to attach to each of the processes and wait until it terminates. If a race condition (or any other issue) results in a child processes not terminating, than the worker logic should force kill all the misbehaving child processes after a 30 second time out, after which the parent process should terminate.

However we are not seeing this. Instead we are seeing the parent process terminating without waiting for the children to terminate (and the children ignoring the "force kill after 30 second timeout" command), and this should be impossible.

The only things that come to my mind as potential causes are:

  1. There is a bug in the Python multiprocessing join() function (unlikely, considering how often it is used in code)
  2. For whatever reason, the children are ignoring the SIGKILL signal (this can happen if they are stuck in kernel space, so might be blocked on a syscall for some reason). Although if that is the issue, then doing a simple "kill" command would not work, you would most likely have to reboot your machine to clear out the hung processes, but as noted, a simple kill does work to clear out the children from the last run.

It is a bit of a head scratcher really, and as you say, it does not happen every time.

@ZivaVatra Thanks for this project! I encountered the issue where it hangs at end as well. Took a quick look at your main, and I wonder if it's a good idea to do sys.exit().

You're welcome @rogerdahl, glad you like it! Why do you think it is a bad idea to do sys.exit()? That is generally the "clean" way to end a program. Which line are you referring to?

@ZivaVatra I'm only guessing based on the issues with the program. But in the multiprocessing module, there's some careful orchestration of main and worker processes happening behind the scenes. I think that, if you force a worker to exit with sys.exit(), it's not going to report back to the main process. So the problem would be with the sys.exit() in worker_process().

Ok, I had an idea for what might be the root cause, and I have pushed a change to the branch linked to this issue. Initial unit tests seem to pass, so feel free to have a go on it and see if the problem persists.

I will probably run more intensive tests over the next few days to see if I can trigger the issue again myself.

@rogerdahl Thanks for the info. I am not sure if that would be the issue, simply because I don't use any built in orchestration, but rather handle it myself. The multiprocess module just forks the children, the rest is pure Unix child management.

This evening I had an idea, which I have pushed out as change on the branch. It boils down to how the operating system syscall "wait" works. I found out while debugging a multi process Perl program recently that calling wait() only waits for a single child to finish, then it returns, even if other children are still running. The symptoms were exactly like flac2all (with orphan children left behind).

The solution there was to call wait() in an infinite loop, and only break out when wait returns "no more children". So I have applied the same here, an infinite loop that only terminates when os.wait() returns a ChildProcessError number 10 (no more children).

So lets see if that solves it. If not, then will have to dig deeper into the code. Fingers crossed though that this fixes it :-)

@ZivaVatra It does sound likely that you found the basic issue. As you mentioned, your implementation does not follow the regular pattern of how the multiprocessing module is used. If you were to let the multiprocessing module do more of the work, I think you'd end up with something that would not need the final wait() loop.

Part of the issue is that the multiprocessing module aims to be a drop-in replacement for the threading module, so it has a join() method (to match the threading module), where it really should have a wait() method (because, at the OS level, join() is to wait for threads, while wait() is to wait for processes).

So that's why it looked like things should work earlier, because you already had a join() for the worker processes, which should boil down to an OS wait().

I'm not sure if this is the same issue, but it could be related (or root cause).

I was having an issue where the last thread would not terminate after all tasks were complete and I would have to kill the process manually.

I traced the problem to a continuous recv timeout->continue loop in the worker thread that put the worker into a state where it no longer received commands.

I have a commit that fixes the problem for me by moving the "READY" command to the top of the loop to ensure that it is called whenever a timeout occurs and enables the thread to receive new commands such as the "EOL" command to terminate the thread.

taylorasdfjkl@e326e41

Hi @taylorasdfjkl , could you issue a Merge request against my repo, then I can review and merge for further testing?

Note that the multiprocessing module has stability issues on Unix. The problem is that multiprocessing is accomplished with an OS level 'fork', and that causes the parent context to be available in the child, including file handles and locks. Due to this, Python 3.14 will change the default for multiprocessing from 'fork', to 'spawn' for Unix. Mac and Windows already use 'spawn' by default.

https://docs.python.org/3.12/library/multiprocessing.html#contexts-and-start-methods

I got bitten by this by having the child processes use the logging setup from the parent. So logging messages for debugging actually itself caused the processes to become more unstable.

I don't think the Python multiprocessing has any stability issues on Unix. At least I've never come across any.

Seems that what Python called "spawn" has no bearing to what the rest of the computing world refers to as spawn, so it confused me for a moment, as fork() is superior to that in just about every way.

Effectively spawn in Python is "fork(), but with only the resources deemed necessary". It has no relevance here as we don't share any handles across the processes, all is handled via ZeroMQ.

The issue here may well be as simple as taylorasdfjkl's change, I will see about either getting a MR from them to merge and test, or copying the changes manually, to see if it resolves the issue.

I got the MR 😃 Now merged into branch 57-errors-on-invoking-latest-version, feel free to check that branch out and test.

I will run some tests too and if no issues found or reported here, will merge into the version5 branch.

Ok, I will take absence of comments as news it is all working fine, so I will merge to version5 now and close ticket. Thanks for bringing it to my attention!

this is still happening as far as I can tell. 1 worker hangs at the end, next run throws the same error:

Traceback (most recent call last):
  File "/home/user/git/flac2all/flac2all_pkg/__init__.py", line 446, in <module>
    main()
  File "/home/user/git/flac2all/flac2all_pkg/__init__.py", line 439, in main
    clustered_encode(localworkers=True)
  File "/home/user/git/flac2all/flac2all_pkg/__init__.py", line 101, in clustered_encode
    tsock.bind("tcp://*:2019")
  File "/usr/lib/python3.12/site-packages/zmq/sugar/socket.py", line 302, in bind
    super().bind(addr)
  File "zmq/backend/cython/socket.pyx", line 564, in zmq.backend.cython.socket.Socket.bind
  File "zmq/backend/cython/checkrc.pxd", line 28, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Address already in use (addr='tcp://*:2019')

@CameronReinhart What flac2all branch did you test with?