rosenbjerg/FFMpegCore

Pipe name collisions (all pipe instances are busy)

reima opened this issue · 4 comments

reima commented

When processing a large number of files (> 100,000) in parallel (>= 8 threads) using pipes as inputs and outputs we frequently get this exception:

System.IO.IOException: All pipe instances are busy.
   at System.IO.Pipes.NamedPipeServerStream.Create(String pipeName, PipeDirection direction, Int32 maxNumberOfServerInstances, PipeTransmissionMode transmissionMode, PipeOptions options, Int32 inBufferSize, Int32 outBufferSize, PipeSecurity pipeSecurity, HandleInheritability inheritability, PipeAccessRights additionalAccessRights)
   at FFMpegCore.Arguments.PipeArgument.Pre()
   at FFMpegCore.FFMpegArguments.Pre()
   at FFMpegCore.FFMpegArgumentProcessor.Process(ProcessArguments processArguments, CancellationTokenSource cancellationTokenSource)
   at FFMpegCore.FFMpegArgumentProcessor.ProcessAsynchronously(Boolean throwOnError, FFOptions ffMpegOptions)

We're running .NET 6.0 on Windows 10/11.

I believe this problem was introduced by PR #374, which has shortened the random part of the pipe name to 5 hex digits. This leads to multiple pipes accidentally sharing the same name.

With 5 hex digits there are only 16^5 = 1,048,576 different pipe names to choose from. To demonstrate the high likeliness of collisions, consider the following program that simulates the concurrent use of pipes:

using System;
using System.Collections.Generic;

public class Program
{
    private const int RandomChars = 5;
    
    private const int MaxConcurrentItems = 16;

    private const int MaxItems = 100_000_000;
    
    public static void Main()
    {
        var namesInUse = new Queue<string>();
        var distinctNamesInUse = new HashSet<string>();

        for (var items = 0; items < MaxItems; items++)
        {
            if (namesInUse.Count >= MaxConcurrentItems)
            {
                distinctNamesInUse.Remove(namesInUse.Dequeue());
            }

            var name = Guid.NewGuid().ToString("N").Substring(0, RandomChars);

            namesInUse.Enqueue(name);
            if (!distinctNamesInUse.Add(name))
            {
                Console.WriteLine($"First collision after {items} items.");
                return;
            }
        }
        
        Console.WriteLine($"No collisions after {MaxItems} items.");
    }
}

(run it on .NET Fiddle)

Here are a couple of results:

First collision after 33614 items.
First collision after 62839 items.
First collision after 59237 items.
First collision after 1515 items.
First collision after 148182 items.
First collision after 6946 items.

I understand that using a full GUID for the random part has hit a name length limitation on Mac OS (#365). But shortening the random part to only 5 hex digits is unwarranted (IMO).

This is the reported error from the issue:

The path '/var/folders/n2/5qzqsyf97cq3tz608mpdqw2r0000gn/T/CoreFxPipe_FFMpegCore_1325ce6c-6e8e-497a-8c0a-c46703282209' is of an invalid length for use with domain sockets on this platform.  The length must be between 1 and 104 characters, inclusive. (Parameter 'path')

Actual value was /var/folders/n2/5qzqsyf97cq3tz608mpdqw2r0000gn/T/CoreFxPipe_FFMpegCore_1325ce6c-6e8e-497a-8c0a-c46703282209.

The pipe path here has a length of 107 characters. The maximum length is 104 characters. Dropping the dashes from the GUID string using ToString("N") already loses 4 characters, barely bringing it into the required length range with 103 characters. To get some breathing room, dropping an additional 16 of the GUID characters brings us down to 87 characters, well within the allowed length range.

This still leaves us with 16 hex characters for the random part, which massively increases the number of unique pipe names to 18,446,744,073,709,551,616. Collisions should be orders of magnitude less likely to occur.

(Note that to maximize randomness, one should not use the first 16 characters of the GUID string. This would include the character at index 13, which is always 4 and provides no additional randomness. Instead use e.g. the last 16 characters.)

Therefore my suggestion is to use 16 characters of the GUID instead of 5.

@ep1kt3t0s Pinging you to get your thoughts on this, as you submitted the change that shortened the GUID to 5 hex characters.

I think it's better to make this configurable rather than fixing a specific structure.

Also, when processing large number of files like 100K concurrently, would it not be better to extract the logic into a CLI and run it through containers, so it can also be scalable on the cloud as well and any temporary or undeleted files won't be a concern?

In my opinion if the library supports processing just one file at a given time, that's good enough. The run instances should be encapsulated from each other in isolated environments. Otherwise the question will be, how many files this library should support to process, 1K, 100K, 10M? But maybe you have a more specific use case?

reima commented

@ep1kt3t0s Thanks for weighing in.

I think it's better to make this configurable rather than fixing a specific structure.

In my opinion surfacing such implementation details and forcing the library user to choose sensible values would be an unnecessary burden. I'd prefer if the library would just work without issues without jumping through additional configuration hoops. Is there any reason why you would be opposed to using 16 random characters in the pipe names?

Also, when processing large number of files like 100K concurrently, would it not be better to extract the logic into a CLI and run it through containers, so it can also be scalable on the cloud as well and any temporary or undeleted files won't be a concern?

I'd rather not invest engineering resources into the additional complexity and orchestration of containers when there's a simpler solution at hand. We're never going to run our workload in the cloud anyway.

In my opinion if the library supports processing just one file at a given time, that's good enough.

Why shouldn't the library support processing multiple files concurrently when the effort to do so is low? Requiring single file processing severely limits the usefulness of the library. E.g. you wouldn't be able to create a desktop application that allows concurrent batch processing of multimedia files. Shipping a container runtime with a desktop application does not sound sensible to me.

Besides, the intent of the library author is to support concurrent processing, which is evident from the fact that the pipe name is randomized at all. Otherwise, two fixed pipe names (one for input and one for output) would suffice.

The run instances should be encapsulated from each other in isolated environments. Otherwise the question will be, how many files this library should support to process, 1K, 100K, 10M?

I don't understand why concurrent processing in a single environment would require to define a limit of files that can be processed. The important factor is how probable a name collision is when processing X files concurrently. Processing more files just increases the likeliness that at least one collision occurs, but doesn't affect the overall probability.

There can be a default convention with a possibility of configuration change in the library but my point stands the same. You don't want to invest engineering hours to support processing 100K files but then you ask for others to change the library to support your extensive use case.

There's a difference between a single line change (which I am happy to do myself) and the work needed for infrastructure to spin up and orchestrate a swarm of containers. I'd rather invest into a simpler solution, both in terms of implementation effort as well as future maintenance.

As there haven't been any objections to my proposed changes, I went ahead and submitted a PR.