Postprocess takes up too much time

Question

Postprocess takes up too much time

FunJoo opened this issue a year ago · 10 comments

First of all, thank you so much for open-sourcing this repository. I've been looking for an ML.NET implementation of YOLOv8 for a long time and even spent some time trying to understand ML.OnnxRuntime (but I still couldn't grasp it).

I've reviewed your source code, and compared to other open-source repositories, it's much easier for me to understand. However, I did notice an issue: the OutputParser.Parse() method seems to take up too much time. I wonder if there's a way to resolve this.

These are some tests:
CPU - i7-10750H
GPU - RTX2060 Laptop

- Detect - CPU
Image origin size: 2448x2048
imgsz: 640x640
Preprocess: 0.07468 s
Inference: 0.24290 s
Postprocess: 0.54611 s

- Detect - GPU
Image origin size: 2448x2048
imgsz: 640x640
Preprocess: 0.0879 s
Inference: 0.06722 s
Postprocess: 0.34554 s

- Segment - CPU
Image origin size: 3840x2748
imgsz: 800x800
Preprocess: 0.10734 s
Inference: 0.58066 s
Postprocess: 3.99814 s

- Segment - GPU
Image origin size: 3840x2748
imgsz: 800x800
Preprocess: 0.11179 s
Inference: 0.36416 s
Postprocess: 4.10433 s

Answer 1 · 2023-08-07T18:16:35.000Z

ML.NET is a machine learning framework, this means that it is used for training tasks, however this repo is about YOLOv8 which is trained in PyTorch environment and after exporting to onnx format, it can be used with ONNX Runtime. Regarding the speed of postprocess for detect task, I currently do not have a way to make it faster, regarding its preprocess for segment task, it requires a library for image processing for interpolation task work to change the size of the masks, this is done with ImageSharp, which slows down the performance a bit. Good Day!

Answer 2 · 2023-08-08T00:44:15.000Z

I'm planning to try using SkiaSharp (which is said to have better image processing performance) or a dll exported from C++/Rust for image processing in the future. If there's a significant performance improvement, can I submit a pull request?

Answer 3 · 2023-08-08T08:48:30.000Z

The reason I didn't use SkiaSharp is because it is a large graphics library, I was looking for something more compact just for image processing, I could have chosen Magick.NET which has better performance, the reason I didn't choose it is because of the size, Magick.NET weighs 23 MB compared to ImageSharp which weighs 3 MB. But Skia could be a good idea if it improves the performance significantly, check the performance and share your conclusions with me.

Answer 4 · 2023-08-09T03:51:32.000Z

I tried SkiaSharp, but there was no significant improvement in image processing performance. So, based on my previous experience with System.Drawing.Bitmap, I attempted to change the method ImageSharp uses to traverse images.
The current optimizations include:

Using a single layer of Parallel.For .
In DetectionOutputParser.Parse() from line 34 to line 64:

        Parallel.For(0, output.Dimensions[2], i =>
        {
            for (int j = 0; j < metadata.Classes.Count; j++)
            {
                var confidence = output[0, j + 4, i];

                if (confidence <= parameters.Confidence)
                    continue;

               // same as the original code ...
            }
        });

Using image.DangerousTryGetSinglePixelMemory(out Memory<TPixel> memory) instead of image[x,y].
In ImageSharpExtensions.ForEachPixel() :

        var width = image.Width;
        var height = image.Height;
        var totalPixels = width * height;

        var flag = image.DangerousTryGetSinglePixelMemory(out Memory<TPixel> memory);

        Parallel.For(0, totalPixels, index =>
        {
            int x = index % width;
            int y = index / width;

            var point = new Point(x, y);
            var pixel = memory.Span[index];  // This line of code triggers an OutOfIndex error when processing certain images. It seems to be related to image size conversion, but it works normally for most images.

            action(point, pixel);
        });

- Detect - GPU
Image origin size: 2448x2048
Image count: 19
imgsz: 640x640

Preprocess: 0.0879 s ===> 0.03151 s
Inference: 0.06722 s ===> 0.05268 s // I don't know why it's getting faster
Postprocess: 0.34554 s ===> 0.00428 s
The processing time is taken as the average.

These optimizations perform well in the detect task, but I'm still not satisfied with the processing speed in segment. Primarily the SegmentationOutputParser.ProcessMask is still too slow. But Preprocess time and Postprocess time have indeed become faster (similar changes have already been made to DetectionOutputParser.Parse()).
Therefore, I'm planning to move all the image processing code to C++ in my project.

Answer 5 · 2023-08-09T16:39:13.000Z

The documentation for the DangerousTryGetSinglePixelMemory function says that there can be a memory corruption while accessing the Span, this may be related to the error you are getting, maybe ProcessPixelRows will give a better result.

The bottleneck in deciphering the segmentation results is increasing the masks from 120x120 to the original size, to do that you need an interpolation algorithm to increase the pixels, I'm currently doing it using ImageSharp and that's what slows down the process, I can give up this increase but then it creates pixelated masks, so I chose to get good masks in exchange for a little longer time in postprocess.

Would you be willing to submit a PR with the fixes you've made so far?

Thanks for your work!

Answer 6 · 2023-08-10T01:41:57.000Z

I've submitted the PR.
In Image.ForEachPixel(), I finally used the following approach to avoid exceptions. It works for most images. If the flag is false, the mask will not be drawn on the result image.

        var flag = image.DangerousTryGetSinglePixelMemory(out Memory<TPixel> memory);

        if (flag)
        {
            Parallel.For(0, totalPixels, index =>
            {
                int x = index % width;
                int y = index / width;

                var point = new Point(x, y);
                var pixel = memory.Span[index];

                action(point, pixel);
            });
        }

I'm still looking for the reason why flag is set to false.

Answer 7 · 2023-08-13T06:55:59.000Z

@FunJoo Your PR is merged, can you confirm that it works without errors?

Answer 8 · 2023-08-14T02:22:13.000Z

I've verified it. Both the demo weights and dataset tests, as well as my custom weights and dataset tests, can pass successfully.

Answer 9 · 2023-08-21T14:53:58.000Z

@FunJoo I fixed some things in segmentation postprocess, can you check the performance of it now?

Answer 10 · 2023-08-22T01:53:07.000Z

Amazing! The postprocess is much faster now than before. Thank you, I think it's ready to be implemented in my project.

- Detect
Image origin size: 2448x2048
imgsz: 640x640
Average Postprocess Time: 0.00428 s ==> 0.00102 s

- Segment
Image origin size: 3840x2748
imgsz: 800x800
Average Postprocess Time: 1.18653 s ==> 0.25893 s