dme-compunet/YoloV8

Postprocess takes up too much time

FunJoo opened this issue · 10 comments

FunJoo commented

First of all, thank you so much for open-sourcing this repository. I've been looking for an ML.NET implementation of YOLOv8 for a long time and even spent some time trying to understand ML.OnnxRuntime (but I still couldn't grasp it).

I've reviewed your source code, and compared to other open-source repositories, it's much easier for me to understand. However, I did notice an issue: the OutputParser.Parse() method seems to take up too much time. I wonder if there's a way to resolve this.

These are some tests:
CPU - i7-10750H
GPU - RTX2060 Laptop

- Detect - CPU
Image origin size: 2448x2048
imgsz: 640x640
Preprocess: 0.07468 s
Inference: 0.24290 s
Postprocess: 0.54611 s

- Detect - GPU
Image origin size: 2448x2048
imgsz: 640x640
Preprocess: 0.0879 s
Inference: 0.06722 s
Postprocess: 0.34554 s

- Segment - CPU
Image origin size: 3840x2748
imgsz: 800x800
Preprocess: 0.10734 s
Inference: 0.58066 s
Postprocess: 3.99814 s

- Segment - GPU
Image origin size: 3840x2748
imgsz: 800x800
Preprocess: 0.11179 s
Inference: 0.36416 s
Postprocess: 4.10433 s

FunJoo commented

I'm planning to try using SkiaSharp (which is said to have better image processing performance) or a dll exported from C++/Rust for image processing in the future. If there's a significant performance improvement, can I submit a pull request?

FunJoo commented

I tried SkiaSharp, but there was no significant improvement in image processing performance. So, based on my previous experience with System.Drawing.Bitmap, I attempted to change the method ImageSharp uses to traverse images.
The current optimizations include:

  1. Using a single layer of Parallel.For .
    In DetectionOutputParser.Parse() from line 34 to line 64:
        Parallel.For(0, output.Dimensions[2], i =>
        {
            for (int j = 0; j < metadata.Classes.Count; j++)
            {
                var confidence = output[0, j + 4, i];

                if (confidence <= parameters.Confidence)
                    continue;

               // same as the original code ...
            }
        });
  1. Using image.DangerousTryGetSinglePixelMemory(out Memory<TPixel> memory) instead of image[x,y].
    In ImageSharpExtensions.ForEachPixel() :
        var width = image.Width;
        var height = image.Height;
        var totalPixels = width * height;

        var flag = image.DangerousTryGetSinglePixelMemory(out Memory<TPixel> memory);

        Parallel.For(0, totalPixels, index =>
        {
            int x = index % width;
            int y = index / width;

            var point = new Point(x, y);
            var pixel = memory.Span[index];  // This line of code triggers an OutOfIndex error when processing certain images. It seems to be related to image size conversion, but it works normally for most images.

            action(point, pixel);
        });

- Detect - GPU
Image origin size: 2448x2048
Image count: 19
imgsz: 640x640

Preprocess: 0.0879 s ===> 0.03151 s
Inference: 0.06722 s ===> 0.05268 s // I don't know why it's getting faster
Postprocess: 0.34554 s ===> 0.00428 s
The processing time is taken as the average.

These optimizations perform well in the detect task, but I'm still not satisfied with the processing speed in segment. Primarily the SegmentationOutputParser.ProcessMask is still too slow. But Preprocess time and Postprocess time have indeed become faster (similar changes have already been made to DetectionOutputParser.Parse()).
Therefore, I'm planning to move all the image processing code to C++ in my project.

The documentation for the DangerousTryGetSinglePixelMemory function says that there can be a memory corruption while accessing the Span, this may be related to the error you are getting, maybe ProcessPixelRows will give a better result.

The bottleneck in deciphering the segmentation results is increasing the masks from 120x120 to the original size, to do that you need an interpolation algorithm to increase the pixels, I'm currently doing it using ImageSharp and that's what slows down the process, I can give up this increase but then it creates pixelated masks, so I chose to get good masks in exchange for a little longer time in postprocess.

Would you be willing to submit a PR with the fixes you've made so far?

Thanks for your work!

FunJoo commented

I've submitted the PR.
In Image.ForEachPixel(), I finally used the following approach to avoid exceptions. It works for most images. If the flag is false, the mask will not be drawn on the result image.

        var flag = image.DangerousTryGetSinglePixelMemory(out Memory<TPixel> memory);

        if (flag)
        {
            Parallel.For(0, totalPixels, index =>
            {
                int x = index % width;
                int y = index / width;

                var point = new Point(x, y);
                var pixel = memory.Span[index];

                action(point, pixel);
            });
        }

I'm still looking for the reason why flag is set to false.

@FunJoo Your PR is merged, can you confirm that it works without errors?

FunJoo commented

I've verified it. Both the demo weights and dataset tests, as well as my custom weights and dataset tests, can pass successfully.

@FunJoo I fixed some things in segmentation postprocess, can you check the performance of it now?

FunJoo commented

Amazing! The postprocess is much faster now than before. Thank you, I think it's ready to be implemented in my project.

- Detect
Image origin size: 2448x2048
imgsz: 640x640
Average Postprocess Time: 0.00428 s ==> 0.00102 s

- Segment
Image origin size: 3840x2748
imgsz: 800x800
Average Postprocess Time: 1.18653 s ==> 0.25893 s