dme-compunet/YoloV8

To improve efficiency, it is recommended to consider modifying the Preprocess function.

ErrorGz opened this issue · 9 comments

Here are the main modifications:
Change the input array from 4 columns to 1 column for indices.
Change the loop to a parallel loop.

private Tensor Preprocess(Image image)
{
var modelSize = _metadata.ImageSize;

  var xPadding = 0;
  var yPadding = 0;

  int targetWidth;
  int targetHeight;

  if (_parameters.ProcessWithOriginalAspectRatio)
  {
      var xRatio = (float)modelSize.Width / image.Width;
      var yRatio = (float)modelSize.Height / image.Height;

      var ratio = Math.Min(xRatio, yRatio);

      targetWidth = (int)(image.Width * ratio);
      targetHeight = (int)(image.Height * ratio);

      xPadding = (modelSize.Width - targetWidth) / 2;
      yPadding = (modelSize.Height - targetHeight) / 2;
  }
  else
  {
      targetWidth = modelSize.Width;
      targetHeight = modelSize.Height;
  }

  image.Mutate(x => x.Resize(targetWidth, targetHeight));

  float[] inputArray = new float[image.Width * image.Height * 3];
  var TotalPixel = image.Width * image.Height;
  Parallel.For(0, image.Height, y =>
  {
      image.ProcessPixelRows(row =>
      {
          var pixelSpan = row.GetRowSpan(y);
          int rowOffset = y * image.Width * 3;

          for (int x = 0; x < image.Width; x++)
          {
              var R_offset = y * image.Width + x;
              var G_offset = TotalPixel + y * image.Height + x;
              var B_offset = TotalPixel * 2 + y * image.Width + x;

              inputArray[R_offset] = pixelSpan[x].R / 255f;
              inputArray[G_offset] = pixelSpan[x].G / 255f;
              inputArray[B_offset] = pixelSpan[x].B / 255f;

          }
      });
  });

  var dimensions = new int[] { 1, 3, image.Height, image.Width };
  var input = new DenseTensor<float>(inputArray, dimensions);



  //var dimensions = new int[] { 1, 3, modelSize.Height, modelSize.Width };
  //var input = new DenseTensor<float>(dimensions);

  //image.ForEachPixel((point, pixel) =>
  //{
  //    var x = point.X + xPadding;
  //    var y = point.Y + yPadding;

  //    var r = pixel.R / 255f;
  //    var g = pixel.G / 255f;
  //    var b = pixel.B / 255f;

  //    input[0, 0, y, x] = r;
  //    input[0, 1, y, x] = g;
  //    input[0, 2, y, x] = b;
  //});



  return input;

}

Furthermore, upon reviewing the source code, I noticed that after the resize operation in the Preprocess function, memory is still allocated based on the original size. I am unsure if this is intentional. Please verify this.

private Tensor Preprocess(Image image)
{
var modelSize = _metadata.ImageSize;

    var xPadding = 0;
    var yPadding = 0;

    int targetWidth;
    int targetHeight;

    if (_parameters.ProcessWithOriginalAspectRatio)
    {
        var xRatio = (float)modelSize.Width / image.Width;
        var yRatio = (float)modelSize.Height / image.Height;

        var ratio = Math.Min(xRatio, yRatio);

        targetWidth = (int)(image.Width * ratio);
        targetHeight = (int)(image.Height * ratio);

        xPadding = (modelSize.Width - targetWidth) / 2;
        yPadding = (modelSize.Height - targetHeight) / 2;
    }
    else
    {
        targetWidth = modelSize.Width;
        targetHeight = modelSize.Height;
    }

    image.Mutate(x => x.Resize( **targetWidth**, **targetHeight**));

    var dimensions = new int[] { 1, 3, **modelSize**.Height, **modelSize**.Width };
    var input = new DenseTensor<float>(dimensions);

    image.ForEachPixel((point, pixel) =>
    {
        var x = point.X + xPadding;
        var y = point.Y + yPadding;

        var r = pixel.R / 255f;
        var g = pixel.G / 255f;
        var b = pixel.B / 255f;

        input[0, 0, y, x] = r;
        input[0, 1, y, x] = g;
        input[0, 2, y, x] = b;
    });

    return input;
}

Another suggestion for processing is to add a comparison between first.Point and second.Point in the PlotImage function. The code is as follows:

if (first.Confidence < options.KeypointConfidence || second.Confidence < options.KeypointConfidence || Point.Equals(first.Point, second.Point))
    continue;
FunJoo commented

I am also focusing on performance-related issues.

  1. About your first question, you can refer to #1 and #2.
    When I was optimizing before, I used image.DangerousTryGetSinglePixelMemory(), but I saw that sstainba/Yolov8.Net is using image.DangerousGetPixelRowMemory().
    If the performance improves after you use image.ProcessPixelRows(), it is very likely that the 8th line of code in Yolov8.Extensions.ImageSharpExtensions.cs is returning false. I have tested it in the images of one of my projects and encountered the failure of image.DangerousTryGetSinglePixelMemory() (details see #2), so the author retained the for loop code. I am looking forward to your further analysis and feedback, and I am also paying attention to this issue.
  2. About your second question, you can refer to #13.

I am also focusing on performance-related issues.

  1. About your first question, you can refer to Postprocess takes up too much time #1 and Improve postprocess time #2.
    When I was optimizing before, I used image.DangerousTryGetSinglePixelMemory(), but I saw that sstainba/Yolov8.Net is using image.DangerousGetPixelRowMemory().
    If the performance improves after you use image.ProcessPixelRows(), it is very likely that the 8th line of code in Yolov8.Extensions.ImageSharpExtensions.cs is returning false. I have tested it in the images of one of my projects and encountered the failure of image.DangerousTryGetSinglePixelMemory() (details see Improve postprocess time #2), so the author retained the for loop code. I am looking forward to your further analysis and feedback, and I am also paying attention to this issue.
  2. About your second question, you can refer to There are false positives occurring in the detection task #13.

Based on my debugging, I found that the following statement in the Preprocess function has the longest CPU usage time: input[0, 0, y, x] = r; input[0, 1, y, x] = g; input[0, 2, y, x] = b;. Therefore, I understand that the slow addressing of data in the DenseTensor 4-dimensional array is the cause. I converted it to a 1-dimensional float[] inputArray, which allows for faster addressing. Overall, it is not the merit of parallel processing that greatly improves efficiency, but rather the optimization of array addressing. The attempt to obtain a continuous memory block of the entire image data with DangerousTryGetSinglePixelMemory, as well as obtaining one row with ProcessPixelRows, are both recognized operations by the official documentation.

Hy @ErrorGz thank for your optimization suggestions

Change the input array from 4 columns to 1 column for indices.

I'm not sure it matters much because the 'DenseTensor' behind the scenes also contains a one-dimensional array.

Change the loop to a parallel loop.

ForEachPixel also uses a parallel loop, but in the next version I'm going to change it, first try to get all the data in one memory block with DangerousTryGetSinglePixelMemory and if the image data is split into several places in memory then with DangerousGetPixelRowMemory

memory is still allocated based on the original size. I am unsure if this is intentional. Please verify this.

It must be like this because the model expects an input with the size of the imgsz

I've made some additional optimizations independently and successfully enabled fp16 support.
Looks like what I came up with on my own is very similar to the comments here.
Was able to achieve around 20-25ms for total execution time using 640x640 with nano model.
That's about 10-20ms for pre-processing and 2-8 ms for actual interference on a 2080 Ti. I can share my changes if there's any interest.

  1. calculating the linearized index directly is much faster than using the read only span extension
var dim1 = dimensions[1] * dimensions[2] * dimensions[3];
var dim2 = dimensions[2] * dimensions[3];

input.Buffer.Span[0 * dim1 + 0 * dim2 + y * dimensions[3] + x] = r;
input.Buffer.Span[0 * dim1 + 1 * dim2 + y * dimensions[3] + x] = g;
input.Buffer.Span[0 * dim1 + 2 * dim2 + y * dimensions[3] + x] = b;
  1. multiply instead of divide in the parallel loop
    note: this is probably trivial difference, but doesn't hurt. I also tried experimenting with SIMD, but didn't see any noticeable performance benefits
private static readonly float Multiplier = 1 / 255f;

var r = pixel.R * Multiplier;
var g = pixel.G * Multiplier;
var b= pixel.B * Multiplier;
  1. don't use the func callback of the image sharp foreach pixel as it creates extra overhead
    note: I also tested using Parallel.For on each row in the image with an interior regular for loop for each pixel in the row, and according to the high precision performance counter, this method is about 15-20% faster.
image.DangerousTryGetSinglePixelMemory(out var memory);

Parallel.For(0, image.Height * image.Width, index =>
{
    var pixel = memory.Span[index];
    var x = index % image.Width;
    var y = index / image.Width;
    // do rest of tensor loading
});

I have some additional changes and of course support for Half/Float16 data types, but these were some of the highlights.

Thanks @RaidMax, your ideas are really good! can you submit a PR with the optimizations and improvements you've made?

Regarding 3, in the latest version the extraction of the pixels from the image is not done with a callback, everything is inside the ProcessToTensor method

Looks like after reviewing the most recent code, most of the optimization are already made. I cloned the repo about a month ago and neglected to check for updates until recently. I can make PR with the fp16 support as anyone looking for a performance bump could definitely utilize that.

I believe that the new code has achieved the optimization goal.