stereolabs/zed-open-capture

Race condition on Frame returned by `VideoCapture::getLastFrame`

peds opened this issue · 7 comments

peds commented

Preliminary Checks

  • This issue is not a duplicate. Before opening a new issue, please search existing issues.
  • This issue is not a question, feature request, or anything other than a bug report directly related to this project.

Description

There is a race condition on the Frame object that VideoCapture::getLastFrame returns a reference to.
When a user program calls VideoCapture::getLastFrame (e.g. in zed_ov_video_example.cpp), it receives a reference to the Frame object mLastFrame. Even though a lock is used in VideoCapture::getLastFrame, the lock releases when the function returns. So mLastFrame can continue to be updated in the video capture thread here; without the user program knowing that this is happening.
This results in occasional "corrupt" frames where some parts of the current frame that the user program is working with would get updated by data from future frames. The data corruption is noticeable as visual "glitches" when the frames are being stored or displayed in the user program. The following image is the left frame captured from a wooden stick that was being moved back and forth in front of the camera. The images were being retrieved in a while loop similar to zed_ov_video_example.cpp, and they were being saved using OpenCV's imwrite function.

zed_oc_glitch_example_left_frame

Even though OpenCV's imwrite call blocks the loop while the frame is being written to disk, the frame is being updated by another thread. That’s because the frame is a reference to mLastFrame, which is being updated continuously in the capture thread that is created in VideoCapture::startCapture() here.

This issue does NOT happen on all of the frames, and is hard to detect on frames without fast moving objects. It happens on some of the frames, and it is more frequent when the grab loop is being blocked by a file write call, and the disk that it's being written to is slow. For example on a Jetson Nano with an SD card. However, I have also observed the issue on my laptop and desktop with NVMe drives, but less frequently. The issue also happens when the frames are only being displayed (e.g. uisng OpenCV's imshow) but less frequently.

Steps to Reproduce

  1. Add an imwite call to the loop in zed_ov_video_example.cpp to save images to the disk.
  2. Set resolution to 2K (so saving to disk takes longer).
  3. Move an object in front of the camera to help with identifying the glitches
  4. Save a few hundred frames
  5. Inspect the saved frames and look for glitches similar to the image that was included in the description

Expected Result

All the frames should be intact with no glitches.

Actual Result

Some of the frames that were saved are corrupt and have glitches similar to the image that was included in the description

ZED Camera model

ZED2

Environment

This issue was observed on various platforms:

Desktop PC
CPU: Intel Xeon
OS: Ubuntu 20

Laptop PC
CPU: Intel i5
OS: Ubuntu 20

Nvidia Jetson Nano (dev board)
CPU: ARM Cortex-A57
OS: Linux4Tegra

Anything else?

I made a fix for this. I would be happy to create a PR.

Hi @peds
thank you for reporting this issue.
I think that there are two possible solutions to fix it:

  1. double buffering (or triple...)
  2. returning a copied buffer

The first solution introduces a possible frame latency. The UVC driver already implements a multi-buffer mechanism (we use double buffering), but it can be not enough:

memcpy(mLastFrame.data, (unsigned char*) mBuffers[mCurrentIndex].start, mBuffers[mCurrentIndex].length);

The second solution requires more memory and can be slow for hosts with slow memory.

Any thoughts?

@peds can you test if this branch solves the problem:
https://github.com/stereolabs/zed-open-capture/tree/fix_race_cond

peds commented

Hi @peds thank you for reporting this issue. I think that there are two possible solutions to fix it:

1. double buffering (or triple...)

2. returning a copied buffer

The first solution introduces a possible frame latency. The UVC driver already implements a multi-buffer mechanism (we use double buffering), but it can be not enough:

memcpy(mLastFrame.data, (unsigned char*) mBuffers[mCurrentIndex].start, mBuffers[mCurrentIndex].length);

The second solution requires more memory and can be slow for hosts with slow memory.

Any thoughts?

Hi @Myzhar,

Thank you for getting back to me so fast!
I did implement a solution that does work by making a copy of the last frame that is only updated when getLastFrame is called. I think this is similar to your second proposed solution?

Did you implement double buffering in https://github.com/stereolabs/zed-open-capture/tree/fix_race_cond?
I will test this and get back to you.

Yes, it takes advantage of the double buffering already available with UVC

peds commented

The issue persists even with double buffering, but it seems to be more subtle.
zed_oc_glitch_example_left_frame_double_buffer

I think the most reliable solution would be to return a copy of mLastFrame. The following worked for me.

static void copyFrame(Frame& dstFrame, Frame& srcFrame){
    if (srcFrame.data != nullptr) {
        int bufSize = srcFrame.width * srcFrame.height * srcFrame.channels;
        if(dstFrame.data == nullptr ){
            dstFrame.data = new unsigned char[bufSize];
        }
        // memcpy( dstFrame.data, srcFrame.data, bufSize);
        std::copy( srcFrame.data, srcFrame.data + bufSize, dstFrame.data);
    }
    dstFrame.frame_id = srcFrame.frame_id;
    dstFrame.timestamp = srcFrame.timestamp;
    dstFrame.width = srcFrame.width;
    dstFrame.height = srcFrame.height;
    dstFrame.channels = srcFrame.channels;
}

const Frame& VideoCapture::getLastFrame( uint64_t timeout_msec )

{
    // ----> Wait for a new frame
    uint64_t time_count = timeout_msec*10;
    while( !mNewFrame )
    {
        if(time_count==0)
        {
            // const std::lock_guard<std::mutex> lock(mBufMutex);
            // copyFrame(mLastFrameCopy, mLastFrame);
            return mLastFrameCopy;
        }
        time_count--;
        usleep(100);
    }
    // <---- Wait for a new frame

    // Get the frame mutex
    const std::lock_guard<std::mutex> lock(mBufMutex);
    mNewFrame = false;
    copyFrame(mLastFrameCopy, mLastFrame);
    return mLastFrameCopy;
}

Declaration of Frame mLastFrameCopy is also added to

Frame mLastFrame; //!< Last grabbed frame

And

    if(mLastFrameCopy.data)
    {
        delete [] mLastFrameCopy.data;
        mLastFrameCopy.data = nullptr;
    }

is added to //deinint in

if(mLastFrame.data)

Edit:
I commented out

const std::lock_guard<std::mutex> lock(mBufMutex);
copyFrame(mLastFrameCopy, mLastFrame);

in getLastFrame before the timeout return, because the last frame has not changed.

Yes, a copy is always the best solution in similar cases... I will test it further

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment otherwise it will be automatically closed in 5 days