OGRECave/ogre-next

Investigate Vulkan commit b85e1a64bbb9d539042a08924990235aaa5e825f

Closed this issue ยท 21 comments

Relevant link.

Asked the following to @eugenegff :

I've been avoiding it due to its difficulty, but I am reviewing this change and trying to merge the disjoint branches.

Because once VulkanQueue::commitAndNextCommandBuffer( NewFrameIdx ) is called, mDynamicBufferCurrentFrame MUST advance. The new code breaks this.

Not advancing would most likely lead to unnecessary full stalls at best, and at worst race conditions with the GPU (I can't figure out how it would turn out as race conditions though, but full stalls are definitely possible).

The commit log says "new command buffers could happen without frame advancing(...)". Um no.

If new cmd buffers are wanted without advancing the frame, then commitAndNextCommandBuffer( SubmissionType::FlushOnly ) must be called.
That's the reason FlushOnly exists.

But it does sound like whatever caused the bug the commit is trying to solve, the solution is likely different.

Thus I would loooove to have a repro to play with.

(relevant message from Skype, 14 May 2023)

Assert was triggered by somewhat long chain of operations without frame rendering. Eventually some part of Ogre called commitAndNextCommandBuffer standalone, not as part of the VaoManager::update(), and with old code mDynamicBufferCurrentFrame and mFrameCount were incremented in _notifyNewCommandBuffer one extra time, causing later assert.
New code is very similar in structure to the code of the Metal render system, incrementing mDynamicBufferCurrentFrame and mFrameCount at the end of XxxVaoManager::update()

I think that extra call to commitAndNextCommandBuffer was from VulkanAsyncTextureTicket::downloadFromGpu

image

Thanks for the info.

I SEE IT NOW!!!

I theorize at least two ways to repro:

User error

  1. Perform some rendering operations without calling RenderSystem::_update.
  2. Call RenderSystem::_endFrameOnce.
  3. Repeat.

Calling RenderSystem::_endFrameOnce twice without a RenderSystem::_update in the middle will result in such error.
It may go unnoticed if there are not queued operations because VulkanQueue::commitAndNextCommandBuffer checks:

if( mPendingCmds.empty() )
    return;

Which means the 2nd RenderSystem::_endFrameOnce will do nothing. But if there are queued operations, it will trigger the problem.

Thus basically the code would be:

iteration = 5;
while( iteration-- )
{
  renderSomething();
  // !!! missing call to mRenderSystem->_update() !!!
  mRenderSystem->_endFrameOnce();
}

Not user error

  1. OgreNext will call Root::_updateAllRenderTargets.
  2. As normal operation of updating the Compositor, CompositorManager2::_updateImplementation will end up calling VaoManager::_update (via RenderSystem::_update).
  3. Root will now call _fireFrameRenderingQueued.
  4. Inside the listener, users calls RenderSystem::update followed by RenderSystem::_endFrameOnce.
    • For example ParallaxCorrectedCubemap::updateExpensiveCollectedDirtyProbes and IrradianceFieldRaster::renderProbes can do this.
    • The user could do this by hand too.
  5. OgreNext will now call CompositorManager2::_swapAllFinalTargets normally.
    • This will call RenderSystem::_endFrameOnce() which leads to VulkanQueue::commitAndNextCommandBuffer( SubmissionType::EndFrameAndSwap )

The problem is that if we flatten the operations they become like this:

mVaoManager->_update(); // Called by CompositorManager2::_updateImplementation
mVaoManager->_update(); // Called by user inside Root::_fireFrameRenderingQueued
mRenderSystem->_endFrameOnce(); // Called by user inside Root::_fireFrameRenderingQueued
mRenderSystem->_endFrameOnce(); // Called by CompositorManager2::_swapAllFinalTargets

Thus we end up with two mRenderSystem->_endFrameOnce in a row without an mVaoManager->_update.

I wonder if this causes subtle problems in other RenderSystems (Metal is the most sensitive after Vulkan, D3D11 and GL likely won't care).

One simple Vulkan solution is to call VaoManager::_update (or at least the relevant pieces) upon receiving VulkanVaoManager::_notifyNewCommandBuffer, thus Vulkan RenderSystem can handle the edge case of having _endFrameOnce getting called twice by accident.

Another solution is to alter / revamp the listeners so that this condition doesn't happen, and detect when _endFrameOnce gets called twice in a row without an update before it for all RenderSystems.

OK, based on talks with Eugene, they tried to solve it by transplanting Metal's RS logic into Vulkan.

I'll explain why that doesn't work:

We use Metal 1 without heaps, which means Metal still does reference tracking (thanks to ObjC) of the command buffers. So they get destroyed when they're no longer referenced.
That's not so if we use Metal 2 with heaps and/or use commandBufferWithUnretainedReferences (which we don't). In other words Metal 1 is some sort of hybrid or in-between D3D11 and Vulkan.

But Vulkan feels more like writing a GPU driver. When we submit a command buffer, we must wait ourselves until the GPU is done using it to destroy it (which will happen in VaoManager::mDynamicBufferMultiplier frames).
In Vulkan we don't actually destroy the command buffer, we just recycle its memory for a new frame.

That's why VulkanQueue::commitAndNextCommandBuffer is forced to advance the frame: When we call it with SubmissionType::NewFrameIdx or greater, we append the VkFence (we have one per frame in range [0; mDynamicBufferMultiplier)) to the command buffer.

Thus we know that once that VkFence is signaled, the frame is over (and thus, we can do anything like recycling the oldest command buffer).

By forcing _notifyNewCommandBuffer to not advance the frame (i.e. do what Metal does) we end up with a full stall instead:

relevantFance = mFences[vaoMgr->mDynamicBufferCurrentFrame];
submitCommandBuffer( relevantFance ); // Not actual code from OgreNext (simplifying for sake of explanation)

if( submissionType >= SubmissionType::NewFrameIdx )
{
    mPerFrameData[dynBufferFrame].mCurrentCmdIdx = 0u;
    mVaoManager->_notifyNewCommandBuffer();
}

newCommandBuffer();

But inside newCommandBuffer(); we wait for the oldest frame to finish:

void VulkanQueue::newCommandBuffer()
{
    const size_t currFrame = mVaoManager->waitForTailFrameToFinish();
    mCurrentCmdBuffer = getCmdBuffer( currFrame );
    // ..,
}

And we can simplify waitForTailFrameToFinish to this:

void VulkanVaoManager::waitForTailFrameToFinish()
{
    waitOn( mFences[vaoMgr->mDynamicBufferCurrentFrame] );
}

In other words, we just reset mFences[vaoMgr->mDynamicBufferCurrentFrame] and send it to the GPU, and immediately afterwards wait on mFences[vaoMgr->mDynamicBufferCurrentFrame]. Which means a full stall.

The code works, but simply because the logic bug no longer happens (mFrameCount no longer advances) but the GPU is now forced to be synchronous with the CPU.

On Metal we don't have this issue because the fencing logic is slightly different, a great deal thanks to us not having to worry too much about the lifetimes of command buffers.

@eugenegff : I reverted your change and attempted to fix it differently.

These changes are living in the EndFrameOnceFailure branch. Could your team test it when they've got the time? Thanks!

The fix is a little naive because it tries to call VulkanVaoManager::_update when the error is detected; which may lead to unintended side effects (i.e. VaoManager::_update calls Root::_renderingFrameEnded).

Will try to reproduce original issue, and to check is it resolved by EndFrameOnceFailure branch

So, with only first workaround-reverting commit of EndFrameOnceFailure branch bug is still reproducible, callstack is slightly different than previous

Screenshot 2024-03-02 at 12 03 31

Unfortunately, second commit did not fixed assert.

Screenshot 2024-03-02 at 12 20 47

BTW, here we try to render single screenshot (preview of the our library item), that would be cached in file system. That means, that we do not want artefacts from incomplete texture streaming, therefore we use such code:

if(!textureGpuManager->isDoneStreaming())
	textureGpuManager->waitForStreamingCompletion();
auto loadRequestsCounter = textureGpuManager->getLoadRequestsCounter();
Ogre::Root::getSingleton().renderOneFrame();						

if (loadRequestsCounter != textureGpuManager->getLoadRequestsCounter())
{
	textureGpuManager->waitForStreamingCompletion();
	Ogre::Root::getSingleton().renderOneFrame();
}

Idea was to waitForStreamingCompletion and render with fully loaded textures,
but if during frame rendering some new texture streaming requests were done - repeat the waiting and rendering.

Note, that all item previews probably use unique sets of textures, so waitForStreamingCompletion() is essential.
Crash occurred on ~30th screenshot, when mDelayedDestroyBuffers became non empty? No other rendering is usually performed by the app between those screenshots generation, just repeated waitForStreamingCompletion+renderOneFrame

I tried to remove all waitForStreamingCompletion() calls in code above. Almost all previews are now generated without textures, but crashes are gone

AHHHH!!!!

I just realized that the following can happen:

  1. stall() (which occurs naturally when OgreNext runs out of staging memory)
  2. Add more buffers to mDelayedDestroyBuffers
  3. _notifyNewCommandBuffer()
    • mFrameCount will advance but _update() won't be called because I'm checking only for if( mFenceFlushed == FenceFlushed )

I thought that stalling was safe because _notifyDeviceStalled would destroy all delayed buffers.
However I did not account that you can add more delayed buffers in-between (i.e. step 2).

I'll see how this can be addressed.

Grrrr I'm testing that theory but I can't reproduce it.

After trying it, it appears to be safe because the delayed buffer ends up with the right and desired frameCount value.

Are you able to upload the Ogre.log file of the session that causes the crash? Perhaps that may shed more light as to what's going on.

Unfortunately log is absolutely innocent, with last record ~5s before the assertion

LiveHomeOgre2.log

Logcat shows assert more than second after the noisy but innocent whinning from MButtonToggleGroup
2024-03-02 18:34:06.712 17852-17852 MButtonToggleGroup com.belightsoft.livehome3d.pro W Child order wasn't updated

Logcat.log

Screenshot 2024-03-02 at 18 49 12

in assert( mFrameCount - itor->frame == mDynamicBufferMultiplier &&
"Delayed buffer must be destroyed in the last buffered frame!" );

mFrameCount == 617
itor->frame == 611
mDynamicBufferMultiplier == 3

Well that was indeed interesting:

The new code is supposed to log:

WARNING: Calling RenderSystem::_endFrameOnce() twice in a row without calling RenderSystem::_update. This can lead to strange results.

But it seems that never happened. This suggests my new code I wrote was not triggered at all.

mFrameCount == 617
itor->frame == 611
mDynamicBufferMultiplier == 3

This is also very useful, 617 - 611 is 6, which is mDynamicBufferMultiplier * 2.
_notifyDeviceStalled advances mFrameCount by mDynamicBufferMultiplier:

mFrameCount += mDynamicBufferMultiplier;

So I'm more inclined to suspect there's two stalls involved in triggering the bug. But I'm a bit surprised I don't see the following message:

Texture memory budget exceeded. Stalling GPU.

BINGO! I managed to repro with this simple code:

void EndFrameOnceFailureGameState::simulateSpuriousBufferDisposal()
{
    Ogre::RenderSystem *renderSystem = mGraphicsSystem->getRoot()->getRenderSystem();
    Ogre::VaoManager *vaoManager = renderSystem->getVaoManager();

    Ogre::IndexBufferPacked *dummyBuffer = vaoManager->createIndexBuffer(
        Ogre::IT_16BIT, 16000, Ogre::BT_DYNAMIC_PERSISTENT, 0, false );
    dummyBuffer->map( 0, dummyBuffer->getNumElements() );
    dummyBuffer->unmap( Ogre::UO_UNMAP_ALL );
    vaoManager->destroyIndexBuffer( dummyBuffer );
    dummyBuffer = 0;
}

const size_t iterations = size_t( std::ceil( Ogre::Math::RangeRandom( 1.0f, 50.0f ) ) );
for( size_t i = 0u; i < iterations; ++i )
{
    for( size_t i = 0u; i < iterations; ++i )
        simulateSpuriousBufferDisposal();
    vaoManager->_update();
}

In the end your Ogre.log was very useful! The fact that there were no errors or warnings at all was in itself useful information.

Well!!! After a nap and some printf debugging, the solution was ridiculously simple. Kind of obvious once identified.

It should be fixed now. Let me know if you find any issues.

I'm now writing the test to test for regressions on this; and then run the whole set of unit tests to ensure the new snippet didn't introduce any regressions.

All tests passed!

Hopefully this is it and fixes the problem for you.

Just one heads up: I'm slightly concerned that your team is using BT_DEFAULT_SHARED on mobile for better performance. My concern is simply that your team's fix to the bug in this ticket essentially synchronized the CPU & GPU to run serially instead of in parallel.

That would hide any or most potentially-wrong access (i.e. race conditions) of BT_DEFAULT_SHARED buffers. Be on the lookout for that.

Hi Matias!
The main idea of using BT_DEFAULT_SHARED on UMA platforms is to save RAM. Indeed, our app is kind of 3D modeller of heavy scenes, and we have some tasks when we have to operate with geometry data on CPU. So we use BT_DEFAULT_SHARED for meshes only. Unfortunately mobile platforms do not have virtual memory, we can have only up to 60% of physical RAM. We can't waste memory on geometry shadow buffers. That's why it's important to be thrifty and use shared memory for geometry data.
In most cases we just read geometry data. Nevertheless, some times we have to change it and we do that serially.

Yes! I know you have good reasons to use it. It is a shame we couldn't sat down to think hard on how to make it work in parallel.

Remember that you can always force serialization/stall by calling vaoManager->waitForSpecificFrameToFinish( vaoManager->getFrameCount() )

Unrelated: Right now I'm trying to merge all the branches (master will become 3.0; master needs to merge into warm_up, warm_up will become the new master. I also want to merge warm_up into ParticleFX2, ParticleFX2 may become the new master).

Your team will benefit a lot from warm_up branch in performance; because it implements the following:

  1. Vulkan shader cache.
  2. Parallel shader & PSO compilation.
  3. Parallel texture loading (current texture streaming only loads 1 file at a time in another thread).

These make a huge performance difference, specially on Android and specially on the cheap phones; which are low spec but come with 4, 6, or even 8 cores.

Yes, these are really cool features!
We've already had some problems due to shader compilation time on Android - even device lost. I hope we can really benefit from parallel shader & PSO compilation before public beta. We are going to get our Android beta soon.

Thank you, Matias!

@darksylinc No more asserts with latest fixes. Thank you!

Awesome!

I merged those fixes into master. Now the branch you were using went into branch v3-0.
And the new master contains what will become 4.0.

A short summary of what happened:

  • master -> merged into warm_up
  • master -> renamed to v3-0
  • warm_up -> renamed to master

I still have to figure out what to do with ParticleFX2 branch which is ahead of master, since the latest up-to-date documentation of master is currently living in ParticleFX2 (including the multithreaded shader compilation changes)