/EveryCulling

This library integrates multiple culling methods into one library.

Primary LanguageC++MIT LicenseMIT

EveryCulling ( ~2022.11 )

This library integrates multiple culling methods into One library.

This System contain ViewFrustumCulling, Masked SW Occlusion Culling, Distance Culling
All culling methods implemented in this library are actually used in the commercial game, game engines(Unreal engine, MS Flight Simulator 2020...)

This project tries to integrate them into one library and make them easy to use.

Core Feature

This library is targeting Maximizing SIMD, Cache hit, Multi Threading.

  1. SIMD : Data is stored for using SIMD Intrinsics ( Object's Datas has SoA layout, check EntityBlock.h ) ( Require AVX2 ( _mm256 ) )
  2. Cache Hit : SoA!!. Data of entities is stored in SoA layout. ( Structure of Arrays, check EntityBlock.h )
  3. Multi Threading : All culling methods of this library is executed by multiple threads. Data of multiple objects is stored in a entity block and the entity block is aligned to cache line size. Then The threads work on each entity block. The structure of entity block prevents data race and cache coherency problem ( false sharing, check EntityBlock.h ), Locking is not required. And It reduces cache miss.

Fully implemented features

Culling Order : Distance Culling ( Cheap ) -> View Frustum Culing -> Masked SW Occlusion Culling ( Expensive )

View Frustum Culling from Frostbite Engine of EA Dice ( 100% )

Source code

Video
Slide Resource
GDC Talk Video
한국어 블로그 글

MultiThreaded View Frustum Culling is 8ms faster than SingleThreaded View Frustum Culling ( in Stress Test )

Feature 1 : Transform Data of Entities is stored linearlly to maximize utilizing SIMD. ( Data oriented Design )

For Maximizing Cache Hitting, Data is allocated adjacently.

Ordinary Way

float objectData[] = { FrustumPlane1-NormalX, FrustumPlane1-NormalY, FrustumPlane1-NormalZ, FrustumPlane1-Distance, FrustumPlane2-NormalX, FrustumPlane2-NormalY, FrustumPlane2-NormalZ, FrustumPlane2-Distance, FrustumPlane3-NormalX, FrustumPlane3-NormalY, FrustumPlane3-NormalZ, FrustumPlane3-Distance, FrustumPlane4-NormalX, FrustumPlane4-NormalY, FrustumPlane4-NormalZ, FrustumPlane4-Distance }

Data oriented Design

float objectData[] = { FrustumPlane1-NormalX, FrustumPlane2-NormalX, FrustumPlane3-NormalX, FrustumPlane4-NormalX, FrustumPlane1-NormalY, FrustumPlane2-NormalY, FrustumPlane3-NormalY, FrustumPlane4-NormalY, FrustumPlane1-NormalZ, FrustumPlane2-NormalZ, FrustumPlane3-NormalZ, FrustumPlane4-NormalZ, FrustumPlane1-Distance, FrustumPlane2-Distance, FrustumPlane3-Distance, FrustumPlane4-Distance }

And Frustun checking use SIMD instruction.

Combine upper two feature!.

Plane1 NormalX    Plane2 NormalX    Plane3 NormalX    Plane4 NormalX     
Plane1 NormalY    Plane2 NormalY    Plane3 NormalY    Plane4 NormalY     
Plane1 NormalZ    Plane2 NormalZ    Plane3 NormalZ    Plane4 NormalZ     
Plane1 Distance   Plane2 Distance   Plane3 Distance   Plane4 Distance     

Ojbect PointX     Ojbect PointX     Ojbect PointX      Ojbect PointX
Ojbect PointY     Ojbect PointY     Ojbect PointY      Ojbect PointY
Ojbect PointZ     Ojbect PointZ     Ojbect PointZ      Ojbect PointZ
      1                 1                 1                  1
      
      
Plane5 NormalX    Plane6 NormalX    Plane5 NormalX    Plane6 NormalX     
Plane5 NormalY    Plane6 NormalY    Plane5 NormalY    Plane6 NormalY     
Plane5 NormalZ    Plane6 NormalZ    Plane5 NormalZ    Plane6 NormalZ     
Plane5 Distance   Plane6 Distance   Plane5 Distance   Plane6 Distance     

Ojbect PointX     Ojbect PointX     Ojbect PointX      Ojbect PointX
Ojbect PointY     Ojbect PointY     Ojbect PointY      Ojbect PointY
Ojbect PointZ     Ojbect PointZ     Ojbect PointZ      Ojbect PointZ
      1                 1                 1                  1

                            |
                            | Multiply each row
                            V
                                    
Plane1 NormalX * Ojbect PointX    Plane2 NormalX * Ojbect PointX    Plane3 NormalX * Ojbect PointX    Plane4 NormalX * Ojbect PointX     
Plane1 NormalY * Ojbect PointY    Plane2 NormalY * Ojbect PointY    Plane3 NormalY * Ojbect PointY    Plane4 NormalY * Ojbect PointY
Plane1 NormalZ * Ojbect PointZ    Plane2 NormalZ * Ojbect PointZ    Plane3 NormalZ * Ojbect PointZ    Plane4 NormalZ * Ojbect PointZ
     Plane1 Distance * 1               Plane2 Distance * 1               Plane3 Distance * 1               Plane4 Distance * 1   
     
Plane5 NormalX * Ojbect PointX    Plane6 NormalX * Ojbect PointX    Plane5 NormalX * Ojbect PointX    Plane6 NormalX * Ojbect PointX     
Plane5 NormalY * Ojbect PointY    Plane6 NormalY * Ojbect PointY    Plane5 NormalY * Ojbect PointY    Plane6 NormalY * Ojbect PointY
Plane5 NormalZ * Ojbect PointZ    Plane6 NormalZ * Ojbect PointZ    Plane5 NormalZ * Ojbect PointZ    Plane6 NormalZ * Ojbect PointZ
     Plane1 Distance * 1               Plane2 Distance * 1               Plane3 Distance * 1               Plane4 Distance * 1   

                            |
                            | Sum all row
                            V
                        
Plane1 Dot Object      Plane2 Dot Object      Plane3 Dot Object      Plane4 Dot Object
Plane5 Dot Object      Plane6 Dot Object      Plane5 Dot Object      Plane6 Dot Object

                            |
                            |   If Dot is larget than 0, Set 1
                            V
   1      0      1      1                                                1      1      1      1                                                                  
   1      1      1      1                                                1      1      1      1
                                                                         
             |                                                 or                  |
             |   To be rendered, All value should be 1                             |   To be rendered, All value should be 1
             V                                                                     V
                                                                         
         Culled!!!!!                                                            Rendered

Feature 2 : Entities is divided to Entity Blocks.

Entity Block 1 : Entity 1 ~ Entity 50 
Entity Block 2 : Entity 51 ~ Entity 100 
Entity Block 3 : Entity 101 ~ Entity 150
      
               |
               |
               V
             
Thread 1 : Check Frustum of Entity Block 1, 4, 7
Thread 2 : Check Frustum of Entity Block 2, 5, 8

Because Each Entity Blocks is seperated, Threads can check whether object is culled without data race.

To minimize waiting time(wait calculating cull finish) , Passing cull job to thread should be placed at foremost of rendering loop.
In My experiment, Waiting time is near to zero.

Masked SW ( CPU ) Occlusion Culling From Intel ( 100% )

Source Code

Stage 1 : Solve Mesh Role Stage ( Choose occluder based on object's screen space aabb area )
Stage 2 : Bin Occluder Triangle Stage ( Dispatch(Bin) triangles to screen tiles based on triangle's screen space vertex data for following rasterizer stage. This is for not using lock when multiple threads rasterize triangles on depth buffer )
Stage 3 : Multithread Rasterize Occluder Triangles ( Threads do job rasterizing each tile's binned triangles, calculate max depth value of tile )
Stage 4 : Multithread Query depth buffer ( Occludee Test ) ( Compare aabb of occludee's min depth value with tile depth buffer. check 52p https://www.ea.com/frostbite/news/culling-the-battlefield-data-oriented-design-in-practice )

Stage 2, 3 is pretty slow. So they are performed alternately. Rasterization ( Stage 3 ) is performed with previous frame's binned triangle. It makes slight error when query depth buffer. But It's acceptable and it makes occlusion culling run faster. It's good trade-off.

Reference paper
개발 일지
Video1, Video2. Performance comparision between ON/OFF
동작 원리 한국어 설명 : "Masked Software Occlusion Culling"는 어떻게 작동하는가?
References : references 1, references 2, references 3

Please read this :
SW Occlusion Culling doesn't make always good performance. It can makes rendering slower than not using it because Rasterizing occluder is expensive computation.
But if your application is gpu bound, SW Occlusion Culling will make your application run faster.

As you know, CPU isn't good at this computation. So this algorithm should be used carefully

This algorithm is used in MS Flight Simulator 2020.

Distance Culling From Unreal Engine ( 100% )

Code directory : https://github.com/SungJJinKang/EveryCulling/tree/main/CullingModule/DistanceCulling

This feature is referenced from Unreal Engine.
You can see How this feature works from here
Objects's visibility is decided based on distance between object and camera. ( Distance is computed using SIMD for performance )
With this feature, You can make tiny object not to be rendered when it is far from camera.
If Object's DesiredMaxDrawDistance is larger than distance between object and camera, The object is culled.

Code Example

You can see how i use this library in my inhouse game engine DoomsEngine

  1. Initialize CullingSystem
EveryCulling::SetThreadCount( Thread Count to do cull job );
EveryCulling::mMaskedSWOcclusionCulling->mSolveMeshRoleStage.mOccluderAABBScreenSpaceMinArea = Screen Space AABB Minimum Area in Masked SW Occlusion Culling ( Used for deciding occluders of entities )
  1. Initialize entity data for culling
culling::EntityBlockViewer entityBlockViewer = EveryCulling::AllocateNewEntity();
entityBlockViewer.SetMeshVertexData // Used in Masked SW Occlusion Culling
(
    Vertex Data,
    Vertice Count,
    MeshIndices Data,
    MeshIndices Count ,
    Vertex Stride
);
entityBlockViewer.SetDesiredMaxDrawDistance(mDesiredMaxDrawDistance); // Used in Distance Culling
  1. Update datas for cull job ( Should be done every frame )
Update GlobalData for cull job

mCullingSystem.SetCameraCount(Camera Count)
mCullingSystem.ResetCullJob();

culling::EveryCulling::GlobalDataForCullJob cullingSettingParameters;
std::memcpy(cullingSettingParameters.mViewProjectionMatrix.data(), Camera ViewProjection Matrix ( Mat4x4, Column major ), sizeof(culling::Mat4x4));
cullingSettingParameters.mFieldOfViewInDegree = Camera FOV;
cullingSettingParameters.mCameraFarPlaneDistance = Camera ClippingPlaneFar;
cullingSettingParameters.mCameraNearPlaneDistance = Camera ClippingPlaneNear;
std::memcpy(cullingSettingParameters.mCameraWorldPosition.data(), Camera World Position Data, sizeof(culling::Vec3));                  
std::memcpy(cullingSettingParameters.mCameraRotation.data(), Camera Rotation Data ( Quaternion ), sizeof(culling::Vec4));                  

EveryCulling::UpdateGlobalDataForCullJob(camera->CameraIndexInCullingSystem, cullingSettingParameters);            
EveryCulling::PreCullJob();  

------------------
Update Entity Data

for(entity : EntityList)
{
    entity.EntityBlockViewer.UpdateEntityData
    (
        Entity World Position Data,
        Entity AABB LowerBound World Position Data,
        Entity AABB UpperBound World Position Data,
        Entity Model Matrix Data ( Mat4x4 )
    );
}
  1. Update Front to Back Sorting Data of entities ( Important for Peforamance of Masked SW Occlusion Culling ) ( Should be updated every frame )
EveryCulling.ResetSortedEntityCount();
int entityOrder = 0;
for(entity : Front to Back Sorted Entity List)
{
    EveryCulling.SetSortedEntityInfo
    (
        CameraIndex,
        entityOrder,
        entity.EntityBlockViewer
    );

    entityOrder++;
}
  1. Pass Culling Job to SubThreads
Your Job System.PassJobToThread(EveryCulling::GetThreadCullJobInLambda(CameraIndex));        
  1. Draw Entity
for(entity : EntityList)
{
  if(entity.EntityBlockViewer.GetIsCulled(CameraIndex) == false)
  {
    entity.Draw();
  }
}

References