[Roadmap] Meta Schedule Rules
Opened this issue · 0 comments
junrushao commented
Schedule Rule
- (Junru) Auto-Inline #503
- (Junru) Multi-Level-Tiling #503
- (Bohan) Parallel-Vectorize-Unroll #516
- (Bohan) Random-Compute-Location #516
- (Siyuan) Add-RFactor #551
- (Ruihang) Cross-Thread-Reduction #556
Auto Tensor Core Rule
- Multi-Level-Tiling (with Tensor Core) #524
- Rewrite-Reduction-Block (with Tensor Core) #524
- Rewrite-Cooperative-Fetch (with Tensor Core) #524
- Rewrite-Tensor-Core #524
Postproc Rule
- (Bohan, Hongyi) Rewrite-Parallel-Vectorize-Unroll #499
- (Junru) Rewrite-Reduction-Block #509
- (Bohan) Disallow-Dynamic-Loops #499
- (Junru) Rewrite-Cooperative-Fetch #509
- (Junru) Rewrite-Unbound-Blocks #509
- (Bohan) Verify-GPU-Code #499
Mutator Rule
- (Junru) Mutate-Tile-Size #534
- (Junru) Mutate-Unroll #534
- (Junru) Mutate-Parallel #534
- (Bohan) Mutate-Compute-Location #548
Appendix
Legacy CPU
SearchRule
ms.rule.inline_pure_spatial(strict_mode=True)
ms.rule.add_rfactor(
max_jobs_per_core=16,
max_innermost_factor=64,
)
ms.rule.multi_level_tiling(
structure="SSRSRS",
must_cache_read=False,
cache_read_scope="global",
can_cache_write=True,
must_cache_write=False,
cache_write_scope="global",
consumer_inline_strict=True,
fusion_levels=[1, 2],
)
ms.rule.parallelize_vectorize_unroll(
max_jobs_per_core=16,
max_vectorize_extent=32,
unroll_max_steps=[0, 16, 64, 512],
unroll_explicit=True,
)
ms.rule.random_compute_location()
Postproc
ms.postproc.rewrite_parallel_vectorize_unroll()
ms.postproc.rewrite_reduction_block()
ms.postproc.disallow_dynamic_loops()
Mutator
ms.mutator.mutate_tile_size(): 0.90
ms.mutator.mutate_compute_location(): 0.05
ms.mutator.mutate_auto_unroll(): 0.03
ms.mutator.mutate_parallel(max_jobs_per_core=16): 0.02
Legacy CUDA
SearchRule
ms.rule.multi_level_tiling(
structure="SSSRRSRS",
must_cache_read=True,
cache_read_scope="shared",
can_cache_write=True,
must_cache_write=True,
cache_write_scope="local",
consumer_inline_strict=False,
fusion_levels=[3],
vector_load_max_len=4,
tile_binds=["blockIdx.x", "vthread", "threadIdx.x"],
)
ms.rule.inline_pure_spatial(strict_mode=False)
ms.rule.parallelize_vectorize_unroll(
max_jobs_per_core=-1, # disable parallelize
max_vectorize_extent=-1, # disable vectorize
unroll_max_steps=[0, 16, 64, 512, 1024],
unroll_explicit=True,
)
Postproc
ms.postproc.rewrite_cooperative_fetch()
ms.postproc.rewrite_unbound_blocks()
ms.postproc.rewrite_parallel_vectorize_unroll()
ms.postproc.rewrite_reduction_block()
ms.postproc.disallow_dynamic_loops()
ms.postproc.verify_gpu_code()
Mutator
ms.mutator.mutate_tile_size(): 0.90
ms.mutator.mutate_auto_unroll(): 0.10