[Roadmap] Meta Schedule Rules

Question

[Roadmap] Meta Schedule Rules

Opened this issue 3 years ago · 0 comments

junrushao commented 3 years ago

Schedule Rule

(Junru) Auto-Inline #503
(Junru) Multi-Level-Tiling #503
(Bohan) Parallel-Vectorize-Unroll #516
(Bohan) Random-Compute-Location #516
(Siyuan) Add-RFactor #551
(Ruihang) Cross-Thread-Reduction #556

Auto Tensor Core Rule

Multi-Level-Tiling (with Tensor Core) #524
Rewrite-Reduction-Block (with Tensor Core) #524
Rewrite-Cooperative-Fetch (with Tensor Core) #524
Rewrite-Tensor-Core #524

Postproc Rule

(Bohan, Hongyi) Rewrite-Parallel-Vectorize-Unroll #499
(Junru) Rewrite-Reduction-Block #509
(Bohan) Disallow-Dynamic-Loops #499
(Junru) Rewrite-Cooperative-Fetch #509
(Junru) Rewrite-Unbound-Blocks #509
(Bohan) Verify-GPU-Code #499

Mutator Rule

(Junru) Mutate-Tile-Size #534
(Junru) Mutate-Unroll #534
(Junru) Mutate-Parallel #534
(Bohan) Mutate-Compute-Location #548

Appendix

Legacy CPU

SearchRule

ms.rule.inline_pure_spatial(strict_mode=True)
ms.rule.add_rfactor(
  max_jobs_per_core=16,
  max_innermost_factor=64,
)
ms.rule.multi_level_tiling(
  structure="SSRSRS",
  must_cache_read=False,
  cache_read_scope="global",
  can_cache_write=True,
  must_cache_write=False,
  cache_write_scope="global",
  consumer_inline_strict=True,
  fusion_levels=[1, 2],
)
ms.rule.parallelize_vectorize_unroll(
  max_jobs_per_core=16,
  max_vectorize_extent=32,
  unroll_max_steps=[0, 16, 64, 512],
  unroll_explicit=True,
)
ms.rule.random_compute_location()

Postproc

ms.postproc.rewrite_parallel_vectorize_unroll()
ms.postproc.rewrite_reduction_block()
ms.postproc.disallow_dynamic_loops()

Mutator

ms.mutator.mutate_tile_size(): 0.90
ms.mutator.mutate_compute_location(): 0.05
ms.mutator.mutate_auto_unroll(): 0.03
ms.mutator.mutate_parallel(max_jobs_per_core=16): 0.02

Legacy CUDA

SearchRule

ms.rule.multi_level_tiling(
  structure="SSSRRSRS",
  must_cache_read=True,
  cache_read_scope="shared",
  can_cache_write=True,
  must_cache_write=True,
  cache_write_scope="local",
  consumer_inline_strict=False,
  fusion_levels=[3],
  vector_load_max_len=4,
  tile_binds=["blockIdx.x", "vthread", "threadIdx.x"],
)
ms.rule.inline_pure_spatial(strict_mode=False)
ms.rule.parallelize_vectorize_unroll(
  max_jobs_per_core=-1,  # disable parallelize
  max_vectorize_extent=-1,  # disable vectorize
  unroll_max_steps=[0, 16, 64, 512, 1024],
  unroll_explicit=True,
)

Postproc

ms.postproc.rewrite_cooperative_fetch()
ms.postproc.rewrite_unbound_blocks()
ms.postproc.rewrite_parallel_vectorize_unroll()
ms.postproc.rewrite_reduction_block()
ms.postproc.disallow_dynamic_loops()
ms.postproc.verify_gpu_code()

Mutator

ms.mutator.mutate_tile_size(): 0.90
ms.mutator.mutate_auto_unroll(): 0.10