tlc-pack/tvm-tensorir

[Roadmap] Meta Schedule Rules

Opened this issue · 0 comments

Schedule Rule

  • (Junru) Auto-Inline #503
  • (Junru) Multi-Level-Tiling #503
  • (Bohan) Parallel-Vectorize-Unroll #516
  • (Bohan) Random-Compute-Location #516
  • (Siyuan) Add-RFactor #551
  • (Ruihang) Cross-Thread-Reduction #556

Auto Tensor Core Rule

  • Multi-Level-Tiling (with Tensor Core) #524
  • Rewrite-Reduction-Block (with Tensor Core) #524
  • Rewrite-Cooperative-Fetch (with Tensor Core) #524
  • Rewrite-Tensor-Core #524

Postproc Rule

  • (Bohan, Hongyi) Rewrite-Parallel-Vectorize-Unroll #499
  • (Junru) Rewrite-Reduction-Block #509
  • (Bohan) Disallow-Dynamic-Loops #499
  • (Junru) Rewrite-Cooperative-Fetch #509
  • (Junru) Rewrite-Unbound-Blocks #509
  • (Bohan) Verify-GPU-Code #499

Mutator Rule

  • (Junru) Mutate-Tile-Size #534
  • (Junru) Mutate-Unroll #534
  • (Junru) Mutate-Parallel #534
  • (Bohan) Mutate-Compute-Location #548

Appendix

Legacy CPU

SearchRule

ms.rule.inline_pure_spatial(strict_mode=True)
ms.rule.add_rfactor(
  max_jobs_per_core=16,
  max_innermost_factor=64,
)
ms.rule.multi_level_tiling(
  structure="SSRSRS",
  must_cache_read=False,
  cache_read_scope="global",
  can_cache_write=True,
  must_cache_write=False,
  cache_write_scope="global",
  consumer_inline_strict=True,
  fusion_levels=[1, 2],
)
ms.rule.parallelize_vectorize_unroll(
  max_jobs_per_core=16,
  max_vectorize_extent=32,
  unroll_max_steps=[0, 16, 64, 512],
  unroll_explicit=True,
)
ms.rule.random_compute_location()

Postproc

ms.postproc.rewrite_parallel_vectorize_unroll()
ms.postproc.rewrite_reduction_block()
ms.postproc.disallow_dynamic_loops()

Mutator

ms.mutator.mutate_tile_size(): 0.90
ms.mutator.mutate_compute_location(): 0.05
ms.mutator.mutate_auto_unroll(): 0.03
ms.mutator.mutate_parallel(max_jobs_per_core=16): 0.02

Legacy CUDA

SearchRule

ms.rule.multi_level_tiling(
  structure="SSSRRSRS",
  must_cache_read=True,
  cache_read_scope="shared",
  can_cache_write=True,
  must_cache_write=True,
  cache_write_scope="local",
  consumer_inline_strict=False,
  fusion_levels=[3],
  vector_load_max_len=4,
  tile_binds=["blockIdx.x", "vthread", "threadIdx.x"],
)
ms.rule.inline_pure_spatial(strict_mode=False)
ms.rule.parallelize_vectorize_unroll(
  max_jobs_per_core=-1,  # disable parallelize
  max_vectorize_extent=-1,  # disable vectorize
  unroll_max_steps=[0, 16, 64, 512, 1024],
  unroll_explicit=True,
)

Postproc

ms.postproc.rewrite_cooperative_fetch()
ms.postproc.rewrite_unbound_blocks()
ms.postproc.rewrite_parallel_vectorize_unroll()
ms.postproc.rewrite_reduction_block()
ms.postproc.disallow_dynamic_loops()
ms.postproc.verify_gpu_code()

Mutator

ms.mutator.mutate_tile_size(): 0.90
ms.mutator.mutate_auto_unroll(): 0.10