DSL Mapper is tested on the control_replication
branch of Legion.
The resources for learning Legion's C++ mapping interface is on Legion website and in Legion manual, Chapter 7
Index Task Launch Placement (Sharding and Slicing)
git clone git@github.com:Anjiang-Wei/DSLMapper.git
cp -r DSLMapper/src/dsl_mapper.cc DSLMapper/src/dsl_mapper.h DSLMapper/src/compiler $your_project_source_dir
Register the callback function to the Legion runtime. The callback will invoke the function to create mappers (i.e., replacing the default mapper).
Besides, users can turn on logging wrapper to record the mapping decisions by passing -wrapper
in the command line.
User can try the DSL mapper while having an existing customized C++ mapper, and we allow users to choose whether to choose DSL mapper or the original customized C++ mapper from command line (-dslmapper
). This is function that we change for the circuit mapper to support it.
In this way, users can choose whether to use DSL mapper by passing -dslmapper
in the command line (without recompilation).
The original compilation flow needs to be customized in order to integrate DSL mapper:
- Allow warnings in compilation. Comment out the
-Werror
flags. Here is the change we make for the circuit example - Make sure the right mapper registration function is used. Here is the circuit example's compilation script and corresponding registration function
Note, using Legion's namespace and having user-defined typedef together could sometimes cause compilation problems. In DSL Mapper, we use using namespace Legion; using namespace Legion::Mapping;
If it happens to collide with application code's typedef, users can change their code to be more specific about the namespace and the types, e.g., changing Rect<3>
to Legion::Rect<3,long long>
-mapping mapping_file
: specify the mapping policy file (using the DSL). Updating the policy file does not need recompilation.-wrapper
: turn on mapper logging wrapper, for debugging purposes-tm:disable_backpressure
: disable backpressure-tm:disable_untrack_valid_regions
: disable memory collection-tm:use_semantic_name
: use semantic name to specify regions, explained in region placement; the default option is to use 0-based index-tm:select_source_by_bandwidth
: select sources for copies by bandwidth; the default option is to sort by the size of intersecting area. Refer to the implementation of the functiondefault_policy_select_sources
for details.
Every statement in the program should end with ;
(like C
)
Functions should start with def FuncName(Arg1Type arg1, ...)
and the function body needs to be wrapped with { ... }
, e.g., used in Index Task Placement
Comments should start with #
(like Python
)
Task foo CPU; # for task named "foo", run on CPU
Task * GPU,OMP,CPU; # for any other task, by default try running on GPU first (if a GPU variant exists), then try OpenMP Processor, finally try CPU.
The task named foo
will use CPU
. For other tasks, they will use the default (fallback) strategies.
The wildcard *
is a way to describe fallback policies, after which is a priority-list of processors kinds.
The supported processor kinds are
CPU
:Processor::LOC_PROC
in LegionGPU
:Processor::TOC_PROC
in LegionOMP
: OpenMP,Processor::OMP_PROC
in LegionIO
:Processor::IO_PROC
in LegionPY
: Python,Processor::PY_PROC
in LegionPROC
:Processor::PROC_SET
in Legion
Implementation:
select_task_options
, dsl_default_policy_select_initial_processor
, validate_processor_mapping
There are two modes for Region placement. One is called semantic naming mode, and the other is called index-based mode.
To support semantic naming mode, you need to pass -tm:use_semantic_name
in the command line when using DSL mapper.
Region * * GPU FBMEM; # for any tasks, any regions, if mapped onto GPU, use GPU FrameBuffer Memory as default
Region * * CPU SYSMEM; # for any tasks, any regions, if mapped onto CPU, use CPU System Memory as default
# for $task, $region, if mapped onto $proc_type, specify $mem_type to use
Region * rp_shared GPU ZCMEM;
Region * rp_ghost GPU ZCMEM;
To name the regions, users can attach semantic names (runtime->attach_name
) to regions in Legion. Regent will automatically attach the variable names as the semantic names.
Note: for region names, we need to use the region names that are passed in (instead of the names used in task definition/declaration). The implication is that as long as the task uses the rp_shared
region, no matter what argument name the task uses for declaration, the rp_shared
region will be placed onto zero-copy memory. In the above example, since the task is a wildcard (*
), all tasks using the rp_shared
will be placed onto zero-copy memory.
In Regent, the semantic name binding happens when you invoke certain APIs (e.g., partition
) and bind it to a variable, e.g., rp_shared
and rp_ghost
in the Circuit example. Typically, regions that are across the partition boundaries needs faster communication, and thus may benefit from placing onto ZCMEM.
Users can also choose to use index-based approach for placing regions (without passing -tm:use_semantic_name
in command line).
Region foo 0 GPU ZCMEM; # If task foo is mapped onto GPU, place its first region onto ZeroCopy memory
The index is for task.regions
, starting from 0.
The supported memory kinds are the following:
SYSMEM
(Memory::SYSTEM_MEM
in Legion)FBMEM
(Memory::GPU_FB_MEM
in Legion, FrameBuffer Memory on GPU)ZCMEM
(Memory::Z_COPY_MEM
in Legion, ZeroCopy Memory for GPU)RDMEM
(Memory::REGDMA_MEM
in Legion, RDMA memory)SOCKMEM
(Memory::SOCKET_MEM
in Legion)VIRTUAL
(PhysicalInstance::get_virtual_instance()
in Legion, useful for inner tasks where no memory allocation is needed)
The DSL mapper's implementation will try to get the memory that has the highest bandwidth with the chosen processor and the specified memory kind. This can be important when one GPU has affinity with multiple ZCMEM
, one OMP
has affinity with multiple SOCKMEM
, and our implementation will choose the best memory (i.e., the highest bandwidth) in those cases.
Implementation:
map_task
, get_handle_names
, maybe_append_handle_name
, dsl_default_policy_select_target_memory
, query_best_memory_for_proc
# Task, Region, Processor Kind, List of Constraints
Layout * * * SOA C_order; # Other choices: AOS F_order Exact Align==128 Compact
SOA
refers toStruct of Array
whileAOS
refers toArray of Struct
. It is translated toOrderingConstraint
in Legion.C_order
andF_order
are two different orderings. It is translated toOrderingConstraint
in Legion.Align
can specify memory alignment, and we support operators including==
,<=
,>=
,!=
. This corresponds toAlignmentConstraint
in Legion.OrderingConstraint
andAlignmentConstraint
are useful because certain APIs from external libraries (e.g., CuBLAS) may require the data to be in a specific layout or the data to be aligned in memory.Compact
: it is translated toLEGION_COMPACT_SPECIALIZE
ofSpecializedConstraint
.Exact
: it corresponds toLEGION_COMPACT_SPECIALIZE
ofSpecializedConstraint
(settingexact=true
). Users can also pass-dm:exact_region
in command line to request exact region allocation (globally).
Implementation:
map_task
, dsl_default_create_custom_instances
, dsl_default_policy_select_constraints
, dsl_default_policy_select_layout_constraints
To turn on this feature, pass -tm:enable_backpressure
in command line.
InstanceLimit task_4 1;
InstanceLimit task_6 1;
On each node, only one task_4
can be mapped at the same time. Only one task_6
can be mapped at the same time. Typically, tasks are mapped ahead of time, and backpressure can avoid mapping too many tasks at the same time ahead of time, which can avoid consuming too much memory. The above example comes from the mapping policy for Cannon.
Implementation:
select_tasks_to_map
, map_task
, map_task_post_function
, report_profiling
, get_mapper_sync_model
The backpressure example in Legion repository has more detailed explanation.
To turn on this feature, pass -tm:untrack_valid_regions
in the command line.
# $task $region
CollectMemory task_4 *;
CollectMemory
can tell Legion runtime to untrack certain read-only regions as valid regions so that those memories can be collected (i.e., adding the region into untracked_valid_regions
in MapTaskOutput
struct). This is useful when users know that certain read-only regions will only be used once for some tasks, and keeping the regions in memory costs too much memory consumption.
Here the region supports both semantic naming (if passing -tm:use_semantic_name
), index naming (starting from 0
), and also wildcard (*
).
Implementation:
map_task
, map_task_post_function
mcpu = Machine(CPU); # 2-dim tuple: nodes * CPU processors, assuming one Legion runtime per node
mgpu = Machine(GPU); # 2-dim tuple: nodes * GPU processors, assuming one Legion runtime per node
def linearblock(Task task) {
# task.ipoint is a n-dim tuple (in this case n=1) indicating index point within the index launch domain
# task.ispace is a n-dim tuple (in this case n=1) indicating launch domain, not used here
return mgpu[task.ipoint[0] / mgpu.size[1], task.ipoint[0] % mgpu.size[1]]; # return one point in a machine model
}
def block_shard_cpu(Task task) {
return mcpu[task.ipoint[0] / mcpu.size[1], 0]; # only one CPU per node
}
# specify $task_name(s) and sharding+slicing function
IndexTaskMap calculate_new_currents,distribute_charge,update_voltages linearblock;
IndexTaskMap init_piece,init_pointers,print_summary block_shard_cpu;
IndexTaskMap
can specify a list of comma-separated task names and the corresponding sharding+slicing functions to use. The above code snippet is from the circuit example.
On a high level, users want to map each N-dim point from the launch domain to a point in the machine model to pick the processor.
Machine($PROC_KIND)
can be used to initialize a 2-dim machine model with respect to a processor kind. The initialized machine model is a 2-dim tuple, and we can access the number of nodes with mcpu.size[0]
and the number of CPU processors in each node with mcpu.size[1]
.
The function signature should always be taking one argument Task
, and return one point in a machine model. Inside the function, users need to specify that for each index point, how to map the index point (task.ipoint[0]
for 1d index launch) to one point in the machine model.
If turning on the logging wrapper, users can see the following logs running on a 2 node, 4 GPU per node machine:
SELECT_SHARDING_FUNCTOR for calculate_new_currents
0 <- (0) (1) (2) (3)
1 <- (4) (5) (6) (7)
The SELECT_SHARDING_FUNCTOR
will report the decision of choosing nodes. The task calculate_new_currents
mapped onto GPU will use linearblock
function, and it corresponds to task.ipoint[0] / mgpu.size[1]
for choosing nodes. Here each node has 4 GPUs, so it is task.ipoint[0] / 4
.
SLICE_TASK for calculate_new_currents
<0>..<0> -> 1d00000000000003
<1>..<1> -> 1d00000000000004
<2>..<2> -> 1d00000000000005
<3>..<3> -> 1d00000000000006
The SLICE_TASK
will report the decision of choosing processors within the node. The task calculate_new_currents
will use task.ipoint[0] % mgpu.size[1]
(i.e., task.ipoint[0] % 4
in this case) so that within each node, each index point will be placed on a different GPU processor.
Note: if running multiple Legion runtimes per node, i.e., multiple ranks per node, then mcpu.size[0]
, mgpu.size[0]
becomes the number of ranks, and mcpu.size[1]
, mgpu.size[1]
becomes the number of processors within that rank.
Implementation:
select_sharding_functor
, shard
, slice_task
, dsl_slice_task
, dsl_decompose_points
, etc.
def block1d(Task task) {
ip = task.ipoint;
# is = task.ispace;
# special case -- one point task per gpu
return mgpu[ip[0] / mgpu.size[1], ip[0] % mgpu.size[1]];
}
def hierarchicalblock1d(Task task) {
ip = task.ipoint;
is = task.ispace;
node_idx = ip[0] * mgpu.size[0] / is[0]; # block over node
blk_size = is[0] / mgpu.size[0];
gpu_ip = ip[0] % blk_size; # index within the node-partitioned block for mapping to GPU
gpu_idx = gpu_ip * mgpu.size[1] / blk_size; # block over gpu
return mgpu[node_idx, gpu_idx];
}
def cyclic1d(Task task) {
ip = task.ipoint;
# cyclic over node, cyclic over gpu
return mgpu[ip[0] % mgpu.size[0], ip[0] / mgpu.size[0] % mgpu.size[1]];
}
The merge
transformation is a method supported on a machine model, and it takes two integers (dim1
, dim2
) as the arguments. The dim1
and dim2
dimensions will be merged into the dim1
dimension. More specifically, if dim1 < dim2
, then the new merged dimension will be dim1
; if dim1 > dim2
, then the new merged dimension will be dim1-1
. The returned machine model model_new
will have one dimension smaller than model_old
.
model_new = model_old.merge(dim1, dim2);
We can guarantee that:
- The merged dimension in the new machine model
merge_dim = dim1 < dim2 ? dim1 : (dim1 - 1)
- We specify that
dim2
is the faster changing dimension during merging. More specifically, suppose themodel_old.size[dim2]
isdim2_volume
. Then the processsor indexed bymodel_new[..., i, ...]
wherei
is at indexmerge_dim
refers to the same processor as indexingmodel_old
withi / dim2_volume
indim1
andi % dim2_volume
indim2
. model_new.size[merge_dim] == model_old.size[dim1] * model_old.size[dim2]
model_new.size
will be aN-1
-dim tuple ifmodel_old
is aN
-dim tuple
A real example of the merge
transformation is below, extracted from solomonik. In general, merge
transformation can return a new machine model with fewer dimensions, which can be useful to index launches with lower dimensions.
m_2d = Machine(GPU); # nodes * processors
m_1d = m_2d.merge(0, 1);
def block_primitive(IPoint x, ISpace y, MSpace z, int dim1, int dim2) {
return x[dim1] * z.size[dim2] / y.size[dim1];
}
def block1d(Task task) {
return m_1d[block_primitive(task.ipoint, task.ispace, m_1d, 0, 0)];
}
IndexTaskMap init_cublas block1d;
The task init_cublas
is a 1D index launch. The machine model m_1d
is obtained by applying the merge
transformation to the m_2d
machine model. Then the it does a blockwise distribution. Here we define a function block_primitive
which is a more general blockwise function that will be useful for later examples. It takes index point IPoint x
, index space ISpace y
, and machine model MSpace z
as input, and also allows users to specify which dimensions of the index launch int dim1
and which dimension of the machine model int dim2
to be blockwise mapped. In this case, because both the machine model m_1d
and the task's index launch (task.ipoint
, task.ispace
) are 1D, dim1
and dim2
are both set to 0
.
The split
transformation is a method supported on a machine model, and it takes two integers (split_dim
, split_factor
) as the arguments. The split_dim
of the original machine model will be splitted into two dimensions split_dim
(with size split_factor
) and split_dim+1
. Therefore, the returned machine model model_new
will have one dimension bigger than model_old
. You may think of split
as a reverse transformation of merge
.
model_new = model_old.split(split_dim, split_factor);
We can guarantee that:
- The processor indexed by
model_new[..., i, j, ...]
(wherei
,j
is at indexsplit_dim
andsplit_dim+1
) is the same asmodel_old[..., i + j * split_factor]
model_new.size[split_dim] * model_new.size[split_dim+1] == model_old.size[split_dim]
model_new.size[split_dim] == split_factor
model_new.size
will be aN+1
-dim tuple ifmodel_old
is aN
-dim tuple
Due to the more expressivity that autosplit provides, split
transformation is not used quite often in practice.
The swap
transformation is a method supported on a machine model, and it takes two integers (dim1
, dim2
) as the arguments. The returned machine model model_new
will swap the two dimensions of dim1
and dim2
.
model_new = model_old.swap(dim1, dim2);
We can guarantee that:
- The processor indexed by
model_new[..., i, ..., j, ...]
(wherei
,j
is at indexdim1
anddim2
) is the same asmodel_old[..., j, ..., i, ...]
model_new.size[dim1] == model_old.size[dim2]
model_new.size[dim2] == model_old.size[dim1]
model_new.size
will be aN
-dim tuple ifmodel_old
is aN
-dim tuple
swap
could be useful when the default ordering of split
transformation is not what you expect. swap
transformation is not used quite often in practice.
The slice
transformation is a method supported on a machine model, and it takes three integers (dim
, low
, high
) as the arguments. The returned machine model model_new
will be a machine model whose dim
dimension only contains a subset of processors that ranges from low
to high
(both ends included), with other dimensions the same as model_old
.
model_new = model_old.slice(dim, low, high);
We can guarantee that:
- The processor indexed by
model_new[..., i, ...]
(wherei
is at indexdim
) is the same asmodel_old[..., i + low, ...]
model_new.size[dim] == high - low + 1
model_new.size[other_dim] == model_old.size[other_dim]
(other_dim
not equal todim
)model_new.size
will be aN
-dim tuple ifmodel_old
is aN
-dim tuple
The reverse
transformation is a method supported on a machine model, and it takes one integers (dim
) as the argument. The returned machine model model_new
will be a machine model whose dim
dimension's index will be reversed.
model_new = model_old.reverse(dim);
We can guarantee that:
- The processor indexed by
model_new[..., i, ...]
(wherei
is at indexdim
) is the same asmodel_old[..., model_old.size[dim] - 1 - i, ...]
model_new.size[any_dim] == model_old.size[any_dim]
the machine model's size does not changemodel_new.size
will be aN
-dim tuple ifmodel_old
is aN
-dim tuple
We have not found a real use case for reverse
so far. It could be useful when the processors are not exactly the same so that the order within the dimension would make a difference.
The balance_split
transformation is a method supported on a machine model, and it takes two integers (split_dim
, num_dim
) as the arguments. The split_dim
of the original machine model will be split into num_dim
dimensions (i.e., splitting the original size of dimensionsplit_dim
into split_dim
, split_dim+1
, split_dim+2
if num_dim=3
). Therefore, the returned machine model model_new
will have num_dim-1
more dimensions than model_old
. The splitting is done in a "balanced" way so that split dimensions' sizes are close to each other (as square/cubic as possible).
model_new = model_old.balance_split(split_dim, num_dim);
Let's see an example to understand how balance_split
works.
Suppose model_old.size[split_dim]=6
, applying model_old.split(split_dim, 2)
will return a model_new
satisfying model_new.size[split_dim]=3
and model_new.size[split_dim+1]=2
. There are two ways to factor 6:
We also specify that the ordering in terms of the sizes of the newly created dimensions (split_dim
, split_dim+1
, split_dim+2
) to be decreasing. Therefore, we will split 6 into 3X2 rather than 2X3.
We can guarantee that:
model_new.size
will be aN+num_dim-1
-dim tuple ifmodel_old
is aN
-dim tuple
Due to the more expressivity that autosplit provides, balance_split
transformation is not used quite often in practice. We only support num_dim
to be 2
or 3
for now.
The auto_split
transformation is a method supported on a machine model, and it takes an integer dim
and a tuple of integers vec
as the arguments. The dim
of the original machine model will be split into more dimensions, which will be the same number of dimension of the tuple vec
.
model_new = model_old.auto_split(dim, vec);
Suppose vec
is a 3D tuple: vec=(4,2,4)
and we want to split model_old
's dim
dimension, for instance, model_old.size[dim]=4
. The balanced way to split 32 into 3 numbers with respect to (4,2,4)
is 4=2*1*2
. As a result, model_new[dim]=2, model_new[dim+1]=1, model_new[dim+2]=2
.
Generally, given a N-dim tuple auto_split
aims to automatically split the model_old
's dim
dimension of size
$O_1 * O_2 * ... * O_N = O$ - Define
$W_i = \frac{L_i}{O_i}$ . We guarantee that$\Sigma_{i \neq j} (W_i * W_j)$ is minimized
It can be proven that the minimum is achieved when
This transformation primitive is quite useful in practice. If auto_split
can minimize inter-node communication assuming stencil computation pattern.
We can guarantee that:
model_new.size
will be aN+vec.len
-dim tuple ifmodel_old
is aN
-dim tuple
Below is a real use case for auto_split
, extracted from solomonik
def block_primitive(IPoint x, ISpace y, MSpace z, int dim1, int dim2) {
return x[dim1] * z.size[dim2] / y.size[dim1];
}
def cyclic_primitive(IPoint x, ISpace y, MSpace z, int dim1, int dim2) {
return x[dim1] % z.size[dim2];
}
m_2d = Machine(GPU); # nodes * processors
def auto3d(Task task) {
m_4d = m_2d.auto_split(0, task.ispace); # split the original 0 dim into 0,1,2 dim
# subspace: task.ispace / m_4d[:-1]
m_6d = m_4d.auto_split(3, task.ispace / m_4d[:-1]); # split the processor (previosly 1, now 3) dim into 3,4,5 dim w.r.t subspace
upper = tuple(block_primitive(task.ipoint, task.ispace, m_6d, i, i) for i in (0,1,2));
lower = tuple(cyclic_primitive(task.ipoint, task.ispace, m_6d, i, i + 3) for i in (0,1,2));
return m_6d[*upper, *lower];
}
IndexTaskMap task_5 auto3d; # task_5 launch space: (rpoc, rpoc, c)
The first transformation is done via m_2d.auto_split(0, task.ispace)
where 0
stands for the node dimension of the original machine model, andtask.ispace
is a tuple of integers representing the launch domain. The launch domain is 3D, so the original dimension 0 will be split into 3 dimensions, so the resulting machine model is m_4d
, and its last dimension (dimension 3) corresponds to m_2d
's last dimension (dimension 1, representing processors per node).
Imagine that the launching domains will be mapped onto the nodes, and each node will get a subset of the index points. The tuple which represents each nodes' workload can be computed as another tuple: task.ispace / m_4d[:-1]
. Here we use m_4d[:-1]
to get the first dimensions of a machine model as a tuple of integers, and we directly support /
operator for two tuple of integers.
To express the mapping more easily, we need m_4d.auto_split(3, task.ispace / m_4d[:-1])
to obtain a 6D machine model. Here we split the last dimension (3
) of m_4d
, with respect to the workload tuple.
It's not hard to intuitively understand the resulting 6D machine model -- the first 3 dimensions correspond to node dimension, and the last 3 dimensions correspond to the processor dimension. For each point in the launching domain, we choose nodes in a blockwise way while we choose processors in a cyclic way. The way to specify this is to compute the upper
3D tuple and lower
3D tuple respectively withtuple
construction, user-defined functions (block_primitive
and cyclic_primitive
). The syntax (tuple
, for
) is similar to Python's list comprehension
Finally, we can use *
operator to turn a tuple of integers into integers (e.g., unwrapping (1,2,1)
into 1,2,1
) for indexing the machine model to pick a specific processor via m_6d[*upper, *lower]
.
Implementation:
MSpace.cc
m_2d = Machine(GPU); # nodes * processors
def same_point(Task task) {
return m_2d[*task.parent.processor(m_2d)]; # same point as its parent
}
SingleTaskMap task_4 same_point;
For tasks that are not index launch, users can also specify where to place tasks.
Typically, the default mapper will use some heuristic depending on the depth of the task (task.get_depth()
in default_policy_select_initial_processor
): round-robin local processors (within the same node) for top-level (0-level) tasks, round-robin all processors for 1-level tasks (beyond the same node) if not tagged with SAME_ADDRESS_SPACE
, round-robin local processors (within the same node) of the same kind for 2-level task or deeper tasks. Different applications may want different strategies to get better data locality, and users can customize it with DSL mapper.
The above example code specifies that task_4
will be placed on the same processor as its parent task (originating processsor). task.parent
is another Task
object, and task.parent.processor(m_2d).processor(m_2d)
will return a tuple representing the position with respect to the machine model m_2d
, e.g., (1, 1) means that task_4
's parent is placed on the second node's second GPU. *
is used here to turn (1, 1)
into 1, 1
with which we can use to index the m_2d
again so that we place task_4
on exactly the same processor as its parent.
def same_node(Task task) {
return m_2d[task.parent.processor(m_2d)[0], *]; # same node as its parent
}
Users can also specify *
to specify a set of points in the machine model, to define a round-robin strategy. The above same_node
function specifies that the task will be placed on the same node as its parent (task.parent.processor(m_2d)[0]
), and any processor on that node (*
) is acceptable. The runtime will make the choice dynamically.
Conventional Legion Benchmarks
Distributed GEMM Algorithms
Users can write print statement and {}
in the language. The number of {}
and the arguments for printing needs to match.
print("{}, {}, {}", a, b, c);
Users can inspect the mapping decisions by turning on logging wrapper via command line (-wrapper
). An example log is circuit, and more logs are collected for reference