exo-lang/exo

Proposal for Returning Relevant Cursors

Opened this issue · 18 comments

Problem:

We have been talking about having scheduling ops support the idea of "Returning relevant cursors". For example, when using stage_mem, we generate an allocation (which we might wanna further schedule to for example set the memory), and at least one of a load stage or store stage:

def foo(n: size, A: i8[n] @ DRAM):
    assert n % 4 == 0
    for io in seq(0, n / 4):
        for ii in seq(0, 4):
            A[4 * io + ii] += 0.2
foo = stage_mem(foo, "for ii in _:_", "A[io*4:io*4+4]", "tile")
def foo(n: size, A: i8[n] @ DRAM):
    assert n % 4 == 0
    for io in seq(0, n / 4):
        tile: i8[4] @ DRAM      <----- Allocation
        for i0 in seq(0, 4):        <----- Load Stage
            tile[i0] = A[i0 + 4 * io]
        for ii in seq(0, 4):
            tile[ii] += 0.2
        for i0 in seq(0, 4):        <------ Store Stage
            A[i0 + 4 * io] = tile[i0]

It would be convenient if scheduling operations returned cursors to those relevant pieces of the procedure. Currently the alternative to this is using the argument cursor as an anchor and using navigation to get to such cursors from the scheduling code. The goal is to offload that work from the user to the tool.

Proposals:
Proposal 1: Return a dataclass algonside the proc. The dataclass contains fields which are the relevant cursors.

Definition goes along with the scheduling operations definition, serves as a documentation of what can be returned:

@dataclass
class stage_mem_cursors:
        alloc: AllocCursor = InvalidCursor()
        load_stage: AssignCursor | LoopCursor = InvalidCursor()
        store_stage: AssignCursor | Reduce_Crursor | LoopCursor = InvalidCursor()

Usage:

foo, cursors = stage_mem(foo, "for ii in _:_", "A[io*4:io*4+4]", "tile")
load_stage = cursors.load_stage # Save for later
foo, cursors = set_memory(foo, cursors.alloc, DRAM_STATIC)

Advantages:

  1. Explicit API
  2. Uniform return values convention (Procedure, dataclass of relevant cursors)
  3. It is easy to be forward compatiable and add more cursors in the future

Disatvantages:

  1. Have to define a dataclass per scheduling op
  2. Might be too verbose: cursors.load_stage
  3. Users may need to name for each returned dataclass to save them as temporary varaibles. Alternatively, overwrite previous one just like we do for proedures; you need to manually save which cursors you actually care about. So, the advantage your really save is not having to do a lot of navigation which I think is worth it.
  4. Unclear what to return when there is nothing to return. Some "Null" dataclass.

Proposal 2:

Similar to proposal 1, but we keep the return value of primitives to be just a procedure. Then, we implement a standard library wrapper around each primitive that returns a procedure and a dataclass of relative cursors.

Advantages:

  1. Keeping the primitives implementation as small as possible.

Disatvantages:

  1. Duplicate shceduling operations that perform the same rewrite.
  2. Naming problems:
    • Use same name
    • or add a suffix to the wrappers e.g. _c.

Proposal 3:
Instead of returning multiple relevant cursor, we can return exactly on relevant cursor along with the procedure. In the case of stage_mem, this could be the new block cursor including the load stage, the original block we staged around, and the store stage.

  • There is again the question here of whether this would a part of the primitive or a standard library wrapper.

Advantages:

  1. Might be easier to use since you don't have to think about the name bindings (in a dataclass or a dictionary) anymore.

Disatvantages:

  1. Returning one curosor is not alawys a viable options when the relevant cursor are not co-located i.e. they cannot be all grouped in one block. It is esepcially difficult with operations that could have an aribitrary number relevant cursors in arbitrary places in the procedure.
  2. We will still require a fair amount of navigation in the user-code.

@yamaguchi1024 @skeqiqevian. Can you add any other propsals you had in mind?

Some thoughts:

  1. I think there are a few semi-orthogonal sub-issues here that would be worth breaking out when thinking about the design. In particular, I am thinking of (a) what is the return value (just proc, proc + dataclass, proc + one cursor); (b) how should the "return cursors" functionality be exposed? (e.g. should this be an alternate version of a primitive or should the expected signature of a scheduling operation primitive change?); (c) slightly different is "how should this be implemented?" (e.g. purely in user-library vs. changing the primitives' implementation?)

  2. Another option (for dimension (b)) is to support a flag (e.g. p, cs = sched_op(p, c, ..., return_cursors=True)) that changes the returned value. If you think of this functionality as "advanced" then a little bit of extra typing might be appropriate.

  3. I like that the dataclass idea can be combined with lvalue pattern matching to unpack the different cursors "positionally" e.g. p, (alloc, load, store) = stage_mem(...) or unpack the cursors by name as you suggested @SamirDroubi. In the event that a scheduling op naturally wants to return only one or zero cursors, what would we do? Still return a dataclass for uniformity of the interface? (That doesn't seem like a terrible idea...?)

  4. The "return one cursor" variant is probably important for a combinator user library along the lines of the one we sketched in our PLDI submission. However, I don't think that option is either maximally simple (no cursor returned) or maximally useful/informative (the multiple cursor return). One could imagine e.g. a new combinator that allows one to wrap an op and specify which of the dataclass entries should be returned as "the" cursor. Bottom line is we can implement better "one cursor" convention libraries on top of the other options.

  5. The second foo code (post stage_mem) has a bunch of arrows pointing to parts of it. This reminds me of the "pattern matching multiple cursor groups" idea we floated as a more self-documenting introspection mechanism. Suppose we added some such feature. We would have to return a set of cursors in that case; one for each group. Shouldn't that ultimately use the same convention that we decide on here? If so, maybe thinking about that feature could help us feel more confident about a decision here (e.g. exactly what something like the dataclass idea should look like)

  1. I think splitting the problem into thoss three dimensions makes sense and is helpful to think more clearly about it.
  2. If you don't care about the returned cursors, you could simply ignore it by writing p, _ = sched(...). On the other hand, having to specify return_cursros=True is more verbose (and more characters to write). I guess ultimately it would depend on how frequently you would need to specify that argument. I would say the nice thing about having this option is someone who doesn't want to think about cursors, can completely operate as if they didn't exist.
  3. Unpacking should be simple with dataclasses, we just need to make sure to make the dataclass iterable by implementing __iter__. I am on the fence of the case where there is exactly one cursor because just returning the cursor wouldn't change the number of returned values, just the type of the second one. When there is no cursors to return, I think it would be nice to have uniformity at least in terms of number of returned arguments. Although it is a bit weird that every use of some op is of the form: p, _ = sched(...).
  4. I thought about the combinators and my thinking is what you likely do is build lambdas on the fly. The one returned cursor by the lambda might likely be one of the many returned cursors by the scheduling operation. The other way is what you described which is to have a wrapping combinator: we can make the dataclasses indexable by the order in which the fields are defined, and then we can build the combinator you described to generate a version that returns exactly one cursor given an index.
  5. I think we should be able to also define dataclasses dynamically for this feature as well and build the same functionality: https://docs.python.org/3/library/dataclasses.html#dataclasses.make_dataclass.

Continuing the conversation where we're not converged:

  1. yes, p,_ = op(p, ...) isn't unreasonable. I think the argument from the other side is that it's conceptually another spinning wheel. There was an old proposal which would simplify this to op(...) at the cost of not having as much explicit control. Absent other considerations, we should try to avoid the situation where there are a bunch of extra parameters/return-values that always have to be handled because such designs are more complicated to learn and tend to produce more syntactic boilerplate. If the return_cursors is an "advanced" feature, you can imagine it being as simple as adding op(..., rc=1) (using "truthiness" instead of True) or something similarly brief.
  2. (a) You can get dataclass to autogenerate the __iter__ implementation.
  3. (b) One proposal going with rc=1 style is that if it returns 0 cursors, then p,_ = op(..., rc=1) makes sense even if it is unnecessary. I think the awkward case is p,(c,) = op(..., rc=1) but maybe that's OK given that this is invoking the advanced feature?
  4. You can use a string with the dataclasses or an index. Effectively, we would have a system with a few different standard signature conventions for scheduling operators. e.g. Basic is p = op(p, ...); Basic at location is p = op(p, c, ...); Combinator is p,c = op(p,c, ...); Full Navigation is p,(c1,...) = op(p,c, ...)
  5. Yes.

An additional thought (not to roll into this proposal but thinking further ahead) is that it would be great if there was a "standard" way to decorate some user-defined scheduling ops in order to encourage adhering to a convention and maybe generate some boilerplate functionality as a consequence. There could be different conventions (as suggested above). You could imagine such a decorator allowing us to install different kinds of cursor coercion functionality (already exists for primitives, but not user ops), maybe help with documentation, and maybe package rc=1 optional functionality etc. You could also imagine auto-coercion between some different operator conventions.

  1. Yeah, I agree with you that there should be a supported scheduling mode that allows that user to ignore anything cursors related. I think the consise arguemnt you described makes sense, it is not clear what it does from looking at it, but it is in every operation and will be common in code that users will just learn.
  2. Agreed.
  3. I agree it should be okay to expect the user to unpack the dataclass in that one off case.
  4. Agreed.
  5. I am not sure exactly what you mean by auto-coersion, it would be great if you could elaborate more on this.
  6. I was generally thinking that we would extend the @sched_op decorator so that it automatically adds the new default argument rc=False.
  7. We can potentially make the dataclass be automatically generated and associated with a scheduling op via the same decorator or a different one: @returned_cursors("alloc", "load", "block", "store") which then would serve as a docuemntation. If we go with the two decorators apporach, then the sched_op decorator should likely enforce the returned_cursors decorator.
  8. How existing and new decorators can be exported to the user world is really important discussion as well. The current decorator and argument processors can greatly simplify the user ops implementation and provide easy documentation. I think simply exporting them as is may not be a terrible idea since the almost all return API cursors. There are some exceptions (NewExprA) where the return value is a non-API object. One option is to not export any thing that could return a non-API object for now and make the user-level argument signature be a string instead which is what you would do now anyways.
  9. Another remaining point of the discussion is (1.c). How should this be implemented?

To summarize, the design we arrived at above:
a. Relevant cursors will be returned as a dataclass.
b. Using this functionality will require opting-in by specifcying a default parameter rc=True (return cursors) which will be the last parameter of a scheduling op.
c. This functionlity will be added to the primitives themsevles.
d. Cases where the are 0 or 1 relevant cursors will still return a dataclass when the the functionality is inovked for consisntency.
e. The per-op dataclass will be automatically defined by a decorator or explicitly.
f. The default argument will either be enforced or automatically added by a decorator.

I think (e) and (f) present a new design decision:
Option 1:

  1. Require each scheduling operation to have an extra final argument **kwargs
  2. We modify the argument processor to enforce what goes in that. Currently, we only allow rc with boolean values.
  3. We can add a decorator that dynamically defines the dataclass e.g. (@returned_cursors("alloc", "load", "block", "store")). This decorator binds the type of the dataclass in **kwargs

Option 2:

  1. Require each operation to have the last argument be rc similarly to how we enforce the first argument to be a procedure.
  2. To define the dataclass:
    a) Statically (explicitly) define a dataclass in the globalscope with some naming convention e.g. "{op_name}_rc"
    b) Use a decorator @returned_cursors("alloc", "load", "block", "store") to the function that binds the dynamically-created definition of the dataclass to the globalscope under some naming convention.

I think Option 2 with (2.b) is the cleanest and most explicit. The rc argument will be explicit in the operation defintion. It exposes the dataclass definition to the users, so they can introspect it with help(). The information about what cursors are returned is clearly coupled with the scheduling operation via the decorator and we can add them to the information displayed when a user calls help() on a scheduling operation.

(e) and (f) were referring to the primitives here. I think we need to have something if we are to implement this feature for the primitives. The other alternative is to hold off on implementing this feature until we have the second discussion about how decorators are exported by the API.

Yeah that makes sense. That's a cool idea, I didn't think about making the implementation always return the value by default, then optionally supress it in the object. Thanks!

Tangential to design but wanted to note that once this gets built out, we're free to deprecate some of the awkward forwarding behavior (e.g. divide_loop where the divided loop forwards to the outermost loop). That was only originally present because we didn't have a good way to access newly created cursors.

The last piece of this design is what should be returned by the operations. I think we need at least a few policies so that there is some consistency in what we return. Here is an attempt at a policy and what it translates to for the operations:

  1. The input cursor (forwarded) needs to be among the returned cursors.
  2. We return the most granular cursors for statements. For example, even if we could return a block cursor (of a fixed size) that encompasses the cursors, we would instead return a cursor to each statement instead.
  3. We return cursors to every statement/expressions added/modified.
  4. The cursors are ordered in the dataclass by sequential order of execution.
  5. In the absence of a returned cursor that does have a field in the dataclass, we assign it to an invalid cursor. Unless the field represents a collection of things, then we set it to the empty version of that collection.
  6. If a cursor is deleted, do we return a cursor to the gap (and with respect to what statement)? What if there is no previous or next statement?

Let's see what this translates to for the scheduling operations:

Operation Returned Cursors Discussion
simplify - A list of cursors too all updated expressions This is according to (3). Is this an overkill? Are these cursors really useful? I think it will also complicate the implementation of the operation.
rename No returned cursors
make_instr No returned cursors
insert_pass - Forwarded gap
- Pass stmt cursor
According to (1) and (3)
delete_pass - list of gaps after the statement previous to the pass According to (3) and (6)
reorder_stmts - Block of the two statements - First statement
- Second statement
According to (1), (2) and (3). Here the input is a block, so we should return the block of the two. But also the most granular thing to return is a cursor to both statements. However, it seems weird to do so since the block is a block of the two statements.
parallelize_loop - cursor to the loop According to (1)
commute_expr - list of all the commuted cursors According to (1).
rewrite_expr - cursor to the new expression According to (1).
bind_expr - alloc cursor
- write cursor (the bind)
- list of all accesses to the new buffer
According to (1), (2), and (3)
extract_subproc - call cursor
- the new subprocedure
According to (1) and (3). Currently, the subprocedure is directly returned by the operation, so this changes that convention and adds it instead to the dataclass
inline - Block of the inline statements According to (1)
replace - A cursor to the call According to (1)
call_eqv - A cursor to the call According to (1)
set_precision - A cursor to the allocation or argument According to (1).
set_window - A cursor to the argument According to (1).
set_memory - A cursor to the allocation or argument According to (1).
bind_config - A cursor to the config write
- A cursor to the config read
According to (1) and (3).
delete_config - A cursor to the gap after the statement previous to the deletion According to (1) and (6).
write_config - A cursor to the config write According to (1)
resize_dim, expand_dim, rearrange_dim, bound_alloc, divide_dim, mult_dim - A cursor to the alloc
- A cursor to the modified dimensions
- A cursor to every access to the buffer - A cursor to the modified access expressions
According to (1) and (3). Some of these seem like an overkill. Is it really useful to return any cursors to the accesses
unroll_buffer - A block of the created allocations
- Cursors to the access
According to (1) and (3)
lift_alloc, sink_alloc, autolift_alloc - A cursor to the allocation According to (1)
delete_buffer - A cursor to the gap after the previous statement According to (1) and (6)
reuse_buffer - A cursor to the reused allocation
- A cursor to the gap after the statement previous to the eliminated allocation
- A cursor to every new access to the reused buffer
According to (1) and (3)
inline_window - A cursor to the gap after the statement before the window statement
- A cursor to every inlined access to the previously windowed buffer.
stage_window - alloc
- load
- cursor to expression passed as an argument
- store
According to (1), (2), (3). Should the load/store cursors be to the load/store statement themselves or to the outer most statement in the load/store nest (which could be a loop or a load/store statement
stage_mem - alloc
- load
- block
- store
Same discussion as stage_window
divide_with_recompute - outer loop
- inner loop
- a list of every modified index expression
According to (1), (2), and (3). Is returning every modified index expression an overkill or useful?
divide_loop - outer loop
- inner loop
- guard
- tail loop
- A cursor to every new index expression
According to (1) and (3)
mult_loops - A cursor to the first loop
- A cursor to every new index expression
Same discussion as above.
join_loops - A cursor to the loop According to (1)
join_loops - A cursor to the first loop
- A cursor to the second loop
According to (1) and (2)
shift_loop - A cursor to the loop
- A cursor to every modified index expression
According to (1) and (3)
reorder_loops - Outer loop
- inner loop
According to (1) and (3)
merge_writes - A cursor to the final statement According to (1) and (3)
inline_assign - A cursor to the gap
- A cursor to every new access
According to (3).
lift_reduce_constant - block cursor
- A cursor to the new statement
According to (1) and (3).
fission, autofission - A cursor to the first loop
- A cursor to the second loop
According to (1) and (3).
fuse - A cursor to the new loop According to (1)
remove_loop - A cursor to the guard
- A cursor the block
According to (1) and (3)
add_loop - A cursor to the loop
- A cursor to the guard
- A cursor to the block
According to (1) and (3)
unroll_loop - A cursor to the block According to (1) and (3)
lift_scope - A cursor to the if statement - A cursor to the two loops (or one loop?) Unclear here what to return in terms of loops since there could be one or two
eliminate_dead_code in case of if: the block, in case of for: there is nothing to return so just the gap Unclear what to return here: a unifying datalcass for both, different dataclass depending on the operation, split the rewrite into two rewrites
specialize - A cursor to the if statement
- A cursor to the if block
- A cursor to the else block
According to (1), (2), and (3).
add_unsafe_guard - A cursor to the guard
- A cursor to the block
According to (1) and (3)
bound_and_guard - A cursor to the loop
- A cursor to the guard
According to (1) and (3)

So far we have been thinking of this returned object as something that holds cursors. There is a question of whether it needs to be built with the intention of potentially returning types other than cursors. For example, extract_proc above returns the extracted subprocedure; how should that be returned when you specify rc=1: within this dataclass or should there be three arguements returned Proc, Porc, CursorsDataclass. Another related thing that may come up is a "success" flag. Let's say an operation doesn't throw errors when it fails, but rather reports a True/False value on whether it succeeded. How should that flag be returned? Within the dataclass or as a third argument. I don't think we have a great example of why you might want to support a success flag instead of just throwing an error, but it is just an example I thought of where an operation might want to return something other than just cursors.

(sorry for any weird formatting; composed this on my phone)

After going over this list I realize there is a useful way we can approach this: documentation.

If we force ourselves to document the behavior in a standard way then (a) we solve the missing documentation problem and (b) it should be more obvious whether that documentation makes sense.

In particular, the extract_subproc primitive reminds me that not all of our primitives strictly conform to certain “conventions” (ie what signature is expected by higher order scheduling operators). The RC feature is fundamentally about adding another such signature convention. We should document which conventions we support in which ways. That will likely streamline a lot of work. We should discuss this point in a design meeting. I will try to post to Slack.

————-

The case of simplify suggests that (3) is too strong. My opinion is simplify should not support RC.

Insert Pass should probably just return a cursor to the pass statement.

On delete pass, I think I prefer No RC support. However there is a deeper issue with the convention. What happens if you delete a pass that is the only statement inside a loop? That’s not grammatical. Does it fail? This is related in a non-obvious way to the presence of n+1 “gaps” in an n statement block, and the problem of special cases for the first and last gap. eg suppose I assume that a returned gap cursor is always attached to the preceding statement when returned from delete pass. Oh shit. This is impossible to guarantee. The opposite convention is also impossible to guarantee. So a user is highly likely to infer unreliable invariants regardless of what behavior we set. As a result, I suggest we say no RC here to prevent the otherwise unavoidable ambiguity bugs.

On reorder stmts it should just be the block cursor returned. (2 shouldn’t be absolute). There are two variants of reorder statements that should be defined in a standard library: reorder before and reorder after. These should take and return single statement cursors in the obvious way.

Parallelize loop seems fine

Commute_exprs seems fine, but maybe just don’t support RC or return the input cursor?

What is rewrite-expr again? Is this an unsafe escape hatch? (Behavior seems fine)

I 100% agree with bind_expr.

extract-subproc — see discussion at the top

in-line, replace, call-eqv— agree

Set_* — is it possible to have a cursor to an argument?

bind-Config, write Config — sure

Delete config — same issue as for deleting a pass statement

Resize dim etc — just return a cursor to the allocation. The other cursors ought to be easy to retrieve using an inspection (better name than introspection?) feature of some sort. One wants to do that in other circumstances, so better to be more idiomatic about the way to do so.

Unroll buffer — drop cursors to accesses; see above

*_alloc — agreed

Delete buffer — same issue as other deletes

Reuse buffer — should just be a cursor to the alloc

In-line window — same issue as delete. More open to the list of cursors to in-line sites in this case. Might be harder to retrieve another way? But also totally fine to just not support RC on this for consistency?

Stage window — there is no expression passed as argument? You mean the statement around which one stages? (Oh I see that I mean stage-mem) I agree with that. Again no need for every use site of the new buffer.

On *_loop ops, drop the lists of cursors to modified expressions. The two cursors to inner and outer or similar are OK. I think just an outer cursor is fine too. If there are two successive loops (not nested) then multiple cursors should absolutely be returned.

Merge writes— yup

In-line assign — same comment as for deletion

Lift-reduce constant — I don’t understand what this op is or does…

Fission, auto fission — this should return a gap cursor to the gap between the two loops as well. The cursor should be attached to the loop before or after the gap based on which statement the argument cursor was attached to.

Fuse — agreed

Remove loop — should only return one cursor

Add_loop — should only return one cursor

Unroll-loop — agreed

Lift-scope — not sure about this one. Maybe this trouble is a sign that this primitive is mis-designed.

Eliminate dead code — same comment as other deletions

Specialize — if it’s always an if block then why return the other cursors? They have obvious paths from the first cursor

Last two — same comment about gratuitous cursors to sunstructures

I have thought about extract_subproc more and I wonder if really returning the subproc by default (the way it is done now) is necessary. Let's say it wasn't returned, I would expect to be able to write code like the following:

proc = extract_subproc(proc, stmt)
subproc = proc.forward(stmt).subproc()

and with this new feature, you would also be able to do:

proc, cursors = extract_subproc(proc, stmt, rc=True)
subproc = cursors.subproc

It generally seems like the behavior of returning the subprocedure is an artifact of the pre-cursors Exo rather than being something that is necessary.

Actually, I guess if you want to be able to schedule the subprocedure without using any cursor-related features. You do need to get the subproc somehow. Perhaps, I am wrong then.