New synthesis mode required

Question

New synthesis mode required

Opened this issue 4 years ago · 17 comments

Currently, HeteroCL has several target modes, but none of them support pure kernel code synthesis.

"vhls" target and "debug" mode returns host/kernel code without execution. hcl.platform needs to collaborate with .to to move computations from host to xcel. If .to is not used, all the computations will be done on the host.

However, sometimes users only want to optimize their applications on FPGA but do not care so much about the host part (e.g. consider using different primitives to optimize a single convolution layer on chip). Current HeteroCL's facility gives no convenience for these users, leading to several disadvantages:

Users need to manually do data streaming for all the compute modules. This may be a burden if users want all the computations to be executed on FPGA, and profile the performance of this application. I agree that in most of the cases, users need to specify where the compute functions are executed, but it would be better if we provide an option for users to quickly put all their functions on-chip.
Users cannot focus on the module they want to optimize. For example, a simple add function using .to will generate the following three loops. However, only the performance of Loop2 is what users care about. Loop1 and Loop3 are automatically generated by HeteroCL, which should be optimized and have no need to do synthesis again and again. The latency in HLS report also count these loops.

void test(hls::stream<ap_int<32> >& B_channel, hls::stream<ap_int<32> >& C_channel) {
  #pragma HLS INTERFACE axis port=B_channel offset=slave bundle=gmem0
  #pragma HLS INTERFACE axis port=C_channel offset=slave bundle=gmem1
  #pragma HLS INTERFACE s_axilite port=return bundle=control
    ap_int<32> B[320];
Loop1: for (ap_int<32> B0 = 0; B0 < 10; ++B0) {
      for (ap_int<32> B1 = 0; B1 < 32; ++B1) {
        B[(B1 + (B0 * 32))] = B_channel.read();
      }
    }
    ap_int<32> C[320];
Loop2: for (ap_int<32> args = 0; args < 10; ++args) {
      for (ap_int<32> args0 = 0; args0 < 32; ++args0) {
        C[(args0 + (args * 32))] = (B[(args0 + (args * 32))] + 1);
      }
    }
Loop3: for (ap_int<32> C0 = 0; C0 < 10; ++C0) {
      for (ap_int<32> C1 = 0; C1 < 32; ++C1) {
        C_channel.write(C[(C1 + (C0 * 32))]);
      }
    }
  }

Performance degradation cannot be detected when testing. The current tests in HeteroCL only run simulations. However, modification to some codes may degrade performance leading to unexpected results. #233 is an example, which is not covered by the test cases since no synthesis is run now.

Though kernel codes cannot be executed without a host, I think this kind of synthesis is important for quick profiling and performance improvement at the beginning of application development. Thus, I suggest adding a new mode that only generates kernel code and run HLS directly (can be viewed as combining the "vhls" mode and "csyn" mode).

To avoid conflicting with hcl.platform and .to, which by default places all the computations on the host, using a string target would be a choice. Some interfaces like hcl.build(s, target="vhls_csyn") will call this mode.

Answer 1 · 2020-06-06T18:20:00.000Z

Any comments? @zhangzhiru @seanlatias @hecmay

Answer 2 · 2020-06-06T22:01:26.000Z

@chhzh123 Yes, this is a very good point.

For VHLS, I suggest we only generate the HLS code for design under test (DUT) if the mode is set to csim, csyn, cosim or some of their combinations. There is no need to create the host interface we target the actual hardware emulation or execution. What are the names for those modes?

Answer 3 · 2020-06-06T22:04:18.000Z

@hecmay with the .to() primitive, do we also create a local buffer of the original size when we do streaming between two internal blocks?

Answer 4 · 2020-06-06T22:12:16.000Z

@hecmay with the .to() primitive, do we also create a local buffer of the original size when we do streaming between two internal blocks?

Yes.

Answer 5 · 2020-06-06T22:16:33.000Z

Agreed. the abstraction is not clear enough, and it's too much work for users to use .to for each of the arguments.

We can change the default setting: If no data movement i s specified, then all computations offloaded to FPGA. Using too many .to is indeed a burden to users.

Answer 6 · 2020-06-06T23:41:05.000Z

@hecmay with the .to() primitive, do we also create a local buffer of the original size when we do streaming between two internal blocks?

Yes.

This unfortunately will result in a low-throughput and area-inefficient design.

With streaming, we should read one data element per cycle from the innermost loop (pipelined to II=1). For the producer, we also write out one element per cycle so that the data rate matches. This way we don't need an extra buffer for the consumer.

Answer 7 · 2020-06-06T23:49:51.000Z

Agreed. the abstraction is not clear enough, and it's too much work for users to use .to for each of the arguments.

We can change the default setting: If no data movement i s specified, then all computations offloaded to FPGA. Using too many .to is indeed a burden to users.

Let's do something similar to PyTorch. Instead of always forcing the user to specify explicit data placement with .to(), we should also allow compute offload to target.xcel. We need a separate PR for this new feature.

Answer 8 · 2020-06-07T00:00:27.000Z

@hecmay with the .to() primitive, do we also create a local buffer of the original size when we do streaming between two internal blocks?

Yes.

This unfortunately will result in a low-throughput and area-inefficient design.

With streaming, we should read one data element per cycle from the innermost loop (pipelined to II=1). For the producer, we also write out one element per cycle so that the data rate matches. This way we don't need an extra buffer for the consumer.

I suppose that should only work for one-read-one-write case. We can support this specific one-read-one-write case without generating any local buffer, the local buffer is generated for other more generic cases, where there can multiple producers and (or) consumers.

Answer 9 · 2020-06-07T00:08:57.000Z

Agreed. the abstraction is not clear enough, and it's too much work for users to use .to for each of the arguments.
We can change the default setting: If no data movement i s specified, then all computations offloaded to FPGA. Using too many .to is indeed a burden to users.

Let's do something similar to PyTorch. Instead of always forcing the user to specify explicit data placement with .to(), we should also allow compute offload to target.xcel. We need a separate PR for this new feature.

This sounds good. We can create a stage or scope for FPGA device as pytorch does. That actually makes things much easier than the data-centric approach.

with torch.cuda.device(1):
    a = torch.tensor([1., 2.], device=cuda)
    b = torch.tensor([1., 2.]).cuda()
    c = a + b

Answer 10 · 2020-06-07T01:00:08.000Z

I suppose that should only work for one-read-one-write case. We can support this specific one-read-one-write case without generating any local buffer, the local buffer is generated for other more generic cases, where there can multiple producers and (or) consumers.

Not necessarily. We do this when the tensor object being passed is streamable, meaning that the read and write orders are sequential and we have the same number of reads and writes.

Let's enable the simple cases first. Later we will need to use a polyhedral checker to verify the safety of the code transformation.

Answer 11 · 2020-06-07T01:32:46.000Z

I suppose that should only work for one-read-one-write case. We can support this specific one-read-one-write case without generating any local buffer, the local buffer is generated for other more generic cases, where there can multiple producers and (or) consumers.

Not necessarily. We do this when the tensor object being passed is streamable, meaning that the read and write orders are sequential and we have the same number of reads and writes.

Let's enable the simple cases first. Later we will need to use a polyhedral checker to verify the safety of the code transformation.

Yes. That makes total sense. We actually have a simple test case for that scenario (i.e. the streamed multicasting), though it is not very stable and can be easily broken...

Answer 12 · 2020-06-07T02:52:00.000Z

Agreed. the abstraction is not clear enough, and it's too much work for users to use .to for each of the arguments.
We can change the default setting: If no data movement i s specified, then all computations offloaded to FPGA. Using too many .to is indeed a burden to users.

Let's do something similar to PyTorch. Instead of always forcing the user to specify explicit data placement with .to(), we should also allow compute offload to target.xcel. We need a separate PR for this new feature.

This sounds good. We can create a stage or scope for FPGA device as pytorch does. That actually makes things much easier than the data-centric approach.
with torch.cuda.device(1):
    a = torch.tensor([1., 2.], device=cuda)
    b = torch.tensor([1., 2.]).cuda()
    c = a + b

So what do you suppose to do? @hecmay
I think the default setting should remain the same. If no data movement is specified, computations are executed on the host. Otherwise, it may be confused when .to is added.

As for another way, using one .to to place the whole module onto FPGA would be preferred (see the following example from PyTorch). There's no need to create a scope and move all the tensors explicitly, which seems the same as the current .to facility.

class Net(nn.Module):
    def __init__(self):
        # do something

net = Net().to(device)

Answer 13 · 2020-06-07T03:07:56.000Z

Agreed. the abstraction is not clear enough, and it's too much work for users to use .to for each of the arguments.
We can change the default setting: If no data movement i s specified, then all computations offloaded to FPGA. Using too many .to is indeed a burden to users.

Let's do something similar to PyTorch. Instead of always forcing the user to specify explicit data placement with .to(), we should also allow compute offload to target.xcel. We need a separate PR for this new feature.

This sounds good. We can create a stage or scope for FPGA device as pytorch does. That actually makes things much easier than the data-centric approach.
with torch.cuda.device(1):
    a = torch.tensor([1., 2.], device=cuda)
    b = torch.tensor([1., 2.]).cuda()
    c = a + b
So what do you suppose to do? @hecmay
I think the default setting should remain the same. If no data movement is specified, computations are executed on the host. Otherwise, it may be confused when .to is added.

As for another way, using one .to to place the whole module onto FPGA would be preferred (see the following example from PyTorch). There's no need to create a scope and move all the tensors explicitly, which seems the same as the current .to facility.
class Net(nn.Module):
    def __init__(self):
        # do something

net = Net().to(device)

This is a good way to do data movement. We need to figure out how to combine it with the current data-centric data movement approach. I still need to think about it.

Answer 14 · 2020-06-07T07:02:19.000Z

For VHLS, I suggest we only generate the HLS code for design under test (DUT) if the mode is set to csim, csyn, cosim or some of their combinations. There is no need to create the host interface we target the actual hardware emulation or execution. What are the names for those modes?

These modes should be used with hcl.platform now. For example, to do synthesis, we need to write the following code.

# 1. Declare computation
A = hcl.placeholder((10, 32), "A")
def kernel(A):
    B = hcl.compute(A.shape, lambda *args : A[args] + 1, "B")
    return B

# 2. Create schedule
s = hcl.create_schedule([A], kernel)

# 3. Specify the target platform and mode
target = hcl.platform.zc706
target.config(compile="vivado_hls", mode="csyn")

# 4. Data movement
s.to(A, target.xcel)
s.to(kernel.B, target.host)

# 5. Build the kernel
#    (A misleading interface without code generation)
f = hcl.build(s, target)

# 6. Create required arrays
np_A = np.random.randint(10, size=(10,32))
np_B = np.zeros((10,32))
hcl_A = hcl.asarray(np_A)
hcl_B = hcl.asarray(np_B, dtype=hcl.Int(32))

# 7. Generate kernel code and do synthesis
f(hcl_A, hcl_B)

If csim/csyn/cosim is called, step 4, 6, and 7 are redundant. But if we only generate the HLS code, it will conflict with .to, since HeteroCL place computations on the host by default. That is to say, without step 4, only CPU code is generated.

Answer 15 · 2020-06-08T16:06:58.000Z

Like I suggested, we should also add support for compute placement so the programmers do have to always use .to() to specify the data movement. The programming interface will be similar to PyTorch, which also supports either data or compute placement.

For the time being, we still need .to() to determine which functions need to be synthesized with HLS. We don't have to generate the host code though if only csyn is specified.

Answer 16 · 2020-06-14T03:30:18.000Z

@chhzh123 Just had a discussion with Sean. I will create a ZeroCopy mode for .to primitive, so that you will be able to generate kernel function without any read/write nested for loops. Also I will change the VHLS CodeGen a bit to automatically generate labels for the loops, this should make the analysis easier.

Answer 17 · 2020-06-14T03:33:46.000Z

After I add the aforementioned features, I will create a simple primitive for compute placement. This would make our life easier: we do not have to call so many .to to perform the host-device splitting.