Latency measurement through `CSR(MCYCLE)` adds one extra cycle
Unike267 opened this issue ยท 8 comments
Hi @stnolting !
Back here again ๐
I'm performing a CFU (Custom Function Unit) latency measurement using the CSR(MYCYCLE)
.
For this purpose I've added a multiplier with NEORV32 through this mode of connection.
In this context I've made the following code in C:
//0000000000000001 x 0000000000000001
static uint32_t fir = 0x00010001;
//0000000000000010 x 0000000000000010
static uint32_t sec = 0x00020002;
//0000000000000100 x 0000000000000100
static uint32_t thi = 0x00040004;
//0000000000001000 x 0000000000001000
static uint32_t fou = 0x00080008;
// Intro
neorv32_uart0_printf("\n CFU-mpw \n");
// Perform 4 multiplication through custom instruction (funct3=001)
neorv32_cpu_csr_write(CSR_MCYCLE, 0);
neorv32_cfu_r3_instr(0b1111111, 0b001, fir, 0);
neorv32_cpu_csr_read(CSR_MCYCLE);
neorv32_cpu_csr_write(CSR_MCYCLE, 0);
neorv32_cfu_r3_instr(0b1111111, 0b001, sec, 0);
neorv32_cpu_csr_read(CSR_MCYCLE);
neorv32_cpu_csr_write(CSR_MCYCLE, 0);
neorv32_cfu_r3_instr(0b1111111, 0b001, thi, 0);
neorv32_cpu_csr_read(CSR_MCYCLE);
neorv32_cpu_csr_write(CSR_MCYCLE, 0);
neorv32_cfu_r3_instr(0b1111111, 0b001, fou, 0);
neorv32_cpu_csr_read(CSR_MCYCLE);
// End
neorv32_uart0_printf("\nEND\n");
Note: in the custom instruction the two 16-bit input operators are contained in rs1
, rs2
don't care and the 32-bit result is stored in rd
.
Obtaining the following assembly code:
asm volatile ("csrw %[input_i], %[input_j]" : : [input_i] "i" (csr_id), [input_j] "r" (csr_data));
214: 00000793 li a5,0
218: b0079073 csrw mcycle,a5
neorv32_cfu_r3_instr(0b1111111, 0b001, fir, 0);
21c: 00010737 lui a4,0x10
220: 00170713 add a4,a4,1 # 10001 <__neorv32_ram_size+0xe001>
224: fef7170b .word 0xfef7170b
asm volatile ("csrr %[result], %[input_i]" : [result] "=r" (csr_data) : [input_i] "i" (csr_id));
228: b0002773 csrr a4,mcycle
asm volatile ("csrw %[input_i], %[input_j]" : : [input_i] "i" (csr_id), [input_j] "r" (csr_data));
22c: b0079073 csrw mcycle,a5
neorv32_cfu_r3_instr(0b1111111, 0b001, sec, 0);
230: 00020737 lui a4,0x20
234: 00270713 add a4,a4,2 # 20002 <__neorv32_ram_size+0x1e002>
238: fef7170b .word 0xfef7170b
asm volatile ("csrr %[result], %[input_i]" : [result] "=r" (csr_data) : [input_i] "i" (csr_id));
23c: b0002773 csrr a4,mcycle
asm volatile ("csrw %[input_i], %[input_j]" : : [input_i] "i" (csr_id), [input_j] "r" (csr_data));
240: b0079073 csrw mcycle,a5
neorv32_cfu_r3_instr(0b1111111, 0b001, thi, 0);
244: 00040737 lui a4,0x40
248: 00470713 add a4,a4,4 # 40004 <__neorv32_ram_size+0x3e004>
24c: fef7170b .word 0xfef7170b
asm volatile ("csrr %[result], %[input_i]" : [result] "=r" (csr_data) : [input_i] "i" (csr_id));
250: b0002773 csrr a4,mcycle
asm volatile ("csrw %[input_i], %[input_j]" : : [input_i] "i" (csr_id), [input_j] "r" (csr_data));
254: b0079073 csrw mcycle,a5
neorv32_cfu_r3_instr(0b1111111, 0b001, fou, 0);
258: 00080737 lui a4,0x80
25c: 00870713 add a4,a4,8 # 80008 <__neorv32_ram_size+0x7e008>
260: fef7170b .word 0xfef7170b
asm volatile ("csrr %[result], %[input_i]" : [result] "=r" (csr_data) : [input_i] "i" (csr_id));
264: b00027f3 csrr a5,mcycle
The code executes four multiplications to perform the CFU latency measurement. The first measurement will be performed between the instruction pc: 218
where the CSR: MYCYCLE
is set to 0 and the instruction pc: 228
where the first latency will be read from this register. In the same way, the second measurement is performed between pc: 22c
and pc: 23c
, the third between pc: 240
and pc: 250
and the fourth between pc: 254
and pc: 264
.
The complex design (NEORV32 + multiplier) is simulated and the following results are obtained:
Note: The PROGRAM COUNTER
, INSTRUCTION
, CSR(MYCYCLE)
and CFU
signals are from NEORV32 and the FIFO_IN
, FIFO_OUT
and FULL/EMPTY
signals are from multiplier.
The question is that all the instructions to perform the CFU custom instruction lui
, add
and .work
take 11 cycles to complete and the CSR(MYCYCLE)
counts 12 cycles.
As we can see in the following image:
Is it possible that the latency measurement through CSR(MYCYCLE) is adding one extra cycle? ๐ค
I look forward to your reply,
Cheers! ๐
/cc @umarcor
Isn't there something missing? The variables for the return data??
neorv32_cfu_r3_instr(0b1111111, 0b001, fir, 0);
neorv32_cpu_csr_read(CSR_MCYCLE);
The cycle counter will increment in every clock cycle if the CPU is not in sleep mode. The CSR instructions to read/write any CSR are identical in terms of execution time, but they" commit in different stages. In this cas the CSR write is one cycle faster than the CSR read. That might be the reason for the +1 offset?!
Anyway, I suggest to also add the execute engine's state register to you waveform. Then you can see when an instruction has really completed execution - just looking at the program counter is not explicit enough for this.
Hi @stnolting !
Isn't there something missing? The variables for the return data??
The program is not designed to perform a implementation in FPGA. Therefore, it has been simplified to the maximum and the instructions/variables have been reduced to those necessary to excite the signals to be captured/measured.
The idea is to obtain directly in CI the result of the latency measurement for each modes/accelerators.
For this purpose the complex (NEORV32 + multiplier) is simulated through VUnit and the CSR(MYCYCLE)
value is extracted using the following code.
mycycle_capture: process
begin
done <= false;
wait until start and rising_edge(clk);
for x in 0 to test_items-1 loop
wait until rising_edge(clk) and csr_we = '0' and csr_valid = '1' and csr_addr = x"B00" and csr_rdata_o /= x"00000000"; -- CSR MYCYCLE ADDR IS 0xB00
info(logger, "Data " & to_string(x+1) & "/" & to_string(test_items) & " latency is " & to_string(to_integer(unsigned(csr_rdata_o))) & " cycles");
wait until rising_edge(clk);
end loop;
wait until rising_edge(clk);
done <= true;
wait;
end process;
To obtain the latency when the simulation is performed.
Note: uart.tx: mpw means "multp_wfifos", it is the name of one of the accelerators under test.
For this reason I didn't use variables to return data.
It should be noted that I've tested in FPGA the correct operation of the design with a program with variables to return data to print these values via uart. ๐
Note: in this case I also measured 12 cycles of latency.
In this cas the CSR write is one cycle faster than the CSR read. That might be the reason for the +1 offset?!
In the first comment in the second image we can see that the csrw
(CSR write) instruction pc: 218
takes three cycles to execute and the csrr
(CSR read) instruction pc: 228
also takes three cycles to execute. I can't see why CSR write is one cycle faster. ๐
Please, could you re-explain this to me?
Anyway, I suggest to also add the execute engine's state register to you waveform. Then you can see when an instruction has really completed execution - just looking at the program counter is not explicit enough for this.
Okey, thanks for the hint! ๐
I will relaunch the simulation to see this information and will comment the results. ๐
Cheers! ๐
Hi again!
I've added the execute engine's state register in the waveform.
Since the vcd
format cannot display type elements I've made the following assignment:
with execute_engine.state select
states <= "0001" when DISPATCH,
"0010" when TRAP_ENTER,
"0011" when TRAP_EXIT,
"0100" when RESTART,
"0101" when FENCE,
"0110" when SLEEP,
"0111" when EXECUTE,
"1000" when ALU_WAIT,
"1001" when BRANCH,
"1010" when BRANCHED,
"1011" when SYSTEM,
"1100" when MEM_REQ,
"1101" when MEM_WAIT,
"0000" when others;
I know that the ghw
format supports type elements but the simulation is so large that it gets stuck in this format. ๐
Anyway I've obtained the following results:
For csrw
instruction:
- First cycle: EXECUTE
- Second cycle: SYSTEM
- Third cycle: DISPATCH
For csrr
instruction:
- First cycle: EXECUTE
- Second cycle: SYSTEM
- Third cycle: DISPATCH
At first look, it seems that both instructions take the same time to execute. In addition, it looks that the cycles are captured through csrr
when the SYSTEM is consulted adding one extra cycle to the measurement, that is the EXECUTE cycle. ๐ค
What do you think?
Cheers! ๐
Since the vcd format cannot display type elements I've made the following assignment:
That's smart!
At first look, it seems that both instructions take the same time to execute. In addition, it looks that the cycles are captured through csrr when the SYSTEM is consulted adding one extra cycle to the measurement, that is the EXECUTE cycle.
That's right! All CSR access instruction always require the same amount of cycles. However, read and write access have have not the same latency. Write accesses happen in the 2nd cycle after the SYSTEM state. Read accesses (or actually the write-back to the register file) already happen in the cycle right after SYSTEM.
That's why the write access seems to be one cycle faster in the overall chronological order. However, I guess this is a very relative view ๐
Hi @stnolting !
That's right! All CSR access instruction always require the same amount of cycles. However, read and write access have have not the same latency. Write accesses happen in the 2nd cycle after the SYSTEM state. Read accesses (or actually the write-back to the register file) already happen in the cycle right after SYSTEM.
Okey!
Then, following your indication, I've understood this:
If I'm right you must subtract one cycle to the cfu_r3_instr
latency measurement and in general to any instruction/s that you are going to measure between csrw(mcycle = 0)
and csrr
.
Could you please confirm if I am correct in my assumption?
Cheers! ๐
Hi!
To reinforce my argument I've simulated the NEORV32 with this brief program:
//Intro
neorv32_uart0_printf("\n CSR \n\n");
neorv32_cpu_csr_write(CSR_MCYCLE, 0);
neorv32_cpu_csr_read(CSR_MCYCLE);
neorv32_uart0_printf("\nEND-CSR\n");
//END
In assembler:
asm volatile ("csrw %[input_i], %[input_j]" : : [input_i] "i" (csr_id), [input_j] "r" (csr_data));
1f0: 00000793 li a5,0
1f4: b0079073 csrw mcycle,a5
asm volatile ("csrr %[result], %[input_i]" : [result] "=r" (csr_data) : [input_i] "i" (csr_id));
1f8: b00027f3 csrr a5,mcycle
To see if the consecutive execution of csrw(mcycle = 0)
and csrr
instructions counts in the read operation 1 or 0 cycles.
Giving two possible scenarios:
- Read operation counts 0 cycles. The measurement through
CSR(MCYCLE)
doesn't add any extra cycle. - Read operation counts 1 cycle. The measurement through
CSR(MCYCLE)
adds one extra cycle.
In this context I've obtained the following waveform:
And the following simulation result:
In view of these results, we can confirm that the latency measurement through CSR(MCYCLE)
adds one extra cycle to whatever we want to measure between csrw(mcycle = 0)
and csrr
. If this were not the case, the result of this test should have been 0 cycles.
To fix this issue I've subtracted one cycle to the csr_rdata_o
in the process of extracting CSR(MCYCLE)
information.
From:
info(logger, "Data " & to_string(x+1) & "/" & to_string(test_items) & " latency is " & to_string(to_integer(unsigned(csr_rdata_o))) & " cycles");
To:
info(logger, "Data " & to_string(x+1) & "/" & to_string(test_items) & " latency is " & to_string(to_integer(unsigned(csr_rdata_o)) - 1) & " cycles");
What do you think about this conclusion?
Cheers! ๐
In view of these results, we can confirm that the latency measurement through CSR(MCYCLE) adds one extra cycle to whatever we want to measure between csrw(mcycle = 0) and csrr.
Right. This is a side-effect of the actual hardware implementation.
This is what I would do: initialize mcycle only once at the beginning. Then, just read the "current time" from that CSR and compute the time deltas. Thus, you will always be using read accesses for mcycle resulting in identical latency:
// initialize mcycle (low-part only!)
neorv32_cpu_csr_write(CSR_MCYCLE, 0);
...
uint32_t time_start = neorv32_cpu_csr_read(CSR_MCYCLE); // start time
... // your code to be benchmarked
uint32_t time_end = neorv32_cpu_csr_read(CSR_MCYCLE); // end time
uint32_t execution_cycles = time_end - time_start; // time delta
Hi @stnolting !
Right. This is a side-effect of the actual hardware implementation.
Okey! Issue solved! ๐ ๐
This is what I would do: initialize mcycle only once at the beginning. Then, just read the "current time" from that CSR and compute the time deltas. Thus, you will always be using read accesses for mcycle resulting in identical latency:
I appreciated your tip!
Thank you so much for your quick responses! ๐
Cheers! ๐