NoC experiments #9

petervdonovan · 2023-01-20T20:21:15Z

This includes the software that I have used to estimate the maximum achievable bandwidth of the NoC.

I realize now that NetworkInterface really is needed since in order for a communication protocol to work properly the received messages must be placed in buffers corresponding to the sender cores.

On the order of 12 cycles worst case to send an integer without handshaking except checking the valid bit.

This uses self-modifying code. Probably more efficient to do it with a sequence of forward branches conditioned on read cycle.

The synchronization itself costs like 13 cycles. It's cheap. And synchronization is something that an HRTT only has to do once per rendezvous, so 13 cycles is no big deal.

It doesn't work entirely properly because although we check that all parties are ready to communicate, we do not sync precisely when the batch starts. Instead, we just sync with a blocking read, which is not precise enough. It can't be precise enough because the poll takes 5 cycles, which is as large as the interval between flits. It does "kind of" work though, which is progress.

Much longer is impractical because the code size would be excessive. A little FPGA cannot have a giant IMEM scratchpad :(

The interleaving of the prints looks funny because the print from one of the cores appears to happen a couple cycles later than the prints from the other two cores. However, all 126 numbers are correctly received, in the correct order, by all receiving cores. That's 126 numbers broadcasted in roughly 1233 cycles. 126*5=630 of those cycles are necessarily required by the TDM schedule, 115*4=460 of those cycles are synchronization overhead at the start of sending a packet, and the remaining 143 or so cycles are probably the C glue code. Obviously the proportion that is overhead will vary depending on the packet size -- smaller packets -> more packets for the same amount of data -> more overhead. In this case 4 packets were sent. By my measurement, the synchronization overhead incurred before sending a packet -- and this is the assembly, not the C which is probably doing a function call, saving things on the stack, etc. -- is 115 cycles. Multiple measurements all gave something like 114.5 cycles.

I want to have multiple bursts, to amortize the initial hundred-cycle synchronization overhead while still allowing enough time between bursts for the sender to either jump back to the start of an unrolled loop or for the sender to load words from main memory before sending them from the register file. In this commit the program in low_level_interface_noc still seems to work.

It took some time to get the assembly to work as it did in the previous commit. It is getting hard to manage register allocation.

lhstrh

Impressive work, @petervdonovan!

programs/benchmarks/noc/latency_aligned/noc_latency_aligned.c

lhstrh · 2023-01-20T20:44:09Z

programs/benchmarks/noc/latency_aligned/noc_latency_aligned.c

+        // "nop\n\t"
+        // "nop\n\t"
+        "li t4, 0x40000000\n\t"  // wishbone base address
+        // FIXME: Why does this loop have to go through one iteration extra the first time around, compared to the number of iterations that it makes thereafter?


Co-authored-by: Marten Lohstroh <[email protected]>

That is 87% the maximum possible on this NoC, and by sending more data we can amortize even further.

This is 14 cycles out of 115 -- not a big performance difference. More importantly, it avoids clobbering a few registers, which the initialize_asm can now write to without their data being lost.

The relevant functions are commented out because GCC inline asm apparently does not permit clobbering caller-saved registers.

This is close to 600 lines of assembly all written in a very brittle way. It is not tested yet. Also note that the result of porting it will not really take advantage of all the code generator's dynamic checks etc. because it is translated very directly from C macros.

I have not verified that the assembly is _correct_...

I stopped working on this 20 hours ago; just checkpointing old work here.

petervdonovan added 24 commits December 30, 2022 01:12

Tentative start on NoC benchmarks.

fdf8d34

Measure 11 cycles of latency by cheating.

1cff46f

store->nop->load -> wrong WB read. Bug?

841e500

Actually 35 cycles of latency it seems.

f9afb86

Adjust and comment on noc_latency_aligned.

ec82902

Experiment with the NoC interface.

6abf115

More tinkering.

828d100

I realize now that NetworkInterface really is needed since in order for a communication protocol to work properly the received messages must be placed in buffers corresponding to the sender cores.

Get a basic test working in simulation.

93e1631

On the order of 12 cycles worst case to send an integer without handshaking except checking the valid bit.

Failed attempt at synchronization.

7e59e25

This uses self-modifying code. Probably more efficient to do it with a sequence of forward branches conditioned on read cycle.

Successful attempt at synchronization.

d77fd4d

The synchronization itself costs like 13 cycles. It's cheap. And synchronization is something that an HRTT only has to do once per rendezvous, so 13 cycles is no big deal.

Factor more assembly out into macros.

d65c22f

First draft of the sender side of the protocol.

8eefda3

Refactor the assembly a bit.

085bcd8

More assembly refactoring.

5ecc536

First draft of receive words macro.

614a56f

Receive a sequence of words correctly.

8a49d64

Send packets of length up to 64.

b9ee3ad

Much longer is impractical because the code size would be excessive. A little FPGA cannot have a giant IMEM scratchpad :(

Add C API for read_n_words_and_print.

132d4c7

Add C API for broadcast_count.

a8d065d

Make small modifications.

7ba4017

It took some time to get the assembly to work as it did in the previous commit. It is getting hard to manage register allocation.

Get the extended protocol to work properly.

8b863f9

lhstrh reviewed Jan 20, 2023

View reviewed changes

petervdonovan and others added 5 commits January 20, 2023 12:52

Update programs/benchmarks/noc/latency_aligned/noc_latency_aligned.c

309dad6

Co-authored-by: Marten Lohstroh <[email protected]>

This sends 1023 words in 5867 cycles.

b8fdadb

That is 87% the maximum possible on this NoC, and by sending more data we can amortize even further.

Optimize out a SYNC5.

8e70867

This is 14 cycles out of 115 -- not a big performance difference. More importantly, it avoids clobbering a few registers, which the initialize_asm can now write to without their data being lost.

Bugfix; move header-only lib to flexpret.

d5983a3

Start creating a BroadcastMemory program.

4c64694

The relevant functions are commented out because GCC inline asm apparently does not permit clobbering caller-saved registers.

petervdonovan added 5 commits February 3, 2023 18:08

Assembly generation "hello world".

b18fd08

Top-level definitions parse for BroadcastCount.

6827fcc

BroadcastCount assembly is generated.

9bf0027

I have not verified that the assembly is _correct_...

Struggle to get assembly to work.

f066f42

I stopped working on this 20 hours ago; just checkpointing old work here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NoC experiments #9

NoC experiments #9

petervdonovan commented Jan 20, 2023

lhstrh left a comment

lhstrh Jan 20, 2023

NoC experiments #9

Are you sure you want to change the base?

NoC experiments #9

Conversation

petervdonovan commented Jan 20, 2023

lhstrh left a comment

Choose a reason for hiding this comment

lhstrh Jan 20, 2023

Choose a reason for hiding this comment