Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NoC experiments #9

Draft
wants to merge 34 commits into
base: main
Choose a base branch
from
Draft

NoC experiments #9

wants to merge 34 commits into from

Conversation

petervdonovan
Copy link

This includes the software that I have used to estimate the maximum achievable bandwidth of the NoC.

I realize now that NetworkInterface really is needed since in order
for a communication protocol to work properly the received messages
must be placed in buffers corresponding to the sender cores.
On the order of 12 cycles worst case to send an integer without
handshaking except checking the valid bit.
This uses self-modifying code. Probably more efficient to do it with a
sequence of forward branches conditioned on read cycle.
The synchronization itself costs like 13 cycles. It's cheap. And
synchronization is something that an HRTT only has to do once per
rendezvous, so 13 cycles is no big deal.
It doesn't work entirely properly because although we check that all
parties are ready to communicate, we do not sync precisely when the
batch starts. Instead, we just sync with a blocking read, which is not
precise enough. It can't be precise enough because the poll takes 5
cycles, which is as large as the interval between flits.

It does "kind of" work though, which is progress.
Much longer is impractical because the code size would be excessive. A
little FPGA cannot have a giant IMEM scratchpad :(
The interleaving of the prints looks funny because the print from one of
the cores appears to happen a couple cycles later than the prints from
the other two cores. However, all 126 numbers are correctly received, in
the correct order, by all receiving cores. That's 126 numbers
broadcasted in roughly 1233 cycles. 126*5=630 of those cycles are
necessarily required by the TDM schedule, 115*4=460 of those cycles are
synchronization overhead at the start of sending a packet, and the
remaining 143 or so cycles are probably the C glue code.

Obviously the proportion that is overhead will vary depending on the
packet size -- smaller packets -> more packets for the same amount of
data -> more overhead. In this case 4 packets were sent.

By my measurement, the synchronization overhead incurred before sending
a packet -- and this is the assembly, not the C which is probably doing
a function call, saving things on the stack, etc. -- is 115 cycles.
Multiple measurements all gave something like 114.5 cycles.
I want to have multiple bursts, to amortize the initial hundred-cycle
synchronization overhead while still allowing enough time between bursts
for the sender to either jump back to the start of an unrolled loop or
for the sender to load words from main memory before sending them from
the register file.

In this commit the program in low_level_interface_noc still seems to
work.
It took some time to get the assembly to work as it did in the previous
commit. It is getting hard to manage register allocation.
Copy link
Member

@lhstrh lhstrh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work, @petervdonovan!

// "nop\n\t"
// "nop\n\t"
"li t4, 0x40000000\n\t" // wishbone base address
// FIXME: Why does this loop have to go through one iteration extra the first time around, compared to the number of iterations that it makes thereafter?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem

petervdonovan and others added 5 commits January 20, 2023 12:52
That is 87% the maximum possible on this NoC, and by sending more data
we can amortize even further.
This is 14 cycles out of 115 -- not a big performance difference. More
importantly, it avoids clobbering a few registers, which the
initialize_asm can now write to without their data being lost.
The relevant functions are commented out because GCC inline asm
apparently does not permit clobbering caller-saved registers.
This is close to 600 lines of assembly all written in a very brittle
way. It is not tested yet. Also note that the result of porting it will
not really take advantage of all the code generator's dynamic checks
etc. because it is translated very directly from C macros.
I have not verified that the assembly is _correct_...
I stopped working on this 20 hours ago; just checkpointing old work
here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants