-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NoC experiments #9
Draft
petervdonovan
wants to merge
34
commits into
main
Choose a base branch
from
noc-experiments
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I realize now that NetworkInterface really is needed since in order for a communication protocol to work properly the received messages must be placed in buffers corresponding to the sender cores.
On the order of 12 cycles worst case to send an integer without handshaking except checking the valid bit.
This uses self-modifying code. Probably more efficient to do it with a sequence of forward branches conditioned on read cycle.
The synchronization itself costs like 13 cycles. It's cheap. And synchronization is something that an HRTT only has to do once per rendezvous, so 13 cycles is no big deal.
It doesn't work entirely properly because although we check that all parties are ready to communicate, we do not sync precisely when the batch starts. Instead, we just sync with a blocking read, which is not precise enough. It can't be precise enough because the poll takes 5 cycles, which is as large as the interval between flits. It does "kind of" work though, which is progress.
Much longer is impractical because the code size would be excessive. A little FPGA cannot have a giant IMEM scratchpad :(
The interleaving of the prints looks funny because the print from one of the cores appears to happen a couple cycles later than the prints from the other two cores. However, all 126 numbers are correctly received, in the correct order, by all receiving cores. That's 126 numbers broadcasted in roughly 1233 cycles. 126*5=630 of those cycles are necessarily required by the TDM schedule, 115*4=460 of those cycles are synchronization overhead at the start of sending a packet, and the remaining 143 or so cycles are probably the C glue code. Obviously the proportion that is overhead will vary depending on the packet size -- smaller packets -> more packets for the same amount of data -> more overhead. In this case 4 packets were sent. By my measurement, the synchronization overhead incurred before sending a packet -- and this is the assembly, not the C which is probably doing a function call, saving things on the stack, etc. -- is 115 cycles. Multiple measurements all gave something like 114.5 cycles.
I want to have multiple bursts, to amortize the initial hundred-cycle synchronization overhead while still allowing enough time between bursts for the sender to either jump back to the start of an unrolled loop or for the sender to load words from main memory before sending them from the register file. In this commit the program in low_level_interface_noc still seems to work.
It took some time to get the assembly to work as it did in the previous commit. It is getting hard to manage register allocation.
lhstrh
reviewed
Jan 20, 2023
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive work, @petervdonovan!
// "nop\n\t" | ||
// "nop\n\t" | ||
"li t4, 0x40000000\n\t" // wishbone base address | ||
// FIXME: Why does this loop have to go through one iteration extra the first time around, compared to the number of iterations that it makes thereafter? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idem
Co-authored-by: Marten Lohstroh <[email protected]>
That is 87% the maximum possible on this NoC, and by sending more data we can amortize even further.
This is 14 cycles out of 115 -- not a big performance difference. More importantly, it avoids clobbering a few registers, which the initialize_asm can now write to without their data being lost.
The relevant functions are commented out because GCC inline asm apparently does not permit clobbering caller-saved registers.
This is close to 600 lines of assembly all written in a very brittle way. It is not tested yet. Also note that the result of porting it will not really take advantage of all the code generator's dynamic checks etc. because it is translated very directly from C macros.
I have not verified that the assembly is _correct_...
I stopped working on this 20 hours ago; just checkpointing old work here.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This includes the software that I have used to estimate the maximum achievable bandwidth of the NoC.