-
Notifications
You must be signed in to change notification settings - Fork 406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GIT PULL] man/io_uring_internal: Man page about high lvl inner workings of io_uring #1256
base: master
Are you sure you want to change the base?
Conversation
Thanks for kicking this off! I'll add some comments in the diff. |
man/io_uring_internals.7
Outdated
.PP | ||
.B io_uring | ||
is a linux specific, asynchronous API that allows the submission of requests to | ||
the kernel that are typically otherwise performed via a syscall. Requests are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically yes, but not exclusively. Not sure if it bears mentioning or not...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, generally the first couple sentences here could imo be shortened or cut, as they are kind of just introductory rambling. Mostly wanted to just get to these two sentences:
An important detail here is that after a request has been submitted to the kernel some CPU time has to be spent in kernel space to perform the
required submission and completion related tasks.
The mechanism used to provide this CPU time, as well as what process does so
and when is different in
.I io_uring
than for the traditional API provided by regular syscalls.
.I Submission Queue | ||
(SQ) and completion notifications are passed back to the application via the | ||
.I Completion Queue | ||
(CQ). An important detail here is that after a request has been submitted to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure that sentence reads that well, I'm having a hard time trying to make sense of it.
man/io_uring_internals.7
Outdated
The tasks required in kernel space on the submission side are mostly checking | ||
the SQ for newly arrived SQEs, parsing and check them for validity and | ||
permissions and then passing them on to the responsible system, such as a | ||
block device driver. An important note here is that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd mention "such as a block device driver, networking stack, etc" or something like that. Don't want to make this sound storage centric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, will do.
If this fails, e.g. due to the respective system not supporting non-blocking | ||
submissions, | ||
.I io_uring | ||
will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't fully accurate. If an IO can be issued in a non-blocking fashion, then one of two things can happen:
- It's done. Examples of this would be writing to a pipe/socket (for example), reading from a pipe/socket, or reading/writing to/from a regular file where the result is either in the page cache already (for a read), or io_uring was able to just copy it to the page cache (for a write). For these cases, a CQE will be posted immediately, even before io_uring_enter(2) returns.
2a) It wasn't done, but submitted async. io_uring will get a callback at some point when the operation completes, and a CQE will be posted. Examples of this are async reads/writes to a storage device.
2b) It wasn't done, but the file in question can signal readiness for when the operation can be retried. Examples of this are any pollable file, like a pipe, socket, etc. When io_uring receives the callback that data can now be read/written, it will retry the operation. Importantly, this retry happens from the task that submitted the IO. There's no async thread involved in this operation.
2c) It wasn't done, and the file has limited async support. Eg it cannot signal when it's ready to do IO. For this case, and only this case, does io_uring punt to an async worker to do the IO.
I don't want to imply that io_uring just willy nilly punts to async workers, as that is not the case, and that would not be very efficient. It's a last resort kind of thing, for when the driver / file type is pretty basic and doesn't support more than very basic primitives.
Now, for the application, it doesn't really matter which of the 2 cases end up happening, as completions are posted as it expects. But for efficiency reasons, it very much does matter, and there's a common theme where people assume that io_uring is just a thread work pool. That is very much WRONG, and this man page should not perpetuate that myth, it should help clear up the misunderstanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot that 1. is even a thing when writing this :D, so this should definitively be mentioned explicitly.
Also explaining 2a) and 2b) in more detail like this is probably a good idea.
man/io_uring_internals.7
Outdated
.SH The Completion Side Work | ||
.PP | ||
|
||
The tasks required in kernel space on the completion side mostly come in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General theme - task is not the best word to use, because it implies a relationship to a thread/process. Not sure what's a better word to use here, just tossing it out there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah true. Many of the alternatives are also overloaded. I guess "work" would be better. The heading already uses it.
was to reduce or entirely avoid the overheads of syscalls to provide the | ||
required CPU time in kernel space. The mechanism that | ||
.I io_uring | ||
utilizes to achieve this differs depending on the configuration with different |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utilize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think utilizes is correct here?
utilizes to achieve this differs depending on the configuration with different | ||
trade-offs between configurations in respect to e.g. CPU efficiency and latency. | ||
|
||
With the default configuration the primary mechanism to provide the kernel space |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what this "provide the kernel space CPU time" means here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Someone" needs to run some code in kernel space (to do the kernel side submission, post the CQE etc.) , be that one of the submitting processes after the context switch during a syscall or e.g. the sq poll thread or to a limited extend the io wq threads. So "Someone" e.g. the caller of io_uring_enter or the sq poll thread would "provide the kernel space CPU time" ... and use it to run the relevant code in kernel space. That's how i have been thinking about this, but yeah maybe not the best wording...
optionally wait until a specified amount of completions have arrived before | ||
returning. | ||
|
||
If polled I/O is used all completion related work is performed during the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this section!
man/io_uring_internals.7
Outdated
.SH Submission Queue Polling | ||
.PP | ||
|
||
Sq polling introduces a dedicated kernel thread that performs essentially all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, and in other spots, it's important to note that io_uring does NOT utilize any kernel threads. In Linux, a kernel thread is a special kind of thread that is entirely decoupled from any other real process/thread running in userspace. It doesn't have any files, mm, etc associated with it.
What io_uring uses are "io threads", which are exactly like a thread created with eg pthread_create() in the sense that they share any resources that the original task has, and any credentials, namespaces, etc. The only thing that makes them different is that they are created by io_uring, and they never exit to userspace. They sit around and do work, if needed, and then go away when they are no longer needed.
Hence I would probably explain this io thread concept when it's initially encountered in this man page, and then subsequently refer to io threads and remove any mention of kernel threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, my mistake. I thought: thread that never exits to user space i.e. only runs in kernel space == kernel thread. The distinction makes a lot of sense though. I assume this is done for permission management on files (credentials) and zero copy stuff (mm)? I only noticed the difference till now, by io_urings worker threads showing up in traces with tracy or perf, where this did not seem to be the case for other (or what i thought were) other kernel threads.
Sq polling introduces a dedicated kernel thread that performs essentially all | ||
submission and completion related tasks from fetching SQEs from the SQ, | ||
submitting requests, polling requests, if configured for I/O poll and posting | ||
CQEs. Notably, async punt requests are still processed by the IO WQ, to not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above explanation of how requests are issued and when io-wq is actually used, applies here too.
man/io_uring_internals.7
Outdated
.SH IO Work Queue | ||
.PP | ||
|
||
The IO WQ is a kernel thread pool used to execute any requests that can not be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
io-wq is a pool of io threads [...]
man/io_uring_internals.7
Outdated
request on to a IO WQ thread that then performs the blocking submission. While | ||
this mechanism ensures that | ||
.IR io_uring , | ||
unlike e.g. AIO, never blocks on any of the submission paths, it is, as the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in any of the submission paths.
Stop the sentence there. And then I don't understand what the rest of that original sentence is trying to say?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
y. Cutting the sentence there makes sense. What I tried to say there is mostly redundant with the following sentences anyways. That being that one could have probably guessed that its not ideal only by the name. It is called the async punt for a reason. It's a fall back. There is probably a reason why it is not the first thing that is attempted. Anyways, not really a reason why that comment should be here.
name of this mechanism, the async punt, suggests not ideal. The blocking | ||
nature of the submission, the passing of the request to another thread, as | ||
well as the scheduling of the IO WQ threads are all ideally avoided | ||
overheads. Significant IO WQ activity can thus be seen as an indicator that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very true.
@axboe Thanks for the notes! I left some comments and will type up the corrections tomorrow if i get to it. |
Adds a man page with details about the inner workings of io_uring that are likely to be useful for users as they relate to frequently misused flags of io_uring such as IOSQE_ASYNC and the taskrun flags. This mostly describes what needs to be done on the kernel side for each request, who does the work and most notably what the async punt is. Signed-off-by: Constantin Pestka <[email protected]>
3ea13c6
to
f7338fd
Compare
Ok, finally got around to address your comments @axboe. The main things, I guess, are: I added a small section for explaining the io threads separately and expanded the explanation for scenarios during submission. Also a couple of other minor rewordings etc. Let me know if I missed smth or got something wrong in the corrections :) |
Thanks, I'll take another look! |
requests for submissions and process arrived completions within the same | ||
.IR io_uring_enter (2) | ||
call. Applications can set the flag | ||
.I IORING_ENTER_GETEVENTS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for someone who doesn't have deep understanding of io_uring (me) it is still not clear what IORING_ENTER_GETEVENTS
actually do. Wording says that it allows processing completions in io_uring_enter
, how do completions processed when this flag is missing then?
This adds the first of the new man 7 pages suggested in #1241
It contains a high lvl overview of what needs to be done on the kernel side for all requests, who does the work with a given configuration, explains the async punt and describes what io_uring kernel threads exist in which case.
Let me know if I got anything wrong, smth is missing etc..
Also, wasn't sure about the name of the page.
git request-pull output:
Click to show/hide pull request guidelines
Pull Request Guidelines
notification, use
[GIT PULL]
as a prefix in your PR title.Commit message format rules:
Signed-off-by
tag with your real name and email. For example:The description should be word-wrapped at 72 chars. Some things should
not be word-wrapped. They may be some kind of quoted text - long
compiler error messages, oops reports, Link, etc. (things that have a
certain specific format).
Note that all of this goes in the commit message, not in the pull
request text. The pull request text should introduce what this pull
request does, and each commit message should explain the rationale for
why that particular change was made. The git tree is canonical source
of truth, not github.
Each patch should do one thing, and one thing only. If you find yourself
writing an explanation for why a patch is fixing multiple issues, that's
a good indication that the change should be split into separate patches.
If the commit is a fix for an issue, add a
Fixes
tag with the issueURL.
Don't use GitHub anonymous email like this as the commit author:
Use a real email address!
Commit message example:
By submitting this pull request, I acknowledge that: