Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GIT PULL] man/io_uring_internal: Man page about high lvl inner workings of io_uring #1256

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

CPestka
Copy link
Contributor

@CPestka CPestka commented Oct 5, 2024

This adds the first of the new man 7 pages suggested in #1241

It contains a high lvl overview of what needs to be done on the kernel side for all requests, who does the work with a given configuration, explains the async punt and describes what io_uring kernel threads exist in which case.

Let me know if I got anything wrong, smth is missing etc..
Also, wasn't sure about the name of the page.


git request-pull output:

The following changes since commit 206650ff72b6ea4d76921f9c91ebfffd9902e6a0:

  test/fixed-hugepage: skip test on -ENOMEM (2024-09-27 10:27:10 -0600)

are available in the Git repository at:

  https://github.com/CPestka/liburing man_internals

for you to fetch changes up to fc5266b9626ac9798dce455f031871f4817c7cea:

  man/io_uring_internal: Add man page about relevant internals for users (2024-10-05 15:00:14 +0200)

----------------------------------------------------------------
CPestka (1):
      man/io_uring_internal: Add man page about relevant internals for users

 man/io_uring_internals.7 | 225 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 225 insertions(+)
 create mode 100644 man/io_uring_internals.7

Click to show/hide pull request guidelines

Pull Request Guidelines

  1. To make everyone easily filter pull request from the email
    notification, use [GIT PULL] as a prefix in your PR title.
[GIT PULL] Your Pull Request Title
  1. Follow the commit message format rules below.
  2. Follow the Linux kernel coding style (see: https://github.com/torvalds/linux/blob/master/Documentation/process/coding-style.rst).

Commit message format rules:

  1. The first line is title (don't be more than 72 chars if possible).
  2. Then an empty line.
  3. Then a description (may be omitted for truly trivial changes).
  4. Then an empty line again (if it has a description).
  5. Then a Signed-off-by tag with your real name and email. For example:
Signed-off-by: Foo Bar <[email protected]>

The description should be word-wrapped at 72 chars. Some things should
not be word-wrapped. They may be some kind of quoted text - long
compiler error messages, oops reports, Link, etc. (things that have a
certain specific format).

Note that all of this goes in the commit message, not in the pull
request text. The pull request text should introduce what this pull
request does, and each commit message should explain the rationale for
why that particular change was made. The git tree is canonical source
of truth, not github.

Each patch should do one thing, and one thing only. If you find yourself
writing an explanation for why a patch is fixing multiple issues, that's
a good indication that the change should be split into separate patches.

If the commit is a fix for an issue, add a Fixes tag with the issue
URL.

Don't use GitHub anonymous email like this as the commit author:

Use a real email address!

Commit message example:

src/queue: don't flush SQ ring for new wait interface

If we have IORING_FEAT_EXT_ARG, then timeouts are done through the
syscall instead of by posting an internal timeout. This was done
to be both more efficient, but also to enable multi-threaded use
the wait side. If we touch the SQ state by flushing it, that isn't
safe without synchronization.

Fixes: https://github.com/axboe/liburing/issues/402
Signed-off-by: Jens Axboe <[email protected]>

By submitting this pull request, I acknowledge that:

  1. I have followed the above pull request guidelines.
  2. I have the rights to submit this work under the same license.
  3. I agree to a Developer Certificate of Origin (see https://developercertificate.org for more information).

@CPestka CPestka changed the title man/io_uring_internal: Man page about high lvl inner workings of io_uring [GIT PULL] man/io_uring_internal: Man page about high lvl inner workings of io_uring Oct 5, 2024
@axboe
Copy link
Owner

axboe commented Oct 6, 2024

Thanks for kicking this off! I'll add some comments in the diff.

.PP
.B io_uring
is a linux specific, asynchronous API that allows the submission of requests to
the kernel that are typically otherwise performed via a syscall. Requests are
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically yes, but not exclusively. Not sure if it bears mentioning or not...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, generally the first couple sentences here could imo be shortened or cut, as they are kind of just introductory rambling. Mostly wanted to just get to these two sentences:

An important detail here is that after a request has been submitted to the kernel some CPU time has to be spent in kernel space to perform the
required submission and completion related tasks.
The mechanism used to provide this CPU time, as well as what process does so
and when is different in
.I io_uring
than for the traditional API provided by regular syscalls.

.I Submission Queue
(SQ) and completion notifications are passed back to the application via the
.I Completion Queue
(CQ). An important detail here is that after a request has been submitted to
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that sentence reads that well, I'm having a hard time trying to make sense of it.

The tasks required in kernel space on the submission side are mostly checking
the SQ for newly arrived SQEs, parsing and check them for validity and
permissions and then passing them on to the responsible system, such as a
block device driver. An important note here is that
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd mention "such as a block device driver, networking stack, etc" or something like that. Don't want to make this sound storage centric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, will do.

If this fails, e.g. due to the respective system not supporting non-blocking
submissions,
.I io_uring
will
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't fully accurate. If an IO can be issued in a non-blocking fashion, then one of two things can happen:

  1. It's done. Examples of this would be writing to a pipe/socket (for example), reading from a pipe/socket, or reading/writing to/from a regular file where the result is either in the page cache already (for a read), or io_uring was able to just copy it to the page cache (for a write). For these cases, a CQE will be posted immediately, even before io_uring_enter(2) returns.

2a) It wasn't done, but submitted async. io_uring will get a callback at some point when the operation completes, and a CQE will be posted. Examples of this are async reads/writes to a storage device.

2b) It wasn't done, but the file in question can signal readiness for when the operation can be retried. Examples of this are any pollable file, like a pipe, socket, etc. When io_uring receives the callback that data can now be read/written, it will retry the operation. Importantly, this retry happens from the task that submitted the IO. There's no async thread involved in this operation.

2c) It wasn't done, and the file has limited async support. Eg it cannot signal when it's ready to do IO. For this case, and only this case, does io_uring punt to an async worker to do the IO.

I don't want to imply that io_uring just willy nilly punts to async workers, as that is not the case, and that would not be very efficient. It's a last resort kind of thing, for when the driver / file type is pretty basic and doesn't support more than very basic primitives.

Now, for the application, it doesn't really matter which of the 2 cases end up happening, as completions are posted as it expects. But for efficiency reasons, it very much does matter, and there's a common theme where people assume that io_uring is just a thread work pool. That is very much WRONG, and this man page should not perpetuate that myth, it should help clear up the misunderstanding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot that 1. is even a thing when writing this :D, so this should definitively be mentioned explicitly.
Also explaining 2a) and 2b) in more detail like this is probably a good idea.

.SH The Completion Side Work
.PP

The tasks required in kernel space on the completion side mostly come in the
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General theme - task is not the best word to use, because it implies a relationship to a thread/process. Not sure what's a better word to use here, just tossing it out there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah true. Many of the alternatives are also overloaded. I guess "work" would be better. The heading already uses it.

was to reduce or entirely avoid the overheads of syscalls to provide the
required CPU time in kernel space. The mechanism that
.I io_uring
utilizes to achieve this differs depending on the configuration with different
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utilize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think utilizes is correct here?

utilizes to achieve this differs depending on the configuration with different
trade-offs between configurations in respect to e.g. CPU efficiency and latency.

With the default configuration the primary mechanism to provide the kernel space
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this "provide the kernel space CPU time" means here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Someone" needs to run some code in kernel space (to do the kernel side submission, post the CQE etc.) , be that one of the submitting processes after the context switch during a syscall or e.g. the sq poll thread or to a limited extend the io wq threads. So "Someone" e.g. the caller of io_uring_enter or the sq poll thread would "provide the kernel space CPU time" ... and use it to run the relevant code in kernel space. That's how i have been thinking about this, but yeah maybe not the best wording...

optionally wait until a specified amount of completions have arrived before
returning.

If polled I/O is used all completion related work is performed during the
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this section!

.SH Submission Queue Polling
.PP

Sq polling introduces a dedicated kernel thread that performs essentially all
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, and in other spots, it's important to note that io_uring does NOT utilize any kernel threads. In Linux, a kernel thread is a special kind of thread that is entirely decoupled from any other real process/thread running in userspace. It doesn't have any files, mm, etc associated with it.

What io_uring uses are "io threads", which are exactly like a thread created with eg pthread_create() in the sense that they share any resources that the original task has, and any credentials, namespaces, etc. The only thing that makes them different is that they are created by io_uring, and they never exit to userspace. They sit around and do work, if needed, and then go away when they are no longer needed.

Hence I would probably explain this io thread concept when it's initially encountered in this man page, and then subsequently refer to io threads and remove any mention of kernel threads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, my mistake. I thought: thread that never exits to user space i.e. only runs in kernel space == kernel thread. The distinction makes a lot of sense though. I assume this is done for permission management on files (credentials) and zero copy stuff (mm)? I only noticed the difference till now, by io_urings worker threads showing up in traces with tracy or perf, where this did not seem to be the case for other (or what i thought were) other kernel threads.

Sq polling introduces a dedicated kernel thread that performs essentially all
submission and completion related tasks from fetching SQEs from the SQ,
submitting requests, polling requests, if configured for I/O poll and posting
CQEs. Notably, async punt requests are still processed by the IO WQ, to not
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above explanation of how requests are issued and when io-wq is actually used, applies here too.

.SH IO Work Queue
.PP

The IO WQ is a kernel thread pool used to execute any requests that can not be
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

io-wq is a pool of io threads [...]

request on to a IO WQ thread that then performs the blocking submission. While
this mechanism ensures that
.IR io_uring ,
unlike e.g. AIO, never blocks on any of the submission paths, it is, as the
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in any of the submission paths.

Stop the sentence there. And then I don't understand what the rest of that original sentence is trying to say?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

y. Cutting the sentence there makes sense. What I tried to say there is mostly redundant with the following sentences anyways. That being that one could have probably guessed that its not ideal only by the name. It is called the async punt for a reason. It's a fall back. There is probably a reason why it is not the first thing that is attempted. Anyways, not really a reason why that comment should be here.

name of this mechanism, the async punt, suggests not ideal. The blocking
nature of the submission, the passing of the request to another thread, as
well as the scheduling of the IO WQ threads are all ideally avoided
overheads. Significant IO WQ activity can thus be seen as an indicator that
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very true.

@CPestka
Copy link
Contributor Author

CPestka commented Oct 12, 2024

@axboe Thanks for the notes! I left some comments and will type up the corrections tomorrow if i get to it.

Adds a man page with details about the inner workings of io_uring that
are likely to be useful for users as they relate to frequently misused
flags of io_uring such as IOSQE_ASYNC and the taskrun flags. This
mostly describes what needs to be done on the kernel side for each
request, who does the work and most notably what the async punt is.

Signed-off-by: Constantin Pestka <[email protected]>
@CPestka
Copy link
Contributor Author

CPestka commented Oct 20, 2024

Ok, finally got around to address your comments @axboe. The main things, I guess, are: I added a small section for explaining the io threads separately and expanded the explanation for scenarios during submission. Also a couple of other minor rewordings etc. Let me know if I missed smth or got something wrong in the corrections :)

@axboe
Copy link
Owner

axboe commented Oct 22, 2024

Thanks, I'll take another look!

requests for submissions and process arrived completions within the same
.IR io_uring_enter (2)
call. Applications can set the flag
.I IORING_ENTER_GETEVENTS

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for someone who doesn't have deep understanding of io_uring (me) it is still not clear what IORING_ENTER_GETEVENTS actually do. Wording says that it allows processing completions in io_uring_enter , how do completions processed when this flag is missing then?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants