Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run copy codec first to get stuff cached #27

Open
nemequ opened this issue Oct 3, 2015 · 1 comment
Open

Run copy codec first to get stuff cached #27

nemequ opened this issue Oct 3, 2015 · 1 comment

Comments

@nemequ
Copy link
Member

nemequ commented Oct 3, 2015

The first codec to run may have a small disadvantage on the first run since the kernel is unlikely to have the file cached. It's probably quite a small disadvantage; the operation is usually run multiple times, and most of the time will not be counted as CPU time anyways, but it would be easy to just run the copy codec once first for each file being benchmarked.

@travisdowns
Copy link

travisdowns commented Jan 4, 2017

FWIW, I've recently done some tests on this (this mostly follows from the discussion in squash 131 - but I think the further discussion belongs here since this issue is open and the changes - I think - need to go in the benchmark).

The results (read-only) on kernel 4.4 are basically that IO adds:

  • ~100 ms of time per 1 GB if you directly mmap the input file
  • ~120 ms of time per 1 GB if you read() the files

Note that these times are very similar, but they actually come from very different sources:

(1) The mmap time is divided between two sources: the kernel work to actually map the pages, and the user work to fault-in and/or TLB-miss the resultant pages into the process user-space. For the kernel side the work is basically setting up the VMA and PTE entries - with the latter being the important cost since the VMA entry is one-per-mmap and the PTE is once per-page, so the per-page overhead dominates once the mmap is reasonably large. This work takes ~100 ms on recent (4.4 ish) kernels over a large variety of mmap sizes, from say 1 MB to 1,000 MB. If you get a lot smaller than 1 MB, you start to see increasing overhead because of the actual per-mmap call overhead - but the effect is slow - even at 0.1 MB, it only slows down by about 30%, and at 0.01 MB by about 100%.

The user-land cost is basically faulting in each page as it is accessed. Since the file cache is 4K, you take a fault every 4K pages in principle. In practice, you actually take a fault every 16 pages on newer kernels, due to faultaround.

You can use MAP_POPULATE to reduce the number of pagefaults to essentially zero in the consuming code, at the cost of a (much longer) mmap call. Here, the kernel basically maps in all the PTEs in user space in the mmap call, so there are no faults later. Userspace still needs one TLB fill for each 4K page, but this doesn't involve the kernel, and modern x86 hardware is pretty good at this (next-page prefetch + more than 1 page walk handler). On other hardware, the TLB misses may have a big impact, but it depends on page size and hardware capabilities.

It also physically reads the file if that's necessary. This doesn't actually seem to speed things up much (perhaps 5%) because fault handling is apparently super fast (a few 100 ns) and faultaround reduces the fault count by 16x.

(2) read() with a fixed buffer, on the other hand, has a totally different set of costs. In this case, userland isn't taking any pagefaults at all beyond the cost to initially fault in the buffer (which for a smallish buffer is microscopic compared to faulting in the entire file). The cost here is that the kernel has to copy the pages from the page buffer to user land (whereas mmap effectively has "zero copy" reads). In addition to the copy itself, there are other costs such as validating the userspace buffer, and accessing the pages in the page cache, but at least the latter of these is probably roughly shared with the mmap path. The copy used to be quite slow on x86, because the kernel doesn't use SSE/AVX registers, which excludes the fastest memcpy implementations since you are copying 8 bytes at a time rather than 16 or 32, but recently tricks have been added to use REP MOVQ which can indirectly leverage the largest registers fastest algorithms use (at some startup cost).

Overall read() ended up slightly slower, at about ~8 GB/s versus ~10 GB/s for mmap, but on other kernels or hardware, the results could flip depending on what features were implemented in the kernel and hardware behavior. A "pure" memcpy in user-space runs at about 20 GB/s, so in-memory (fully buffered) IO is still about half-speed compared to pure memory access. Later kernels (4.8 or so) open up other options, such as large page support for tmpfs and memfd which could change all this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants