Add an in-memory mode #131

travisdowns · 2015-09-26T07:36:08Z

For slow codecs, the exact details of IO aren't that important, as long as you aren't doing physical disk IO (i.e., as long as the reads are in the page cache, and the writes are effectively hidden by the page cache and delayed writeback).

For fast codecs, say those around 1 GiB/s or faster, the exact details of how the IO is performed can matter a lot. Especially on the write side, there are a lot of gremlins and non-determinism in how a modern OS like Linux will handle IO. For example, something like the write() syscall typically writes into the page cache, but not to the disk. Later, the OS will flush to the disk, but exactly when that happens depends on many factors such as how much time has elapsed, how much pressure there is for free pages, how many pages are dirty, etc. Sometimes writes may block on actual IO if you have enough dirty pages in the cache - so a typical pattern is that the first N MBs of data written to a quiet system is very fast, followed by a crash in speed down to close to the underlying disk speed once you hit some number of dirty pages. Sometimes the crash is well below the actual IO speed, depending on how the queued IO is handled, and then it covers back to the physical speed.

Well the exact details don't matter, but safe to say that getting repeatable IO performance can be tricky.

In light of this, it might make sense to have an enhancement which allows an "in memory" mode, at least for the purpose of benchmarking (which IMO is one of the really valuable things about squash). The write side can be purely in memory, as you can just "sink" the output of the codec to nowhere (perhaps you want to checksum it?). On the read side, it may not be feasible to buffer the entire file in memory, but perhaps that's fine for smaller files?

nemequ · 2015-09-26T16:08:16Z

I agree that I/O is complicated, but if people want a purely in-memory mode all they have to do is not use the I/O or splicing APIs. Actually, what you're suggesting can't really be done with the I/O or splicing APIs since they take FILE*s, but for a purely in-memory test you would need to provide input and output buffers—i.e. use squash_compress and squash_decompress.

I'm closing this issue, but if you want you can open up an issue for the benchmark. My response will be:

First off, most of the issues around exactly how I/O is handled by the kernel are glossed over by using CPU time when benchmarking. The current issue with a performance regression when using memory-mapped files on x86 (which wasn't there a couple months ago) aside, I think CPU time does a pretty good job of eliminating I/O from the equation.

I would be willing to add a flag to the benchmark to cause it to read the entire file into a malloc()ed buffer, malloc() another buffer for the output, then simply time squash_compress/squash_decompress operations, but I can't make that the default. Some of the larger files will fail with some codecs on machines with less memory. Until I added the mmap code to the old file-to-file API it would do that, and the OOM-killer would routinely strike it down.

Basically, this could be used for people wanting to run this benchmark code themselves. I feel like that is a bigger audience than it probably should be… If you are running a benchmark yourself you should probably be doing it in your application, with your data. Squash makes this really easy—all you have to do to change the codec is change a single string. The only real exception, IMHO, is codec developers who are trying to optimize their implementation.

travisdowns · 2015-10-02T23:05:07Z

Ugh, I had totally forgotten than the benchmark is a separate project when I wrote that. You are completely correctly the the request doesn't make sense in the context of the squash API. I really meant for it to be a feature of the benchmark as you pointed out.

The problem of IO-incurred CPU time is only glossed over for codecs that aren't "extremely fast". For example, perhaps some IO strategy incurs 100 ms of CPU time per 1 GB read. Effectively, the IO itself can run at 10 GB/s (where time is measured CPU time, not wall time). For a codec running at 100 MB/s, or even 1 GB/s, that's just a small adjustment to the timing (about 1% error in the former case and about 10% in the latter). For a codec otherwise running at 20 GB/s, however, the benchmark would only report ~6.67 GBs, or about one third of the true value.

Such fast codecs are not likely to be used with actual typical IO devices, but rather in-memory or over some other very fast interface, so the true number is useful.

So perhaps I can re-open this against the benchmark at some point. There are a few options beyond the entire-file-in-buffer + time full buffer API approach. For example, you could still call the streaming API but measure the sum of time taken in each API call, or the IO could be run on a separate thread or process and only the thread/process running the plugin could be measured.

nemequ · 2015-10-03T00:19:07Z

The problem of IO-incurred CPU time is only glossed over for codecs that aren't "extremely fast". For example, perhaps some IO strategy incurs 100 ms of CPU time per 1 GB read. Effectively, the IO itself can run at 10 GB/s (where time is measured CPU time, not wall time). For a codec running at 100 MB/s, or even 1 GB/s, that's just a small adjustment to the timing (about 1% error in the former case and about 10% in the latter). For a codec otherwise running at 20 GB/s, however, the benchmark would only report ~6.67 GBs, or about one third of the true value.

I have some doubts about this, but testing would be needed (surprise!). AFAIK the kernel tends to be pretty good about keeping stuff cached, so my guess would be that even for very fast codecs the time required is insignificant. Of course this breaks down when the machine doesn't have enough memory to keep the file around, but trying to do everything in-memory will not work well there anyways—you'll end up thrashing and it would be much worse than if you just mmaped the file.

This does bring up an interesting point that the first plugin to run may have a small disadvantage. Perhaps for each file we should run the copy codec once first (and discard the results) to try to get the kernel to cache stuff. Filed as quixdb/squash-benchmark#27.

So perhaps I can re-open this against the benchmark at some point. There are a few options beyond the entire-file-in-buffer + time full buffer API approach. For example, you could still call the streaming API but measure the sum of time taken in each API call,

Perhaps, but codecs which only support the all-at-once API will basically degrade to placing the entire file in a buffer, they would just also have the overhead of an extra memcpy and potentially some reallocs to get to that point.

or the IO could be run on a separate thread or process and only the thread/process running the plugin could be measured.

I'm already planning on using the POSIX AIO API in the future (see #127). I'm not sure how helpful it will be for CPU time but wall-clock should be a big win, and it should be much more elegant than a thread.

Anyways, definitely feel free to open this back up in the benchmark. It wouldn't be very difficult to implement.

travisdowns · 2015-10-03T03:29:42Z

Right, in the mmap read case, you might get "free" IO (actually still some loss in some scenarios due to no huge pages in the page cache) - but that doesn't apply to read() or write() syscalls (those definitely incur CPU time per byte) and in some cases not to mmap writes (that case is relatively complex).

Agree it's all just speculation until there are some real numbers and I'll try to get some (I don't have the wide variety of hardware you have to test on but I can certainly try on Intel/Linux and OSX.

On Oct 2, 2015, at 5:19 PM, Evan Nemerson [email protected] wrote:

The problem of IO-incurred CPU time is only glossed over for codecs that aren't "extremely fast". For example, perhaps some IO strategy incurs 100 ms of CPU time per 1 GB read. Effectively, the IO itself can run at 10 GB/s (where time is measured CPU time, not wall time). For a codec running at 100 MB/s, or even 1 GB/s, that's just a small adjustment to the timing (about 1% error in the former case and about 10% in the latter). For a codec otherwise running at 20 GB/s, however, the benchmark would only report ~6.67 GBs, or about one third of the true value.

I have some doubts about this, but testing would be needed (surprise!). AFAIK the kernel tends to be pretty good about keeping stuff cached, so my guess would be that even for very fast codecs the time required is insignificant. Of course this breaks down when the machine doesn't have enough memory to keep the file around, but trying to do everything in-memory will not work well there anyways—you'll end up thrashing and it would be much worse than if you just mmaped the file.

This does bring up an interesting point that the first plugin to run may have a small disadvantage. Perhaps for each file we should run the copy codec once first (and discard the results) to try to get the kernel to cache stuff. Filed as quixdb/squash-benchmark#27.

So perhaps I can re-open this against the benchmark at some point. There are a few options beyond the entire-file-in-buffer + time full buffer API approach. For example, you could still call the streaming API but measure the sum of time taken in each API call,

Perhaps, but codecs which only support the all-at-once API will basically degrade to placing the entire file in a buffer, they would just also have the overhead of an extra memcpy and potentially some reallocs to get to that point.

or the IO could be run on a separate thread or process and only the thread/process running the plugin could be measured.

I'm already planning on using the POSIX AIO API in the future (see #127). I'm not sure how helpful it will be for CPU time but wall-clock should be a big win, and it should be much more elegant than a thread.

Anyways, definitely feel free to open this back up in the benchmark. It wouldn't be very difficult to implement.

—
Reply to this email directly or view it on GitHub.

nemequ closed this as completed Sep 26, 2015

travisdowns mentioned this issue Jan 4, 2017

Run copy codec first to get stuff cached quixdb/squash-benchmark#27

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an in-memory mode #131

Add an in-memory mode #131

travisdowns commented Sep 26, 2015

nemequ commented Sep 26, 2015

travisdowns commented Oct 2, 2015

nemequ commented Oct 3, 2015

travisdowns commented Oct 3, 2015

Add an in-memory mode #131

Add an in-memory mode #131

Comments

travisdowns commented Sep 26, 2015

nemequ commented Sep 26, 2015

travisdowns commented Oct 2, 2015

nemequ commented Oct 3, 2015

travisdowns commented Oct 3, 2015