Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scrypt: Adds parallel feature and max_memory argument #178

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

fpgaminer
Copy link

WARNING: Do NOT merge yet; discussion is required.

Overview

Hello! This is my first contribution to this project. I hope it is helpful. I welcome comments and am happy to update my pull request as needed.

This pull request adds parallelism to the scrypt implementation, gated behind a new crate feature parallel. It uses rayon internally. The original scrypt function should work exactly as it did before, and does not use parallelism. Instead a new scrypt_parallel function is added which allows the user to specify num_threads.

scrypt_parallel also features a new max_memory argument that can be used to limit the memory usage of scrypt. Setting this to 1GB, for example, will make scrypt use less than 1GB regardless of scrypt params and number of threads (though it may use slightly more than 1GB on the order of hundreds of bytes; specifically the storage of B and temporary variables is not counted).

New tests and benchs were added to cover the new features.

Implementation details

The parallelism is straightforward and just uses rayon across the p parameter.

max_memory is a bit more involved. This is accomplished through the use of a memory-compute trade off for the scrypt algorithm. romix::scrypt_ro_mix has been modified with a log_f argument. At log_f = 0 it operates just as it did before. At log_f=1 it uses half the memory at the cost of more compute. And so forth.

It's easiest to understand log_f in terms of the size of V and the computational cost of scrypt_ro_mix. V must be n >> log_f blocks in length. The total BlockMix operations performed by scrypt_ro_mix is equal to ops(n, log_f) = 2 * n + 0.5 * n * (2**log_f - 1).

The addition of this log_f functionality allows us to implement the max_memory argument. This means any scrypt parameters can be computed even on machines that otherwise don't have the memory for it. It's also very useful for parallelism as the memory usage of scrypt normally scales linearly with the number of threads used (since each thread needs its own V).

scrypt_log_f is the new workhorse, implementing the higher level scrypt algorithm, with the addition of num_threads to control parallelism and log_f to control memory usage.

scrypt now calls scrypt_log_f internally with log_f = 0 and num_threads = 1 to emulate previous behavior.

scrypt_parallel calls scrypt_log_f internally as well, but it calculates log_f automatically from the provided max_memory parameter.

It's always better to use all available cores on a machine to compute scrypt, even if that means increasing log_f to make that possible. Consider for example computing scrypt(log_n=20, r=8, p=4) on a four core, 1GB machine. That machine only has enough memory to compute scrypt(log_n=20, r=8, p=1), so without log_f it can only use a single core. Increasing log_f to 2 allows the use of all four cores. It turns out that, despite log_f increasing the amount of work, it's still overall faster as that compute is offset by the additional cores that can work.

(See the section "log_f proof" for the proof. Note that we ignore the overhead bytes needed for B and temps, to simplify.)

So scrypt_parallel can use max_memory to automatically set log_f to the optimal value, assuming num_threads is less than or equal to the number of cores on the machine.

Benchmarks

commit f8735676f0591ebfb2a59a2a076c2c3fa80cd5be
CARGO_PROFILE_BENCH_LTO=true CARGO_PROFILE_BENCH_CODEGEN_UNITS=1 RUSTFLAGS="-C target-cpu=native" cargo +nightly bench

test scrypt_15_8_1          ... bench:  56,664,081 ns/iter (+/- 3,567,575)
test scrypt_15_8_4          ... bench: 222,945,177 ns/iter (+/- 6,051,303)
test scrypt_parallel_15_8_4 ... bench:  72,227,336 ns/iter (+/- 4,555,001)


commit 02fcc37f63894cd699335c7f3f2e914dc577c7ea
CARGO_PROFILE_BENCH_LTO=true CARGO_PROFILE_BENCH_CODEGEN_UNITS=1 RUSTFLAGS="-C target-cpu=native" cargo +nightly bench
test scrypt_15_8_1 ... bench:  55,516,757 ns/iter (+/- 3,521,891)

Speed of scrypt remains unaffected by these changes (within the margin of error).

Discussion

I don't think the API I picked for the new scrypt_* functions is ideal. I'm not sure what the ideal API is for the new features. So feedback would be great!

Since log_f always has an optimal value based on desired memory usage and number of threads, it seems like scrypt_parallel presents the more useful API to the user. Hence why I hid scrypt_log_f.

But maybe some users might want direct control of log_f for some reason?

And max_memory is also kind of a tricky argument. It requires the user to know how much memory is available on the machine. The sysinfo crate provides these kinds of stats, so I guess I would expect users to use something like that, possibly dividing it by some factor so they don't use all of the memory, and then feeding that as max_memory.

I thought about including that kind of functionality directly in this crate. But the sysinfo crate didn't look very straightforward and robust, so I don't personally feel comfortable including it in a cryptography crate. I guess we could put it behind a feature flag?

Of course an API user is welcome to just set max_memory to a huge value if they "don't care", and get the same behavior as scrypt but with the addition of parallelism.

Finally, the addition of the max_memory argument adds another failure case to scrypt_parallel. There are situations where max_memory is too small for the given parameters, so scrypt_parallel needs to error out. Since there were no existing errors I just put in a panic for now. Should a new error be added?

It's important to note that the existing scrypt parameter would also panic in low memory situations, due to OOM. So it's more like scrypt_parallel makes that situation explicit. (Though scrypt_parallel can still OOM since it doesn't currently account for all allocations.)

log_f proof

# Suppose we have a four core, 1GB machine and need to compute scrypt(log_n=20, r=8, p=4).
# At f=1 scrypt needs 1GB of memory, so we can only use 1 core of this machine.
# At f=4 scrypt only needs 0.25GB, which means we can use all 4 cores.  But each core has to do a lot more work.
# Which is better?
# More generally, is it _always_ better to trade compute for memory if it means we can use all the cores on a given machine?
# Let's find out!

# The number of operations that need to be performed is given by:
# ops(n, f, p) = (2 * n + 0.5 * n * (f - 1)) * p
#
# From our example we assume that when f=1 we can only use one core for computation.  When f=2 we can use 2 cores.
# When f=16 we can use 16 cores.  Etc.
# So we can formulate computation time like so:
# time(n, f, p) = ops(n, f, p) / f => (2 * n + 0.5 * n * (f - 1)) * p / f
#
# By increasing f, ops(n, f, p) increases, but we can also use more cores, hence the division by f.
# 
# Now we can ask the question, is it always better to increase f?
#
# time(n, f + 1, p) <? time(n, f, p)
# 
# Note: assumes n > 0, f > 0, p > 0
# (2 * n + 0.5 * n * ((f + 1) - 1)) * p / (f + 1) <? (2 * n + 0.5 * n * (f - 1)) * p / f
# Divide by p:
# (2 * n + 0.5 * n * f) / (f + 1) <? (2 * n + 0.5 * n * (f - 1)) / f
# Divide by n:
# (2 + 0.5 * f) / (f + 1) <? (2 + 0.5 * (f - 1)) / f
# Multiply by 2:
# (4 + f) / (f + 1) <? (4 + f - 1) / f
# Multiply by f * (f + 1):
# (4 + f) * f <? (3 + f) * (f + 1)
# Expand:
# 4 * f + f**2 <? 3 * f + 3 + f**2 + f
# Subtract 4 * f + f**2
# 0 <? 3
# QED
# 0 < 3
#
# Thus we show that increasing f _always_ results in faster computation, as long as there are cores available.
# This of course also assumes that the p parameter is also large enough to keep all cores busy.
# For example when trying to compute scrypt(log_n=20, r=8, p=4) we can never use more than four cores, so it doesn't
# make sense to increase f beyond what's needed to allow four cores to run.
# And of course we're also assuming there's memory bandwidth aplenty.
#
# So for a given set of scrypt parameters, the optimal f-factor is set using:
# cores = min(machine_cores, p)
# iter_memory = machine_memory / cores
# f = ceil(n * 128 * r / iter_memory)
#
# Though in practice it's only reasonable for f to be a power of two, so the nearest power of two greater than or equal to f
# should be used.

Side Note: SSE

I didn't do it in this pull request, but while digging into the code and such I also noticed that the use of SSE would make scrypt run significantly faster. I figured I'd put a note here about it, just so there's note about it somewhere, even though it's not relevant to this pull request.

Implementing scrypt using SSE is relatively straightforward. Colin Percival's scrypt implementation in tarsnap has a good example. The easiest way to do this is to just directly port that implementation over to Rust using core::arch. I did that and saw a nice improvement from 46ns per salsa20_8 call to 33ns per salsa20_8 call. That should translate to scrypt as a whole running in ~70% of the time!

But core::arch would introduce unsafe into this crate and isn't future proof. I tried to find a way to re-write the existing Rust code in such a way that the compiler itself could infer SIMD instructions. See this thread: https://users.rust-lang.org/t/can-the-compiler-infer-sse-instructions/59976

While the compiler is or will be capable of doing that, for some permutations of the code, those optimization passes are really unstable now, for example varying their outputs depending on whether you use rotate_left or a manual implementation.

Again, this isn't relavant to this pull request. Just wanted to put those findings down somewhere.

TODOs

These todos should block merging until resolved:

  • Should scrypt_parallel return an error on low memory, or panic?
  • Figure out final API
  • Someone else should review all changes and verify correctness, especially the changes to romix

Final Note

Thank you for taking the time to review my pull request. I hope it's useful! Sorry for the long pull request notes, but I wanted to be sure to explain everything thoroughly, as this is a security sensitive codebase.

@tarcieri tarcieri requested review from newpavlov and tarcieri May 25, 2021 21:33
@tarcieri
Copy link
Member

Thanks for opening this PR. Definitely looks like interesting work! But it might take some time to review, just FYI.

@fpgaminer
Copy link
Author

No worries.

Regarding the failing automated test: looks like it's because of an unused argument which occurs when the parallel feature is off. I can fix that once the exact API is settled.

@smessmer
Copy link

Is this still being pursued?

@tarcieri
Copy link
Member

This is fine but needs a rebase and the test failures fixed

@smessmer
Copy link

smessmer commented Jan 8, 2023

@fpgaminer ?

@fpgaminer
Copy link
Author

I can rebase and fix this up when I get a chance, but I haven't seen any discussion on the API from the maintainers.

@tarcieri
Copy link
Member

tarcieri commented Jan 8, 2023

@fpgaminer I think the API is mostly fine. I like the explicit parameterization of max_memory versus trying to decide on that heuristically by querying the OS, which won't work in no_std environments and only gives you a point-in-time snapshot versus a reasonable sanity limit.

Regarding error handling, you might add an Error enum and to it the cases you're encountering, with variants that are only present when the parallel feature is enabled.

dns2utf8 pushed a commit to dns2utf8/password-hashes that referenced this pull request Jan 24, 2023
...gated under the `alloc` feature.

Allocates a `Vec<u8>` to return an individual `XofReader::read`
invocation into.

Unlike `ExtensibleOutput::finalize_vec`, it doesn't consume the reader
and can be called an unlimited number of times.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants