Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: compute subtrees in parallel when computing proof #169

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

estensen
Copy link

@estensen estensen commented Dec 2, 2024

Merkle trees are binary trees, enabling parallel processing of subtrees. By parallelizing 8 subtrees, we get ~83% reduction in processing time for two mainnet blocks. This implementation detects the number of available CPU cores and runs either serial processing, parallel 4 subtrees (RPi4) or 8 subtrees.

Process 2 subtrees in parallel

Prove Benchmark - File: benches/21315748.json - size 247/123
                        time:   [250.73 ms 251.78 ms 253.11 ms]
                        change: [-44.361% -43.919% -43.531%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) high mild
  1 (10.00%) high severe

Prove Benchmark - File: benches/21327802.json - size 261/130
                        time:   [251.22 ms 252.05 ms 252.89 ms]
                        change: [-44.012% -43.797% -43.578%] (p = 0.00 < 0.05)
                        Performance has improved.

Process 4 subtrees in parallel compared to 2

Prove Benchmark - File: benches/21315748.json - size 247/123
                        time:   [158.15 ms 159.54 ms 161.30 ms]
                        change: [-42.477% -41.625% -40.843%] (p = 0.00 < 0.05)
                        Performance has improved.

Benchmarking Prove Benchmark - File: benches/21327802.json - size 261/130: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 8.7s or enable flat sampling.
Benchmarking Prove Benchmark - File: benches/21327802.json - size 261/130: Collecting 10 samples in estima
Prove Benchmark - File: benches/21327802.json - size 261/130
                        time:   [158.71 ms 159.15 ms 159.40 ms]
                        change: [-43.690% -43.042% -42.435%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 10 measurements (30.00%)
  2 (20.00%) low mild
  1 (10.00%) high severe

Process 8 subtrees in parallel compared to 4

Prove Benchmark - File: benches/21315748.json - size 247/123
                        time:   [92.659 ms 93.286 ms 93.841 ms]
                        change: [-41.520% -40.689% -39.987%] (p = 0.00 < 0.05)
                        Performance has improved.

Benchmarking Prove Benchmark - File: benches/21327802.json - size 261/130: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 5.2s or enable flat sampling.
Prove Benchmark - File: benches/21327802.json - size 261/130
                        time:   [93.627 ms 94.510 ms 95.063 ms]
                        change: [-40.912% -40.329% -39.725%] (p = 0.00 < 0.05)
                        Performance has improved.

Avoid creating extra buffers for subtrees

Benchmarking Prove Benchmark - File: benches/21315748.json - size 247/123: Collecting 10 samples in estima
Prove Benchmark - File: benches/21315748.json - size 247/123
                        time:   [79.151 ms 79.399 ms 79.855 ms]
                        change: [-14.591% -13.868% -13.113%] (p = 0.00 < 0.05)
                        Performance has improved.

Benchmarking Prove Benchmark - File: benches/21327802.json - size 261/130: Collecting 10 samples in estima
Prove Benchmark - File: benches/21327802.json - size 261/130
                        time:   [82.095 ms 83.493 ms 84.412 ms]
                        change: [-14.100% -12.710% -11.136%] (p = 0.00 < 0.05)
                        Performance has improved.

Hasher abstractions from #161

ssz-rs/benches/compute_proof.rs Outdated Show resolved Hide resolved
ssz-rs/benches/compute_proof.rs Outdated Show resolved Hide resolved
ssz-rs/benches/compute_proof.rs Outdated Show resolved Hide resolved
Comment on lines 43 to 46
let proof = dummy_transactions
.prove(black_box(path))
.expect("Failed to generate proof");
black_box(proof)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let proof = dummy_transactions
.prove(black_box(path))
.expect("Failed to generate proof");
black_box(proof)
let _ = black_box(dummy_transactions
.prove(path)
.expect("Failed to generate proof"));

The compiler cannot optimize away an input to a function

ssz-rs/benches/compute_proof.rs Outdated Show resolved Hide resolved
ssz-rs/benches/compute_proof.rs Outdated Show resolved Hide resolved
ssz-rs/benches/compute_proof.rs Outdated Show resolved Hide resolved
@estensen estensen changed the title feat: add bench for computing proof perf: compute subtrees in parallel when computing proof Dec 4, 2024
@estensen estensen marked this pull request as ready for review December 4, 2024 19:48

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this test data file so that we have a path like benches/test_data/block_transactions/21315748.json?

Comment on lines +307 to +313
if leaf_count >= 16 && num_cores >= 8 {
compute_merkle_tree_parallel_8(&mut buffer, node_count);
} else if leaf_count >= 16 && num_cores >= 4 {
compute_merkle_tree_parallel_4(&mut buffer, node_count);
} else {
compute_merkle_tree_serial(&mut buffer, node_count);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can simplify these checks by looking at the leaf_count < 16 case first.

Comment on lines +335 to +353
// Create buffers for each section
let mut left_left_buf = vec![0u8; nodes.left_left.len() * BYTES_PER_CHUNK];
let mut left_right_buf = vec![0u8; nodes.left_right.len() * BYTES_PER_CHUNK];
let mut right_left_buf = vec![0u8; nodes.right_left.len() * BYTES_PER_CHUNK];
let mut right_right_buf = vec![0u8; nodes.right_right.len() * BYTES_PER_CHUNK];

// Copy data to section buffers
copy_nodes_to_buffer(buffer, &nodes.left_left, &mut left_left_buf);
copy_nodes_to_buffer(buffer, &nodes.left_right, &mut left_right_buf);
copy_nodes_to_buffer(buffer, &nodes.right_left, &mut right_left_buf);
copy_nodes_to_buffer(buffer, &nodes.right_right, &mut right_right_buf);

// Process all sections in parallel
rayon::scope(|s| {
s.spawn(|_| process_subtree(&mut left_left_buf, nodes.left_left.len()));
s.spawn(|_| process_subtree(&mut left_right_buf, nodes.left_right.len()));
s.spawn(|_| process_subtree(&mut right_left_buf, nodes.right_left.len()));
s.spawn(|_| process_subtree(&mut right_right_buf, nodes.right_right.len()));
});
Copy link

@thedevbirb thedevbirb Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you can try to look into buffer.chunks_mut(). I tried this code and compiles (doesn't mean it is correct, but rather an example):

    let mut chunks = buffer.chunks_mut(buffer.len() / 4);
    let chunk_1 = chunks.next().unwrap();
    let chunk_2 = chunks.next().unwrap();
    let chunk_3 = chunks.next().unwrap();
    let chunk_4 = chunks.next().unwrap();

    // Process all sections in parallel
    rayon::scope(|s| {
        s.spawn(|_| process_subtree(chunk_1, nodes.left_left.len()));
        s.spawn(|_| process_subtree(chunk_2, nodes.left_right.len()));
        s.spawn(|_| process_subtree(chunk_3, nodes.right_left.len()));
        s.spawn(|_| process_subtree(chunk_4, nodes.right_right.len()));
    });

If you carefully exploit this method I think you can get rid of the high amount of allocation happening in this code which can be a bottleneck

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ty, this improved perf with 13%, and uses less memory
aaea182

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

celebrated a bit early, chunks_mut can split the buffer into subtree len sizes, but this is not what we want here. Subtree ranges overlap in memory.

After this PR is merged I'll create an issue for improving it in the future.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear,

//              0
//       /            \
//      1              2
//    /    \        /    \
//   3      4      5      6
//  / \    / \    / \    / \
// 7   8  9  10  11 12  13 14

AFAIK chunks_mut(4) will create iterators like this

  1. 0, 1, 2, 3
  2. 4, 5, 6, 7
  3. 8, 9, 10, 11
  4. 12, 13, 14

We want (current implementation)

  1. 3, 7, 8
  2. 4, 9, 10
  3. 5, 11, 12
  4. 6, 13, 14

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or when the tree is mostly padded with zero subtrees we want something more similar to this, since zero subtrees won't require hashing

  1. 7
  2. 8
  3. 4, 9, 10
  4. 2, 5, 6, 11, 12, 13, 14

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants