perf: compute subtrees in parallel when computing proof #169

estensen · 2024-12-02T08:55:05Z

Merkle trees are binary trees, enabling parallel processing of subtrees. By parallelizing 8 subtrees, we get ~83% reduction in processing time for two mainnet blocks. This implementation detects the number of available CPU cores and runs either serial processing, parallel 4 subtrees (RPi4) or 8 subtrees.

Process 2 subtrees in parallel

Prove Benchmark - File: benches/21315748.json - size 247/123
                        time:   [250.73 ms 251.78 ms 253.11 ms]
                        change: [-44.361% -43.919% -43.531%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  1 (10.00%) high mild
  1 (10.00%) high severe

Prove Benchmark - File: benches/21327802.json - size 261/130
                        time:   [251.22 ms 252.05 ms 252.89 ms]
                        change: [-44.012% -43.797% -43.578%] (p = 0.00 < 0.05)
                        Performance has improved.

Process 4 subtrees in parallel compared to 2

Prove Benchmark - File: benches/21315748.json - size 247/123
                        time:   [158.15 ms 159.54 ms 161.30 ms]
                        change: [-42.477% -41.625% -40.843%] (p = 0.00 < 0.05)
                        Performance has improved.

Benchmarking Prove Benchmark - File: benches/21327802.json - size 261/130: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 8.7s or enable flat sampling.
Benchmarking Prove Benchmark - File: benches/21327802.json - size 261/130: Collecting 10 samples in estima
Prove Benchmark - File: benches/21327802.json - size 261/130
                        time:   [158.71 ms 159.15 ms 159.40 ms]
                        change: [-43.690% -43.042% -42.435%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 10 measurements (30.00%)
  2 (20.00%) low mild
  1 (10.00%) high severe

Process 8 subtrees in parallel compared to 4

Prove Benchmark - File: benches/21315748.json - size 247/123
                        time:   [92.659 ms 93.286 ms 93.841 ms]
                        change: [-41.520% -40.689% -39.987%] (p = 0.00 < 0.05)
                        Performance has improved.

Benchmarking Prove Benchmark - File: benches/21327802.json - size 261/130: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 5.2s or enable flat sampling.
Prove Benchmark - File: benches/21327802.json - size 261/130
                        time:   [93.627 ms 94.510 ms 95.063 ms]
                        change: [-40.912% -40.329% -39.725%] (p = 0.00 < 0.05)
                        Performance has improved.

Avoid creating extra buffers for subtrees

Benchmarking Prove Benchmark - File: benches/21315748.json - size 247/123: Collecting 10 samples in estima
Prove Benchmark - File: benches/21315748.json - size 247/123
                        time:   [79.151 ms 79.399 ms 79.855 ms]
                        change: [-14.591% -13.868% -13.113%] (p = 0.00 < 0.05)
                        Performance has improved.

Benchmarking Prove Benchmark - File: benches/21327802.json - size 261/130: Collecting 10 samples in estima
Prove Benchmark - File: benches/21327802.json - size 261/130
                        time:   [82.095 ms 83.493 ms 84.412 ms]
                        change: [-14.100% -12.710% -11.136%] (p = 0.00 < 0.05)
                        Performance has improved.

Hasher abstractions from #161

ssz-rs/benches/compute_proof.rs

thedevbirb · 2024-12-02T09:24:42Z

ssz-rs/benches/compute_proof.rs

+                    let proof = dummy_transactions
+                        .prove(black_box(path))
+                        .expect("Failed to generate proof");
+                    black_box(proof)


Suggested change

let proof = dummy_transactions

.prove(black_box(path))

.expect("Failed to generate proof");

black_box(proof)

let _ = black_box(dummy_transactions

.prove(path)

.expect("Failed to generate proof"));

The compiler cannot optimize away an input to a function

ssz-rs/benches/compute_proof.rs

thedevbirb · 2024-12-06T16:41:48Z

ssz-rs/benches/21315748.json

Can you move this test data file so that we have a path like benches/test_data/block_transactions/21315748.json?

thedevbirb · 2024-12-09T16:28:05Z

ssz-rs/src/merkleization/merkleize.rs

+    if leaf_count >= 16 && num_cores >= 8 {
+        compute_merkle_tree_parallel_8(&mut buffer, node_count);
+    } else if leaf_count >= 16 && num_cores >= 4 {
+        compute_merkle_tree_parallel_4(&mut buffer, node_count);
+    } else {
+        compute_merkle_tree_serial(&mut buffer, node_count);
+    }


I think you can simplify these checks by looking at the leaf_count < 16 case first.

thedevbirb · 2024-12-09T17:05:15Z

ssz-rs/src/merkleization/merkleize.rs

+    // Create buffers for each section
+    let mut left_left_buf = vec![0u8; nodes.left_left.len() * BYTES_PER_CHUNK];
+    let mut left_right_buf = vec![0u8; nodes.left_right.len() * BYTES_PER_CHUNK];
+    let mut right_left_buf = vec![0u8; nodes.right_left.len() * BYTES_PER_CHUNK];
+    let mut right_right_buf = vec![0u8; nodes.right_right.len() * BYTES_PER_CHUNK];
+
+    // Copy data to section buffers
+    copy_nodes_to_buffer(buffer, &nodes.left_left, &mut left_left_buf);
+    copy_nodes_to_buffer(buffer, &nodes.left_right, &mut left_right_buf);
+    copy_nodes_to_buffer(buffer, &nodes.right_left, &mut right_left_buf);
+    copy_nodes_to_buffer(buffer, &nodes.right_right, &mut right_right_buf);
+
+    // Process all sections in parallel
+    rayon::scope(|s| {
+        s.spawn(|_| process_subtree(&mut left_left_buf, nodes.left_left.len()));
+        s.spawn(|_| process_subtree(&mut left_right_buf, nodes.left_right.len()));
+        s.spawn(|_| process_subtree(&mut right_left_buf, nodes.right_left.len()));
+        s.spawn(|_| process_subtree(&mut right_right_buf, nodes.right_right.len()));
+    });


Perhaps you can try to look into buffer.chunks_mut(). I tried this code and compiles (doesn't mean it is correct, but rather an example):

let mut chunks = buffer.chunks_mut(buffer.len() / 4); let chunk_1 = chunks.next().unwrap(); let chunk_2 = chunks.next().unwrap(); let chunk_3 = chunks.next().unwrap(); let chunk_4 = chunks.next().unwrap(); // Process all sections in parallel rayon::scope(|s| { s.spawn(|_| process_subtree(chunk_1, nodes.left_left.len())); s.spawn(|_| process_subtree(chunk_2, nodes.left_right.len())); s.spawn(|_| process_subtree(chunk_3, nodes.right_left.len())); s.spawn(|_| process_subtree(chunk_4, nodes.right_right.len())); });

If you carefully exploit this method I think you can get rid of the high amount of allocation happening in this code which can be a bottleneck

ty, this improved perf with 13%, and uses less memory
aaea182

celebrated a bit early, chunks_mut can split the buffer into subtree len sizes, but this is not what we want here. Subtree ranges overlap in memory.

After this PR is merged I'll create an issue for improving it in the future.

To be clear,

// 0 // / \ // 1 2 // / \ / \ // 3 4 5 6 // / \ / \ / \ / \ // 7 8 9 10 11 12 13 14

AFAIK chunks_mut(4) will create iterators like this

0, 1, 2, 3

4, 5, 6, 7

8, 9, 10, 11

12, 13, 14

We want (current implementation)

3, 7, 8

4, 9, 10

5, 11, 12

6, 13, 14

Or when the tree is mostly padded with zero subtrees we want something more similar to this, since zero subtrees won't require hashing

7

8

4, 9, 10

2, 5, 6, 11, 12, 13, 14

This reverts commit aaea182.

feat: add bench for computing proof

1231321

thedevbirb suggested changes Dec 2, 2024

View reviewed changes

estensen added 9 commits December 3, 2024 11:02

feat: benchmark mainnet block

812c79e

feat: support hashtree with feature

5dc3a33

chore: ignore flamegraph and DS_STORE

32c3bcf

chore: benchmark block just above 256 txs

f00ed20

chore: don't bench multiple indices

1d4dac8

perf: process subtrees in parallel

26e2a94

fix: return when no pairs to process

5f06bc8

fix: return early when less than two chunks

7acc300

chore: temporary remove hashtree feature

34fbbde

estensen changed the title ~~feat: add bench for computing proof~~ perf: compute subtrees in parallel when computing proof Dec 4, 2024

chore: extract functions

713f04d

estensen marked this pull request as ready for review December 4, 2024 19:48

estensen added 2 commits December 5, 2024 10:52

chore: compute merkle tree serially when less than 8 leaves

843d070

perf: process 4 subtrees in parallel instead of 2

74acc3d

estensen mentioned this pull request Dec 6, 2024

perf: use precomputed zero hashes in process_subtree #170

Open

thedevbirb reviewed Dec 6, 2024

View reviewed changes

This was referenced Dec 7, 2024

feat: support hashtree sha256 lib #172

Open

chore: bump bitvec and sha2 deps #173

Open

estensen added 2 commits December 7, 2024 15:51

chore: only compute merkle tree in parallel when >4 CPUs are available

8e689cb

perf: process 8 subtrees in parallel instead of 4

e0ca957

estensen force-pushed the bench-proof branch from ebf6b52 to e0ca957 Compare December 8, 2024 18:57

thedevbirb reviewed Dec 9, 2024

View reviewed changes

estensen mentioned this pull request Dec 10, 2024

feat: implement multiproof generation #171

Open

estensen added 2 commits December 10, 2024 11:18

perf: don't create extra buffers for subtrees

aaea182

Revert "perf: don't create extra buffers for subtrees"

6f00708

This reverts commit aaea182.

mempirate mentioned this pull request Dec 11, 2024

Support hashtree crate with feature, add Criterion benchmarks #161

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: compute subtrees in parallel when computing proof #169

perf: compute subtrees in parallel when computing proof #169

estensen commented Dec 2, 2024 •

edited

Loading

thedevbirb Dec 2, 2024

thedevbirb Dec 6, 2024

thedevbirb Dec 9, 2024

thedevbirb Dec 9, 2024 •

edited

Loading

estensen Dec 10, 2024

estensen Dec 10, 2024

estensen Dec 10, 2024

estensen Dec 10, 2024

perf: compute subtrees in parallel when computing proof #169

Are you sure you want to change the base?

perf: compute subtrees in parallel when computing proof #169

Conversation

estensen commented Dec 2, 2024 • edited Loading

thedevbirb Dec 2, 2024

Choose a reason for hiding this comment

thedevbirb Dec 6, 2024

Choose a reason for hiding this comment

thedevbirb Dec 9, 2024

Choose a reason for hiding this comment

thedevbirb Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

estensen Dec 10, 2024

Choose a reason for hiding this comment

estensen Dec 10, 2024

Choose a reason for hiding this comment

estensen Dec 10, 2024

Choose a reason for hiding this comment

estensen Dec 10, 2024

Choose a reason for hiding this comment

estensen commented Dec 2, 2024 •

edited

Loading

thedevbirb Dec 9, 2024 •

edited

Loading