-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: compute subtrees in parallel when computing proof #169
base: main
Are you sure you want to change the base?
Conversation
ssz-rs/benches/compute_proof.rs
Outdated
let proof = dummy_transactions | ||
.prove(black_box(path)) | ||
.expect("Failed to generate proof"); | ||
black_box(proof) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let proof = dummy_transactions | |
.prove(black_box(path)) | |
.expect("Failed to generate proof"); | |
black_box(proof) | |
let _ = black_box(dummy_transactions | |
.prove(path) | |
.expect("Failed to generate proof")); |
The compiler cannot optimize away an input to a function
ssz-rs/benches/21315748.json
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this test data file so that we have a path like benches/test_data/block_transactions/21315748.json
?
if leaf_count >= 16 && num_cores >= 8 { | ||
compute_merkle_tree_parallel_8(&mut buffer, node_count); | ||
} else if leaf_count >= 16 && num_cores >= 4 { | ||
compute_merkle_tree_parallel_4(&mut buffer, node_count); | ||
} else { | ||
compute_merkle_tree_serial(&mut buffer, node_count); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can simplify these checks by looking at the leaf_count < 16
case first.
// Create buffers for each section | ||
let mut left_left_buf = vec![0u8; nodes.left_left.len() * BYTES_PER_CHUNK]; | ||
let mut left_right_buf = vec![0u8; nodes.left_right.len() * BYTES_PER_CHUNK]; | ||
let mut right_left_buf = vec![0u8; nodes.right_left.len() * BYTES_PER_CHUNK]; | ||
let mut right_right_buf = vec![0u8; nodes.right_right.len() * BYTES_PER_CHUNK]; | ||
|
||
// Copy data to section buffers | ||
copy_nodes_to_buffer(buffer, &nodes.left_left, &mut left_left_buf); | ||
copy_nodes_to_buffer(buffer, &nodes.left_right, &mut left_right_buf); | ||
copy_nodes_to_buffer(buffer, &nodes.right_left, &mut right_left_buf); | ||
copy_nodes_to_buffer(buffer, &nodes.right_right, &mut right_right_buf); | ||
|
||
// Process all sections in parallel | ||
rayon::scope(|s| { | ||
s.spawn(|_| process_subtree(&mut left_left_buf, nodes.left_left.len())); | ||
s.spawn(|_| process_subtree(&mut left_right_buf, nodes.left_right.len())); | ||
s.spawn(|_| process_subtree(&mut right_left_buf, nodes.right_left.len())); | ||
s.spawn(|_| process_subtree(&mut right_right_buf, nodes.right_right.len())); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps you can try to look into buffer.chunks_mut()
. I tried this code and compiles (doesn't mean it is correct, but rather an example):
let mut chunks = buffer.chunks_mut(buffer.len() / 4);
let chunk_1 = chunks.next().unwrap();
let chunk_2 = chunks.next().unwrap();
let chunk_3 = chunks.next().unwrap();
let chunk_4 = chunks.next().unwrap();
// Process all sections in parallel
rayon::scope(|s| {
s.spawn(|_| process_subtree(chunk_1, nodes.left_left.len()));
s.spawn(|_| process_subtree(chunk_2, nodes.left_right.len()));
s.spawn(|_| process_subtree(chunk_3, nodes.right_left.len()));
s.spawn(|_| process_subtree(chunk_4, nodes.right_right.len()));
});
If you carefully exploit this method I think you can get rid of the high amount of allocation happening in this code which can be a bottleneck
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ty, this improved perf with 13%, and uses less memory
aaea182
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
celebrated a bit early, chunks_mut
can split the buffer into subtree len sizes, but this is not what we want here. Subtree ranges overlap in memory.
After this PR is merged I'll create an issue for improving it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear,
// 0
// / \
// 1 2
// / \ / \
// 3 4 5 6
// / \ / \ / \ / \
// 7 8 9 10 11 12 13 14
AFAIK chunks_mut(4) will create iterators like this
- 0, 1, 2, 3
- 4, 5, 6, 7
- 8, 9, 10, 11
- 12, 13, 14
We want (current implementation)
- 3, 7, 8
- 4, 9, 10
- 5, 11, 12
- 6, 13, 14
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or when the tree is mostly padded with zero subtrees we want something more similar to this, since zero subtrees won't require hashing
- 7
- 8
- 4, 9, 10
- 2, 5, 6, 11, 12, 13, 14
Merkle trees are binary trees, enabling parallel processing of subtrees. By parallelizing 8 subtrees, we get ~83% reduction in processing time for two mainnet blocks. This implementation detects the number of available CPU cores and runs either serial processing, parallel 4 subtrees (RPi4) or 8 subtrees.
Process 2 subtrees in parallel
Process 4 subtrees in parallel compared to 2
Process 8 subtrees in parallel compared to 4
Avoid creating extra buffers for subtrees
Hasher abstractions from #161