Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify auxiliary columns build #1140

Merged
merged 17 commits into from
Feb 13, 2024
Merged

Simplify auxiliary columns build #1140

merged 17 commits into from
Feb 13, 2024

Conversation

Al-Kindi-0
Copy link
Collaborator

Describe your changes

This PR proposes a simplification to the auxiliary columns building procedure.
This is a wip/poc version without any optimizations.

Checklist before requesting a review

  • Repo forked and branch created from next according to naming convention.
  • Commit messages and codestyle follow conventions.
  • Relevant issues are linked in the PR description.
  • Tests added for new functionality.
  • Documentation/comments updated according to changes.

@Al-Kindi-0 Al-Kindi-0 requested a review from bobbinth November 7, 2023 19:20
@Al-Kindi-0 Al-Kindi-0 marked this pull request as draft November 7, 2023 19:22
Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you! I left a few comments inline - mostly about improving code readability/modularity.

processor/src/chiplets/aux_trace/mod.rs Outdated Show resolved Hide resolved
processor/src/chiplets/aux_trace/mod.rs Outdated Show resolved Hide resolved
processor/src/chiplets/aux_trace/mod.rs Outdated Show resolved Hide resolved
processor/src/chiplets/aux_trace/mod.rs Outdated Show resolved Hide resolved
@Al-Kindi-0 Al-Kindi-0 marked this pull request as ready for review November 23, 2023 15:54
Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you! I left a few small comments inline. A couple of additional thoughts:

  • It seems like this change results in roughly doubling of the time it takes to build the auxiliary trace (tested by running the Fibonacci example for $2^20$ cycles). This is not a huge deal as overall contribution to proof generation time is pretty small, but I'd like to understand which parts of the new approach contribute the most to the runtime.
  • I'm wondering if it would make sense to combine some of the functions. For example, the logic in build_mloadw_request and build_mstorew_request is nearly identical - so, combining these two should be pretty straight-forward. Similar approach can probably be applied to some other functions as well.

Also, we should probably remove the old "hint" structure as I don't think it is needed for anything chiplet-related any more (this can be done in a separate commit so that we can roll things back if needed).

Overall, I'd probably make this PR about chiplet bus only and then address other buses/virtual tables in different PRs (which build on this PR). This way, we have relatively small PRs for each change, and then we can merge them together once the whole structure has been migrated.

processor/src/chiplets/aux_trace/mod.rs Outdated Show resolved Hide resolved
processor/src/chiplets/aux_trace/mod.rs Outdated Show resolved Hide resolved
processor/src/chiplets/aux_trace/mod.rs Outdated Show resolved Hide resolved
processor/src/chiplets/aux_trace/mod.rs Outdated Show resolved Hide resolved
@Al-Kindi-0
Copy link
Collaborator Author

* It seems like this change results in roughly doubling of the time it takes to build the auxiliary trace (tested by running the Fibonacci example for 220 cycles). This is not a huge deal as overall contribution to proof generation time is pretty small, but I'd like to understand which parts of the new approach contribute the most to the runtime.

Added a description of the timings for the Fibonacci example above.

* I'm wondering if it would make sense to combine some of the functions. For example, the logic in `build_mloadw_request` and `build_mstorew_request` is nearly identical - so, combining these two should be pretty straight-forward. Similar approach can probably be applied to some other functions as well.

I have now combined a few additional functions. Let me know if you see further simplifications.

Also, we should probably remove the old "hint" structure as I don't think it is needed for anything chiplet-related any more (this can be done in a separate commit so that we can roll things back if needed).

Will remove them in at the end in a separate commit, if that is OK.

Overall, I'd probably make this PR about chiplet bus only and then address other buses/virtual tables in different PRs (which build on this PR). This way, we have relatively small PRs for each change, and then we can merge them together once the whole structure has been migrated.

Makes sense.

Copy link
Contributor

@bobbinth bobbinth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thank you! I left a few more small comments inline - after these, we are good to merge. But as mentioned in the previous review, let's hold off on merging until the whole structure is refactored.

processor/src/chiplets/aux_trace/mod.rs Outdated Show resolved Hide resolved
processor/src/chiplets/aux_trace/mod.rs Outdated Show resolved Hide resolved
processor/src/chiplets/aux_trace/mod.rs Outdated Show resolved Hide resolved
processor/src/chiplets/aux_trace/mod.rs Outdated Show resolved Hide resolved
@bobbinth bobbinth force-pushed the al-simplify-aux branch 5 times, most recently from 2eb75d6 to cf6db3a Compare December 13, 2023 02:08
@bobbinth
Copy link
Contributor

I've cleaned up commit history a bit and merged all downstream PRs into this one (and also rebased this PR from the latest next).

Let's not merge it yet though. We can probably do a bit more cleanup - and I'd like to do it on top of this PR before we merge into next.

@bobbinth
Copy link
Contributor

bobbinth commented Dec 13, 2023

Overall, with the changes as they stand, the time to build auxiliary trace increased about 10x (from ~25ms to ~250ms). But we haven't optimized this at all, and nothing is parallelized. So, overall, I think we should be able to get this down to 25 ms (or maybe even less).

And if GKR changes work, we may be able to reduce it even more (though, I think we'll need to understand GKR proving time a bit more before we can make this claim).

@bobbinth
Copy link
Contributor

One of the things we can do in terms of code improvements is to consolidate all MainTrace structs and move them into the air crate (under air/src/trace).

@Al-Kindi-0
Copy link
Collaborator Author

One of the things we can do in terms of code improvements is to consolidate all MainTrace structs and move them into the air crate (under air/src/trace).

Makes sense!

@Al-Kindi-0
Copy link
Collaborator Author

Overall, with the changes as they stand, the time to build auxiliary trace increased about 10x (from ~25ms to ~250ms). But we haven't optimized this at all, and nothing is parallelized. So, overall, I think we should be able to get this down to 25 ms (or maybe even less).

That's plausible. Will try to optimize it after refactoring MainTrace.

And if GKR changes work, we may be able to reduce it even more (though, I think we'll need to understand GKR proving time a bit more before we can make this claim).

I would bet yes since as far as I can see there are no inversions when evaluating the GKR circuit. However, there will be some extra costs with the GKR prover that are non-negligible.

@Al-Kindi-0
Copy link
Collaborator Author

I have now consolidated the helper struct and methods, improved documentation further and did some minor cleanups. Managed to find a nice way to build the column and batch invert at the same time, this should hopefully reduce the build times.

@Al-Kindi-0
Copy link
Collaborator Author

I have also looked at parallelizing the auxiliary columns build and I don't see a way to do it without making some changes to Winterfell. I propose that we tackle this in a separate PR as it is a self-contained task.

@bobbinth
Copy link
Contributor

I have also looked at parallelizing the auxiliary columns build and I don't see a way to do it without making some changes to Winterfell. I propose that we tackle this in a separate PR as it is a self-contained task.

The place to parallelize is roughly here - so, I think we can avoid changes in Winterfell. I do agree that it should be a separate PR - but probably based on this PR before we merge.

Copy link
Contributor

@plafer plafer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the hints stuff is very refreshing! However, I think we lost some good abstractions that were present before (as mentioned in the review).

Not a full review yet; I still need to think more about how to bring some of these abstractions back. Suggestions obviously welcome :)

{
let mut multiplicand = E::ONE;
if i % 8 == 7 {
let op_label = get_op_label(ONE, ZERO, is_xor, ZERO);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use the column values from the trace itself (selector0, etc) instead of hardcoding the column values? Harcoding values like this makes it all the more difficult to change the layout of the chiplets in the future.

I would personally change get_op_label() to

fn get_op_label(main_trace: &MainTrace, row_idx: usize) -> Felt {
  // fetch selector columns from main trace and compute label
}

and let the compiler optimize the memory accesses

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can. Some of the "hardcoding" is due to there being only one possible value of the flag given some previous check.
I have now factored the computation of the flag outside of the if blocks.

E: FieldElement<BaseField = Felt>,
{
let mut multiplicand = E::ONE;
if i % 8 == 7 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if i % 8 == 7 {
if i % bitwise::OP_CYCLE_LEN == bitwise::OP_CYCLE_LEN - 1 {

Can we limit the use of magic numbers elsewhere too (e.g. in build_hasher_chiplet_responses)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

processor/src/chiplets/aux_trace/mod.rs Outdated Show resolved Hide resolved
pub struct BusTraceBuilder {}

/// Builds the chiplets bus auxiliary trace column.
fn build_bus_aux_column<E>(main_trace: &ColMatrix<Felt>, alphas: &[E]) -> Vec<E>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was nice about the previous implementation is that we only had one function to build all aux columns (AuxColumnBuilder::build_aux_column()). Now this logic is duplicated for every column (build_bus_aux_column(), build_vtable_aux_column(), stack's AuxTraceBuilder::build_aux_column(), etc.).

We could bring back a different AuxColumnBuilder that would look something like

trait AuxColumnBuilder {
  // except with (much better) naming
  fn init_result_1(main_trace, alphas);
  fn init_result_2(main_trace, alphas);

  fn result_1(prev_result, main_trace, alphas, row_idx);
  fn result_2(prev_result, main_trace, alphas, row_idx);
 
  // provided method
  fn build_aux_column(main_trace: &ColMatrix<Felt>, alphas: &[E]) -> Vec<E>
  where
      E: FieldElement<BaseField = Felt>,
  {
        let main_trace = MainTrace::new(main_trace);
        let mut result_1: Vec<E> = Self::init_result_1(&main_trace, alphas);
        let mut result_2: Vec<E> = Self::init_result_2(&main_trace, alphas)

        let mut result_2_acc = E::ONE;
        for i in 0..main_trace.num_rows() - 1 {
            result_1[i + 1] = Self::result_1(result_1[i], &main_trace, alphas, row_idx);
            result_2[i + 1] = Self::result_2(result_2[i], &main_trace, alphas, row_idx);
            result_2_acc *= result_2[i + 1];
        }

        let mut acc_inv = result_2_acc.inv();

        for i in (0..main_trace.num_rows()).rev() {
            result_1[i] *= acc_inv;
            acc_inv *= result_2[i];
        }
        result_1
  }
}

The only thing left is to find better names than the ones I used here. And potentially making main_trace and alphas fields in the structs to avoiding passing them around all the time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now factored the logic above into a trait for aux column building. I have used the self. instead of Self:: as it makes implementing aux column building for the stack overflow table less cumbersome.

processor/src/chiplets/aux_trace/mod.rs Outdated Show resolved Hide resolved
processor/src/chiplets/aux_trace/mod.rs Show resolved Hide resolved
processor/src/chiplets/aux_trace/mod.rs Outdated Show resolved Hide resolved
+ alphas[3].mul_base(node_index);

let bit = node_index.as_int() & 1;
let left_word = build_value(&alphas[8..12], &state[4..8]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use constants here, instead of 8..12 and 4..8? It makes it hard to understand why these specific values were chosen

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Although for the alphas it is less straightforward as the values don't have a special meaning, except for the alphas used in computing the header values.

Comment on lines +917 to +877
alphas[0]
+ alphas[1].mul_base(op_label)
+ alphas[2].mul_base(ctx)
+ alphas[3].mul_base(addr)
+ alphas[4].mul_base(clk)
+ alphas[5].mul_base(value0)
+ alphas[6].mul_base(value1)
+ alphas[7].mul_base(value2)
+ alphas[8].mul_base(value3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use build_value() here? This applies to all other places in this file where we reimplement build_value() by hand.

I'll have to think about it some more, but it'd be nice to have a trait similar to the previous LookupTableRow, adapted to the changes in this PR, which would replace build_value().

processor/src/chiplets/aux_trace/mod.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@plafer plafer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's best to hold off to make big refactors on top of this, as the final shape of the aux column building code will probably require all chiplets to be refactored first (see last comment in review). So I would address the points I brought up in my reviews, and then the PR would be in a good state to merge IMO 🙂

processor/src/decoder/auxiliary.rs Outdated Show resolved Hide resolved
processor/src/trace/mod.rs Outdated Show resolved Hide resolved
result_2[0] = E::ONE;

let mut result_2_acc = E::ONE;
for i in 0..main_trace.num_rows() - 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, every chiplet performs its own loop over the entire trace. It would probably be more cache-efficient to have just one loop, in which we every chiplet performs its work in the body - but we'd need to measure the performance delta to be sure.

It would also make further optimizations easier, such as parallelizing the loop (as opposed to parallelize n loops for n chiplets).

@bobbinth
Copy link
Contributor

bobbinth commented Jan 8, 2024

I think it's best to hold off to make big refactors on top of this, as the final shape of the aux column building code will probably require all chiplets to be refactored first (see last comment in review).

I like the idea of standardizing the code a bit more - but I wouldn't do it in this PR directly. Instead, we could do it in PRs built on top of this one, and then merge into this PR. This would make incremental reviews easer.

The goal here is to get the structure in the PR to be relatively clean, and only then merge into next.

@bobbinth bobbinth force-pushed the al-simplify-aux branch 2 times, most recently from 3d06daa to 4c963d2 Compare February 8, 2024 08:29
@bobbinth bobbinth merged commit d1aa43c into next Feb 13, 2024
15 checks passed
@bobbinth bobbinth deleted the al-simplify-aux branch February 13, 2024 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants