From 262bf5ddcd062a28d4b9e594a253972756baa46d Mon Sep 17 00:00:00 2001 From: Kunming Jiang Date: Sun, 29 Dec 2024 20:38:58 -0500 Subject: [PATCH] First draft of the new documentation --- docs/spartan_parallel.md | 148 ++++++++++++++++++++++++++++----------- 1 file changed, 109 insertions(+), 39 deletions(-) diff --git a/docs/spartan_parallel.md b/docs/spartan_parallel.md index cd02f947..a323e077 100644 --- a/docs/spartan_parallel.md +++ b/docs/spartan_parallel.md @@ -5,7 +5,7 @@ - (Compile Time) Circuit preprocessing and commitment - (Runtime) Witness preprocessing and commitment - (Runtime) Sumcheck on all circuits and all instances -- (Runtime) Opening on circuit and witness commitments +- (Runtime) Opening on witness and circuit commitments - (Runtime) Shift and program IO proofs ## High-Level Idea @@ -125,8 +125,13 @@ At runtime, `spartan_parallel` reads in from `circ_blocks` through the struct `R * `addr_phy_mems_list`, `addr_vir_mems_list`: memory accesses throughout the program execution (including initialization), ordered by address. * `addr_ts_bits_list`: bit split of timestamp difference, used by memory coherence check. +### Circuit Sorting +> Relevant sections: `BLOCK SORT` and `PAIRWISE SORT` of `src/lib.rs` + +As later illlustrated by polynomial binding, the most efficient parallel proof requires the circuits (blocks) to be sorted by decreasing order of number of instances (executions). When the circuits are generated at compile time, however, the number of executions on each block is unknown. As such, during runtime, the prover and the verifier needs to re-order the block circuits and commitments based on their number of execution. The same sort is also performed on register consistency, stack, and heap coherence circuits according to their number of instances: `consis_num_proofs`, `total_num_phy_mem_accesses`, and `total_num_vir_mem_accesses`. + ### Witness Preprocessing and Commitment -> Relevant file: `src/lib.rs` +> Relevant section: `CHALLENGES AND WITNESSES FOR PERMUTATION` of `src/lib.rs` Apart from the witnesses provided by each block execution, the prover also needs to compute additional witnesses used by permutation and consistency checks. This includes, most notably: * `perm_w0 = [tau, r, r^2, ...]`: the randomness used by the random linear permutation. This value is can be efficiently generated by the verifier and does not require commitment. @@ -139,21 +144,39 @@ Apart from the witnesses provided by each block execution, the prover also needs - $ri_k$ and $ro_k$ are only used by register transition to record the RLC of $i_k$ and $o_k$ individually * `block_w3_shifted`, `perm_exec_w3_shifted`, `phy_mem_w3_shifted`, `vir_mem_w3_shifted`: each $w3$ shifted by one row. This is so that $D_k$ can obtain $\pi_{k+1}$ and $v_{k+1}$ for its computation. See shift proofs section for more details. -Note that all of the above witnesses have different sizes: -* `block_vars_matrix`: `block_num_proofs[i] * num_vars_per_block[i]` for each block `i` -* `perm_exec_w2`: `consis_num_proofs * num_ios`, where `num_ios` is the total number of inputs and outputs of a block, and is the same across all blocks. -* `addr_phy_mems_list`, `phy_mem_w2`: `total_num_phy_mem_accesses * 4` -_XXX: we should be able to reduce the length to `total_num_phy_mem_accesses * 2`._ -* `addr_vir_mems_list`, `vir_mem_w2`: `total_num_vir_mem_accesses * 8` -_XXX: we should be able to reduce the length to `total_num_phy_mem_accesses * 4`._ -* `block_w2`: `block_num_proofs[i] * (num_ios + block_num_phy_ops[i] + block_num_vir_ops[i])` -* `perm_exec_w3`, `perm_exec_shifted`: `consis_num_proofs * 8` -* `phy_mem_w3`, `phy_mem_w3_shifted`: `total_num_phy_mem_accesses * 8` -* `vir_mem_w3`, `vir_mem_w3_shifted`: `total_num_vir_mem_accesses * 8` -* `block_w3`, `block_w3_shifted`: `block_num_proofs[i] * 8` for each block `i` - All witnesses are committed using regular dense polynomial commitment schemes. `block_vars_matrix`, `block_w2`, `block_w3`, and `block_w3_shifted` are committed by each type of block. We note that we can use tricks similar to circuit commitment above to batch commit and batch open witness commitments. +### Summary on Commitments +In total, `spartan_parallel` needs to commmit to the following items: +#### Compile time (all sparse polynomials) +* `BLOCK_CORRECTNESS`: `num_block` circuits of size `num_cons_per_block[i] * (8 * num_vars_per_block[i])` +* `CONSIS_CHECK`: one consistency circuit of size `1 * 16` +* `PHY_MEM_COHERE`: one stack coherence circuit of size `4 * 8` +* `VIR_MEM_COHERE`: one heap coherence circuit of size `(8 + max_ts_width) * (4 * max_ts_width)`, where `max_ts_width` is the number of bits required to express the heap access timestamp +* `PERM_ROOT`: one permutation circuit of size `(num_ios + 4) * (4 * num_ios)` +#### Runtime, round 1 (commitment hash required to generate RLC challenges) +* `block_vars`: $P$ instances of `block_num_proofs[i] * num_vars_per_block[i]` +* `exec_inputs`: `consis_num_proofs * num_ios` +* `init_phy_mems_list`: `total_num_init_phy_mem_accesses * 4` +* `addr_phy_mems_list`: `total_num_phy_mem_accesses * 4` +> _XXX: we should be able to reduce the length to `total_num_phy_mem_accesses * 2`._ +* `init_vir_mems_list`: `total_num_init_vir_mem_accesses * 8` +* `addr_vir_mems_list`: `total_num_vir_mem_accesses * 8` +> _XXX: we should be able to reduce the length to `total_num_phy_mem_accesses * 4`._ +* `addr_ts_bits_list`: `total_num_vir_mem_accesses * mem_addr_ts_bits_size` +#### Runtime, round 2 (witnesses computed from RLC challenges) +* `block_w2`: $P$ instances of `block_num_proofs[i] * (num_ios + block_num_phy_ops[i] + block_num_vir_ops[i])` +* `block_w3`, `block_w3_shifted`: $P$ instances of `block_w3_shifted`: `block_num_proofs[i] * 8` +* `perm_exec_w2`: `consis_num_proofs * num_ios` +* `perm_exec_w3`, `perm_exec_w3_shifted`: `consis_num_proofs * 8` +* `phy_mem_addr_w2`: `total_num_phy_mem_accesses * 4` +> _XXX: we should be able to reduce the length to `total_num_phy_mem_accesses * 2`._ +* `phy_mem_addr_w3`, `phy_mem_addr_w3_shifted`: `total_num_phy_mem_accesses * 8` +* `vir_mem_addr_w2`: `total_num_vir_mem_accesses * 8` +> _XXX: we should be able to reduce the length to `total_num_phy_mem_accesses * 4`._ +* `vir_mem_addr_w3`, `vir_mem_addr_w3_shifted`: `total_num_vir_mem_accesses * 8` + + ## Sumcheck Proof on Circuits and Instances > Relevant files: `src/customdensepolynomial.rs`, `src/r1csproof.rs` and `src/sumcheck.rs` @@ -182,17 +205,28 @@ $$\sum \tilde{\text{eq}} \cdot (\tilde{Az} \cdot \tilde{Bz} - \tilde{Cz}) = 0$$ $$(r_A\cdot \tilde{A} + r_B\cdot \tilde{B} + r_C\cdot \tilde{C})\cdot \tilde{z} = r_A\cdot \tilde{Az} + r_B\cdot \tilde{Bz} + r_C\cdot \tilde{Cz}$$ for some random $r_A$, $r_B$, $r_C$. -To implement data-parallelism, we divide Spartan into 4 steps. +To implement data-parallelism, we divide Spartan into the following steps. -#### Obtaining $\tilde{Az}, \tilde{Bz}, \tilde{Cz}$ -> Relevant files: `src/r1csinstance.rs` and `src/customdensepolynomial.rs` +### Combining Witnesses Together +> Relevant file: `src/r1csproof.rs` While in regular Spartan, $Az$ is simply a length-$X$ vector, obtained by multiplication of a $X\times Y$ matrix $A$ by a length-$Y$ vector $z$, the data-paralleled version is slightly more complicated. -The prover's first task is to construct a $P\times Q_i\times W\times Y_i$ struct `z_mat` through a 4-dimensional vector. The size of $Q_i$ and $Y_i$ depends on the entry of the $P$ dimension, while the size of $W$ does not. Further, even though sumcheck requires the size of each dimension to be _conceptually_ a power of 2, `z_mat` only needs to store the entries that contain non-zero values. +The prover's first task is to construct a $P\times Q_i\times W\times Y_i$ struct `z_mat` through a 4-dimensional vector. This is, conceptually, piecing together $W$ different witnesses of size $P\times Q_i\times Y_i$ together. In the case of block correctness proof $\mathcal{C}_i$, where +$$z'_{i, j} = (z_{i, j}, r, rz_{i, j}, \pi_{i, j}, \pi'_{i, j})$$ +$z, rz, \pi, \pi'$ are different for each circuit and each instance, and thus can be naturally expressed as $P\times Q_i\times Y_i$ vectors. $r$, which is the randomness used by RLC, however, is the same across all circuits and instances, and thus needs to be copied $P\times Q_i$ times. We note that this construction presents opportunities and challenges: on one hand, since every evaluation on the $r$ polynomial is the same, bindings on the $P$ and $Q_i$ dimensions are free, and thus it does not require a commitment opening proof; on the other hand, for this binding shortcut to work, $r$ needs to be the same on all $2^{q_i}$ bases, so $Q_i$ has to be a power of 2 (technically $P$ also needs to be a power of 2, but since $P$ is small, the verifier can manually bind $p$ variables to a polynomial). + +> To provide more intuition on this problem, assume that for some circuit $i$, $Q_i$ is not a power of 2. Let $Q_\text{pad}$ be the next power of 2 of $Q_i$, the sumcheck still needs to operate for $q_\text{pad}$ rounds on a vector of size $Q_\text{pad}$. What one can do, however, is to set entries $Q_i$ to $Q_\text{pad}$ to 0. If the prover needs to access any of these entries, it already knows that they are 0, and can skip the memory access and any multiplication on it. The problem with $r$ is that every entry needs to be the same (to skip the binding), and thus entries $Q_i$ to $Q_\text{pad}$ cannot be zero. + +From the construction we can deduce that the size of $Q_i$ and $Y_i$ depends on the entry of the $P$ dimension, while the size of $W$ does not. Further, even though sumcheck requires the size of each dimension to be _conceptually_ a power of 2, `z_mat` allows vectors on the $P, W, Y_i$ dimensions to only store the non-zero entries at the front. + +> XXX: current construction pads $Y_i$ to a power of 2, pending improvements. Moreover, there are no fundamental challenges to trim the $Q_i$ dimension, except that they make the code extremely messy and hard to parallelize. + +### Obtaining $\tilde{Az}, \tilde{Bz}, \tilde{Cz}$ +> Relevant files: `src/r1csinstance.rs` and `src/customdensepolynomial.rs` To obtain $Az$, $Bz$, $Cz$, the prover treats `z_mat` as $P$ counts of $Q_i \times (W \cdot Y_i)$ matrix. Since $A$, $B$, $C$ can be expressed as $P$ counts of $X_i\times (W \cdot Y_i)$ matrices, this allows the prover to perform $P$ matrix multiplications to obtain $P \times Q_i \times X_i$ tensors $Az$, $Bz$, $Cz$ and their MLE $\tilde{Az}$ (`poly_Az`), $\tilde{Bz}$, $\tilde{Cz}$. This process is described in `R1CSinstance::multiply_vec_block`. Note that: -* Conceptually, `poly_Az` of every block $i$ has $p + q_\text{max} + x_\text{max}$ variables. However, the value of the variables indexed at $[p, p + q_\text{max} - q_i)$ and $[p + q_\text{max}, p + q_\text{max} + x_\text{max} - x_i)$ does not affect the evaluation of the polynomial. +* Conceptually, `poly_Az` of every block $i$ has $p + q_\text{max} + x_\text{max}$ variables. However, the value of the variables indexed at $[p, p + q_\text{max} - q_i)$ and $[p + q_\text{max}, p + q_\text{max} + x_\text{max} - x_i)$ does not affect the evaluation for parts of the polynomial. * Each circuit $i$ has different $Q_i$ and $X_i$, so $Az$ is expressed as a 3-dimensional vector, and the prover stores its MLE in a concise structure `DensePolynomialPqx`. #### Bindings on a `DensePolynomialPqx` @@ -236,34 +270,70 @@ for p in 0..num_instances: ``` which merges entry $2x$ and $2x+1$ into entry $x$. Note that if `num_inputs[p] = 1`, then the binding simply multiplies the first and only entry by $1-r$ without further modification. - - - -#### Sumcheck 1 +### Sumcheck 1 > Relevant functions: `R1CSProof::prove_phase_one` and `SumcheckInstanceProof::prove_cubic_with_additive_term_disjoint_rounds` Similar to the regular Spartan, sumcheck 1 is of the following form: -$$\sum \tilde{\text{eq}} \cdot (\tilde{Az} \cdot \tilde{Bz} - \tilde{Cz}) = 0$$ +$$\sum_{\tau\in\{0, 1\}^{p + q_\text{max} + x_\text{max}}} \tilde{\text{eq}}(\tau) \cdot (\tilde{Az}(\tau) \cdot \tilde{Bz}(\tau) - \tilde{Cz}(\tau)) = 0$$ Except that $\tilde{Az}$, $\tilde{Bz}$, and $\tilde{Cz}$ are now $p + q_\text{max} + x_\text{max}$-variate polynomials, which means the sumcheck involves $p + q_\text{max} + x_\text{max}$ rounds and returns with the challenge $r = r_p || r_q || r_x$. However, we want the prover to only perform $\sum_i Q_i \cdot X_i$ computations (as opposed to $P \cdot Q_\text{max} \cdot X_\text{max}$). -We first note that binding the $P$ variables first can cause inefficiencies. This is because each binding on a variable of the $P$ dimension collapses two vectors on the $Q_i$ dimension, which may be of different lengths. As for a toy example, assume that a polynomial $G$ only has 2 dimensions $P\times Q_i$, and let $P = 4$ and $Q_i = [4, 4, 2, 2]$. The polynomial would thus contain 4 variables: -$$\tilde{G}(x_{p, 0}, x_{p, 1}, x_{q, 0}, x_{q, 1})$$ +The solution is the same approach to the binding problem. `spartan_parallel` always performs sumcheck evaluation from right to left. Since the evaluation polynomial is of degree 3, at each round, the prover performs the following steps: +1. Bind the right-most variable in $\tilde{\text{eq}}, \tilde{Az}, \tilde{Bz}, \tilde{Cz}$ to 0, 1, 2, and 3 and form a degree-3 univariate polynomial. +2. Use the univariate polynomial to generate a challenge $r_j$. +3. Bind the right-most variable in $\tilde{\text{eq}}, \tilde{Az}, \tilde{Bz}, \tilde{Cz}$ to $r_j$. -Binding $x_{p, 0}$ to $r$ is equivalent to the following operations: -$$G_0 = \langle(1 - r)\cdot G_{0, 0} + r\cdot G_{2, 0}, (1 - r)\cdot G_{0, 1} + r\cdot G_{2, 1}, (1 - r)\cdot G_{0, 2}, (1 - r)\cdot G_{0, 3}\rangle$$ -$$G_1 = \langle(1 - r)\cdot G_{1, 0} + r\cdot G_{3, 0}, (1 - r)\cdot G_{1, 1} + r\cdot G_{3, 1}, (1 - r)\cdot G_{1, 2}, (1 - r)\cdot G_{1, 3}\rangle$$ +### Obtaining $\tilde{ABC}, \tilde{Z}$, and Sumcheck 2 +> Relevant functions: `R1CSProof::prove_phase_two` and `SumcheckInstanceProof::prove_cubic_disjoint_rounds` -Since $Q_2 = Q_3 = 2$, $G_{2, 2}, G_{2, 3}, G_{3, 2}, G_{3, 3}$ are all 0s, thus the prover does not access nor perform operations on them. As a result, in the first round, the prover's work is $\sum_i Q_i = 12$ multiplications. However, after the first round, the prover is left with $P = 2$ and $Q_i = [4, 4]$. So its work binding $x_{p, 1}$ would be 8 multiplications. +At the end of sumcheck 1, the prover generates the following claims: +$$\tilde{Az}(r_p, r_q, r_x), \tilde{Bz}(r_p, r_q, r_x), \tilde{Cz}(r_p, r_q, r_x)$$ -Now consider the alternative of binding $x_{q, 1}$ first. All bindings are performed within the $Q$ dimension: -$$G_0 = \langle(1 - r)\cdot G_{0, 0} + r\cdot G_{0, 1}, (1 - r)\cdot G_{0, 2} + r\cdot G_{0, 3}\rangle$$ -$$G_1 = \langle(1 - r)\cdot G_{1, 0} + r\cdot G_{1, 1}, (1 - r)\cdot G_{1, 2} + r\cdot G_{1, 3}\rangle$$ -$$G_2 = \langle(1 - r)\cdot G_{2, 0} + r\cdot G_{2, 1}\rangle$$ -$$G_3 = \langle(1 - r)\cdot G_{3, 0} + r\cdot G_{3, 1}\rangle$$ +Sumcheck 2 now checks that $\tilde{Az}, \tilde{Bz}, \tilde{Cz}$ are computed correctly. Similar to Spartan, `spartan_parallel` first samples random coefficients $r_A, r_B, r_C$ and computes +$$s = r_A\cdot \tilde{Az}(r) + r_B\cdot \tilde{Bz}(r) + r_C\cdot \tilde{Cz}(r)$$ +as `claim_phase2`. Next, it generates $\tilde{ABC}_{r_x}$ as $r_A\cdot \tilde{A} + r_B\cdot\tilde{B} + r_C\cdot\tilde{C}$ binded to $r_x$ and $\tilde{Z}_{r_q}$ as $\tilde{Z}$ binded to $r_q$. Note that the prover cannot bind both $\tilde{ABC}$ and $\tilde{Z}$ to $r_p$, as $\tilde{Az}$ is computed through a pairwise multiplication on the $P$ dimension, not a vector dot product. -This again costs 12 multiplications. However, this time it leaves us with $P = 4$ and $Q_i = [2, 2, 1, 1]$, and the next binding of $x_{q, 0}$ costs only 6 multiplications. +> In other words, $\tilde{Az}_{r_p} = \tilde{eq}_{r_p} \cdot (\tilde{A} \circ_p \tilde{Z})$, but $\tilde{A}_{r_p}\cdot \tilde{Z}_{r_p} = \tilde{eq}_{r_p} \cdot \tilde{A} \cdot \tilde{eq}_{r_p} \cdot \tilde{Z}$. The two operations are not the same. + +However, to compensate for the missing $r_p$, `spartan_parallel` inserts an additional term $\tilde{eq}_{r_p}(\tau) = \tilde{eq}(r_p, \tau[0..p])$. Since $\tilde{eq}_{r_p}(\tau)$ only depends on the first $p$ variables of $\tau$, it can be expressed using only $P$ entries. Sumcheck 2 of `spartan_parallel` is thus: +$$\sum_{\tau\in\{0, 1\}^{p + w + y}} \tilde{eq}_{r_p}(\tau)\cdot \tilde{ABC}_{r_x}(\tau)\cdot \tilde{Z}_{r_q}(\tau) = s$$ + +Is a degree-3 sumcheck with $p + w + y$ rounds. Since $\tilde{ABC}_{r_x}$ and $\tilde{Z}_{r_q}$ are $P\times W\times Y_i$ `DensePolynomialPqx`'s, the sumcheck and binding is again performed from right to left. Finally, we note that the cost of this sumcheck is independent of $Q$ and thus is independent of the size of each execution. This makes sumcheck 2 extremely cheap. + +## Opening Witness and Circuit Commitment +> Relevant structs: `PolyEvalProof` and `R1CSEvalProof` + +At the end of sumcheck 2, the prover reduces the entire proof down to just two claims: $\tilde{ABC}(r_p', r_x, r_w, r_y)$ and $\tilde{Z}(r_p', r_q, r_w, r_y)$. Note that $r_p'$ are the last $p$ challenges derived from sumcheck 2, and are different from $r_p$ in sumcheck 1. + +Finally, the prover opens all commitments on the given points, this includes: +* $3 \times P$ sparse polynomial commitments on circuits ($A$, $B$, $C$), each of size $X_i \times (W Y_i)$. +* $W\times P$ dense polynomial commitments on witnesses ($Z$), each of size $Q_i \times Y_i$. + +We leave the investigation of batched commitment opening as future work. + +## Shift and Program IO Proofs +> Relevant structs: `ShiftProofs` and `IOProofs` in `src/lib.rs` + +After all sumchecks, the prover still needs to show the following: +* Each of the shifted polynomials (`block_w3_shifted`, `perm_exec_w3_shifted`, `phy_mem_w3_shifted`, `vir_mem_w3_shifted`) are performed correctly. We first motivate the necessity of shifted polynomials, then briefly describe the shift proof process. + +> XXX: There might be ways to shift a polynomial without additional commitment. If you know any, please let me know. + +* The program inputs and outputs are correctly reflected in the witnesses. + +### Polynomial Shifts and Shift Proofs + +We recall our grand product construction: given a list of $w3_k = [v_k, x_k, \pi_k, D_k], k\in[0, Q)$, want to compute the cumulative product $\pi_k$ through $D_k = x_k \cdot (\pi_{k+1} + (1 - v_{k+1}))$ and $\pi_k = v_k\cdot D_k$. We note that the same computation is applied to every $w3_k$, and thus the computation should be easily parallelizable. Naively, one would generate a circuit $\mathcal{C}_\text{perm}$ for one instance of $w3_k$, and then execute that circuit $Q$ times. However, this problem to this approach is that the computation also involves entries of $w3_{k+1}$ ($\pi_{k+1}$, $v_{k+1}$). Alternatively, one can also construct $\mathcal{C}_\text{perm}$ to be satisfied by $w3'_k = [v_k, x_k, \pi_k, D_k, v_{k+1}, x_{k+1}, \pi_{k+1}, D_{k+1}]$, but the prover still needs to prove that the last four entries of $w3'_k$ matches with the first four entries of $w3'_{k+1}$. Our solution is to cut $w3'$ in two halves (i.e. set $W = 2, Y = 4$). This translates to two $Q\times 4$ commitments: `w3 = [v0, x0, pi0, D0, v1, x1, pi1, D1, ...]` for the left half and `w3_shifted = [v1, x1, pi1, D1, v2, x2, pi2, D2, ..., 0, 0, 0, 0]` for the right half. Finally, to prove that `w3_shifted` is `w3` shifted by 4 entries, the prover treats entries of `w3` and `w3_shifted` as coefficients to univariate polynomials $\tilde{w3}$, $\tilde{w3_s}$ and shows that for some random challenge $r$, +$$\tilde{w3}(r) = v_0 + r\cdot x_0 + r^2\cdot pi_0 + r^3\cdot D_0 + r^4\tilde{w3_s}(r)$$ + +Note that the prover also needs to show that $v_0, x_0, pi_0, D_0$ are indeed the first four entries of $\tilde{w3}$, which is quite difficult were $\tilde{w3}$ a univariate polynomial. Instead, for these openings, $\tilde{w3}$ is re-interpreted as a multilinear polynomial. Thus $v_0 = \tilde{w3}(0, 0, \dots, 0), x_0 = \tilde{w3}(0, 0, \dots, 1)$, etc. For this strategy to work, however, `spartan_parallel` must choose a polynomial commitment scheme that allows both univariate and multilinear opening. + +Finally, we remark that the same shift strategy can be applied to verify memory and register consistency checks on consecutive states. -As a result, `spartan_parallel` always performs sumcheck bindings from right to left. +### Proving the Correctness of Program Input and Output -_XXX: `bit_reverse` was already designed to help the binding from right to left. Upon further thoughts, however, this seems unnecessary. I'm currently in the process of removing bit-reverse and opt for direct right-to-left sumcheck binding, which is also how ceno does it._ \ No newline at end of file +The final step is to check that the witnesses indeed contain program inputs and outputs. The prover does so by opening corresponding entries on `exec_inputs` ($i', o'$, input and output state of each block sorted by time of execution). It performs the following opening: +1. The first execution is on block 0. +2. The final execution terminates at a designated exit block. +3. Inputs of the first block execution match program input. +4. Outputs of the final block execution match program output. \ No newline at end of file