-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite MAST to use structure-of-arrays/table-based representation #1217
Comments
Thank you for the great write up! This makes a lot of sense to me (and something like this should probably have been implemented right from the start). A couple of clarifying questions:
How would we use this structure to store node types for branch node (e.g., to figure out if it it is a pub struct Program {
nodes: Vec<Node>,
}
pub enum Node {
Span(Span),
Join(Join),
...
}
pub struct Join {
body: [NodeId; 2],
hash: Digest,
}
pub type NodeId = usize; In either case, how would we handle resolving "local" vs. "external" references. By "local", I mean a node which is present in the tree, and by "external" I mean something that lives outside of the tree (e.g., a MAST root of some well-known program). A potential alternative approach is to use digests as node IDs: pub struct Program {
nodes: BTreeMap<NodeId, Node>,
}
pub enum Node {
Span(Span),
Join(Join),
...
}
pub struct Join {
body: [NodeId; 2],
hash: NodeId,
}
pub type NodeId = Digest; We do lose the nice cache properties and compact memory representation here as we need to do a lookups into the
|
I guess one way to still use vectors could be like so: pub struct Program {
nodes: Vec<Node>,
}
pub enum Node {
Span(Span),
Join(Join),
...
Ref(Ref), // reference to an external node
}
pub struct Join {
body: [NodeId; 2],
hash: Digest,
}
pub struct Ref {
hash: Digest,
}
pub type NodeId = usize; I think this should work. One of the potential downsides is that the integrity of the structure is not "self-enforced" (e.g., due to a bug, we could end up with multiple nodes having the same hash located in different indexes - which would not be possible with the |
I think having a dedicated node type for external references is likely to be more ergonomic (and efficient), but you've got the right idea in terms of how to handle the various node types. The core idea is pretty simple/flexible, but the basic principle is that a
It definitely requires a stricter API in terms of the ways in which you can let people manipulate it, but there are a number of useful patterns in Rust for dealing with this sort of thing while retaining ergonomics. In particular, the use of the newtype pattern (you'd never use The actual // u32 because we can't construct a MAST with more than that number of nodes anyway
#[derive(Debug, Copy, Clone, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct NodeId(u32);
impl Default for NodeId {
fn default() -> Self {
Self::INVALID
}
}
impl NodeId {
/// A sentinel value for invalid node ids can occasionally be useful
const INVALID: Self = Self(u32::MAX);
pub const fn as_usize(&self) -> u32 {
self.0 as u32
}
pub unsafe fn from_usize(index: usize) -> Self {
let id = Self(index.try_into().expect("invalid node id"));
assert_ne!(id, Self::INVALID, "invalid node id");
id
}
}
impl std::ops::Index<NodeId> for Vec<Node> {
type Output = Node;
fn index(&self, id: NodeId) -> &Self::Output {
self.index(id.as_usize())
}
} You can make the level of safety as rigorous as you want, but there is a balance between preventing bugs and ergonomics to consider. In particular, the next level of rigor is using generational indices, so that a In general, the risk of bugs of this type is low here, because the structure is immutable once built, and is only being used to represent a single MAST. Where this pattern can run into issues in terms of bugs is when you have some kind of structure that you hand out a In any case, while the efficiency gains are certainly noticeable, I don't think that's the primary benefit here anyway (just a nice to have). The main benefit of this structure is that it corresponds to a much better serializable format, and will also make it easier to introduce on-the-fly code loading of blocks which are not part of the initial tree. That latter bit is the main topic of my next proposal, but it got delayed yesterday, I'll have it up today for sure. |
All makes sense. We will probably need to build a map btw, flashing it out a bit more, I think pub struct Program {
nodes: Vec<Node>,
/// Node at which the program execution starts.
entrypoint: NodeId,
/// A list of nodes IDs which are referenced more than once.
roots: Vec<NodeId>,
} The |
It might be useful to keep it around, since it would allow dynexec/dyncall to determine whether or not the MAST it needs is in the current tree, or needs to be fetched from the cache/object store. But we can probably strip the mapping down to just those nodes which have their hash "captured" via
Could make that even more compact using a dense bitset, i.e. a bit for every index in I have some thoughts on the binary format for MAST in general, should I make that part of this proposal, or open a separate issue for that? |
Thinking a bit more about this, I defined The main benefit I see in tracking this info (or maybe building it on the fly during serialization) is that it would allow us to serialize such sub-trees super efficiently (e.g., spending something like
I'm not expecting the number of such nodes to be great (probably much smaller than the total number of nodes) - so, we may not need to do anything fancy. But I guess we'll need to test out this hypothesis.
I think a separate issue would be better. In my mind, there are actually 2 binary formats:
|
Closed by #1349. |
This is the first of a small set of issues I'm opening to try and set the stage for some of the things we need on the compiler/packaging side of things, but which also bring a number of benefits to the VM as well. I'm planning to work on implementing these things once there is consensus, but want to make sure everyone feels good about the proposals and has an opportunity to provide feedback on implementation details.
Proposal
This proposal is to reimplement the in-memory representation of MAST to use a more efficient, in both space and time, structure-of-arrays/table-based representation, rather than the tree structure currently used today. Interestingly, it seems like the seeds of this approach were planted with
CodeBlockTable
, but didn't take the idea to its logical conclusion. This proposal does just that. I'm guessing most/all of you have a pretty good feel for the reasons why this would be an improvement just in terms of memory and raw execution speed, but to recap:CodeBlock
enum-of-structs type wraps the set of various block types that compose the MAST, and which form the leaves and branches of the T in MAST. For the block types which are branches in the tree, namelyJoinBlock
andSplitBlock
, their children are expressed as a boxed slice ofCodeBlock
. Thus, the tree is navigated by traversing from the rootCodeBlock
to the leaves in a depth-first fashion.Rather than representing MAST as a literal tree in-memory, we can use a far more compact representation that is also cache-friendly, and significantly easier to query/visit. In essence, you go from a structure like this:
To something like this:
In essence, you have a table of nodes which can reference each other using their index in that table. Not only does this have nice perf characteristics, but it makes nodes directly accessible regardless of their depth in the tree, allows nodes to be trivially compacted if they are equivalent (e.g. MAST blocks with the same hash could be stored just once). It also becomes very easy to extend the tree with additional metadata without it being awkward to access. For example, something like so:
The compiler uses a similar type of structure in a few places, most importantly for representing the dataflow graph of a function. But perhaps a more notable implementation of this idea is in WebAssembly, which stores each type of "entity" in a module-scoped table, and all references to a given entity use its index in the table, rather than the entity itself.
Rationale
This change would pave the way for some other proposals, which I'll be submitting over the next day or two. The key thing this specific proposal will enable, aside from the other benefits mentioned, is straightforward serialization of the MAST representation of a program (or its component parts) in a compact format more suitable for packaging and distribution. The idea would then be to ship this representation rather than Miden Assembly sources when publishing a Miden contract/program for consumption. There are some other important properties gained from this representation, but more on that in my next issue.
The text was updated successfully, but these errors were encountered: