-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement extensible subsystem for on-demand storage/provisioning of MAST objects #1226
Comments
This looks great! Thank you again for putting so much though into this. I agree with most things above and I think it describes a pretty good end state (there are some minor things we can discuss in the future, but these won't have an impact on the overall direction/design). This is a pretty big change, however, and one thing we should try to do is break it down into small incremental updates. Below, are some of the first steps that I think we could take and then we can expand from there: Step 1: update MAST representation to be table-basedThis is basically what is described in #1217. This will involve:
Regarding point 1: we already discussed some high-level approaches to this struct and I'm curious if you have more concrete thoughts on this. A couple of points from my end:
impl Component {
pub fn get_node(&self, hash: Digest) -> Option<&MastNode> {
todo!()
}
} With this, it would be pretty easy to migrate the VM from using Step 2: implement
|
Thanks @bobbinth! You're overall sketch of the process more or less aligns with what I was expecting in terms of roadmap, so I'm glad to see I haven't forgotten something significant.
Definitely, and though it wasn't especially clear from my writeup, I wasn't expecting this to be an all-or-nothing, or even all-at-once kind of change, but rather a roadmap for how to get from where we are now to a VM that has the functionality I described. The overall steps you've laid out are pretty much the lines along which I was expecting to divide up the tasks at a high-level, but some of those are more involved and may warrant further breakdown.
Sure, I can be more specific with my thoughts here: First, the NOTE: I'm using more general terminology in the code below, i.e. /// Assume this is the more robust handle implementation I outlined in #1217
/// But the key points are:
/// * Has the same layout as a u32 when wrapped in an Option (due to niche optimization)
/// * Can be used to index slices just like a u32, but with some additional safety
/// * Could be modified to contain a generational index if we expect the MAST to be
/// mutated while handles are floating around
/// * Implements TryFrom<u32/usize>/Into<usize> for convenience
#[repr(transparent)]
#[derive(Copy, Clone, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct MastNodeId(core::num::NonZeroU32);
impl MastNodeId {
pub fn new(id: usize) -> Result<Self, InvalidNodeIdError> {
if id == u32::MAX {
Err(InvalidNodeIdError)
} else {
unsafe { Self(NonZeroU32::new_unchecked(id + 1) }
}
}
pub fn as_usize(self) -> usize {
(self.0.get() - 1) as usize
}
}
/// A [MastForest] represents either a disjoint set of MASTs, or an arborescence,
/// the distinction depending both on whether the forest represents a library or a
/// program, as well as the dependency structure of the code itself. For example,
/// an exectuable program would almost always be an arborescence, since it has
/// a single entrypoint/root, and code not reachable from that entrypoint must be dead.
/// Whereas, a library might consist of one or more entrypoints/roots.
///
/// In any case, this single data structure can represent both, and the semantics
/// would allow merging trees/forests into larger ones; as well as splitting a
/// forest into smaller forests, or into its constituent trees. These operations
/// are of particular usefulness when it comes to things like serialization/deserialization,
/// as well as granular storage (and loading) of trees at runtime.
#[derive(Default)]
pub struct MastForest {
/// All of the blocks local to the trees comprising the MAST forest
nodes: Vec<MastNode>,
/// Dense bitset where each set bit indicates that the corresponding
/// index in `nodes` is a MAST root. Multiple ways of representing
/// this information is possible, but I'm choosing this one to illustrate.
roots: SmallVec<[u64; 1]>,
/// The "entrypoint", when set, is the root of the entire forest, i.e.
/// a path exists from this node to all other roots in the forest. This
/// corresponds to the executable entry point. When not set, the forest
/// may or may not have such a root in `roots`, but is not required.
/// Whether or not the entrypoint is set distinguishes a MAST which is
/// executable, versus a MAST which represents a library.
///
/// NOTE: The entrypoint is also present in `roots` if set
entrypoint: Option<MastNodeId>,
/// FUTURE: Used to fetch nodes which are referenced, but are not present
/// in `nodes`. Doesn't necessarily have to be made available here though,
/// this is just one way such a thing could be used.
loader: Arc<dyn ObjectStore>,
}
impl Mast {
pub fn is_root(&self, id: MastNodeId) -> bool {
let id = id.as_usize();
let chunk = id / 64;
let bit = (id % 64) as u64;
self.roots[chunk] & bit == bit
}
pub fn roots(&self) -> impl Iterator<Item = MastNodeId> {
let last_chunk = self.roots.len().saturating_sub(1);
let trailing = self.nodes.len() % 64;
self.roots.iter().copied().enumerate().flat_map(move |(chunk, bitset)| {
let start = (chunk * 64) as u32;
let end = if chunk == last_chunk { start + trailing } else { start + 64 };
(start..end).filter_map(|id| {
let id = MastNodeId::new(id).unwrap();
if self.is_root(id) { Some(id) } else { None }
})
})
}
/// We could also maintain a digest -> id map, but if we only expect to look
/// for roots, this should be sufficient
pub fn get(&self, hash: Digest) -> Option<&CodeBlock> {
self.roots().map(|id| &self.nodes[id]).find(|node| node.digest() == hash)
/// FUTURE: Fall back to loading from object store
.or_else(|| self.loader.get(hash))
}
pub fn has(&self, hash: Digest) -> bool { self.get(hash).is_some() }
pub fn insert_root(&mut self, node: MastNode) -> MastNodeId {
let id = self.insert_node(node);
let index = id.as_usize();
let root_index = index / 64;
let bit = (index % 64) as u64;
if root_index == self.roots.len() {
self.roots.push(1u64 << bit);
} else {
self.roots[root_index] |= (1u64 << bit);
}
id
}
pub fn insert_node(&mut self, node: MastNode) -> MastNodeId {
let index = self.nodes.len();
let id = MastNodeId::new(index).expect("too many nodes");
self.nodes.push(node);
id
}
pub fn build(&mut self) -> MastBuilder<'_> {
MastBuilder::new(self)
}
}
/// This structure is used to build a MastForest while following the
/// control flow of a program naturally.
///
/// # Example
///
/// let mut forest = MastForest::default();
/// let mut builder = forest.build();
/// # Start a new `begin .. end` block
/// let mut entry = builder.begin();
/// # Start building a `if.true .. else .. end` construct
/// let mut split = builder.split();
/// # The split builder starts in the "true" block
/// split.call(a);
/// # And must be explicitly switched to the "false" block
/// split.else();
/// split.call(b);
/// # When both blocks are built, control flow is explicitly rejoined,
/// # consuming the split builder.
/// split.end();
/// # Likewise with the `begin .. end` block
/// entry.end();
/// # Lastly, the MAST built so far is added to the underlying forest
/// builder.build();
pub struct MastBuilder<'mast> {
mast: &'mast mut MastForest,
}
impl<'mast> MastBuilder<'mast> {
fn new(mast: &'mast mut MastForest) -> Self {
Self { mast }
}
/// Save the tree built so far to the underlying `MastForest`
pub fn build(self) { ... }
} The The current Beyond that, the only other "interesting" bits are in serialization/deserialization, as well as the object store/registry which were the subject of this proposal. I'll open a proposal for the binary representation of |
I like
I like the idea of putting the btw, what is the reason for using
I think we will probably have to maintain the |
Primarily to avoid monomorphizing all of the Of course, if the hypothetical In any case, that was my rationale. I'd be inclined to switch to making it generic if it was important for performance that we monomorphize code paths which interact with it, but the overhead of loading objects from the store is going to dwarf the actual dynamic dispatch, so the ergonomic benefits seem to far outweigh the downsides. It's also quite likely that the
Could you elaborate on the instances where you expect to look up a node in the For example, the basic storage implementation I'm imagining, breaks up a In practice, its likely there would be a little more involved there, e.g. we'd need to load whatever metadata is associated with that MAST root to allow us to initialize its memory in a new context (whether that's directly, or by placing the data into the advice provider). There is some trickiness involved there that I'm still mulling over, but in any case, it doesn't have any bearing on the specific thing we're talking about here. In any case, I just want to make sure I understand the circumstances in which it would be useful/necessary to maintain the full digest/node id map. |
I guess this depends on how we implement MAST traversal on the VM side. If it is based on
If we do it this way, the only time we'd use the MAST-root-based One thing to note, the implementation of Also, one thing I just realized is that with the approach we've been discussing so far, it seems like we'd need to assign node IDs to all nodes loaded from the store (or at least to re-compute them). This is because node IDs make sense only in the context of a given forest. So, for MAST trees coming from the store, "local" node IDs may conflict with the ones already in the forest. This may not be a big deal, but it does prevent us from building a "hot cache" of MAST components - e.g., something that lives in memory and is available for traversal without any additional work. |
It's true that trying to merge two That said, I think there are other reasons why we may not want to make this quite as transparent in the The traversal of a program would then be decoupled from a specific pub struct Continuation {
/// The forest containing the node we'll continue at
forest: Arc<MastForest>,
/// The node at which control will resume
node: NodeId,
/// If applicable, the child index in `node` where control should resume.
/// For nodes with only a single child, or with no children, this would always be zero.
index: usize,
} As we traverse the MAST depth-first, we push a continuation on the stack when visiting a child of some node, and pop a continuation off the stack when we've finished executing the current node and need to continue somewhere further up the tree. In this conceptual model the "current continuation" is the continuation on top of the stack. For example: graph TD;
subgraph A
direction TB
subgraph A^1
direction TB
A1("Join { id: 0, digest: 0x0, children: [1, 10] }");
B1("Join { id: 1, digest: ..., children: [2, 6] }");
C1("Span { id: 2, digest: ..., children: [3, 4, 5] }");
D1("Call { id: 3, digest: ..., callee: 0x2 }");
E1("Span { id: 6, digest: ..., children: [7, 8, 9] }");
F1("Call { id: 10, digest: ..., callee: 0x1 }");
end
subgraph A^2
G1("Span { id: 11, digest: 0x1, children: [...] }");
end
end
subgraph B
A2("Join { id: 0, digest: 0x2, children: [1, 5] }");
B2("Span { id: 1, digest: ..., children: [2, 3, 4] }");
C2("Span { id: 5, digest: ..., children: [6, 7, 8] }");
end
A1-->B1;
A1-->F1;
B1-->C1;
B1-->E1;
C1-->D1;
A2-->B2;
A2-->C2;
Here we have two forests,
So the bottom line here is that despite node IDs being local to a The key thing about this approach is that it when we request a MAST root from a
I think using the scheme I sketched out above, we get best of both worlds - efficient |
I think the approach you've described with continuations will work, but it also wonder if we need to add this much complexity (at least initially). Specifically, a much simpler approach would be to do something like this: pub struct MastForest {
nodes: BTreeMap<Digest, MastNode>,
roots: Vec<Digest>,
entrypoint: Option<Digest>,
loader: Arc<dyn ObjectStore>,
}
impl MastForest {
pub fn get(&self, node_hash: Digest) -> Option<MastNode> {
if let Some(node) = self.nodes.get(node_hash) {
match node {
MastNode::External(ext_hash) => self.loader.get(ext_hash),
_ => node
}
} else {
None
}
}
} The main benefit of this approach is that changes on the VM side would be minimal (maybe like an hour or two of work) and we'd be able to implement the whole transition way faster. The obvious downside is that this will be much less memory efficient and traversals will take longer, but:
Note, that this struct does not need to affect the serialization format. We can make serialization format more or less independent from the in-memory representation. |
I just caught up to this whole discussion; it all sounds very exciting! I wanted to share 2 quick thoughts: 1. Being mindful of registry attacksThe registry idea sounds great. Only thing I'll say is we should be mindful of security, since registries can be used for attacks (see NPM). Specifically, the prover should always verify the MAST that it received for a given root. Theoretically, this wouldn't be needed, since the verifier should catch any unsound MAST and reject any proof that used an execution trace built using it. But in practice, an attacker could use a registry to exploit a bug in the AIR constraints, such that the prover would build a proof using this invalid MAST, and the verifier would accept it. Of course, such attacker could run its own prover to build a proof that exploits the constraints, but at least, this would prevent anyone using our prover to generate malicious proofs due to a registry attack. 2. Plan ahead for MAST breaking changesWe should plan ahead for any future breaking changes to MAST and/or the serialization format in our packaging solution. For example, say we add a new node type in the MAST, anyone running an outdated prover would fail to deserialize that new MAST. Or similarly for the serialization format, I'd expect a registry to understand many (all?) different versions of the serialization format as it changed through time, and be able to deserialize whichever version into its current MAST (if compatible). We should also keep in mind that theoretically, hash collisions are possible across breaking changes of the MAST. A dummy example is if from version A to version B, we swap the hash domain between join & split nodes. Then, a tree with a JOIN as root in version A would have the same hash as the same tree in version B, except for the root changed to a SPLIT. Although we (hopefully) would never do that literally, we could unintentionally introduce a bug in a similar but more complex manner, which would allow unintended code to be fetched from the registry and executed/proved. Now, we should expect the verifier to catch that invalid code was executed (it basically almost always will), so it's not a huge safety concern. But at the very least, it would be a very annoying bug to figure out and fix, which would probably require people purge their local registries, etc. So this suggests we should think of the "content unique IDs" (referring to the content-addressable property of MAST) no longer as only |
In 0xPolygonMiden/compiler#132 (review) we discussed two ways of serializing MAST: (1) "fast" (trusted) and (2) "compact" (un-trusted). I'm imagining that the registry would contain the "compact" variant, and so, on download, the user would re-build the MAST to verify that the roots match. There is another issue here though: while verifying that a given program has a given MAST root is enough to verify the correct execution of this program, it may not be enough to execute it. There are two reasons for this:
Given the above, we may need to provide two commitments for a given library: a set of MAST roots as well as sequential hash of the serialized MAST (including decorators). Upon download the user could then verify that both hashes are correct and be sure that they have the right program.
This would be a potentially very serious security vulnerability and I think we should make sure we avoid such scenarios. One way to do this would be to change domain separators for MAST nodes any time we introduce breaking changes (say as we go from v1.0 to v2.0). This way, we'd be guaranteed that version MAST v1.0 will never have the same hash as MAST v2.0 - even for identical programs. |
With #1349 we will have the first part of the MAST migration implemented. The next step would be to add support for the
The first approach may be a bit simpler, but in the long term term we'll need the second approach. At this point yet I'm not sure we should start with the 1st approach to build the registry iteratively or if we should go for the second approach right away. One other thing to consider: we may need an in-memory registry anyways for things like tests. In either case though, I think we'll need to make the following changes: Object storeFirst, we need to define a trait for pub trait ObjectStore {
/// Returns MAST forest corresponding to the specified digest, or None if the MAST forest for this
/// digest could not be found in this [ObjectStore].
fn get(&self, node_digest: Digest) -> Option<&MastForest>
} External nodesSecond, we'll need to add
But what to do about the procedures which we can resolve to MAST roots but do not want to include into the assembled MAST forest? For example, let's say we are compiling a program that uses One way to address the above is to instantiate the assembler with its own object store. And then, if a procedure is found in this object store, we'd output an The pub struct Assembler {
/// The global [ModuleGraph] for this assembler. All new [AssemblyContext]s inherit this graph
/// as a baseline.
module_graph: Box<ModuleGraph>,
/// The global procedure cache for this assembler.
procedure_cache: ProcedureCache,
/// Whether to treat warning diagnostics as errors
warnings_as_errors: bool,
/// Whether the assembler enables extra debugging information.
in_debug_mode: bool,
/// Whether the assembler allows unknown invocation targets in compiled code.
allow_phantom_calls: bool,
/// An object store loaded with MAST forests which we assume to be available to the VM at runtime.
store: Arc<dyn ObjectStore>,
} But this would also require a deeper refactoring of the assembler and is not something we need on the first iteration - so, maybe we do it as a follow up after the basics are implemented. Loading the object storeLoading of the object store would left to specific implementations. Assuming we want to implement a simple in-memory object store, it could look something like this: pub struct MemObjectStore {
procedures: BTreeMap<Digest, MastForest>,
}
impl MemObjectStore {
pub fn load(&mut self, mast: MastForest) -> Result<(), SomeError> {
// here, we'd break the incoming MAST forest into individual procedures such that the entries that
// we add to self.procedures correspond to individual procedure MASTs.
}
} To support the above functionality, we'll also need to refactor how we identify procedures in the pub struct MastForest {
/// All of the nodes local to the trees comprising the MAST forest.
nodes: Vec<MastNode>,
/// Roots of all procedures defined within this MAST forest .
roots: Vec<MastNodeId>,
/// This corresponds to the executable entry point. Whether or not the entrypoint is set distinguishes
/// a MAST which is executable, versus a MAST which represents a library.
entrypoint: Option<MastNodeId>,
kernel: Kernel,
} I'm also wondering if Processor changesIn the processor, we could add the object store directly to the struct Process<H>
where
H: Host,
{
system: System,
decoder: Decoder,
stack: Stack,
range: RangeChecker,
chiplets: Chiplets,
host: RefCell<H>,
max_cycles: u32,
enable_tracing: bool,
store: Arc<dyn ObjectStore>,
} Or we could add it to the pub trait Host {
...
/// Returns MAST forest corresponding to the specified digest, or None if the MAST forest for this
/// digest could not be found in this [Host].
fn get_mast_forest(&self, node_digest: Digest) -> Option<&MastForest>;
...
} Intuitively, it feels like putting this functionality into the When the VM encounters an MastNode::External(node_digest) => {
let mast_forest = self.host.get_mast_forest(node_digest)?;
let node_id = mast_forest.get_node_id_by_digest(node_digest)?;
self.execute_mast_node(node_id, mast_forest)
} In the above, I'm not handling the case when an |
I think some concepts might be getting conflated here: in the design I had in mind, we need two components, the registry (which manages packages), and the object store (which manages MAST objects). The latter needs both in-memory and on-disk stores, while the registry doesn't need its own storage at all (or rather, the storage associated with registry data depends on the specific registry implementation). Below I've quickly sketched out the broad strokes of these components and how I see them interacting with one another (and with the general workflow of executing programs): RegistryThe registry provides two things:
The registry basically exists to facilitate looking up a package and installing its objects (and those of its dependencies, if available) into the object store. The core registry API is dead simple, and tailored towards the needs of the VM with regards to packaging. Depending on the specific registry implementation, you might have additional tooling above and beyond that provided by the core registry API, e.g. publishing, checking for updates. The simplest possible registry would look for package files in a directory on-disk (e.g. A slightly more sophisticated registry might follow a similar approach, but instead require the directory to be managed via a CLI tool, which not only installs/removes packages from the directory, but also maintains an on-disk index (e.g. When the time comes where we have an actual centralized registry akin to crates.io, the local registry might maintain just an index of packages it has installed (so as to avoid attempting to install objects from a package which was already installed), and route all metadata requests to the remote registry server. In this case, storage for the registry itself is purely optional. Object StoreThe object store provides two things:
The object store itself requires both in-memory (hot) storage and on-disk (cold) storage:
If the VM is instantiated and discarded after each program, then the in-memory store can be very naive, and never unload anything. This consumes more memory the more complicated the call graph of a program is, but doesn't matter too much for most programs. The on-disk store ensures that instantiating a VM frequently is inexpensive (as all the work of preparing objects to be loaded into memory has been done). In this use case, you could theoretically omit either the in-memory or the on-disk store, but doing so would impact runtime performance significantly. Because the in-memory store would be so simple in this case, it makes little sense to only implement one of them IMO. If the VM is instantiated once, and run as a long-lived process like a server, then the in-memory store will need to select a caching algorithm (e.g. ARC or LRU), and use that to determine what objects to keep in-memory, and which objects to drop and reload from disk when needed. In this use case, both the in-memory and on-disk stores serve a critical role. Putting It TogetherSo with the design for those components in mind, let's say you want to invoke a program you have assembled locally, with the VM, here's a general sketch of how I imagine that working:
Similarly, a program could be executed by installing it via the registry, and then invoking the entrypoint as described in Step 3 above. If at any point, a referenced MAST root was requested from the object store, but not found, execution of the program will trap, and an error will be returned as the result. Whether or not the full dependency tree is installed when a package is installed isn't super important. If you defer installation of dependencies until they are actually used, then you need to ensure that when a given MAST root is requested and not found in the object store, that you can look up what package in the current program's dependency tree, exports that root, and install the relevant package at that point. Doing it this way allows you to treat missing dependencies as a non-issue until an execution trace actually hits that dependency. The simplest approach is to just require everything up front, so that you don't waste time executing a program that is doomed to fail if a key dependency is missing. I'll follow up here on the rest of your comment, but wanted to make sure the vision for how all of this ties together was laid out in a shorter, more cohesive description, now that some of the work has been done in the assembler that lays the foundation for it. If you spot any issues with what I've sketched out, or have a different approach in mind, let me know. |
I think we also need the following methods: /// Inserts all of the objects comprising the given MAST forest to the object store
fn insert(&mut self, forest: &MastForest);
/// Removes the referenced object from the object store
fn remove(&mut self, node_digest: Digest) -> Result<(), InvalidObjectError>;
The main reason we need the kernel in the assembler is because we need to know the names of the kernel procedures and their corresponding MAST roots. Currently, we require the kernel to be assembled first, and then we use the information gleaned from that assembly to compile programs on top of that kernel. To avoid needing to do it this way, we need to get the same information by some other means (e.g. package metadata).
This was the case prior to @plafer's PR, but since we removed that node type, I think we just panic now. In any case, if we restore that code, this one is solved for.
In the assembler refactoring, all procedure calls were done via external/proxy nodes, so this was also a non-issue. The small tweak I proposed in @plafer's PR (prior to us deciding to remove that node type entirely), was to check if the referenced MAST root is part of the current forest, and if so, resolve it as part of the forest, otherwise resolve it as an external/proxy node. This meant that the code local to a program would use forest-local references, while code from dependencies would use external/proxy references. Additional heuristics could be used to further refine how that choice is made (i.e. deliberately choose to use a proxy reference even when the node is available in the current forest), but I don't believe that is likely to be necessary.
Just to be clear, no extra info is required. When assembling a given program, we aren't assembling the whole world, we're assembling just the code we are given. To the extent we need extra info in the assembler, it is package metadata for the dependencies of the program being compiled, so that we can resolve calls to those dependencies to their corresponding MAST roots. We don't have packaging implemented yet, so at this point there isn't even a way to compile a program that references dependencies by name, without also providing the code for the dependency at the same time. Once we implement packaging, that is no longer the case, but at that point the assembler can use the package metadata to resolve procedure references without having to compile the dependency, and thus will use external/proxy nodes for those references. TL;DR - this is solved automatically as a side effect of implementing packages.
It isn't necessary for the assembler to have the actual objects at all - it need only know the set of exports from a given package (i.e. the fully-qualified procedure names and their corresponding MAST roots). We can already tell if a given MAST root refers to a concrete node that is part of the current compilation unit or not.
I don't think the object store has or needs user-provided implementations (user-configurable, sure/maybe). The registry is different, that very obviously needs to be pluggable. See the Object Store section of my previous comment for an elaboration on what its implementation consists of. To the extent that anything needs to be pluggable in the object store component, it would be the implementation of the on-disk/cold object storage. IMO that is unnecessary, and we could just provide configuration rather than a plug-in system for those details, but if anywhere made sense, it would be there.
This metadata is already supposed to be part of the
We also need it to merge forests and/or progressively extend a forest. I think these operations will be quite common throughout the execution of a program, so removing that field would vastly limit the utility of the
The As an aside, I prefer to think of
I think rather than request specific operations like This way, users can easily instantiate default impls (or third-party impls) for those traits without needing to implement a bunch of trait methods. Not to mention it ties all of those traits to the
I think this is basically what I was just describing above (assuming the
The one thing missing here is merging the loaded object (a |
Ah yes - I was sloppy with terminology here. Everything I wrote was about the object store (and not about the registry).
I'm not sure this is a requirement: we could have a purely in-memory implementation of the object store, and we can also have a more sophisticated implementation that works the way you've described.
I think one of the main differences between how you and I think about it is that in my mind the VM is instantiated only with an For example, I think of the above methods as being methods on a specific implementation of the
I think this may be tricky as depending on the execution path the program may require a different set of dependencies to be available. So, in a sense, the set of dependencies may be "dynamic" and depend on the input with which a given program is executed. In my mind, the dependencies would be fetched from the object store at runtime (rather than prior to program execution). This way, we don't need to fetch something that would not eventually be requested during execution. But I think you mention something similar a few paragraphs down from the above quote.
Right - I'm imagining that we'll eventually end up having kernel packages and so an assembler would be instantiated with a kernel package (or build one on the fly) and the only useful info from that package to the assembler would be a map
I think this is another main point of difference: I'm still not sure we need to merge MAST forests and my thinking so far has been based on the idea of "switching forests". My reasoning for this is as follows:
So, to summarize the way I'm envisioning the process:
|
The current design of Today's pub struct Procedure {
nodes: Vec<MastNode>,
node_id_by_hash: BTreeMap<RpoDigest, MastNodeId>,
entrypoint: MastNodeId,
} We would represent pub enum MastNode {
// ...
External(Digest)
} Then, a pub struct Program {
procedures: BTreeMap<Digest, Procedure>
}
impl Program {
pub fn insert(&mut self, proc: Procedure) {
self.procedures.insert(proc.hash(), proc)
}
} The VM's processor would actually execute an pub struct AbstractProgram {
store: ObjectStore,
program: Program
}
impl AbstractProgram {
// Note: If this fails, VM execution halts
pub fn get_procedure_by_digest(&mut self, proc_digest: Digest) -> Option<Procedure> {
if let Some(proc) = self.program.procedures.get(proc_digest) {
// procedure already loaded in program, just return it
Some(proc)
} else {
let proc = self.store.get(proc_digest)?;
// Cache the procedure, and return. This has the effect of automatically resolving all
// `External` nodes with this hash in the program to this procedure.
//
// Note: we'd want procedures to be behind `Arc`
self.program.insert(proc.clone());
Some(proc)
}
}
} where the pub trait ObjectStore {
// use `&mut self` to allow caching (without interior mutability at least)
fn get(&mut self, proc_digest: Digest) -> Option<Procedure>;
} For me, the main benefit of this set of abstraction is that it maps more closely to how I think about program execution than the other suggestions. For example, External calls are also resolved very cleanly. You only need to update the "procedure cache" (i.e. the Similarly, a Now, I'm not sure if using a different approach for "execution" and "package management" is a good idea; this is just meant as my idea of what would be the ideal representation strictly for execution. |
A few comments about this approach: If I understood it correctly, it is opinionated in a way that we assume that So, to sum up - I think introducing
The way I think about it, if we go with So, as far as the processor is concerned MAST forests are immutable, and whether MAST forests are mutable or immutable inside the object store, would be up to a specific implementation of the object store. Here is how I'm thinking some of the structs could change concretely: pub enum MastNode {
// ...
External(Digest)
}
pub struct MastForest {
/// All of the nodes local to the trees comprising the MAST forest.
nodes: Vec<MastNode>,
/// Roots of all procedures defined within this MAST forest .
roots: Vec<MastNodeId>,
}
pub struct Program {
mast_forest: MastForest,
entrypoint: MastNodeId,
kernel: Kernel,
} Note that we no longer have For the processor, we'll need to make the following change: pub trait ObjectStore {
// if object store needs to be mutated, this would be done via interior mutability
fn get(&self, node_digest: Digest) -> Option<Arc<MastForest>>;
}
pub trait Host {
// ...
/// Returns MAST forest corresponding to the specified digest, or None if the MAST forest for this
/// digest could not be found in this [Host].
fn get_mast_forest(&self, node_digest: Digest) -> Option<Arc<MastForest>>;
// ...
}
pub struct DefaultHost<A, S> {
adv_provider: A,
store: S, // we could also have it as Arc<dyn ObjectStore>,
} We'll also need to provide a very simple implementation of in-memory object store which could look something like this: pub struct MemObjectStore {
masts: BTreeMap<Digest, Arc<MastForest>>,
}
impl MemObjectStore {
pub fn insert(&mut self, mast: MastForest) -> Result<(), SomeError> {
for root in mast.roots() {
self.masts.insert(root, Arc::new(mast));
}
}
}
impl ObjectStore for MemObjectStore {
fn get(&self, node_digest: Digest) -> Option<Arc<MastForest>> {
self.masts.get(node_digest).cloned()
}
} In the future, we'll also implement more sophisticated object stores which may read data from disk and/or request packages form registry - though, how exactly this will work we can figure out when we get there. Lastly, the way the MastNode::External(node_digest) => {
let mast_forest = self.host.get_mast_forest(node_digest)?.as_ref();
let node_id = mast_forest.find_root(node_digest)?;
self.execute_mast_node(node_id, mast_forest)
} The above assumes that there is a I've omitted the changes to the assembler from the above description. We can discuss them separately - but I think they shouldn't be too big. |
After #1370, we'll have a few more things left to do before we can close this issue. Specifically, I'm thinking we need to do the following:
To compile a library, we first need to define a I think the minimal functional version could look something like this: pub struct Library {
mast_forest: MastForest,
// a path for every `root` in the associated [MastForest]
exports: Vec<LibraryPath>,
} We'd probably also need to have other properties (e.g., Once we have the pub fn assemble_library<T: Iterator<Item = impl Compile>>(
&self,
modules: T,
matadata: LibraryMetadata, // name, version etc.
) -> Result<Library, Report> There could be other ways to do this though. @bitwalker - I'm curious if you were thinking of a different approach which would work better for the compiler. If we do go with this approach, we should probably rename the current Next, we'll need a way to load the libraries into the assembler. For this, I think we can re-purpose the current One other thing to note is that when loading a library into the assembler, we would need to specify whether the library is "vendored" or not. So, the pub fn add_library(&mut self, library: Library, vendored: bool) -> Result<(), Report> The difference between vendored and non-vendored libraries is that:
Another thing we may want to consider is treating kernels as libraries. Specifically, we could add another method to the assembler specifically for compiling kernels: pub fn assemble_kernel<T: Iterator<Item = impl Compile>>(
&self,
modules: T,
) -> Result<Library, Report> In the above, one of the modules in the provided module list would be required to be Once we have this, we'd change the pub fn with_kernel(kernel: Library) -> Self |
Closing as most of this has been implemented in #1359, #1370, #1401, #1409, #1411, #1416, #1418, and #1422. There are a few outstanding issues remain for which I'll create separate PRs/issues. These include:
|
@bobbinth FYI, we already support |
Ah - nice! I updated my comment above.
What I meant by this is that it might make sense to modify the |
This is the second, and most significant of the three issues I planned to open related to MAST. It is intended to build on the foundation laid by #1217, and - in my opinion anyway - is the lynchpin to a number of important features that are related to packaging/distribution/deployment.
Background
Miden is currently oriented around shipping Miden Assembly files (either in
.masm
form or bundled into a.masl
library). These files must be placed on disk in a precise structure, and then the VM invoked with the right set of incantations to make sure everything necessary to assemble and run the program is available. The assembler then processes these into a MAST and executes the program, all in one shot.This isn't ideal for a number of reasons:
Binary Size
I believe one of the primary reasons MASM ended up being the de-facto format for distribution was due to it being viewed as significantly smaller in terms of binary size - but I think the actual case for this is a bit flimsy in the big picture. While it is almost certainly true that serializing our current MAST structure would indeed be bloated in comparison to the original MASM, there are some counter-balancing properties here:
libfoo
). If everyone is shipping MASM artifacts (containing all of the MASM code that their program uses), you end up with thousands of redundant copies oflibfoo
being stored. With MAST however, regardless of how many times a given version oflibfoo
is used, only a single copy need be stored.Proposal
The background above isn't what this proposal is about though, it just is the necessary backdrop for it. For now, let's assume we're moving to a world where MAST is our object code format (i.e. both in-memory and on disk, MAST is the equivalent of e.g. ELF executables, static archives/dynamic libraries).
I think I can summarize the key things that this enables:
In short, my primary proposal here is to introduce some changes to the VM in how it provisions MAST for execution, both initially at program startup, as well as during execution of the program. Specifically, the introduction of two high-level abstractions, with the necessary set of interfaces for use in the various parts of the VM that need to interact with them:
Object Store
This component would be responsible for storage and provisioning of the binary objects corresponding to a given MAST root. Conceptually the storage is content-addressable, where the MAST root is the content hash. This component has the following responsibilities:
The primary users of this component are three different parties:
Package Registry
This component is not a full package management solution, nor a registry like crates.io, but a core piece of VM infrastructure to make it package-aware, and to allow straightforward integration between package management tooling, the object store, and the VM. It has the following responsibilities:
Putting It Together
These two components work in tandem to provide a unified and convenient experience for publishing/distributing/and executing Miden programs. It can be extended to support complex tools and centralized registries like Cargo/crates.io, or be kept completely offline and using the local filesystem, or somewhere in between. Fundamentally though, the VM will assume that all code is either part of the program MAST which it started executing (the status quo), or that it can be loaded just-in-time from the object store. Any failed attempt to load MAST from the object store should result in a trap (and appropriate error code).
TODOs
There are a few related items we'll want to also tackle along these lines:
exec.HASH
, mirroringcall.HASH
, so that procedures can be explicitly invoked using their MAST root. From the brief scan of the code that Bobbin and I did, this seems like it's basically there already, just need to tie up the loose ends.warg
, which is designed for the WebAssembly component ecosystem, but which we would retool for Miden "components". The main benefit with it, or a solution like it, is that it is relatively Cargo-like in nature, i.e. it can support source-control deps, GitHub repos, a centralized "official" registry, as well as private/offline registries. That lets us build out some decent tools without having to build our own crates.io.Final Thoughts
I tried to avoid letting this go too long by getting into example code and things like that, but I'm happy to show how I think some of these interfaces will actually look like/work. Interested to hear your feedback!
The text was updated successfully, but these errors were encountered: