-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add Rust lazy reader #77
Conversation
2247179
to
a58b9c0
Compare
In the existing rust reader code, we usually use let xxx_reader = XxxReader::from_compatible_slice(raw_data.as_slice()).map_err(|_| ...) But in the new lazy reader code, I can't find a similar method to do this, could you please give an example of how to do this? |
could you implement the iterator for lazy reader:
it's useful to write code like:
|
If a slice data is already loaded into memory, it's not the scenario to lazy reader. |
I mean the lazy reader lacks a way to verify the data format of the slice after lazy loading. |
We can verify format like this:
|
tools/codegen/src/generator/languages/rust_lazy_reader/generator.rs
Outdated
Show resolved
Hide resolved
tools/codegen/src/generator/languages/rust_lazy_reader/generator.rs
Outdated
Show resolved
Hide resolved
703b9e9
to
2309307
Compare
Find the example here: https://github.com/XuJiandong/molecule/blob/b23d1f92df8944d1286dd6f7c93a218edf66d98e/bindings/rust/src/lazy_reader.rs#L485 With cursor, we can construct data structures directly or via Into trait. |
} | ||
|
||
impl LazyReaderGenerator for ast::Array { | ||
fn gen_rust<W: io::Write>(&self, output: &mut W) -> io::Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For array of bytes, please add an additional function so one can read the full data directly into a memory buffer, see this for an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way of solving this problem, is that we can add Iterator support for fixvec as well.
tools/codegen/src/generator/languages/rust_lazy_reader/generator.rs
Outdated
Show resolved
Hide resolved
tools/codegen/src/generator/languages/rust_lazy_reader/generator.rs
Outdated
Show resolved
Hide resolved
`out of bound` will occur when `reader` try to read the data beyond that. | ||
reader: interface to read underlying data | ||
*/ | ||
pub fn new(total_size: usize, reader: Box<dyn Read>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's think about this API: it suits the case when you are initializing a Cursor from a Vec<u8>
structure, but what if you are initializing in a smart contract, from a syscall?
We would need to first build a reader implementing Read trait by reading syscalls, however, to initialize a Cursor, you would also need total_size
, which is doable by... doing a syscall.
Thinking it another way, a DataSource can well be initialized by a single syscall, which loads enough data for the cache field, and also gets the total size of data in IO.
By requiring total_size
here, we first need a syscall to get the total size, then DataSource requires another syscall to fill in the cache part. We are wasting one syscall here.
This is worth thinking, when refactoring the code related to DataSource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The total_size
is required for safety/security reason: we can always check its bounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure on this: why total_size
enhances security?
Ok(()) | ||
} | ||
|
||
pub fn validate(&self) -> Result<(), Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a good idea to revisit here, and distinguish functions that are used internally, and externally used functions. It really seems to me that this validate function, is used internally in the Cursor? In that sense there is not need to put that pub
here.
And there might also be other methods in this structure that are only required internally, so we might want to think about those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometime we need to adjust cursor manually, see c code: https://github.com/XuJiandong/omnilock/blob/90ad8ba6744b9c5d261c713b6b71c30740d54a98/c/cobuild.c#L269-L271
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like the comment below, I don't really see any operation that cannot be covered by a slicing operation. We can provide a single slicing operation for cutting a cursor into any desirable range. Then the only place we need to do validation, is within the slicing operation.
))) | ||
} else { | ||
let offset = calculate_offset(item_size, item_index, NUMBER_SIZE)?; | ||
cur2.add_offset(offset)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a lot of patterns like this, where we clone a cursor, add something to its offset, then manipulate the size by directly changing it, or via sub_size. Are we really doing slicing here? Can we fix those code by a single slicing operation(and also removing both add_offset
and sub_size
)? That will simplify the code a lot and reduce the attack vector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometime we need to adjust cursor manually, see c code: https://github.com/XuJiandong/omnilock/blob/90ad8ba6744b9c5d261c713b6b71c30740d54a98/c/cobuild.c#L269-L271
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me a slicing operation is all you need here. Depending on the exact API of slicing operation, we can do one of the following 2:
mol2_slice_by_length(cursor, offset, cursor->length() - offset);
Or
mol2_slice_by_end(cursor, offset, cursor->end());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you check "calculate_offset" carefully: they shows in code 4 times and shares nothing common.
We can't get any benefit doing so. note: there are slice_by_offset
, slice_by_index
already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agre with this assessment. We are mixing 2 independent things together:
- How do I safely get a slice of the data from a cursor, given an offset and either a length, or and end value?
- Molecule uses indices to maintain molecule offsets, how do I calculate the offsets of a certain field?
The fact that we have slice_by_offset
, slice_by_index
and other functions all working on a single cursor type feels extremely messy to me. If I were doing this, I would design 2 separate components:
- A cursor only has a slicing operation that takes an offset and a length, internally, it checks if the offset and length are vaild, if not, an error would be thrown. If they are indeed valid, a new cursor will be returned.
- An offset calculation function that only calculates the offset for dynvec or a table. If you really want it, you can have
slice_by_index
function that first calculates the offset, then calls the slicing operation above to build the final cursor. I personally find it messy and possible dangerous to have adynvec_slice_by_index
function, that directly manipulates a cursor, this is not a proper abstraction. When the cursor logic requires altering, we will need to touch a ton of functions here, ensuring they all obey cursor's validation rules. This is a nightmare to maintain.
} | ||
|
||
pub fn slice_by_offset(&self, offset: usize, size: usize) -> Result<Cursor, Error> { | ||
let mut cur2 = self.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code reads better if we simply calculate the new offset, then create a new cursor using new offset, size, and data source.
tools/codegen/src/generator/languages/rust_lazy_reader/generator.rs
Outdated
Show resolved
Hide resolved
@quake @xxuejie @eval-exec @XuJiandong Hi, folks, I don't use this feature, so I don't have opinions on this PR, so would you reach agreement on this PR. |
I will address the issues raised by @xxuejie when I have the time. Currently, I'm busy with other matters. |
port Rust version from original repo: https://github.com/XuJiandong/moleculec-c2
Including fuzzing tests
add corpus
Use quote instead of printing code in generator
* Support iterator * Change read_at as method * sub_size returns Result * Check alignment on get_item_count * Snake names * Some typos
* Add verify methods * Add iterator test cases * Make lazy_reader.rs panic free * Add impl_cursor_primitive * Format error strings
Add test cases
14c0a88
to
eaae0e2
Compare
* support custom unicode id * Array get item return Vec<u8> * union uses Enum
eaae0e2
to
7ac63f2
Compare
In the latest commit, We have improved following:
|
ecc64af
to
7ac63f2
Compare
Fix typo
* Refactor errors Remove strings because it takes more space in contracts. * Check read_len == 0 on lazy_reader.rs: read_at * Add test case for zero byte union * Add clone * fix length on read * add slice_by_start --------- Co-authored-by: joii2020 <[email protected]>
In the current implementation, the molecule requires that all data be loaded into memory before deserialization. This is a significant limitation in smart contracts as ckb-vm only has 4M memory. This PR aims to resolve this issue by implementing a lazy reader.
If we examine the Molecule Spec, we can retrieve specific data by navigating through "hops". By reading only the header, we can estimate where to navigate and avoid reading the rest of the data. In many scenarios where only certain parts of the data are required, a lazy reader mechanism can be utilized.
Reading data from a data source via syscall is costly. It would be advantageous to read more data for future use with each syscall. Currently, there is support for caching every reading. For further information, refer to read_at (in Rust).
Main changes:
Finally, it is ported from original repo: https://github.com/XuJiandong/moleculec-c2