feat: Add Rust lazy reader #77

XuJiandong · 2023-08-11T02:55:39Z

In the current implementation, the molecule requires that all data be loaded into memory before deserialization. This is a significant limitation in smart contracts as ckb-vm only has 4M memory. This PR aims to resolve this issue by implementing a lazy reader.

If we examine the Molecule Spec, we can retrieve specific data by navigating through "hops". By reading only the header, we can estimate where to navigate and avoid reading the rest of the data. In many scenarios where only certain parts of the data are required, a lazy reader mechanism can be utilized.

Reading data from a data source via syscall is costly. It would be advantageous to read more data for future use with each syscall. Currently, there is support for caching every reading. For further information, refer to read_at (in Rust).

Main changes:

Add a new plugin, named "rust-lazy-reader"
Add test cases
Add fuzzing test

Finally, it is ported from original repo: https://github.com/XuJiandong/moleculec-c2

bindings/rust/src/lazy_reader.rs

examples/lazy-reader-tests/src/types_api2_mol2.rs

quake · 2023-08-22T23:22:48Z

In the existing rust reader code, we usually use XxxReader::from_compatible_slice to deserialize the data and verify the molecule format:

let xxx_reader = XxxReader::from_compatible_slice(raw_data.as_slice()).map_err(|_| ...)

But in the new lazy reader code, I can't find a similar method to do this, could you please give an example of how to do this?

bindings/rust/src/lazy_reader.rs

quake · 2023-08-22T23:59:23Z

could you implement the iterator for lazy reader:

molecule/tools/codegen/src/generator/languages/rust/iterator.rs

Line 69 in 1b3bfe2

pub fn iter<'t>(&'t self) -> #reader_iterator<'t, 'r> {

it's useful to write code like:

let tx = lazy_load_tx();
for input in tx.inputs().iter() {
   ...
}

XuJiandong · 2023-08-23T01:12:52Z

In the existing rust reader code, we usually use XxxReader::from_compatible_slice to deserialize the data and verify the molecule format:
let xxx_reader = XxxReader::from_compatible_slice(raw_data.as_slice()).map_err(|_| ...)
But in the new lazy reader code, I can't find a similar method to do this, could you please give an example of how to do this?

If a slice data is already loaded into memory, it's not the scenario to lazy reader.

quake · 2023-08-23T01:21:26Z

If a slice data is already loaded into memory, it's not the scenario to lazy reader.

I mean the lazy reader lacks a way to verify the data format of the slice after lazy loading.

XuJiandong · 2023-08-23T01:42:24Z

If a slice data is already loaded into memory, it's not the scenario to lazy reader.

I mean the lazy reader lacks a way to verify the data format of the slice after lazy loading.

We can verify format like this:

molecule/tools/codegen/src/generator/languages/rust/reader/implementation.rs

Line 136 in 1b3bfe2

    
           fn verify(slice: &[u8], compatible: bool) -> molecule::error::VerificationResult<()> {

examples/lazy-reader-tests/src/lib.rs

bindings/rust/src/lazy_reader.rs

tools/codegen/src/generator/languages/rust_lazy_reader/generator.rs

bindings/rust/src/lazy_reader.rs

XuJiandong · 2023-08-25T08:27:44Z

In the existing rust reader code, we usually use XxxReader::from_compatible_slice to deserialize the data and verify the molecule format:
let xxx_reader = XxxReader::from_compatible_slice(raw_data.as_slice()).map_err(|_| ...)
But in the new lazy reader code, I can't find a similar method to do this, could you please give an example of how to do this?

Find the example here: https://github.com/XuJiandong/molecule/blob/b23d1f92df8944d1286dd6f7c93a218edf66d98e/bindings/rust/src/lazy_reader.rs#L485

With cursor, we can construct data structures directly or via Into trait.

bindings/rust/src/lazy_reader.rs

tools/codegen/src/generator/languages/rust_lazy_reader/generator.rs

xxuejie · 2023-12-28T01:54:03Z

tools/codegen/src/generator/languages/rust_lazy_reader/generator.rs

+}
+
+impl LazyReaderGenerator for ast::Array {
+    fn gen_rust<W: io::Write>(&self, output: &mut W) -> io::Result<()> {


For array of bytes, please add an additional function so one can read the full data directly into a memory buffer, see this for an example.

Another way of solving this problem, is that we can add Iterator support for fixvec as well.

tools/codegen/src/generator/languages/rust_lazy_reader/generator.rs

xxuejie · 2023-12-28T02:29:04Z

bindings/rust/src/lazy_reader.rs

+    `out of bound` will occur when `reader` try to read the data beyond that.
+    reader: interface to read underlying data
+     */
+    pub fn new(total_size: usize, reader: Box<dyn Read>) -> Self {


Let's think about this API: it suits the case when you are initializing a Cursor from a Vec<u8> structure, but what if you are initializing in a smart contract, from a syscall?

We would need to first build a reader implementing Read trait by reading syscalls, however, to initialize a Cursor, you would also need total_size, which is doable by... doing a syscall.

Thinking it another way, a DataSource can well be initialized by a single syscall, which loads enough data for the cache field, and also gets the total size of data in IO.

By requiring total_size here, we first need a syscall to get the total size, then DataSource requires another syscall to fill in the cache part. We are wasting one syscall here.

This is worth thinking, when refactoring the code related to DataSource.

The total_size is required for safety/security reason: we can always check its bounds.

I'm not sure on this: why total_size enhances security?

bindings/rust/src/lazy_reader.rs

xxuejie · 2023-12-28T02:34:01Z

bindings/rust/src/lazy_reader.rs

+        Ok(())
+    }
+
+    pub fn validate(&self) -> Result<(), Error> {


It might be a good idea to revisit here, and distinguish functions that are used internally, and externally used functions. It really seems to me that this validate function, is used internally in the Cursor? In that sense there is not need to put that pub here.

And there might also be other methods in this structure that are only required internally, so we might want to think about those.

Sometime we need to adjust cursor manually, see c code: https://github.com/XuJiandong/omnilock/blob/90ad8ba6744b9c5d261c713b6b71c30740d54a98/c/cobuild.c#L269-L271

Like the comment below, I don't really see any operation that cannot be covered by a slicing operation. We can provide a single slicing operation for cutting a cursor into any desirable range. Then the only place we need to do validation, is within the slicing operation.

xxuejie · 2023-12-28T02:38:17Z

bindings/rust/src/lazy_reader.rs

+            )))
+        } else {
+            let offset = calculate_offset(item_size, item_index, NUMBER_SIZE)?;
+            cur2.add_offset(offset)?;


I see a lot of patterns like this, where we clone a cursor, add something to its offset, then manipulate the size by directly changing it, or via sub_size. Are we really doing slicing here? Can we fix those code by a single slicing operation(and also removing both add_offset and sub_size)? That will simplify the code a lot and reduce the attack vector

Sometime we need to adjust cursor manually, see c code: https://github.com/XuJiandong/omnilock/blob/90ad8ba6744b9c5d261c713b6b71c30740d54a98/c/cobuild.c#L269-L271

To me a slicing operation is all you need here. Depending on the exact API of slicing operation, we can do one of the following 2:

mol2_slice_by_length(cursor, offset, cursor->length() - offset);

Or

mol2_slice_by_end(cursor, offset, cursor->end());

If you check "calculate_offset" carefully: they shows in code 4 times and shares nothing common.
We can't get any benefit doing so. note: there are slice_by_offset, slice_by_index already

I don't agre with this assessment. We are mixing 2 independent things together:

How do I safely get a slice of the data from a cursor, given an offset and either a length, or and end value?

Molecule uses indices to maintain molecule offsets, how do I calculate the offsets of a certain field?

The fact that we have slice_by_offset, slice_by_index and other functions all working on a single cursor type feels extremely messy to me. If I were doing this, I would design 2 separate components:

A cursor only has a slicing operation that takes an offset and a length, internally, it checks if the offset and length are vaild, if not, an error would be thrown. If they are indeed valid, a new cursor will be returned.

An offset calculation function that only calculates the offset for dynvec or a table. If you really want it, you can have slice_by_index function that first calculates the offset, then calls the slicing operation above to build the final cursor. I personally find it messy and possible dangerous to have a dynvec_slice_by_index function, that directly manipulates a cursor, this is not a proper abstraction. When the cursor logic requires altering, we will need to touch a ton of functions here, ensuring they all obey cursor's validation rules. This is a nightmare to maintain.

xxuejie · 2023-12-28T02:39:36Z

bindings/rust/src/lazy_reader.rs

+    }
+
+    pub fn slice_by_offset(&self, offset: usize, size: usize) -> Result<Cursor, Error> {
+        let mut cur2 = self.clone();


The code reads better if we simply calculate the new offset, then create a new cursor using new offset, size, and data source.

tools/codegen/src/generator/languages/rust_lazy_reader/generator.rs

yangby-cryptape · 2024-01-16T10:49:03Z

@quake @xxuejie @eval-exec @XuJiandong

Hi, folks, I don't use this feature, so I don't have opinions on this PR, so would you reach agreement on this PR.
We can merge it in next version (0.9.x).

XuJiandong · 2024-01-16T11:22:16Z

@quake @xxuejie @eval-exec @XuJiandong

Hi, folks, I don't use this feature, so I don't have opinions on this PR, so would you reach agreement on this PR. We can merge it in next version (0.9.x).

I will address the issues raised by @xxuejie when I have the time. Currently, I'm busy with other matters.

port Rust version from original repo: https://github.com/XuJiandong/moleculec-c2

Including fuzzing tests

add corpus

Use quote instead of printing code in generator

* Support iterator * Change read_at as method * sub_size returns Result * Check alignment on get_item_count * Snake names * Some typos

* Add verify methods * Add iterator test cases * Make lazy_reader.rs panic free * Add impl_cursor_primitive * Format error strings

Add test cases

* support custom unicode id * Array get item return Vec<u8> * union uses Enum

XuJiandong · 2024-01-31T08:41:15Z

In the latest commit, We have improved following:

Support custom union id
Array is returned as [u8; N]
Use Enum as union id

Fix typo

* Refactor errors Remove strings because it takes more space in contracts. * Check read_len == 0 on lazy_reader.rs: read_at * Add test case for zero byte union * Add clone * fix length on read * add slice_by_start --------- Co-authored-by: joii2020 <[email protected]>

XuJiandong force-pushed the lazy-reader branch from 2247179 to a58b9c0 Compare August 15, 2023 08:33

XuJiandong marked this pull request as ready for review August 17, 2023 03:00

quake reviewed Aug 22, 2023

View reviewed changes

quake requested review from yangby-cryptape and driftluo August 22, 2023 23:23

quake reviewed Aug 22, 2023

View reviewed changes

bindings/rust/src/lazy_reader.rs Outdated Show resolved Hide resolved

quake reviewed Aug 22, 2023

View reviewed changes

bindings/rust/src/lazy_reader.rs Outdated Show resolved Hide resolved

quake reviewed Aug 22, 2023

View reviewed changes

bindings/rust/src/lazy_reader.rs Outdated Show resolved Hide resolved

quake reviewed Aug 23, 2023

View reviewed changes

examples/lazy-reader-tests/src/lib.rs Outdated Show resolved Hide resolved

quake reviewed Aug 23, 2023

View reviewed changes

bindings/rust/src/lazy_reader.rs Outdated Show resolved Hide resolved

quake reviewed Aug 23, 2023

View reviewed changes

tools/codegen/src/generator/languages/rust_lazy_reader/generator.rs Outdated Show resolved Hide resolved

tools/codegen/src/generator/languages/rust_lazy_reader/generator.rs Outdated Show resolved Hide resolved

XuJiandong force-pushed the lazy-reader branch from 703b9e9 to 2309307 Compare August 23, 2023 07:29

quake reviewed Aug 24, 2023

View reviewed changes

bindings/rust/src/lazy_reader.rs Outdated Show resolved Hide resolved

bindings/rust/src/lazy_reader.rs Outdated Show resolved Hide resolved

xxuejie reviewed Dec 28, 2023

View reviewed changes

eval-exec reviewed Dec 29, 2023

View reviewed changes

tools/codegen/src/generator/languages/rust_lazy_reader/generator.rs Outdated Show resolved Hide resolved

XuJiandong and others added 7 commits January 31, 2024 15:58

feat: Add Rust lazy reader

597bd66

port Rust version from original repo: https://github.com/XuJiandong/moleculec-c2

Fix tempfile version

92b9e78

Add lazy-reader-tests

7f919d4

Add lazy reader tests

929d7fa

Including fuzzing tests

Update fuzzing test

53fea36

add corpus

Use build.rs to generate code

645ae10

Use quote instead of printing code in generator

Improve test cases and fuzz to improve code coverage (#1)

fcef512

XuJiandong and others added 3 commits January 31, 2024 16:01

Fixes to reviews

1415c18

* Support iterator * Change read_at as method * sub_size returns Result * Check alignment on get_item_count * Snake names * Some typos

Update based on reviews

f51a30d

* Add verify methods * Add iterator test cases * Make lazy_reader.rs panic free * Add impl_cursor_primitive * Format error strings

Verify fields recursively

3406d30

Add test cases

XuJiandong force-pushed the lazy-reader branch from 14c0a88 to eaae0e2 Compare January 31, 2024 08:18

Support custom union id (#3)

7ac63f2

* support custom unicode id * Array get item return Vec<u8> * union uses Enum

XuJiandong force-pushed the lazy-reader branch from eaae0e2 to 7ac63f2 Compare January 31, 2024 08:32

xxuejie mentioned this pull request Feb 26, 2024

Implement Cobuild integration and Solana support cryptape/omnilock#3

Open

XuJiandong force-pushed the lazy-reader branch 2 times, most recently from ecc64af to 7ac63f2 Compare March 18, 2024 09:22

quake and others added 4 commits March 18, 2024 17:59

chore: refactor DataSource (#5)

4f04bf8

Add more imports (#8)

f72267f

Fix typo

examples/ci-tests make clean: remove some ignored files (#7)

032eecc

quake approved these changes Apr 3, 2024

View reviewed changes

joii2020 approved these changes Apr 3, 2024

View reviewed changes

XuJiandong requested a review from joii2020 April 3, 2024 02:46

joii2020 approved these changes Apr 3, 2024

View reviewed changes

yangby-cryptape enabled auto-merge April 3, 2024 02:50

yangby-cryptape approved these changes Apr 3, 2024

View reviewed changes

yangby-cryptape merged commit 9190849 into nervosnetwork:master Apr 3, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Rust lazy reader #77

feat: Add Rust lazy reader #77

XuJiandong commented Aug 11, 2023 •

edited

Loading

quake commented Aug 22, 2023

quake commented Aug 22, 2023

XuJiandong commented Aug 23, 2023

quake commented Aug 23, 2023

XuJiandong commented Aug 23, 2023 •

edited

Loading

XuJiandong commented Aug 25, 2023 •

edited

Loading

xxuejie Dec 28, 2023

xxuejie Dec 28, 2023

xxuejie Dec 28, 2023

XuJiandong Jan 2, 2024

xxuejie Jan 5, 2024

xxuejie Dec 28, 2023

XuJiandong Jan 30, 2024

xxuejie Jan 30, 2024

xxuejie Dec 28, 2023

XuJiandong Jan 30, 2024

xxuejie Jan 30, 2024

XuJiandong Jan 30, 2024

xxuejie Jan 30, 2024

xxuejie Dec 28, 2023

yangby-cryptape commented Jan 16, 2024 •

edited

Loading

XuJiandong commented Jan 16, 2024

XuJiandong commented Jan 31, 2024 •

edited

Loading

feat: Add Rust lazy reader #77

feat: Add Rust lazy reader #77

Conversation

XuJiandong commented Aug 11, 2023 • edited Loading

quake commented Aug 22, 2023

quake commented Aug 22, 2023

XuJiandong commented Aug 23, 2023

quake commented Aug 23, 2023

XuJiandong commented Aug 23, 2023 • edited Loading

XuJiandong commented Aug 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yangby-cryptape commented Jan 16, 2024 • edited Loading

XuJiandong commented Jan 16, 2024

XuJiandong commented Jan 31, 2024 • edited Loading

XuJiandong commented Aug 11, 2023 •

edited

Loading

XuJiandong commented Aug 23, 2023 •

edited

Loading

XuJiandong commented Aug 25, 2023 •

edited

Loading

yangby-cryptape commented Jan 16, 2024 •

edited

Loading

XuJiandong commented Jan 31, 2024 •

edited

Loading