-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Bao chunk groups #19
Conversation
See: Scripts:
Data: I can test this branch soon as well. |
Thanks, added those vectors and confirmed they are passing 👍🏻 |
I'm glad to see everything is working. I have looked at it and can test it fully once verifying slices are implemented. I looked at Thanks. |
@lukechampine Is it possible to have an API that takes a chunk (e.g., 256 kb data), the chunk group size, the offset, the outboard proof, and the root hash and statelessly verifies it? This means having complete control over any seeking or partial verification. I see what you added regarding the write streaming, but I really need to verify a single slice. Right now, I stream directly from an S3 bucket and wrap that in an To give you an idea of what I currently do, here is some of the code:
Thanks. |
It might not be as bad as you think to invert the control flow. For example, say you're reading a bao encoding from the network, and you want to send verified bytes as an HTTP response. That would look like: func HandleHTTP(w http.ResponseWriter, req *http.Request) {
data, outboard, root := getBaoEncoding(req)
ok, err := blake3.BaoDecode(w, data, outboard, 8, root)
if !ok || err != nil {
http.Error(/* ... */)
return
}
} If you want to add logging to the reads, you can wrap the reader: type loggingReader struct {
r io.Reader
log *zap.Logger
}
func (lr loggingReader) Read(p []byte) (int, error) {
start := time.Now()
n, err := lr.r.Read(p)
lr.log.Debug("Read time", zap.Duration("", time.Since(start)))
return n, err
}
// in HandleHTTP
ok, err := blake3.BaoDecode(w, loggingReader{data, log}, outboard, 8, root) It's certainly possible to rewrite // in NewVerifier
pipeReader, pipeWriter := io.Pipe()
go func() {
ok, err := blake3.BaoDecode(v.buffer, pipeReader, outboard, 8, root)
// handle ok and error, probably by sending them down a channel
}()
// in Verifier
for {
n, err := io.ReadFull(v.r, buf)
pipeWriter.Write(buf[:n]
} (Needing a goroutine is pretty ugly here, admittedly. Maybe this is the coroutine approach you mentioned.) |
I looked and thought I could hack baodecode to do the verify slice, but it would require either forking and adding or forking and exposing internals. Any effort I make on that would be trial and error since I'm not the expert on this algorithm. However, this is a requirement in the long term if any Go code wants to seek a file and verify it as the canonical Rust version allows. I am not asking to change BaoDecode but to add a And yes, what you are saying is similar. I would need a separate thread to let it process and use channels or something else to track it. I feel that's overcomplicating it, and an additional stateless version is best. The important logic in my code is:
With it verifying a single chunk stateless. So, yes, I might make this work as-is (I am not 100% sure yet, as I'm not accepting an inbound HTTP, but doing a background cron that makes an s3 SDK request), but it would not be ideal nor cover long-term possibilities with partial streaming/seeking. Thanks. |
Just to be clear, is this your desired API? type BaoVerifier struct
func (BaoVerifier) Verify(chunk []byte) bool
func NewVerifier(outboard []byte, group int, root [32]byte) *BaoVerifier |
That is the API I use in my code (though as a transparent reader, not as a verify object). It should ideally be a single function instead of a class, but that depends on how it needs to be designed. But roughly high level I am asking for (argument order doesn't matter):
The goal is that any chunk in the stream can be verified stateless, knowing its data offset, the chunk data, the proof, the group size, and the root hash. That is basically what the rust does. Ex from redsolvers code:
Thanks. |
Ok, added support for slices. The equivalent of that Rust code is now: func verifyIntegrityInternal(chunk_bytes []byte, offset uint64, bao_outboard_bytes []byte, blake3_hash [32]byte) bool {
var buf bytes.Buffer
blake3.BaoExtractSlice(&buf, bytes.NewReader(chunk_bytes), bytes.NewReader(bao_outboard_bytes), 8, offset, 262144)
_, ok := blake3.BaoVerifySlice(buf.Bytes(), 8, offset, 262144, blake3_hash)
return ok
} I'm not in love with this API, but it's workable. I think I implemented the slice format correctly, but pls test |
Thanks, I will test this tomorrow. One suggestion I have is that you should edit the change:
to
so that it is an optional feature (vs using io.Discard or something). |
I have done testing, and assuming I have not to made an error, it seems to fail at the first chainingValue based on my IDE debugger. I also found your tests only have group 0 and no higher group. I tested abao's default group 4 and s5 group 8 (each requires a new rust build to change the features config). Here is a test script I created based on some of my portal code:
|
func main() {
file, err := os.Open("<path to file>")
if err != nil {
panic(err)
}
proof, err := os.ReadFile("output.obao")
if err != nil {
panic(err)
}
var root [32]byte
hex.Decode(root[:], []byte("871208da7506cf458575b8d9b44652c66e53a74f94cdbcb4ee1910d6359808c1"))
v := &Verifier{
r: file,
proof: proof,
root: root,
}
if _, err := io.Copy(io.Discard, v); err != nil {
panic(err)
}
}
type Verifier struct {
r io.Reader
proof []byte
root [32]byte
buf bytes.Buffer
offset uint64
}
func (v *Verifier) Read(p []byte) (int, error) {
if v.buf.Len() == 0 {
n, err := io.CopyN(&v.buf, v.r, VERIFY_CHUNK_SIZE)
if err != nil && err != io.EOF && err != io.ErrUnexpectedEOF {
return 0, err
} else if !verifyIntegrityInternal(v.buf.Bytes()[:n], v.offset, v.proof, v.root) {
v.buf.Reset() // don't expose unverified data to future Read calls
return 0, fmt.Errorf("integrity check failed at offset %d", v.offset)
}
v.offset += uint64(n)
}
return v.buf.Read(p)
}
func verifyIntegrityInternal(chunk []byte, offset uint64, outboard []byte, root [32]byte) bool {
const group = 4
var buf bytes.Buffer
length := uint64(len(chunk_bytes))
blake3.BaoExtractSlice(&buf, bytes.NewReader(chunk), bytes.NewReader(outboard), group, offset, length)
_, ok := blake3.BaoVerifySlice(buf.Bytes(), group, offset, length, root)
return ok
} I should note, though, that extracting a new slice for every chunk is not very efficient. (In fact, it is "accidentally quadratic," because |
I re-dumped the encoding (to double check) and hashed the file I have again and verified, and seem to be getting As for the What I am seeing as a possible solution from what I do understand is creating a large array of all the outboard slice parts (possibly a struct type), split up, where it would then get the data injected after on every verification, so your not doing a scan each run but just an index lookup based on memory. Though Thanks. |
The root hash will always match; increasing the size of the chunk groups does not change the root hash. I checked again, and Here's an easy way to check what version of
If you're running standard As for being accidentally quadratic, I suspect it won't be a big deal in practice, but as always you should benchmark it to make sure. |
The quadratic issue seems to be extreme. I also seem to be unable to get a group 8 encoding to verify, but a group 4 works fine. In a group 4 encoding:
So there are definitely some outstanding issues here IMHO based on some basic testing. I am using go run to test this with the |
Hmm. Looking into this. Verification works for group <= 4, but not above that. Strange. In the meantime, can you describe how verified streaming fits into your broader system? I'm wondering if there's a way to avoid the quadratic behavior. |
Right now, any file above 100 mb is uploaded to s3. That is then hashed when downloaded from s3 and both the file and proof are sent to sia. This roughly follows what S5 does. It knows the valid hash ahead of time as its passed in HTTP headers via TUS, and stored in db, following what S5 has implemented. The more immediate term I have network imports where a file is downloaded off the S5 network and sent to S3. This is effectively Q2 per my grant this year, I will also be sharing the Sia file metadata from Longer term, I see the slice verification (streaming) usable in go applications, though I don't have any immediate plans besides the portal system. Overall the key thing regarding the approach ive taken is im streaming on the fly from A to B as a Thanks. |
Turned out to be a simple fix. All group sizes should work now. I agree that there should be an easy, efficient way to verify chunk n given the full outboard encoding. Even the Rust code you posted above, IIUC, is suboptimal, because it extracts a new Bao slice just to immediately verify it -- meaning it duplicates all of the chunk data in memory! That said, even the optimal version of this (which would read directly from the outboard to verify, instead of materializing a new slice encoding) ends up duplicating work compared to verifying multiple chunks at a time. It won't be var buf bytes.Buffer
blake3.BaoExtractSlice(&buf, bytes.NewReader(chunkData), bytes.NewReader(outboard), group, offset, length)
v := NewVerifier(r, buf.Bytes(), group, root)
io.Copy(dst, v) That is, scope each verifier to a particular offset and length, and initialize it with an extracted slice for that range. |
I assume that would basically be put in You also say verifying multiple chunks at a time and if you mean somehow batch processing multiple offsets... that's technically possible for me to do as well I think, but I may be mis-understanding 🤷 . |
ok, added a |
I ran some tests myself based on a 1 GB file and 5 GB file. The following data is AI generated.
The computation is linear. I also used Based on all this it is a massive improvement and seems to rival the rust version 🙃. TBD to see how it performs with HTTP streaming, but on disk i/o it seems fine. |
Nice! I definitely encourage collecting a few more datapoints to confirm the trend. I would expect it to grow linearithmically (n log n), so I'm curious what the actual time for a 100 GB file would be. Anyway, it seems like this is good to merge. However, I'm wary of polluting the |
I will provide feedback when I have some data collected on this. |
I have gotten this implemented in the portal at LumeWeb/portal@8d98f13 and will be testing it on my dev node soon. |
Ive just started doing testing and debugging around functions using the verification. The debug timer code I have in is logging in zap that every 256kB chunk, streamed from S5 P2P up into S3, is about 104 ms processing, and this CID, https://cid.one/#z6e5rKQLuohQGLqnRvkUrLVzcsgFhkyM2QxGfWcx5JHC6Z8jXqqYT, 1073741824 bytes takes about I have yet to test anything larger, though I will likely end up testing a 1 tb file as that will be something that will get some demand. This does not isolate all the reader code from the bao verify code directly, so there could be inefficiencies on my side. Regardless, this is working, and the only thing left is to optimize it if needed in the future. Kudos 😄 |
Merged! Note that all Bao-related code now lives in the |
See #17
I don't have test vectors to compare against; @pcfreak30 or @redsolver, can you provide test vectors and/or test this implementation against them? (For 256KiB chunk groups, pass
group = 8
)