-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
firehose2 #69
Comments
Nice, this is great. I have some initial reactions (personal reactions, not necessarily representative of the bsky team).
I am a fan of the approach. One upside that I appreciate is that it's quite tidy to be able to say producers and verifying consumers all just need to know how to write to a repo.
I'm into the idea of supporting compression. The main downside of doing something at the app level is that it could just be quite heavy on the protocol to specify. I expect it would have to include some form of negotiation, even if only for the purpose of adaptability/future-proofing. Even if we come up with something optimal outside of permessage-deflate, it would be useful to fit it into the websocket compression extension framework rfc7692. The sync protocol wouldn't strictly depend on a bespoke compression strategy: could just say "you're welcome to use websocket compression extensions, we recommend this one". We'd inherit a system for negotiating the compression extension and its parameters (adaptability/future-proofing). The extension framework exists and I'm pretty sure it suits our problem—I don't think it's overly prescriptive in a way that would make the compression less effective but would be interested to learn if it is.
I'm fairly into this too. I believe you generally need to compute all the CIDs from block contents anyway as part of the verification process, in which case it's not all that useful to transport the CIDs themselves. This would of course mean departing from transporting CARs altogether, which I imagine could be on the table if we're talking about a proper v2. At the same time I imagine it's possible to go "too far" and come up with a weird structure by optimizing hard for compressibility, so would be interested to see what it looks like.
I'm interested to hear a little more about what Also—in the case of the full repo sync you can elide all the data blocks except the root, as you point out. I don't have a precise description of it, but I believe there is a generalization of that applying to sync of repo slices. In the case of syncing a collection, I believe you can elide all the data blocks except those on the "boundary" of the collection up to the root. I haven't worked out if you can still usually avoid transmitting data blocks if you are always writing to the right side of a collection, as we typically do since TIDs are monotonic. Or if you may need to transmit some blocks when writing to an adjacent collection to the one you want to sync. I think it could be worth mapping this all out though, see if there's something we can exploit there. |
I haven't done any more work on this since I opened the issue, but I just had an idea which I'll write down before I forget. IIUC, the the current (relatively low) MST fanout factor of 4 was chosen to reduce merkle proof sizes. iff the merkle proofs are no longer being transmitted over the firehose explicitly, their size matters a bit less, and there might be some perf gains (in terms of disk i/o) in increasing the fanout. This would be a very breaking change, though. |
I'm working on a draft for how "firehose" bandwidth could be significantly reduced, without making any sacrifices in terms of authentication of data.
The gist of the changes are:
It's at a very early stage right now (I haven't yet written code to benchmark my changes), but in the interests of developing in the open and getting early feedback, I'm logging my progress here: https://github.com/DavidBuchanan314/firehose2/
The text was updated successfully, but these errors were encountered: