Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

firehose2 #69

Open
DavidBuchanan314 opened this issue Oct 20, 2024 · 2 comments
Open

firehose2 #69

DavidBuchanan314 opened this issue Oct 20, 2024 · 2 comments

Comments

@DavidBuchanan314
Copy link

DavidBuchanan314 commented Oct 20, 2024

I'm working on a draft for how "firehose" bandwidth could be significantly reduced, without making any sacrifices in terms of authentication of data.

The gist of the changes are:

  • Not transmitting MST blocks (draft spec'd)
  • Improving the compressibility of relevant on-wire formats (TODO) (For example, a lot of CIDs reference data that is also present on the wire, and thus there's no need to transmit the CIDs themselves)
  • Adding a compression layer - likely based on zstd (TODO) (I will also investigate the likes of permessage-deflate, but I suspect doing compression at the application level will give more control and better overall results)

It's at a very early stage right now (I haven't yet written code to benchmark my changes), but in the interests of developing in the open and getting early feedback, I'm logging my progress here: https://github.com/DavidBuchanan314/firehose2/

@devinivy
Copy link
Contributor

devinivy commented Oct 22, 2024

Nice, this is great. I have some initial reactions (personal reactions, not necessarily representative of the bsky team).

Not transmitting MST blocks (draft spec'd)

I am a fan of the approach. One upside that I appreciate is that it's quite tidy to be able to say producers and verifying consumers all just need to know how to write to a repo.

Adding a compression layer - likely based on zstd (TODO) (I will also investigate the likes of permessage-deflate, but I suspect doing compression at the application level will give more control and better overall results)

I'm into the idea of supporting compression. The main downside of doing something at the app level is that it could just be quite heavy on the protocol to specify. I expect it would have to include some form of negotiation, even if only for the purpose of adaptability/future-proofing.

Even if we come up with something optimal outside of permessage-deflate, it would be useful to fit it into the websocket compression extension framework rfc7692. The sync protocol wouldn't strictly depend on a bespoke compression strategy: could just say "you're welcome to use websocket compression extensions, we recommend this one". We'd inherit a system for negotiating the compression extension and its parameters (adaptability/future-proofing). The extension framework exists and I'm pretty sure it suits our problem—I don't think it's overly prescriptive in a way that would make the compression less effective but would be interested to learn if it is.

Improving the compressibility of relevant on-wire formats

I'm fairly into this too. I believe you generally need to compute all the CIDs from block contents anyway as part of the verification process, in which case it's not all that useful to transport the CIDs themselves. This would of course mean departing from transporting CARs altogether, which I imagine could be on the table if we're talking about a proper v2. At the same time I imagine it's possible to go "too far" and come up with a weird structure by optimizing hard for compressibility, so would be interested to see what it looks like.

opsCid

I'm interested to hear a little more about what opsCid is and isn't intended to help with. On today's firehose including full proofs, it would help consumers who only validate proofs against ops but don't keep repo structure locally—e.g. to protect from withheld/deletion ops. In this v2 it seems like it is superfluous in the case of full repo sync, but in the case of syncing a slice it helps verify the ops of an individual commit. If you're syncing a slice presumably you're trying not to witness every commit, though, and in those cases I suppose you don't know what you might be missing. I think it also means that you can't verify ops for firehose events that represent a coarse diff from multiple commits combined together, which is permitted today. I still feel like there could be some juice here, but could use help mapping it out.

Also—in the case of the full repo sync you can elide all the data blocks except the root, as you point out. I don't have a precise description of it, but I believe there is a generalization of that applying to sync of repo slices. In the case of syncing a collection, I believe you can elide all the data blocks except those on the "boundary" of the collection up to the root. I haven't worked out if you can still usually avoid transmitting data blocks if you are always writing to the right side of a collection, as we typically do since TIDs are monotonic. Or if you may need to transmit some blocks when writing to an adjacent collection to the one you want to sync. I think it could be worth mapping this all out though, see if there's something we can exploit there.

@DavidBuchanan314
Copy link
Author

DavidBuchanan314 commented Nov 15, 2024

I haven't done any more work on this since I opened the issue, but I just had an idea which I'll write down before I forget.

IIUC, the the current (relatively low) MST fanout factor of 4 was chosen to reduce merkle proof sizes. iff the merkle proofs are no longer being transmitted over the firehose explicitly, their size matters a bit less, and there might be some perf gains (in terms of disk i/o) in increasing the fanout. This would be a very breaking change, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants