-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[META] Leverage protobuf for serializing select node-to-node objects #15308
Comments
Having spent way too much time diving into the byte-level (or bit-level if you consider 7-bit Vints) details of these protocols, I want to make sure we're focusing on the correct things here. Protobuf isn't some magical thing that brings about 30% improvement the way the POC implemented it, and I don't think the existing POCs tested it fully vs. the existing capabilities of the transport protocol if used properly. In summary:
Protobuf's primary benefit is in backwards compatibility. There are probably additional significant benefits of gRPC but this meta issue doesn't seem to be about changing the entire protocol, so I don't think those should be considered as part of the benefits proposed. The performance impacts are very situation specific and shouldn't be assumed; it's possible a change in streaming to variable-length integers/longs (VInt/ZInt) can achieve similar gains. |
Hi @dbwiddis, thanks for the feedback!
I am noticing protobuf objects I implement write more bytes to stream than the native "writeTo" serialization. Small protobuf objects are about ~20% larger but the difference shrinks as request content grows. So far I've been focusing on FetchSearchResult and it's members. If i'm understanding correctly the best use case for protobuf (only considering performance) would be something like a As I learn more about protobuf I'm understanding any potential performance improvements just from making a change to serialization is gong to be speculative and I'll update this issue with benchmarks as I run them. Currently the concrete expected benefits would be out of the box backwards compatibility as well as providing a stepping stone for gRPC support. |
Initial numbers for protobuf vs native serializers. OSB Big 5 2.16 min distribution
FetchSearchResults as protobuf - Branch
Notes
|
Let's keep in mind that while node-to-node we may not see as much benefit, client-to-server switching from JSON to protobuf should. There's also a significant improvement in developer experience if we can replace the native implementation with protobuf. |
Some additional benchmarks for a vector search workload. 5 data nodes (r5.xlarge) per cluster. 10000 queries against sift-128-euclidean.hdf5 data set. FetchSearchResults as Protobuf
2.16 Native serialization
|
@finnegancarroll can we run vector search workload with client-to-server switching? Thanks. |
Experimenting with a previous POC with diff here: https://github.com/finnegancarroll/OpenSearch/pull/2/files Benchmarking locally with 5 nodes in separate docker containers with 20% big5 workload. POC diff:
|
Protobuf increases our payload size around 5% based on term queries run on the big5 dataset. On average the following query run on the big5 dataset produces a ~9.5mb response with a native serializer and ~10mb with protobuf.
Without trying to deconstruct the protobuf binary format one explanation could be protobuf losses some space efficiency by requiring each field in a protobuf object have its position marked with a "field number". In contrast the OpenSearch native node-to-node protocol simply expects to read fields in the exact pre-determined order in which they are written. |
Some micro benchmarks with flame graphs comparing protobuf serialization to native transport & client/server JSON serialization. Test branch can be found here: https://github.com/finnegancarroll/OpenSearch/tree/serde-micro-bench Note: Test data is generated with SearchHitsTests::createTestItem so the content is randomized and may not accurately reflect real use cases. While microbenchmarks are not particularly reliable this does seem to suggest protobuf is more performant in both cases. Much more so on client/server than transport layer. Serialization is likely not a large enough slice of the total latency to make an impact in OSB benchmarks.
|
Thanks a lot, @finnegancarroll , I am a bit surprised (since "in general" the specialized impl is expected to be a bit faster than generalized one). I am eager to run the |
End to end benchmarks with protobuf on transport layer as well as the REST response for a vector workload. Environment is identical to the previous tests here (with 2.17 as baseline): #15308 (comment)
The change in serialization does not appear to impact latencies in this case. Generated some flame graphs using OS 2.17 on a local 3 node cluster for this workload to estimate how much serialization impacts latency. Examining the coordinator node for the cluster:
On average REST response from the coordinator is a relatively small ~13kb. |
Hi @reta, i've cleaned up
I agree it is surprising that
Looking through the flame graphs of |
Thanks @finnegancarroll
Very true, I came to the same conclusions, I will try to briefly look what could be the issue there (without spending too much time). Thanks a lot for cleaning up the benchmarks! |
While investigating areas we might expect to benefit from protobuf I spent some time reproducing this previous POC: #10684 (comment) Below benchmarks were run locally with 5 nodes, Edit: Previous POC gives Protobuf Proof of Concept Branch
Native Branch
Inspecting flame graphs for the poc-search-protobuf I see nearly all cpu cycles are consumed by lucene processing the search and little time is spent on serialization. |
@finnegancarroll This is great seeing benchmarking and deep diving into the details here!
Having deconstructed the protobuf binary format, I'll confirm this observation. :) In short: protobuf uses a byte for each field that exists (type and index info) and 0 bytes for nonexistent fields. The binary transport protocol doesn't mark each field but assumes all fields are present in a specific order, and requires the use of a byte for a boolean "optional". So the takeaway from a "bandwidth" perspective is that if you tend to have a lot of optional information, you can save space with protobuf. But as your tests seem to show, this marginal difference in bandwidth seems to be less significant, and maybe not the thing to focus on.... |
Thanks @finnegancarroll , my apologies for the delay, just completed may part. I didn't spend too much time on native serialization but it looks like (with the benchmarks)
This is a very valid point @dbwiddis but we also should not forget the maintenance costs. Changes in native protocol are difficult now and incur a burden on everyone (update main, backport + change version, forwardport to main, ...). Protobuf takes care of that (as far as schema is properly evolved), I think that would bring more benefits than marginal latency improvements. |
Hi @reta, I spent some time exploring your above suggestion and I'm seeing benefits but only in some limited artificial cases for the moment.
( Stepping through with a debugger it looks like we do save on some calls to To quickly hack in a "best case" scenario dataset I modified the unit tests such that DocumentFields could only be 100 element long integer lists. With the optimized writing implemented like this. This gives pretty substantial speedup in microbenchmarks, especially considering the cpu now spends nearly all of it's time writing specifically DocumentFields.
|
Thanks a lot @finnegancarroll , it looks like we are at crossroads now: should we move forward with protobuf (since the latency wins are moderate) or (may be) move forward with optimizing the native protocol? |
Additional benchmarks comparing protobuf to native protocol with the same inflated DocumentFields size as used here.
|
Thanks @finnegancarroll , |
I think we should do both. Switching to an industry standard is a long term bet on improving developer experience and an entire community optimizing the binary protocol for us. |
Closing this issue to move forward with client facing gRPC server - #16556. |
Please describe the end goal of this project
Provide a framework and migrate some key node-to-node requests to serialize via protobuf rather than the current native stream writing (i.e.
writeTo(StreamOutput out)
). Starting with node-to-node transport messages maintains api compatibility while introducing protobuf implementations which have several benefits:Initial protobuf implementation for types - FetchSearchResult, SearchHits, SearchHit
Edit: Closing this issue as we see only marginal benefits to utilizing protobuf on the transport layer. The existing
writeTo()
implementation is as performant as protobuf, if not more so. Adding a second implementation for objects sent over the transport layer would create a second parallel object which needs to be maintained and would only provide the backwards compatibility benefit of protobuf for that specific transport request.Supporting References
Related component
Search:Performance
The text was updated successfully, but these errors were encountered: