Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2][adjuster] Implement adjuster for deduplicating spans #6391

Merged
merged 18 commits into from
Dec 22, 2024

Conversation

mahadzaryab1
Copy link
Collaborator

Which problem is this PR solving?

Description of the changes

  • Implemented an adjuster to deduplicate spans.
  • The span deduplication is done by marshalling each span into protobuf bytes and applying the FNV hash algorithm to it.

How was this change tested?

  • Added unit tests

Checklist

Signed-off-by: Mahad Zaryab <[email protected]>
Signed-off-by: Mahad Zaryab <[email protected]>
Signed-off-by: Mahad Zaryab <[email protected]>
Signed-off-by: Mahad Zaryab <[email protected]>
Comment on lines 48 to 51
if err != nil {
// TODO: what should we do here?
continue
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurishkuro how should we handle the case where the hash code cannot be computed. This would happen in case there as an error in protobuf serialization or if the hashing function returned an error. Its probably very unlikely this ever happens. Is skipping over the span sufficient? Do we want to add a warning?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think skipping the span is fine in this case. We could also add a warning with the error to that span.

Signed-off-by: Mahad Zaryab <[email protected]>
Copy link

codecov bot commented Dec 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.21%. Comparing base (9fc9d75) to head (48c4021).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #6391   +/-   ##
=======================================
  Coverage   96.20%   96.21%           
=======================================
  Files         362      363    +1     
  Lines       20705    20748   +43     
=======================================
+ Hits        19919    19962   +43     
  Misses        601      601           
  Partials      185      185           
Flag Coverage Δ
badger_v1 9.04% <ø> (ø)
badger_v2 1.64% <ø> (ø)
cassandra-4.x-v1-manual 15.04% <ø> (ø)
cassandra-4.x-v2-auto 1.58% <ø> (ø)
cassandra-4.x-v2-manual 1.58% <ø> (ø)
cassandra-5.x-v1-manual 15.04% <ø> (ø)
cassandra-5.x-v2-auto 1.58% <ø> (ø)
cassandra-5.x-v2-manual 1.58% <ø> (ø)
elasticsearch-6.x-v1 18.75% <ø> (ø)
elasticsearch-7.x-v1 18.84% <ø> (ø)
elasticsearch-8.x-v1 19.00% <ø> (ø)
elasticsearch-8.x-v2 1.64% <ø> (ø)
grpc_v1 10.71% <ø> (-0.01%) ⬇️
grpc_v2 7.98% <ø> (ø)
kafka-v1 9.40% <ø> (ø)
kafka-v2 1.64% <ø> (ø)
memory_v2 1.63% <ø> (ø)
opensearch-1.x-v1 18.88% <ø> (-0.01%) ⬇️
opensearch-2.x-v1 18.88% <ø> (-0.01%) ⬇️
opensearch-2.x-v2 1.63% <ø> (ø)
tailsampling-processor 0.47% <ø> (ø)
unittests 95.05% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Mahad Zaryab <[email protected]>
Signed-off-by: Mahad Zaryab <[email protected]>
scopeSpans := rs.ScopeSpans()
for j := 0; j < scopeSpans.Len(); j++ {
ss := scopeSpans.At(j)
spansByHash := make(map[uint64]ptrace.Span)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be defined at the top level in the function, so that deduping is global. And the hashing must account for resource and scope attributes.

Signed-off-by: Mahad Zaryab <[email protected]>
// the FNV hashing algorithm to the serialized data.
//
// To ensure consistent hash codes, this adjuster should be executed after
// SortAttributesAndEvents, which normalizes the order of collections within the span.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of thoughts on this:

  1. Some storage backends (Cassandra, in particular), perform similar deduping by computing a hash before the span is saved and using it as part of the partition key (it creates tombstones if identical span is saved 2 times or more but no dups on read). So we could make this hashing process to be a part of the ingestion pipeline (e.g. in sanitizers) and simply store it as an attribute on the span. Then this adjuster would be "lazy", it will only recompute the hash if it doesn't already exist in the storage.

  2. If we do this on the write path, we would want this to be as efficient as possible, so we would need to implement manual hashing by iterating through the attributes (and pre-sorting them to avoid dependencies) and but manually going through all fields of the Span / SpanEvent / SpanLink. The reason I was reluctant to do that in the past was to avoid unintended bugs if the data model was changed, like a new field added that we'd forget to add to the hash function. To protect against that we probably could use some fuzzing tests, by setting / unsetting each field individually and making sure the hash code changes as a result.

We don't have to do it now, but let's open a ticket for future improvement (I think it could be a good-first-issue)

Signed-off-by: Mahad Zaryab <[email protected]>
cmd/query/app/querysvc/adjuster/hash.go Outdated Show resolved Hide resolved
Comment on lines 71 to 75
traces := ptrace.NewTraces()
rs := traces.ResourceSpans().AppendEmpty()
resourceAttributes.CopyTo(rs.Resource().Attributes())
ss := rs.ScopeSpans().AppendEmpty()
scopeAttributes.CopyTo(ss.Scope().Attributes())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather do this outside of the loop for spans and only replace the span before hashing

@mahadzaryab1 mahadzaryab1 changed the title [WIP][v2][adjuster] Implement adjuster for deduplicating spans [v2][adjuster] Implement adjuster for deduplicating spans Dec 22, 2024
@mahadzaryab1 mahadzaryab1 marked this pull request as ready for review December 22, 2024 00:55
@mahadzaryab1 mahadzaryab1 requested a review from a team as a code owner December 22, 2024 00:55
@dosubot dosubot bot added the v2 label Dec 22, 2024
return 0, err
}
hasher := fnv.New64a()
hasher.Write(b) // never returns an error
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring the error here because the Hash64 interface says that the writer never returns an error.

type Hash interface {
	// Write (via the embedded io.Writer interface) adds more data to the running hash.
	// It never returns an error.
	io.Writer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should go in the code as explanation

return 0, err
}
hasher := fnv.New64a()
hasher.Write(b) // never returns an error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should go in the code as explanation

Comment on lines 40 to 45
hashTrace := ptrace.NewTraces()
rs := resourceSpans.At(i)
hashResourceSpan := hashTrace.ResourceSpans().AppendEmpty()
rs.Resource().Attributes().CopyTo(hashResourceSpan.Resource().Attributes())
scopeSpans := rs.ScopeSpans()
hashScopeSpan := hashResourceSpan.ScopeSpans().AppendEmpty()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hard to grok due to ordering and naming

Suggested change
hashTrace := ptrace.NewTraces()
rs := resourceSpans.At(i)
hashResourceSpan := hashTrace.ResourceSpans().AppendEmpty()
rs.Resource().Attributes().CopyTo(hashResourceSpan.Resource().Attributes())
scopeSpans := rs.ScopeSpans()
hashScopeSpan := hashResourceSpan.ScopeSpans().AppendEmpty()
rs := resourceSpans.At(i)
scopeSpans := rs.ScopeSpans()
hashTrace := ptrace.NewTraces()
hashResourceSpans := hashTrace.ResourceSpans().AppendEmpty()
hashScopeSpans := hashResourceSpan.ScopeSpans().AppendEmpty()
hashSpan := hashScopeSpans.Spans().AppendEmpty()
rs.Resource().Attributes().CopyTo(hashResourceSpan.Resource().Attributes())

Comment on lines 47 to 51
ss := scopeSpans.At(j)
ss.Scope().Attributes().CopyTo(hashScopeSpan.Scope().Attributes())
spans := ss.Spans()
newSpans := ptrace.NewSpanSlice()
hashSpan := hashScopeSpan.Spans().AppendEmpty()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ss := scopeSpans.At(j)
ss.Scope().Attributes().CopyTo(hashScopeSpan.Scope().Attributes())
spans := ss.Spans()
newSpans := ptrace.NewSpanSlice()
hashSpan := hashScopeSpan.Spans().AppendEmpty()
ss := scopeSpans.At(j)
spans := ss.Spans()
ss.Scope().Attributes().CopyTo(hashScopeSpan.Scope().Attributes())
dedupedSpans := ptrace.NewSpanSlice()


func (s *SpanHashDeduper) Adjust(traces ptrace.Traces) {
spansByHash := make(map[uint64]ptrace.Span)
resourceSpans := traces.ResourceSpans()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend going forward to use terms resources and scopes. Makes the code more readable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good - I can open a cleanup PR

@mahadzaryab1 mahadzaryab1 merged commit 54ceda2 into jaegertracing:main Dec 22, 2024
54 checks passed
@mahadzaryab1 mahadzaryab1 deleted the hash branch December 22, 2024 14:53
ekefan pushed a commit to ekefan/jaeger that referenced this pull request Dec 23, 2024
…ing#6391)

## Which problem is this PR solving?
- Towards jaegertracing#6344

## Description of the changes
- Implemented an adjuster to deduplicate spans. 
- The span deduplication is done by marshalling each span into protobuf
bytes and applying the FNV hash algorithm to it.

## How was this change tested?
- Added unit tests

## Checklist
- [x] I have read
https://github.com/jaegertracing/jaeger/blob/master/CONTRIBUTING_GUIDELINES.md
- [x] I have signed all commits
- [x] I have added unit tests for the new functionality
- [x] I have run lint and test steps successfully
  - for `jaeger`: `make lint test`
  - for `jaeger-ui`: `npm run lint` and `npm run test`

---------

Signed-off-by: Mahad Zaryab <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants