Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2][adjuster] Implement adjuster for deduplicating spans #6391

Merged
merged 18 commits into from
Dec 22, 2024
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
88 changes: 88 additions & 0 deletions cmd/query/app/querysvc/adjuster/hash.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
// Copyright (c) 2024 The Jaeger Authors.
// SPDX-License-Identifier: Apache-2.0

package adjuster

import (
"hash/fnv"

"go.opentelemetry.io/collector/pdata/pcommon"
"go.opentelemetry.io/collector/pdata/ptrace"
)

var _ Adjuster = (*SpanHashDeduper)(nil)

// SpanHash creates an adjuster that deduplicates spans by removing all but one span
// with the same hash code. This is particularly useful for scenarios where spans
// may be duplicated during archival, such as with ElasticSearch archival.
//
// The hash code is generated by serializing the span into protobuf bytes and applying
// the FNV hashing algorithm to the serialized data.
//
// To ensure consistent hash codes, this adjuster should be executed after
// SortAttributesAndEvents, which normalizes the order of collections within the span.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of thoughts on this:

  1. Some storage backends (Cassandra, in particular), perform similar deduping by computing a hash before the span is saved and using it as part of the partition key (it creates tombstones if identical span is saved 2 times or more but no dups on read). So we could make this hashing process to be a part of the ingestion pipeline (e.g. in sanitizers) and simply store it as an attribute on the span. Then this adjuster would be "lazy", it will only recompute the hash if it doesn't already exist in the storage.

  2. If we do this on the write path, we would want this to be as efficient as possible, so we would need to implement manual hashing by iterating through the attributes (and pre-sorting them to avoid dependencies) and but manually going through all fields of the Span / SpanEvent / SpanLink. The reason I was reluctant to do that in the past was to avoid unintended bugs if the data model was changed, like a new field added that we'd forget to add to the hash function. To protect against that we probably could use some fuzzing tests, by setting / unsetting each field individually and making sure the hash code changes as a result.

We don't have to do it now, but let's open a ticket for future improvement (I think it could be a good-first-issue)

func SpanHash() SpanHashDeduper {
return SpanHashDeduper{
marshaler: &ptrace.ProtoMarshaler{},
}
}

type SpanHashDeduper struct {
marshaler ptrace.Marshaler
yurishkuro marked this conversation as resolved.
Show resolved Hide resolved
}

func (s *SpanHashDeduper) Adjust(traces ptrace.Traces) {
spansByHash := make(map[uint64]ptrace.Span)
resourceSpans := traces.ResourceSpans()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend going forward to use terms resources and scopes. Makes the code more readable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good - I can open a cleanup PR

for i := 0; i < resourceSpans.Len(); i++ {
rs := resourceSpans.At(i)
rs.Resource().Attributes()
mahadzaryab1 marked this conversation as resolved.
Show resolved Hide resolved
scopeSpans := rs.ScopeSpans()
for j := 0; j < scopeSpans.Len(); j++ {
ss := scopeSpans.At(j)
spans := ss.Spans()
newSpans := ptrace.NewSpanSlice()
for k := 0; k < spans.Len(); k++ {
span := spans.At(k)
h, err := s.computeHashCode(
span,
rs.Resource().Attributes(),
ss.Scope().Attributes(),
)
if err != nil {
// TODO: Add Warning
continue

Check warning on line 54 in cmd/query/app/querysvc/adjuster/hash.go

View check run for this annotation

Codecov / codecov/patch

cmd/query/app/querysvc/adjuster/hash.go#L53-L54

Added lines #L53 - L54 were not covered by tests
}
if _, ok := spansByHash[h]; !ok {
spansByHash[h] = span
span.CopyTo(newSpans.AppendEmpty())
}
}
newSpans.CopyTo(spans)
}
}
}

func (s *SpanHashDeduper) computeHashCode(
span ptrace.Span,
resourceAttributes pcommon.Map,
scopeAttributes pcommon.Map,
) (uint64, error) {
traces := ptrace.NewTraces()
rs := traces.ResourceSpans().AppendEmpty()
resourceAttributes.CopyTo(rs.Resource().Attributes())
ss := rs.ScopeSpans().AppendEmpty()
scopeAttributes.CopyTo(ss.Scope().Attributes())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather do this outside of the loop for spans and only replace the span before hashing

newSpan := ss.Spans().AppendEmpty()
span.CopyTo(newSpan)
b, err := s.marshaler.MarshalTraces(traces)
if err != nil {
return 0, err
}

Check warning on line 81 in cmd/query/app/querysvc/adjuster/hash.go

View check run for this annotation

Codecov / codecov/patch

cmd/query/app/querysvc/adjuster/hash.go#L80-L81

Added lines #L80 - L81 were not covered by tests
hasher := fnv.New64a()
_, err = hasher.Write(b)
if err != nil {
return 0, err
}

Check warning on line 86 in cmd/query/app/querysvc/adjuster/hash.go

View check run for this annotation

Codecov / codecov/patch

cmd/query/app/querysvc/adjuster/hash.go#L85-L86

Added lines #L85 - L86 were not covered by tests
return hasher.Sum64(), nil
}
163 changes: 163 additions & 0 deletions cmd/query/app/querysvc/adjuster/hash_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
// Copyright (c) 2024 The Jaeger Authors.
// SPDX-License-Identifier: Apache-2.0

package adjuster

import (
"testing"

"github.com/stretchr/testify/assert"
"go.opentelemetry.io/collector/pdata/ptrace"
)

func TestSpanHash_EmptySpans(t *testing.T) {
adjuster := SpanHash()
input := ptrace.NewTraces()
expected := ptrace.NewTraces()
adjuster.Adjust(input)
assert.Equal(t, expected, input)
}

func TestSpanHash_RemovesDuplicateSpans(t *testing.T) {
adjuster := SpanHash()
input := func() ptrace.Traces {
traces := ptrace.NewTraces()
rs := traces.ResourceSpans().AppendEmpty()
ss := rs.ScopeSpans().AppendEmpty()
spans := ss.Spans()

span1 := spans.AppendEmpty()
span1.SetName("span1")
span1.SetTraceID([16]byte{1})
span1.SetSpanID([8]byte{1})

span2 := spans.AppendEmpty()
span2.SetName("span2")
span2.SetTraceID([16]byte{1})
span2.SetSpanID([8]byte{2})

span3 := spans.AppendEmpty()
span3.SetName("span1")
span3.SetTraceID([16]byte{1})
span3.SetSpanID([8]byte{1})

span4 := spans.AppendEmpty()
span4.SetName("span2")
span4.SetTraceID([16]byte{1})
span4.SetSpanID([8]byte{2})

span5 := spans.AppendEmpty()
span5.SetName("span3")
span5.SetTraceID([16]byte{3})
span5.SetSpanID([8]byte{4})

rs2 := traces.ResourceSpans().AppendEmpty()
ss2 := rs2.ScopeSpans().AppendEmpty()
spans2 := ss2.Spans()

span6 := spans2.AppendEmpty()
span6.SetName("span4")
span6.SetTraceID([16]byte{5})
span6.SetSpanID([8]byte{6})

span7 := spans2.AppendEmpty()
span7.SetName("span3")
span7.SetTraceID([16]byte{3})
span7.SetSpanID([8]byte{4})

return traces
}
expected := func() ptrace.Traces {
traces := ptrace.NewTraces()
rs := traces.ResourceSpans().AppendEmpty()
ss := rs.ScopeSpans().AppendEmpty()
spans := ss.Spans()

span1 := spans.AppendEmpty()
span1.SetName("span1")
span1.SetTraceID([16]byte{1})
span1.SetSpanID([8]byte{1})

span2 := spans.AppendEmpty()
span2.SetName("span2")
span2.SetTraceID([16]byte{1})
span2.SetSpanID([8]byte{2})

span3 := spans.AppendEmpty()
span3.SetName("span3")
span3.SetTraceID([16]byte{3})
span3.SetSpanID([8]byte{4})

rs2 := traces.ResourceSpans().AppendEmpty()
ss2 := rs2.ScopeSpans().AppendEmpty()
spans2 := ss2.Spans()

span4 := spans2.AppendEmpty()
span4.SetName("span4")
span4.SetTraceID([16]byte{5})
span4.SetSpanID([8]byte{6})

return traces
}

i := input()
adjuster.Adjust(i)
assert.Equal(t, expected(), i)
}

func TestSpanHash_NoDuplicateSpans(t *testing.T) {
adjuster := SpanHash()
input := func() ptrace.Traces {
traces := ptrace.NewTraces()
rs := traces.ResourceSpans().AppendEmpty()
ss := rs.ScopeSpans().AppendEmpty()
spans := ss.Spans()

span1 := spans.AppendEmpty()
span1.SetName("span1")
span1.SetTraceID([16]byte{1})
span1.SetSpanID([8]byte{1})

span2 := spans.AppendEmpty()
span2.SetName("span2")
span2.SetTraceID([16]byte{1})
span2.SetSpanID([8]byte{2})

span3 := spans.AppendEmpty()
span3.SetName("span3")
span3.SetTraceID([16]byte{3})
span3.SetSpanID([8]byte{4})

return traces
}
expected := func() ptrace.Traces {
traces := ptrace.NewTraces()
rs := traces.ResourceSpans().AppendEmpty()
ss := rs.ScopeSpans().AppendEmpty()
spans := ss.Spans()

span1 := spans.AppendEmpty()
span1.SetName("span1")
span1.SetTraceID([16]byte{1})
span1.SetSpanID([8]byte{1})

span2 := spans.AppendEmpty()
span2.SetName("span2")
span2.SetTraceID([16]byte{1})
span2.SetSpanID([8]byte{2})

span3 := spans.AppendEmpty()
span3.SetName("span3")
span3.SetTraceID([16]byte{3})
span3.SetSpanID([8]byte{4})

return traces
}

i := input()
adjuster.Adjust(i)
assert.Equal(t, expected(), i)
}

// TODO: write tests for duplicate spans with different outer attributes
// TODO: write tests for error cases
2 changes: 2 additions & 0 deletions cmd/query/app/querysvc/adjuster/sort.go
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,11 @@ var _ Adjuster = (*SortAttributesAndEventsAdjuster)(nil)

// SortAttributesAndEvents creates an adjuster that standardizes trace data by sorting elements:
// - Resource attributes are sorted lexicographically by their keys.
// - Scope attributes are sorted lexicographically by their keys.
// - Span attributes are sorted lexicographically by their keys.
// - Span events are sorted lexicographically by their names.
// - Attributes within each span event are sorted lexicographically by their keys.
// - Attributes within each span link are sorted lexicographically by their keys.
func SortAttributesAndEvents() SortAttributesAndEventsAdjuster {
return SortAttributesAndEventsAdjuster{}
}
Expand Down
Loading