Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: tune latency attribute buckets to reduce cardinality #2432

Merged
merged 2 commits into from
Aug 20, 2024

Conversation

deniseli
Copy link
Contributor

@deniseli deniseli commented Aug 19, 2024

Fixes #2286

Log Buckets

This PR introduces the ability to constrain logBucket with a max and min bucket. Log buckets just bucket numbers (in this case, ms latency):

Example for base 4:

<1, [1,4), [4,16), [16,64), [64,256), [256,1024), [1024,4096), ...

In DataDog, looking across the last week, we see several latency buckets displaying low counts in the noise, especially on the buckets closer to 1. This makes sense given the intrinsic nature of log buckets - they are densest close to 1 and sparser the further up you go. To reduce cardinality (thus, cost), we can chop away that noise at the low end.

It's similarly worth capping the buckets at the high end because although the buckets get exponentially larger, you can still have infinitely more buckets. We haven't been hit with any major issues yet, but you can imagine how this could impact cardinality during a latency spike.

Call Latency

DataDog links: bucketed counts, avg latency rollup

Currently set to base 2 without min/max.

  • <1, 1-2 .. 8-16: noise
  • 16-32 .. 1024-2048: healthy number of data points
  • 2048-4096, 4096-8192: some spikes, but low data
  • 8192-16384: healthy number of data points (worth investigating separately from this change)
  • 15 buckets total

Proposed change: base 4, min 4^3 (i.e. <64 is the smallest bucket), max 4^7 (i.e. >=16384 is the highest bucket, accounting for outliers).

  • => Max cardinality = 6 buckets

Async Call Latency

DataDog links: bucketed counts, avg latency rollup

Currently set to base 8 without min/max.

  • 8-64: noise. Note that <1 and 1-8 buckets have never occurred, though they are possible in the current setup.
  • 64-512, 512-4096: healthy number of data points
  • 4096-32768: noise
  • 4 buckets total

Proposed change: base 4, min 4^4 (i.e. <256 is the smallest bucket), max 4^6 (i.e. >=4096 is the highest bucket)

  • => Max cardinality = 4 buckets

Ingress Latency

DataDog links: bucketed counts, avg latency rollup
Currently set to base 2 without min/max.

  • <1, 1-2 .. 16-32: noise.
  • 32-64 .. 1024-2048: healthy number of data points
  • 2048-4096, 4096-8192 : noise
  • 8192-16384: healthy number of data points (worth investigating separately from this change)
  • 15 buckets total

Proposed change: base 4, min 4^3 (i.e. <64 is the smallest bucket), max 4^7 (i.e. >=16384 is the highest bucket).

  • => Max cardinality = 6 buckets

@ftl-robot ftl-robot mentioned this pull request Aug 19, 2024
@deniseli deniseli marked this pull request as ready for review August 19, 2024 22:05
@deniseli deniseli requested a review from alecthomas as a code owner August 19, 2024 22:05
@deniseli deniseli requested review from a team, gak, wesbillman, safeer and jonathanj-square and removed request for a team August 19, 2024 22:05
Copy link
Contributor

@safeer safeer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Might we want to have the largest bucket specifically for outliers? I.e. > 4^7 for the call latency and ingress latency buckets

@deniseli
Copy link
Contributor Author

Nice! Might we want to have the largest bucket specifically for outliers? I.e. > 4^7 for the call latency and ingress latency buckets

Great idea! I'll add that in now.

@deniseli deniseli added this pull request to the merge queue Aug 20, 2024
Merged via the queue into main with commit e54e220 Aug 20, 2024
18 checks passed
@deniseli deniseli deleted the dli/trim-buckets branch August 20, 2024 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

otel: code cleanup
2 participants