otel: code cleanup #2286

deniseli · 2024-08-07T19:37:21Z

The noop refactor
Check across the metrics for attribute consistency, esp if there are attributes we log on some metrics that would be useful to add elsewhere
Check if the async call queue depth and latency metrics/attrs are enough, or if we'll want equivalents on pubsub directly (hopefully not, but double check)

The text was updated successfully, but these errors were encountered:

…patterns (#2288) part 1 of #2286

deniseli · 2024-08-20T00:19:26Z

Check if the async call queue depth and latency metrics/attrs are enough, or if we'll want equivalents on pubsub directly (hopefully not, but double check)

We should ideally add source tracing through all of these calls, but that's not a P1.

Fixes #2286 ### Log Buckets This PR introduces the ability to constrain `logBucket` with a max and min bucket. Log buckets just bucket numbers (in this case, ms latency): Example for base 4: ``` <1, [1,4), [4,16), [16,64), [64,256), [256,1024), [1024,4096), ... ``` In DataDog, looking across the last week, we see several latency buckets displaying low counts in the noise, especially on the buckets closer to 1. This makes sense given the intrinsic nature of log buckets - they are densest close to 1 and sparser the further up you go. To reduce cardinality (thus, cost), we can chop away that noise at the low end. It's similarly worth capping the buckets at the high end because although the buckets get exponentially larger, you can still have infinitely more buckets. We haven't been hit with any _major_ issues yet, but you can imagine how this could impact cardinality during a latency spike. ### Call Latency DataDog links: [bucketed counts](https://app.datadoghq.com/metric/explorer?fromUser=false&graph_layout=split&start=1723496312134&end=1724101112134&paused=false#N4Ig7glgJg5gpgFxALlAGwIYE8D2BXJVEADxQEYAaELcqyKBAC1pEbghkcLIF8qo4AMwgA7CAgg4RKUAiwAHOChASAtnADOcAE4RNIKtrgBHPJoQaUAbVBGN8qVoD6gnNtUZCKiOq279VKY6epbINiAiGOrKQdpYZAYgUJ4YThr42gDGSsgg6gi6mZaBZnHKGniqyIIIaAB0mRho9UZBGhbAAFQ8dRgaTpn4IggAFACUIDwAulSu7niYoeFzqgsYMaXxkzMg7VhoOaDyTYgIOUk4MAOLGhCZiWiicE5yiuU4j1APT070TMoiNweNCTfgQeyYLAvBTnR4iJTTHh8XbyR4IADCUmEMBkKLRTigPjgIlujmsoCkzxg2gw8kYTkU2mUNXqjWadW0eBELyJTlUGjqACM8JkANaIbZUR6qcQoABMABYqOltIQKdoBEzcgINPckTwgA), [avg latency rollup](https://app.datadoghq.com/metric/explorer?fromUser=false&graph_layout=multi&start=1723498319625&end=1724103119625&paused=false#N4Ig7glgJg5gpgFxALlAGwIYE8D2BXJVEADxQEYAaELcqyKBAC1pEbghkcLIF8qo4AMwgA7CAgg4RKUAiwAHOChASAtnADOcAE4RNIKtrgBHPJoQaUAbVBGN8qVoD6gnNtUZCKiOq279VKY6epbINiAiGOrKQdpYZAYgUJ4YThr42gDGSsgg6gi6mZaBZnHKGABuMMiCCGgAdJkYaA2qGk4IOE6ZOKryaIhwwABUPPXaOC148gAUlTAUAAQAbAAMAJQgPAC6VK7ueJih4fuqhxgxpfFbuyAacgMyIPLNiAg5STgw3UcaEJmJNCiOAdBQfdJAqCA4FOehMZQiNweNBbfgQeyYLCgxTKIEiJQ7Hh8O79cQAYSkwhgKBEhzQPCAA) Currently set to base 2 without min/max. * `<1`, `1-2` .. `8-16`: noise * `16-32` .. `1024-2048`: healthy number of data points * `2048-4096`, `4096-8192`: some spikes, but low data * `8192-16384`: healthy number of data points (worth investigating separately from this change) * 15 buckets total Proposed change: base 4, min `4^3` (i.e. `<64` is the smallest bucket), max `4^7` (i.e. `>=16384` is the highest bucket, accounting for outliers). * => Max cardinality = 6 buckets ### Async Call Latency DataDog links: [bucketed counts](https://app.datadoghq.com/metric/explorer?fromUser=false&graph_layout=split&start=1723497145524&end=1724101945524&paused=false#N4Ig7glgJg5gpgFxALlAGwIYE8D2BXJVEADxQEYAaELcqyKBAC1pEbghkcLIF8qo4AMwgA7CAgg4RKUAiwAHOChASAtnADOcAE4RNIKtrgBHPJoQaUAbVBGN8qVoD6gnNtUZCKiOq279VKY6epbINiAiGOrKQdpYZAYgUJ4YThr42gDGSsgg6gi6mZaBZnHKGniqyIIIaAB0GBpYIplOmRho9Zk4qvJoiHBQwABUPA0abfgiCAAUAJQgPAC6VK7ueJih4WuqGxgxpfGLKyAacv0yIPIdiAg5STgwbZsaEJmJaKJwTnKK5TifKAfL5OehMZQiNweNCLfgQeyYLA-BT3T4iJTLHh8U59cQAYSkwhglwR4icUB8cBEr0c1lAUm+MG0GHkjCcim0yhq9UazVa7U6dTU31eLRFmTYUA2gycnicqg0dQARnhMgBrRDHKifVTiFAAJgALFR0tpCPTtAJObkBBp3lieEA), [avg latency rollup](https://app.datadoghq.com/metric/explorer?fromUser=false&graph_layout=multi&start=1723498319625&end=1724103119625&paused=false#N4Ig7glgJg5gpgFxALlAGwIYE8D2BXJVEADxQEYAaELcqyKBAC1pEbghkcLIF8qo4AMwgA7CAgg4RKUAiwAHOChASAtnADOcAE4RNIKtrgBHPJoQaUAbVBGN8qVoD6gnNtUZCKiOq279VKY6epbINiAiGOrKQdpYZAYgUJ4YThr42gDGSsgg6gi6mZaBZnHKGABuMMiCCGgAdBgaWCKZTpkYaA2qGk4IOO04qvJoiHDAAFQ89do4XXjyABSVMBQABABsAAwAlCA8ALpUru54mKHhJ6pnGDGl8ftHIBpyozIg8p2ICDlJODDtc4aCCZRJoURwPoKX7pcFQMEQpz0JjKERuDxofb8CD2TBYKGKZTgkRKQ48PjPEbiADCUmEMBQIjOaB4QA) Currently set to base 8 without min/max. * `8-64`: noise. Note that `<1` and `1-8` buckets have never occurred, though they are possible in the current setup. * `64-512`, `512-4096`: healthy number of data points * `4096-32768`: noise * 4 buckets total Proposed change: base 4, min `4^4` (i.e. `<256` is the smallest bucket), max `4^6` (i.e. `>=4096` is the highest bucket) * => Max cardinality = 4 buckets ### Ingress Latency DataDog links: [bucketed counts](https://app.datadoghq.com/metric/explorer?fromUser=false&graph_layout=split&start=1723497891973&end=1724102691973&paused=false#N4Ig7glgJg5gpgFxALlAGwIYE8D2BXJVEADxQEYAaELcqyKBAC1pEbghkcLIF8qo4AMwgA7CAgg4RKUAiwAHOChASAtnADOcAE4RNIKtrgBHPJoQaUAbVBGN8qVoD6gnNtUZCKiOq279VKY6epbINiAiGOrKQdpYZAYgUJ4YThr42gDGSsgg6gi6mZaBZnHKGniqyIIIaAB0ojB2GnVGQRoWwABUPHUYGk6Z+CIIABQAlCA8ALpUru54mKHh86qLGDGl8VOzIB1YaDmg8hiHCAg5STgwg0saEJmJaKJwTnKK5TjPUE8vTvRMZQiNweNBTfgQeyYLBvBSXZ4iJQzHh8PbyZ4IADCUmEMBkaIxTigPjgInujmsoCkryaGHkjCcim0yhq9UazVaeBEbxJTlULQARnhMgBrRA7KjPVTiFAAJgALFR0tpCFTtAJmbkBBpHiieEA), [avg latency rollup](https://app.datadoghq.com/metric/explorer?fromUser=false&graph_layout=multi&start=1723498319625&end=1724103119625&paused=false#N4Ig7glgJg5gpgFxALlAGwIYE8D2BXJVEADxQEYAaELcqyKBAC1pEbghkcLIF8qo4AMwgA7CAgg4RKUAiwAHOChASAtnADOcAE4RNIKtrgBHPJoQaUAbVBGN8qVoD6gnNtUZCKiOq279VKY6epbINiAiGOrKQdpYZAYgUJ4YThr42gDGSsgg6gi6mZaBZnHKGABuMMiCCGgAdKIwdhr1qhpOCDhOmTiq8miIcMAAVDz12jhoaHjyABSVMBQABABsAAwAlCA8ALpUru54mKHhh6rHGDGl8Tv7IBpygzIg8hiDCAg5STgwPScaCCZRJoURwToKb7pUFQEFgpz0JjKERuDxoHb8CD2TBYCGKZSgkRKPY8PgPAbiADCUmEMBQImOaB4QA) Currently set to base 2 without min/max. * `<1`, `1-2` .. `16-32`: noise. * `32-64` .. `1024-2048`: healthy number of data points * `2048-4096`, `4096-8192 `: noise * `8192-16384`: healthy number of data points (worth investigating separately from this change) * 15 buckets total Proposed change: base 4, min `4^3` (i.e. `<64` is the smallest bucket), max `4^7` (i.e. `>=16384` is the highest bucket). * => Max cardinality = 6 buckets

deniseli self-assigned this Aug 7, 2024

deniseli mentioned this issue Aug 7, 2024

Metrics pass over FTL #1988

Open

15 tasks

ftl-robot mentioned this issue Aug 7, 2024

Dashboard #728

Open

deniseli mentioned this issue Aug 7, 2024

chore: minor refactor to get all otel metric code following the same patterns #2288

Merged

github-merge-queue bot pushed a commit that referenced this issue Aug 7, 2024

chore: minor refactor to get all otel metric code following the same …

4f39234

…patterns (#2288) part 1 of #2286

deniseli mentioned this issue Aug 19, 2024

feat: tune latency attribute buckets to reduce cardinality #2432

Merged

deniseli closed this as completed in #2432 Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

otel: code cleanup #2286

otel: code cleanup #2286

deniseli commented Aug 7, 2024 •

edited

Loading

deniseli commented Aug 20, 2024

otel: code cleanup #2286

otel: code cleanup #2286

Comments

deniseli commented Aug 7, 2024 • edited Loading

deniseli commented Aug 20, 2024

deniseli commented Aug 7, 2024 •

edited

Loading