Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(fix) Prevent Datadog timeout errors in logs #5692

Merged
merged 4 commits into from
Jul 22, 2024
Merged

Conversation

BrynCooke
Copy link
Contributor

@BrynCooke BrynCooke commented Jul 19, 2024

This is caused by connection pooling.
Setting the pool value to something very low causes the errors to disappear.

Improves #2058

This is impossible to automatically test as the Datadog agent test image does not behave in the same way as the real agent.

Related issue:
open-telemetry/opentelemetry-rust-contrib#7

I did some manual testing.


#ROUTER-456
Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Tests added and passing3
    • Unit Tests
    • Integration Tests
    • Manual Tests

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

This is caused by connection pooling.
Setting the pool value to something very low causes the errors to disappear.

This comment has been minimized.

@router-perf
Copy link

router-perf bot commented Jul 19, 2024

CI performance tests

  • const - Basic stress test that runs with a constant number of users
  • demand-control-instrumented - A copy of the step test, but with demand control monitoring and metrics enabled
  • demand-control-uninstrumented - A copy of the step test, but with demand control monitoring enabled
  • enhanced-signature - Enhanced signature enabled
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
  • events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
  • extended-reference-mode - Extended reference mode enabled
  • large-request - Stress test with a 1 MB request payload
  • no-tracing - Basic stress test, no tracing
  • reload - Reload test over a long period of time at a constant rate of users
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • step-local-metrics - Field stats that are generated from the router rather than FTV1
  • step-with-prometheus - A copy of the step test with the Prometheus metrics exporter enabled
  • step - Basic stress test that steps up the number of users over time
  • xlarge-request - Stress test with 10 MB request payload
  • xxlarge-request - Stress test with 100 MB request payload

@BrynCooke BrynCooke changed the title Prevent Datadog timeout errors in logs (fix) Prevent Datadog timeout errors in logs Jul 19, 2024
@BrynCooke BrynCooke requested review from garypen and bnjjj July 19, 2024 16:55
@BrynCooke BrynCooke marked this pull request as ready for review July 22, 2024 08:12
@BrynCooke BrynCooke requested review from a team as code owners July 22, 2024 08:12
@BrynCooke BrynCooke merged commit c460962 into dev Jul 22, 2024
14 checks passed
@BrynCooke BrynCooke deleted the bryn/datadog-timeout branch July 22, 2024 09:55
BrynCooke added a commit that referenced this pull request Jul 26, 2024
@bnjjj bnjjj mentioned this pull request Jul 30, 2024
aaronArinder referenced this pull request in apollographql/rover Aug 1, 2024
[![Mend
Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [apollographql/router](https://togithub.com/apollographql/router) |
minor | `v1.51.0` -> `v1.52.0` |

---

### Release Notes

<details>
<summary>apollographql/router (apollographql/router)</summary>

###
[`v1.52.0`](https://togithub.com/apollographql/router/releases/tag/v1.52.0)

[Compare
Source](https://togithub.com/apollographql/router/compare/v1.51.0-rc.0...v1.52.0-rc.0)

#### 🚀 Features

##### Provide helm support for when router's health_check's default path
is not being used([Issue
#&#8203;5652](https://togithub.com/apollographql/router/issues/5652))

When helm chart is defining the liveness and readiness check probes, if
the router has been configured to use a non-default health_check path,
use that rather than the default ( /health )

By [Jon Christiansen](https://togithub.com/theJC) in
[https://github.com/apollographql/router/pull/5653](https://togithub.com/apollographql/router/pull/5653)

##### Support new span and metrics formats for entity caching ([PR
#&#8203;5625](https://togithub.com/apollographql/router/pull/5625))

Metrics of the router's entity cache have been converted to the latest
format with support for custom telemetry.

The following example configuration shows the the `cache` instrument,
the `cache` selector in the subgraph service, and the `cache` attribute
of a subgraph span:

```yaml
telemetry:
  instrumentation:
    instruments:
      default_requirement_level: none
      cache:
        apollo.router.operations.entity.cache:
          attributes:
            entity.type: true
            subgraph.name:
              subgraph_name: true
            supergraph.operation.name:
              supergraph_operation_name: string
      subgraph:
        only_cache_hit_on_subgraph_products:
          type: counter
          value:
            cache: hit
          unit: hit
          description: counter of subgraph request cache hit on subgraph products
          condition:
            all:
            - eq:
              - subgraph_name: true
              - products
            - gt:
              - cache: hit
              - 0
          attributes:
            subgraph.name: true
            supergraph.operation.name:
              supergraph_operation_name: string

```

To learn more, go to [Entity caching
docs](https://www.apollographql.com/docs/router/configuration/entity-caching).

By [@&#8203;Geal](https://togithub.com/Geal) and
[@&#8203;bnjjj](https://togithub.com/bnjjj) in
[https://github.com/apollographql/router/pull/5625](https://togithub.com/apollographql/router/pull/5625)

##### Helm: Support renaming key for retrieving APOLLO_KEY secret
([Issue
#&#8203;5661](https://togithub.com/apollographql/router/issues/5661))

A user of the router Helm chart can now rename the key used to retrieve
the value of the secret key referenced by `APOLLO_KEY`.

Previously, the router Helm chart hardcoded the key name to
`managedFederationApiKey`. This didn't support users whose
infrastructure required custom key names when getting secrets, such as
Kubernetes users who need to use specific key names to access a
`secretStore` or `externalSecret`. This change provides a user the
ability to control the name of the key to use in retrieving that value.

By [Jon Christiansen](https://togithub.com/theJC) in
[https://github.com/apollographql/router/pull/5662](https://togithub.com/apollographql/router/pull/5662)

#### 🐛 Fixes

##### Prevent Datadog timeout errors in logs ([Issue
#&#8203;2058](https://togithub.com/apollographql/router/issue/2058))

The router's Datadog exporter has been updated to reduce the frequency
of logged errors related to connection pools.

Previously, the connection pools used by the Datadog exporter frequently
timed out, and each timeout logged an error like the following:

2024-07-19T15:28:22.970360Z ERROR OpenTelemetry trace error occurred:
error sending request for url (http://127.0.0.1:8126/v0.5/traces):
connection error: Connection reset by peer (os error 54)

Now, the pool timeout for the Datadog exporter has been changed so that
timeout errors happen much less frequently.

By [@&#8203;BrynCooke](https://togithub.com/BrynCooke) in
[https://github.com/apollographql/router/pull/5692](https://togithub.com/apollographql/router/pull/5692)

##### Allow service version overrides ([PR
#&#8203;5689](https://togithub.com/apollographql/router/pull/5689))

The router now supports configuration of `service.version` via YAML file
configuration. This enables users to produce custom versioned builds of
the router.

The following example overrides the version to be `1.0`:

```yaml
telemetry:
  exporters:
    tracing:
      common:
        resource:
          service.version: 1.0
```

By [@&#8203;BrynCooke](https://togithub.com/BrynCooke) in
[https://github.com/apollographql/router/pull/5689](https://togithub.com/apollographql/router/pull/5689)

##### Populate Datadog `span.kind` ([PR
#&#8203;5609](https://togithub.com/apollographql/router/pull/5609))

Because Datadog traces use `span.kind` to differentiate between
different types of spans, the router now ensures that `span.kind` is
correctly populated using the OpenTelemetry span kind, which has a 1-2-1
mapping to those set out in
[dd-trace](https://togithub.com/DataDog/dd-trace-go/blob/main/ddtrace/ext/span_kind.go).

By [@&#8203;BrynCooke](https://togithub.com/BrynCooke) in
[https://github.com/apollographql/router/pull/5609](https://togithub.com/apollographql/router/pull/5609)

##### Remove unnecessary internal metric events from traces and spans
([PR #&#8203;5649](https://togithub.com/apollographql/router/pull/5649))

The router no longer includes some internal metric events in traces and
spans that shouldn't have been included originally.

By [@&#8203;bnjjj](https://togithub.com/bnjjj) in
[https://github.com/apollographql/router/pull/5649](https://togithub.com/apollographql/router/pull/5649)

##### Support Datadog span metrics ([PR
#&#8203;5609](https://togithub.com/apollographql/router/pull/5609))

When using the APM view in Datadog, the router now displays span metrics
for top-level spans or spans with the `_dd.measured` flag set.

The router sets the `_dd.measured` flag by default for the following
spans:

-   `request`
-   `router`
-   `supergraph`
-   `subgraph`
-   `subgraph_request`
-   `http_request`
-   `query_planning`
-   `execution`
-   `query_parsing`

To enable or disable span metrics for any span, configure `span_metrics`
for the Datadog exporter:

```yaml
telemetry:
  exporters:
    tracing:
      datadog:
        enabled: true
        span_metrics:

### Disable span metrics for supergraph
          supergraph: false

### Enable span metrics for my_custom_span
          my_custom_span: true
```

By [@&#8203;BrynCooke](https://togithub.com/BrynCooke) in
[https://github.com/apollographql/router/pull/5609](https://togithub.com/apollographql/router/pull/5609)
and
[https://github.com/apollographql/router/pull/5703](https://togithub.com/apollographql/router/pull/5703)

##### Use spawn_blocking for query parsing and validation ([PR
#&#8203;5235](https://togithub.com/apollographql/router/pull/5235))

To prevent its executor threads from blocking on large queries, the
router now runs query parsing and validation in a Tokio blocking task.

By [@&#8203;xuorig](https://togithub.com/xuorig) in
[https://github.com/apollographql/router/pull/5235](https://togithub.com/apollographql/router/pull/5235)

#### 🛠 Maintenance

##### chore: Update rhai to latest release (1.19.0) ([PR
#&#8203;5655](https://togithub.com/apollographql/router/pull/5655))

In Rhai 1.18.0, there were changes to how exceptions within functions
were created. For details see:
https://github.com/rhaiscript/rhai/blob/7e0ac9d3f4da9c892ed35a211f67553a0b451218/CHANGELOG.md?plain=1#L12

We've modified how we handle errors raised by Rhai to comply with this
change, which means error message output is affected. The change means
that errors in functions will no longer document which function the
error occurred in, for example:

```diff
-         "rhai execution error: 'Runtime error: I have raised an error (line 223, position 5)\nin call to function 'process_subgraph_response_string''"
+         "rhai execution error: 'Runtime error: I have raised an error (line 223, position 5)'"
```

Making this change allows us to keep up with the latest version (1.19.0)
of Rhai.

By [@&#8203;garypen](https://togithub.com/garypen) in
[https://github.com/apollographql/router/pull/5655](https://togithub.com/apollographql/router/pull/5655)

##### Add version in the entity cache hash ([PR
#&#8203;5701](https://togithub.com/apollographql/router/pull/5701))

The hashing algorithm of the router's entity cache has been updated to
include the entity cache version.

\[!IMPORTANT]
If you have previously enabled [entity
caching](https://www.apollographql.com/docs/router/configuration/entity-caching),
you should expect additional cache regeneration costs when updating to
this version of the router while the new hashing algorithm comes into
service.

By [@&#8203;bnjjj](https://togithub.com/bnjjj) in
[https://github.com/apollographql/router/pull/5701](https://togithub.com/apollographql/router/pull/5701)

##### Improve testing by avoiding cache effects and redacting tracing
details ([PR
#&#8203;5638](https://togithub.com/apollographql/router/pull/5638))

We've had some problems with flaky tests and this PR addresses some of
them.

The router executes in parallel and concurrently. Many of our tests use
snapshots to try and make assertions that functionality is continuing to
work correctly. Unfortunately, concurrent/parallel execution and static
snapshots don't co-operate very well. Results may appear in
pseudo-random order (compared to snapshot expectations) and so tests
become flaky and fail without obvious cause.

The problem becomes particularly acute with features which are
specifically designed for highly concurrent operation, such as batching.

This set of changes addresses some of the router testing problems by:

1. Making items in a batch test different enough that caching effects
are avoided.
2. Redacting various details so that sequencing is not as much of an
issue in the otel traces tests.

By [@&#8203;garypen](https://togithub.com/garypen) in
[https://github.com/apollographql/router/pull/5638](https://togithub.com/apollographql/router/pull/5638)

#### 📚 Documentation

##### Update router naming conventions ([PR
#&#8203;5400](https://togithub.com/apollographql/router/pull/5400))

Renames our router product to distinguish between our non-commercial and
commercial offerings. Instead of referring to the **Apollo Router**, we
now refer to the following:

- **Apollo Router Core** is Apollo’s free-and-open (ELv2 licensed)
implementation of a routing runtime for supergraphs.
- **GraphOS Router** is based on the Apollo Router Core and fully
integrated with GraphOS. GraphOS Routers provide access to GraphOS’s
commercial runtime features.

By [@&#8203;shorgi](https://togithub.com/shorgi) in
[https://github.com/apollographql/router/pull/5400](https://togithub.com/apollographql/router/pull/5400)

#### 🧪 Experimental

##### Enable Rust-based API schema implementation ([PR
#&#8203;5623](https://togithub.com/apollographql/router/pull/5623))

The router has transitioned to solely using a Rust-based API schema
generation implementation.

Previously, the router used a Javascript-based implementation. After
testing for a few months, we've validated the improved performance and
robustness of the new Rust-based implementation, so the router now only
uses it.

By [@&#8203;goto-bus-stop](https://togithub.com/goto-bus-stop) in
[https://github.com/apollographql/router/pull/5623](https://togithub.com/apollographql/router/pull/5623)

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined),
Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.

♻ **Rebasing**: Whenever PR is behind base branch, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR was generated by [Mend
Renovate](https://www.mend.io/free-developer-tools/renovate/). View the
[repository job
log](https://developer.mend.io/github/apollographql/rover).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy40NDAuNyIsInVwZGF0ZWRJblZlciI6IjM3LjQ0MC43IiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyI6Y2hyaXN0bWFzX3RyZWU6IGRlcGVuZGVuY2llcyJdfQ==-->

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants