Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prep release: v1.59.0 #6460

Merged
merged 6 commits into from
Dec 17, 2024
Merged

prep release: v1.59.0 #6460

merged 6 commits into from
Dec 17, 2024

Conversation

BrynCooke
Copy link
Contributor

@BrynCooke BrynCooke commented Dec 16, 2024

Note

When approved, this PR will merge into the 1.59.0 branch which will — upon being approved itself — merge into main.

Things to review in this PR:

  • Changelog correctness (There is a preview below, but it is not necessarily the most up to date. See the Files Changed for the true reality.)
  • Version bumps
  • That it targets the right release branch (1.59.0 in this case!).

[1.59.0] - 2024-12-17

Important

If you have enabled distributed query plan caching, updates to the query planner in this release will result in query plan caches being regenerated rather than reused. On account of this, you should anticipate additional cache regeneration cost when updating to this router version while the new query plans come into service.

🚀 Features

General availability of native query planner

The router's native, Rust-based, query planner is now generally available and enabled by default.

The native query planner achieves better performance for a variety of graphs. In our tests, we observe:

  • 10x median improvement in query planning time (observed via apollo.router.query_planning.plan.duration)
  • 2.9x improvement in router’s CPU utilization
  • 2.2x improvement in router’s memory usage

Note: you can expect generated plans and subgraph operations in the native
query planner to have slight differences when compared to the legacy, JavaScript-based query planner. We've ascertained these differences to be semantically insignificant, based on comparing ~2.5 million known unique user operations in GraphOS as well as
comparing ~630 million operations across actual router deployments in shadow
mode for a four month duration.

The native query planner supports Federation v2 supergraphs. If you are using Federation v1 today, see our migration guide on how to update your composition build step. Subgraph changes are typically not needed.

The legacy, JavaScript, query planner is deprecated in this release, but you can still switch
back to it if you are still using Federation v1 supergraph:

experimental_query_planner_mode: legacy

Note: The subgraph operations generated by the query planner are not
guaranteed consistent release over release. We strongly recommend against
relying on the shape of planned subgraph operations, as new router features and
optimizations will continuously affect it.

By @sachindshinde,
@goto-bus-stop,
@duckki,
@TylerBloom,
@SimonSapin,
@dariuszkuc,
@lrlna, @clenfest,
and @o0Ignition0o.

Ability to skip persisted query list safelisting enforcement via plugin (PR #6403)

If safelisting is enabled, a router_service plugin can skip enforcement of the safelist (including the require_id check) by adding the key apollo_persisted_queries::safelist::skip_enforcement with value true to the request context.

Note: this doesn't affect the logging of unknown operations by the persisted_queries.log_unknown option.

In cases where an operation would have been denied but is allowed due to the context key existing, the attribute persisted_queries.safelist.enforcement_skipped is set on the apollo.router.operations.persisted_queries metric with value true.

By @glasser in #6403

Add fleet awareness plugin (PR #6151)

A new fleet_awareness plugin has been added that reports telemetry to Apollo about the configuration and deployment of the router.

The reported telemetry include CPU and memory usage, CPU frequency, and other deployment characteristics such as operating system and cloud provider. For more details, along with a full list of data captured and how to opt out, go to our
data privacy policy.

By @jonathanrainer, @nmoutschen, @loshz
in #6151

Add fleet awareness schema metric (PR #6283)

The router now supports the apollo.router.instance.schema metric for its fleet_detector plugin. It has two attributes: schema_hash and launch_id.

By @loshz and @nmoutschen in #6283

Support client name for persisted query lists (PR #6198)

The persisted query manifest fetched from Apollo Uplink can now contain a clientName field in each operation. Two operations with the same id but different clientName are considered to be distinct operations, and they may have distinct bodies.

The router resolves the client name by taking the first from the following that exists:

  • Reading the apollo_persisted_queries::client_name context key that may be set by a router_service plugin
  • Reading the HTTP header named by telemetry.apollo.client_name_header, which defaults to apollographql-client-name

If a client name can be resolved for a request, the router first tries to find a persisted query with the specified ID and the resolved client name.

If there is no operation with that ID and client name, or if a client name cannot be resolved, the router tries to find a persisted query with the specified ID and no client name specified. This means that existing PQ lists that don't contain client names will continue to work.

To learn more, go to persisted queries docs.

By @glasser in #6198

🐛 Fixes

Fix coprocessor empty body object panic (PR #6398)

Previously, the router would panic if a coprocessor responds with an empty body object at the supergraph stage:

{
  ... // other fields
  "body": {} // empty object
}

This has been fixed in this release.

Note: the previous issue didn't affect coprocessors that responded with formed responses.

By @BrynCooke in #6398

Ensure cost directives are picked up when not explicitly imported (PR #6328)

With the recent composition changes, importing @cost results in a supergraph schema with the cost specification import at the top. The @cost directive itself is not explicitly imported, as it's expected to be available as the default export from the cost link. In contrast, uses of @listSize to translate to an explicit import in the supergraph.

Old SDL link

@link(
    url: "https://specs.apollo.dev/cost/v0.1"
    import: ["@cost", "@listSize"]
)

New SDL link

@link(url: "https://specs.apollo.dev/cost/v0.1", import: ["@listSize"])

Instead of using the directive names from the import list in the link, the directive names now come from SpecDefinition::directive_name_in_schema, which is equivalent to the change we made on the composition side.

By @tninesling in #6328

Fix query hashing algorithm (PR #6205)

The router includes a schema-aware query hashing algorithm designed to return the same hash across schema updates if the query remains unaffected. This update enhances the algorithm by addressing various corner cases to improve its reliability and consistency.

By @Geal in #6205

Fix typo in persisted query metric attribute (PR #6332)

The apollo.router.operations.persisted_queries metric reports an attribute when a persisted query was not found.
Previously, the attribute name was persisted_quieries.not_found, with one i too many. Now it's persisted_queries.not_found.

By @goto-bus-stop in #6332

Fix telemetry instrumentation using supergraph query selector (PR #6324)

Previously, router telemetry instrumentation that used query selectors could log errors with messages such as this is a bug and should not happen.

These errors have now been fixed, and configurations with query selectors such as the following work properly:

telemetry:
  exporters:
    metrics:
      common:
        views:
          # Define a custom view because operation limits are different than the default latency-oriented view of OpenTelemetry
          - name: oplimits.*
            aggregation:
              histogram:
                buckets:
                  - 0
                  - 5
                  - 10
                  - 25
                  - 50
                  - 100
                  - 500
                  - 1000
  instrumentation:
    instruments:
      supergraph:
        oplimits.aliases:
          value:
            query: aliases
          type: histogram
          unit: number
          description: "Aliases for an operation"
        oplimits.depth:
          value:
            query: depth
          type: histogram
          unit: number
          description: "Depth for an operation"
        oplimits.height:
          value:
            query: height
          type: histogram
          unit: number
          description: "Height for an operation"
        oplimits.root_fields:
          value:
            query: root_fields
          type: histogram
          unit: number
          description: "Root fields for an operation"

By @bnjjj in #6324

More consistent attributes on apollo.router.operations.persisted_queries metric (PR #6403)

Version 1.28.1 added several unstable metrics, including apollo.router.operations.persisted_queries.

When an operation is rejected, Router includes a persisted_queries.safelist.rejected.unknown attribute on the metric. Previously, this attribute had the value true if the operation is logged (via log_unknown), and false if the operation is not logged. (The attribute is not included at all if the operation is not rejected.) This appears to have been a mistake, as you can also tell whether it is logged via the persisted_queries.logged attribute.

Router now only sets this attribute to true, and never to false. Note these metrics are unstable and will continue to change.

By @glasser in #6403

Drop experimental reuse fragment query optimization option (PR #6354)

Drop support for the experimental reuse fragment query optimization. This implementation was not only very slow but also very buggy due to its complexity.

Auto generation of fragments is a much simpler (and faster) algorithm that in most cases produces better results. Fragment auto generation is the default optimization since v1.58 release.

By @dariuszkuc in #6353

📃 Configuration

Add version number to distributed query plan cache keys (PR #6406)

The router now includes its version number in the cache keys of distributed cache entries. Given that a new router release may change how query plans are generated or represented, including the router version in a cache key enables the router to use separate cache entries for different versions.

If you have enabled distributed query plan caching, expect additional processing for your cache to update for this router release.

By @SimonSapin in #6406

🛠 Maintenance

Remove catch_unwind wrapper around the native query planner (PR #6397)

As part of internal maintenance of the query planner, the
catch_unwind wrapper around the native query planner has been removed. This wrapper served as an extra safeguard for potential panics the native planner could produce. The
native query planner however no longer has any code paths that could panic. We have also
not witnessed a panic in the last four months, having processed 560 million real
user operations through the native planner.

This maintenance work also removes backtrace capture for federation errors, which
was used for debugging and is no longer necessary as we have the confidence in
the native planner's implementation.

By @lrlna in #6397

Deprecate various metrics (PR #6350)

Several metrics have been deprecated in this release, in favor of OpenTelemetry-compatible alternatives:

  • apollo_router_deduplicated_subscriptions_total - use the apollo.router.operations.subscriptions metric's subscriptions.deduplicated attribute.
  • apollo_authentication_failure_count - use the apollo.router.operations.authentication.jwt metric's authentication.jwt.failed attribute.
  • apollo_authentication_success_count - use the apollo.router.operations.authentication.jwt metric instead. If the authentication.jwt.failed attribute is absent or false, the authentication succeeded.
  • apollo_require_authentication_failure_count - use the http.server.request.duration metric's http.response.status_code attribute. Requests with authentication failures have HTTP status code 401.
  • apollo_router_timeout - this metric conflates timed-out requests from client to the router, and requests from the router to subgraphs. Timed-out requests have HTTP status code 504. Use the http.response.status_code attribute on the http.server.request.duration metric to identify timed-out router requests, and the same attribute on the http.client.request.duration metric to identify timed-out subgraph requests.

The deprecated metrics will continue to work in the 1.x release line.

By @goto-bus-stop in #6350

@svc-apollo-docs
Copy link
Collaborator

svc-apollo-docs commented Dec 16, 2024

✅ Docs Preview Ready

No new or changed pages found.

@router-perf
Copy link

router-perf bot commented Dec 16, 2024

CI performance tests

  • connectors-const - Connectors stress test that runs with a constant number of users
  • const - Basic stress test that runs with a constant number of users
  • demand-control-instrumented - A copy of the step test, but with demand control monitoring and metrics enabled
  • demand-control-uninstrumented - A copy of the step test, but with demand control monitoring enabled
  • enhanced-signature - Enhanced signature enabled
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
  • events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
  • extended-reference-mode - Extended reference mode enabled
  • large-request - Stress test with a 1 MB request payload
  • no-tracing - Basic stress test, no tracing
  • reload - Reload test over a long period of time at a constant rate of users
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • step-local-metrics - Field stats that are generated from the router rather than FTV1
  • step-with-prometheus - A copy of the step test with the Prometheus metrics exporter enabled
  • step - Basic stress test that steps up the number of users over time
  • xlarge-request - Stress test with 10 MB request payload
  • xxlarge-request - Stress test with 100 MB request payload

CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
@garypen garypen self-requested a review December 16, 2024 13:55
garypen
garypen previously approved these changes Dec 16, 2024
Copy link
Member

@lrlna lrlna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking for a little bit while I fix the changelog - i don't think dropping an experimental_ feature is a breaking change.

CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
lrlna
lrlna previously approved these changes Dec 17, 2024
garypen
garypen previously approved these changes Dec 17, 2024
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
@BrynCooke BrynCooke dismissed stale reviews from garypen and lrlna via d8ad335 December 17, 2024 09:38
@BrynCooke BrynCooke merged commit bd8ea14 into 1.59.0 Dec 17, 2024
12 checks passed
@BrynCooke BrynCooke deleted the prep-1.59.0 branch December 17, 2024 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants