Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(cache): add support for surrogate cache key #6234

Merged
merged 15 commits into from
Nov 20, 2024
Merged

feat(cache): add support for surrogate cache key #6234

merged 15 commits into from
Nov 20, 2024

Conversation

bnjjj
Copy link
Contributor

@bnjjj bnjjj commented Nov 6, 2024

Context

Existing caching systems often support a concept of surrogate keys, where a key can be linked to a specific piece of cached data, independently of the actual cache key.

As an example, a news website might want to invalidate all cached articles linked to a specific company or person following an event. To that end, when returning the article, the service can add a surrogate key to the article response, and the cache would keep a map from surrogate keys to cache keys.

Surrogate keys and the router’s entity cache

To support a surrogate key system with the entity caching in the router, we make the following assumptions:

  • The subgraph returns surrogate keys with the response. The router will not manipulate those surrogate keys directly. Instead, it leaves that task to a coprocessor
  • The coprocessor tasked with managing surrogate keys will store the mapping from surrogate keys to cache keys. It will be useful to invalidate all cache keys related to a surrogate cache key in Redis.
  • The router will expose a way to gather the cache keys used in a subgraph request

Router side support

The router has two features to support surrogate cache key:

  • An id field for subgraph requests and responses. This is a random, unique id per subgraph call that can be used to keep state between the request and response side, and keep data from the various subgraph calls separately for the entire client request. You have to enable it in configuration (subgraph_request_id):
coprocessor:
  url: http://127.0.0.1:3000 # mandatory URL which is the address of the coprocessor
  supergraph:
    response: 
      context: true
  subgraph:
    all:
      response: 
        subgraph_request_id: true
        context: true
  • The entity cache has an option to store in the request context, at the key apollo::entity_cache::cached_keys_status, a map subgraph request id => cache keys only when it's enabled in the configuration (expose_keys_in_context)):
preview_entity_cache:
  enabled: true
  expose_keys_in_context: true
  metrics:
    enabled: true
  invalidation:
    listen: 0.0.0.0:4000
    path: /invalidation
  # Configure entity caching per subgraph
  subgraph:
    all:
      enabled: true
      # Configure Redis
      redis:
        urls: ["redis://localhost:6379"]
        ttl: 24h # Optional, by default no expiration

The coprocessor will then work at two stages:

  • Subgraph response:
    • Extract the subgraph request id
    • Extract the list of surrogate keys from the response
  • Supergraph stage:
    • Extract the map subgraph request id => cache keys
    • Match it with the surrogate cache keys obtained at the subgraph response stage

The coprocessor then has a map of surrogate keys => cache keys that it can use to invalidate cached data directly from Redis.

Example workflow

  • The router receives a client request
  • The router starts a subgraph request:
    • The entity cache plugin checks if the request has a corresponding cached entry:
      • If the entire response can be obtained from cache, we return a response here
      • If it cannot be obtained, or only partially (_entities query), a request is transmitted to the subgraph
    • The subgraph responds to the request. The response can contain a list of surrogate keys in a header: Surrogate-Keys: homepage, feed
    • The subgraph response stage coprocessor extracts the surrogate keys from headers, and stores it in the request context, associated with the subgraph request id 0e67db40-e98d-4ad7-bb60-2012fb5db504:
{
  "​0ee3bf47-5e8d-47e3-8e7e-b05ae877d9c7": ["homepage", "feed"]
}
  • The entity cache processes the subgraph response:
    • It generates a new subgraph response by interspersing data it got from cache with data from the original response
    • It stores the list of keys in the context. new indicates newly cached data coming from the subgraph, linked to the surrogate keys, while cached is data obtained from the cache. These are the keys directly used in Redis:
{
  "apollo::entity_cache::cached_keys_status": {
    "0ee3bf47-5e8d-47e3-8e7e-b05ae877d9c7": [
      {
        "key": "version:1.0:subgraph:products:type:Query:hash:af9febfacdc8244afc233a857e3c4b85a749355707763dc523a6d9e8964e9c8d:data:d9d84a3c7ffc27b0190a671212f3740e5b8478e84e23825830e97822e25cf05c",
        "status": "new",
        "cache_control": "max-age=60,public"
      }
    ]
  }
}
  • The supergraph response stage loads data from the context and creates the mapping:
{
  "homepage": [
    {
      "key": "version:1.0:subgraph:products:type:Query:hash:af9febfacdc8244afc233a857e3c4b85a749355707763dc523a6d9e8964e9c8d:data:d9d84a3c7ffc27b0190a671212f3740e5b8478e84e23825830e97822e25cf05c",
      "status": "new",
      "cache_control": "max-age=60,public"
    }
  ],
  "feed": [
    {
      "key": "version:1.0:subgraph:products:type:Query:hash:af9febfacdc8244afc233a857e3c4b85a749355707763dc523a6d9e8964e9c8d:data:d9d84a3c7ffc27b0190a671212f3740e5b8478e84e23825830e97822e25cf05c",
      "status": "new",
      "cache_control": "max-age=60,public"
    }
  ]
}
  • When a surrogate key must be used to invalidate data, that mapping is used to obtained the related cache keys

Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Tests added and passing3
    • Unit Tests
    • Integration Tests
    • Manual Tests

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

@svc-apollo-docs
Copy link
Collaborator

svc-apollo-docs commented Nov 6, 2024

✅ Docs Preview Ready

No new or changed pages found.

Copy link
Contributor

github-actions bot commented Nov 6, 2024

@bnjjj, please consider creating a changeset entry in /.changesets/. These instructions describe the process and tooling.

@router-perf
Copy link

router-perf bot commented Nov 6, 2024

CI performance tests

  • connectors-const - Connectors stress test that runs with a constant number of users
  • const - Basic stress test that runs with a constant number of users
  • demand-control-instrumented - A copy of the step test, but with demand control monitoring and metrics enabled
  • demand-control-uninstrumented - A copy of the step test, but with demand control monitoring enabled
  • enhanced-signature - Enhanced signature enabled
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
  • events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
  • extended-reference-mode - Extended reference mode enabled
  • large-request - Stress test with a 1 MB request payload
  • no-tracing - Basic stress test, no tracing
  • reload - Reload test over a long period of time at a constant rate of users
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • step-local-metrics - Field stats that are generated from the router rather than FTV1
  • step-with-prometheus - A copy of the step test with the Prometheus metrics exporter enabled
  • step - Basic stress test that steps up the number of users over time
  • xlarge-request - Stress test with 10 MB request payload
  • xxlarge-request - Stress test with 100 MB request payload

@Geal
Copy link
Contributor

Geal commented Nov 6, 2024

at line 714 let (new_entities, new_errors) = assemble_response_from_errors(, when we got an error from the subgraph response, we still return a partial response with some data from cache, so we need to store the cache keys for those entities

bnjjj added 3 commits November 6, 2024 11:54
Signed-off-by: Benjamin <[email protected]>
Signed-off-by: Benjamin <[email protected]>
@bnjjj bnjjj requested review from Geal, garypen and BrynCooke November 6, 2024 16:01
@bnjjj bnjjj marked this pull request as ready for review November 6, 2024 16:02
@bnjjj bnjjj requested review from a team as code owners November 6, 2024 16:02
@bnjjj bnjjj requested a review from a team as a code owner November 7, 2024 15:31
apollo-router/src/plugins/cache/entity.rs Outdated Show resolved Hide resolved
apollo-router/src/plugins/cache/entity.rs Outdated Show resolved Hide resolved
@bnjjj bnjjj merged commit 83e6291 into dev Nov 20, 2024
14 checks passed
@bnjjj bnjjj deleted the bnjjj/feat_659 branch November 20, 2024 09:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants