Subgraph Entity Caching for the GraphOS Router
Configure Redis-backed caching for entities
Learn how the GraphOS Router can cache subgraph query responses using Redis to improve your query latency for entities in the supergraph.
Overview
An entity gets its fields from one or more subgraphs. To respond to a client request for an entity, the GraphOS Router must make multiple subgraph requests. Different clients requesting the same entity can make redundant, identical subgraph requests.
Entity caching enables the router to respond to identical subgraph queries with cached subgraph responses. The router uses Redis to cache data from subgraph query responses. Because cached data is keyed per subgraph and entity, different clients making the same client query—with the same or different query arguments—hit the same cache entries of subgraph response data.
Benefits of entity caching
Compared to caching entire client responses, entity caching supports finer control over:
the time to live (TTL) of cached data
the amount of data being cached
When caching an entire client response, the router must store it with a shorter TTL because application data can change often. Real-time data needs more frequent updates.
A client-response cache might not be shareable between users, because the application data might contain personal and private information. A client-response cache might also duplicate a lot of data between client responses.
For example, consider the Products
and Inventory
subgraphs from the Entities guide:
1type Product @key(fields: "id") {
2 id: ID!
3 name: String!
4 price: Int
5}
1type Product @key(fields: "id") {
2 id: ID!
3 inStock: Boolean!
4}
Assume the client for a shopping cart application requests the following for each product in the cart:
The product's name and price from the
Products
subgraph.The product's availability in inventory from the
Inventory
subgraph.
If caching the entire client response, it would require a short TTL because the cart data can change often and the real-time inventory has to be up to date. A client-response cache couldn't be shared between users, because each cart is personal. A client-response cache might also duplicate data because the same products might appear in multiple carts.
With entity caching enabled for this example, the router can:
Store each product's description and price separately with a long TTL.
Minimize the number of subgraph requests made for each client request, with some client requests fetching all product data from the cache and requiring no subgraph requests.
Share the product cache between all users.
Cache the cart per user, with a small amount of data.
Cache inventory data with a short TTL or not cache it at all.
For example, the diagram below shows how a price entity can be cached and then combined with purchase and inventory fragments to serve a products
query. Because price data is subject to change less often than inventory data, it makes sense to cache it with a different TTL.
Use entity caching
Follow this guide to enable and configure entity caching in the GraphOS Router.
Prerequisites
To use entity caching in the GraphOS Router, you must set up:
A Redis instance or cluster that your router instances can communicate with
A GraphOS Enterprise plan that connects your router to GraphOS.
Configure router for entity caching
In router.yaml
, configure preview_entity_cache
:
Enable entity caching globally.
Configure Redis using the same conventions described in distributed caching.
Configure entity caching per subgraph, with overrides per subgraph for disabling entity caching and TTL.
For example:
1# Enable entity caching globally
2preview_entity_cache:
3 enabled: true
4 expose_keys_in_context: true # Optional, it will expose cache keys in the context in order to use it in coprocessors or Rhai
5 subgraph:
6 all:
7 enabled: true
8 # Configure Redis
9 redis:
10 urls: ["redis://..."]
11 timeout: 2s # Optional, by default: 500ms
12 ttl: 24h # Optional, by default no expiration
13 # Configure entity caching per subgraph, overrides options from the "all" section
14 subgraphs:
15 products:
16 ttl: 120s # overrides the global TTL
17 inventory:
18 enabled: false # disable for a specific subgraph
19 accounts:
20 private_id: "user_id"
preview_entity_cache
, for example preview_entity_cache.redis
.This configuration may change while the feature is in preview.Configure time to live (TTL)
Besides configuring a global TTL for all the entries in Redis, the GraphOS Router also honors the Cache-Control
header returned with the subgraph response. It generates a Cache-Control
header for the client response by aggregating the TTL information from all response parts.
A TTL has to be configured for all subgraphs using entity caching, either defined in the per subgraph configuration or inherited from the global configuration, in case the subgraph returns a Cache-Control
header without a max-age
.
Customize Redis cache key
If you need to store data for a particular request in different cache entries, you can configure the cache key through the apollo_entity_cache::key
context entry.
This entry contains an object with the all
field to affect all subgraph requests under one client request, and fields named after subgraph operation names to affect individual subgraph queries. The field's value can be any valid JSON value (object, string, etc).
1{
2 "all": 1,
3 "subgraph_operation1": "key1",
4 "subgraph_operation2": {
5 "data": "key2"
6 }
7}
Entity cache invalidation
You can invalidate entity cache entries with a [specifically formatted request](#invalidation-request-format once you configure your router appropriately. For example, if price data changes before a price entity's TTL expires, you can send an invalidation request.
When existing cache entries need to be replaced, the router supports a couple of ways for you to invalidate entity cache entries:
Invalidation endpoint - the router exposes an invalidation endpoint that can receive invalidation requests from any authorized service. This is primarily intended as an alternative to the extensions mechanism described below. For example a subgraph could use it to trigger invalidation events "out of band" from any requests received by the router or a platform operator could use it to invalidate cache entries in response to events which aren't directly related to a router.
Subgraph response extensions - you can send invalidation requests via subgraph response extensions, allowing a subgraph to invalidate cached data right after a mutation.
One invalidation request can invalidate multiple cached entries at once. It can invalidate:
All cached entries for a specific subgraph
All cached entries for a specific type in a specific subgraph
All cached entries for a specific entity in a specific subgraph
To process an invalidation request, the router first sends a SCAN
command to Redis to find all the keys that match the invalidation request. After iterating over the scan cursor, the router sends a DEL
command to Redis to remove the matching keys.
Configuration
You can configure entity cache invalidation globally with preview_entity_cache.invalidation
. You can also override the global setting for a subgraph with preview_entity_cache.subgraph.subgraphs.invalidation
. The example below shows both:
1preview_entity_cache:
2 enabled: true
3
4 # global invalidation configuration
5 invalidation:
6 # address of the invalidation endpoint
7 # this should only be exposed to internal networks
8 listen: "127.0.0.1:3000"
9 path: "/invalidation"
10 scan_count: 1000
11
12 subgraph:
13 all:
14 enabled: true
15 redis:
16 urls: ["redis://..."]
17 invalidation:
18 # base64 string that will be provided in the `Authorization: Basic` header value
19 shared_key: "agm3ipv7egb78dmxzv0gr5q0t5l6qs37"
20 subgraphs:
21 products:
22 # per subgraph invalidation configuration overrides global configuration
23 invalidation:
24 # whether invalidation is enabled for this subgraph
25 enabled: true
26 # override the shared key for this particular subgraph. If another key is provided, the invalidation requests for this subgraph's entities will not be executed
27 shared_key: "czn5qvjylm231m90hu00hgsuayhyhgjv"
listen
The address and port to listen on for invalidation requests.
path
The path to listen on for invalidation requests.
shared_key
A string that will be used to authenticate invalidation requests.
scan_count
The number of keys to scan in a single SCAN
command. This can be used to reduce the number of requests to Redis.
Invalidation request format
Invalidation requests are defined as JSON objects with the following format:
Subgraph invalidation request:
1{
2 "kind": "subgraph",
3 "subgraph": "accounts"
4}
Subgraph type invalidation request:
1{
2 "kind": "subgraph",
3 "subgraph": "accounts",
4 "type": "User"
5}
Subgraph entity invalidation request:
1{
2 "kind": "subgraph",
3 "subgraph": "accounts",
4 "type": "User",
5 "key": {
6 "id": "1"
7 }
8}
@key
directive. If a subgraph has multiple keys defined and the entity is being invalidated, it is likely you'll need to send a request for each key definition.Invalidation HTTP endpoint
The invalidation endpoint exposed by the router expects to receive an array of invalidation requests and will process them in sequence. For authorization, you must provide a shared key in the request header. For example, with the previous configuration you should send the following request:
1POST http://127.0.0.1:3000/invalidation
2Authorization: agm3ipv7egb78dmxzv0gr5q0t5l6qs37
3Content-Length:96
4Content-Type:application/json
5Accept: application/json
6
7[{
8 "kind": "type",
9 "subgraph": "invalidation-subgraph-type-accounts",
10 "type": "Query"
11}]
The router would send the following response:
1HTTP/1.1 200 OK
2Content-Type: application/json
3
4{
5 "count": 300
6}
The count
field indicates the number of keys that were removed from Redis.
Invalidation through subgraph response extensions
A subgraph can return an invalidation
array with invalidation requests in its response's extensions
field. This can be used to invalidate entries in response to a mutation.
1{
2 "data": { "invalidateProductReview": 1 },
3 "extensions": {
4 "invalidation": [{
5 "kind": "entity",
6 "subgraph": "invalidation-entity-key-reviews",
7 "type": "Product",
8 "key": {
9 "upc": "1"
10 }
11 }]
12 }
13}
Observability
Invalidation requests are instrumented with the following metrics:
apollo.router.operations.entity.invalidation.event
- counter triggered when a batch of invalidation requests is received. It has a labelorigin
that can be eitherendpoint
orextensions
.apollo.router.operations.entity.invalidation.entry
- counter measuring how many entries are removed perDEL
call. It has a labelorigin
that can be eitherendpoint
orextensions
, and a labelsubgraph.name
with the name of the receiving subgraph.apollo.router.cache.invalidation.keys
- histogram measuring the number of keys that were removed from Redis per invalidation request.apollo.router.cache.invalidation.duration
- histogram measuring the time spent handling one invalidation request.
Invalidation requests are also reported under the following spans:
cache.invalidation.batch
- span covering the processing of a list of invalidation requests. It has a labelorigin
that can be eitherendpoint
orextensions
.cache.invalidation.request
- span covering the processing of a single invalidation request.
Failure cases
Entity caching will greatly reduce traffic to subgraphs. Should there be an availability issue with a Redis cache, this could cause traffic to subgraphs to increase to a level where infrastructure becomes overwhelmed. To avoid such issues, the router should be configured with rate limiting for subgraph requests to avoid overwhelming the subgraphs. It could also be paired with subgraph query deduplication to further reduce traffic.
Scalability and performance
The scalability and performance of entity cache invalidation is based on its implementation with the Redis SCAN
command. The SCAN
command provides a cursor for iterating over the entire key space and returns a list of keys matching a pattern. When executing an invalidation request, the router first runs a series of SCAN
calls and then it runs DEL
calls for any matching keys.
The time complexity of a single invalidation request grows linearly with the number of entries, as each entry requires SCAN
to iterate over. The router can also execute multiple invalidation requests simultaneously. This lowers latency but might increase the load on Redis instances.
To help tune invalidation performance and scalability, you should benchmark the ratio of the invalidation rate against the number of entries that will be recorded. If it's too low, you can tune it with the following:
Increase the number of pooled Redis connections.
Increasing the
SCAN
count option. This shouldn't be too large, with 1000 as a generally reasonable value, because larger values will reduce the operation throughput of the Redis instance.Use separate Redis instances for some subgraphs.
Private information caching
A subgraph can return a response with the header Cache-Control: private
, indicating that it contains user-personalized data. Although this usually forbids intermediate servers from storing data, the router may be able to recognize different users and store their data in different parts of the cache.
To set up private information caching, you can configure the private_id
option. private_id
is a string pointing at a field in the request context that contains data used to recognize users (for example, user id, or sub
claim in JWT).
As an example, if you are using the router's JWT authentication plugin, you can first configure the private_id
option in the accounts
subgraph to point to the user_id
key in context, then use a Rhai script to set that key from the JWT's sub
claim:
1preview_entity_cache:
2 enabled: true
3 subgraph:
4 all:
5 enabled: true
6 redis:
7 urls: ["redis://..."]
8 subgraphs:
9 accounts:
10 private_id: "user_id"
11authentication:
12 router:
13 jwt:
14 jwks:
15 - url: https://auth-server/jwks.json
1fn supergraph_service(service) {
2 let request_callback = |request| {
3 let claims = request.context[Router.APOLLO_AUTHENTICATION_JWT_CLAIMS];
4
5 if claims != () {
6 let private_id = claims["sub"];
7 request.context["user_id"] = private_id;
8 }
9 };
10
11 service.map_request(request_callback);
12}
The router implements the following sequence to determine whether a particular query returns private data:
Upon seeing a query for the first time, the router requests the cache as if it were a public-only query.
When the subgraph returns the response with private data, the router recognizes it and stores the data in a user-specific part of the cache.
The router stores the query in a list of known queries with private data.
When the router subsequently sees a known query:
If the private id isn't provided, the router doesn't interrogate the cache, but it instead transmits the subgraph response directly.
If the private id is provided, the router queries the part of the cache for the current user and checks the subgraph if nothing is available.
Observability
The router supports a cache
selector in telemetry for the subgraph service. The selector returns the number of cache hits or misses by an entity for a subgraph request.
Spans
You can add a new attribute on the subgraph span for the number of cache hits. For example:
1telemetry:
2 instrumentation:
3 spans:
4 subgraph:
5 attributes:
6 cache.hit:
7 cache: hit
Metrics
The router provides the telemetry.instrumentation.instruments.cache
instrument to enable cache metrics:
1telemetry:
2 instrumentation:
3 instruments:
4 cache: # Cache instruments configuration
5 apollo.router.operations.entity.cache: # A counter which counts the number of cache hit and miss for subgraph requests
6 attributes:
7 graphql.type.name: true # Include the entity type name. default: false
8 subgraph.name: # Custom attributes to include the subgraph name in the metric
9 subgraph_name: true
10 supergraph.operation.name: # Add custom attribute to display the supergraph operation name
11 supergraph_operation_name: string
12 # You can add more custom attributes using subgraph selectors
You can use custom instruments to create metrics for the subgraph service. The following example creates a custom instrument to generate a histogram that measures the subgraph request duration when there's at least one cache hit for the "inventory" subgraph:
1telemetry:
2 instrumentation:
3 instruments:
4 subgraph:
5 only_cache_hit_on_subgraph_inventory:
6 type: histogram
7 value: duration
8 unit: hit
9 description: histogram of subgraph request duration when we have cache hit on subgraph inventory
10 condition:
11 all:
12 - eq:
13 - subgraph_name: true # subgraph selector
14 - inventory
15 - gt: # If the number of cache hit is greater than 0
16 - cache: hit
17 # entity_type: Product # Here you could also only check for the entity type Product, it's `all` by default if we don't specify this config.
18 - 0
19
Implementation notes
Cache-Control header requirement
The Router currently cannot know which types or fields should be cached, so it requires the subgraph to set a Cache-Control
header in its response to indicate that it should be stored.
Responses with errors not cached
To prevent transient errors from affecting the cache for a long duration, subgraph responses with errors are not cached.
Cached entities with unavailable subgraph
If some entities were obtained from the cache, but the subgraphs that provided them are unavailable, the router will return a response with the cached entities, and the other entities nullified (schema permitting), along with an error message for the nullified entities.
Authorization and entity caching
When used alongside the router's authorization directives, cache entries are separated by authorization context. If a query contains fields that need a specific scope, the requests providing that scope have different cache entries from those not providing the scope. This means that data requiring authorization can still be safely cached and even shared across users, without needing invalidation when a user's roles change because their requests are automatically directed to a different part of the cache.
Schema updates and entity caching
On schema updates, the router ensures that queries unaffected by the changes keep their cache entries. Queries with affected fields need to be cached again to ensure the router doesn't serve invalid data from before the update.