8. Query planning
10m

Overview

The does a great job of handling all of our requests and delegating each piece to the responsible, but there's a lot more than meets the eye happening under the hood. Let's take a look at the overall router processing time, and see what we can do to speed things up.

In this lesson, we will:

  • Inspect to chart a 's path to execution
  • Review for the 's ner
  • Discuss caching options for and query plan requests

Router processing time

Returning now to the client-side, something looks curious:

Router processing time

This shows the actual time the spends processing a request.

Most of a client request's latency lies in requests. But there are some CPU-intensive tasks performed in the , and the time they take is included in the here. We can see that total processing time nearly approaches 1s in places, which is a lot of overhead for the router.

This can become especially problematic as time spent processing one request can easily affect other requests waiting for their turn to be processed.

Inspecting router tasks

In general, the adds very low overhead to requests. Its main sources of busy work are:

  • response validation and formatting
  • parsing and validation
  • Telemetry
  • JSON serialization and deserialization

We can use to figure out which task in particular might be affecting our processing time. Jump back over to Explore in Grafana, and make sure that you've selected Tempo from the dropdown. (Recall that we can run our for by entering opening and closing curly braces ({}) in the input!)

Let's take a look at a for one of our more robust queries. Look for a like query Reviews5.

Router trace

A few things stand out:

  • of the 729ms required to execute the , we're waiting about 43ms for one subgraph_request; then 52ms for another; and finally 502ms for a third one. So most of the time involved here is spent waiting for requests.
  • 5th from the top, the query_planning is the time spent generating a . This takes about 126ms.

Even to plan out a , 126ms seems high. This will certainly add a lot of overhead to the request, passing that latency onto subsequent requests waiting for processing. So what happens if this latency stacks up, and many more queries spend a lot of time in planning? That additional latency would continue to compound quickly for all requests.

Query planner metrics

Let's return to our dashboard. We can check out how the is behaving in the panels at the bottom.

Query planning time

The Query planning duration panel shows the time spent planning, and we can see that it can get as high at 500ms, but depending on your computer it might approach one to five seconds. We also have a panel charting the size of the queue, often approaching 24 queries that require a plan to be generated.

And at last, hit and miss for the cache, where we see between 150-220 requests per second hitting a cache miss.

Query planner cache hit and miss

To really make sense of these , we need to understand how the handles .

Query planning and the cache

is an expensive task (usually taking between 1ms and 1s); because generating plans can greatly impact request processing time, the caches its query plan results in memory.

in particular are a great candidate for caching. The process of generating a query plan is highly deterministic: it depends on the query, , schema, configuration, and often some plugin-specific information, such as authorization. So if we've already generated the query plan for a particular query, we don't need to repeat the process when the router receives the same query again; we can just reuse the query plan.

The also works on reducing the workload by deduplicating requests to the planner itself. If the receives multiple identical queries, it generates the for just one, and passes the generated plan to the others. The router also limits the footprint of the planner by limiting it to one CPU core by default.

To explore additional configuration options, check out the official router documentation on query planner pooling.

Caching in-memory is a great option, but it doesn't scale to multiple instances. A generated by one router isn't automatically accessible by other routers that might receive identical requests. To allow multiple router instances to share the plans they have calculated, we also have the option to cache in a Redis database.

Improving cache performance

Returning to our Grafana dashboard, we should now have a better understanding of what's happening in the .

  • Our data indicates many cache misses. This means the is spending a lot of time calculating .
  • We have many different queries coming in, so the queue continues to grow. Each query has to wait for its turn to be planned, increasing latency.
  • Planning time is high, between 10ms and 100ms.

It's highly likely that the is spending most of its CPU time calculating . We've seen something like this before; when we introduced timeouts for client requests in an earlier lesson, we applied a timeout value of 100ms—but still saw a higher execution time than expected. This was caused by the CPUs being occupied. If a request came in with a timeout of 100ms, but needed to wait 200ms to get some time with the CPU, then the timeout would trigger too late.

Let's return to router.yaml and tweak our settings. At the top of the file, we'll see the supergraph key, along with a section all about query_planning.

router.yaml
supergraph:
introspection: true
listen: 0.0.0.0:4000
query_planning:
cache:
in_memory:
limit: 10

Right now we'll see that our in_memory cache is set with a limit of 10 items. And with up to 25 queries in our queue, we're not giving our in-memory cache enough space to reuse the plans the has already generated. Let's increase the capacity to 50 and see what happens.

query_planning:
cache:
in_memory:
limit: 50

What a difference!

Query planner cache effect

The effect is nearly immediate: cache misses drop to 0, and the queue size does too. Processing time has also dropped somewhat, while p50, p75 and p90 client latencies decrease.

We'll also notice the rate limiting errors increasing too. This is due to an increase in requests to the accounts , because now the can spend more time executing subgraph requests.

Looking to go deeper with the cache? Check out the official documentation for more on both in-memory and Redis-based caching.

Practice

Why are query plans a good candidate for caching in a database like Redis?

Key takeaways

  • As part of its "processing time", the takes care of tasks such as , query parsing and validation, response validation and formatting, telemetry, and serialization.
  • The emits various , such as how many queries are in the queue, the duration of planning queries, and cache hits and misses.
  • We can cache as well as requests to the planner, furthe reducing the processing time in the for a it has already handled before.
  • The query_planning property in the configuration file allows us to specify the number of the router should hold in its in-memory cache.

Up next

We've used timeouts, rate limiting, deduplication, and increased the capacity of our cache. In the next lesson, we'll explore a new feature that allows us to serve up responses—without bothering the subgraph more than once!

Previous

Share your questions and comments about this lesson

This course is currently in

beta
. Your feedback helps us improve! If you're stuck or confused, let us know and we'll help you out. All comments are public and must follow the Apollo Code of Conduct. Note that comments that have been resolved or addressed may be removed.

You'll need a GitHub account to post below. Don't have one? Post in our Odyssey forum instead.