6. Subgraph traffic control
5m

Overview

We've looked at configuration on the part of limiting inbound traffic to the ; now let's shift gears and set some controls at the level.

In this lesson, we will:

  • Apply -level timeouts and rate limiting

Subgraph performance

When building our API using , different components can end up with varying performance profiles.

One or two could be responsible for top-level (defining commonly-used on the Query type for instance), and as a result would need high reliability. Even so, downstream could receive a lot more traffic due to the amplification effect (such as requesting the reviews for each product, and the account details of the user who wrote each review).

When a requires data from multiple , a failure in one service does not cause a failure in the entire . Instead, comes with the benefit of partial responses. This gives us the flexibility to limit traffic at the level of a particular subgraph—we can test out receiving partial responses from more responsive subgraphs, while lowering latency across the board.

Let's return to our router.yaml configuration file. First of all, we'll reset our with a timeout of 1s, and capacity and interval of 500 and 1s respectively.

router.yaml
traffic_shaping:
router:
timeout: 1s
global_rate_limit:
capacity: 500
interval: 1s

Going back to our Grafana dashboard, we'll see a noticeable effect.

First of all, a longer timeout allows more client requests to succeed with a 200 status code.

The HTTP request count panel showing an increase in 200 status code (successful) requests

This means that more requests will pass on successfully to the ; and we can see this reflected in our Subgraph requests panel, where the number of requests has increased.

Subgraph request count and latency

Consider the for each of our four .

What we should see is that both reviews and inventory have reasonable request latency: each should have a p99 below 150ms. (With 99% of requests resolving in 150 milliseconds or less.)

Panels showing low request latency for both reviews and inventory subgraph requests

The accounts , on the other hand, shows a p99 latency of one second at about 870 requests per second.

Panels showing number of requests to accounts and the latency

And products has a p99 latency of one second at 400 requests per second. (The other percentiles are quite high too! This means that the majority of requests are taking much longer to resolve—with 50% taking close to 500 milliseconds or less.)

Panels showing latency of products subgraph requests

This gives us a good place to start; accounts and products are both good candidates for timeouts and rate limiting.

Subgraph level timeout

To tweak these settings for our , we can once again return to our router.yaml configuration file. Under traffic_shaping, we'll add a new property, all, at the same indentation level as the router key.

router.yaml
traffic_shaping:
router:
timeout: 1s
global_rate_limit:
capacity: 500
interval: 1s
all:

The settings that we apply under all will affect all . Let's go ahead and set a global timeout for all of our subgraphs by adding a new timeout property, and a value of 500ms.

traffic_shaping:
router:
timeout: 1s
global_rate_limit:
capacity: 500
interval: 1s
all:
timeout: 500ms

As of now, all of our will officially have a timeout of 500 milliseconds. But we can get more granular and tweak them one-by-one as needed—by adding a subgraphs property next. We'll add this under the all section, but at the same level of indentation.

traffic_shaping:
router:
timeout: 1s
global_rate_limit:
capacity: 500
interval: 1s
all:
timeout: 500ms
subgraphs:

Under subgraphs we can identify specific and give them more precise timeouts as needed. Let's add an accounts key, and give this service a timeout of 100ms.

traffic_shaping:
router:
timeout: 1s
global_rate_limit:
capacity: 500
interval: 1s
all:
timeout: 500ms
subgraphs:
accounts:
timeout: 100ms

We now observe an effect on client requests. Overall latency is dropping, but still quite high.

Effect of subgraph timeout on client requests

But we get a curious result on requests:

Effect of subgraph timeout on subgraph requests

While accounts and products are rejecting more requests (504 status code), latency is actually increasing for the reviews and inventory .

Increasing latency for reviews and inventory

Take some time to investigate why this is happening. Test various values for timeout, see what effect they can have. Check the latency of various queries. Are you noticing anything?

Most of the queries we use here are requesting data from multiple in a row, so changing the performance profile of one subgraph will affect others. Let's take the following :

query GetRecommendedProducts {
recommendedProducts {
name
price
shippingEstimate
upc
inStock
reviews {
id
body
author {
id
name
username
}
}
}
}

Broken down by responsibility, we can see the consists of spread across our API.

A diagram of the query broken down by subgraph responsibility per field

This will go first through the products , then in parallel through inventory (for the inStock ) and reviews, then the accounts for the author details.

A diagram showing how the query will move through the subgraphs, collecting data for each field

The benchmarking client is executing that (along with others) in a loop, so when it finishes, it sends it again. If the accounts fails due to the timeout, the client will retry the sooner. This means it begins the trip back through products again, then immediately to reviews and inventory, slightly increasing their traffic. If those requests are expensive for those , that load can be enough to increase their latency.

This is an effect that can be observed in many deployments: subgraphs are not isolated from each other, and can affect each other's performance.

And it's possible to see this effect even without timeouts. Let's say a is often called in the middle of complex , but is slow to respond, to the point where clients tend to abandon the query. Improving that subgraph's performance might mean that client gets the response faster, but then other subgraphs called down the line would see an increase in traffic that they are not ready for.

Here we suffer from a similar issue to the client-side traffic shaping: when applying timeouts in isolation, all of the queries are still reaching the . We may need to use rate limiting along with timeouts to better control the traffic.

Subgraph level rate limiting

We will use the same traffic shaping options, but this time we will set a rate limit of 500 requests per second for accounts:

traffic_shaping:
router:
timeout: 1s
global_rate_limit:
capacity: 500
interval: 1s
all:
timeout: 500ms
subgraphs:
accounts:
timeout: 100ms
global_rate_limit:
capacity: 500
interval: 1s

Latency in reviews and inventory immediately drops, so this helped!

Effect of subgraph rate limiting on subgraph requests

And p99 latency for client requests, which was quite volatile between 2s and 4s, is now stabilizing near 1s.

Effect of subgraph rate limiting on client requests

We are finally starting to get a handle on this infrastructure. Limiting traffic on the right improves overall performance for the entire deployment, at the cost of missing data in some client responses.

Practice

Subgraph traffic control
We can set a global timeout for all of our subgraphs using the 
 
 configuration property. More precise timeouts can be applied by specifying each subgraph individually beneath the 
 
 property. In a federated architecture, subgraphs are 
 
 from one another. This means that a change in traffic shaping for one 
 
 impact other services.

Drag items from this box to the blanks above

  • can

  • subgraphs

  • global

  • isolated

  • all

  • accounts

  • not isolated

  • cannot

Key takeaways

  • We can use the 's traffic shaping configuration to limit traffic at a level.
  • Any settings we apply at the all level will apply to all the communicates with.
  • We can use the subgraphs key to specify particular timeout and rate limit values for individual services.

Up next

Client requests are now getting a stable latency, but at one second, it's still not great. Coming up next, we'll get a huge performance boost with deduplication.

Previous

Share your questions and comments about this lesson

This course is currently in

beta
. Your feedback helps us improve! If you're stuck or confused, let us know and we'll help you out. All comments are public and must follow the Apollo Code of Conduct. Note that comments that have been resolved or addressed may be removed.

You'll need a GitHub account to post below. Don't have one? Post in our Odyssey forum instead.