Exploring metrics

3. Exploring metrics

Overview

Let's begin our investigation in the Grafana dashboard connected to our router.

In this lesson, we will:

Identify our primary metrics for monitoring service health

Exploring Grafana

With your Docker process running, we'll jump back over to our Grafana router dashboard.

Already we should see that there's a lot happening here. Our router is receiving requests and routing them across our federated API to be resolved by the appropriate subgraph. The dashboard is composed of several different metrics we're tracking, sorted into a few discrete categories:

Client: Metrics concerned with client experience, client requests, and latency.
Subgraphs: Metrics specific to our underlying subgraph services. We should see a panel for latency in each of our four subgraphs.
Query planner: Various metrics from monitoring the router's query planner mechanism, which devises the router's "instructions" to follow to various queries

Take a few moments to scroll through this dashboard. Yours likely won't look exactly like ours, but you should see some data coming through.

Grafana graph for request count, latency, etc

Let's highlight some of the most important data here.

HTTP request count

The first panel under the Client section, HTTP request count shows us the number of requests the router receives per second. This number fluctuates slightly over time, but we should see our line approaching 500 requests per second. Hover over any of the plotted points to see their timestamp, along with the specific number of requests received at that time.

A screenshot highlighting the HTTP request count panel in Grafana, with an average of 400 requests per second

Client connections

The next panel reflects how many active client connections we have over time.

A screenshot highlighting the client connections panel in Grafana with a data point of 7500 connections highlighted

Request latency and processing time

Jumping down to the next row, we'll see two panels with data plotted in percentiles, Request latency and Processing time.

Request latency reflects how values are distributed across different percentiles (p50, p75, p90, p95, p99).

A screenshot of the request latency panel in Grafana, plotting percentages of request against the time they take to execute

Here's how we break this down. (Note that the following are just example values; your dashboard may reflect something different for each percentile.)

Percentile	Value	Significance
p50	0.50s	50% of requests execute in 0.5 seconds or less
p75	1.50s	75% of requests execute in 1.5 seconds or less
p90	3.75s	90% of requests execute in 3.75 seconds or less
p95	4.50s	95% of requests execute in 4.5 seconds or less
p99	4.95s	99% of requests execute in 4.95 seconds or less

Taken another way, this means that the majority of requests in our system take anywhere from 0.5 seconds to 5 seconds to be executed.

Under Processing time, we'll see something similar. This panel shows us how long it takes the router to complete its processing-related tasks for most requests.

A screenshot of the processing time panel in Grafana, plotting percentages of requests against processing time

When the majority of our queries are taking multiple seconds to respond, we're left with a less-than-great user experience. Let's jump down to the Subgraphs section and get a more granular look at our services.

Subgraph requests and latency

Scroll down in your Grafana dashboard until you've reached the Subgraphs category. Spend a moment comparing the different latency panels for each of the four subgraphs we're working with.

A screenshot of the Subgraphs section of the Grafana dashboard

See anything interesting? The latency rates for both the products and accounts subgraphs are consistently high—with 99% of requests resolving in about 5 seconds or less—while inventory and reviews have spikes every now and then, but otherwise maintain reasonable levels.

Grafana dashboards for accounts and products highlighted with high latency warning

A closer look shows that our accounts service has a consistent p99 value of five seconds, and receives close to 1000 requests per second!

A screenshot of the Grafana panels subgraph requests and accounts latency

That's a good start—we've taken a look at some key data points, and identified where latency is higher than we'd like. But we'll gather some more information before making any changes!

Practice

What does it mean for our system if we have a request latency of 0.25s at p75?

25% of our requests take longer than 0.25s to resolve.75% of our requests resolve in 0.25s or less.25% of our requests resolve in 0.75s or less.75% of our requests take longer than 0.25s to resolve.

Key takeaways

The GraphOS router exposes a wide variety of metrics that can be represented in a data dashboard, such as: client requests, request latency, processing time, query planner performance, and subgraph requests.
Some metrics are represented as values distributed across different percentiles (such as p50, p75, p99). We use these distributions to visualize how a particular component of our system performs at various percentages. We can see, for example, how long it takes 50% of our requests to resolve (p50); and compare this to how long it takes 99% of our requests to resolve (p99). This gives us a general idea of latency across the board, and where improvements might need to be made.

Up next

Time series charts are super useful, but they're one piece of the puzzle in diagnosing performance issues. Next, we'll take a look at system traces and logs.

Share your questions and comments about this lesson

This course is currently in

beta

. Your feedback helps us improve! If you're stuck or confused, let us know and we'll help you out. All comments are public and must follow the Apollo Code of Conduct. Note that comments that have been resolved or addressed may be removed.

You'll need a GitHub account to post below. Don't have one? Post in our Odyssey forum instead.