3. Exploring metrics
10m

Overview

Let's begin our investigation in the Grafana dashboard connected to our .

In this lesson, we will:

  • Identify our primary for monitoring service health

Exploring Grafana

With your Docker process running, we'll jump back over to our Grafana router dashboard.

Already we should see that there's a lot happening here. Our is receiving requests and routing them across our federated API to be resolved by the appropriate . The dashboard is composed of several different we're tracking, sorted into a few discrete categories:

  • Client: concerned with client experience, client requests, and latency.
  • Subgraphs: specific to our underlying services. We should see a panel for latency in each of our four subgraphs.
  • Query planner: Various from monitoring the 's mechanism, which devises the router's "instructions" to follow to various queries

Take a few moments to scroll through this dashboard. Yours likely won't look exactly like ours, but you should see some data coming through.

Grafana graph for request count, latency, etc

Let's highlight some of the most important data here.

HTTP request count

The first panel under the Client section, HTTP request count shows us the number of requests the receives per second. This number fluctuates slightly over time, but we should see our line approaching 500 requests per second. Hover over any of the plotted points to see their timestamp, along with the specific number of requests received at that time.

A screenshot highlighting the HTTP request count panel in Grafana, with an average of 400 requests per second

Client connections

The next panel reflects how many active client connections we have over time.

A screenshot highlighting the client connections panel in Grafana with a data point of 7500 connections highlighted

Request latency and processing time

Jumping down to the next row, we'll see two panels with data plotted in percentiles, Request latency and Processing time.

Request latency reflects how values are distributed across different percentiles (p50, p75, p90, p95, p99).

A screenshot of the request latency panel in Grafana, plotting percentages of request against the time they take to execute

Here's how we break this down. (Note that the following are just example values; your dashboard may reflect something different for each percentile.)

PercentileValueSignificance
p500.50s50% of requests execute in 0.5 seconds or less
p751.50s75% of requests execute in 1.5 seconds or less
p903.75s90% of requests execute in 3.75 seconds or less
p954.50s95% of requests execute in 4.5 seconds or less
p994.95s99% of requests execute in 4.95 seconds or less

Taken another way, this means that the majority of requests in our system take anywhere from 0.5 seconds to 5 seconds to be executed.

Under Processing time, we'll see something similar. This panel shows us how long it takes the to complete its processing-related tasks for most requests.

A screenshot of the processing time panel in Grafana, plotting percentages of requests against processing time

When the majority of our queries are taking multiple seconds to respond, we're left with a less-than-great user experience. Let's jump down to the Subgraphs section and get a more granular look at our services.

Subgraph requests and latency

Scroll down in your Grafana dashboard until you've reached the Subgraphs category. Spend a moment comparing the different latency panels for each of the four we're working with.

A screenshot of the Subgraphs section of the Grafana dashboard

See anything interesting? The latency rates for both the products and accounts are consistently high—with 99% of requests resolving in about 5 seconds or less—while inventory and reviews have spikes every now and then, but otherwise maintain reasonable levels.

Grafana dashboards for accounts and products highlighted with high latency warning

A closer look shows that our accounts service has a consistent p99 value of five seconds, and receives close to 1000 requests per second!

A screenshot of the Grafana panels subgraph requests and accounts latency

That's a good start—we've taken a look at some key data points, and identified where latency is higher than we'd like. But we'll gather some more information before making any changes!

Practice

What does it mean for our system if we have a request latency of 0.25s at p75?

Key takeaways

  • The exposes a wide variety of that can be represented in a data dashboard, such as: client requests, request latency, processing time, performance, and requests.
  • Some are represented as values distributed across different percentiles (such as p50, p75, p99). We use these distributions to visualize how a particular component of our system performs at various percentages. We can see, for example, how long it takes 50% of our requests to resolve (p50); and compare this to how long it takes 99% of our requests to resolve (p99). This gives us a general idea of latency across the board, and where improvements might need to be made.

Up next

Time series charts are super useful, but they're one piece of the puzzle in diagnosing performance issues. Next, we'll take a look at system and logs.

Previous

Share your questions and comments about this lesson

This course is currently in

beta
. Your feedback helps us improve! If you're stuck or confused, let us know and we'll help you out. All comments are public and must follow the Apollo Code of Conduct. Note that comments that have been resolved or addressed may be removed.

You'll need a GitHub account to post below. Don't have one? Post in our Odyssey forum instead.