4. Traces and logs
10m

Overview

We have a few other tools we can reach for to understand what's happening in a system. and logs come in handy when we want a play-by-play of what might be going wrong.

In this lesson, we will:

  • Explore and s
  • Learn how to access the 's logs

Traces

are another good way of exploring what's going on in our system. We can access them by navigating to Explore in the left-hand menu, or by clicking this link.

By default, this will bring us to an interface with Tempo selected from the main dropdown. Below this, we'll see an input where we'll enter in opening and closing curly braces ({}) to show all .

A screenshot of the interface with Tempo selected from the dropdown, and an empty object typed in the input

Next, click the blue Run query button to get our results.

A screenshot of all the traces

Let's look closer at one of our . Click directly on one of the values in the Trace ID column. (Look for one which unknown_service:router under the Service column.) This opens up another panel which lets us take a closer look at this particular . Alternatively, you can right-click on the Trace ID value and open the data in a new tab.

A trace with its individual spans opened up in a second panel on the screen

A is a way to represent the execution of a , and the various systems involved in the process. The top level line, called a "", represents the time spent by the initial request in the system. Spans can correspond to different parts of a single service (the by itself generates many!) or different services.

Tempo traces for the query execution

In this example, the top-level is labeled query Reviews7. We can see that it executed in 313.15ms.

Just beneath it, there's a router with a slightly shorter duration, 303.9ms. This tells us that close to the total execution time was spent in the router and its children. And just beneath router we can further explore its child (such as supergraph).

By looking at which take the most time, or considering the gaps between a parent and child span, we can infer where the has spent most of its execution time. This can help point us in the direction of a bottleneck in a service, or a particularly expensive part of the code being executed.

can also indicate where an error was raised, allowing us to identify the source of the problem even more quickly. We can see that some spans take a long time to execute—in particular, the spans for execution. This records the point in execution when the calls into the services. If our subgraph services were set up to record as well, we'd see those reflected here as children spans.

So if the are taking a long time to execute, perhaps we'll find the source of our latency there? Possibly—but we're not done digging yet! Let's explore logs.

Exploring logs

It's common to see erratic behavior when a service is overloaded. This can result in the service failing to handle some requests, throwing errors, or becoming unresponsive. We can get a better idea of what's going on under the hood by exploring our logs.

We can access the logs on the "Explore" page in Grafana as well. Let's click the menu where Tempo is specified, and change the option to Loki, to match the service we're using in our Docker process to collect logs.

A screenshot showing the Explore dropdown with Loki selected

We'll see the interface update with the Loki builder. Rather than construct our query using their prebuilt parameters, we'll switch our mode from Builder to Code. You'll find this option on the far right.

A screenshot showing the Builder-Code switcher

This gives us an input where we can write our own . Let's copy in the following query, which gives us all of the logs from the , and click the blue Run query button.

{job="router"} | json | line_format `{{.level}} {{.message}}`

Loki dashboard for router logs

Here's what this does.

  1. First, it fetches all log lines that match the filter of {job="router"}.
  2. Next, it extracts keys and values from log lines formatted as json, and uses them as labels.
  3. Finally, the log line is replaced using a specific line_format; specifically, with the error level, followed by the message.

A screenshot of the Loki query results, showing logs returned

We see some errors in the logs:

A screenshot of the Loki logs in greater detail

You can click into each error for greater detail, such as the error message and the service it originated in.

A screenshot of greater detail in the Loki error log

We can also explore the different logs by category, and select info, warning, or error from the menu on the left.

A screenshot of the menus in Loki error logs to choose different levels

Practice

Metrics
Traces, spans, and logs give us good insight into a system's 
 
. Each 
 
 represents the execution of a query and its ripple effects through the system. The top-level line is called a 
 
, and it tells us how much time the initial request spent in the system. When there are large gaps between blocks of work, this could indicate 
 
.

Drag items from this box to the blanks above

  • performance

  • compromised security

  • trace

  • metric dashboard

  • a bottleneck in performance

  • service

  • span

  • architecture

Key takeaways

  • are visual depictions of how a particular request moves through our system. We can see a breakdown of time spent in various services, and the different stages of work completed.
  • Each block of work represented in a is called a .
  • Logs are text-based messages outputted by a system. We can inspect logs to where an error occurred, or identify the details of a request's path through our system.

Up next

We've touched on , (and ), and logs; and with some high latency rates, and a few lengthy traces, we're on the right track. In the next lesson, we'll dissect the data we gathered and begin making some fixes.


Share your questions and comments about this lesson

This course is currently in

beta
. Your feedback helps us improve! If you're stuck or confused, let us know and we'll help you out. All comments are public and must follow the Apollo Code of Conduct. Note that comments that have been resolved or addressed may be removed.

You'll need a GitHub account to post below. Don't have one? Post in our Odyssey forum instead.