Router Telemetry

Collect observable data to monitor your router and supergraph


Since the router is the single access point for all traffic to and from your graph, router telemetry is the most comprehensive way to observe your supergraph. By implementing telemetry, you can:

  • Monitor your supergraph's health and performance

  • Diagnose issues and deduce root causes

  • Optimize resource usage and system reliability

To understand how router telemetry fits into the broader set of GraphOS observability tooling, see the observability overview.

How router telemetry works

By default, the router doesn't collect or export any telemetry beyond the operation and field usage metrics it sends to GraphOS. You configure which additional telemetry data to collect and where to export it via your router's configuration file.

The router request lifecycle is the primary data source for telemetry data or signals. Telemetry signals include logs, metrics, and traces. The section on router telemetry signals explains these data types and gives basic configuration examples. Exporters are responsible for sending telemetry data to your application performance monitoring (APM) and observability tools for storage, visualization, and analysis.

Telemetry exporters

The router emits telemetry in the industry-standard OpenTelemetry Protocol (OTLP) format and is therefore compatible with many APM tools, including:

  • Prometheus

  • OpenTelemetry Collector

  • Datadog

  • New Relic

  • Jaeger

  • Zipkin

Attributes and selectors

Attributes and selectors are key-value pairs that add contextual information from the router request lifecycle to telemetry data. You can use attributes and selectors to annotate events, metrics, and spans so they can help you filter and group data in your APMs.

The router supports a set of standard attributes from OpenTelemetry semantic conventions. Example attributes include:

  • HTTP status code

  • GraphQL operation name

  • Subgraph name

Selectors allow you to define custom data points based on the router's request lifecycle.

Description
AttributeStandard data points that can be attached to spans, instruments, and events.
SelectorCustom data points extracted from the router's request lifecycle, tailored to specific needs.

Router telemetry signals

The router supports three signal types for collecting and exporting telemetry:

Signal Description
Logs and events
  • Capture and export logs in text or JSON format.
  • Trigger custom events to log critical actions during the router request lifecycle.
Metrics and instruments
  • Export standard metrics for Router operations.
  • Leverage OpenTelemetry (OTEL) metrics to capture HTTP lifecycle data.
  • Define custom metrics using attributes and selectors.
Traces and spans
  • Export traces of router transactions.
  • Use spans to monitor specific actions within traces and attach attributes or selectors for deeper insights.

These mechanisms let you collect data about the inner workings of your router and graph and export them accordingly.

Logs and events

Logs record events in the router's request lifecycle. Examples of logged events include:

  • Information about the router lifecycle

  • Warnings about misconfiguration

  • Errors that occurred during a request

Log exporters

You can log events to standard output in either text or JSON format. Logs can also be consumed by logging exporters and as part of spans via tracing exporters.

Example log configuration

This configuration snippet enables stdout logging in JSON:

YAML
router.yaml
1telemetry:
2  exporters:
3    logging:
4      stdout:
5        enabled: true
6        format: json

Metrics and instruments

Metrics are measurements of the router's behavior that are collected and often analyzed over time to identify trends. Examples of router metrics include the number of incoming HTTP requests and the time spent processing a request.

Instruments define how to collect and report metrics. Different kinds of instruments include counters, gauges, and histograms. For example, given the metric "number of incoming HTTP requests," a counter records the total number of requests, a histogram captures the distribution of request counts over time, and a gauge provides a snapshot of the current request count at a given moment.

Instrument types

Metric instruments fall into three categories:

Instrument Type Description
OTEL instrumentsStandard OpenTelemetry instruments around the HTTP lifecycle, including:
  • The number of HTTP requests by HTTP status
  • A histogram of HTTP router request duration
  • The number of active requests in flight
  • A histogram of request body sizes
Router instrumentsStandard instruments for the router request life cycle, including:
  • Count of GraphQL errors in responses
  • Time spent loading the schema in seconds
  • Number of entries in the router's cache
  • Time spent warming up the query planner queries in seconds
Custom instrumentCustom instruments defined in the router request life cycle.

Example instrument configuration

This configuration snippet enables OTEL instrumentation for a histogram of request body sizes:

YAML
router.yaml
1telemetry:
2  instrumentation:
3    instruments:
4      router:
5        http.server.request.body.size: true

See Instruments for an overview of available instruments and a guide for configuring and customizing instruments.

Metric exporters

In addition to the operation metrics and field usage metrics that GraphOS Router sends to GraphOS, you can configure the router with metric exporters for other observability tools and APMs.

This configuration snippet enables exporting metrics to Prometheus:

YAML
router.yaml
1telemetry:
2  exporters:
3     metrics:
4       prometheus:
5         enabled: true
6         listen: 127.0.0.1:9090
7         path: /metrics

Learn more about sending metrics to Prometheus and metric exporters in general.

Traces and spans

Traces help you monitor the flow of a request through the router. A trace is composed of spans. A span captures a request's duration as it flows through the router request lifecycle. Spans may include contextual information about the request, such as the HTTP status code or the name of the subgraph being queried.

Examples of spans include:

  • router - Wraps an entire request from the HTTP perspective

  • supergraph - Wraps a request once GraphQL parsing has taken place

  • subgraph - Wraps a request to a subgraph.

Tracing exporters

If you've enabled federated tracing (also known as FTV1 tracing) in your subgraph libraries, the router sends field-level traces to GraphOS. Additionally, trace exporters can consume and report traces to your APM.

This configuration snippet enables

  • setting attributes that Datadog uses to organize its APM view

  • exporting traces to a Datadog agent:

YAML
router.yaml
1telemetry:
2  instrumentation:
3    spans:
4      mode: spec_compliant
5      router:
6        attributes:
7          otel.name: router
8          operation.name: "router"
9          resource.name:
10            request_method: true
11      supergraph:
12        attributes:
13          otel.name: supergraph
14          operation.name: "supergraph"
15          resource.name:
16            operation_name: string
17      subgraph:
18        attributes:
19          otel.name: subgraph
20          operation.name: "subgraph"
21          resource.name:
22            subgraph_operation_name: string
23  exporters:
24    tracing:
25      otlp:
26        enabled: true
27        endpoint: "${env.DATADOG_AGENT_HOST}:4317"

Learn more about sending traces to DataDog and trace exporters in general.

Best practices

Collecting exactly the telemetry you need

Effective telemetry provides just the right amount and granularity of information to maintain your graph. Too much data can overwhelm your system, for example, with high cardinality metrics. Too little may not provide enough information to debug issues.

Specific events that need to be captured—and the conditions under which they need to be captured—can change as client applications and graphs change. Different environments, such as production and development, can have different observability requirements.

Router telemetry is customizable to meet the observability needs of different graphs. Keep in mind your particular environments' and graphs' requirements when configuring your telemetry.

Setting conditions for collecting telemetry

You can set conditions for instruments and events to only collect telemetry data when necessary. This configuration snippet enables only collecting the configured telemetry data when the request_header is equal to "example-value":

YAML
1eq:
2  - "example-value"
3  - request_header: x-req-header

Dropping metrics using views

You can use metric exporters' view property with the drop aggregation to remove certain metrics from being sent to your APM. This configuration snippet removes all instruments that begin with apollo_router:

YAML
router.yaml
1telemetry:
2  exporters:
3    metrics:
4      common:
5        service_name: apollo-router
6        views:
7          - name: apollo_router*
8            aggregation: drop

Balancing telemetry and router performance

Keep in mind that the amount of telemetry you add can impact your router's performance.

  • Custom metrics, events, and attributes consume more processing resources than standard metrics. Adding too many (standard or custom) can slow your router down.

  • Configurations such as events.*.request|error|response that produce output for all router lifecycle services should only be used for development or debugging, not for production.

For properly logged telemetry, you should use a log verbosity of info. Set the values of RUST_LOG or APOLLO_ROUTER_LOG environment variables and the --log CLI option to info. Using less verbose logging, such as error, can cause some attributes to be dropped.

Next steps

Consult the following documentation for details on how to configure the various telemetry mechanisms and exporters:

Feedback

Edit on GitHub

Forums