Introducing Open Telemetry for Apollo Federation
Lenny Burdette
How do you debug or optimize a request as it travels through all the layers of your stack: from the client, through the Apollo Gateway, through your subgraph services, all the way to your databases?
First, you need visibility into what your code is doing during that request, regardless of how that work is distributed across machines. Fortunately, this is easier than ever with a Cloud-Native Computing Foundation project called OpenTelemetry!
As of version 0.31.1, @apollo/gateway natively supports OpenTelemetry as part of a complete picture of the performance and reliability of a federated GraphQL request. You can find the documentation here.
In this article, I’ll cover some OpenTelemetry basics, talk about the new support for OpenTelemetry in Apollo Gateway, and explain how Apollo Studio and OpenTelemetry work together to give you a complete picture of your graph.
What is OpenTelemetry?
OpenTelemetry (or OTel) is “an observability framework for cloud-native software”. It is a collection of libraries, systems, and applications for instrumenting, collecting, and exporting information about workloads that span multiple applications.
A unit of work in OTel is a “span”. You can wrap a function call in a span, recording its duration and other metadata. You can also install instrumentation that automatically hooks into modules and functions, like the HTTP instrumentation for Node.js.
A collection of related spans is called a “trace”. In a single system, a trace is very similar to a stack trace. But what makes OpenTelemetry (and its predecessors, OpenTracing and OpenCensus) powerful is the ability to correlate spans across systems. It does this by propagating trace identifiers in remote calls. When making HTTP requests, for example, OTel instrumentation adds trace IDs as HTTP headers.
Once you have a trace, you can aggregate them in a “collector” and efficiently send them to your observability tools with an “exporter”.
There’s obviously a lot more to OpenTelemetry (see the docs!), but that covers it in the context of Apollo Federation.
OpenTelemetry in Apollo Server and Apollo Gateway
Apollo Server has always supported OpenTelemetry via the @opentelemetry/instrumentation-graphql library, which instruments the underlying graphql.js library.
However, Apollo Gateway doesn’t use graphql.js to execute a GraphQL request; instead, it executes a “query plan”, coordinating smaller requests to your subgraph services and assembling them into a single response. The Gateway executor has a few phases:
- Validation
- Planning
- Execution
- Fetching
- Postprocessing
Apollo Gateway now instruments these phases as spans in an OTel trace. Combined with HTTP instrumentation, which propagates the trace ID as headers in requests to subgraphs, we can now connect the gateway query planning and execution to the actual GraphQL field resolution work in your subgraph services.
Check out our documentation on setting up OpenTelemetry in your Apollo Gateway and Apollo Server apps.
Even if you’re using GraphQL framework other than Apollo Server or a language other than JavaScript, you can still use OTel to instrument your application. There are OTel instrumentation libraries for many other languages and frameworks, and GraphQL-specific OTel support is small but growing.
OpenTelemetry in your Infrastructure
Instrumenting a request is cool, but it’s not very useful on its own. You can export traces into many different systems to view, inspect, or search traces.
The open-source projects Zipkin and Prometheus both support OpenTelemetry. Here’s an example of a federated GraphQL request in Zipkin.
The OpenTelemetry project has a number of exporters for various Application Performance Monitoring (APM) tools, including SaaS products like Datadog and Honeycomb, or cloud provider solutions like AWS X-Ray. You can find a complete list in the OpenTelemetry Registry.
OpenTelemetry doesn’t replace Apollo Tracing
OpenTelemetry is great at recording what happened, but it doesn’t tell you much about what could happen. Apollo Studio uses the declarative nature of GraphQL to help you catch potential errors before they happen. Consider this GraphQL operation:
query ProductsPage($categoryId: ID!) {
products(category: $categoryId) {
nodes {
sku
price
reviews { # could return null
nodes {
id
body
rating
author {
name
photoUrl
}
}
}
}
}
}
If the Product.reviews
field returns null, then the GraphQL executor won’t run resolvers for any of the fields nested inside that selection like Review.id
, Review.body
, or Author.name
. OTel instrumentation won’t record any spans, so you’ll have no data about those fields.
However, if you remove or change the Author.name
field, you could break all the clients that include that field in their operations.
We designed Apollo Federated Tracing to collect field usage regardless of what actually happens (or doesn’t happen!) in your services. Our tracing data powers the Schema Checks feature, alerting you of potentially breaking changes and giving you confidence as you evolve your API.
Apollo Studio also knows schema-related information like the @deprecated
directive, so we can help you understand how clients are using deprecated fields.
It’s possible to cross-reference OTel traces and Apollo Studio traces by customizing the Usage Reporting plugin in Apollo Server. By passing header values in trace sent to Studio that align with attributes on your OTel spans, you can jump between Studio and your OTel observability tooling to get a complete picture of a request.
Wrapping Up
OpenTelemetry is a fantastic open-source project and we’re excited to participate in the growing ecosystem. Combined with Apollo Studio, OpenTelemetry gives you a clear picture of your running systems, helping you debug, optimize, and evolve your federated graphs.
Check out the OpenTelemetry documentation and our documentation on instrumenting federated graphs.