Skip to main content
Platform Engineering7 min read·

OpenTelemetry for Platform Engineers: The Setup Your Teams Will Actually Adopt

OTel is the CNCF standard for vendor-neutral instrumentation. Here is the collector topology, sampling policy, and golden path that get teams to actually use it.

OpenTelemetry for Platform Engineers: The Setup Your Teams Will Actually Adopt

TL;DR. Platform teams that adopt OpenTelemetry without a clear collector topology, sampling policy, and golden path integration get partial adoption and unreliable data. The decisions that matter: gateway collector over agent sidecars for most Series B companies, head-based sampling to start, auto-instrumentation in CI templates so developers get traces without writing code.

OpenTelemetry (OTel) is the Cloud Native Computing Foundation standard for collecting telemetry data — traces, metrics, and logs — from applications. It decouples instrumentation from the backend: you instrument once against the OTel SDK, and you can route data to Grafana Tempo, Jaeger, Datadog, or any other compatible backend without touching service code again. Source: opentelemetry.io.

For platform teams specifically, this matters for one reason above all others: vendor lock-in on observability infrastructure is expensive to unwind. If 30 services are instrumented directly against the Datadog SDK and you decide to move backends, each service requires a code change. If those 30 services send traces to an OTel collector, changing backends is a collector configuration change.

The platform team's job with OTel is not to instrument services. It is to make the right instrumentation the path of least resistance for every team building on the platform.

The platform team's four OTel decisions

1. Collector topology: agent vs. gateway

An agent deployment runs the OTel collector as a sidecar or DaemonSet alongside each service. A gateway deployment runs the collector as a standalone service that all applications send data to.

For a Series B company with a moderate number of services, the gateway is usually the right starting point. The reasons are operational:

  • One collector deployment to monitor and maintain instead of one per node or one per pod.
  • Configuration changes apply centrally. If you change sampling policy or swap backends, it happens in one place.
  • Resource usage is predictable. Sidecars add memory and CPU to every running pod; the overhead compounds as you scale.

The agent topology becomes worth considering when you need to sample at the source — for example, when traffic volume is high enough that a gateway becomes a bottleneck — or when tail-based sampling requires making decisions at the node level. For most organizations at Series B scale, this is not the problem they have today.

2. Sampling policy

Sampling determines which traces you keep. Two approaches:

Head-based sampling makes the decision at the start of a request, before the trace is complete. A simple percentage: keep 10% of all requests. It is cheap to operate and predictable.

Tail-based sampling makes the decision after the trace is complete, based on the outcome. Keep 100% of traces that include an error or exceeded a latency threshold. Keep 1% of traces that completed successfully within expected parameters. It is more intelligent — you capture the interesting requests — but operationally heavier. The collector has to buffer in-flight traces to make a decision, which means more memory and a more complex failure mode.

The recommendation for most platform teams: start with head-based sampling at 5–10% depending on your request volume. Add tail-based sampling when you have a concrete reason — typically when you are debugging a class of errors that are rare enough that head-based sampling is missing most of them. The operational complexity of tail-based sampling is real. Do not take it on until you understand the problem it solves.

3. Auto-instrumentation

OTel provides auto-instrumentation libraries for Go, Python, Java, and Node.js. These libraries attach to the runtime and generate baseline spans for HTTP requests, database calls, and inter-service communication without requiring the developer to write any instrumentation code.

Platform teams can enable this in CI/CD templates. If the deployment template for Node.js services includes the OTel Node.js auto-instrumentation initialization, every service deployed through that template gets baseline traces from day one. Developers who want richer instrumentation can add manual spans. Developers who want nothing more than the baseline do nothing.

This is the path that actually produces adoption. Manual instrumentation requires developers to make an active decision to add traces. Auto-instrumentation in the template requires developers to make an active decision to remove them. The default being "instrumented" is what moves adoption from opt-in to opt-out.

4. Backend choice

For teams already on Grafana, Grafana Tempo is the lowest-friction backend: it integrates with Grafana dashboards, correlates with Loki logs and Prometheus metrics, and does not require an additional vendor relationship.

Jaeger is a reasonable alternative if you want a standalone tracing system that is self-contained. It is more complex to operate than Tempo at scale but simpler to reason about in isolation.

The choice matters less than the collector topology decision. Because you are routing through an OTel collector, switching backends later does not require touching service code. Make a decision, deploy it, and revisit when you have an operational reason to change.

What developers need from you for adoption

Platform teams often set up OTel and then wait for adoption that does not arrive. The adoption failure is not usually a technical problem. It is that developers do not trust the data.

Three things produce trust:

A working example in the golden path. Developers should be able to deploy a new service and immediately see traces in the backend without any configuration work. If the first trace a developer sees requires debugging the collector configuration, the mental model for OTel in your organization becomes "complicated thing that requires platform team help." The golden path needs to work the first time, without exceptions.

A queryable backend they know about. If you set up Grafana Tempo but do not tell developers it exists, the investment is not producing value. A 20-minute "here is how to trace a request" session — or a doc with three screenshots — pays for itself in the first incident where someone uses it.

A runbook for common debugging patterns. What does a trace look like when a database call is slow? How do you find all requests that hit a specific downstream service? How do you correlate a trace ID from a log entry to a trace in Tempo? These are questions with specific answers that do not require custom tooling. Writing them down once means they do not have to be answered live during an incident.

The adoption failure to watch for

The most common OTel adoption failure in platform teams is this: the collector is deployed, auto-instrumentation is enabled, traces are arriving in the backend — but developers do not use them during incidents because they do not trust the data to be complete or accurate.

The root cause is usually one of two things. Either the sampling policy is misconfigured and removing traces that matter — a 0.1% sampling rate means that most low-traffic services produce almost no traces. Or the backend is intermittently unavailable because the collector is not monitored as production infrastructure.

The observability platform is production infrastructure. It needs uptime monitoring, alerting, and an on-call owner. A platform team that carefully monitors the services developers build but does not monitor the collector and backend is building a reliability blind spot in the worst possible place.

For the full picture of what belongs in a platform observability standard before this layer is worth building, read Observability for Platform Teams. For how SLOs and error budgets connect to the reliability program, see Error budgets in practice and the Reliability Program.

opentelemetryobservabilityplatform engineering

Found this useful? Share it with your network.

Matías Caniglia

Mat Caniglia

LinkedIn

Founder of Clouditive. 18+ years transforming engineering organizations across LATAM and globally through Developer Experience consulting.

79 articles published

Related Articles

Platform Engineering

The Cost of Not Investing in Platform Engineering

Every hour engineers spend fighting deploy friction, waiting on platform tickets, or repeating slow onboarding is a real cost. A framework for making the number concrete.

Read More →
Platform Engineering

Platform Engineering Consulting vs. Hiring: When Each Makes Sense

An honest analysis for a VP Eng facing the build-the-team-or-bring-in-a-consultancy decision. Cover the 3-6 month critical window, failure modes of each approach, and what a good engagement exit looks like.

Read More →
Platform Engineering

IDP Build vs Buy: A Decision Framework for Engineering Leaders

A structured decision framework covering total cost of ownership, team capacity requirements, vendor lock-in spectrum, what changes at 10 vs 50 vs 200 engineers, and the hybrid path.

Read More →

Stay updated with Clouditive

Long-form analysis on platform engineering, DORA, and AI readiness from Mat Caniglia. Sent when there is something worth reading.

Start here

See where your delivery stands.

A fifteen minute self-diagnostic that scores your platform across DORA metrics, deployment frequency, change failure rate, and cognitive load. No sales call required.

Want to read first? See the Foundations Framework