Observability for Platform Teams: What You Instrument Before Developers Ask You To
TL;DR. Developers ask for observability after a 3am incident where nobody could tell which service broke. Platform teams build it before that happens. The sequence is: RED metrics everywhere first, structured logs second, distributed tracing third. Each layer is only useful once the previous one is reliable.
The conversation goes the same way every time. An engineer is on call. Something is degraded. They open the monitoring and find either nothing — no dashboard for this service — or something worse: five dashboards with inconsistent metric names, three different log formats, and no way to correlate a user-facing error to a root cause. By the end of the incident, the platform team has a new ticket in their backlog: "add observability for service X."
Platform teams that wait for this conversation are always behind. The better position is to build the standard before any team needs it, so that deploying through the golden path automatically means being observable from day one.
What observability means when the platform team owns it
Observability is the ability to understand system state from external outputs — logs, metrics, traces — without needing to modify the code to ask a new question. The classic framing comes from control systems theory: an observable system is one where you can reconstruct internal state from the signals it produces.
For a platform team at a Series A–C company, this framing has a practical implication. The platform team does not own every service's internal state. It owns the standard by which services produce signals, the infrastructure that collects and stores those signals, and the tooling that lets engineers query them. Product teams consume that standard. They do not build it from scratch.
This division matters because without a platform-owned standard, you get drift. Team A uses Datadog. Team B uses a custom Prometheus instance. Team C writes logs as unstructured text. All three decisions are reasonable in isolation. In aggregate, they make cross-service incident investigation much harder than it needs to be.
The three signals — and what the platform team provides for each
Metrics
Metrics are the first thing to establish because they are the cheapest to produce and the fastest to query during an incident. The baseline the platform team provides is a Prometheus exporter in the standard service scaffold, with Grafana dashboards templated around RED metrics: Rate, Errors, Duration.
RED metrics work because they answer the three questions an on-call engineer asks first. How many requests is this service handling? What fraction are failing? How long are they taking? These three questions cover most incident triage without requiring any service-specific knowledge.
The dashboard template is important. If the platform team provides RED dashboards for every service automatically, on-call engineers learn one dashboard shape and can navigate to any service they are unfamiliar with. If each team builds its own, on-call engineers spend the first ten minutes of every incident finding the right dashboard.
Logs
Structured logging with mandatory fields is harder to enforce but pays off immediately when you need to correlate logs across services. The standard the platform team defines: JSON format, with at minimum a timestamp, severity level, service name, trace ID, and a message field. The trace ID field is what makes logs useful when you have distributed tracing later. Without it, you cannot connect a log entry to the request that produced it.
The platform team also owns the aggregation pipeline — Loki is a reasonable choice for organizations already on Grafana because it integrates without an additional vendor relationship — and the retention policy. Retention is often overlooked until someone asks for logs from an incident that happened three weeks ago and finds they no longer exist. Define the policy before that conversation.
Traces
Distributed tracing is the most powerful of the three signals and the most expensive to operate correctly. It is also the one that requires the other two to already be working well. If your metrics are unreliable or your logs are unstructured, fixing those is a better use of time than adding a tracing pipeline.
What the platform team provides for traces: an OpenTelemetry collector as the collection standard, a sampling policy (more on this below), and a visualization backend. Grafana Tempo integrates well if you are already in the Grafana stack. Jaeger is a reasonable alternative that is self-contained.
The collector is the component that matters most for platform-owned observability. If product teams send their traces to the collector and the collector routes them to the backend, you can change backends later without touching every service. Coupling product services directly to a tracing backend is an anti-pattern that creates migration work later.
Three layers of ownership
Ownership of observability is not monolithic. The platform team owns some of it. Product teams own some of it. Both own the middle layer together.
Platform-owned: the collection infrastructure, storage, retention, the default dashboards, and the alerting framework. These are shared infrastructure. No product team should be building their own Prometheus instance or Loki pipeline. Duplication here is waste, and inconsistency makes cross-service investigation harder.
Shared: SLO tooling, alerting rules for SLOs, runbook templates. The platform team builds the tooling that lets product teams define SLOs — Pyrra or Sloth are reasonable options for generating multi-window multi-burn-rate alert rules from a simple SLO definition. Product teams define their SLOs. Both agree on what an alert means and what the on-call response looks like.
Team-owned: specific alerts, custom dashboards, service-level SLOs. The product team knows what matters for their service in ways the platform team does not. The platform team should not be in the business of writing service-specific alerts. It should provide the tooling and templates that make writing those alerts straightforward.
The boundary matters because without it, the platform team gets pulled into writing alerts for every service, which does not scale, or product teams build their own alerting infrastructure, which produces the drift problem described above.
What happens without this
Each team installs monitoring differently. One team configures a custom Prometheus with non-standard metric names. Another team sends logs to a third-party vendor the rest of the organization does not use. A third team has no tracing at all. This is not a hypothetical. It is what most Series B engineering organizations look like before their first serious reliability investment.
The consequence shows up during incidents. An on-call engineer inheriting a service they did not build has no consistent place to start investigation. Cross-service issues — a downstream service degrading, an upstream dependency slowing down — are invisible until engineers from both teams are on the same call trying to correlate signals from incompatible systems.
Incident response time in this state is dominated by investigation rather than recovery. Engineers spend 45 minutes figuring out what broke before they can spend 10 minutes fixing it. After several incidents of this shape, senior engineers start leaving because on-call is unsustainable.
Where to start
For a Series A–C company, the right sequence is RED metrics and structured logs before distributed tracing. Get the basics working everywhere before building advanced capabilities.
This is not a concession to resource constraints. It reflects what actually produces value first. RED metrics give you the triage baseline that every on-call rotation needs. Structured logs with consistent fields give you the correlation capability that makes investigation tractable. Distributed tracing is the layer that answers "why did this request take 4 seconds instead of 200 milliseconds" — a valuable question, but one you can only ask coherently once you know which requests are slow and which service they touched.
The practical order:
- Add a Prometheus exporter to the standard service scaffold. Template one RED dashboard that all services share.
- Define the structured logging standard. Enforce it in the build process, not via code review.
- Deploy a Loki pipeline. Define retention.
- Add an OpenTelemetry collector. Configure head-based sampling to start. Point it at Grafana Tempo.
- Update the golden path so that new services produce all three signals from day one without developer action.
Step 5 is the compounding investment. Once the golden path produces observability automatically, the platform team's maintenance burden is managing the standard, not chasing each team to adopt it.
For how reliability practices connect to SLO design and error budget management, read SLO design guide and Error budgets in practice. For how to wire this into a broader reliability program, see the Reliability Program.

Mat Caniglia
LinkedInFounder of Clouditive. 18+ years transforming engineering organizations across LATAM and globally through Developer Experience consulting.
79 articles published