Back to Blog
SRE13 min readJun 2026

Observability Cost Management

Telemetry bills can rival your infra spend. Here are the cost drivers, high-cardinality metrics, log volume, trace retention, and the levers to cut spend without going blind.

SREObservabilityCostFinOps
SB

Sri Balaji

Founder · TheSimplifiedTech

On this page

The observability bill that doubled the infra bill

Six months after rolling out a shiny new observability stack, a team I worked with opened their vendor invoice and stopped breathing. The compute, storage, and networking for the entire product ran about "$38k" a month. The telemetry describing that product? "$41k". Their monitoring cost more than the thing it was monitoring.

Nobody decided this. It accreted. Every new service shipped with default instrumentation. Every engineer added "just one more label" to a metric. Every incident ended with "let's log more so we catch it next time." Each choice was locally reasonable. The bill was the sum of a thousand reasonable choices nobody was measuring.

Observability is not free, and it does not scale linearly with usefulness. The first 20% of your telemetry answers 80% of your questions. The last 80% mostly sits in expensive storage, never queried, quietly compounding. This article is about keeping the signal and cutting the spend, the cost drivers, and the levers you actually have.

Who this is for

SREs, platform engineers, and tech leads who own a telemetry budget, or just got handed a vendor invoice and a mandate to cut it. Assumes you already run metrics, logs, and traces (see [Observability: Metrics, Logs & Traces](/blog/observability-metrics-logs-traces)) and now need to make them affordable without going blind.

The principle: instrument what you'll act on

Instrument what you'll act on, and sample the rest. Telemetry you never query is just a storage bill with extra steps.

Cost management in observability is not about collecting less, it's about collecting with intent. Every signal you keep should answer a question someone will actually ask: a dashboard you watch, an alert that pages, a query you run during an incident. If a metric, log line, or trace will never inform a decision, it is pure cost with zero return.

The reframe that unlocks the savings: you do not need every data point to understand a system, you need a *representative* one. A 1% sample of traces tells you your p99 latency just as faithfully as 100% retention, at a hundredth of the cost. The skill is knowing which 1% to keep and which 99% to drop.

A documentary crew films selectively and keeps the telling momentsSampling, capture a representative slice, not every frame
You keep raw footage a few weeks, the final cut foreverRetention tiers, hot for days, cold/aggregated for months
You don't mic every extra in the crowdCardinality control, don't label every unique user/request
The edit room cuts the boring hours before archivingFiltering & aggregation, drop noise before it hits storage
Observability cost is a documentary budget, not a surveillance budget.

The picture: from firehose to cheap storage

Cost lives in the pipeline between where telemetry is *born* and where it is *stored*. The expensive mistake is wiring every source straight into hot, indexed, fully-retained storage. The cheap design puts a processing layer, sampling, filtering, aggregation, between the firehose and the bill.

roll offdownsampled
App + SDKs

metrics, logs, traces

Infra agents

node, k8s, proxy

OTel Collector

processing layer

Sample / Filter

tail-sample, drop debug

Aggregate

recording rules, rollups

Hot tier

7-14d, indexed

Cold / archive

90d+, object store

Telemetry sources flow through a processing layer that samples, filters, and aggregates before anything reaches storage, and the survivors land in tiers priced by how fast you need them.

  1. 1

    Collect at the edge

    Apps and infra agents emit to a local OTel Collector, not directly to the vendor. The collector is your single control point for everything downstream.

  2. 2

    Sample and filter before egress

    Tail-sample traces (keep the errors and slow ones), drop debug logs in prod, and discard health-check spans. This is where most of the savings happen, before the data ever leaves your network.

  3. 3

    Aggregate high-volume metrics

    Pre-compute the time series you actually graph with recording rules, so you store rollups instead of raw per-request cardinality.

  4. 4

    Land in the cheapest tier that answers the question

    Recent, queryable data goes hot for a week or two; everything older rolls to cheap object storage, downsampled. You pay premium prices only for data you'd query at 3am.

The three cost drivers and their levers

Almost every runaway telemetry bill traces back to three drivers. Each has a direct, well-understood lever. Find which driver dominates your invoice first, vendors bill on different units (data points/min, GB ingested, spans retained), so the biggest line item tells you where to aim.

Cost driverWhy it explodesLever
High-cardinality metricsEach unique label combination is a separate billable time series; one unbounded label (user_id, request_id) can mint millionsCardinality control, drop/limit labels, aggregate away dimensions, set per-metric series limits
Log volumeVerbose levels and chatty libraries ingest GB/day; you pay per GB whether or not you ever search itLog levels & filtering, INFO+ in prod, drop health checks, structure and sample repetitive lines
Trace retention100% retention at full fidelity stores millions of spans, most of which are healthy and identicalSampling (head/tail) + retention tiers, keep errors and slow traces, sample the boring majority
Match the cost driver to the lever that actually moves it.

Config: tail-sampling + a cardinality cap

Here's a real OpenTelemetry Collector pipeline that does the two highest-leverage things at once: tail-sample traces so you keep every error and slow request while dropping most healthy ones, and cap metric cardinality so a runaway label can't mint a million series. Drop this in front of your vendor exporter.

otel-collector-config.yaml
yaml
processors:
  # --- Trace tail-sampling: decide AFTER the full trace is assembled ---
  tail_sampling:
    decision_wait: 10s          # buffer spans this long before deciding
    num_traces: 50000
    policies:
      - name: keep-all-errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: keep-slow-requests
        type: latency
        latency: { threshold_ms: 500 }
      # everything else: keep a representative 5%
      - name: sample-the-rest
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

  # --- Metric cardinality limit: cap series + strip noisy labels ---
  metricstransform:
    transforms:
      - include: http_server_requests
        action: update
        operations:
          - action: aggregate_labels       # drop user_id / request_id
            label_set: [service, route, status_class]
            aggregation_type: sum

  # hard ceiling so one bad label can't mint millions of series
  filter/cardinality:
    metrics:
      metric:
        - 'HasAttrKeyOnDatapoint("request_id")'

service:
  pipelines:
    traces:
      receivers:  [otlp]
      processors: [tail_sampling, batch]
      exporters:  [otlphttp/vendor]
    metrics:
      receivers:  [otlp]
      processors: [metricstransform, filter/cardinality, batch]
      exporters:  [otlphttp/vendor]

Order matters

Put sampling and filtering processors **before** the exporter in the pipeline. If you sample after egress, you've already paid to ship the data. The whole point is to shrink the stream before it crosses the billing boundary.

Head vs tail sampling

Sampling is the biggest lever for traces, but *when* you decide to keep a trace changes everything. The two strategies trade cost-predictability against signal-quality, and mature stacks often combine them.

Head samplingTail sampling
When the decision is madeAt the start, before the request runsAfter the full trace completes
What it can seeNothing yet, just a coin flip (or trace-id hash)Errors, latency, the whole trace shape
Cost to operateNear zero, drop at the SDKHigher, must buffer spans in the collector
RiskDrops errors and slow traces blindlyMemory pressure; needs all spans at one collector
Best forHigh-volume, predictable cost ceilingKeeping every error/slow trace that matters
Head sampling is cheap and dumb; tail sampling is smart and pricier to run.

Head sampling flips a coin when the request arrives, keep 10%, drop 90%, before anything has happened. It's dirt cheap because you never even build the spans you drop. But it's blind: a request that's about to error or run slow gets dropped at the same rate as a healthy one. You save money and lose exactly the traces you most wanted.

Tail sampling waits until the trace finishes, then decides based on what actually happened, always keep errors, always keep anything over 500ms, sample the boring successful majority. You get the traces that matter at a fraction of full retention. The cost: the collector must buffer all spans of a trace in memory until the decision, which needs real resources and careful collector topology. The common pattern is head-sample at the SDK to a survivable rate, then tail-sample in the collector for the keep-what-matters decision.

Common mistakes that cost thousands

  1. Logging everything at DEBUG in production. Debug logs are for debugging, locally, or behind a flag you flip during an incident. Shipping them 24/7 inflates ingest by 5-10x for lines no one reads. Default to INFO in prod; make DEBUG runtime-togglable per service.
  2. Unbounded labels on metrics. Putting user_id, request_id, email, or full URLs in a metric label turns one time series into millions. Cardinality is the silent killer, a single unbounded dimension can 100x a metric's cost overnight. Label by bounded dimensions (service, route, status class) only.
  3. 100% trace retention "just in case." Healthy traces are nearly identical; storing all of them buys you almost nothing over a 5% sample plus every error. Full retention is the single most common reason a trace bill balloons. Sample the success path; keep the failures.
  4. One flat retention tier for everything. Keeping 90 days of raw, indexed, hot data because "we might need it" pays premium prices for cold data. Tier it: days hot, weeks warm, months cold and downsampled.
  5. No cost ownership. If no one watches the telemetry bill the way they watch the infra bill, it grows unchecked. Put it on a dashboard. Alert on ingest spikes the same way you alert on error spikes.

Watch out

A sudden ingest spike is often a *bug*, not growth, a new label leaking request IDs, a retry storm logging every attempt, a debug flag left on after an incident. Alert on telemetry volume itself so a runaway dimension pages you before it pages your invoice.

Takeaways

The whole article in seven lines

  • Telemetry cost can rival infra cost, and nobody decides it; it accretes.
  • Instrument what you'll act on; sample the rest. Unqueried data is pure cost.
  • Three drivers dominate: metric cardinality, log volume, trace retention.
  • Put a processing layer (OTel Collector) between sources and storage, sample and filter before egress.
  • Head sampling is cheap and blind; tail sampling keeps errors and slow traces at a fraction of full cost.
  • Tier retention: hot for days, cold/downsampled for months, pay premium only for data you'd query at 3am.
  • Own the bill: dashboard it, and alert on ingest spikes like you alert on errors.

Where to go next

Cost management is the discipline you layer on *after* you have working observability. If the signals themselves are still fuzzy, start with the fundamentals and the instrumentation tooling, then come back to make them affordable.

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.