Observability Cost Management

On this page

The observability bill that doubled the infra bill
The principle: instrument what you'll act on
The picture: from firehose to cheap storage
The three cost drivers and their levers
Config: tail-sampling + a cardinality cap
Head vs tail sampling
Common mistakes that cost thousands
Takeaways
Where to go next

TL;DR

Stop telemetry bills from rivaling your infra spend. Know the three cost drivers, high-cardinality metrics, log volume, and trace retention, then pull levers like tail-sampling and cardinality caps to cut cost without going blind.

The observability bill that doubled the infra bill

Six months after rolling out a shiny new observability stack, a team I worked with opened their vendor invoice and stopped breathing. The compute, storage, and networking for the entire product ran about "$38k" a month. The telemetry describing that product? "$41k". Their monitoring cost more than the thing it was monitoring.

Nobody decided this. It accreted. Every new service shipped with default instrumentation. Every engineer added "just one more label" to a metric. Every incident ended with "let's log more so we catch it next time." Each choice was locally reasonable. The bill was the sum of a thousand reasonable choices nobody was measuring.

Observability is not free, and it does not scale linearly with usefulness. The first 20% of your telemetry answers 80% of your questions. The last 80% mostly sits in expensive storage, never queried, quietly compounding. This article is about keeping the signal and cutting the spend, the cost drivers, and the levers you actually have.

Who this is for

SREs, platform engineers, and tech leads who own a telemetry budget, or just got handed a vendor invoice and a mandate to cut it. Assumes you already run metrics, logs, and traces (see Observability: Metrics, Logs & Traces) and now need to make them affordable without going blind.

The principle: instrument what you'll act on

Instrument what you'll act on, and sample the rest. Telemetry you never query is just a storage bill with extra steps.

Cost management in observability is not about collecting less, it's about collecting with intent. Every signal you keep should answer a question someone will actually ask: a dashboard you watch, an alert that pages, a query you run during an incident. If a metric, log line, or trace will never inform a decision, it is pure cost with zero return.

The reframe that unlocks the savings: you do not need every data point to understand a system, you need a *representative* one. A 1% sample of traces tells you your p99 latency just as faithfully as 100% retention, at a hundredth of the cost. The skill is knowing which 1% to keep and which 99% to drop.

A documentary crew films selectively and keeps the telling momentsSampling, capture a representative slice, not every frame

You keep raw footage a few weeks, the final cut foreverRetention tiers, hot for days, cold/aggregated for months

You don't mic every extra in the crowdCardinality control, don't label every unique user/request

The edit room cuts the boring hours before archivingFiltering & aggregation, drop noise before it hits storage

Observability cost is a documentary budget, not a surveillance budget.

The picture: from firehose to cheap storage

Cost lives in the pipeline between where telemetry is *born* and where it is *stored*. The expensive mistake is wiring every source straight into hot, indexed, fully-retained storage. The cheap design puts a processing layer, sampling, filtering, aggregation, between the firehose and the bill.

Telemetry sources flow through a processing layer that samples, filters, and aggregates before anything reaches storage, and the survivors land in tiers priced by how fast you need them.

1
Collect at the edge
Apps and infra agents emit to a local OTel Collector, not directly to the vendor. The collector is your single control point for everything downstream.
2
Sample and filter before egress
Tail-sample traces (keep the errors and slow ones), drop debug logs in prod, and discard health-check spans. This is where most of the savings happen, before the data ever leaves your network.
3
Aggregate high-volume metrics
Pre-compute the time series you actually graph with recording rules, so you store rollups instead of raw per-request cardinality.
4
Land in the cheapest tier that answers the question
Recent, queryable data goes hot for a week or two; everything older rolls to cheap object storage, downsampled. You pay premium prices only for data you'd query at 3am.

The three cost drivers and their levers

Almost every runaway telemetry bill traces back to three drivers. Each has a direct, well-understood lever. Find which driver dominates your invoice first, vendors bill on different units (data points/min, GB ingested, spans retained), so the biggest line item tells you where to aim.

Cost driver	Why it explodes	Lever
High-cardinality metrics	Each unique label combination is a separate billable time series; one unbounded label (user_id, request_id) can mint millions	Cardinality control, drop/limit labels, aggregate away dimensions, set per-metric series limits
Log volume	Verbose levels and chatty libraries ingest GB/day; you pay per GB whether or not you ever search it	Log levels & filtering, INFO+ in prod, drop health checks, structure and sample repetitive lines
Trace retention	100% retention at full fidelity stores millions of spans, most of which are healthy and identical	Sampling (head/tail) + retention tiers, keep errors and slow traces, sample the boring majority

Match the cost driver to the lever that actually moves it.

Config: tail-sampling + a cardinality cap

Here's a real OpenTelemetry Collector pipeline that does the two highest-leverage things at once: tail-sample traces so you keep every error and slow request while dropping most healthy ones, and cap metric cardinality so a runaway label can't mint a million series. Drop this in front of your vendor exporter.

otel-collector-config.yaml

yaml

processors:
  # --- Trace tail-sampling: decide AFTER the full trace is assembled ---
  tail_sampling:
    decision_wait: 10s          # buffer spans this long before deciding
    num_traces: 50000
    policies:
      - name: keep-all-errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: keep-slow-requests
        type: latency
        latency: { threshold_ms: 500 }
      # everything else: keep a representative 5%
      - name: sample-the-rest
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

  # --- Metric cardinality limit: cap series + strip noisy labels ---
  metricstransform:
    transforms:
      - include: http_server_requests
        action: update
        operations:
          - action: aggregate_labels       # drop user_id / request_id
            label_set: [service, route, status_class]
            aggregation_type: sum

  # hard ceiling so one bad label can't mint millions of series
  filter/cardinality:
    metrics:
      metric:
        - 'HasAttrKeyOnDatapoint("request_id")'

service:
  pipelines:
    traces:
      receivers:  [otlp]
      processors: [tail_sampling, batch]
      exporters:  [otlphttp/vendor]
    metrics:
      receivers:  [otlp]
      processors: [metricstransform, filter/cardinality, batch]
      exporters:  [otlphttp/vendor]

Order matters

Put sampling and filtering processors before the exporter in the pipeline. If you sample after egress, you've already paid to ship the data. The whole point is to shrink the stream before it crosses the billing boundary.

Head vs tail sampling

Sampling is the biggest lever for traces, but *when* you decide to keep a trace changes everything. The two strategies trade cost-predictability against signal-quality, and mature stacks often combine them.

	Head sampling	Tail sampling
When the decision is made	At the start, before the request runs	After the full trace completes
What it can see	Nothing yet, just a coin flip (or trace-id hash)	Errors, latency, the whole trace shape
Cost to operate	Near zero, drop at the SDK	Higher, must buffer spans in the collector
Risk	Drops errors and slow traces blindly	Memory pressure; needs all spans at one collector
Best for	High-volume, predictable cost ceiling	Keeping every error/slow trace that matters

Head sampling is cheap and dumb; tail sampling is smart and pricier to run.

Head sampling flips a coin when the request arrives, keep 10%, drop 90%, before anything has happened. It's dirt cheap because you never even build the spans you drop. But it's blind: a request that's about to error or run slow gets dropped at the same rate as a healthy one. You save money and lose exactly the traces you most wanted.

Tail sampling waits until the trace finishes, then decides based on what actually happened, always keep errors, always keep anything over 500ms, sample the boring successful majority. You get the traces that matter at a fraction of full retention. The cost: the collector must buffer all spans of a trace in memory until the decision, which needs real resources and careful collector topology. The common pattern is head-sample at the SDK to a survivable rate, then tail-sample in the collector for the keep-what-matters decision.

Common mistakes that cost thousands

1Logging everything at DEBUG in production. Debug logs are for debugging, locally, or behind a flag you flip during an incident. Shipping them 24/7 inflates ingest by 5-10x for lines no one reads. Default to INFO in prod; make DEBUG runtime-togglable per service.
2Unbounded labels on metrics. Putting user_id, request_id, email, or full URLs in a metric label turns one time series into millions. Cardinality is the silent killer, a single unbounded dimension can 100x a metric's cost overnight. Label by bounded dimensions (service, route, status class) only.
3100% trace retention "just in case." Healthy traces are nearly identical; storing all of them buys you almost nothing over a 5% sample plus every error. Full retention is the single most common reason a trace bill balloons. Sample the success path; keep the failures.
4One flat retention tier for everything. Keeping 90 days of raw, indexed, hot data because "we might need it" pays premium prices for cold data. Tier it: days hot, weeks warm, months cold and downsampled.
5No cost ownership. If no one watches the telemetry bill the way they watch the infra bill, it grows unchecked. Put it on a dashboard. Alert on ingest spikes the same way you alert on error spikes.

Watch out

A sudden ingest spike is often a *bug*, not growth, a new label leaking request IDs, a retry storm logging every attempt, a debug flag left on after an incident. Alert on telemetry volume itself so a runaway dimension pages you before it pages your invoice.

Takeaways

The whole article in seven lines

Telemetry cost can rival infra cost, and nobody decides it; it accretes.
Instrument what you'll act on; sample the rest. Unqueried data is pure cost.
Three drivers dominate: metric cardinality, log volume, trace retention.
Put a processing layer (OTel Collector) between sources and storage, sample and filter before egress.
Head sampling is cheap and blind; tail sampling keeps errors and slow traces at a fraction of full cost.
Tier retention: hot for days, cold/downsampled for months, pay premium only for data you'd query at 3am.
Own the bill: dashboard it, and alert on ingest spikes like you alert on errors.

Where to go next

Cost management is the discipline you layer on *after* you have working observability. If the signals themselves are still fuzzy, start with the fundamentals and the instrumentation tooling, then come back to make them affordable.

Observability: Metrics, Logs & Traces, the three signals, what each is for, and when to reach for which.
OpenTelemetry in Practice, wiring up the Collector that becomes your cost-control choke point.
Cloud Cost & FinOps: Engineering for Spend, the broader discipline of treating spend as an engineering metric.
Follow the full SRE career path to connect observability, reliability, and cost into one operating model.

Your telemetry invoice is too high and you need to cut it without going blind. Where do you start?

Check your understanding

1. What is the guiding principle for keeping observability costs under control?

2. How do runaway observability bills typically come about?

Frequently asked questions

How can a telemetry bill end up larger than the infra bill?

It accretes from a thousand locally reasonable choices nobody is measuring: every new service ships with default instrumentation, every engineer adds one more label, and every incident ends with "let's log more." In the article's example, telemetry cost about 41k a month against roughly 38k for the infra it described.

What are the main cost drivers in observability?

Almost every runaway bill traces back to three drivers: high-cardinality metrics, log volume, and trace retention. Each has a direct lever, so find which driver dominates your invoice first, since vendors bill on different units like data points per minute, GB ingested, or spans retained.

Will sampling traces make me blind to problems?

No, if you sample with intent. A 1% sample of traces can tell you your p99 latency just as faithfully as full retention at a hundredth of the cost. The skill is knowing which 1% to keep and which 99% to drop.

What is the guiding principle for cutting telemetry cost?

Instrument what you will act on, and sample the rest. Every signal you keep should answer a question someone will actually ask through a dashboard, an alert, or an incident query, because telemetry you never query is just a storage bill with extra steps.

Was this article helpful?

Want to go deeper?

This article covers concepts taught hands-on in the Cloud Engineer and DevOps career paths, with real terminal labs, production scenarios, and structured lessons.

Explore Career Paths Try the Labs

Keep reading

DevOps

Observability: Metrics, Logs & Traces (The Three Pillars)

Read

Cloud

Reliability & Resilience: Designing for Failure

Read

Cloud

Cloud Cost & FinOps: Engineering for Spend

Read

Observability Cost Management

01The observability bill that doubled the infra bill

02The principle: instrument what you'll act on

03The picture: from firehose to cheap storage

04The three cost drivers and their levers

05Config: tail-sampling + a cardinality cap

06Head vs tail sampling

07Common mistakes that cost thousands

08Takeaways

09Where to go next

Frequently asked questions

Want to go deeper?

Observability: Metrics, Logs & Traces (The Three Pillars)

Reliability & Resilience: Designing for Failure

Cloud Cost & FinOps: Engineering for Spend

The observability bill that doubled the infra bill

The principle: instrument what you'll act on

The picture: from firehose to cheap storage

The three cost drivers and their levers

Config: tail-sampling + a cardinality cap

Head vs tail sampling

Common mistakes that cost thousands

Takeaways

Where to go next