Observability

Observability: Tracing, Metrics, and Logs

The instinct of every new engineer is to add a print statement. The instinct of every senior engineer is to wonder why this system was set up so that the only way to investigate is to add a print statement. Production observability is the discipline of having enough signal in advance that you do not need to add print statements when an incident hits at 2 a.m.

This article covers the three pillars (logs, metrics, traces), how they interact, and the practical setup that lets a small team operate a real production system without drowning.

The three pillars

flowchart TB Service[Your service] --> Logs[Logs
discrete events] Service --> Metrics[Metrics
numeric time series] Service --> Traces[Traces
request flow across services] Logs --> LStore[Log aggregator
Loki, ELK, Datadog] Metrics --> MStore[Metrics store
Prometheus, InfluxDB] Traces --> TStore[Trace backend
Jaeger, Tempo, Honeycomb] LStore --> Dashboard[Unified dashboard
Grafana, Datadog] MStore --> Dashboard TStore --> Dashboard Dashboard --> Engineer[On-call engineer]

The three pillars feeding a unified view. The trick is correlation: an alert on a metric should let you click through to the relevant logs and traces immediately.

Logs

Logs are discrete events: a record of something that happened, with a timestamp and context. The fundamental log is a line of text; modern logs are structured (JSON) so they can be queried mechanically.

// Old style: free-form text
2026-04-25 14:32:11 INFO User 42 logged in from 10.0.0.5

// Modern: structured
{
  "ts": "2026-04-25T14:32:11Z",
  "level": "info",
  "msg": "user_login",
  "user_id": 42,
  "ip": "10.0.0.5",
  "trace_id": "abc123"
}

The structured form is queryable: "show me all logins for user 42 today" is a one-line query in Loki, ELK, or Datadog. The free-form form requires regex parsing.

Critical fields:

  • Trace ID linking the log to a request trace. Without this, logs and traces are disconnected.
  • User / tenant ID for filtering by affected customer.
  • Service / version so you can correlate issues with deployments.
  • Severity level for filtering noise versus real problems.

What logs are good for: investigating "what exactly happened to this specific request?". They are bad for: aggregate questions like "what is our error rate?" (use metrics), or "why is this request slow?" (use traces).

Metrics

Metrics are numbers over time. CPU usage. Request rate. Error rate. P99 latency. Each metric is a time series — a sequence of values at regular intervals.

Three metric types matter:

  • Counters only go up. Total requests, total errors. You compute rates by taking the derivative.
  • Gauges can go up or down. Active connections, queue depth, free memory.
  • Histograms capture distributions. Request latency is the canonical example — you want P50, P95, P99, not just the average.

The four golden signals (from Google's SRE book) cover most use cases:

Latency:    request duration distribution
Traffic:    requests per second
Errors:     errors per second / error rate
Saturation: how full the system is (CPU, memory, queue)

Build a dashboard with these four for every service and you have decent operational visibility for free. Alerts on these four cover most incidents.

Traces

A trace records the path of a single request across all the services it touches. Each unit of work within the request is a span. Spans nest: the top-level span is the user request; child spans are the database call, the cache lookup, the call to a downstream service.

Trace abc123 (450 ms total)
+-- HTTP GET /api/orders (450 ms)
    +-- auth check (12 ms)
    +-- db query (340 ms)  <-- BOTTLENECK
    |   +-- connection acquire (1 ms)
    |   +-- query execute (335 ms)
    +-- cache write (8 ms)
    +-- response serialise (90 ms)

The bottleneck is immediately visible — the database query is 75% of the request time. Without traces, finding this would require carefully placed log statements and stitching together timestamps.

Tracing requires instrumentation. Each service must propagate the trace ID (typically via the traceparent HTTP header in OpenTelemetry) and emit spans. Modern frameworks have automatic instrumentation that handles HTTP, database, and queue calls without manual code changes.

OpenTelemetry: the standard

OpenTelemetry (OTel) is the industry-standard instrumentation API and protocol. You instrument your service once with OTel SDKs, and you can send data to any backend that understands the OTLP protocol — Jaeger, Tempo, Datadog, Honeycomb, New Relic, Splunk, and many more.

The reason this matters: vendor lock-in used to be brutal. Datadog APM only worked with Datadog's agent; New Relic only with theirs. OpenTelemetry made the instrumentation portable. Choose your backend; switch later if you want; the code does not change.

Sample setup for a Node.js service:

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces'
  }),
  instrumentations: [getNodeAutoInstrumentations()]
});
sdk.start();

Five lines, and you have automatic tracing of every HTTP request, every database call, every Redis call, every outbound HTTP call. This is significantly more than most teams have.

The toolchain in 2026

The popular open-source stack:

  • Logs: Loki (Grafana's log aggregator) or Elastic / OpenSearch.
  • Metrics: Prometheus, scraping every service.
  • Traces: Tempo (Grafana's trace store) or Jaeger.
  • Dashboard: Grafana, unifying all three.
  • Alerting: Alertmanager (with Prometheus) or Grafana's built-in alerting.
  • Collection: OpenTelemetry Collector, receiving from services and routing to backends.

Hosted alternatives that handle everything for you:

  • Datadog — comprehensive, polished, expensive at scale.
  • Honeycomb — trace-first, excellent for high-cardinality debugging.
  • New Relic — mature, broad coverage.
  • Grafana Cloud — managed Loki + Prometheus + Tempo, OSS-compatible.
  • Better Stack, Axiom — newer entrants with simpler pricing.

What to alert on

Alerting is its own discipline. Two principles:

  1. Alert on symptoms, not causes. "Error rate above 1%" is good. "CPU at 80%" is usually noise. The first wakes you up when users are affected; the second wakes you up at random.
  2. Make every alert actionable. If you cannot say what to do when the alert fires, it is not a real alert — it is a metric you should put on a dashboard but not page on.

A starter set of alerts:

  • Error rate above 1% for 5 minutes (page on-call)
  • P99 latency above 1 second for 10 minutes (page on-call)
  • Service unhealthy for 2 minutes (page on-call)
  • Disk usage above 85% (notify, do not page)
  • Deploy failed (notify the deployer)

That is fewer than 10 alerts. The instinct is to add many more; resist it. Every alert that fires and is not actionable trains your team to ignore alerts. Quality over quantity.

Common mistakes

  • Logging too much. Logging every request at INFO level fills up storage. Use DEBUG for verbose detail; only INFO for events that matter.
  • Ignoring high-cardinality fields in metrics. Adding user_id as a label to a Prometheus metric explodes cardinality and crashes Prometheus. Put high-cardinality data in logs or traces, not metric labels.
  • Sampling traces too aggressively. 1% sampling means you miss the rare slow requests. Modern systems use head-based or tail-based sampling that keeps interesting traces (slow, errored) at high rates.
  • Disconnected dashboards. Three different tools with three different login pages. Aim for one pane of glass — usually Grafana — that shows everything.
  • Manual instrumentation only. Use the auto-instrumentation libraries; they cover 80% of cases. Custom spans only for business logic that auto-instrumentation cannot see.

Frequently Asked Questions

Where do I start if I have nothing today?

Start with metrics: instrument the four golden signals on your main service. That gives you ability to detect incidents. Then add structured logs for the same service. Tracing comes third — useful but the highest-effort to implement well. Many teams successfully operate with just structured logs and Prometheus metrics for years.

How much does observability cost?For a small team self-hosting OSS: a few hundred dollars a month for the infrastructure. For Datadog with five engineers' worth of services: thousands per month. Honeycomb tends to be cheaper than Datadog at small scale; Grafana Cloud has a generous free tier. Costs scale with traffic and retention period.

What is the difference between APM and observability?APM (Application Performance Monitoring) was the older term, focused on tracing and performance metrics. Observability is broader, including logs and unstructured exploration. In practice the two have converged — modern APM products do everything observability platforms do.

Should I use Sentry?Sentry is excellent for error tracking specifically — capturing exceptions with full context, deduplicating, alerting. It complements rather than replaces a full observability stack. Many teams run both: Sentry for errors, Grafana / Datadog for everything else.

How do I instrument code I do not own (a vendor library)?Auto-instrumentation in OpenTelemetry covers most popular libraries automatically by patching their methods. For libraries it does not cover, you wrap calls in your own spans. For totally opaque dependencies, observe at the boundary — trace the request that crosses into the dependency and the response that comes back.

Share your thoughts

Worked with this in production and have a story to share, or disagree with a tradeoff? Email us at support@mybytenest.com — we read everything.