Observability: Logs, Metrics, Traces
You can't fix what you can't see. Three pillars, each answers a different question about your system.
1 credit
The three pillars
3 itemsLogs
Discrete events. "What happened at T?" Textual, unbounded cardinality. Slowest/most expensive to query.Metrics
Time-series numbers. "How's it trending?" Cheap, low cardinality. Alerts live here.Traces
A request's journey across services. "Where is the latency?" Connected spans, sampled.Logs done right
- **Structured, not strings** — `{"level":"error","user_id":42,"err":"timeout"}` vs "user 42 got timeout". Queryable.
- **Log levels** — ERROR / WARN / INFO / DEBUG. Prod defaults to INFO; DEBUG behind a flag.
- **Request-scoped context** — every log line inside a request includes a request-id. Trivial to correlate.
- **What to log** — inputs, errors, important state transitions. Not every function call.
- **What NOT to log** — passwords, PII, tokens, credit cards. Redact at source.
Metrics that matter (the USE / RED method)
- **RED** (request-oriented): **R**ate, **E**rrors, **D**uration. For every service you care about.
- **USE** (resource-oriented): **U**tilization, **S**aturation, **E**rrors. For every resource (CPU, disk, queue).
- Golden signals — latency p50/p95/p99, traffic (req/s), errors (%), saturation (resource pressure).
- Alert on symptoms (user-visible: error rate, p95), not causes (CPU > 80%) — most "cause" alerts are false.
Tracing
- Each request gets a trace-id; each operation a span-id. Spans form a tree.
- OpenTelemetry is the standard — vendor-neutral instrumentation library, send to any backend.
- Sample — tracing every request at scale is expensive. 1-10% head-based + 100% of errors is typical.
- Propagate headers across service calls (`traceparent`) or the trace ends at the boundary.
Tooling
- Self-host: Prometheus (metrics) + Loki (logs) + Tempo (traces) + Grafana (UI). OSS, solid, some ops work.
- Managed: Datadog, New Relic, Honeycomb, Grafana Cloud, Axiom. Pay per GB ingested/retained.
- At tiny scale (1 service) — good logging + /metrics endpoint is enough. Don't over-invest early.