Identity System Observability: OpenTelemetry Full-Link Tracing in Practice
The 2025 production incident at a major SaaS platform remains a stark reminder: without distributed tracing, a single “slow login” user complaint consumed 6 full hours across 3 SREs, from ticket creation to root cause identification. The culprit was a Redis connection pool misconfiguration in session-service that forced a new connection on every token validation — but without tracing, the team had to check each service and middleware manually, like finding a needle in a haystack.
This is why observability is not a “nice to have” but a lifeline for identity systems. When an identity system goes down or degrades, every business system that depends on it becomes unavailable — and that blast radius dwarfs any single business module.
Autional has treated observability as a first-class citizen from day one, building a unified log-metric-trace observability system on OpenTelemetry. This article breaks down each layer and demonstrates how to quickly locate problems using real-world cases.
Three Pillars of Observability and the Special Needs of Identity Systems
Logs: Recording “What Happened”
The pain point of traditional logging isn’t a lack of data — it’s too much. A medium-sized authentication system can generate gigabytes of access logs daily, but when an actual incident occurs, operators often get lost in a sea of unstructured log data.
Autional’s structured logging solution:
- 78+ standardized log keys: All logs use predefined key constants from the
base/loggerpackage, such aslogger_base.KeyUserID,logger_base.KeyTraceID, andlogger_base.KeyErrorCode. This means you can preciselygrepfor all operations by a specific user, all occurrences of a specific error code, or every log step within a trace span. - Request-level log context: On each HTTP request, middleware injects
request_id,tenant_id,user_id, andtrace_idintocontext.Context. All subsequentlogger_base.FromContext(ctx)calls automatically carry these identifiers — no need to manually pass log parameters as long as the function receivesctx. - Complete error chain recording:
error_base.Errcarries the full error chain (cause -> cause -> cause), which is automatically expanded into an"error_chain"field in log output, revealing the root cause at a glance.
logger.Info("user login successful",
slog.String(logger_base.KeyUserID, userID),
slog.String(logger_base.KeyTenantID, tenantID),
slog.Duration(logger_base.KeyDuration, elapsed),
)
Metrics: Quantifying “What Happened”
Logs tell you the details of a single request; metrics tell you the macro health of your system. Autional’s Prometheus metrics cover four layers:
| Layer | Example Metrics | Purpose |
|---|---|---|
| HTTP Service | http_requests_total, http_request_duration_seconds | Request volume, latency, error rate |
| Domain Events | events_published_total, events_publish_duration_seconds | Domain event throughput and latency |
| MQ Consumption | mq_consumer_messages_total, mq_consumer_message_duration_seconds | Consumption rate, processing latency, DLQ backlog |
| Infrastructure | db_connections_active, redis_commands_duration_seconds | Connection pool health, cache hit rate |
These metrics are automatically registered via the micro-middleware/metrics middleware — business code does not require manual instrumentation. But this doesn’t mean you can ignore metrics. The key is defining the right alerting rules, which we will cover in a separate article.
Traces: Understanding “How It Happened”
Logs tell you the result of each step. Metrics tell you the overall system trend. Traces connect the two — they reveal how many services a single request crosses, how long each service takes, and where the bottleneck is.
A typical “user login” request trace in Autional:
gateway (1ms)
→ identity-service /auth/login (45ms)
→ bcrypt password verify (30ms)
→ JWT token generate (2ms)
→ audit-service /log (5ms, async)
→ session-service /session/create (8ms)
→ Redis SET (2ms)
→ PostgreSQL INSERT (4ms)
→ profile-service /profile/me (12ms, parallel)
In the old architecture, if “login is slow” was the complaint, you had to SSH into each service and check logs manually. With distributed tracing, one screen shows you that identity-service bcrypt took 30ms (65% of total), while session-service’s PostgreSQL INSERT took only 4ms — everything is crystal clear.
Autional OpenTelemetry Implementation Architecture
Cross-Protocol Propagation
What makes identity systems unique is the need to support three communication protocols simultaneously: HTTP, gRPC, and MQ. /auth/login is HTTP → HTTP, but compliance-service may require gRPC calls, and audit logs are delivered asynchronously through MQ. If trace context cannot flow smoothly between protocols, the trace breaks.
Autional’s solution:
- HTTP: Uses the W3C Trace Context standard, propagated via the
traceparentheader. Themicro-middleware/tracingmiddleware automatically extracts and injects it. - gRPC: Uses
otelgrpc’sNewClientHandler()/NewServerHandler()propagated via gRPC metadata. All gRPC client connections mandatorily injectotelgrpc.NewClientHandler(). - MQ:
micro-pkg/event.Publisherautomatically injects the W3CtraceparentinbuildHeaders. Consumers extract it via theconsumer/middleware.Tracing()middleware.
This means even if a business flow crosses HTTP → MQ → gRPC → HTTP, the trace remains fully intact. This is especially critical for identity systems, where a “user registration” operation triggers audit logging (MQ), wallet creation (HTTP), and default role assignment (gRPC).
Integration Approach
Autional chose to use the OpenTelemetry SDK directly rather than a vendor-specific agent. The benefits:
- Vendor-neutral: Trace data can be exported to any OTLP-compatible backend — Jaeger, Tempo, Datadog, Alibaba Cloud ARMS — by changing just one environment variable:
OTEL_EXPORTER_OTLP_ENDPOINT. - Configurable sampling: 100% sampling in development; on-demand sampling in production (e.g., only error requests and slow requests), preventing trace data explosion.
- Zero code intrusion: All trace logic is handled by infrastructure-layer middleware. Business code only needs to pass
ctxnormally — zero manual instrumentation cost.
Real-World Case: Debugging Slow Logins with Tracing
One day, operations received an alert: “identity-service P99 latency spiked from 80ms to 500ms.” Here is the complete process of locating the root cause using distributed tracing:
Step 1: Look at the Big Picture
Open the http_request_duration_seconds panel in Grafana. Confirm that the latency spike began at 14:32, with P99 rising from 80ms to 500ms. The error rate is normal — meaning the service is not crashing, just degrading in performance.
Step 2: Find Representative Traces
In the tracing backend (e.g., Jaeger), query for traces with operation = POST /api/v1/auth/login and duration > 400ms. Sample 5 traces at random and find a common pattern:
identity-service auth/login 420ms
├── bcrypt compare 28ms ← Normal
├── JWT generate 2ms ← Normal
└── session save 385ms ← Anomaly!
├── PostgreSQL INSERT 383ms
└── Redis SET 2ms
The problem is in session-service’s PostgreSQL write.
Step 3: Correlate Logs
Find the trace_id in the trace and use it to search logs in Loki (all Autional logs carry the trace_id field):
14:32:15 [session-service] ERROR session save failed
error="could not serialize access token" cause="pq: value too long for type character varying(1024)"
trace_id=abc123
user_id=01ARZ3NDEKTSV4RRFFQ69G5FAV
The root cause is clear: a tenant configured abnormally large JWT claims (too many custom fields), causing the serialized token to exceed the database column length limit. Fix: increase the column length and add truncation protection before serialization.
Entire process: from alert receipt to root cause identification — 4 minutes. Without the observability system, this process could have taken 4 hours.
Beyond Tracing: The Next Frontier of Observability
Autional’s observability system continues to evolve. The next milestones include:
Convergence of Audit Logging and Observability
Identity systems naturally require audit capabilities — who performed what action and when. Autional bridges audit logging (audit-service writing to MongoDB) with structured logging: every audit record carries a trace_id, allowing you to jump directly from a trace to the corresponding audit record and confirm that an operation was initiated by the user (not an internal call). This is a killer feature for compliance auditing (GDPR Article 30 — records of processing activities).
Error Budget Dashboard
Based on SLO and error budget principles, we are building a centralized “Identity System Health” dashboard that translates core metrics into business language:
- Login success rate (last 1 hour) — actual vs SLO (99.9%)
- Remaining error budget (this month) — how much tolerance is left
- Token issuance P99 latency — is user experience impacted?
When the error budget is exhausted, alert levels are automatically escalated, and a ticket is created, forcing the team to pause feature development and prioritize stability — this is the essence of Google SRE and the decision-making layer we are building on top of observability.
Advice for Readers
If you are building observability for your identity system, here are three prioritized recommendations:
- Start with structured logging: Replace
fmt.Sprintf("user %s login", uid)withslog.Info("user login", "user_id", uid). This is the highest-ROI improvement — near-zero cost, immediate results. - Then add distributed tracing: If your architecture spans more than 3 services, distributed tracing is essential. Start by injecting trace headers at the gateway layer and core authentication services, then extract them at the consumer side. Autional’s
micro-pkgmiddleware is ready to use. - Finally, implement metrics: Define your SLOs (what counts as “available”) first, then build dashboards and alerts. Metrics without SLOs are just charts; metrics with SLOs are decision-making tools.
Observability is not a set of tools — it is a culture: “How can I know what’s wrong with my system faster?” For identity systems, this culture directly determines your security response speed and ultimately, your users’ trust.
All 15 Autional microservices have built-in OpenTelemetry support, ready to use out of the box. Learn how to integrate Autional authentication into your application in the developer documentation.