46 lines
1.7 KiB
Markdown
46 lines
1.7 KiB
Markdown
# Observability and Operability
|
|
|
|
## Purpose
|
|
|
|
Make systems easier to understand, debug, and run by improving signals, diagnostics, and operational readiness around important behavior.
|
|
|
|
## When to use
|
|
|
|
- A system is hard to diagnose in production or staging
|
|
- New functionality needs useful logs, metrics, traces, or alerts
|
|
- Operational ownership is unclear during failures or rollout
|
|
- Reliability work needs better visibility before deeper changes
|
|
|
|
## Inputs to gather
|
|
|
|
- Critical workflows, failure modes, and current diagnostic signals
|
|
- Existing logging, metrics, tracing, dashboards, and alerts
|
|
- Operator needs during rollout, incident response, and debugging
|
|
- Noise constraints and performance or cost considerations
|
|
|
|
## How to work
|
|
|
|
- Instrument the questions a responder will need answered during failure.
|
|
- Prefer signals tied to user-impacting behavior over vanity metrics.
|
|
- Make logs structured and actionable when possible.
|
|
- Add observability close to important boundaries and state transitions.
|
|
- Keep signal quality high by avoiding low-value noise.
|
|
|
|
## Output expectations
|
|
|
|
- Improved observability or an operability plan for the target area
|
|
- Clear explanation of what new signals reveal
|
|
- Notes on alerting, dashboard, or rollout support when relevant
|
|
|
|
## Quality checklist
|
|
|
|
- Signals help detect and diagnose meaningful failures.
|
|
- Instrumentation is focused and not excessively noisy.
|
|
- Operational usage is considered, not just implementation convenience.
|
|
- Added visibility maps to critical user or system outcomes.
|
|
|
|
## Handoff notes
|
|
|
|
- Mention what incidents or debugging tasks the new observability should make easier.
|
|
- Pair with debugging workflow, incident response, or performance optimization when diagnosis is the main bottleneck.
|