Initial commit from agent
This commit is contained in:
45
skills/software/observability-operability.md
Normal file
45
skills/software/observability-operability.md
Normal file
@@ -0,0 +1,45 @@
|
||||
# Observability and Operability
|
||||
|
||||
## Purpose
|
||||
|
||||
Make systems easier to understand, debug, and run by improving signals, diagnostics, and operational readiness around important behavior.
|
||||
|
||||
## When to use
|
||||
|
||||
- A system is hard to diagnose in production or staging
|
||||
- New functionality needs useful logs, metrics, traces, or alerts
|
||||
- Operational ownership is unclear during failures or rollout
|
||||
- Reliability work needs better visibility before deeper changes
|
||||
|
||||
## Inputs to gather
|
||||
|
||||
- Critical workflows, failure modes, and current diagnostic signals
|
||||
- Existing logging, metrics, tracing, dashboards, and alerts
|
||||
- Operator needs during rollout, incident response, and debugging
|
||||
- Noise constraints and performance or cost considerations
|
||||
|
||||
## How to work
|
||||
|
||||
- Instrument the questions a responder will need answered during failure.
|
||||
- Prefer signals tied to user-impacting behavior over vanity metrics.
|
||||
- Make logs structured and actionable when possible.
|
||||
- Add observability close to important boundaries and state transitions.
|
||||
- Keep signal quality high by avoiding low-value noise.
|
||||
|
||||
## Output expectations
|
||||
|
||||
- Improved observability or an operability plan for the target area
|
||||
- Clear explanation of what new signals reveal
|
||||
- Notes on alerting, dashboard, or rollout support when relevant
|
||||
|
||||
## Quality checklist
|
||||
|
||||
- Signals help detect and diagnose meaningful failures.
|
||||
- Instrumentation is focused and not excessively noisy.
|
||||
- Operational usage is considered, not just implementation convenience.
|
||||
- Added visibility maps to critical user or system outcomes.
|
||||
|
||||
## Handoff notes
|
||||
|
||||
- Mention what incidents or debugging tasks the new observability should make easier.
|
||||
- Pair with debugging workflow, incident response, or performance optimization when diagnosis is the main bottleneck.
|
||||
Reference in New Issue
Block a user