memer/skills/software/observability-operability.md

# Observability and Operability

## Purpose

Make systems easier to understand, debug, and run by improving signals, diagnostics, and operational readiness around important behavior.

## When to use

- A system is hard to diagnose in production or staging
- New functionality needs useful logs, metrics, traces, or alerts
- Operational ownership is unclear during failures or rollout
- Reliability work needs better visibility before deeper changes

## Inputs to gather

- Critical workflows, failure modes, and current diagnostic signals
- Existing logging, metrics, tracing, dashboards, and alerts
- Operator needs during rollout, incident response, and debugging
- Noise constraints and performance or cost considerations

## How to work

- Instrument the questions a responder will need answered during failure.
- Prefer signals tied to user-impacting behavior over vanity metrics.
- Make logs structured and actionable when possible.
- Add observability close to important boundaries and state transitions.
- Keep signal quality high by avoiding low-value noise.

## Output expectations

- Improved observability or an operability plan for the target area
- Clear explanation of what new signals reveal
- Notes on alerting, dashboard, or rollout support when relevant

## Quality checklist

- Signals help detect and diagnose meaningful failures.
- Instrumentation is focused and not excessively noisy.
- Operational usage is considered, not just implementation convenience.
- Added visibility maps to critical user or system outcomes.

## Handoff notes

- Mention what incidents or debugging tasks the new observability should make easier.
- Pair with debugging workflow, incident response, or performance optimization when diagnosis is the main bottleneck.