Reliability Engineering

A set of principles for designing how systems behave under failure.

Reliability Engineering treats failure as a design problem: what can break, what the user sees when it does, and how the system degrades.

A Reliability Engineer applies these principles to decide this in advance, not during incidents.

Principles
  1. 01

    Start from the user, not the infrastructure

    The question is not "what happens to the server", but "what does the user see and lose". Reliability is defined not by internal metrics, but by whether a person can do their job.

  2. 02

    Design failure in advance

    Every critical component has a designed failure scenario. Not "what might break", but "what will happen when it breaks and how the system will behave".

  3. 03

    Degradation beats unavailability

    A system that works worse is more useful than a system that doesn't work at all. Degradation is not a bug — it's a designed state. It has levels, defined behavior, and communication with the user.

  4. 04

    Be honest with the user

    If something isn't working, the user should know. No searching, no status pages, no guessing. Feedback about problems is available where the user already is. Part of the resolution can be delegated to the user — if you explain honestly what's going on.

  5. 05

    Start with the defaults

    Default settings are the most tested. Every deviation is a new failure surface, a new component that needs to be understood, maintained, and have its own failure scenario. Deviate deliberately, document the deviation, assign an owner.

  6. 06

    Knowledge must be concrete and accessible

    A reliability knowledge base is not essays or a 50-page Confluence. It's short, specific specifications: what to do, what not to do, what to expect. Write like a spec for an engineer, not an article for a blog. One component — one document — two screens maximum.

  7. 07

    Make experiments cheap, keep production boring

    New code, AI-generated code, hypotheses — all of it lives in an isolated sandbox. Code in the sandbox has an expiration date. Production only gets what has been verified and understood. Chaos is contained to its territory; production stays predictable.

Who this is for

Small teams. Startups and small products with no dedicated SRE, no budget for a platform team — but with users who will leave if the service lets them down.

Detailed principles, practices, and references — coming soon.