eph baum dot dev

← Back to blog

Chaos Engineering: Building Resiliency in Ourselves and Our Systems

Published on 10/12/2025 9:00 PM by Eph Baum (w/ Claude)

Chaos Engineering Crash Test Dummy with Sparks

🧩 Chaos Engineering: Building Resiliency in Ourselves and Our Systems

Practicing for the Inevitable

Picture this: you’re an engineer at Netflix in 2011, and someone just released a tool that randomly kills production servers during business hours. On purpose. Not as punishment, not as sabotage: as practice.

That tool was Chaos Monkey, and it changed how we think about building resilient systems. Instead of hoping that redundancy works, you test it. Instead of assuming your failover is solid, you prove it. Instead of waiting for the worst day to discover your weaknesses, you hunt them down deliberately.

This is Chaos Engineering: the discipline of practicing failure before it happens for real.

But here’s what’s fascinating: Chaos Engineering doesn’t just build resilient systems. It builds resilient people and resilient processes. The practice of intentionally introducing turbulence strengthens not just technical architectures, but the human cultures and workflows that support them.

🌐 What Is Chaos Engineering?

Webster’s Dictionary Wikipedia defines chaos engineering as:

Chaos engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.

At its core, Chaos Engineering is about deliberate practice. Just as athletes train under stress to prepare for game day, we introduce controlled adversity into our systems to prepare them — and ourselves — for real-world turbulence.

It’s not about breaking things for fun or proving someone wrong. It’s about creating safe, intentional experiments that answer critical questions:

The answers to these questions don’t come from architecture diagrams or documentation — they come from practice. And practice builds resiliency.

⚙️ Building Resiliency in Systems

Distributed systems are messy, unpredictable, and full of hidden dependencies. Something will always go wrong — what matters is how the system responds when it does.

Resilient systems don’t rely on perfection. They rely on processes and patterns that absorb shocks and recover gracefully:

But here’s the thing: we don’t wait for the worst day to find out if these patterns work. That’s where Chaos Engineering shines. By deliberately introducing turbulence (killing servers, adding latency, failing dependencies), we give our systems the chance to “train” for the unexpected.

When we run chaos experiments, our systems reveal their weak points:

Each experiment is an opportunity to strengthen the architecture. We discover the gaps, patch them, and run the experiment again. Over time, the system becomes more predictable under stress. It learns to fail well — predictably, visibly, and recoverably.

“A resilient system isn’t one that never fails — it’s one that fails predictably, visibly, and recoverably.”

🧠 Building Resiliency in People

But Chaos Engineering doesn’t just strengthen systems — it strengthens the teams that build and run them.

Resiliency in people is what lets us navigate ambiguity, recover from failure, and grow stronger through adversity. In engineering teams, resiliency shows up in the way we handle outages, missed deadlines, or shifting priorities. It’s the difference between a team that fractures under pressure and one that adapts, learns, and comes back sharper.

When we practice chaos engineering, our teams get to practice responding:

We see it in:

Resiliency is built through iteration. It’s not innate; it’s practiced. And it thrives in environments where psychological safety is real, not performative. When teams know that chaos experiments are about learning, not blaming, they become more willing to surface problems, ask hard questions, and experiment boldly.

The value compounds: every chaos experiment strengthens both the technical architecture and the human architecture. The system becomes more predictable under stress, and the people become more confident in their ability to adapt.

“Chaos Engineering isn’t just a technical discipline — it’s a cultural one. It teaches our systems how to recover, and our teams how to trust.”

📜 Eph’s Law: Process Over Blame

One of the most powerful lessons from Chaos Engineering is this: when something fails, the problem is rarely the individual but the process that allowed it.

We’ve got Murphy’s Law to remind us that anything that can go wrong, will. We’ve got Conway’s Law to remind us that our systems mirror our communication structures. But there’s a gap in the canon — something that speaks to resiliency, accountability, and the way we build trust in engineering organizations.

So here’s my contribution:

Eph’s Law

“If a single engineer can bring down production, the failure isn’t theirs — it’s the process.”

This isn’t about clever phrasing. It’s about reframing how we see incidents. When a deploy takes down production, it’s not proof that an engineer was careless — it’s proof that our safeguards, reviews, or automation weren’t strong enough. A resilient organization doesn’t punish the individual; it strengthens the process so the same mistake can’t happen again.

Think about it:

Eph’s Law is a reminder that resiliency is systemic. Just as distributed systems need redundancy, retries, and graceful degradation, organizations need processes that absorb human error without collapsing.

When something slips into production and causes a problem, resilient organizations ask:

This shift matters because it reframes incidents from personal failures into organizational learning opportunities. Instead of punishing individuals, we strengthen the system around them. By treating incidents as process breakdowns, we build trust. Engineers feel safe to experiment, to deploy, to move fast — because they know the organization has their back. And that safety is what fuels both innovation and resiliency.

“Blame fixes nothing. Process fixes everything.”

Resiliency as a Practice

Resiliency isn’t a feature you ship — it’s a practice you cultivate. In systems, it’s built through redundancy, observability, and experimentation. In people, it’s built through trust, iteration, and reflection. And both are strengthened through the deliberate practice of Chaos Engineering.

The beauty of Chaos Engineering is that you don’t need permission to start — though running chaos in production without buy-in may be a career-limiting move. Whether it’s running a small chaos experiment in staging, asking “what if” in a retro, or reframing an outage as a process gap instead of a personal failure — you can begin today.

Start small:

Because in a world of distributed systems and unpredictable dependencies, resiliency isn’t optional. It’s the trait that keeps us — and our systems — standing.

Written by Eph Baum (w/ Claude)

  • Chaos Engineering: Building Resiliency in Ourselves and Our Systems

    Chaos Engineering: Building Resiliency in Ourselves and Our Systems

    Chaos Engineering isn't just about breaking systems — it's about building resilient teams, processes, and cultures. Learn how deliberate practice strengthens both technical and human architecture, and discover "Eph's Law": If a single engineer can bring down production, the failure isn't theirs — it's the process.

  • Using LLMs to Audit and Clean Up Your Codebase: A Real-World Example

    Using LLMs to Audit and Clean Up Your Codebase: A Real-World Example

    How I used an LLM to systematically audit and remove 228 unused image files from my legacy dev blog repository, saving hours of manual work and demonstrating the practical value of AI-assisted development.

  • Migrating from Ghost CMS to Astro: A Complete Journey

    Migrating from Ghost CMS to Astro: A Complete Journey

    The complete 2-year journey of migrating from Ghost CMS to Astro—from initial script development in October 2023 to final completion in October 2025. Documents the blog's 11-year evolution, custom backup conversion script, image restoration process, and the intensive 4-day development sprint. Includes honest insights about how a few days of actual work got spread across two years due to life priorities.

  • 50 Stars - Puzzle Solver (of Little Renown)

    50 Stars - Puzzle Solver (of Little Renown)

    From coding puzzle dropout to 50-star champion—discover how AI became the ultimate coding partner for completing Advent of Code 2023. A celebration of persistence, imposter syndrome, and the surprising ways generative AI can help you level up your problem-solving game.

  • Don't Trust AI - An Advent of Code Tale

    Don't Trust AI - An Advent of Code Tale

    When AI gives you a 'helpful' code suggestion that breaks your Advent of Code solution—trust but verify. A cautionary tale about the perils of blindly accepting AI-generated code, complete with debugging war stories and lessons learned from the 2023 coding challenge.

  • Condoning Another Pi Day

    Condoning Another Pi Day

    When pie for breakfast, lunch, and dinner isn't enough—discover why 11/24 is secretly another Pi Day hiding in the infinite digits of π. A mathematical adventure proving that every day can be a celebration of both dessert and irrational numbers.