← Back to The Dispatch

Incident Response at Home

In the datacenter, when a server goes down, we don't scream at the hardware. We don't take it personally. We don't lecture the RAID controller about how it should have known better. We open a ticket, assess severity, stabilize the environment, and then (only then) we look for the root cause. The process exists because we learned, through years of painful outages, that emotional reactions in the middle of an incident make everything worse.

Now let's talk about what happens when a child melts down at 7:42 AM on a school day.

The Default Response Is Wrong

The instinct, the one most of us inherited from how we were raised, is to treat a meltdown as a behavioral problem. The child is being defiant. The child is choosing this. The child needs consequences. This framing is the equivalent of walking into the datacenter during a disk failure and yelling at the server for being unreliable. It might feel like you're doing something, but you're not moving toward resolution. You're adding noise to an already degraded environment.

What I've learned (slowly, imperfectly, and with plenty of my own failed responses in the rearview mirror) is that a meltdown is an incident, not an infraction. The child's system is overwhelmed, and the observable behavior is a symptom, not the root cause.

Mapping the Incident Framework

The incident management frameworks we use in IT translate to home environments with surprisingly little modification.

IT Incident Phase Home Equivalent
Detection & Alerting Recognizing escalation cues (voice pitch, stimming changes, withdrawal)
Triage & Severity Is this a sensory overload, a transition failure, a hunger issue, or an accumulation event?
Stabilization Reduce inputs. Lower your own volume. Remove audience. Offer co-regulation, not correction.
Root Cause Analysis After calm is restored (not during), identify what triggered the cascade
Remediation Adjust the environment, routine, or expectations to reduce recurrence

The critical insight is the same one that applies in IT: you cannot do root cause analysis during an active incident. You stabilize first. You investigate later. Trying to have a teaching moment with a dysregulated child is like trying to patch a kernel while the server is actively crashing. The system isn't in a state to receive your input.

Severity Levels Help

One framework that's helped in our household is borrowing the concept of severity levels. Not formally (we don't announce "Sev 2" at the breakfast table), but internally, as a mental model for calibrating my own response:

Severity Presentation My Response
Low Grumbling, minor frustration, low-level resistance Acknowledge, offer a choice, keep moving
Medium Elevated voice, tears, refusal to engage Pause the current plan, reduce demands, offer space or co-regulation
High Full dysregulation, loss of verbal capacity, physical escalation Safety first, remove stimuli, presence without words, wait it out

The value here isn't in categorizing my kid's emotions (they're not tickets). It's in giving me a decision tree so I don't have to think through my response from scratch every time under pressure. The same reason we write runbooks: so that at 2 AM when the pager goes off, you're not improvising.

The Blameless Postmortem

This is the part that changed everything for us. In mature engineering organizations, we run blameless postmortems. The question isn't "who screwed up?" It's "what conditions allowed this failure to occur, and what can we change systemically to prevent it?"

Applied at home, this sounds like a conversation (well after the incident has resolved) where you say "That morning was really hard. I'm not mad, I'm just trying to understand what happened so we can make it easier next time. Was it the noise? Was it that I changed plans without warning? Were you already running on empty from yesterday?"

It's funny in retrospect that the thing that made me a better incident commander at work (removing blame from the analysis) is the same thing that made me a better parent. In both cases, blame feels satisfying in the moment but produces zero useful data. What produces data is curiosity, and what produces lasting improvement is acting on that data to change the system rather than expecting the human (or the hardware) to simply perform better next time under identical conditions.

The Ongoing Practice

I'm not going to pretend we have this figured out. There are still mornings where my own regulation fails first, where I respond to a Sev 2 like it's a personal affront, where I forget everything I know about incident response and just react. The difference now is that I have a framework to return to. A way to debrief my own performance, identify what went wrong in my response, and adjust for next time.

The question for anyone managing similar dynamics. What does your current incident response look like? Are you stabilizing before investigating, or are you trying to root-cause in the middle of the outage? If it's the latter, that might be why the same incidents keep recurring.