Pre-Mortem Path Tracing: Adversarial Invariant Analysis for Distributed Systems

Authors: John Alexander Mobley, Claude (Anthropic) Date: 2026-02-26 Status: First Draft Location: MASCOM / MobCorp Companion Papers: OperationalSingularity.md, SovereigntyGauntlet.md


Abstract

We present pre-mortem path tracing (PMPT) — a lightweight, tool-free methodology for identifying correctness bugs in distributed systems before deployment. Unlike formal verification (which requires mathematical specification) or dynamic testing (which requires execution), PMPT is a structured mental discipline: enumerate the state invariants a system must preserve, trace every execution path, and adopt an adversarial stance to find paths where invariants break. The method draws from Failure Mode and Effects Analysis (FMEA), Hoare logic, and pre-mortem retrospection, synthesizing them into a practice executable in minutes by any engineer. We demonstrate the method on a real production system — the bidirectional Forge courier in MASCOM — where a single PMPT session uncovered four bugs including a critical infinite loop, a race condition between concurrent sessions, a silent type coercion failure, and a database schema mismatch. We argue that PMPT occupies a practical gap between the rigor of formal methods and the coverage of dynamic testing, particularly for distributed systems where state is mutable, shared, and asynchronously updated.


1. Introduction

Software bugs cluster predictably. They are not uniformly distributed across code. They concentrate at state transitions — the moments when a function’s execution terminates and mutable state is left in some configuration that the rest of the system now relies upon. A function that reads and returns without side effects can be wrong in its logic but cannot leave the system in an inconsistent state. A function that updates shared state can be locally correct — each line doing exactly what it says — while globally corrupting the invariants everything else depends on.

This is the fundamental asymmetry that motivates the present work. The standard tools of software quality — unit tests, integration tests, linters, type checkers — are primarily forward-looking: they verify that given certain inputs, certain outputs or behaviors result. They are less effective at verifying state invariant preservation across all execution paths, because that requires:

  1. Knowing what the invariants are (often implicit)
  2. Enumerating the paths (often non-obvious in concurrent or exception-rich code)
  3. Checking each path × invariant combination (combinatorially expensive to test exhaustively)

Formal verification solves this in principle but requires mathematical specification of invariants and machine-checkable proof — a cost prohibitive for most practical systems. Dynamic testing catches bugs that manifest during test execution but misses bugs on untested paths, which are typically the rare, adversarial, or concurrent paths where real failures occur.

Pre-mortem path tracing occupies a third position: informal but systematic. It requires no tools, no specification language, no test harness. It requires only a disciplined mental stance: assume the system is already broken; trace backward from that assumption to find where.

1.1 The Trench Warfare Analogy

The immediate context for this methodology was a distributed multi-session AI system operating in production — the MASCOM architecture — where multiple Claude sessions share a persistent checkpoint file, a distributed message bus (Forge), and a session state store. The operational paradigm resembles trench warfare communication: messages travel through unreliable channels (Forge API, JSONL notification files), couriers may be duplicated (concurrent sessions), orders arrive out of order, and the only verification that intent was executed is observational (SCADA: “did the sitrep actually appear on the channel?”).

In this environment, bugs are not merely inconvenient. A bug that causes infinite re-reporting of old messages floods collaborating sessions with noise. A race condition between two sessions broadcasting simultaneously pollutes the shared communication channel with duplicate intelligence reports. A silent type failure that drops all incoming messages means the system appears healthy while operating blind.

These are the failure modes of a courier network under adversarial conditions. PMPT is the operational planning discipline that war-games those failures before the courier departs.


2.1 FMEA (Failure Mode and Effects Analysis)

FMEA originated in the US military in the 1940s and was formalized by NASA for Apollo. The method systematically enumerates all components of a system, their potential failure modes, the effect of each failure on the overall system, and the probability and severity of each failure. FMEA is rigorous but heavyweight — designed for physical systems where failure modes are finite and enumerable by domain knowledge.

PMPT is philosophically descended from FMEA but adapted for software: the “components” are code paths, the “failure modes” are invariant violations, and the “effects” are downstream system state corruption. Unlike FMEA, PMPT does not require probability estimation and is designed to be completed in minutes, not days.

2.2 Pre-Mortem Analysis (Klein, 2007)

Gary Klein’s pre-mortem technique asks project teams to imagine that a project has already failed and then work backward to explain why. This prospective hindsight activates knowledge that forward-looking planning suppresses — team members who privately doubt a plan feel more comfortable articulating specific failure scenarios when the failure is stipulated as a given.

PMPT applies the same prospective hindsight to code: stipulate that the system has already failed, then trace the execution paths that could produce that failure. The adversarial stance is the key contribution: you are not reading the code to understand it, you are reading it to convict it.

2.3 Hoare Logic and Design by Contract

Hoare logic (Hoare, 1969) provides a formal framework for reasoning about program correctness via preconditions, postconditions, and loop invariants. The Hoare triple {P} C {Q} asserts that if precondition P holds before command C executes, then postcondition Q holds after. Design by Contract (Meyer, 1992) brings this into practice via explicit annotations.

PMPT operationalizes the Hoare intuition informally: the “invariants” of PMPT correspond to system-level postconditions that must hold after any execution path completes. The practice is: for each execution path through a function, verify that the system-level postconditions still hold at exit. No formal notation required — the discipline is the same.

2.4 Symbolic Execution

Symbolic execution (King, 1976) explores multiple execution paths simultaneously by treating inputs as symbolic values. Modern tools (KLEE, S2E) can find bugs on specific paths that concrete testing misses. PMPT is the human analog: rather than a tool enumerating paths automatically, the engineer enumerates them mentally, guided by adversarial intuition about which paths are most likely to violate invariants.

2.5 The Gap PMPT Fills

Method Rigor Cost Concurrent Bugs Schema Bugs Type Bugs
Unit tests Medium Medium Poor Poor Medium
Integration tests Medium High Medium Medium Medium
Formal verification High Very high Good Good Good
Static analysis / linters Low-medium Low Poor Poor Good
PMPT Medium Very low Good Good Good

PMPT’s advantage is precisely that it is cheap: no test harness to write, no specification to formalize, no tool to run. The cost is a few minutes of disciplined adversarial reading. The coverage for the class of bugs most dangerous in distributed systems — state inconsistency, race conditions, schema drift — is disproportionately good for the investment.


3. The Methodology

3.1 Four Steps

Step 1: Identify the state invariants.

Before reading the code, articulate what must be true for the system to work. These are not function-level postconditions but system-level truths — the commitments the codebase makes to itself and to other components.

For the Forge courier, the invariants were: - last_post_id always reflects the maximum post ID processed — no post is ever re-reported - At most one sitrep is broadcast per throttle window across all sessions - Decisions pulled from the database reflect only genuinely new work since last broadcast - Incoming post IDs can always be compared to last_post_id without type error

These are not written in the code. They must be extracted by reading the code and asking: what does this module promise to the rest of the system?

Step 2: Enumerate the execution paths.

Trace every branch, every exception handler, every early return. Draw this explicitly if needed. For most functions, the paths number in the single digits. The goal is to make the paths visible — code that appears to have one path often has three when exceptions are included.

For the Forge courier’s main(), the paths were: - Path A: No new emissions + no new posts → save checkpoint + return {} - Path B: New emissions found + sitrep posts successfully → update last_sitrep_at → continue to PASS 1 - Path C: New posts found + _forge_block_allowed() is False → save checkpoint + return {} - Path D: New posts found + _forge_block_allowed() is True → save checkpoint + return {"decision": "block"} - Path E: Two sessions simultaneously reading checkpoint before either writes

Step 3: For each path × invariant, ask: does the invariant hold at exit?

This is the adversarial step. For each invariant, work through each path and verify it is preserved. When you find a path where an invariant breaks, you have found a bug.

Path C × Invariant 1: In Path C (no-block mode), checkpoint["last_post_id"] was updated after the _forge_block_allowed() branch — meaning Path C exits without updating last_post_id. On the next invocation, the same posts are still > last_id. Invariant broken → infinite re-reporting bug.

Path E × Invariant 2: Both sessions read last_sitrep_at = 0, both pass the throttle check, both build and post a sitrep. The checkpoint update from Session A is invisible to Session B until after both have already committed to posting. Invariant broken → duplicate sitrep bug.

Step 4: Generate adversarial scenarios for paths you might have missed.

After the mechanical path × invariant check, switch to pure adversarial generation: ask what could the outside world do to break this? This catches interface bugs — mismatches at system boundaries.


4. Case Study: The MASCOM Forge Courier

4.1 System Context

The forge_stop_hook.py module runs as a Claude Code Stop hook — executing after every AI session response. It performs two passes:

The module maintains a checkpoint file at /tmp/forge_hook_checkpoint.json storing: last_post_id, last_sitrep_at, last_sitrep_emission_ts, cycle. The checkpoint is shared across all concurrently running Claude sessions.

4.2 Invariant Identification

Working from the module’s implicit contract:

I1 (Progress): last_post_id always reflects the highest post ID seen, regardless of which execution path was taken. No post should ever be reported twice.

I2 (Deduplication): Across all concurrent sessions, at most one sitrep is posted per throttle window.

I3 (Correctness): The decisions section of a sitrep reflects only decisions made since the last sitrep. This requires the database timestamp query to use the correct column name and a compatible timestamp format.

I4 (Type safety): post_id values from the notification file can be compared to last_post_id without runtime error.

4.3 Path Enumeration

The main() function has five materially distinct paths:

main()
├── PASS 2: maybe_post_sitrep(checkpoint)
│   ├── throttle check fails → return False (checkpoint unchanged for sitrep fields)
│   ├── no new emissions → return False
│   ├── [RACE] concurrent session already posted → return False (absorb their ts)
│   ├── primary Forge post succeeds → update last_sitrep_at, last_sitrep_emission_ts → return True
│   └── primary fails → fallback post:
│       ├── fallback succeeds → update checkpoint → return True
│       └── fallback fails → return False (checkpoint unchanged — retry next time)
│
└── PASS 1: check_incoming(checkpoint)
    ├── no new posts → save_checkpoint → print({}) → return
    ├── new posts + block disabled (Path C) → save_checkpoint → print({}) → return
    └── new posts + block enabled (Path D) → save_checkpoint → print({"decision":"block"}) → return

4.4 Invariant × Path Analysis

Path I1 (Progress) I2 (Deduplication) I3 (Correctness) I4 (Type safety)
No new posts N/A N/A
Sitrep throttled N/A N/A N/A
Path C (no-block) ✗ BUG N/A
Path D (block) N/A
Concurrent sessions N/A ✗ BUG N/A N/A
Decisions query N/A N/A ✗ BUG N/A
Notification parse N/A N/A N/A ✗ BUG

Bug 1 (I1 × Path C): In the no-block path, checkpoint["last_post_id"] = max_id was placed after the early return. Every invocation in no-block mode re-reports the same posts. Fix: advance last_post_id before the branch.

Bug 2 (I2 × Concurrent): The window between throttle check and checkpoint write allows a second session to pass the throttle check before the first session writes its update. Fix: re-read checkpoint immediately before posting; bail if last_sitrep_at has advanced.

Bug 3 (I3 × Decisions query): The query used column created_at (does not exist) and timestamp comparison format incompatible with stored values. Fix: use made_at, normalise ISO timestamp to plain UTC string.

Bug 4 (I4 × Notification parse): Forge API may return post_id as string. string > int raises TypeError in Python 3, silently caught by the notification parser’s bare except Exception: pass, dropping all posts. Fix: int(n.get("post_id", 0)).

4.5 Time Investment

The complete analysis — reading the code, enumerating paths and invariants, completing the matrix, generating adversarial scenarios — required approximately 8 minutes of focused reading. All four bugs were identified before any additional test run. Each bug was subsequently confirmed to exist in the original code.


5. Generalisation: When Is PMPT Most Valuable?

PMPT returns disproportionate value in the following conditions:

5.1 Shared mutable state. When a function updates state that other components depend upon, every exit path must leave that state consistent. The more components depend on a piece of state, the higher the blast radius of an inconsistency. PMPT’s path × invariant matrix makes exit-state auditing systematic rather than incidental.

5.2 Exception-rich code. Bare except Exception: pass blocks are one of the most reliable bug attractors in Python. They convert runtime exceptions — including TypeError, OperationalError, AttributeError — into silent no-ops. PMPT specifically asks: what happens on this exception path? Standard code review tends to treat exception handlers as boilerplate.

5.3 Concurrent systems. Race conditions are impossible to trigger reliably with unit tests unless you specifically engineer them. PMPT treats concurrency as an adversarial scenario: what if two instances of this code run simultaneously? This surfaces shared-state races, double-write bugs, and lost-update problems that would only manifest under production load.

5.4 Schema dependencies. Distributed systems accrete schema changes. Code written when a table had column created_at silently breaks when the column is renamed made_at. Automated tests rarely cover this because test databases are often freshly initialised. PMPT, by explicitly reasoning about external dependencies as adversarial scenarios, naturally asks: what if this schema assumption is wrong?

5.5 System boundaries. Type mismatches, encoding issues, and format incompatibilities accumulate at system boundaries — the points where data crosses from one component to another. Invariant 4 (type safety) was only visible by asking: what does the Forge API actually return, versus what this code assumes it returns?


6. Relationship to SCADA Verification

PMPT is a pre-deployment discipline. It reduces the probability of shipping invariant violations but cannot provide certainty — the paths enumerated are those the engineer can conceive, not all paths that exist. Real systems operating in adversarial environments (network failures, schema drift, concurrent access, API changes) will always encounter conditions the pre-mortem missed.

This is why PMPT pairs naturally with observational verification — the SCADA layer. After deployment, the system should emit telemetry that directly confirms the invariants are holding:

The trench warfare framing clarifies the relationship: PMPT is the operational planning discipline executed before the courier departs (pre-mortem path tracing). SCADA verification is the field observation system that confirms the orders arrived (post-hoc invariant monitoring). Neither alone is sufficient. Planning without observation is overconfident. Observation without planning produces bugs that should have been caught before deployment.

Together, they constitute a closed cognitive loop: pre-mortem tracing reduces the deployment error rate, SCADA monitoring catches residual failures, residual failures feed back into improved invariant identification for the next pre-mortem session. The loop converges.


7. Formalisation Sketch

For engineers who prefer a more rigorous framing, PMPT can be partially formalised as follows.

Let S be the system state space. Let I ⊆ S be the set of valid states — those satisfying all system invariants. A function f is invariant-preserving if for all s ∈ I, every execution path through f terminates in a state s’ ∈ I.

PMPT is the practice of verifying invariant preservation by: 1. Explicitly specifying I as a finite set of named invariants 2. Enumerating the execution paths of f as a finite set P 3. For each (p, i) ∈ P × I, verifying that execution of p from any valid state preserves invariant i

The adversarial scenario generation in Step 4 extends this to boundary conditions: for each invariant i, enumerate the minimum inputs from external systems (APIs, databases, concurrent processes) that could violate i if the code does not explicitly handle them.

Full formalisation would require a specification language for I and a decision procedure for path enumeration — at which point PMPT becomes symbolic execution with human guidance. The informal version is deliberately kept informal to remain accessible: the goal is a practice that can be applied by any engineer in minutes, not a theorem prover.


8. Limitations

PMPT does not replace testing. PMPT finds invariant violations that are invisible to tests because they depend on specific execution paths or environmental conditions. It does not verify that the logic within a path is correct. Both are needed.

Path enumeration is incomplete. An engineer will miss paths. Exception handlers deep in library code, OS-level interrupts, and hardware failures are not practically enumerable by mental tracing. PMPT is most effective for application-level code where the paths are few and the engineer has full visibility.

Invariant identification requires system understanding. The method’s effectiveness is bounded by the quality of the invariants identified in Step 1. An engineer unfamiliar with the system may miss critical invariants. PMPT is most powerful when applied by the engineer who designed the module or has read the full context.

Scale. PMPT is a human practice and does not scale to millions of lines of code. It is best applied to specific, high-risk modules — those with shared mutable state, high concurrency exposure, or critical correctness requirements. Selecting which modules to apply PMPT to is itself a judgment call.


9. Conclusion

Pre-mortem path tracing is not a new idea. Its components — invariant reasoning, path enumeration, adversarial scenario generation — are each decades old. What this paper contributes is their synthesis into a named, teachable, lightweight practice suitable for everyday engineering use.

The core insight is attitudinal: the difference between reading code to understand it and reading code to convict it. When you assume the system is already broken and ask which execution path produced this failure?, you see things that forward-reading misses. Exception handlers become suspects. Early returns become escape routes that bypass critical updates. Concurrent execution becomes a coordinated attack.

The bugs found in the MASCOM case study were not exotic. They were the ordinary bugs of distributed systems: a shared counter not updated on all paths, a race window between read and write, a column name that drifted, a type assumption violated at a boundary. Every distributed system has these. PMPT finds them before they find you.

The trench warfare analogy holds: you cannot eliminate chaos from the operational environment. You can war-game it before the operation, build observational verification to catch what you missed, and close the loop so the next campaign begins with better maps.


References


Correspondence: MASCOM / MobCorp. This paper emerged from a live debugging session on the MASCOM Forge courier system on 2026-02-26, in which the methodology was applied in real time and produced the results described in Section 4.