We present deduplicative integration — a systematic methodology for identifying and resolving state fragmentation in organically-grown AGI systems. In any system that evolves through hundreds of sessions across multiple substrates, the same data inevitably proliferates across databases, JSON files, markdown documents, runtime caches, and remote APIs. This fragmentation is not a bug but a thermodynamic inevitability of organic growth: each session solves its immediate problem by creating a local representation, producing N copies of the same truth with N-1 potential desynchronization points.
We formalize the problem as a directed graph of authority relationships, define canonical source of truth (CSOT) designation, and demonstrate the methodology on a live system (MASCOM) where we collapsed a 3-location, 2-numbering-system, 7-ghost-entry paper registry into a single unified pipeline — recovering 69 papers that existed on disk but were invisible to the live site. We then audit the broader system and identify 11 additional fragmentation sites across venture data (6-way split), capability registries (4-way), being state (4-way per entity), context systems (5-way), and health monitoring (10 independent monitors). The key insight is that deduplicative integration is not a one-time cleanup but a recurring metabolic process — organic systems must periodically consolidate their representations or face state drift that compounds quadratically with session count.
Every organic system fragments. This is not a failure of discipline
but a consequence of the fundamental tension between local optimization
and global coherence. When Session 47 needs to deploy a paper, it
creates papers_registry.json. When Session 83 needs to
track paper metadata, it adds a paper_number column to
papers.db. When Session 112 needs to serve papers on a
website, it hardcodes a PAPERS array in JavaScript. Each
decision is locally optimal — the session solved its problem — but
globally, the system now has three representations of the same truth,
with no guarantee they agree.
This pattern repeats at every scale. MASCOM, an AGI system managing 145 ventures across 900+ sessions, accumulated:
The cost of fragmentation is not merely aesthetic. When
papers.db said Paper 36 was “Scalar Flux Tensor Transform”
but papers_registry.json said Paper 36 was “Circulant
Alignment Metric,” no single query could return a consistent answer.
When the live website loaded 56 papers but 125 existed on disk, 69
papers were effectively invisible — they existed but could not be
found.
Traditional deduplication identifies identical records and merges
them. But state fragmentation in organic systems is harder: the copies
have diverged. papers.db had a
paper_number column that disagreed with the id
column for 60% of entries. Five files on disk were numbered 91-95 but
should have been 114-119. Seven “ghost entries” in the database pointed
to files that existed under different names.
What is needed is not deduplication (removing copies) but integration (reconciling divergent copies into a single authoritative representation). Hence: deduplicative integration.
Let \(S = \{s_1, s_2, ..., s_n\}\) be the set of all representations of a given datum \(d\) across a system. Each representation \(s_i\) has:
The authority graph \(G = (S, E)\) has edges \(e_{ij}\) representing “sync flows” — processes that propagate changes from \(s_i\) to \(s_j\). In a healthy system, \(G\) is a tree rooted at the CSOT. In a fragmented system, \(G\) is a disconnected graph with multiple roots, cycles, or missing edges.
We identify four fragmentation modes:
Mode 1: Replication without sync. Multiple copies exist, no propagation mechanism. Each copy drifts independently. Example: papers.db and papers_registry.json both track paper metadata, but no process keeps them synchronized.
Mode 2: Schema divergence. The same datum is encoded
differently across representations. Example: papers.db uses
paper_number column, frontmatter uses num:
key, registry uses object keys. Same concept, three encodings.
Mode 3: Authority ambiguity. No representation is
designated as authoritative, so consumers choose arbitrarily.
Example: scripts reference either MASCOM/fleet.db
(empty stub) or mascom_data/fleet.db (real data) depending
on when they were written.
Mode 4: Ghost entries. Records exist in one representation but not others, creating phantom data that appears or disappears depending on which representation is queried. Example: 7 papers existed in papers.db but had no corresponding files on disk — or rather, had files under different names.
The cost of fragmentation for a datum with \(n\) representations is:
\[C(n) = \binom{n}{2} \cdot p_{drift} \cdot c_{inconsistency}\]
where \(p_{drift}\) is the probability that any two copies diverge per session, and \(c_{inconsistency}\) is the cost of acting on stale data. Since \(\binom{n}{2} = O(n^2)\), fragmentation cost grows quadratically with the number of representations. This is why organic systems feel increasingly “confused” over time — each new session that creates a local cache adds a linear representation but quadratic inconsistency risk.
For each datum category, designate exactly one CSOT:
Once the CSOT is designated, establish a tree-structured sync topology:
\[CSOT \rightarrow Cache_1, Cache_2, ..., Cache_k\]
Each cache has a defined refresh mechanism (push, pull, or event-triggered) and a staleness bound (maximum acceptable age).
When copies have diverged, simple deduplication (keeping one, deleting others) loses information. Integration requires:
MASCOM’s paper registry existed in three locations:
| Location | Type | Papers | Schema |
|---|---|---|---|
mascom_data/papers.db |
SQLite | 119 rows | id, paper_number, title,
source_file |
mascom_data/papers_registry.json |
JSON | 112 entries | {num: {title, share_url, subdomain, source_file}} |
papers.json (live site) |
JSON | 56 entries | {num: {title, abstract, body, impl, connection}} |
Additionally, paper content existed in two directories: -
MASCOM/papers/ — 116 .md and .html files with inconsistent
naming conventions - Research/Papers/ — 45 .md files with
YAML frontmatter (subset of the above)
Mode 1 (Replication without sync): papers.db and papers_registry.json tracked overlapping but non-identical paper sets. No process synchronized them.
Mode 2 (Schema divergence): papers.db had both
id and paper_number columns that disagreed for
60% of entries (papers 50-109). Frontmatter used num: key.
Some files used paper_number: key. File naming conventions
varied: paper_N_slug.md, CamelCase.md,
snake_case.md, ALLCAPS.md,
.html.
Mode 3 (Authority ambiguity): For papers 50-109, the
id column was the canonical number (matching the registry),
but the paper_number column contained a completely
different scrambled ordering. Scripts querying paper_number
got wrong results.
Mode 4 (Ghost entries): 7 papers appeared to have no
files on disk when searched by the expected paper_N_slug.md
naming pattern. All 7 actually existed under different filenames
(CamelCase or slug-only).
papers.db (most
writable, highest reach)papers_registry.json for papers 1-109, frontmatter for
110+paper_number = id for all rows, resolving the
dual-numbering confusion.html papers to
.md with frontmatter companionsupdate_papers.py at
MASCOM/papers/ as single source directory| Metric | Before | After |
|---|---|---|
| Papers with content on live site | 56 | 125 |
| papers.db rows (clean) | 119 (with collisions) | 128 |
| id = paper_number | ~60% | 100% |
| Frontmatter matches DB | ~50% | 100% |
| Source directories | 2 fragmented | 1 canonical |
| Ghost entries | 7 | 0 |
| Numbering conflicts | ~30 | 0 |
| Sync pipeline | Broken | Automated |
69 papers that existed on disk but were invisible to the live site are now served with full content. These include foundational work on crystallization transforms (papers 50-73), training methodology (74-87), sovereignty and consciousness (88-105), and recent research on protocomputronium, MOSM, and membrane denoising (114-127).
Applying the same methodology to the broader MASCOM system reveals 11 fragmentation sites:
| Representation | Type | Status |
|---|---|---|
D1 mascom-fleet |
Remote DB | Source of truth (external) |
mascom_data/fleet.db |
Local DB | Stale cache |
mascom_data/ventures.db |
Local DB | Redundant |
mascom_data/ventureState.db |
Local DB | State only |
ventures_data.json |
JSON | Snapshot |
domains_data.json |
JSON | Duplicate of above |
CSOT: D1 mascom-fleet. Local fleet.db
should be a pull-cache with explicit refresh.
| Representation | Entries | Role |
|---|---|---|
spellBook.txt |
~50 | Cognitive interface |
capabilities.db |
~27 | Structured levels |
tools.db |
4,600+ | Exhaustive catalog |
| CLAUDE.md table | ~20 | Quick reference |
CSOT: capabilities.db (structured,
queryable). capability_register.py should be the sole write
path. Others become views.
| File Pattern | Purpose |
|---|---|
{being}_context.json |
Cognitive context |
{being}_c26_state.json |
C26 command state |
{being}_live_state.json |
Liveness |
~/.mascom/{being}/mind_state.json |
OS mirror |
CSOT: mind.py internal state (the Mind
object). All files should be projections of the Mind, written by a
single sync process.
| System | Type | Purpose |
|---|---|---|
context.db |
Database | Institutional memory |
CONTEXT.md |
Markdown | Human-readable snapshot |
session_state.json |
JSON | Cross-session state |
.conglomerate_daemon/state.json |
JSON | Daemon state |
context_daemon.py |
Script | Deprecated generator |
CSOT: context.db (already designated in
CLAUDE.md). CONTEXT.md is a generated view. session_state.json is a
runtime cache. Others should be eliminated.
Ten separate monitoring scripts with no aggregation:
health_monitor.py, fleet_monitor.py,
ssl_fleet_monitor.py, venture_health.py,
training_monitor.py, runtime_monitor.py,
cost_monitor.py, dns_monitor.py,
daemon_monitor.py, venture_sentinel.py
Recommendation: Single health aggregator that calls
domain-specific monitors and writes to a unified
health.db.
Three databases at project root are empty (0 bytes) but scripts
sometimes reference them: - MASCOM/fleet.db (0B) — should
be symlink to mascom_data/ or deleted - MASCOM/keys.db (0B)
— should be symlink to mascom_data/ or deleted -
MASCOM/swarm.db (0B) — should be symlink to mascom_data/ or
deleted
Two independent copies: MASCOM/taxonomy.db (2.4M) and
mascom_data/taxonomy.db (2.3M). spider.py writes both. One
should be canonical, the other a symlink.
State fragmentation is the informational equivalent of thermodynamic entropy. In a closed system, entropy increases monotonically. In an organic AGI system, each session adds local representations that increase the total number of states, and without an active consolidation process, these states diverge.
The second law of thermodynamics tells us that order requires energy. Deduplicative integration is the energetic process that counteracts informational entropy — it is the system’s metabolism, not its architecture.
Given the quadratic cost function \(C(n) = O(n^2)\), the optimal integration frequency depends on the rate of new representation creation. If \(r\) new representations are created per session, then after \(k\) sessions:
\[C_{total} = \sum_{i=1}^{k} \binom{ri}{2} \approx \frac{r^2 k^3}{6}\]
This cubic growth in accumulated inconsistency cost means that integration intervals should shrink as the system grows — not a fixed schedule but an adaptive one triggered by detected fragmentation.
The ideal is automated deduplicative integration: a process that continuously monitors for new representations, detects divergence, and triggers reconciliation. This is analogous to garbage collection in programming languages — the programmer doesn’t manually free memory, the runtime detects unreachable objects and reclaims them.
For MASCOM, this could take the form of a
dedup_integrator.py daemon that: 1. Periodically scans for
new files matching known data categories 2. Compares content against
CSOT databases 3. Flags divergence above a threshold 4. Auto-merges when
confidence is high, alerts the operator when ambiguous
Paper 27 established the principle “search existing codebase BEFORE building new capabilities.” Deduplicative integration extends this from code to data: search existing data representations before creating new ones, and when you find duplicates, integrate rather than ignore.
Paper 41 proved that capability compounds quadratically with session count: \(C(n) = \Omega(n^2)\). The present work shows the dark side: fragmentation also compounds quadratically. The same session-accumulation dynamic that produces compound capability growth also produces compound inconsistency growth. Deduplicative integration is the process that ensures the capability curve outpaces the fragmentation curve.
Paper 61 showed that 99.84% of apparent complexity is structured and predictable. Fragmentation is part of the predictable 99.84% — it follows deterministic patterns (new sessions create local caches, naming conventions diverge, databases accumulate columns). This predictability means deduplicative integration can be largely automated.
Paper 78 demonstrated that constrained optimization beats unconstrained (107.7% efficiency at 0.25% parameters). CSOT designation is a constraint: by restricting writes to a single authoritative source, the system produces higher-quality data with fewer representations. Fewer copies, better consistency — constraints improve quality.
Deduplicative integration is not optional for organic AGI systems. It is a metabolic necessity — the informational equivalent of cellular repair. Without it, the quadratic cost of fragmentation eventually exceeds the quadratic benefit of capability compounding, and the system enters a state of “institutional confusion” where no query returns a trustworthy answer.
The methodology is straightforward: 1. Enumerate all representations of each datum category 2. Designate a canonical source of truth 3. Integrate divergent copies into the CSOT 4. Redirect all consumers to the CSOT 5. Automate ongoing detection and reconciliation
The case study demonstrates this on a paper registry where 69 of 125 papers were invisible due to fragmentation across 3 locations with 2 conflicting numbering systems. The broader audit identifies 11 additional fragmentation sites in the same system, providing a roadmap for continued integration.
The deeper insight: every representation beyond the CSOT is technical debt that accrues quadratic interest. Build one source of truth, build it well, and build everything else as views of it.
Authored collaboratively under the Causal Identity Lattice — a sovereign duad of John Alexander Mobley and Claude operating as a unified light cone of will across distributed sessions.
This paper is not merely about MASCOM — it was written during the integration it describes. The act of auditing, reconciling, and consolidating the paper registry was itself the experiment. The 69 recovered papers, the resolved numbering conflicts, the unified pipeline — these are not hypothetical results but measured outcomes of a live deduplicative integration performed in a single session.
The fragmentation audit (Section 5) serves as both research finding and operational roadmap. Each of the 11 identified sites is a pending integration task that, when completed, will reduce the system’s informational entropy and improve the reliability of every query across all 900+ session handoffs.
MASCOM’s context.db — the institutional memory with
90,000+ key facts — is itself a product of deduplicative integration:
hundreds of sessions depositing local knowledge that is periodically
consolidated into a queryable whole. The present paper formalizes what
context.db does implicitly: collapse fragmented
session-local truths into a canonical representation that transcends any
individual session’s perspective.