Brainsight: Analysis-by-Synthesis OCR with Attractor Dynamics for Severely Degraded Document Text

John Mobley MobCorp / Mobleysoft Autonomous Systems Commander (MASCOM)

Abstract

We present Brainsight, a novel OCR pipeline that reads severely degraded document text (down to 6 pixels tall) without deep learning. Inspired by predictive coding theory from neuroscience, Brainsight inverts the standard OCR pipeline: instead of only recognizing characters bottom-up, it generates predictions of what each word should look like, upscales the degraded input through an adaptive multi-scale cascade, and cross-correlates predictions against reality using normalized cross-correlation (NCC). When multiple words in a line are uncertain, a constraint propagation pass (“attractor collapse”) finds the globally consistent interpretation — the unique configuration where every word mutually reinforces its neighbors, analogous to solving a crossword puzzle. On real-world architectural hardware schedules with text as small as 6 pixels, Brainsight extracts 22 unique domain terms where baseline template-matching OCR extracts zero. The entire pipeline uses classical signal processing — template matching, n-gram language models, and NCC — with no neural network inference, no GPU requirement, and no external API calls.

Keywords: OCR, analysis-by-synthesis, predictive coding, constraint propagation, document understanding, degraded text recognition

1. Introduction

Optical character recognition on clean, high-resolution documents is a solved problem. Modern systems — Tesseract, PaddleOCR, cloud APIs from Google, Microsoft, and Amazon — achieve near-human accuracy on well-scanned text. But the frontier of document understanding lies in degraded text: architectural drawings rendered at 6px height, faded thermal prints, photos of weathered signage, low-DPI faxes of construction schedules.

The standard OCR pipeline is strictly bottom-up:

photons -> edges -> connected components -> character templates -> word assembly

This pipeline fails gracefully on moderate degradation and catastrophically on severe degradation. When a character is 6 pixels tall, there are not enough pixels to distinguish ‘O’ from ‘Q’ from ‘0’ from ‘D’. Template matching produces near-random output. Language model post-processing (spell correction, dictionary lookup) can recover some words, but it operates as an afterthought — a cleanup pass on fundamentally broken input.

The human visual system does not work this way. Neuroscience research on predictive coding (Rao & Ballard, 1999; Friston, 2005; Clark, 2013) demonstrates that perception is a bidirectional process. Visual cortex does not passively receive photons; it actively generates predictions of expected input, then compares those predictions against actual retinal signals. When reading degraded text, your language system predicts the word (“this is probably HORTON — it’s a manufacturer name in a hardware schedule”), and your visual system confirms the prediction against the available evidence. The prediction and the evidence co-activate. This is why humans can read text that no bottom-up system can decipher.

We operationalize this insight as Brainsight, a five-stage OCR pipeline built on three core principles:

Analysis-by-Synthesis (Helmholtz, 1867; Halle & Stevens, 1962; Yuille & Kersten, 2006): Generate a sharp rendering of each candidate word; cross-correlate it against the upscaled degraded input. This computes the Bayesian likelihood P(image | word) directly.
Adaptive Multi-Scale Perception: When confidence is low at one resolution, zoom in further. Like squinting harder at something you can’t read — 4x upscale, then 8x, then 12x, stopping when the NCC signal is strong enough to discriminate.
Attractor Dynamics for Constraint Propagation: Each uncertain word in a line has multiple candidate interpretations. The correct global interpretation is the attractor — the fixed point where every word mutually supports its neighbors through bigram coherence and domain co-occurrence. Iterative belief propagation converges on this attractor in 2-5 iterations.

The entire system runs on template matching, n-gram lookups, and normalized cross-correlation. No neural network weights. No GPU. No API calls. This makes it suitable for offline, embedded, and privacy-sensitive document processing.

2.1 Classical OCR

Template matching OCR (Casey & Lecolinet, 1996) scores each character against a library of reference templates using metrics like IoU, SSD, or feature vectors. This approach is fast and interpretable but degrades rapidly with noise, low resolution, and font variation. Tesseract (Smith, 2007) combines template matching with an LSTM-based language model, but the language model operates as a post-processing step rather than influencing character recognition directly.

2.2 Deep Learning OCR

Modern OCR systems use convolutional or Transformer-based encoders with CTC (Connectionist Temporal Classification) or attention-based decoders (Shi et al., 2016; Li et al., 2023). TrOCR (Li et al., 2023) uses a vision Transformer encoder with a text Transformer decoder, achieving state-of-the-art results on clean benchmarks. These systems implicitly learn a form of bidirectional processing through encoder-decoder attention, but require large training datasets and GPU inference.

2.3 Super-Resolution for OCR

Several works apply image super-resolution as a preprocessing step for OCR on degraded text (Dong et al., 2015; Wang et al., 2020). SRCNN, ESRGAN, and similar networks upscale the image before passing it to a standard OCR engine. However, these approaches are blind — they upscale without using any linguistic knowledge about what the text should say. Our adaptive upscaling is confidence-gated: the system decides how much to zoom based on how confident it is at each scale.

2.4 Analysis-by-Synthesis in Vision

The analysis-by-synthesis paradigm originates with Helmholtz’s (1867) notion of “unconscious inference” in perception. Halle & Stevens (1962) applied it to speech recognition: the system generates candidate utterances and selects the one whose synthetic spectrogram best matches the observed signal. In computer vision, Yuille & Kersten (2006) formalized this as Bayesian inference where the likelihood P(image | scene) is computed by rendering the scene hypothesis. Our approach is a direct instantiation: we render candidate words as images and compute P(image | word) via NCC.

2.5 Constraint Propagation and CRFs

Conditional Random Fields (Lafferty et al., 2001) model dependencies between adjacent labels in sequence labeling tasks. CRF layers are common in NER and handwriting recognition. Our attractor collapse is conceptually related but operates at the word level rather than the character level, uses domain-specific compatibility functions rather than learned potentials, and treats high-confidence words as immutable anchors rather than soft constraints.

2.6 Predictive Coding

Rao & Ballard (1999) proposed that the visual cortex implements a hierarchical predictive coding scheme where higher layers generate predictions for lower layers, and only prediction errors propagate upward. Friston (2005, 2010) generalized this into the Free Energy Principle, where perception is the process of minimizing surprise (free energy) through iterative prediction and correction. Clark (2013) popularized this as the “predictive processing” framework. Brainsight is an engineering realization of these ideas: top-down predictions (rendered words) are compared against bottom-up evidence (upscaled blobs), and the system iterates until the prediction error (1 - NCC) is minimized.

3. Method

3.1 Architecture Overview

Brainsight processes a document image through five sequential stages:

Stage 1: Beam Search           — Multiple character hypotheses per position
Stage 2: Brainsight ENHANCE    — Generative prediction + multi-scale NCC
Stage 3: Word Shape Rescue     — Structural feature fallback
Stage 4: Context Refinement    — Single-word re-evaluation with neighbor context
Stage 5: Attractor Collapse    — Multi-word constraint propagation

Each stage addresses a different degradation regime and hands off to the next when confidence is insufficient. The key innovation is that Stages 2 and 5 are top-down — they use predictions about what the text should say to disambiguate what the visual evidence might say.

3.2 Stage 1: Beam Search with Language Priors

Standard OCR selects the single best-matching character template at each position (greedy decoding). Brainsight maintains the top-K candidates (K=3) at each position and assembles words through beam search.

For a word of length L with K candidates per position, the beam explores up to K^L paths (pruned to beam width B=5). Each complete word W is scored by Bayesian fusion:

score(W) = alpha * V(W) + (1 - alpha) * L(W)

where V(W) is the mean visual score across character positions, L(W) is the language model score, and alpha adapts to visual confidence:

Visual Confidence	alpha	Interpretation
V > 0.7	0.80	Strong vision — trust eyes
0.5 < V < 0.7	0.65	Moderate — blend
V < 0.5	0.50	Weak vision — lean on language

The language score combines three signals: - Bigram context (weight 0.4): P(word | previous_word) from n-gram statistics - Unigram frequency (weight 0.3): log-scaled word frequency - Domain vocabulary (weight 0.3): binary membership in a task-specific vocabulary set

Language signal gate: When the maximum language score across all beam candidates is below 0.1, the language model has no opinion. In this case, beam search is adding noise rather than signal, so the system falls back to the greedy visual result. This prevents the beam from flipping correct characters based on near-zero language differences.

The top-5 unique word candidates are retained for potential use in Stage 5.

3.3 Stage 2: Brainsight ENHANCE (Analysis-by-Synthesis)

When the beam search produces a word with average character confidence below 0.45, the visual evidence is too degraded for character-level recognition. Brainsight shifts to word-level recognition through analysis-by-synthesis.

3.3.1 Structural Text Filter

Before attempting enhancement, a structural filter verifies that the connected components form a plausible text row:

Height consistency: standard deviation of component heights < 40% of mean height
Baseline alignment: standard deviation of vertical centers < 30% of mean height
Character aspect ratio: mean width / mean height in [0.2, 3.0]

This eliminates non-text artifacts (lines, symbols, logos) that would waste computation and produce false matches.

3.3.2 Adaptive Multi-Scale Upscaling

The degraded word blob is cropped from the original grayscale image with padding, contrast-inverted (dark text becomes high values), and upscaled via bicubic interpolation. The upscale target adapts to source height:

Source Height	Scales Attempted	Rationale
< 8px	40, 80, 120	Severely degraded — squint hard
8-13px	40, 80	Moderately degraded — two tries
>= 14px	40	Adequate resolution — single pass

The cascade is confidence-gated: if the best NCC at a given scale exceeds 0.45, subsequent scales are skipped. This prevents unnecessary computation on text that is already readable and avoids interpolation artifacts from extreme upscaling.

3.3.3 Generative Prediction and NCC Verification

For each vocabulary candidate matching the estimated character count (within +/-1):

Render: Generate a sharp grayscale image of the candidate word using a sans-serif system font at the current target height. The rendering is cropped to the ink bounding box and normalized to [0, 1].
Compare: Compute the normalized cross-correlation (NCC) between the upscaled blob and the rendered prediction. NCC is scale- and brightness-invariant:

NCC(a, b) = sum((a - mean(a)) * (b - mean(b))) / sqrt(sum((a - mean(a))^2) * sum((b - mean(b))^2))

Column density profile: Additionally, compute NCC on the column-wise density profiles — the sum of pixel values in each column. This captures the horizontal “rhythm” of the word: tall strokes, narrow gaps, wide characters. The column profile is more robust to vertical misalignment than pixel-level NCC.
Composite score: 0.6 * pixel_NCC + 0.4 * column_NCC

The final word score fuses visual, shape, and language evidence:

combined = 0.40 * visual_ncc + 0.15 * shape_match + 0.45 * language_score

Words scoring above 0.30 are accepted. The shape match encodes character count agreement (1.0 for exact match, 0.5 for off-by-one, 0.2 for off-by-two).

3.3.4 Bayesian Interpretation

This procedure implements Bayesian word recognition:

P(word | image) proportional to P(image | word) * P(word | context)

The rendering step generates the hypothesis. The NCC step computes the likelihood P(image | word). The language score provides the prior P(word | context). The fusion weights control the relative influence of likelihood and prior.

3.4 Stage 3: Word Shape Rescue

When ENHANCE fails (no candidate exceeds the NCC threshold), a lighter fallback uses only structural features: component count as character count proxy, word width ratio, and language scoring. This requires at least 4 components (to prevent short-word false positives) and accepts candidates scoring above 0.35.

3.5 Stage 4: Context Refinement

After all words in a line are recognized, low-confidence words (confidence < 0.6) are re-evaluated with beam search using up to 2 preceding words as bigram context. This is a single-word refinement pass — each word is updated independently given its neighbors.

3.6 Stage 5: Attractor Collapse (Constraint Propagation)

Context refinement updates words one at a time, which can miss global consistency. Attractor collapse updates all uncertain words simultaneously through iterative belief propagation.

3.6.1 Problem Formulation

Let a line contain N word positions. Each uncertain position i (confidence < 0.7 with multiple candidates from beam search) has a candidate set C_i = {w_i^1, …, w_i^K}. High-confidence positions and ENHANCE results are anchors — their text is fixed and they only send messages, never update.

The goal is to find the assignment {w_1, …, w_N} that maximizes:

Product over all pairs (i,j): compatibility(w_i, w_j)  *  Product over all i: evidence(w_i)

This is a MAP inference problem on a pairwise Markov Random Field.

3.6.2 Compatibility Function

The compatibility between two words at positions i and j combines:

Bigram coherence (0.5): Whether the word pair appears in the n-gram database
Domain co-occurrence (0.3): Whether both words appear in the domain vocabulary (0.3 if both, 0.1 if one)
Mutual realness (0.2): Whether both words appear in a general dictionary

Distance decay: compatibility messages from position j to position i are weighted by 1/|i-j|, limiting influence to a window of 4 positions.

3.6.3 Belief Propagation

Beliefs are initialized from the beam search combined scores (visual * language), normalized per position.

At each iteration, for each uncertain position i and each candidate w:

new_belief(w) = base_evidence(w) * (1 + 0.5 * mean_compatibility_message)

where the mean compatibility message aggregates weighted messages from all other uncertain positions and anchors.

Beliefs are normalized per position after each update and damped (30% new + 70% old) to prevent oscillation. The algorithm converges when the maximum belief change across all positions drops below 1%, or after 5 iterations.

3.6.4 Attractor Selection

After convergence, the dominant belief at each position is compared against the current text. A flip occurs only when the attractor candidate’s belief exceeds the current text’s belief by at least 20%. This hysteresis prevents noise-driven flips.

3.6.5 Anchor Design

A critical design decision: words that were improved by ENHANCE (Stage 2) or word shape rescue (Stage 3) are treated as anchors in the attractor, not as uncertain positions. These words were verified by pixel-level NCC evidence, which is stronger than the bigram and vocabulary signals used in constraint propagation. Without this, the attractor can destructively flip enhanced words back to inferior beam search alternatives — we observed this producing 604 erroneous flips on a single test image before implementing anchor protection.

4. Experiments

4.1 Test Corpus

We evaluate on six document images from five real-world PDFs, none seen during development of the language model or template library:

Document	Type	Pages	Resolution	Text Size	Challenge
KaiserSunsetSchedule	Hardware schedule	1	200 DPI	12-16px	Faded scan, similar characters
DoorHardwareSchedule	Hardware schedule	1	200 DPI	10-14px	Dense table, small text
ColoniesHardwareSchedule	Architectural drawing	1	400 DPI	6px	Extremely degraded, tiny text
FullSubmittal1 (p1)	Construction submittal	1	200 DPI	10-14px	Mixed content, specifications
FullSubmittal1 (p2)	Construction submittal	1	200 DPI	10-14px	Technical specifications
PrecisionAutoDoorEstimates	Estimate document	1	200 DPI	10-14px	Typed estimate, mixed layout

4.2 Evaluation Metric

We report the count of unique domain-specific terms correctly extracted, drawn from a vocabulary of 165 hardware/construction terms (door types, hardware components, manufacturer names, materials, finishes). This metric reflects real-world utility: a downstream hardware extraction system needs to identify that the document mentions “HORTON” (a manufacturer) and “THRESHOLD” (a component), not merely read generic text.

4.3 Baseline

The baseline is the same OCR engine with beam search, ENHANCE, and attractor collapse disabled — pure greedy template matching with dictionary validation. This isolates the contribution of the bidirectional pipeline.

4.4 Results

Document	Baseline (unique HW)	Brainsight (unique HW)	Delta	Attractor Fixes
KaiserSunset	8	14	+6	2
DoorHardware	7	23	+16	3
Colonies 6px	0	22	+22	3
FullSubmittal p1	6	13	+7	0
FullSubmittal p2	4	10	+6	0
PrecisionAutoDoor	0	6	+6	3
Total	25	88	+63 (+252%)	11

4.5 Key Observations

Colonies (6px text): The most dramatic result. Baseline extracts zero hardware terms from 6-pixel architectural drawing text. Brainsight extracts 22 unique terms including MANUFACTURER, HARDWARE, THRESHOLD, CONTINUOUS, OPERATOR, and AUTOMATIC. This is achieved entirely through analysis-by-synthesis — at 6px, no individual character is recognizable, but the word-level shape signature combined with domain vocabulary priors is sufficient to identify the correct words.

PrecisionAutoDoor (unseen document): Zero-to-six improvement on a document format never encountered during development, demonstrating generalization of the approach.

Attractor contributions: The attractor collapse makes 2-3 targeted fixes per document, flipping uncertain words to interpretations that are more consistent with their neighbors. The small number reflects the anchor design — most corrections happen in Stages 1-3, and the attractor only intervenes when cross-word constraints provide additional discriminative signal.

Processing time: The full Brainsight pipeline adds 2-10x overhead relative to baseline, primarily from multi-scale NCC in Stage 2. Colonies (the most degraded) takes 109 seconds for a single page. This is acceptable for batch document processing but would require optimization (candidate prefiltering, GPU-accelerated NCC) for interactive use.

5. Discussion

5.1 Why Analysis-by-Synthesis Works for Degraded OCR

The key insight is a shift in what “recognition” means. Standard OCR asks: “What character does this pixel pattern match?” When the pixel pattern is 6 pixels tall, the answer is noise. Brainsight asks a different question: “Which known word, when rendered as an image, best matches this blob?” The rendered prediction is sharp regardless of input quality. NCC doesn’t care about blur — it measures whether the bright and dark regions are in the right places. A 6px blob of “THRESHOLD” has a different horizontal density profile than a 6px blob of “ACCESSIBLE”, even though neither is recognizable character-by-character.

This is exactly what the brain does. You don’t read degraded text letter-by-letter; you recognize the word shape and confirm it against the visual evidence. The Brainsight pipeline makes this process explicit and computable.

5.2 The Attractor Metaphor

The attractor collapse captures a phenomenon familiar to anyone who has solved a crossword puzzle. When you fill in a word that “clicks” — that makes all the crossing words suddenly make sense — you’ve found the attractor of the constraint network. The correct solution doesn’t just satisfy local constraints; it satisfies all constraints simultaneously. In OCR, the constraints are bigram coherence (DOOR usually precedes HARDWARE) and domain co-occurrence (HORTON and OPERATOR both belong to hardware schedule vocabulary). The attractor is the unique interpretation where every word is mutually consistent.

The mathematical formulation is belief propagation on a pairwise MRF, but the attractor metaphor is more than literary. In dynamical systems, an attractor is a state toward which the system evolves from many initial conditions. Our damped iterative update has exactly this property: regardless of initialization (the noisy beam search outputs), the beliefs converge to the same fixed point — the globally consistent interpretation.

5.3 The Role of Domain Vocabulary

Brainsight requires a domain vocabulary — the set of words expected in the document. This is both a strength and a limitation. It is a strength because domain knowledge dramatically constrains the search space: matching against 165 hardware terms is computationally trivial, while matching against 100,000 English words would be expensive and less discriminative. It is a limitation because the system cannot discover truly unexpected words.

In practice, domain vocabularies are readily available for structured documents. Hardware schedules use hardware terminology. Medical records use medical terminology. Legal documents use legal terminology. The vocabulary can be provided explicitly or learned from a corpus of similar documents.

5.4 Limitations

Domain vocabulary dependency: The system performs best when a relevant vocabulary is provided. On open-domain text with no vocabulary prior, Stages 2-3 are disabled and the system reduces to beam search with generic language priors.
Font assumption: The rendered predictions use a sans-serif system font. Documents using highly stylized, serif, or handwritten fonts will have lower NCC scores. Font detection and multi-font rendering would address this.
Speed: Multi-scale NCC over hundreds of vocabulary candidates is the bottleneck. GPU-accelerated NCC or learned embedding similarity could reduce this by 10-100x.
No deep features: We use hand-crafted features (IoU, SSD, pixel templates) for character matching. A learned character encoder would likely improve Stage 1, reducing the load on Stages 2-5.
Attractor scope: ~~Constraint propagation currently operates within a single line.~~ See Section 6: Blindsight extends constraint propagation to the full page via column-type inference.

5.5 Comparison to Deep Learning Approaches

We do not claim Brainsight outperforms TrOCR, PaddleOCR, or cloud OCR APIs on clean text. On well-scanned documents, deep learning systems with millions of training examples will be more accurate. Brainsight’s contribution is in the degradation regime — the long tail of document quality where even deep systems struggle — and in the deployment regime — offline, local, no-GPU, no-API settings where deep learning is unavailable.

The architectural principles (analysis-by-synthesis, adaptive perception, constraint propagation) are complementary to deep learning. A system combining a learned character encoder with our generative verification and attractor collapse could outperform either approach alone.

6. Blindsight: Structural Inference Without Pixels

The original attractor collapse (Section 4.3) operates within a single text line — words constrain their horizontal neighbors. But structured documents have a second axis of constraint: columns. In a table, every entry in column 3 is the same type of thing. If column 3 contains “HORTON” in row 1 and “CAMDEN” in row 3, then garbled text in column 3 of row 2 is probably a manufacturer name — regardless of what the pixels look like.

We call this Blindsight: alpha = 0. No visual evidence. Pure structural inference.

6.1 Grid Reconstruction

After all lines are processed through the Brainsight pipeline (beam search, ENHANCE, attractor collapse), the system reconstructs the page-level grid:

Row detection: Group blocks by Y-coordinate (adaptive tolerance = 60% of median text height)
Column detection: Cluster block X-positions with tolerance proportional to document width
Cell assignment: Each block maps to its nearest column boundary

6.2 Column Type Inference

For each column, the system examines confident entries (confidence >= 0.5) and classifies the column by type:

Named entity: Matches against a domain-specific entity vocabulary (manufacturers, lab tests, transaction types)
Identifier: Matches a structural pattern (door numbers, patient IDs, account numbers)
Quantity: Numeric pattern with optional unit suffix
Field header: Column label terminology
Data: Untyped, but cross-row repetition provides constraint

The classification is Bayesian: a column is typed as “manufacturer” only if 2+ confident entries match the known manufacturer set, or 1 match exists in a column with 3 or fewer entries.

6.3 Domain Profiles

Blindsight is domain-agnostic. A domain profile defines what entity types, identifier patterns, and field headers to look for. The system includes profiles for:

Domain	Entity Type	Identifier	Example Headers
Construction	Manufacturer (36 known)	Door number	OPERATOR, SENSOR, HINGE
Medical	Lab test (62 known)	Patient ID	TEST, RESULT, REFERENCE
Financial	Transaction type (22 known)	Account number	DATE, AMOUNT, BALANCE
Legal	Legal entity (18 known)	Case number	MOTION, ORDER, EXHIBIT

The domain is auto-detected from page content: the profile whose headers and entities match the most confident blocks wins. New domains are added by defining a dictionary — no model retraining required.

6.4 Cross-Row Correction

For each low-confidence block (confidence < 0.55), Blindsight checks if the column type constrains what the text should be:

Entity columns: Fuzzy-match the garbled text against the known entity set using Levenshtein distance (threshold: 35% of candidate length, minimum 3 characters)
Data columns: If a value appears 3+ times confidently in the same column with 4+ characters, match garbled text against it (threshold: 30%)
Identifier columns: Single-character correction when the garbled ID has edit distance 1 from a confident ID of the same length

Results are surgical: 0-4 corrections per page. On Kaiser Sunset, Blindsight correctly identifies a manufacturer column and matches garbled text to “BEST” (Best Access Systems) — a manufacturer that was never in the domain vocabulary but appeared confidently in other rows of the same column.

6.5 The Perception Gradient

The full pipeline now spans a continuous gradient of visual reliance:

Stage	Alpha	What It Uses	When
Template matching	1.0	Pure pixel evidence	Good images
Beam search	0.65-0.8	Pixels + language prior	Moderate degradation
Brainsight ENHANCE	0.4	Rendered predictions + NCC	Severe degradation
Word shape rescue	0.2	Gross shape + language	Near-destroyed text
Attractor collapse	0.1	Neighbor consistency	Context resolution
Blindsight	0.0	Column structure only	Structural inference

This mirrors the neuroscience finding that biological perception is not a single process but a cascade of strategies with decreasing visual fidelity and increasing top-down influence (Clark, 2013; Friston, 2010). The brain does not “choose” between seeing and guessing — it operates at all levels simultaneously, with confidence-gated blending.

7. Farsight: Prediction Without Observation

7.1 The Negative Alpha Regime

The perception gradient from alpha = 1.0 (pure pixels) to alpha = 0.0 (Blindsight) does not stop at zero. It continues to negative alpha: the regime where no observation exists yet, and the system generates predictions from pure world knowledge.

Alpha	Mode	Signal Source
1.0	Template matching	Raw pixel evidence only
0.7	Beam search	Vision + language fusion
0.4	Brainsight ENHANCE	Synthesis + cross-correlation
0.0	Blindsight	Structure + column types + edit distance
-1.0	Farsight	World model only — no document

At alpha = -1.0, we ask: given what we know about the world (building occupancy, jurisdiction, physics), what would this document contain if it existed? This is not speculation — it is the causal closure of known constraints.

7.2 World Model Architecture

Farsight encodes domain-specific causal knowledge as prediction models:

Construction: Given a building’s occupancy type (healthcare, education, business, assembly), applicable building codes, and door schedule, the world model predicts: - Which hardware categories are required vs. optional (healthcare requires operators, sensors, actuators for automatic doors; residential does not) - Which manufacturers are most likely per category (closers: LCN, Norton, Sargent; sensors: BEA, Camden, Horton) - Per-door hardware sets based on door properties (exterior, accessible, fire-rated)

Medical: Given a patient’s conditions and age, the world model predicts which lab tests will appear on a lab result document. Diabetes → HbA1c, glucose, lipid panel, renal panel. Cardiac concern → troponin, BNP, CK-MB.

Financial: Given a document type (bank statement, invoice, tax return), the world model predicts column structure and transaction types.

7.3 Prediction → Confirmation Loop

The Farsight architecture completes the perception gradient by introducing a two-phase loop:

Phase 1 (before observation): farsight_predict(domain, context) → prediction
Phase 2 (after observation):  farsight_confirm(prediction, ocr_blocks) → scored prediction

Confirmation uses fuzzy matching (Levenshtein distance, keyword aliases, substring containment) to score how well the prediction matches degraded OCR output. On real documents:

Door Hardware Schedule (DoorHW): 6/12 categories confirmed, manufacturers BEA, NGP, STANLEY found via fuzzy match
Kaiser Automatic Door Schedule: 6/12 categories confirmed, manufacturers BEA, CAMDEN, HORTON, NORTON found. Operators, sensors, actuators all confirmed — exactly what you’d expect for a healthcare automatic door schedule.

High confirmation rate means the world model is accurate — the document contains what physics and building codes predict it should. Low confirmation rate is equally valuable: it flags anomalies that warrant human review.

7.4 Temporal Implications

Farsight is not forecasting in the statistical sense. It is causal closure: the logical consequence of constraints that already hold. A healthcare building in California with 150 doors and ADA compliance requirements WILL have a hardware schedule containing automatic operators, motion sensors, and specific manufacturer names. The schedule is a necessary consequence of the building’s properties, just as a chess position’s legal moves are necessary consequences of the board state.

This means the “prediction” is available before the document is drafted. The architect’s first draft can be checked against Farsight’s prediction in real-time. The GC’s submittal can be scored for completeness before the consultant ever reviews it. The document’s future state is computable because the causal structure that generates it is known.

7.5 Progressive Collapse: From All Futures to Inevitability

Farsight extends further through farsight_collapse(): given a set of constraints, project all possible futures and find where they necessarily converge. Each constraint eliminates possible futures. What remains after all constraints are applied is what will come to pass.

The collapse is progressive and monotonic:

Step	Constraints Applied	Inevitable	Likely	Uncertain	Collapse
0	None	0	0	9	0%
1	Healthcare occupancy	7	2	0	100%
2	+ ADA compliance	7	2	0	100%
3	+ Fire rated	7	2	0	100%
4	+ LCN preference + premium	7	2	0	100%
5	+ OCR observations (HORTON, BEA)	8	1	0	100%

The occupancy constraint alone collapses 7 of 9 categories to inevitable — because building codes and physics are that constraining. Each subsequent constraint further narrows the manufacturer probability distributions. Observing HORTON in the OCR output promotes COORDINATOR from “likely” to “inevitable” because HORTON produces both operators and coordinators.

By contrast, a residential building under the same framework collapses to only 4 inevitable categories (closer, hinge, lock, threshold) with 4 remaining uncertain — residential buildings may or may not have automatic operators or sensors. The constraint structure itself reveals the difference.

This is the same mathematics as attractor collapse within a line (Section 5) — iterative constraint propagation converges to the globally consistent state. The difference is scale: word-level collapse operates on character-position candidates within a document. Farsight collapse operates on item-category candidates across all possible documents. The attractor dynamics are identical.

7.6 The Fractal Operator: Cross-Domain Constraint Propagation

The single-domain collapse of Section 7.5 treats each domain in isolation. But reality doesn’t respect domain boundaries. A hospital construction project (construction domain) creates demand for BEA sensors and HORTON operators (supply chain domain), which affects ASSA ABLOY’s stock price (equity domain), which influences hiring decisions at sensor manufacturers (personnel domain), which feeds back into supply chain lead times.

The fractal operator makes this recursive. Each domain’s high-confidence items become constraints in other domains via potentiation edges — a directed graph connecting domain collapses:

construction(healthcare, BEA inevitable) → supply_chain(demand_surge)
supply_chain(shortage)                   → equity(supply_shock)
equity(sector_downturn)                  → personnel(layoff_rumors)
geopolitical(trade_conflict)             → supply_chain(trade_restriction)
epidemiology(diabetes_cascade)           → medical(screening_required)
ai_development(emergent_capability)      → geopolitical(arms_buildup)

The fractal property: the same operator works at every scale. Characters constrain words (beam search). Words constrain lines (attractor collapse). Lines constrain documents (Blindsight). Documents constrain projects (Farsight). Projects constrain markets (fractal operator). Markets constrain geopolitics. Geopolitics constrains projects. The loop closes. Convergence is inevitable.

In our test scenario (4 seed domains: construction, geopolitical, epidemiology, AI development), the fractal operator: - Converged in 3 iterations - Potentiated 3 emergent domains (supply_chain, personnel, equity) into existence from cross-domain constraints - Produced a 7-domain collapse from 4 seeds — 3 domains were implied by the conjunction of the other 4

7.7 Void Computation: The Differential Stack

Classical physics holds that the value of the nth derivative drops exponentially — acceleration matters, jerk barely matters, and snap/crackle/pop are curiosities. In void computation, the relationship inverts: each derivative gains precision because it eliminates exponentially more of the possibility space.

The insight: we compute what exists not by shining light upon it, but by understanding where darkness fled from the light. Each derivative level of the fractal operator reveals a deeper layer of what cannot be:

Derivative	Name	Computation	Items Surviving
d0	Position	Raw domain collapses	16
d1	Velocity	Cross-domain propagation	16
d2	Acceleration	Conjunction constraints	16
d3	Jerk	Cascading cascades	9
d4	Snap	Meta-pattern dominance	9
d5	Crackle	Pattern stability	7
d6	Pop	The invariant	4

The pop — the items that survive all derivatives — are the true inevitabilities: OPERATOR, SENSOR, CLOSER, SEAL. These 4 items are true under every perturbation of the constraint space. Everything else was contingent — it depended on specific constraint values that could be different. The void computation progressively stripped away the contingent to reveal the necessary.

This is negentropic integration: the void organizes itself by computing what it isn’t. The forcing function of existence (∞) divided by nothing (0) is resolved progressively at each fractal layer. Each derivative resolves more of the division. The integral across all layers — the negentropic integral — measures the total information extracted from the void.

The mathematical structure:

Void(n) = FractalOperator(Void(n-1)) ∩ Void(n-1)
Pop = ∩_{n=0}^{∞} Void(n)

The Pop is the fixed point of the void computation — the attractor of the differential stack. It is what remains when all contingency has been eliminated. In physical terms: the Pop is what the universe must contain regardless of initial conditions.

7.8 Solution Landscape Heightmap

The fractal operator and void computation are iterative — they converge through repeated application. The solution landscape inverts this: it renders the full probability surface across all domains × all items in a single pass. No iteration. No derivatives. The constraints define the viewport; the maxima within the viewport are the inevitabilities.

The landscape is computed as: 1. Collapse every seeded domain simultaneously 2. Fire all fractal edges to discover emerged domains 3. Build heightmap: H[domain][item] = probability across the full space 4. Cross-domain heights: Items appearing in multiple domains combine via noisy-OR: P_combined = 1 - ∏(1 - p_i) 5. Classify terrain: Maxima (h ≥ 0.95), ridges (h ≥ 0.6), plains (h ≥ 0.3), basins (h < 0.3) 6. Compute gradient: Sensitivity at each point approximated by the logistic function g(h) = 4h(1-h) — maximum at h=0.5 (uncertain items are most sensitive to new constraints), zero at h=0 and h=1 (locked items are inert)

On the 4-domain test scenario (construction, geopolitical, epidemiology, AI development), the landscape renders 7 domains with 68 total items, 7 maxima, 2 ridges, and a void volume of 0.618 — the majority of the possibility space is eliminated by the constraints. The gradient identifies which items would shift most with additional evidence, enabling targeted data collection.

Traditional hill-climbing follows the gradient within a single domain. Hyper-climbing jumps between domains via fractal edges. Viper climbing goes beyond both.

The metaphor is biological: the pit viper (Crotalinae) possesses infrared-sensitive pit organs that detect thermal radiation from warm-blooded prey invisible to visual-only predators. The viper doesn’t just see the landscape — it senses heat radiating from peaks in adjacent dimensions.

The algorithm: 1. Render the landscape (instant heightmap across all domains) 2. Identify peaks using domain-relative thresholds (the viper detects contrast, not absolute temperature — a warm mouse in a cold room glows brighter) 3. Build dimensional neighbor graph: peaks connected across domains by fractal edges, with thermal product heat(A→B) = √(h_A × h_B) as edge weight 4. Compute dimensional temperature: each peak’s temperature = its height × (1 + 0.3 × dimensional reach) × mean neighbor heat. High-reach peaks are hubs in the inter-dimensional topology. 5. Climb: Follow hottest dimensional neighbors with 2× novelty bonus for unvisited dimensions. When edge-connected neighbors are exhausted, the viper performs a thermal scan — detecting heat from peaks with no direct fractal edge path (dimmed by 0.5× but still sensible). This is what makes viper climbing beyond hyper. 6. Summit: The peak with highest dimensional temperature — visible from the most dimensions. 7. Fangs: The two strongest cross-dimensional connections — dual strike points where dimensions maximally reinforce.

On the 4-domain scenario, the viper achieves 100% dimensional coverage (7/7 domains) in 25 climbs. The first 11 climbs use fractal edges (hyper-climbing). Steps 12-19 use thermal scans to reach epidemiology, geopolitical, and AI development — domains with no inbound fractal edges from the construction hub. The summit is construction:OPERATOR with temperature 3.0 and 3-dimensional reach. The fangs strike into equity:ENERGY and equity:MATERIALS with heat 0.70.

The viper path reveals the topology of the inter-dimensional space: which domains are edge-connected (hot, fast to reach), which require thermal scanning (dim, requiring the viper’s special sense), and which peaks serve as dimensional hubs connecting otherwise isolated regions.

7.10 Future Sight: Temporal Extrapolation via Taylor Integration

The void computation produces seven derivative levels — position through pop — each representing a deeper layer of constraint propagation. Future sight recognizes that this derivative tower IS a temporal structure: items that appear at position (depth 0) must materialize before items that first appear at velocity (depth 1). The tower encodes causal ordering.

Taylor integration across the tower extrapolates forward in time:

x(t+Δt) = Σ (dⁿx/dtⁿ) · Δtⁿ / n!

where the nth derivative is approximated by finite differences across the void levels. This projects each item’s confidence trajectory forward, predicting not just what will happen, but when it will arrive and in what order.

The implementation extracts per-item confidence trajectories across all derivative levels, computes finite differences (velocity, acceleration, jerk of confidence), then applies Taylor extrapolation across a configurable time horizon. Items are classified into four trajectory types:

Emerging: accelerating toward inevitability (positive velocity + non-negative acceleration)
Extinguishing: decelerating toward zero (negative velocity + non-positive acceleration)
Stable: settled at a fixed confidence (velocity ≈ 0)
Volatile: oscillating (high max-min spread)

The algorithm also identifies inflection points — moments where the second derivative of a trajectory changes sign, indicating a tipping point where an item shifts from accelerating to decelerating or vice versa. These correspond to phase transitions in the constraint space.

The causal chain orders items by their first appearance depth in the derivative tower. Items emerging at position (depth 0) must exist before items emerging at velocity (depth 1), which must exist before acceleration (depth 2) items. This ordering is the temporal skeleton of the future.

The future cone tracks the narrowing set of active items at each time step. As Taylor extrapolation pushes items past confidence thresholds (into inevitability or extinction), the cone narrows, converging on the fixed point where all trajectories have settled.

Testing on the 4-domain scenario (construction + geopolitical + epidemiology + ai_development): 68 items tracked across 7 domains (4 seed + 3 emergent), using all 7 void derivatives. The system identified 4 inflection points — COORDINATOR and PANIC each transition from “likely” (0.67) to “inevitable” (1.0) between t=1 and t=4, with the acceleration→deceleration inflection visible as they saturate. All 7 domains settle within the horizon, with construction showing the only non-zero average velocity (+0.056), confirming it as the temporally active domain in this configuration.

8. Applications Beyond OCR

The Brainsight architecture — analysis-by-synthesis, confidence-gated perception, attractor collapse, structural inference — is not specific to optical character recognition. It is a general framework for recovering structured information from degraded signals. We outline several application domains:

8.1 Medical Record Processing

Faxed medical records remain the primary method of inter-provider record transfer. Lab results arrive as degraded table images with predictable structure: test names, values, reference ranges, flags. The medical domain profile (62 known lab tests, structured identifier patterns) enables Blindsight to recover test names and values that pixel-level OCR cannot read.

8.2 Financial Document Understanding

Bank statements, invoices, and receipts follow tabular conventions: dates, descriptions, amounts, balances. The financial domain profile enables structural inference on degraded scans — matching garbled transaction descriptions against the 22 known transaction types, normalizing account numbers, and using column-level quantity patterns to recover dollar amounts.

8.3 Legal Discovery

Document review in litigation requires processing millions of pages of degraded scans, many from decades-old archives. The legal domain profile enables recovery of case numbers, party names, and filing types from documents where standard OCR produces gibberish.

8.4 Historical and Archival Documents

Census records, ship manifests, birth/death certificates, and property deeds follow rigid columnar formats that have been stable for centuries. Domain profiles can encode the expected column structure of 19th-century census forms, enabling automated transcription of documents where the ink has faded and the paper has yellowed beyond the reach of standard OCR.

8.5 Audio Signal Recovery

The architecture generalizes beyond vision. Speech recognition in noisy environments faces the same challenge: degraded sensory input where bottom-up decoding fails. Analysis-by-synthesis (render expected speech, cross-correlate with noisy audio), beam search with language priors, and attractor collapse across uncertain word boundaries are directly applicable to robust ASR.

8.6 Sensor Fusion and IoT

Industrial sensor arrays produce structured data streams with predictable column types (temperature, pressure, flow rate). When sensors degrade or communication is lossy, Blindsight-style column inference can recover values from partial data — the system knows that column 3 is always a temperature between 20-200C, so a garbled reading of “1S3” is probably “153”.

8.7 The Universal Degraded Signal Problem

All of these applications share a common structure:

P(signal | observation, context) ∝ P(observation | signal) × P(signal | context)

The observation likelihood (visual, auditory, electromagnetic) degrades with noise. The context prior (domain vocabulary, column types, neighbor consistency) does not. As signal quality decreases, the system smoothly shifts from observation-driven to context-driven inference. This is not a failure mode — it is the mathematically optimal strategy under the Bayesian framework, and it is precisely what biological perception does.

9. Conclusion

We presented a perception pipeline spanning the full gradient from pixel evidence to world-model prediction, and beyond into cross-domain constraint propagation, void computation, N-dimensional solution landscape navigation, and temporal extrapolation. Ten key innovations define this spectrum:

Analysis-by-Synthesis with NCC: Rendering candidate words as sharp images and cross-correlating against blurry reality provides a robust likelihood signal that degrades gracefully with image quality.
Adaptive Multi-Scale Perception: Confidence-gated upscaling zooms in further on text that needs it, allocating computational effort where it has the most impact.
Attractor Collapse: Iterative belief propagation across uncertain words converges on the globally consistent interpretation — the configuration where every word mutually supports its neighbors.
Blindsight: Cross-line structural inference uses column-type consistency to recover information with zero pixel evidence, extending the perception gradient to alpha = 0.
Farsight: World-model prediction generates document contents before observation, completing the gradient to alpha = -1.0. When observation arrives, it serves as confirmation rather than discovery.
The Fractal Operator: Cross-domain constraint propagation where inevitabilities at one scale become constraints at other scales, recursing until convergence. The same attractor dynamics operate at every scale — characters, words, lines, documents, projects, markets, geopolitics. Four seed domains potentiate three emergent domains in a 7-domain collapse.
Void Computation: The differential stack of the fractal operator. Each derivative level eliminates exponentially more of the possibility space, revealing invariants through progressive subtraction. Not additive inference (shining light) but subtractive inference (mapping where darkness fled). The “pop” — items surviving all derivatives — identifies what must exist regardless of initial conditions.
Solution Landscape Heightmap: Instant rendering of the full probability surface across all domains × all items in a single pass. Constraints define the viewport, maxima are the inevitabilities, gradient identifies items most sensitive to new evidence. Renders 7 domains with 68 items, void volume 0.618.
Viper Climbing: Dimensional neighbor peak navigation beyond hyper-climbing. Domain-relative peak detection (thermal contrast, not absolute temperature), novelty-boosted dimensional jumps, and thermal scanning for peaks with no direct fractal edge path. Achieves 100% dimensional coverage (7/7 domains) where edge-following alone reaches 57%.
Future Sight: Temporal extrapolation via Taylor integration of the void derivative stack. The seven derivative levels encode causal ordering — items at position must exist before velocity items, which precede acceleration items. Taylor series projects each item’s confidence trajectory forward, identifying emergence/extinction trends, inflection points (tipping moments), causal chains, and the narrowing future cone. Predicts not just what will happen, but when and in what order.

On real-world degraded documents, the full pipeline improves domain term extraction by 252% over baseline template matching, including extracting 22 unique hardware terms from 6-pixel text where the baseline extracts zero. Farsight predictions on healthcare building documents achieve 50% category confirmation rate with fuzzy manufacturer matching against degraded OCR output. Void computation across 4 seed domains eliminates 73% of the possibility space at the pop level, identifying 4 invariant items from an initial field of 16. Viper climbing navigates 25 cross-dimensional jumps across the full 7-domain manifold with heat transfer of 6.0 across dimensional boundaries.

The deeper contribution is fivefold. First, the perception gradient is continuous and extends beyond observation into prediction and cross-domain reasoning. Second, the fractal property — the same constraint propagation operator working identically at every scale — suggests that perception, prediction, and understanding are not separate cognitive faculties but a single mathematical operation applied recursively. Third, void computation demonstrates that what exists can be computed by exhaustive elimination of what cannot exist — negentropic integration where the void organizes itself by computing what it isn’t. The forcing function of existence divided by nothing is resolved progressively at each fractal layer. Fourth, the solution landscape and viper climbing demonstrate that the constraint space is an N-dimensional manifold where seemingly unrelated domains are adjacent — “spooky convergent nondeterminism at a distance” resolves because the viper senses thermal proximity in the full manifold that is invisible in any lower-dimensional projection. Fifth, the derivative stack is not just a spatial tool but a temporal one — it encodes causal ordering, and Taylor integration across it constitutes genuine prediction of the future. The system doesn’t just see what is; it sees what will be, when it will arrive, and which events must precede others.

References

Casey, R.G. & Lecolinet, E. (1996). A survey of methods and strategies in character segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(7), 690-706.

Clark, A. (2013). Whatever next? Predictive brains, situated agents, and the future of cognitive science. Behavioral and Brain Sciences, 36(3), 181-204.

Dong, C., Loy, C.C., He, K., & Tang, X. (2015). Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2), 295-307.

Friston, K. (2005). A theory of cortical responses. Philosophical Transactions of the Royal Society B, 360(1456), 815-836.

Friston, K. (2010). The free-energy principle: A unified brain theory? Nature Reviews Neuroscience, 11(2), 127-138.

Halle, M. & Stevens, K. (1962). Speech recognition: A model and a program for research. IRE Transactions on Information Theory, 8(2), 155-159.

Helmholtz, H. von. (1867). Handbuch der physiologischen Optik. Leipzig: Leopold Voss.

Lafferty, J., McCallum, A., & Pereira, F.C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of ICML, 282-289.

Li, M., Lv, T., Chen, J., et al. (2023). TrOCR: Transformer-based optical character recognition with pre-trained models. Proceedings of AAAI, 37(11), 13094-13102.

Rao, R.P. & Ballard, D.H. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2(1), 79-87.

Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network for image-based sequence recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11), 2298-2304.

Smith, R. (2007). An overview of the Tesseract OCR engine. Proceedings of the Ninth International Conference on Document Analysis and Recognition, 629-633.

Wang, W., Xie, E., Liu, X., et al. (2020). Scene text image super-resolution in the wild. Proceedings of ECCV, 650-666.

Yuille, A. & Kersten, D. (2006). Vision as Bayesian inference: Analysis by synthesis? Trends in Cognitive Sciences, 10(7), 301-308.

Appendix A: Algorithm Pseudocode

A.1 Brainsight ENHANCE

function ENHANCE(blob, vocabulary, context, source_height):
    // Determine scale cascade based on degradation
    if source_height < 8:
        scales = [40, 80, 120]
    else if source_height < 14:
        scales = [40, 80]
    else:
        scales = [40]

    best_word = null
    best_score = -1
    best_ncc = -1

    for target_h in scales:
        upscaled = bicubic_zoom(blob, target_h / source_height)

        for word in vocabulary where |word| ~ estimated_char_count:
            rendered = render_word(word, height=target_h)
            ncc = NCC(upscaled, rendered)  // pixel + column profile

            shape = char_count_match(word, blob)
            lang = language_score(word, context)
            combined = 0.40*ncc + 0.15*shape + 0.45*lang

            if combined > best_score:
                best_score = combined
                best_word = word
                best_ncc = ncc

        if best_ncc > 0.45:  // confident enough, stop zooming
            break

    if best_score > 0.30:
        return best_word
    return null

A.2 Attractor Collapse

function ATTRACTOR_COLLAPSE(line_blocks, vocabulary):
    // Identify uncertain positions and anchors
    uncertain = {i : candidates[i] for i in line_blocks
                 where confidence[i] < 0.7 and |candidates[i]| > 1}
    anchors = {i : text[i] for i in line_blocks
               where confidence[i] >= 0.5 and i not in uncertain}

    if |uncertain| < 2:
        return line_blocks  // nothing to propagate

    // Initialize beliefs
    for i in uncertain:
        beliefs[i] = normalize({w: combined_score(w) for w in candidates[i]})

    // Iterative belief propagation
    for iteration = 1 to 5:
        max_delta = 0
        for i in uncertain:
            for w in candidates[i]:
                msg = mean over neighbors j of:
                    (1/|i-j|) * sum over w' in beliefs[j]:
                        beliefs[j][w'] * compatibility(w, w')
                new[i][w] = evidence(w) * (1 + 0.5 * msg)

            normalize(new[i])
            beliefs[i] = 0.7 * beliefs[i] + 0.3 * new[i]  // damped
            max_delta = max(max_delta, max change)

        if max_delta < 0.01:  // converged
            break

    // Apply attractor state
    for i in uncertain:
        best = argmax(beliefs[i])
        if beliefs[i][best] > 1.2 * beliefs[i][current_text[i]]:
            text[i] = best  // flip to attractor

    return line_blocks

Appendix B: Full Results Table

Document	Baseline Blocks	Brainsight Blocks	Baseline HW	Brainsight HW	Unique HW (Base)	Unique HW (Brain)	Attractor Fixes	Time (s)
KaiserSunset	71	69	16	30	8	14	2	5.7
DoorHardware	342	296	19	48	7	23	3	21.0
Colonies 6px	~1100	860	0	116	0	22	3	109.0
FullSubmittal p1	114	111	12	26	6	13	0	13.1
FullSubmittal p2	54	52	4	12	4	10	0	4.9
PrecisionAuto	94	83	0	6	0	6	3	5.6