Hyperion: Compositional Reasoning Without Neural Networks

May 21, 2026 NORTHTEK Labs github.com/NORTHTEKDevs/hyperion

What it is

Hyperion is a small, fast, transparent system that learns the rules of a language or puzzle from a small handful of examples, then applies those rules to brand-new cases it has never seen before.

Think of it like watching someone solve three crossword puzzles, then being able to solve every crossword of that type forever after — without practicing thousands more, without a giant brain, and without forgetting.

It does this on four standardized academic benchmarks that researchers use to measure whether an AI system can generalize compositionally — apply learned rules to new combinations — rather than just memorize patterns.

The results

These are the actual numbers, produced by running code in the open-source repo against the official public datasets. They are gated by automated tests. If you clone the repo and the tests pass on your machine, the numbers are real.

Benchmark	What it measures	Hyperion	For comparison
SCAN	Novel combinations of simple commands	100.00%	Transformer baselines: 1-50% depending on split
PCFG	Nested string-edit operations	99.98%	Baselines: 50-80%
COGS	English-like sentences never seen before	99.75%	Transformer baseline (Kim & Linzen 2020): ~35%. Best published neuro-symbolic methods: 60-80%.
1D-ARC	Abstract grid puzzles	100.00%	Joffe & Eliasmith (2025) with hand-crafted VSA: 83%

Total codebase: roughly 1,000 lines of Python. Runs on a laptop. Trains in about 10 seconds. No GPU required. No gradient descent. No neural networks. The repo is at github.com/NORTHTEKDevs/hyperion.

Why this might matter

Two framings, both worth holding.

The conservative read: Hyperion is a strong empirical baseline on four well-known compositional-generalization benchmarks. It beats published methods on at least one of them (1D-ARC, where the prior best was 83%) and is competitive with or above the best published methods on the others. It runs in seconds. The code is auditable. That is, on its own, a research-grade result.

The more ambitious read: It's evidence for a hypothesis — that the kind of reasoning large language models famously struggle with might not require giant neural networks at all. It might just require representing problems in the right structured way and doing systematic search over candidate rules. If that hypothesis generalizes (an open question), it suggests a different paradigm of AI: small, transparent, energy-efficient, and provably correct on the cases it handles.

The honest version is somewhere between those two. We have a real, reproducible result on a narrow set of problems, and a plausible direction worth pushing further.

What it isn't

Because trust matters more than hype, here is what Hyperion cannot do:

It is not a ChatGPT replacement. You can't have a conversation with it. It can't summarize a PDF or write you a poem.
It only works in domains with a learnable rule structure. Formal languages, structured puzzles, simplified grammars. Not real internet text. Not arbitrary natural language.
The COGS parser knows the schema of COGS. It hand-codes knowledge of determiners, prepositions, role labels. What it learns from data is the lexicon and the parameter values. So this isn't pure "discover the language from scratch" — it's data-driven parameter induction over a structured hypothesis space.
1D-ARC is much easier than 2D-ARC. The full ARC-AGI benchmark is two-dimensional and dramatically harder. Extending this approach to 2D is open research, not a done deal.
It will not scale to general natural language. The mechanism depends on surface regularity in the inputs. Real internet text isn't regular like that.

How it works (in plain English)

For each benchmark, Hyperion extracts the underlying rule from a small set of training examples, then applies that rule to held-out test inputs. The mechanisms differ across benchmarks but all share the same meta-pattern: structured hypothesis space + systematic search + parameter induction from training data.

SCAN — Vector Symbolic Architecture

Built on bipolar 8,192-dimensional vectors that you can bind (multiply), bundle (sum), and permute (shift). These let you store and retrieve role-filler pairs algebraically — no learning required. The grammar's compositional structure is encoded directly. Everything else falls out of the algebra.

COGS — Template induction plus a recursive-structure parser

A two-part system. First, a template induction algorithm that scans training pairs, identifies the input "shape" (positions of proper nouns, determiners, function words), and learns what each output position should be. Second, a recursive-structure fallback parser that handles cases the template learner can't: prepositional phrase recursion, clausal complements, control verbs, ditransitive constructions, passive voice variants. It learns its parameters — past-tense to infinitive mappings, intransitive subject roles, ditransitive verb sets — entirely from training data.

1D-ARC — Program synthesis

A program-synthesis engine over a library of generic grid transformations: shift, mirror, fill, recolor, scale, copy-pattern, and so on. For each task's three to five training examples, it enumerates programs and selects the first one that matches all of them exactly. Three parameter-induction modules (length to color, longest-run to color, parity to color) learn their parameters from training. Programs are ordered general-before-specific so the Occam-prior choice wins ties.

Verified by adversarial audit

Before publishing, the numbers were audited end-to-end in attack mode — five distinct phases trying to falsify them. All passed.

The single most important test: the COGS ground-truth labels were shuffled randomly, then the system was re-evaluated. If the system were peeking at labels somehow, accuracy would stay at 99%. It dropped to 0.005% — pure random chance. That's the cleanest signal that the system isn't cheating.

Full audit report and reproducible scripts are in the repo: AUDIT_REPORT.md, audit_independent.py, audit_perturb.py.

How to verify the numbers yourself

You should not trust any AI claim on faith. Here is how to check.

1. Run the tests.

git clone https://github.com/NORTHTEKDevs/hyperion
cd hyperion
pip install -e .
python data/scan/download_and_prep.py
python data/cogs/download_and_prep.py
python data/pcfg/download_and_prep.py
python data/arc1d/download_and_prep.py
python -m pytest tests/ -v

The tests assert specific accuracy thresholds. If they pass on your machine, the numbers are real.

2. Print the raw numbers. A one-liner in the README runs all four evaluations end to end and prints the per-benchmark percentages. The evaluation functions are deterministic — same data, same output, every time.

3. Audit the code. The whole research codebase is about 1,000 lines of Python. You can read it. There is no per-test hardcoding. The fit functions take training pairs only. The evaluation functions compare predictions to ground truth. You can grep the code yourself.

What this is part of

Hyperion is the research arm of Northtek's broader work on alternative AI architectures. The production stack that pays the bills — websites, automations, the realtor CRM, the agent runtime — uses standard tools where standard tools are the right answer. The research is where we ask "is there a smaller, more transparent way to do what large language models do, in the cases where their downsides matter?"

Right now the answer for these four benchmarks is yes. Whether it generalizes to harder problems is an open question, and the obvious next step is the full ARC-AGI benchmark (the 2D version of 1D-ARC). That's the direction.

The code is MIT-licensed. Fork it, use it, build on it, cite it. Issues and pull requests welcome.

See the code

The full repo is on GitHub. Clone it, run the tests, audit the work.

github.com/NORTHTEKDevs/hyperion →

Built by Kristian Baer / Northtek. Research code, MIT license. Citation in CITATION.cff.