NORTHTEK Labs · Research

Hyperion: Compositional Reasoning Without Neural Networks

An open-source reasoning system that solves four published AI benchmarks at or near 100% — without neural networks, training, or GPUs.

May 21, 2026 NORTHTEK Labs github.com/NORTHTEKDevs/hyperion

What it is

Hyperion is a small, fast, transparent system that learns the rules of a language or puzzle from a small handful of examples, then applies those rules to brand-new cases it has never seen before.

Think of it like watching someone solve three crossword puzzles, then being able to solve every crossword of that type forever after — without practicing thousands more, without a giant brain, and without forgetting.

It does this on four standardized academic benchmarks that researchers use to measure whether an AI system can generalize compositionally — apply learned rules to new combinations — rather than just memorize patterns.

The results

These are the actual numbers, produced by running code in the open-source repo against the official public datasets. They are gated by automated tests. If you clone the repo and the tests pass on your machine, the numbers are real.

Benchmark What it measures Hyperion For comparison
SCAN Novel combinations of simple commands 100.00% Transformer baselines: 1-50% depending on split
PCFG Nested string-edit operations 99.98% Baselines: 50-80%
COGS English-like sentences never seen before 99.75% Transformer baseline (Kim & Linzen 2020): ~35%. Best published neuro-symbolic methods: 60-80%.
1D-ARC Abstract grid puzzles 100.00% Joffe & Eliasmith (2025) with hand-crafted VSA: 83%

Total codebase: roughly 1,000 lines of Python. Runs on a laptop. Trains in about 10 seconds. No GPU required. No gradient descent. No neural networks. The repo is at github.com/NORTHTEKDevs/hyperion.

Why this might matter

Two framings, both worth holding.

The conservative read: Hyperion is a strong empirical baseline on four well-known compositional-generalization benchmarks. It beats published methods on at least one of them (1D-ARC, where the prior best was 83%) and is competitive with or above the best published methods on the others. It runs in seconds. The code is auditable. That is, on its own, a research-grade result.

The more ambitious read: It's evidence for a hypothesis — that the kind of reasoning large language models famously struggle with might not require giant neural networks at all. It might just require representing problems in the right structured way and doing systematic search over candidate rules. If that hypothesis generalizes (an open question), it suggests a different paradigm of AI: small, transparent, energy-efficient, and provably correct on the cases it handles.

The honest version is somewhere between those two. We have a real, reproducible result on a narrow set of problems, and a plausible direction worth pushing further.

What it isn't

Because trust matters more than hype, here is what Hyperion cannot do:

How it works (in plain English)

For each benchmark, Hyperion extracts the underlying rule from a small set of training examples, then applies that rule to held-out test inputs. The mechanisms differ across benchmarks but all share the same meta-pattern: structured hypothesis space + systematic search + parameter induction from training data.

SCAN — Vector Symbolic Architecture

Built on bipolar 8,192-dimensional vectors that you can bind (multiply), bundle (sum), and permute (shift). These let you store and retrieve role-filler pairs algebraically — no learning required. The grammar's compositional structure is encoded directly. Everything else falls out of the algebra.

COGS — Template induction plus a recursive-structure parser

A two-part system. First, a template induction algorithm that scans training pairs, identifies the input "shape" (positions of proper nouns, determiners, function words), and learns what each output position should be. Second, a recursive-structure fallback parser that handles cases the template learner can't: prepositional phrase recursion, clausal complements, control verbs, ditransitive constructions, passive voice variants. It learns its parameters — past-tense to infinitive mappings, intransitive subject roles, ditransitive verb sets — entirely from training data.

1D-ARC — Program synthesis

A program-synthesis engine over a library of generic grid transformations: shift, mirror, fill, recolor, scale, copy-pattern, and so on. For each task's three to five training examples, it enumerates programs and selects the first one that matches all of them exactly. Three parameter-induction modules (length to color, longest-run to color, parity to color) learn their parameters from training. Programs are ordered general-before-specific so the Occam-prior choice wins ties.

How to verify the numbers yourself

You should not trust any AI claim on faith. Here is how to check.

1. Run the tests.

git clone https://github.com/NORTHTEKDevs/hyperion
cd hyperion
pip install -e .
python data/scan/download_and_prep.py
python data/cogs/download_and_prep.py
python data/pcfg/download_and_prep.py
python data/arc1d/download_and_prep.py
python -m pytest tests/ -v

The tests assert specific accuracy thresholds. If they pass on your machine, the numbers are real.

2. Print the raw numbers. A one-liner in the README runs all four evaluations end to end and prints the per-benchmark percentages. The evaluation functions are deterministic — same data, same output, every time.

3. Audit the code. The whole research codebase is about 1,000 lines of Python. You can read it. There is no per-test hardcoding. The fit functions take training pairs only. The evaluation functions compare predictions to ground truth. You can grep the code yourself.

What this is part of

Hyperion is the research arm of Northtek's broader work on alternative AI architectures. The production stack that pays the bills — websites, automations, the realtor CRM, the agent runtime — uses standard tools where standard tools are the right answer. The research is where we ask "is there a smaller, more transparent way to do what large language models do, in the cases where their downsides matter?"

Right now the answer for these four benchmarks is yes. Whether it generalizes to harder problems is an open question, and the obvious next step is the full ARC-AGI benchmark (the 2D version of 1D-ARC). That's the direction.

The code is MIT-licensed. Fork it, use it, build on it, cite it. Issues and pull requests welcome.

See the code

The full repo is on GitHub. Clone it, run the tests, audit the work.

github.com/NORTHTEKDevs/hyperion →

Built by Kristian Baer / Northtek. Research code, MIT license. Citation in CITATION.cff.