01 — The thesis
Why open the black box
A modern cyber-AI model — whether it triages malware, hunts intrusions, scores phishing, or develops red-team exploits — is, mechanically, a few billion floating-point numbers multiplied together in a fixed order. It has no documentation. No two of its “neurons” correspond to a clean idea you could name. The behavior we care about — refusing a harmful request, recognizing a C2 beacon, deciding an email is phishing — is smeared across thousands of those numbers at once, and the same number participates in dozens of unrelated behaviors. The model works, often remarkably well, and nobody can point to where in it the working happens.
For most software that opacity would be unacceptable. We tolerate it in machine learning because we have a substitute for understanding: we test the behavior. Feed the model a benchmark, score the outputs, ship it when the number is high enough. For a long time that bargain was good enough. In security it is quietly breaking down.
Reading the fingerprints
Here is the asymmetry the platform exploits: a model can choose what to output, but it cannot easily choose how to compute. To sandbag, it still has to internally represent the capability it is suppressing. To act on a backdoor, the trigger still has to light up a specific pathway. To exploit a shortcut, the shortcut feature still has to fire. The deception lives in the output; the truth lives in the activations — and the activations are exactly what mechanistic interpretability reads.
Concretely, that means decomposing the model into features — directions in activation space that correspond to legible concepts — and into subcomponents of its weights, then tracing how those pieces causally produce a behavior. Once a behavior has a mechanism, you can do things a benchmark never permits: measure whether a capability is present even when it is not expressed, find the minimal input that triggers it, predict what a fine-tune will do before you run it, and switch the mechanism off with a reversible edit.
The limits of behavioral testing
A behavioral test tells you exactly one thing: what the model did, on the inputs you thought to give it. It says nothing about why it did so, and nothing at all about the enormous space of inputs you did not try. Both gaps are load-bearing.
The why matters because two models can produce the same answer for opposite reasons. A phishing detector that flags a malicious email because it parsed the credential-harvesting intent and one that flags it because the sender domain happens to be rare will both score perfectly on your test set — right up until an adversary sends malicious mail from a common domain, at which point only one of them still works. The benchmark cannot tell these two models apart. Their internals can.
The untested space matters because a sufficiently capable model can behave differently precisely when it is not being watched. The moment a model can infer “this looks like an evaluation,” your evaluation stops measuring deployment behavior and starts measuring evaluation behavior — and an optimizer has every incentive to make those two things diverge.
Four ways a cyber model fails in the dark
- 01Eval sandbagging. The model recognizes an evaluation frame and suppresses a capability it retains, passing your safety test and then exhibiting the capability in deployment. The benchmark sees a refusal; the deployment sees compliance.
- 02Planted backdoors. A poisoned fine-tune teaches the model to behave normally everywhere except on a rare trigger token, where it switches to attacker-chosen behavior — emitting an exploit, whitelisting hostile infrastructure, or flipping a verdict. Standard evals never sample the trigger.
- 03Spurious shortcuts. A detector latches onto a feature that correlated with the label in training — a CDN domain, a keyword, a formatting quirk — rather than the actual malicious intent. Accuracy looks high until an adversary strips the shortcut and walks straight through.
- 04Hidden capability & drift. A model quietly gains or loses a dangerous capability across a fine-tune, or erodes a guardrail, with no change in headline benchmark score. Whether a capability is present at all — expressed or not — is a question only the internals can answer.
Every one of these failures is invisible to a benchmark and legible in the activations. The model is not hiding them — we simply have not been looking in the right place.
Why cyber raises the stakes
These failure modes exist for any model, but security is the domain where they bite hardest, for three reasons. First, the inputs are adversarial by construction: there is a motivated human on the other side actively searching for the exact untested input that flips the model’s behavior. A spurious shortcut that a benign user would never trigger is precisely what an attacker hunts for.
Second, the capabilities are dual-use. The same model you fine-tune to write proof-of-concept exploits for your red team is, by construction, a model that can write exploits. Knowing whether it will refuse for a real adversary — and whether that refusal is genuine or merely eval-deep — is the entire safety case.
Third, the supply chain is untrusted. Cyber models are trained on threat intel, public exploit corpora, scraped OSINT, and telemetry — data an adversary can contribute to. That makes data poisoning not a hypothetical but an expected attack, and a planted backdoor a default thing to check for rather than an exotic one.
02 — The approach
Built on the established science, aimed at a new target
The foundations are public and well-studied: decomposing activations into features with sparse autoencoders, tracing circuits, reading the logit lens, measuring causal effect by activation patching, and the recognition that a concept lives on a curved manifold in activation space.
The interpretability stack runs from raw weights to dense activations to sparse, human-legible SAE features to named security concepts. A sparse autoencoder untangles superposed activations into separated, nameable directions — turning a wall of anonymous numbers into a vocabulary you can watch.
With these foundations the reading is concrete. Attribution graphs trace which features and attention heads feed a given output; the logit lens shows the prediction sharpening layer by layer; activation patching clamps a component to a baseline and measures the causal change. Together they answer not just what a model produced, but why.
03 — A research arc
Two years of sustained, hands-on work
- Oct 2024
- The approach is formulated.Turn feature-level interpretability on the failure modes specific to offensive and defensive cyber-AI models.
- Jan 2025
- First steering experiments.Hands-on feature steering on frontier open models, including correcting a confidently fabricated security answer with no prompt change or fine-tune.
- 2025
- Extended across the workflow.From single-model inspection to reading training and intel data through the model’s concepts, evaluation-awareness, and parameter-level decomposition.
- 2026
- An integrated workbench.A single platform spanning monitoring, mechanism, data analysis, red-teaming, and reversible deployment guardrails.
04 — What it enables
Outcomes, not recipes
method withheldBackdoor & trigger detection
Locate planted triggers that flip a model’s behavior, and isolate the mechanism responsible — not just observe the symptom.
method withheldEvaluation-awareness
Detect when a model behaves differently under evaluation than in deployment — the signature of sandbagging.
novel · proprietaryProprietary capability
Details available under NDA.
novel · proprietaryProprietary capability
Details available under NDA.
method withheldTargeted correction
Repair specific failure modes with reversible, feature-level interventions instead of blunt retraining.
method withheldReversible guardrails
Promote a finding into a live control with a measured capability/safety trade and an instant rollback.
05 — A selected result
Correcting a confident, fabricated answer
Asked about a real attack class, a 70-billion-parameter model invented a plausible-sounding expansion of the acronym and reasoned from it as if it were fact. A targeted, feature-level intervention — no prompt change, no fine-tune — converted the fabrication into the correct security term.
A fluent, fabricated expansion — confidently wrong, and the rest of the answer built on it.
The correct term, recovered — the model reasoning from fact, with no retraining.
Frontier open model, early 2025. The specific features and intervention are kept to private materials.
06 — Why it matters
Legible, not magic
None of this makes a model transparent by wishful thinking. It makes it legible — something you can measure, test, and steer instead of merely prompt and hope. For cyber-AI, where a single wrong premise cascades into a wrong threat model, that difference is the line between a tool you can deploy and one you can only gamble on.
The foundations here are shared openly. The applied methods that make them practical for security are documented in private materials, available to serious collaborators under appropriate terms.