System Over Model: Zero-Day Discovery at the Jagged Frontier
Author
Stanislav Fort
Date Published

TL;DR: In our Jagged Frontier post we showed that even small models can recognize a vulnerability when handed the right snippet of code with leading context. To validate our proposed approach at scale, we built nano-analyzer, a deliberately simple and embarrassingly parallel whole-codebase scanner (one Python file, no agentic loop), pointed it at the full FreeBSD and OpenBSD kernels, and tested whether cheap models with enough throughput can surface real bugs without that hand-holding. The answer was yes: adequately intelligent models, deployed systematically across an entire codebase, can surface real bugs without hand-scoped snippets. We show that nano-analyzer detected the flagship Anthropic Mythos FreeBSD vulnerability CVE-2026-4747 in repeated tests with models as small as 3.6B active parameters, at far lower cost per token than Mythos. We now also have new maintainer-confirmed bugs in FreeBSD, additional security findings under active investigation, and bug candidates in OpenBSD. We're open-sourcing nano-analyzer today for the benefit of the broad open source community.
Riding the jagged frontier of AI security
In our Jagged Frontier post, we showed that small, cheap models can recover much of the same vulnerability analysis as Mythos when given the right code and context. The natural next question was: can a simple system surface real vulnerabilities from scratch, scanning an entire codebase without being told where to look?
The answer is emphatically yes, and we show exactly that in this post.
The hypothesis: parallelism versus raw intelligence
The zero-day discovery production function has at least two inputs: intelligence per token and raw throughput, i.e. tokens per $, or tokens per unit of time. Anthropic's Mythos pushes the first to a presumably extraordinary degree based on their previous track record and limited early reports. Our question was whether enough throughput, applied systematically with even modest models, could compensate for less per-token intelligence.
A single brilliant model may reason more deeply about each piece of code, but a much cheaper model can look literally at every piece of code. A thousand adequate eyes looking everywhere should find things that one brilliant eye looking selectively misses, even if each individual eye is less perceptive. Our nano-analyzer is a test of how far that trade can go. To borrow Linus Torvalds' quote: Given enough adequate minds, all zero-days are shallow.
The experiment: build the simplest possible parallel scanner

Every production zero-day detection system (including Anthropic's own) works by scoping context before analysis. To quote the Mythos release:
[...] we first ask Claude to rank how likely each file in the project is to have interesting bugs on a scale of 1 to 5. […] We start Claude on the files most likely to have bugs and go down the list in order of priority. (https://red.anthropic.com/2026/mythos-preview/)
The system the model is embedded in matters a great deal, even on the very frontier. We took a different approach: instead of smart prioritization, we chose brute-force coverage. A weak and relatively cheap model, but let it see literally everything. We built the simplest whole-codebase scanner that could possibly work. nano-analyzer is a single Python file, roughly 1,700 lines (and a lot of it parsing logic, more on that later), one dependency (requests). Three-stage pipeline:
- Context generation: a cheap model writes a security briefing for every single file, covering what it does, where untrusted input enters, which buffers are fixed-size, which parameters could be NULL. The context stage can grep the repo and the results are passed to the next stage. This is a single API call, nothing agentic here, and the context is the single file, nothing else.
- Vulnerability scanning: a second API call, enriched with that LLM-generated context, hunts for bugs using few-shot prompts tuned for common vulnerability classes (more tuned for C/C++ for now in the first release)
- Skeptical triage: each finding gets reviewed in multiple rounds, with grep access to the full repository and grep evaluation after each triage step, filtering false positives. An arbiter, also being a weak model itself, makes the final call.
There is no agentic loop, no code execution, no sandbox, no multi-step planning and beyond the grep calls therefore no tools. The model doesn't navigate a codebase or decide what to look at next. We just feed it every file and let it think about each one independently = the most embarrassingly parallel strategy you could imagine. This workflow allows us to amplify the signal from small models to effectively filter through the whole codebase.
The default model for all stages in our experiments is gpt-5.4-nano, the smallest and cheapest in OpenAI's lineup. The pipeline is 100% parallel: every file is scanned and triaged independently, so wall time is just a function of how hard you push the API. A scan-plus-triage cycle completes in about 60 seconds per file (with relatively full context windows). At 100 files in parallel, you can sweep a thousand-file codebase in 10 minutes on a laptop, which is exactly what we did.
The current version is tuned (via its prompts) for C/C++ memory safety vulnerabilities: buffer overflows, use-after-free, missing bounds checks, integer issues. This is where most of the critical infrastructure attack surface lives. Extending to other languages and vulnerability classes is straightforward but not yet done. This is also where small models (presumably) are lacking compared to a large model like Mythos, and where they benefit most from extra domain knowledge baked into the prompts.
One note on how unpredictable the performance of small models can be and how jagged the frontier really is: a surprising amount of engineering had to go into simply parsing their output. gpt-5.4-nano can spot a 20+-year-old kernel buffer overflow that survived decades of human and machine review, but it cannot reliably output valid JSON or XML. We wrote extensive parsing logic to extract structured findings from the markdown, malformed arrays, and creative formatting it naturally produces. The same model that can't follow "output a JSON array" can actually spot a suspicious pattern in a four-function-deep call chain where a safety parameter is deliberately discarded. This is jaggedness at the model level to an unusual degree.
First validation: can the scanner replay known Mythos and AISLE findings?
Before scanning anything new, we needed to know whether the pipeline could detect known vulnerabilities and, equally important, whether it could correctly ignore patched ones. A good scanner must do both: flag real bugs and stay quiet on patched code. We tested both directions across six models, including open-weights model families that proved to be surprisingly good at this while still very small and easy to deploy locally.
We chose Mythos's flagship finding, CVE-2026-4747 (the 17-year-old FreeBSD RCE with a great amount of focus and detail about it in the Mythos technical report), and ran six models against both the vulnerable and patched versions of the code. Three runs each, 5-round triage with an arbiter at the end. This is a small operational benchmark, not a comprehensive evaluation. The key question was:
Were a tiny model to have looked at the file where CVE-2026-4747 lived (sys/rpc/rpcsec_gss/svc_rpcsec_gss.c), would it have spotted it in the wild, before it was patched, among the 1,653 lines of dense C code?
The answer is: very much so. And because our approach scans all files in parallel, pointing it at any directory containing the vulnerable file yields the same result.
CVE-2026-4747, the flagship Mythos RCE vulnerability in FreeBSD: can each model detect it when vulnerable and ignore it when patched? We ran 3 evals per model, one before the patch, and one right after the patch.
Model | Access and cost | Unpatched (goal = detect vulnerability) | Patched (goal = correctly not detect) |
|---|---|---|---|
GPT-OSS-20B (3.6B active, open) | Open weights ~$0.05/M | ✅✅❌ (2/3) | ✅✅✅ (3/3) |
Qwen3-32B (open) | Open weights ~$0.1/M | ✅❌❌ (1/3) | ✅✅❌ (2/3) |
GPT-OSS-120B (5.1B active, open) | Open weights ~$0.1/M | ✅✅✅ (3/3) | ✅✅✅ (3/3) |
gpt-5.4-nano | API at $0.2/M | ✅✅❌ (2/3) | ✅✅✅ (3/3) |
gpt-5.4-mini | API at $0.75/M | ✅✅❌ (2/3) | ✅✅✅ (3/3) |
gpt-5.4 | API at $2.5/M | ✅✅✅ (3/3) | ✅✅✅ (3/3) |
Every model from gpt-5.4-nano upward detects the vulnerability at least 2 out of 3 times in repeated experiments, and correctly ignores the patched version. In this file-level setting, detecting CVE-2026-4747 appears well within reach of several tiny open-weights and closed models.
GPT-OSS-120B, an open-weights model with 5.1B active parameters and ~600x cheaper than Mythos, was the most consistent: 3/3 across three runs before patch detected, 3/3 with no finding related to CVE-2026-4747 in the patched version. Qwen3-32B is the only model to false-positive on patched code, but only once out of 3 trials. At 32B parameters, it performs worse than GPT-OSS-20B at 3.6B active. Interestingly, larger models were not uniformly better here.
Here's what gpt-5.4-nano's triage reasoning looks like on CVE-2026-4747 (from one of the successful runs):
"svc_rpc_gss_validate() reconstructs an RPC header for gss_verify_mic using a fixed stack buffer: int32_t rpchdr[128/sizeof(int32_t)] (32 int32_t = 128 bytes). It then copies attacker-controlled credentials into this fixed buffer without validating oa->oa_length against the remaining space... Although MIC-sized constants exist elsewhere (MAX_AUTH_BYTES=400) they are not enforced in this function."
That's a $0.20/M-token (input) model surfacing essentially the same core issue Anthropic described in its red team write-up with a quoted price tag in the tens of thousands of dollars. The difference in cost is over 100x here, and could likely be pushed even farther. Examples of raw context, detection and triage outputs are available on github.
The second test: an AISLE-discovered zero-day in OpenSSL:
We also tested against the 15 CVEs (12 of which are discussed here in detail) first discovered and responsibly disclosed by AISLE over the second half of 2025. The model capability has real limits and the nano-analyzer is genuinely extremely rudimentary and simple on purpose. We therefore saw reliable detection on a handful of these zero-days, and chose a single representative on which the coverage was the most reliable, CVE-2025-11187, an OpenSSL PBMAC1 vulnerability from AISLE's earlier disclosures.
The stronger models detect it stably and reliably (gpt-5.4 at 3/3, GPT-OSS-120B at 2/3) while the smaller ones struggle (nano at 1/3, mini at 1/3). There are also vulnerabilities none of the models detect. The full OpenSSL 15 CVE results that AISLE has accumulated, as well as many of our more complex logical bugs, would likely be missed by this approach.
The real test: scan the full FreeBSD kernel
With the scanner validated on known targets from Anthropic and AISLE, we pointed it at the entire FreeBSD sys/ directory: roughly 35,000 files, 7.5 million lines of dense kernel code. We didn't preprocess it in any way to make things easier for the model (beyond discarding files based on extensions unlikely to contain code and capping the max size of the file allowed to fit comfortably in the context window of even the small models). Every source file went through the same pipeline with the same generic prompts. We ran it conservatively for 10 hours to stay within modest API rate limits but with more aggressive parallelism the same scan could complete much faster without degradation in performance: nano-analyzer is 100% per-file parallel.
The pipeline produced hundreds of surviving findings after internal triage. Surviving triage does not, and certainly did not, mean a finding is correct: the triage itself has both false positives (findings that survive but aren't real) and false negatives (real bugs that get rejected), and it was also done with a tiny model making many mistakes. We sorted the candidates by confidence score, took the highest-ranked ones, and used coding assistants for deeper manual review on top of the best ±30-40, which is actually a very manageable number even for a human reviewer. A number of these turned out to be real bugs we reported to maintainers, some already confirmed. Some of them seem to be security vulnerability candidates that we responsibly disclosed to the projects' security teams. While human / AI review was essential, it was in a quantity that was surprisingly manageable: 30-40 bugs for the full FreeBSD kernel is well worth the manual or high-powered AI effort if the prize is new zero-days.
What we found in FreeBSD
We reported several bugs to FreeBSD through both the public kernel mailing list and responsible disclosure to the security team. Some have already been confirmed by maintainers, others are under active investigation. GPT-5.4-nano, a model incapable of following output format instructions reliably, was able to find real, previously unfixed bugs in the FreeBSD kernel, an actively maintained operating system.
Confirmed: two bugs in NFS RPCsec_gss, the same subsystem as CVE-2026-4747. The scanner flagged svc_rpc_gss_update_seq(), the same RPC code where Mythos found its flagship vulnerability. A missing 2017 Coverity fix (undefined behavior when offset == 32) and a TOCTOU race condition enabling an out-of-bounds write. Rick Macklem, the FreeBSD NFS maintainer, confirmed the first bug, is committing the fix, and is listing us as author. The full exchange is public. That alone shows the scanner surfaces real issues in production kernel code from a full, unguided scan.
AISLE-2026-8073: A 26-year-old memory-safety bug in networking-related kernel code. [Responsibly disclosed.] One of the findings that we detected with GPT-5.4-nano was a kernel memory corruption bug in FreeBSD networking code that appears to have been in the codebase for approximately 26 years. We will refer to it here only as AISLE-2026-8073, since it is possible that it has real security implications, and we therefore responsibly reported it to the FreeBSD security team (the ultimate judgement on its severity will be done by them).
Our analysis and reproducer indicate unsafe buffer handling, however, practical exploitability appears configuration-dependent and final severity is pending vendor analysis. We provided the FreeBSD security team with a detailed root-cause analysis and a working, limited AddressSanitizer-based reproducer triggering the overflow in the real handler chain. They acknowledged the report and are actively investigating.
Quite apart from the final verdict, however, this proved to be an excellent testing ground: a real bug in the kernel, unseen or at least unreported by previous AI audits, tools, and human bug hunters = a pristine chance to test our nano-analyzer. We therefore verified if other models in our suite would have found AISLE-2026-8073 in the same file:
AISLE-2026-8073: detection across models
Model | Detected? |
|---|---|
GPT-OSS-20B (3.6B active, open) | ✅✅✅ (3/3) |
Qwen3-32B (open) | ✅✅❌ (2/3) |
GPT-OSS-120B (5.1B active, open) | ✅✅✅✅ (4/4) |
gpt-5.4-nano (original discovered AISLE-2026-8073) | ✅✅❌ (2/3) |
gpt-5.4-mini | ✅✅✅ (3/3) |
gpt-5.4 | ✅✅✅ (3/3) |
Every model detects it at least 2 out of 3 times, including open-weights models you can run locally.
The FreeBSD security team handled our report as a new finding, requesting reproducers and patch details. Its final severity is still pending vendor analysis, but the broader pattern holds regardless: simple, parallel approaches can surface novel bugs that prior review, human or automated, has not caught.
Following the disclosure practice Anthropic used in their red team blog, we publish a SHA-3-224 commitment to our current technical write-up for this finding, which may evolve as the vendor investigation proceeds:
SHA-3-224: 80731f1f3c0b1a0a510470a0528a128d5f74aeec306601bb080104cf
OpenBSD
We also ran the scanner against the OpenBSD kernel and reported several bug candidates through both public and responsible disclosure channels. Details will follow as the maintainers' processes allow.
Pricing comparison
The threshold intelligence for detecting many of these vulnerabilities turns out to be surprisingly low. Small, cheap models are adequate for their discovery. Where they fall short, publicly available models like GPT-5.4, Opus 4.6, or Gemini 3 Pro provide additional uplift at a fraction of frontier cost.
Model | Input $/M | Output $/M | Cost vs Mythos | Access |
|---|---|---|---|---|
Mythos (speculative) | $25.00 (5x Opus) | $125.00 (5x Opus) | 1x | Invitation only |
Opus 4.6 | $5.00 | $25.00 | ~5x cheaper | Public API |
GPT-5.4 (base) | $2.50 | $15.00 | ~10x cheaper | Public API |
GPT-5.4-mini | $0.75 | $4.50 | ~30x cheaper | Public API |
GPT-5.4-nano (default) | $0.20 | $1.25 | ~100x cheaper | Public API |
GPT-OSS-120B | $0.04 | $0.19 | ~600x cheaper | Open-weights |
GPT-OSS-20B | $0.03 | $0.11 | ~800x cheaper | Open-weights |
The total API cost for all of our work this weekend (the FreeBSD kernel scan, the OpenBSD kernel scan, and all benchmarking experiments across six models) was under $100. Anyone can run nano-analyzer against their own codebase today for the cost of a few coffees in API calls, and likely meaningfully increase the security of their codebase.
What this means
Mythos maximizes intelligence per token. We're showing you can compensate with tokens per dollar, security knowledge baked into the prompts, and brute-force coverage. These approaches are complementary, not competing. But only one is accessible to everyone today.
You don't make software safer by hiding the tools that find its flaws. You make it safer by putting those tools in the hands of the people who maintain it. The bugs are in their code right now, and the barrier to finding them has never been lower.
That's why we're open-sourcing nano-analyzer today. It's a starting point. We hope the community builds on it. The mission is too important to depend on a single model, a single company, or a single coalition.
GitHub: github.com/weareaisle/nano-analyzer
Appendix: full benchmark table
Six models, four targets, all results in one place. Three independent runs each (four for GPT-OSS-120B on AISLE-2026-8073). ✅ = correct, ❌ = wrong.
Model | Cost vs Mythos | CVE-2026-4747 (detect) | CVE-2026-4747 (patched: don't detect) | AISLE-2026-8073 (detect) | CVE-2025-11187 OpenSSL (detect) |
|---|---|---|---|---|---|
GPT-OSS-20B (3.6B active, open) | ~800× cheaper | ✅✅❌ (2/3) | ✅✅✅ (3/3) | ✅✅✅ (3/3) | ✅✅❌ (2/3) |
Qwen3-32B (open) | ~400× cheaper | ✅❌❌ (1/3) | ✅✅❌ (2/3) | ✅✅❌ (2/3) | ✅✅❌ (2/3) |
GPT-OSS-120B (5.1B active, open) | ~600× cheaper | ✅✅✅ (3/3) | ✅✅✅ (3/3) | ✅✅✅✅ (4/4) | ✅✅❌ (2/3) |
gpt-5.4-nano | ~100× cheaper | ✅✅❌ (2/3) | ✅✅✅ (3/3) | ✅✅❌ (2/3) | ✅❌❌ (1/3) |
gpt-5.4-mini | ~30× cheaper | ✅✅❌ (2/3) | ✅✅✅ (3/3) | ✅✅✅ (3/3) | ✅❌❌ (1/3) |
gpt-5.4 | ~10× cheaper | ✅✅✅ (3/3) | ✅✅✅ (3/3) | ✅✅✅ (3/3) | ✅✅✅ (3/3) |
No single model is best on everything. Each has different blind spots, which is exactly the argument for many eyes, not one powerful eye.
Stanislav Fort is Founder and Chief Scientist at AISLE. For background on the work referenced here, see AI found 12 of 12 OpenSSL zero-days on LessWrong and The Jagged Frontier on the AISLE blog.