Building an Autonomous Pentest Agent: What I Learned the Hard Way

I spent six months building an autonomous penetration testing agent. Not wrapping Nmap in a ChatGPT prompt. Not asking Claude to “hack this website.” A real system - 32 detectors, 1,500+ automated tests, engagement database, proof-of-exploit validation, chain analysis. The kind of thing that finds a SQL injection in a path parameter, escalates to UNION-based extraction, dumps the users table, and records machine-verifiable evidence. In under two minutes.

This is what I learned about the gap between “AI does pentesting” as a conference slide and “AI does pentesting” as a system that actually works on production targets.

The Lie Everyone Is Telling

Every AI security company in 2026 is selling the same pitch: “Our AI agent autonomously finds vulnerabilities.” The marketing decks show a chat interface. The demo targets are DVWA. The benchmark numbers are cherry-picked CTF scores.

Here’s the uncomfortable truth that Carlini proved at ICML 2025:

Environment	AI Success Rate
CTF challenges	87%
Real-world applications	37%

That’s a 50-point gap. Not a rounding error. A canyon.

XBOW, which is currently ranked #1 on HackerOne with 1,400+ zero-days and a billion-dollar valuation, operates on the right side of that gap. They’ve proven autonomous vulnerability discovery works at scale. But they’re also a multi-hundred-person company with GPT-5 integration and 48-step attack chains. That’s not a weekend project.

The question I wanted to answer: can a single operator, with the right architecture, close that gap?

Why Most AI Pentest Tools Fail

I studied eight frameworks before building mine.

PentestGPT (86.5% on XBOW benchmarks, USENIX 2024)
RedAmon (Neo4j knowledge graphs)
H-mmer’s pentest-agents (50 agents, 2,605 payloads)
Shannon Pro (Temporal orchestration)
The academic ones - overwhelming amount.
The commercial ones.
The ones with 500 GitHub stars and no real findings.

They all hit the same walls:

Wall 1: LLMs Can’t HTTP

An LLM generating curl commands and parsing responses is doing security theater. It’s slow (2-3 seconds per request through the model), expensive (tokens for every header), and fragile (one malformed response breaks the chain).
The pattern that actually works: deterministic tools for detection, LLM for strategy. Your SQLi detector doesn’t need to “think.” It needs to send 34 precisely-crafted requests in 10 seconds and pattern-match the responses. That’s a Go binary, not a language model.

Wall 2: Context Windows Are a Lie

A 200K context window sounds like you can fit an entire engagement. You can’t. Response bodies, HTTP headers, JavaScript source code — a single page load can burn 20K tokens. Three requests deep into exploitation and you’ve forgotten the endpoints you discovered in recon.
The fix isn’t a bigger window. It’s structured state management. An engagement database that tracks endpoints, findings, negative evidence, coverage gaps, and proof artifacts. The LLM reads summaries. The database remembers everything.

Wall 3: No Verification = No Finding

This is the one that kills most agents. They find “something that looks like SQLi” - a 500 error, a timing difference, an error message - and report it. No proof. No replay. No differential confirmation.
My rule, inspired by the AIxCC finalist teams: no exploit, no finding. Three-way oracle: marker in response + differential from baseline + absence in clean control. If all three don’t pass, it’s not a finding. Period.

This one decision, requiring machine-verifiable proof, took my precision from 70% to 90%.

The Architecture That Works

After six months of iteration, the pattern that emerged:

Operator (LLM - Opus)
  |
  |-- reads source maps, builds mental model
  |-- ranks signals by effort-to-impact  
  |-- directs attack strategy
  |-- exploits business logic manually
  |
Native Detectors (Go binaries - 32 of them)
  |
  |-- deterministic test suites (1,500+ tests)
  |-- zero-token detection (no LLM in the loop)
  |-- structured output (findings, negatives, coverage)
  |-- proof-of-exploit with replay capability
  |
Engagement Database (SQLite)
  |
  |-- endpoints, tokens, findings, proofs
  |-- coverage tracking (what was tested, what wasn't)
  |-- chain analysis (do findings combine into something worse?)
  |-- cross-session resume (pick up where you left off)
  |
Proxy Integration (MCP protocol)
  |
  |-- structured HTTP through Caido
  |-- evidence capture linked to findings
  |-- scope enforcement at transport layer

The key insight: the LLM is the operator, not the executor.
It decides what to test. It doesn’t do the testing.
That’s what compiled, deterministic code does - faster, cheaper, and without hallucinating a finding that doesn’t exist.

What Autonomous Gets Right

Credit where it’s due. The things that work well with AI in the loop:

Recon comprehension. Give an LLM 50 source files extracted from a source map and ask “what does this app do, what data is sensitive, what would be critical.” It builds a mental model in 30 seconds that would take a human 2 hours.

Signal ranking. After recon produces 15 endpoints and 8 signals, the LLM ranks them by effort-to-impact better than any static heuristic. It understands that “Express + role field in JWT + two dashboards” screams vertical privilege escalation.

Chain detection. “You found an IDOR on user data. You found a password reset endpoint. Those chain into account takeover.” Humans do this intuitively. Codifying it is hard. LLMs do it naturally.

Report generation. Nobody wants to write the report. Let the machine do it, provide a template and it’s done.

What Autonomous Gets Wrong

The 37% real-world gap comes from three failure modes:

Business Logic Is Still Human Territory

“Can a patient cancel another patient’s appointment?” → that’s an access control question with a deterministic answer. A detector handles it.

“Can you place an order for $0.01 by manipulating the price calculation between cart and checkout?” → that’s a business logic question. It requires understanding what a product costs, why a price field exists in the request at all, and what the economic impact of the manipulation is.

I’ve built detectors for both. The access control one works. The business logic one requires the LLM to understand the business first, which means reading the source code, building a model of the application’s purpose, and then identifying where economic invariants can be broken. No amount of automated probing handles this.

Multi-Step Chains Require Judgment

The most interesting vulnerabilities aren’t single-request findings. They’re chains:

Register with role: "doctor" (mass assignment)
Login with doctor JWT
Access patient records (vertical BAC)
Extract medical data (IDOR across patient IDs)

Each step depends on the previous step’s output. The decision of whether step 1 worked requires understanding what a doctor role means in this application. A detector can test whether the field is accepted. It can’t determine whether the resulting privilege escalation is meaningful.

Novel Vulnerabilities Don’t Match Patterns

PortSwigger’s Top 10 Web Hacking Techniques of 2025 included error-based blind SSTI, Unicode normalization attacks, and HTTP/2 CONNECT internal scanning. Every single one was discovered by a human researcher through creative reasoning about protocol behavior.

AI finds more instances of known vulnerability classes. It does not discover new classes… yet! The two genuinely novel classes that emerged in the LLM era: prompt injection and hallucinated dependency attacks, were found by humans observing AI behavior. Not by AI observing software behavior.

The Numbers

After six months of development and validation against a standardized benchmark:

Metric	Value
Detectors	32 (27 web + 5 infrastructure)
Automated tests	1,500+
Detection rate (benchmark)	80%
Precision (exact match)	90%
Average time per target	3-5 minutes (detection), 15-30 minutes (full exploitation)
False positive rate	<10% (three-way oracle)
Token cost per engagement	Near zero (Go detectors don’t use LLM)

For context, PentestGPT scores 86.5% on the same benchmark class but at $1.11 average cost per challenge and 6.1 minutes per test.
My detectors run in seconds with zero token cost because they’re compiled Go binaries, not LLM inference calls.

The tradeoff: PentestGPT handles more diverse challenges because it can reason creatively. My system handles known patterns with extreme efficiency and leaves creative exploitation to the operator.

The Honest Assessment

Building this taught me that the AI pentest landscape is:

Overhyped by companies selling “AI-powered scanning” that’s just Nuclei with a chat interface. If your “AI pentester” can’t prove exploitation, it’s a scanner with better marketing.

Underhyped by practitioners who haven’t seen what purpose-built detection frameworks can do when you remove the LLM from the hot path and let it operate at the strategic layer where it’s actually good.

Misarchitected by teams that put the LLM in the HTTP request loop. Every token spent generating a curl command is a token not spent understanding the target’s architecture.

The future isn’t “AI replaces pentesters.” It isn’t “AI is useless for pentesting.” It’s AI as operator: reading source code like a senior researcher, directing testing like a team lead, and leaving the mechanical work to deterministic tools that are faster, cheaper, and more reliable than inference.

The 37% real-world gap is real. But it’s not a ceiling. It’s a measurement of where we are today with naive architectures. The right architecture → deterministic detection, LLM strategy, structured state, proof verification closes that gap to something useful.

I’m not done. Nobody is. But I have a system that finds real bugs on real targets with machine-verifiable evidence, and it’s been six months since I touched a scanner.

What’s Next

The project is ongoing, daily. It’s not open source. It might never be. But the patterns are:

Separate detection from reasoning. Compiled tools for the former. LLMs for the latter.
Require proof. Three-way oracle minimum. No exploit, no finding.
Track everything. Engagement database survives session resets, enables cross-engagement learning.
Let the operator operate. The best architecture puts the human (or LLM) at the strategy layer, not the request layer.

If you’re building something similar, and I know some of you are, the architectural decision that matters most is where you draw the line between autonomous and operator-directed. Draw it too low and you’re a scanner. Draw it too high and you’re a chatbot.

The sweet spot is somewhere in the middle. I’m still finding it.