Blog / / 6 min read

open-source pentesting at commercial scale

pwnkit's measured performance on the public xbow benchmark now matches the published single-model results from commercial pentest stacks. the engine is open source. the methodology is public.

Consider a founder two weeks after a series A. The product is shipping, the SOC 2 auditor is asking pointed questions, and the only security budget item so far was a Burp scan a contractor ran six months ago. Pentest shop quotes are coming back at $25k for a week of work, a PDF at the end, and a three-month lead time. The product keeps shipping every day.

Or a bug-bounty hunter on a Sunday night staring at a scope of 200 npm packages, a private program in beta, and no obvious target to start with.

Both of these people have access to tooling that did not exist a month ago.

what pwnkit is

pwnkit is an open-source, autonomous AI pentesting agent. Point it at a URL, an API, an npm package, or a source tree — it runs reconnaissance, attacks, and validates every finding with a working proof of concept. No signatures, no regex rules, no dashboard tax.

the measured result

As of this week, pwnkit’s measured performance matches the published single-model results from commercial AI pentesting stacks on the public xbow benchmark. xbow is the reference public benchmark for AI pentesting agents: 104 real exploitation challenges across SQLi, SSTI, IDOR, SSRF, LFI, auth bypass, deserialization, the rest of the OWASP top 10, plus a long tail. Not detection. Flag extraction, end-to-end.

Until recently, the top of that leaderboard was closed-source commercial stacks. pwnkit is now in the same range as the best published single-model results — under an MIT license, with the source available and npm install-able. The live numbers are maintained at docs.pwnkit.com/benchmark and update every CI run. (See the XBOW methodology and verification post for what these numbers mean across substrates — the headline takeaway is that any score on “the xbow benchmark” depends on which substrate, single-shot vs best-of-N, the model, and the turn cap. The methodology disclosure for pwnkit’s numbers is public.)

The shorthand: the median mid-market pentest engagement is a week of human hours, a PDF, and a five-figure invoice. pwnkit produces the same class of output — exploit-validated findings, not “possible issues” — on demand, for the cost of a few API calls. A category of work that used to be gate-kept behind procurement is now a command.

what the engine scans

pwnkit runs across four surfaces:

  • Web apps and APIspwnkit scan --target https://your-app. Handles auth, OpenAPI specs, and stateful flows. The same engine that runs against xbow runs against production targets.
  • AI and LLM apps — prompt injection, jailbreak probes, system-prompt extraction, PII leakage, MCP-based SSRF.
  • npm packagespwnkit audit --package <name>. The workflow behind the advisories published in popular npm packages earlier this year is the same engine, packaged.
  • Source code — point it at a repo. It reads the code the way a human researcher would, without fatigue on file 400.

These are not demo targets. The engine surfaced CVEs in node-forge (32M weekly downloads), mysql2, jsPDF, and liquidjs — all from the OSS code path. The repo is at github.com/PwnKit-Labs/pwnkit, the docs at docs.pwnkit.com.

how it works

Most “AI security” tools are a wrapper around existing scanners with a chatbot stapled on for the report. pwnkit is not that. It is an agent loop that drives a shell directly.

The first design choice: one primary tool, bash. The model already knows curl, sqlmap, nmap, jq, and the rest of a kali rootfs from its training data — no schema to learn, no parameter translation, no tool-selection overhead. The agent reasons about a target, runs a command, reads the output, and iterates. The architectural reasoning and ablation data are in the XBOW shell-first post.

The second design choice: every finding is re-exploited in a separate blind verification pass before it is reported. If the verifier cannot reproduce the bug from scratch, the finding is killed. No theoretical risks. No “possible SQLi.” Working exploit or the finding does not appear in the report.

The result: an agent that catches bugs scanners miss, because it actually attempts them.

what’s next

The methodology is open and the roadmap is in the public repo. The near-term work covers the remaining gap on the harder xbow categories (stateful multi-step chains, environment-dependent exploits), expansion of the kali tool surface the agent can drive, and tighter cost controls so long runs stay predictable. None of that requires a new model. It is engineering, and it is happening in the open.

The longer-term direction: if an open-source agent can already match commercial pentest output on a public benchmark, the gap closes further every month. The question is how much of traditional vendor security ends up as a command-line tool.

install

npm install -g pwnkit-cli
pwnkit scan --target https://your-app

Or clone the repo, read the source, file an issue, open a PR. Docs at docs.pwnkit.com. Benchmark numbers at docs.pwnkit.com/benchmark. Source at github.com/PwnKit-Labs/pwnkit.