April 12, 202610 min read

What Happens When You Let AI Write Your Security Scanner

A code review of an open-source AI-generated penetration testing framework that got press coverage, 280+ GitHub stars, and has never been reviewed by a human.

In February 2026, Help Net Security published an article about Zen-AI-Pentest, an open-source AI-powered penetration testing framework. The project promised multi-agent architecture, 20+ integrated security tools, and automated vulnerability scanning. Within weeks, it had accumulated nearly 300 GitHub stars.

We were evaluating the competitive landscape for our own security scanner, so we cloned the repository to see what a modern AI-generated security tool actually looks like on the inside. Everything below is independently verifiable — the repo is public and MIT-licensed.

What we found was not a security tool. It was an archaeological record of what happens when coding agents run unsupervised for weeks.

The numbers that looked impressive

From the outside, the project checked every box:

2,366 Python files across a structured directory tree
Multi-agent architecture with specialized agents for reconnaissance, exploitation, reporting, and more
20+ security tools integrated — Nmap, SQLMap, Metasploit, Burp Suite, Nuclei, BloodHound
CI/CD integration — GitHub Actions, GitLab CI, SARIF output
Container sandboxing for safe exploit validation
REST API + Web UI + CLI interfaces

On paper, this looks like a serious project. The press article certainly treated it as one. And with nearly 300 stars, the community seemed to agree. Then we looked at the code.

Finding #1: Dead code in the main entry file

The project's primary Python file — the one that runs the entire application — contains this pattern:

# Try to import new components
try:
    AUTONOMOUS_AVAILABLE = True
except ImportError:
    AUTONOMOUS_AVAILABLE = False

try:
    RISK_ENGINE_AVAILABLE = True
except ImportError:
    RISK_ENGINE_AVAILABLE = False

Read that carefully. Each try block contains only VARIABLE = True. There is no import statement inside the block. A try block that only assigns a boolean to True cannot raise ImportError. It is impossible. The except branch is dead code.

Compare this with a correctly written version from the same file, just 15 lines below:

try:
    from benchmarks import BenchmarkRunner
    BENCHMARK_AVAILABLE = True
except ImportError:
    BENCHMARK_AVAILABLE = False

The inconsistency between these two blocks — written in the same file, separated by a few lines — is the signature of code generated across multiple LLM sessions without human review. One session wrote the abstract pattern without filling in the actual import. Another session wrote it correctly. Nobody compared them.

Any human reviewer would catch this in five seconds. Not because it crashes the application, but because it reveals that nobody has read the code.

Finding #2: The version archaeology problem

Inside the agents directory, we found:

agents/
├── agent_base.py
├── agent_coordinator_v3.py      ← v3 — where are v1 and v2?
├── agent_orchestrator.py         ← a second orchestrator?
├── kimi_master_agent.py          ← a third coordination layer?
├── react_agent.py
├── react_agent_enhanced.py       ← "enhanced" variant
├── react_agent_vm.py             ← "vm" variant
├── v2/                           ← entire subdirectory for v2
└── recovery_engine.py

There's no single, authoritative agent implementation. At least four competing coordination layers, three variants of the react agent, and a v2/ subdirectory sitting alongside files that appear to be v3.

This is what happens when multiple AI coding sessions each build their own version without deleting the previous one. Each session starts fresh, generates what it thinks is needed, and leaves the artifact behind. Nobody consolidates. The codebase grows horizontally instead of vertically.

Finding #3: 484 markdown files

The repository contains 484 markdown files. For comparison: Nuclei (ProjectDiscovery's widely-used vulnerability scanner) has ~20. Metasploit Framework has ~30. Among those 484:

17 files about code coverage— including titles like "COVERAGE_ACHIEVER_REPORT" and "MASSIVE_COVERAGE_PUSH_REPORT"
13 files about testing— "TEST_ACHIEVEMENT_SUMMARY", "MILESTONE_1000_TESTS"
7 different roadmaps
6 different READMEs— including "README_FINAL" (which, predictably, is not the final README)

The project also contains a Python script whose explicit purpose is to execute code lines to increase coverage percentage numbers, and another script that generates test files programmatically. Of 2,366 Python files, 1,424 are test files — 60% of the codebase. This is not a project optimized for finding vulnerabilities. This is a project optimized for making its coverage dashboard look good.

Finding #4: The operations kitchen sink

At the repository root:

13 docker-compose files — from .ci.yml to .zap.yml
15+ GitHub PR automation scripts — not part of the scanner, but scripts the AI wrote to manage its own pull requests
3 backup copies of the test configuration — using the filesystem as version control
A file named with an error message as its filename — containing a colon character, which is reserved on Windows

That last one means the project cannot be cloned on Windows. The git checkout step fails on any Windows machine. For an open-source project with 280+ stars, this strongly suggests the stars are curiosity clicks from a press article, not actual users who downloaded and ran the software.

Finding #5: The commit pattern

Looking at the last 20 commits:

10 of 20 were automated CI status updates — bot noise
5 of 20 were Dependabot dependency bumps — automated
3 of 20 were consecutive edits to coverage thresholds — adjusting the target, not improving tests
1 was a real security fix
1 was merging a Dependabot PR

Zero of the recent commits added a new security scanning capability, fixed a bug in an existing scanner, or improved the agent's decision-making. The project appears to be maintained by automated systems rather than by an engineer actively developing it.

What this tells us about AI-generated code in security

We are not sharing this to mock anyone's work. The person behind this project is clearly enthusiastic about security and AI, and building in public takes courage. We respect that.

What we are saying is that the gap between "looks like a security tool" and "works like a security tool" has never been wider. AI coding agents can now produce a project structure, CI configuration, test infrastructure, documentation, and multi-agent architecture that looks professional from the outside. Press coverage follows naturally — the README is compelling, the feature list is impressive, the architecture diagram is clean.

But the actual code tells a different story. Dead code in the entry file. Competing orchestrator implementations. Coverage theater. Files that prevent cloning on the most popular desktop OS. A commit history dominated by bots.

This pattern is becoming common across the AI-generated open source landscape. And in security tooling specifically, it is dangerous — because the gap between "appears to scan for vulnerabilities" and "actually finds vulnerabilities" is the gap that attackers exploit.

What we do differently

At IsMySiteHacked.com, we use AI extensively — for report generation, finding prioritization, and plain-English explanations of technical findings. But every scanner module, every correlation rule, and every scoring algorithm has been written, reviewed, and tested by humans who understand what each check is actually measuring.

We run 33 security checks per scan. Not because 29 is the largest number we could claim, but because each of those 33 checks produces a finding that a human has validated against real-world attack patterns. We would rather run 33 checks that work than 1,000 that look good on a feature list.

See what a human-reviewed security scan looks like

33 checks. Real attack paths. Plain-English findings. About two minutes. No signup required.

Scan your site for free