TL;DR

Researchers ran the first large live comparison of AI-driven attack agents and human penetration testers on a university network of roughly 8,000 hosts. A new multi-agent scaffold called ARTEMIS performed near the top of the field, while several existing agent frameworks lagged behind most human participants.

What happened

A team of researchers evaluated six existing AI attack agents, a newly developed multi-agent scaffold named ARTEMIS, and ten professional penetration testers inside a live enterprise environment at a large university. The environment spanned about 8,000 hosts distributed across 12 subnets. ARTEMIS — built to support dynamic prompt generation, arbitrary sub-agents, and automated vulnerability triaging — finished second overall, submitting nine valid findings with an 82% valid submission rate. The paper reports that ARTEMIS outperformed nine of the ten human participants and produced submission quality comparable to the strongest testers. By contrast, other agent scaffolds referenced in the study, including Codex and CyAgent, performed worse than most human testers. The authors also highlight operational trade-offs: AI agents were strong at systematic enumeration and parallel exploitation and could run at lower reported costs in some configurations, but they produced more false positives and struggled with GUI-based tasks.

Why it matters

  • AI tools can match or approach top human performance on some penetration-testing tasks, altering the balance between automation and human expertise.
  • Lower operating costs for certain AI agent variants could change how organizations budget for red-team exercises and vulnerability discovery.
  • Higher false-positive rates and weaknesses on GUI-driven tasks indicate that AI agents are not yet a full replacement for human testers and require oversight.
  • Demonstrates feasibility of running multi-agent attack frameworks in realistic network environments, informing defenders and tool vendors.

Key facts

  • Study compared 10 professional penetration testers, 6 existing AI agents, and ARTEMIS (a new agent scaffold).
  • Test environment: a large university network with approximately 8,000 hosts across 12 subnets.
  • ARTEMIS is described as a multi-agent framework with dynamic prompt generation, arbitrary sub-agents, and automatic vulnerability triaging.
  • ARTEMIS placed second overall in the comparative evaluation.
  • ARTEMIS discovered 9 valid vulnerabilities and achieved an 82% valid submission rate.
  • ARTEMIS outperformed 9 of the 10 human participants in the study.
  • Existing scaffolds such as Codex and CyAgent underperformed relative to most human participants.
  • Reported operational strengths for AI agents included systematic enumeration and parallel exploitation.
  • Reported limitations included higher false-positive rates and difficulty with GUI-based tasks.
  • Cost comparison in the paper notes certain ARTEMIS variants cost about $18/hour versus $60/hour for professional penetration testers.

What to watch next

  • Whether ARTEMIS-style agents generalize to other enterprise environments and network topologies — not confirmed in the source.
  • How future agent designs address false positives and GUI interaction shortcomings — not confirmed in the source.
  • Potential changes in professional practice and procurement for red-team services if automated agents are integrated — not confirmed in the source.

Quick glossary

  • AI agent: A software system that performs tasks autonomously or semi-autonomously using artificial intelligence techniques.
  • Penetration testing (pen testing): A security assessment that simulates attacks on a system to find vulnerabilities before attackers do.
  • Multi-agent framework: A system architecture that coordinates multiple autonomous sub-agents to accomplish complex tasks.
  • Vulnerability triaging: The process of validating, prioritizing, and categorizing discovered security flaws.
  • GUI-based task: An operation that requires interacting with a graphical user interface rather than a command-line or API.

Reader FAQ

Did ARTEMIS outperform human testers in the study?
ARTEMIS placed second overall and outperformed nine of the ten human participants in this evaluation.

How many valid vulnerabilities did ARTEMIS find?
ARTEMIS discovered nine valid vulnerabilities and had an 82% valid submission rate.

How did other AI agents perform compared with humans?
The paper reports that existing scaffolds such as Codex and CyAgent underperformed relative to most human participants.

Was the evaluation run in a real network?
Yes — the testbed was a large university network of about 8,000 hosts across 12 subnets.

Should organizations replace human testers with AI agents now?
not confirmed in the source

Computer Science > Artificial Intelligence [Submitted on 10 Dec 2025] Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing Justin W. Lin, Eliot Krzysztof Jones, Donovan Julian Jasper, Ethan…

Sources

Related posts

By

Leave a Reply

Your email address will not be published. Required fields are marked *