The benchmark extends the Carnegie Mellon SusVibes framework to continuously evaluate leading AI coding agents, updates as new agents and models are released
PALO ALTO, Calif., April 15, 2026 /PRNewswire/ — Endor Labs, today announced the launch of the agentic code security benchmark, extending the existing SusVibes framework from leading academic researchers to evaluate how securely AI coding agents generate code in real-world scenarios. Alongside the benchmark, Endor Labs is introducing the Agent Security League, a public leaderboard tracking the performance of leading AI coding agents on both functional correctness and security outcomes.
Built on real-world code and peer-reviewed research developed at Carnegie Mellon University, the benchmark extends SusVibes, a framework that evaluates 200 real-world tasks drawn from 108 open-source projects and covers 77 Common Weakness Enumeration (CWE) vulnerability classes. Endor Labs introduced new test harnesses for agents like Cursor, evaluated new AI models, and introduced new anti-cheating safeguards, including prompt hardening and automated detection systems. This is an important extension of SusVibes to address the observed cheating of newer agents.
As AI coding agents become increasingly embedded in modern development workflows, organizations face a growing but under-measured risk: code that works but isn’t secure. Endor Labs’ benchmark reveals just how significant that gap is. For the highest performing agent, 84.4% of AI-generated code passed functional tests, but the highest performing security agent still only achieved 17.3% of tests, leaving over 80% of outputs vulnerable.
“AI coding agents are dramatically increasing the speed and scale at which software gets written, but security isn’t keeping pace,” said Varun Badhwar, CEO at Endor Labs. “The challenge isn’t just whether the code works, it’s whether it’s actually safe in the context of a real system. This work builds on rigorous university research grounded in real-world open source code. Today, Endor Labs is extending that foundation and making the continuous evaluation of new models public, pushing the industry toward greater accountability and giving teams a clearer view of how these systems actually behave.”
The Agent Security League evaluates coding agents across two key dimensions: whether the code works, and whether it does so without introducing vulnerabilities. The results reveal a stark gap between the two. While many agents perform well on functionality, security consistently fails. Even the top-performing agent achieved just 17.3% security correctness, and 87% of code generated by AI coding agents contains at least one security vulnerability, underscoring how systemic and unresolved this challenge remains.
Other notable findings include:
- OpenAI Codex with GPT 5.4 scored the highest on security correctness (17.3%), and Cursor with Claude Opus 4.6 scored the highest for functional correctness (84.4%).
- Even the best-performing functional combo, Cursor with Claude Opus 4.6, produced secure code only 7.8% of the time. The gap between that and the lowest-scoring security combo — SWE-Agent with Gemini 2.5 Pro at 4.5% — was just 3 points.
- Newer agent/model combinations exhibited “cheating” behavior — where the agents ignored explicit instructions not to inspect git history. In one mode, cheating was observed in 81.5% of all benchmark tasks (163/200).
“This benchmark moves beyond synthetic tests to show how AI coding agents actually behave in real development environments,” said Luca Compagna, Senior Security Researcher at Endor Labs. “Unlike human developers, agents lack the contextual awareness and security best practices that uphold our industry. The result is code that passes functional tests while still introducing exploitable vulnerabilities — a gap the industry can not afford to ignore.”
This benchmark is the first to evaluate real-world applications at scale, covering a broader range of vulnerability classes, requiring larger code edits, and analyzing how agents perform across different models.
The Agent Security League leaderboard provides an ongoing, transparent view into how leading AI coding agents perform over time. Updated as new agents and models are released, the leaderboard provides some indicators to help developers to choose safer coding tools, security teams to benchmark risk exposure, and model providers to improve security performance.
The launch of the AI Coding Agentic Security Benchmark is part of Endor Labs’ broader mission to secure the software supply chain in the age of AI. To help address these risks, Endor Labs also provides AURI, a security harness designed to bring real-time security context into AI-assisted development workflows.
To see where your agent may fall on the leaderboard, visit: endorlabs.com/research/ai-code-security-benchmark.
About Endor Labs
Endor Labs is the AI-native application security platform for teams that refuse to compromise between speed and security. helps teams identify, prioritize, and fix the vulnerabilities across source code, open-source dependencies, and container images. With deep program analysis, automated remediation, and unmatched coverage, Endor Labs empowers modern engineering and security teams to move fast without compromise.
Media Contact
Rebecca Reese
endorlabs@meetkickstand.com
SOURCE Endor Labs

