Benchmarks

The Autofix Bot team on September 29, 2025

Autofix Bot is an AI agent purpose-built for securing code. We are excited to present benchmarks for our v1 release, showcasing how the agent delivers frontier-level detection and remediation performance for first-party source code vulnerabilities and hardcoded secrets, while featuring category-leading response time and cost efficiency.

Modern software is increasingly written with AI assistance. This has led to two emerging patterns in professional software development: organizations are producing much more code per developer1, and as a direct consequence, most codebases now don't exist as a mental model in anyone's head on the team. How do we ensure we're shipping secure code to users?

In a 2025 industry bakeoff of 100+ LLMs, only ~55% of generated code passed basic security checks.2 Operational telemetry in another study on real-world repositories over six months showed a 10× rise in new security vulnerabilities in AI-generated changes.3 Asking an LLM to review its own code for security vulnerabilities has not proved helpful either, as LLM-only review exhibited very low recall on critical vulnerability classes across multiple labeled datasets — often zero true positives.4

Autofix Bot is purpose-built for securing code. It works in tandem with agentic code generation tools to find and fix security vulnerabilities and hardcoded secrets, with higher accuracy than static-only and LLM-only review, and at a much lower cost than LLM-only review. In this document, we present the architecture, analysis pipeline, performance benchmarks, and future roadmap for Autofix Bot v1.


Security Review Accuracy vs. Price

Accuracy (%)
60
70
80
90
100
Autofix Bot
OpenAI Codex
Claude Code
Gemini CLI
0
88
175
263
350
Cost of running OWASP Java Benchmark (USD)

Hybrid Agent Architecture

We designed Autofix Bot with three key goals:

  1. Maximize recall on security vulnerabilities during the AI review of a file while,
  2. Keeping false positives low enough to generated remediation patches automatically, and
  3. Produce deterministic, reproducible outputs suitable for CI

The agent architecture is hybrid by construction. A semantic/static program analysis provides stable signal; an agentic layer uses that signal (plus code/query tools) to conduct a focused security review, generate remediation patches, and explanation of those patches.

The pipeline contains the following steps:

  1. Codebase indexing: Build an AST and whole-project graph (data-flow, control-flow, import graph, sources/sinks) that act as stores. The agent can query these stores during analysis and remediation.
  2. Static pass (SAST): Run DeepSource’s SAST to deterministically establish a low-false-positive baseline of known security vulnerabilities. A sub-agent suppresses context-specific false positives.
  3. AI review: With static findings, source code tools (ripgrep, graph lookups, etc), and taint analysis sub-agent that tracks the flow of potentially insecure data to inform the main agent's security decisions, the agent performs security review over the relevant slice. The agent also has access to all stores created in step 1, giving it context of the entire codebase and all open-source dependencies.
  4. Remediation: Two specialized sub-agents generate fixes for individual issues detected across steps 2 and 3, and explanations where automated fixes are unsafe.
  5. Sanitization: A language-specific static harness validates all edits generated in step 4, with an additional AI pass to ensure alignment with the intended fix.
  6. Output: Emit a clean git patch, ready to be applied at the HEAD of the branch which was analyzed.
  7. Caching: Multi-layered caching for source code, AST, and the project's stores to improve repeat analysis performance.

Empirical Motivation

LLM-only review has well-documented shortcomings that our hybrid architecture directly addresses:

  • Low recall on critical CWEs, especially in the presence of other kinds of violations or anti-patterns in the file. A static pass helps steer the LLM review's focus back to security.
  • Misses when interprocedural reasoning is required. A grep-only approach lacks deep semantic analysis (can’t track data/control flow across functions), which is a task static analyzers are built for.
  • Non-determinism. Re-reviews of the same code produces wildly different results and often misses vulnerabilities flagged in a earlier run. Similar to the first point, a static pass helps steer the output across repeated runs.
  • Cost. LLM-only review is expensive, especially for large codebases. Static narrowing trims the search space the agent must read/reason about, reducing prompt/context size and tool invocations.
  • Time. Deterministic static seeds lets us shard safely and parallelize analysis without re-reviews, while LLM-only reviewers invariably require additional passes or more tool calls to file search.

Benchmarks: Security Review

For benchmarking the accuracy in finding security vulnerabilities and hotspots, we evaluate Autofix Bot on OWASP/BenchmarkJava5 consisting of 2,740 labeled files. We invoke the agent programmatically, processing 15 files in parallel. Autofix Bot attains 88.18% accuracy with True Positive Rate (TPR) 94.35% and False Positive Rate (FPR) 18.43%, completing the suite in ~2.4 hours, costing $58 in token costs.

For comparison, we run Claude Code (Sonnet 4) via its /security-review CLI, with minor prompt adjustments6 for file-list input, intentional vulnerability reporting, and JSONL output); OpenAI Codex (gpt-5-medium) using a prompt derived from Claude Code's as Codex CLI doesn't have a security specific review tool yet; and Gemini CLI's Code Review7, on the same set of files and sharded 15 files in parallel.

Autofix BotOpenAI CodexClaude CodeGemini CLI
Files Processed274027402740400
Accuracy88.18%88.80%83.58%70.25%
True Positive Rate94.35%95.97%87.77%96.20%
False Positive Rate18.43%18.87%20.91%67.68%
Total True Positives133613581242228
Total False Positives244250277111
Total True Negatives10801075104853
Total False Negatives80571738
Cost$58$122$300$226
Time2.4 hours2.9 hours5.5 hours1.7 hours

Autofix Bot performs on-par with OpenAI Codex and out-performs Claude Code and Gemini for security review, while being faster and significantly cheaper than all three.


Autofix Bot vs. Others : Accuracy Comparison

Accuracy (%)
0
25
50
75
100
70.2
83.6
88.8
88.2
Gemini CLI
Claude Code
OpenAI Codex
Autofix Bot


Autofix Bot vs. Others : Cost Comparison

Total Cost ($)
0
81
163
244
325
226
302
122
58
Gemini CLI
Claude Code
OpenAI Codex
Autofix Bot


Autofix Bot vs. Others : Speed Comparison

Avg. time spent per-file (s)
0
5
9
14
18
15.3
7.22
3.81
3.15
Gemini CLI
Claude Code
OpenAI Codex
Autofix Bot

Benchmarks: Secrets Detection

We benchmark three widely used static tools on our proprietary labeled secrets corpus: Gitleaks8, Detect-Secrets9, and TruffleHog10; and compare the results with Autofix Bot's secrets detection sub-agent on the same corpus.

Perfect MatchesPartial MatchesMissed SecretsFalse PositivesFalse NegativesAccuracyPrecisionRecallF1 Score
Gitleaks30318197101970.58490.96980.61970.7562
detect-secrets270462021522020.52120.67520.610.6409
TruffleHog12121376293760.23360.83040.27410.4122
Autofix Bot4530656650.87450.98690.87450.9278

We observe that Gitleaks and TruffleHog prioritize precision (0.97 and 0.83) but miss many true secrets (recall 0.62 and 0.27). In contrast, detect-secrets achieves similar recall to Gitleaks (0.61) but with substantially lower precision (0.68), resulting in more false positives (152) than Gitleaks (10) or TruffleHog (29).

Static-only signal is necessary but insufficient. To address the above failure modes, Autofix Bot's secrets detection sub-agent runs:

  1. a static regex/pattern sweep to over-approximate candidates (maximizing recall), then
  2. a custom fine-tuned classifier to confirm/deny each candidate and extract the value

On this corpus, Autofix Bot achieves the highest F1 (0.9278), substantially ahead of static scanners (Gitleaks 0.7562, detect-secrets 0.6409, TruffleHog 0.4122).


Autofix Bot's Secrets Detection vs Static Tools: F1 Comparison

0
0
1
1
1
0.75
0.64
0.41
0.93
Gitleaks
detect-secrets
TruffleHog
Autofix Bot

We choose F1 as the headline metric here because secrets datasets are imbalanced and teams care about two things at once: not missing real leaks (recall) and not drowning reviewers in noise (precision). By maximizing F1 — rather than accuracy, which can be inflated by abundant true negatives — Autofix Bot demonstrates the best balance of safety and triage cost. Practically, this means the system catches more real secrets while keeping alerts actionable, making it suitable for always-on CI and large repo backfills.

Benchmarks: Remediation

On OWASP/BenchmarkJava, we evaluate end-to-end fix quality using a shared LLM judge (GPT-4o) with a common prompt. Autofix Bot produced 1,293 correct fixes out of 1,354 (95.49%), with $8.33 total spend and 12.71s average time per fix. Codex (gpt-5 medium) achieved the highest fix accuracy (97.09%) but at $45 and 53.04s average latency; Claude Code was comparable in accuracy (95.48%) at ~$91 and 35.57s per fix.

Autofix Bot delivers near-parity fix accuracy while being ≈4.2× faster than Codex (and ≈2.8× faster than Claude) and ≈5x to 10× cheaper overall.

Autofix BotOpenAI CodexClaude CodeGemini CLI
Files Processed135413791240220
Correct Fixes129313391184181
Incorrect/Partial/Syntax61405639
Fix Accuracy95.49%97.09%95.48%82.27%
Cost$8.33~$45~$91~$21
Average Time per Fix12.71s53.04s35.57s33.2s
Min Time4.80s14.87s15.36s11.6s
Max Time84.29s317.85s200.92s89.96s

Autofix Bot vs Others: Fix Accuracy

Fix Accuracy (%)
0
25
50
75
100
82.27
95.48
97.09
95.49
Gemini CLI
Claude Code
OpenAI Codex
Autofix Bot


Autofix Bot vs Others: Remediation Cost Comparison

Total Cost ($)
0
25
50
75
100
21
91
45
8
Gemini CLI
Claude Code
OpenAI Codex
Autofix Bot


Autofix Bot vs Others: Average Time per Fix

Average Time per Fix (seconds)
0
15
30
45
60
35.57
33.2
53.04
12.71
Claude Code
Gemini CLI
OpenAI Codex
Autofix Bot

What's Next

Autofix Bot is currently in early access and we're testing the pre-release with trusted partners. Over the next few weeks, we will be releasing the REST API and the Terminal UI, shortly followed by the GitHub pull request integration. We'll continue to work on our agent architecture and expect more improvements in the metrics we've showcased in this document.

Please follow @autofixbot for updates, and keep an eye on the News feed.


Footnotes

  1. Randomized field experiments at Microsoft, Accenture, and an anonymous Fortune 100 electronics firm evaluated the causal effect of granting Microsoft Copilot access on weekly developer outputs. Pooled weighted-IV estimates show a 26.08% increase in completed tasks, and 13.55% increase in commits. Read more.
  2. 2025 GenAI Code Security Report, by Veracode.
  3. 4x Velocity, 10x Vulnerabilities: AI Coding Assistants Are Shipping More Risks
  4. Amro & Alalfi (2025), GitHub’s Copilot Code Review: Can AI Spot Security Flaws Before You Commit? arXiv:2509.13650. Evaluations across seven labeled datasets report fewer than 20 total comments (mostly non-security) despite hundreds of vulns, with 0 security detections on SARD XSS/SQLi and 0 comments on 878/898 Wireshark files—indicating near-zero recall under default settings.
  5. The OWASP Java Benchmark (BenchmarkJava) is a fully runnable Java web application seeded with ground-truth vulnerabilities and non-vulnerabilities, designed to evaluate the accuracy, coverage, and speed of automated AppSec tools (SAST, DAST, IAST). The latest widely used release (v1.2) contains ~2,740 labeled test cases with known CWE mappings.
  6. autofix-bot-bench, the benchmark suite used for this document.
  7. Automate app deployment and security analysis with new Gemini CLI extensions, Google Cloud Blog
  8. Gitleaks: A tool for detecting secrets like passwords, API keys, and tokens in git repos, files, and whatever else you wanna throw at it via stdin.
  9. detect-secrets: An enterprise friendly way of detecting and preventing secrets in code.
  10. TruffleHog: Find, verify, and analyze leaked credentials.