Modern software is increasingly written with AI assistance. This has led to two emerging patterns in professional software development: organizations are producing much more code per developer¹, and as a direct consequence, most codebases now don't exist as a mental model in anyone's head on the team. How do we ensure we're shipping secure code to users?

In a 2025 industry bakeoff of 100+ LLMs, only ~55% of generated code passed basic security checks.² Operational telemetry in another study on real-world repositories over six months showed a 10× rise in new security vulnerabilities in AI-generated changes.³ Asking an LLM to review its own code for security vulnerabilities has not proved helpful either, as LLM-only review exhibited very low recall on critical vulnerability classes across multiple labeled datasets — often zero true positives.⁴

Autofix Bot is purpose-built for securing code. It works in tandem with agentic code generation tools to find and fix security vulnerabilities and hardcoded secrets, with higher accuracy than static-only and LLM-only review, and at a much lower cost than LLM-only review. In this document, we present the architecture, analysis pipeline, performance benchmarks, and future roadmap for Autofix Bot v1.

Security Review Accuracy vs. Price

Accuracy (%)

100

Autofix Bot

OpenAI Codex

Claude Code

Gemini CLI

175

263

350

Cost of running OWASP Java Benchmark (USD)

Hybrid Agent Architecture

We designed Autofix Bot with three key goals:

Maximize recall on security vulnerabilities during the AI review of a file while,
Keeping false positives low enough to generated remediation patches automatically, and
Produce deterministic, reproducible outputs suitable for CI

The agent architecture is hybrid by construction. A semantic/static program analysis provides stable signal; an agentic layer uses that signal (plus code/query tools) to conduct a focused security review, generate remediation patches, and explanation of those patches.

The pipeline contains the following steps:

Codebase indexing: Build an AST and whole-project graph (data-flow, control-flow, import graph, sources/sinks) that act as stores. The agent can query these stores during analysis and remediation.
Static pass (SAST): Run DeepSource’s SAST to deterministically establish a low-false-positive baseline of known security vulnerabilities. A sub-agent suppresses context-specific false positives.
AI review: With static findings, source code tools (ripgrep, graph lookups, etc), and taint analysis sub-agent that tracks the flow of potentially insecure data to inform the main agent's security decisions, the agent performs security review over the relevant slice. The agent also has access to all stores created in step 1, giving it context of the entire codebase and all open-source dependencies.
Remediation: Two specialized sub-agents generate fixes for individual issues detected across steps 2 and 3, and explanations where automated fixes are unsafe.
Sanitization: A language-specific static harness validates all edits generated in step 4, with an additional AI pass to ensure alignment with the intended fix.
Output: Emit a clean git patch, ready to be applied at the HEAD of the branch which was analyzed.
Caching: Multi-layered caching for source code, AST, and the project's stores to improve repeat analysis performance.

Empirical Motivation

LLM-only review has well-documented shortcomings that our hybrid architecture directly addresses:

Low recall on critical CWEs, especially in the presence of other kinds of violations or anti-patterns in the file. A static pass helps steer the LLM review's focus back to security.
Misses when interprocedural reasoning is required. A grep-only approach lacks deep semantic analysis (can’t track data/control flow across functions), which is a task static analyzers are built for.
Non-determinism. Re-reviews of the same code produces wildly different results and often misses vulnerabilities flagged in a earlier run. Similar to the first point, a static pass helps steer the output across repeated runs.
Cost. LLM-only review is expensive, especially for large codebases. Static narrowing trims the search space the agent must read/reason about, reducing prompt/context size and tool invocations.
Time. Deterministic static seeds lets us shard safely and parallelize analysis without re-reviews, while LLM-only reviewers invariably require additional passes or more tool calls to file search.

Benchmarks: Security Review

For benchmarking the accuracy in finding security vulnerabilities and hotspots, we evaluate Autofix Bot on OWASP/BenchmarkJava⁵ consisting of 2,740 labeled files. We invoke the agent programmatically, processing 15 files in parallel. Autofix Bot attains 88.18% accuracy with True Positive Rate (TPR) 94.35% and False Positive Rate (FPR) 18.43%, completing the suite in ~2.4 hours, costing $58 in token costs.

For comparison, we run Claude Code (Sonnet 4) via its /security-review CLI, with minor prompt adjustments⁶ for file-list input, intentional vulnerability reporting, and JSONL output); OpenAI Codex (gpt-5-medium) using a prompt derived from Claude Code's as Codex CLI doesn't have a security specific review tool yet; and Gemini CLI's Code Review⁷, on the same set of files and sharded 15 files in parallel.

	Autofix Bot	OpenAI Codex	Claude Code	Gemini CLI
Files Processed	2740	2740	2740	400
Accuracy	88.18%	88.80%	83.58%	70.25%
True Positive Rate	94.35%	95.97%	87.77%	96.20%
False Positive Rate	18.43%	18.87%	20.91%	67.68%
Total True Positives	1336	1358	1242	228
Total False Positives	244	250	277	111
Total True Negatives	1080	1075	1048	53
Total False Negatives	80	57	173	8
Cost	$58	$122	$300	$226
Time	2.4 hours	2.9 hours	5.5 hours	1.7 hours

Autofix Bot performs on-par with OpenAI Codex and out-performs Claude Code and Gemini for security review, while being faster and significantly cheaper than all three.

Autofix Bot vs. Others : Accuracy Comparison

Accuracy (%)

100

70.2

83.6

88.8

88.2

Gemini CLI

Claude Code

OpenAI Codex

Autofix Bot

Autofix Bot vs. Others : Cost Comparison

Total Cost ($)

163

244

325

226

302

122

Gemini CLI

Claude Code

OpenAI Codex

Autofix Bot

Autofix Bot vs. Others : Speed Comparison

Avg. time spent per-file (s)

15.3

7.22

3.81

3.15

Gemini CLI

Claude Code

OpenAI Codex

Autofix Bot

Benchmarks: Secrets Detection

We benchmark three widely used static tools on our proprietary labeled secrets corpus: Gitleaks⁸, Detect-Secrets⁹, and TruffleHog¹⁰; and compare the results with Autofix Bot's secrets detection sub-agent on the same corpus.

	Perfect Matches	Partial Matches	Missed Secrets	False Positives	False Negatives	Accuracy	Precision	Recall	F1 Score
Gitleaks	303	18	197	10	197	0.5849	0.9698	0.6197	0.7562
detect-secrets	270	46	202	152	202	0.5212	0.6752	0.61	0.6409
TruffleHog	121	21	376	29	376	0.2336	0.8304	0.2741	0.4122
Autofix Bot	453	0	65	6	65	0.8745	0.9869	0.8745	0.9278

We observe that Gitleaks and TruffleHog prioritize precision (0.97 and 0.83) but miss many true secrets (recall 0.62 and 0.27). In contrast, detect-secrets achieves similar recall to Gitleaks (0.61) but with substantially lower precision (0.68), resulting in more false positives (152) than Gitleaks (10) or TruffleHog (29).

Static-only signal is necessary but insufficient. To address the above failure modes, Autofix Bot's secrets detection sub-agent runs:

a static regex/pattern sweep to over-approximate candidates (maximizing recall), then
a custom fine-tuned classifier to confirm/deny each candidate and extract the value

On this corpus, Autofix Bot achieves the highest F1 (0.9278), substantially ahead of static scanners (Gitleaks 0.7562, detect-secrets 0.6409, TruffleHog 0.4122).

Autofix Bot's Secrets Detection vs Static Tools: F1 Comparison

0.75

0.64

0.41

0.93

Gitleaks

detect-secrets

TruffleHog

Autofix Bot

We choose F1 as the headline metric here because secrets datasets are imbalanced and teams care about two things at once: not missing real leaks (recall) and not drowning reviewers in noise (precision). By maximizing F1 — rather than accuracy, which can be inflated by abundant true negatives — Autofix Bot demonstrates the best balance of safety and triage cost. Practically, this means the system catches more real secrets while keeping alerts actionable, making it suitable for always-on CI and large repo backfills.

Benchmarks: Remediation

On OWASP/BenchmarkJava, we evaluate end-to-end fix quality using a shared LLM judge (GPT-4o) with a common prompt. Autofix Bot produced 1,293 correct fixes out of 1,354 (95.49%), with $8.33 total spend and 12.71s average time per fix. Codex (gpt-5 medium) achieved the highest fix accuracy (97.09%) but at $45 and 53.04s average latency; Claude Code was comparable in accuracy (95.48%) at ~$91 and 35.57s per fix.

Autofix Bot delivers near-parity fix accuracy while being ≈4.2× faster than Codex (and ≈2.8× faster than Claude) and ≈5x to 10× cheaper overall.

	Autofix Bot	OpenAI Codex	Claude Code	Gemini CLI
Files Processed	1354	1379	1240	220
Correct Fixes	1293	1339	1184	181
Incorrect/Partial/Syntax	61	40	56	39
Fix Accuracy	95.49%	97.09%	95.48%	82.27%
Cost	$8.33	~$45	~$91	~$21
Average Time per Fix	12.71s	53.04s	35.57s	33.2s
Min Time	4.80s	14.87s	15.36s	11.6s
Max Time	84.29s	317.85s	200.92s	89.96s

Autofix Bot vs Others: Fix Accuracy

Fix Accuracy (%)

100

82.27

95.48

97.09

95.49

Gemini CLI

Claude Code

OpenAI Codex

Autofix Bot

Autofix Bot vs Others: Remediation Cost Comparison

Total Cost ($)

100

Gemini CLI

Claude Code

OpenAI Codex

Autofix Bot

Autofix Bot vs Others: Average Time per Fix

Average Time per Fix (seconds)

35.57

33.2

53.04

12.71

Claude Code

Gemini CLI

OpenAI Codex

Autofix Bot

What's Next

Autofix Bot is currently in early access and we're testing the pre-release with trusted partners. Over the next few weeks, we will be releasing the REST API and the Terminal UI, shortly followed by the GitHub pull request integration. We'll continue to work on our agent architecture and expect more improvements in the metrics we've showcased in this document.

Please follow @autofixbot for updates, and keep an eye on the News feed.

Randomized field experiments at Microsoft, Accenture, and an anonymous Fortune 100 electronics firm evaluated the causal effect of granting Microsoft Copilot access on weekly developer outputs. Pooled weighted-IV estimates show a 26.08% increase in completed tasks, and 13.55% increase in commits. Read more. ↩
2025 GenAI Code Security Report, by Veracode. ↩
4x Velocity, 10x Vulnerabilities: AI Coding Assistants Are Shipping More Risks ↩
Amro & Alalfi (2025), GitHub’s Copilot Code Review: Can AI Spot Security Flaws Before You Commit? arXiv:2509.13650. Evaluations across seven labeled datasets report fewer than 20 total comments (mostly non-security) despite hundreds of vulns, with 0 security detections on SARD XSS/SQLi and 0 comments on 878/898 Wireshark files—indicating near-zero recall under default settings. ↩
The OWASP Java Benchmark (BenchmarkJava) is a fully runnable Java web application seeded with ground-truth vulnerabilities and non-vulnerabilities, designed to evaluate the accuracy, coverage, and speed of automated AppSec tools (SAST, DAST, IAST). The latest widely used release (v1.2) contains ~2,740 labeled test cases with known CWE mappings. ↩
autofix-bot-bench, the benchmark suite used for this document. ↩
Automate app deployment and security analysis with new Gemini CLI extensions, Google Cloud Blog ↩
Gitleaks: A tool for detecting secrets like passwords, API keys, and tokens in git repos, files, and whatever else you wanna throw at it via stdin. ↩
detect-secrets: An enterprise friendly way of detecting and preventing secrets in code. ↩
TruffleHog: Find, verify, and analyze leaked credentials. ↩

Security Review Accuracy vs. Price

Hybrid Agent Architecture

Empirical Motivation

Benchmarks: Security Review

Autofix Bot vs. Others : Accuracy Comparison

Autofix Bot vs. Others : Cost Comparison

Autofix Bot vs. Others : Speed Comparison

Benchmarks: Secrets Detection

Autofix Bot's Secrets Detection vs Static Tools: F1 Comparison

Benchmarks: Remediation

Autofix Bot vs Others: Fix Accuracy

Autofix Bot vs Others: Remediation Cost Comparison

Autofix Bot vs Others: Average Time per Fix

What's Next

Footnotes