Unveiling Narada | Autofix Bot

We created a hybrid secret detection system that pairs regex for rapid scanning with a fine-tuned Llama3.2-3B-Instruct model, achieving 97.0% precision. Starting with a teacher-student setup inspired by chain-of-thought reasoning, we iterated through base models, dataset challenges, and synthetic data to deliver a fast, accurate solution ready for real-world codebases.

Detecting sensitive secrets like API keys or credentials in code is critical but tricky. Traditional regex-based tools excel at spotting potential secrets but often overwhelm developers with false positives, flagging harmless strings like "your-api-key-here" as threats. This noise frustrates developers, reduces trust and adoption.

Our goal was to build a practical system that combines the speed of regular expressions with the contextual intelligence of language models, delivering high recall and precision without frustrating developers.

Limits of Regex

Regex detectors face a tradeoff:

High sensitivity catches more secrets but floods users with false positives.
Low sensitivity reduces noise but misses real risks.

Regex excels at pattern matching but lacks the ability to understand context, such as distinguishing live credentials from placeholders or examples. We aimed to enhance regex with a language model to add nuance while maintaining efficiency.

Our Initial Approach: Teacher-Student Architecture

We started with a teacher-student model inspired by chain-of-thought (CoT) reasoning, where a powerful "teacher" model generates detailed explanations to train a lightweight "student" model for deployment.

Picking the Right Base Model

We selected the following compact models (~1-1.5B parameters) to balance performance and speed:

Qwen2.5-1.5B-Instruct
Llama3.2-1B-Instruct
Gemma2-1B-Instruct

Using CoT prompting (inspired by Google Brain’s reasoning research), we evaluated performance on 24 code files, feeding snippets with ±20 lines around potential secrets to mimic real-world scenarios. Here’s how they performed:

Model	Technique	Input Compute	Time (s)	Accuracy
Qwen 2.5 1.5B Instruct	Chain-of-Thought	±20 Lines	292	54.1%
Llama3.2-1B-Instruct	Chain-of-Thought	±20 Lines	168	37.0%

While experimenting with the Gemma-2-1B model, we observed that it often produced random and irrelevant responses during inference. It also failed to follow the given system prompt instructions and was therefore excluded from further evaluation.

Qwen 2.5 led in accuracy, so we proceeded with it for initial detection and labelling. To ensure accurate line numbers, we built a pipeline to preserve original line references when processing snippets.

Teacher Model Trials

Inspired by the OpenThoughts paper on data recipes for reasoning models, we had a larger teacher generate reasoning chains to guide the student.

Qwen QwQ 32B
ChatGPT o1-mini
DeepSeek R1

DeepSeek R1 stood out for consistency, accuracy, and cost-effectiveness, making it our choice for generating training data.

Dataset Challenges: Laying a Strong Foundation

High-quality data is critical for secret detection, but sourcing diverse, reliable datasets proved challenging. Secrets appear in varied formats across programming languages, configs, and documentation. We initially used Samsung’s CredData but found its labels inconsistent. To address this, we employed LLMs to relabel data, using the teacher model to produce structured outputs with CoT reasoning.

Curating CredData

From CredData’s millions of entries, we curated ~8,000 diverse secrets, filtering out repetitive entries (e.g., redundant RSA keys) to reduce bias. The teacher model generated structured outputs:

<input snippet> : <Chain-of-thought reasoning> : <final_verdict>
final_verdict: {
  "Label": "True Positive/False Positive",
  "Line Number": <line where secret exists>,
  "Secret Value": <secret value in snippet>
}

This ensured a balanced mix of true/false positives, focusing on key secret types such as API keys, database URLs, and passwords.

Evolving the Model: Switching to Llama

Initial fine-tuning with Qwen 2.5 1.5B underperformed, so we scaled to Llama3.2-3B-Instruct and rethought our detection pipeline. A comparison of fine-tuned models showed Llama’s superiority:

Metric	Qwen 2.5 1.5B	Instruct Llama3.2-3B-Instruct
True Positives (TP)	199	203
False Positives (FP)	33	10
True Negatives (TN)	944	982
False Negatives (FN)	299	291
Overall Accuracy	77.5%	79.7%
Precision	85.8%	95.3%
Recall	39.9%	41.1%
F1 Score	54.6%	57.3%

Llama’s 95.3% precision and 70% reduction in false positives (10 vs. 33) and a higher overall accuracy made it the clear winner.

Boosting with Synthetic Data

To handle edge cases, we generated synthetic data using:

Gemini 2.5 Pro for diverse, realistic secret patterns.
Claude Sonnet 4 for context-appropriate false positives.

This enriched our dataset with varied secret types and challenging scenarios, improving model robustness.

The Hybrid Solution: Regex + Fine-Tuned Model

Our final system combines regex and AI:

Regex quickly identifies potential secrets.
Fine-tuned Llama3.2-3B-Instruct classifies candidates contextually.

Benefits:

Speed: Regex handles initial scans efficiently.
Accuracy: AI reduces false positives with contextual understanding.
Scalability: The 3B model keeps costs low for production.

Second Round of Fine-Tuning

We shifted from CoT to deterministic outputs, using ~800 synthetic examples labelled by LLMs, along with our own dataset of False Positives (around ~110 entries). The dataset format included:

{
  "line_number": <regex-detected line>,
  "label": "True Positive/False Positive",
  "secret_value": <value>,
  "reason": <brief teacher reasoning>
}

Results showed significant improvement:

Metric	Llama3.2-3B-Instruct Base	Llama3.2-3B-Instruct Fine-Tuned
Total Processed	912	912
True Positives	369	529
False Positives	222	16
False Negatives	180	20
True Negatives	141	347
Overall Accuracy	47.5%	77.8%
Recall	67.2%	96.3%
Precision	62.4%	97.0%
F1 Score	64.7%	96.7%

The fine-tuned model improved accuracy by ~30% and recall by around ~30%, while maintaining a strong precision of 97% compared to the 62% of the base model, an improvement of more than 30%.

Key Lessons

Teacher-student setups generate high-quality training data but can be costly.
Hybrid systems outperform standalone regex or AI approaches.
Synthetic data addresses rare and edge-case scenarios effectively.
Model size matters—3B parameters hit the sweet spot for performance and efficiency.

Next Steps

To further enhance the system:

Implement active learning to incorporate production feedback.
Integrate with security tools for seamless workflows.

Conclusion

Our hybrid secret detection system blends regex's speed with the contextual power of a fine-tuned Llama3.2-3B-Instruct model. By leveraging teacher-student training, curated datasets, and synthetic data, we achieved 97.0% precision and a 92.0% drop in false positives. This approach proves that specialised, lightweight models can deliver enterprise-grade results for targeted security tasks.

The Narada-3.2-3B-v1 model is now available on Hugging Face.