Introducing Narada: An Open-Source Secrets Classification Model
We're excited to release Narada, an open-source secrets classification model based on Llama3.2-3B-Instruct fine-tuned by the Autofix Bot team. When paired with Regex-based detection, the model achieves 97% precision in detecting secrets and hard-coded credentials in source code. This model can be used with any regular expressions-based secrets detection tool to dramatically reduce false positives.
As part of Autofix Bot's benchmarks, we showcased how our secrets detection agent out-performs static-only tools in precision and accuracy. With this release, we're raising the curtains on the work we've done behind the scenes and open-sourcing an early version of the fine-tuned model that powers our secrets detection agent.
The Narada-3.2-3B-v1 model is available on Hugging Face, free to use under the MIT license.
Quickstart
You can pair the model with any secrets detection tool and get improvements in its results. Here's a reference implementation:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load the model and tokenizer
model_name = "deepsource/Narada-3.2-3B-v1"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Prepare input: Format the code snippet (with prepended line numbers)
# and target line number as shown
code = """1: import requests
2: import sys
3:
4: class GitHubClient:
5: def __init__(self):
6: # Hardcoded API token (bad practice, for demo only)
7: self.token = "ghp_9aB3XyZkLmN8QrT5UvWj7xY2ZaQ4b9CdEfGh"
8: self.base_url = "https://api.github.com"
9:
10: def get_user(self, username):
11: headers = {
12: "Authorization": f"token {self.token}",
13: "Accept": "application/vnd.github.v3+json"
14: }
15: url = f"{self.base_url}/users/{username}"
16: response = requests.get(url, headers=headers)
17: if response.status_code == 200:
18: return response.json()
19: else:
20: return {"error": response.status_code, "message": response.text}
"""
line_number = "7"
input_text = f"<input>\n# CODE:\n```{code}```\n\n# LINE NUMBER: {line_number}\n</input>"
# Replace with the system prompt present at
# https://huggingface.co/deepsource/Narada-3.2-3B-v1/blob/main/system_prompt.md
system_prompt = "<replace_with_system_prompt>"
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": input_text}
]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt").to(device)
# Generate response
outputs = model.generate(input_ids, max_new_tokens=1024)
generated_tokens = outputs[0][input_ids.shape[1]:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()
# Parse the JSON output (model responds with <json>{...}</json>)
import re
import json
match = re.search(r'<json>\s*(.*?)\s*</json>', response, re.DOTALL)
if match:
json_str = match.group(1)
data = json.loads(json_str)
print(data) # Example output: {'line_number': 7, 'label': 'True Positive', 'secret_value': 'ghp_9aB3XyZkLmN8QrT5UvWj7xY2ZaQ4b9CdEfGh', 'reason': '...'}
else:
raise ValueError("No valid JSON found in response")
Limitations of regex-based tools
Traditional regex-based secrets detection tools face a tradeoff: 1) high sensitivity catches more secrets but floods users with false positives, and 2) low sensitivity reduces noise but misses real risks. Regex excels at pattern matching but lacks the ability to understand context, such as distinguishing live credentials from placeholders or examples. Flagging harmless strings like "
We aimed to enhance regex with a language model to add nuance while maintaining efficiency. Our goal was to build a practical system that combines the speed of regular expressions with the contextual intelligence of language models, delivering high recall and precision.
Initial approach: Teacher-Student Architecture
We started with a teacher-student model inspired by chain-of-thought (CoT)1 reasoning, where a powerful "teacher" model generates detailed explanations to train a lightweight "student" model for deployment.
Picking the right base model
We selected the following compact models (~1-1.5B parameters) to balance performance and speed:
- Qwen2.5-1.5B-Instruct
- Llama3.2-1B-Instruct
- Gemma2-1B-Instruct
Using CoT prompting (inspired by Google Brain’s reasoning research), we evaluated performance on 24 code files, feeding snippets with ±20 lines around potential secrets to mimic real-world scenarios. Here’s how they performed:
Model | Technique | Input Compute | Time (s) | Accuracy |
---|---|---|---|---|
Qwen 2.5 1.5B Instruct | Chain-of-Thought | ±20 Lines | 292 | 54.1% |
Llama3.2-1B-Instruct | Chain-of-Thought | ±20 Lines | 168 | 37.0% |
While experimenting with the Gemma-2-1B model, we observed that it often produced random and irrelevant responses during inference. It also failed to follow the given system prompt instructions and was therefore excluded from further evaluation.
Qwen 2.5 led in accuracy, so we proceeded with it for initial detection and labelling. To ensure accurate line numbers, we built a pipeline to preserve original line references when processing snippets.
Teacher model trials
Inspired by the OpenThoughts2 paper on data recipes for reasoning models, we had a larger teacher generate reasoning chains to guide the student.
- Qwen QwQ 32B
- ChatGPT o1-mini
- DeepSeek R1
DeepSeek R1 stood out for consistency, accuracy, and cost-effectiveness, making it our choice for generating training data.
Dataset challenges
High-quality data is critical for secret detection, but sourcing diverse, reliable datasets proved challenging. Secrets appear in varied formats across programming languages, configs, and documentation. We initially used Samsung’s CredData3 but found its labels to be inconsistent. To address this, we employed LLMs to relabel data, using the teacher model to produce structured outputs with CoT reasoning.
Curating CredData
From CredData’s millions of entries, we curated ~8,000 diverse secrets, filtering out repetitive entries (e.g., redundant RSA keys) to reduce bias. The teacher model generated structured outputs:
<input snippet> : <Chain-of-thought reasoning> : <final_verdict>
final_verdict: {
"Label": "True Positive/False Positive",
"Line Number": <line where secret exists>,
"Secret Value": <secret value in snippet>
}
This ensured a balanced mix of true/false positives, focusing on key secret types such as API keys, database URLs, and passwords.
Evolving the model: Switching to Llama
Initial fine-tuning with Qwen 2.5 1.5B underperformed, so we scaled to Llama3.2-3B-Instruct and rethought our detection pipeline. A comparison of fine-tuned models showed Llama’s superiority:
Metric | Qwen 2.5 1.5B | Instruct Llama3.2-3B-Instruct |
---|---|---|
True Positives (TP) | 199 | 203 |
False Positives (FP) | 33 | 10 |
True Negatives (TN) | 944 | 982 |
False Negatives (FN) | 299 | 291 |
Overall Accuracy | 77.5% | 79.7% |
Precision | 85.8% | 95.3% |
Recall | 39.9% | 41.1% |
F1 Score | 54.6% | 57.3% |
Llama’s 95.3% precision and 70% reduction in false positives (10 vs. 33) and a higher overall accuracy made it the clear winner.
Boosting with Synthetic Data
To handle edge cases, we generated synthetic data using:
- Gemini 2.5 Pro for diverse, realistic secret patterns.
- Claude Sonnet 4 for context-appropriate false positives.
This enriched our dataset with varied secret types and challenging scenarios, improving model robustness.
A hybrid solution: Regex + fine-tuned model for filtering
Our final system combines regex and AI:
- Regex quickly identifies potential secrets.
- Fine-tuned Llama3.2-3B-Instruct classifies candidates contextually.
Benefits:
- Speed: Regex handles initial scans efficiently.
- Accuracy: AI reduces false positives with contextual understanding.
- Scalability: The 3B model keeps costs low for production.
Fine-tuning round #2
We shifted from CoT to deterministic outputs, using ~800 synthetic examples labelled by LLMs, along with our own dataset of False Positives (around ~110 entries). The dataset format included:
{
"line_number": <regex-detected line>,
"label": "True Positive/False Positive",
"secret_value": <value>,
"reason": <brief teacher reasoning>
}
Results showed significant improvement:
Metric | Llama3.2-3B-Instruct Base | Llama3.2-3B-Instruct Fine-Tuned |
---|---|---|
Total Processed | 912 | 912 |
True Positives | 369 | 529 |
False Positives | 222 | 16 |
False Negatives | 180 | 20 |
True Negatives | 141 | 347 |
Overall Accuracy | 47.5% | 77.8% |
Recall | 67.2% | 96.3% |
Precision | 62.4% | 97.0% |
F1 Score | 64.7% | 96.7% |
The fine-tuned model improved accuracy by ~30% and recall by around ~30%, while maintaining a strong precision of 97% compared to the 62% of the base model, an improvement of more than 30%.
Key lessons
- Teacher-student setups generate high-quality training data but can be costly.
- Hybrid systems outperform standalone regex or AI approaches.
- Synthetic data addresses rare and edge-case scenarios effectively.
- Model size matters—3B parameters hit the sweet spot for performance and efficiency.
Conclusion
Our hybrid secret detection system blends regex's speed with the contextual power of a fine-tuned Llama3.2-3B-Instruct model. By leveraging teacher-student training, curated datasets, and synthetic data, we achieved 97% precision in secrets detection. This approach proves that specialised, lightweight models can deliver enterprise-grade results for targeted security tasks.
We're excited for you to try the Narada-3.2-3B-v1 model, which is now available on Hugging Face under the MIT license. If you have any feedback, please let us know by reaching out to us on @autofixbot.
Definitions
Accuracy
Accuracy measures how often the detector is correct overall—both when identifying real secrets and when correctly ignoring non-secrets. It’s calculated as the ratio of all correct predictions (true positives and true negatives) to the total number of predictions.
Precision
Precision measures how many of the items flagged as secrets are actually real secrets. It’s the ratio of true positives to all items flagged as positives, showing how reliable the detections are.
Footnotes
- Wei, Jason, et al. "Chain-of-thought prompting elicits reasoning in large language models." Advances in neural information processing systems 35 (2022): 24824-24837. ↩
- Guha, Etash, et al. "OpenThoughts: Data Recipes for Reasoning Models." ArXiv, 2025, https://arxiv.org/abs/2506.04178. Accessed 10 Oct. 2025. ↩
- https://github.com/Samsung/CredData ↩