Introducing Narada: An Open-Source Secrets Classification Model

We're excited to release Narada, an open-source secrets classification model based on Llama3.2-3B-Instruct fine-tuned by the Autofix Bot team. When paired with Regex-based detection, the model achieves 97% precision in detecting secrets and hard-coded credentials in source code. This model can be used with any regular expressions-based secrets detection tool to dramatically reduce false positives.

As part of Autofix Bot's benchmarks, we showcased how our secrets detection agent out-performs static-only tools in precision and accuracy. With this release, we're raising the curtains on the work we've done behind the scenes and open-sourcing an early version of the fine-tuned model that powers our secrets detection agent.

The Narada-3.2-3B-v1 model is available on Hugging Face, free to use under the MIT license.

Quickstart

You can pair the model with any secrets detection tool and get improvements in its results. Here's a reference implementation:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the model and tokenizer
model_name = "deepsource/Narada-3.2-3B-v1"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Prepare input: Format the code snippet (with prepended line numbers)
# and target line number as shown
code = """1: import requests
2: import sys
3:
4: class GitHubClient:
5:     def __init__(self):
6:         # Hardcoded API token (bad practice, for demo only)
7:         self.token = "ghp_9aB3XyZkLmN8QrT5UvWj7xY2ZaQ4b9CdEfGh"
8:         self.base_url = "https://api.github.com"
9:
10:     def get_user(self, username):
11:         headers = {
12:             "Authorization": f"token {self.token}",
13:             "Accept": "application/vnd.github.v3+json"
14:         }
15:         url = f"{self.base_url}/users/{username}"
16:         response = requests.get(url, headers=headers)
17:         if response.status_code == 200:
18:             return response.json()
19:         else:
20:             return {"error": response.status_code, "message": response.text}
"""
line_number = "7"
input_text = f"<input>\n# CODE:\n```{code}```\n\n# LINE NUMBER: {line_number}\n</input>"

# Replace with the system prompt present at
# https://huggingface.co/deepsource/Narada-3.2-3B-v1/blob/main/system_prompt.md
system_prompt = "<replace_with_system_prompt>"
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": input_text}
]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt").to(device)

# Generate response
outputs = model.generate(input_ids, max_new_tokens=1024)
generated_tokens = outputs[0][input_ids.shape[1]:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()

# Parse the JSON output (model responds with <json>{...}</json>)
import re
import json
match = re.search(r'<json>\s*(.*?)\s*</json>', response, re.DOTALL)
if match:
    json_str = match.group(1)
    data = json.loads(json_str)
    print(data) # Example output: {'line_number': 7, 'label': 'True Positive', 'secret_value': 'ghp_9aB3XyZkLmN8QrT5UvWj7xY2ZaQ4b9CdEfGh', 'reason': '...'}
else:
    raise ValueError("No valid JSON found in response")

Limitations of regex-based tools

Traditional regex-based secrets detection tools face a tradeoff: 1) high sensitivity catches more secrets but floods users with false positives, and 2) low sensitivity reduces noise but misses real risks. Regex excels at pattern matching but lacks the ability to understand context, such as distinguishing live credentials from placeholders or examples. Flagging harmless strings like "" as threats is common. This makes the results noisy — frustrating developers, reducing trust and hurting adoption.

We aimed to enhance regex with a language model to add nuance while maintaining efficiency. Our goal was to build a practical system that combines the speed of regular expressions with the contextual intelligence of language models, delivering high recall and precision.

Initial approach: Teacher-Student Architecture

We started with a teacher-student model inspired by chain-of-thought (CoT)¹ reasoning, where a powerful "teacher" model generates detailed explanations to train a lightweight "student" model for deployment.

Picking the right base model

We selected the following compact models (~1-1.5B parameters) to balance performance and speed:

Qwen2.5-1.5B-Instruct
Llama3.2-1B-Instruct
Gemma2-1B-Instruct

Using CoT prompting (inspired by Google Brain’s reasoning research), we evaluated performance on 24 code files, feeding snippets with ±20 lines around potential secrets to mimic real-world scenarios. Here’s how they performed:

Model	Technique	Input Compute	Time (s)	Accuracy
Qwen 2.5 1.5B Instruct	Chain-of-Thought	±20 Lines	292	54.1%
Llama3.2-1B-Instruct	Chain-of-Thought	±20 Lines	168	37.0%

While experimenting with the Gemma-2-1B model, we observed that it often produced random and irrelevant responses during inference. It also failed to follow the given system prompt instructions and was therefore excluded from further evaluation.

Qwen 2.5 led in accuracy, so we proceeded with it for initial detection and labelling. To ensure accurate line numbers, we built a pipeline to preserve original line references when processing snippets.

Teacher model trials

Inspired by the OpenThoughts² paper on data recipes for reasoning models, we had a larger teacher generate reasoning chains to guide the student.

Qwen QwQ 32B
ChatGPT o1-mini
DeepSeek R1

DeepSeek R1 stood out for consistency, accuracy, and cost-effectiveness, making it our choice for generating training data.

Dataset challenges

High-quality data is critical for secret detection, but sourcing diverse, reliable datasets proved challenging. Secrets appear in varied formats across programming languages, configs, and documentation. We initially used Samsung’s CredData³ but found its labels to be inconsistent. To address this, we employed LLMs to relabel data, using the teacher model to produce structured outputs with CoT reasoning.

Curating CredData

From CredData’s millions of entries, we curated ~8,000 diverse secrets, filtering out repetitive entries (e.g., redundant RSA keys) to reduce bias. The teacher model generated structured outputs:

<input snippet> : <Chain-of-thought reasoning> : <final_verdict>
final_verdict: {
  "Label": "True Positive/False Positive",
  "Line Number": <line where secret exists>,
  "Secret Value": <secret value in snippet>
}

This ensured a balanced mix of true/false positives, focusing on key secret types such as API keys, database URLs, and passwords.

Evolving the model: Switching to Llama

Initial fine-tuning with Qwen 2.5 1.5B underperformed, so we scaled to Llama3.2-3B-Instruct and rethought our detection pipeline. A comparison of fine-tuned models showed Llama’s superiority:

Metric	Qwen 2.5 1.5B	Instruct Llama3.2-3B-Instruct
True Positives (TP)	199	203
False Positives (FP)	33	10
True Negatives (TN)	944	982
False Negatives (FN)	299	291
Overall Accuracy	77.5%	79.7%
Precision	85.8%	95.3%
Recall	39.9%	41.1%
F1 Score	54.6%	57.3%

Llama’s 95.3% precision and 70% reduction in false positives (10 vs. 33) and a higher overall accuracy made it the clear winner.

Boosting with Synthetic Data

To handle edge cases, we generated synthetic data using:

Gemini 2.5 Pro for diverse, realistic secret patterns.
Claude Sonnet 4 for context-appropriate false positives.

This enriched our dataset with varied secret types and challenging scenarios, improving model robustness.

A hybrid solution: Regex + fine-tuned model for filtering

Our final system combines regex and AI:

Regex quickly identifies potential secrets.
Fine-tuned Llama3.2-3B-Instruct classifies candidates contextually.

Benefits:

Speed: Regex handles initial scans efficiently.
Accuracy: AI reduces false positives with contextual understanding.
Scalability: The 3B model keeps costs low for production.

Fine-tuning round #2

We shifted from CoT to deterministic outputs, using ~800 synthetic examples labelled by LLMs, along with our own dataset of False Positives (around ~110 entries). The dataset format included:

{
  "line_number": <regex-detected line>,
  "label": "True Positive/False Positive",
  "secret_value": <value>,
  "reason": <brief teacher reasoning>
}

Results showed significant improvement:

Metric	Llama3.2-3B-Instruct Base	Llama3.2-3B-Instruct Fine-Tuned
Total Processed	912	912
True Positives	369	529
False Positives	222	16
False Negatives	180	20
True Negatives	141	347
Overall Accuracy	47.5%	77.8%
Recall	67.2%	96.3%
Precision	62.4%	97.0%
F1 Score	64.7%	96.7%

The fine-tuned model improved accuracy by ~30% and recall by around ~30%, while maintaining a strong precision of 97% compared to the 62% of the base model, an improvement of more than 30%.

Key lessons

Teacher-student setups generate high-quality training data but can be costly.
Hybrid systems outperform standalone regex or AI approaches.
Synthetic data addresses rare and edge-case scenarios effectively.
Model size matters—3B parameters hit the sweet spot for performance and efficiency.

Conclusion

Our hybrid secret detection system blends regex's speed with the contextual power of a fine-tuned Llama3.2-3B-Instruct model. By leveraging teacher-student training, curated datasets, and synthetic data, we achieved 97% precision in secrets detection. This approach proves that specialised, lightweight models can deliver enterprise-grade results for targeted security tasks.

We're excited for you to try the Narada-3.2-3B-v1 model, which is now available on Hugging Face under the MIT license. If you have any feedback, please let us know by reaching out to us on @autofixbot.

Definitions

Accuracy

Accuracy measures how often the detector is correct overall—both when identifying real secrets and when correctly ignoring non-secrets. It’s calculated as the ratio of all correct predictions (true positives and true negatives) to the total number of predictions.

Precision

Precision measures how many of the items flagged as secrets are actually real secrets. It’s the ratio of true positives to all items flagged as positives, showing how reliable the detections are.

Wei, Jason, et al. "Chain-of-thought prompting elicits reasoning in large language models." Advances in neural information processing systems 35 (2022): 24824-24837. ↩
Guha, Etash, et al. "OpenThoughts: Data Recipes for Reasoning Models." ArXiv, 2025, https://arxiv.org/abs/2506.04178. Accessed 10 Oct. 2025. ↩
https://github.com/Samsung/CredData ↩