Unveiling Narada
We created a hybrid secret detection system that pairs regex for rapid scanning with a fine-tuned Llama3.2-3B-Instruct model, achieving 97.0% precision. Starting with a teacher-student setup inspired by chain-of-thought reasoning, we iterated through base models, dataset challenges, and synthetic data to deliver a fast, accurate solution ready for real-world codebases.
Detecting sensitive secrets like API keys or credentials in code is critical but tricky. Traditional regex-based tools excel at spotting potential secrets but often overwhelm developers with false positives, flagging harmless strings like "your-api-key-here" as threats. This noise frustrates developers, reduces trust and adoption.
Our goal was to build a practical system that combines the speed of regular expressions with the contextual intelligence of language models, delivering high recall and precision without frustrating developers.
Limits of Regex
Regex detectors face a tradeoff:
- High sensitivity catches more secrets but floods users with false positives.
- Low sensitivity reduces noise but misses real risks.
Regex excels at pattern matching but lacks the ability to understand context, such as distinguishing live credentials from placeholders or examples. We aimed to enhance regex with a language model to add nuance while maintaining efficiency.
Our Initial Approach: Teacher-Student Architecture
We started with a teacher-student model inspired by chain-of-thought (CoT) reasoning, where a powerful "teacher" model generates detailed explanations to train a lightweight "student" model for deployment.
Picking the Right Base Model
We selected the following compact models (~1-1.5B parameters) to balance performance and speed:
- Qwen2.5-1.5B-Instruct
- Llama3.2-1B-Instruct
- Gemma2-1B-Instruct
Using CoT prompting (inspired by Google Brain’s reasoning research), we evaluated performance on 24 code files, feeding snippets with ±20 lines around potential secrets to mimic real-world scenarios. Here’s how they performed:
Model | Technique | Input Compute | Time (s) | Accuracy |
---|---|---|---|---|
Qwen 2.5 1.5B Instruct | Chain-of-Thought | ±20 Lines | 292 | 54.1% |
Llama3.2-1B-Instruct | Chain-of-Thought | ±20 Lines | 168 | 37.0% |
While experimenting with the Gemma-2-1B model, we observed that it often produced random and irrelevant responses during inference. It also failed to follow the given system prompt instructions and was therefore excluded from further evaluation.
Qwen 2.5 led in accuracy, so we proceeded with it for initial detection and labelling. To ensure accurate line numbers, we built a pipeline to preserve original line references when processing snippets.
Teacher Model Trials
Inspired by the OpenThoughts paper on data recipes for reasoning models, we had a larger teacher generate reasoning chains to guide the student.
- Qwen QwQ 32B
- ChatGPT o1-mini
- DeepSeek R1
DeepSeek R1 stood out for consistency, accuracy, and cost-effectiveness, making it our choice for generating training data.
Dataset Challenges: Laying a Strong Foundation
High-quality data is critical for secret detection, but sourcing diverse, reliable datasets proved challenging. Secrets appear in varied formats across programming languages, configs, and documentation. We initially used Samsung’s CredData but found its labels inconsistent. To address this, we employed LLMs to relabel data, using the teacher model to produce structured outputs with CoT reasoning.
Curating CredData
From CredData’s millions of entries, we curated ~8,000 diverse secrets, filtering out repetitive entries (e.g., redundant RSA keys) to reduce bias. The teacher model generated structured outputs:
<input snippet> : <Chain-of-thought reasoning> : <final_verdict>
final_verdict: {
"Label": "True Positive/False Positive",
"Line Number": <line where secret exists>,
"Secret Value": <secret value in snippet>
}
This ensured a balanced mix of true/false positives, focusing on key secret types such as API keys, database URLs, and passwords.
Evolving the Model: Switching to Llama
Initial fine-tuning with Qwen 2.5 1.5B underperformed, so we scaled to Llama3.2-3B-Instruct and rethought our detection pipeline. A comparison of fine-tuned models showed Llama’s superiority:
Metric | Qwen 2.5 1.5B | Instruct Llama3.2-3B-Instruct |
---|---|---|
True Positives (TP) | 199 | 203 |
False Positives (FP) | 33 | 10 |
True Negatives (TN) | 944 | 982 |
False Negatives (FN) | 299 | 291 |
Overall Accuracy | 77.5% | 79.7% |
Precision | 85.8% | 95.3% |
Recall | 39.9% | 41.1% |
F1 Score | 54.6% | 57.3% |
Llama’s 95.3% precision and 70% reduction in false positives (10 vs. 33) and a higher overall accuracy made it the clear winner.
Boosting with Synthetic Data
To handle edge cases, we generated synthetic data using:
- Gemini 2.5 Pro for diverse, realistic secret patterns.
- Claude Sonnet 4 for context-appropriate false positives.
This enriched our dataset with varied secret types and challenging scenarios, improving model robustness.
The Hybrid Solution: Regex + Fine-Tuned Model
Our final system combines regex and AI:
- Regex quickly identifies potential secrets.
- Fine-tuned Llama3.2-3B-Instruct classifies candidates contextually.
Benefits:
- Speed: Regex handles initial scans efficiently.
- Accuracy: AI reduces false positives with contextual understanding.
- Scalability: The 3B model keeps costs low for production.
Second Round of Fine-Tuning
We shifted from CoT to deterministic outputs, using ~800 synthetic examples labelled by LLMs, along with our own dataset of False Positives (around ~110 entries). The dataset format included:
{
"line_number": <regex-detected line>,
"label": "True Positive/False Positive",
"secret_value": <value>,
"reason": <brief teacher reasoning>
}
Results showed significant improvement:
Metric | Llama3.2-3B-Instruct Base | Llama3.2-3B-Instruct Fine-Tuned |
---|---|---|
Total Processed | 912 | 912 |
True Positives | 369 | 529 |
False Positives | 222 | 16 |
False Negatives | 180 | 20 |
True Negatives | 141 | 347 |
Overall Accuracy | 47.5% | 77.8% |
Recall | 67.2% | 96.3% |
Precision | 62.4% | 97.0% |
F1 Score | 64.7% | 96.7% |
The fine-tuned model improved accuracy by ~30% and recall by around ~30%, while maintaining a strong precision of 97% compared to the 62% of the base model, an improvement of more than 30%.
Key Lessons
- Teacher-student setups generate high-quality training data but can be costly.
- Hybrid systems outperform standalone regex or AI approaches.
- Synthetic data addresses rare and edge-case scenarios effectively.
- Model size matters—3B parameters hit the sweet spot for performance and efficiency.
Next Steps
To further enhance the system:
- Implement active learning to incorporate production feedback.
- Integrate with security tools for seamless workflows.
Conclusion
Our hybrid secret detection system blends regex's speed with the contextual power of a fine-tuned Llama3.2-3B-Instruct model. By leveraging teacher-student training, curated datasets, and synthetic data, we achieved 97.0% precision and a 92.0% drop in false positives. This approach proves that specialised, lightweight models can deliver enterprise-grade results for targeted security tasks.
The Narada-3.2-3B-v1 model is now available on Hugging Face.