A domain-specific cybersecurity language model built entirely from scratch.
No fine-tuning of existing models. Custom tokenizer, custom transformer architecture, custom data pipeline, pretrained on 5 billion tokens of cybersecurity data.
You: What is SQL injection?
CyberLLM: SQL injection occurs when an attacker inserts malicious SQL into input
fields that get incorporated into database queries, allowing the execution of
arbitrary commands in the application's database. Prevention: use parameterized
queries and prepared statements for user input; proper validation and sanitization
of incoming data; least privilege database roles; and access management controls.
| Parameters | 303.4M |
| Architecture | LLaMA-3 style decoder-only transformer |
| Training Tokens | 5B (1.5 epochs over 3.4B unique tokens) |
| Security Data | 3.2B tokens (90%+ of training mix) |
| Final Loss | 3.80 (pretrain) → 1.28 (SFT) |
| Vocab Size | 32,000 (custom SentencePiece) |
| Context Length | 2,048 tokens |
| Training Hardware | NVIDIA A40 48GB |
| Training Time | ~85 hours (pretrain) + 10 min (SFT) |
| Total Cost | ~$40 |
CyberLLM-350M
├── Token Embedding (32,000 × 1,024)
├── 24 × Transformer Block
│ ├── RMSNorm
│ ├── Grouped Query Attention (16 Q-heads, 4 KV-heads)
│ │ └── RoPE positional encoding
│ ├── RMSNorm
│ └── SwiGLU FFN (1,024 → 2,816 → 1,024)
├── RMSNorm
└── LM Head (1,024 → 32,000, tied weights)
- Trained a 32,000-token SentencePiece BPE tokenizer on a cybersecurity-weighted corpus
- Includes security-specific vocabulary (CVE IDs, MITRE techniques, tool names)
Collected and processed 3.4B tokens from 10+ sources:
| Source | Tokens | Description |
|---|---|---|
| Primus-FineWeb | 3.07B | Cyber-filtered web pages (Trend Micro) |
| Primus-Seed | 72M | Curated security content |
| Stack Exchange | 45M | InfoSec, Crypto, RevEng, NetEng Q&A |
| NVD/CVE | 25M | Vulnerability descriptions |
| ArXiv cs.CR | 3.7M | Security research abstracts |
| GitHub Repos | 4M | OWASP, awesome-security, pentest wikis |
| MITRE ATT&CK | 390K | Adversary techniques & procedures |
| NIST SP 800 | 1.2M | Security standards (15 publications) |
| Security RFCs | 1.2M | Network security protocols |
| Security Wikipedia | 3.4M | Security topic articles |
- Implemented a LLaMA-3 style transformer from scratch in PyTorch
- Grouped Query Attention (GQA) with 16 query heads and 4 KV heads
- SwiGLU activation in feed-forward layers
- RoPE positional encoding
- RMSNorm pre-normalization
- 5B tokens on NVIDIA A40 48GB GPU
- BF16 mixed precision training
- Cosine learning rate schedule with warmup
- Loss: 9.2 → 3.8 over 19,073 steps
- 3,750 cybersecurity instruction-response pairs
- Topics: incident response, vulnerability analysis, threat detection, security architecture
- 3 epochs, validation loss: 1.28
- Local Flask app with real-time inference on Apple M4 MPS
- React portfolio demo with pre-generated responses
python 3.11+
pip install torch sentencepiece pyyaml flask# Clone repo
git clone https://github.com/Omkarth/CyberLLM.git
cd CyberLLM
# Download trained model from HuggingFace
pip install huggingface_hub
python -c "
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id='Omkarth/CyberLLM-350M', filename='model.pt', local_dir='checkpoints')
"python training/chat.py --question "What is SQL injection?"
# Interactive mode
python training/chat.pypython web/app.py
# Open http://localhost:5000# 1. Train tokenizer
python tokenizer/train_tokenizer.py
# 2. Download security data
python data/pipeline/05_security_data_v2.py --all
# 3. Download general data
python data/pipeline/01_download.py --all
# 4. Clean and filter
python data/pipeline/02_clean_filter.py --skip-dedup --security-oversample 2
# 5. Tokenize into shards
python data/pipeline/03_tokenize_shard.py
# 6. Pretrain (requires GPU — ~50 hours on A40)
python training/pretrain.py --config configs/model_350m.yaml
# 7. SFT
python training/sft.py \
--model checkpoints/final_model/model.pt \
--config configs/model_350m.yaml \
--data data/sft/cybersec_sft_v3.jsonl \
--checkpoint-dir checkpoints/sft \
--epochs 3 --lr 2e-5
# 8. Chat
python training/chat.pycyberllm/
├── model/
│ ├── config.py # Model configuration dataclass
│ ├── layers.py # RMSNorm, SwiGLU, RoPE, GQA
│ └── transformer.py # Full transformer + generation
├── tokenizer/
│ ├── train_tokenizer.py # SentencePiece BPE training
│ └── cybersec_tokenizer.model
├── data/
│ ├── pipeline/
│ │ ├── 01_download.py # General data downloader
│ │ ├── 02_clean_filter.py # Quality filtering + dedup
│ │ ├── 03_tokenize_shard.py # Tokenization + sharding
│ │ └── 05_security_data_v2.py # Security data pipeline
│ ├── dataloader.py # Training data loader
│ └── sft/
│ └── cybersec_sft_v3.jsonl # SFT instruction pairs
├── training/
│ ├── pretrain.py # Pretraining loop
│ ├── sft.py # Supervised fine-tuning
│ ├── chat.py # CLI chat interface
│ └── train_350m.sh # RunPod deployment script
├── web/
│ └── app.py # Flask web interface
├── configs/
│ ├── model_125m.yaml # 125M config (v1)
│ └── model_350m.yaml # 350M config (v2)
└── scripts/
├── count_params.py
└── verify_environment.py
Step Loss Notes
─────────────────────────────
0 9.24 Training start
500 5.67 Rapid learning
1,000 4.92
3,000 4.37
5,000 4.10 Below 125M final loss (4.70)
8,000 3.95
10,000 3.86
14,000 3.77 Best low: 3.14
16,000 3.85 Best low: 3.04
19,073 3.80 Training complete
─────────────────────────────
SFT 1.28 After instruction tuning
- 350M parameters is small — the model handles common security topics well but struggles with niche technical details
- Repetition — can sometimes repeat phrases or concepts
- Hallucination — may generate plausible but incorrect technical details
- Not a security tool — this is a research/educational project, not a production security advisor
- Data quality > model size — Primus-FineWeb's 3B tokens of cyber-filtered data made more difference than model architecture tweaks
- SFT transforms base models — the base model produced semi-coherent text; SFT made it actually useful
- Infrastructure matters — debugging AMP (FP32 vs BF16) was the difference between 16K and 91K tokens/sec
- Small models have limits — 350M can learn domain vocabulary and common patterns but can't reliably recall specific technical procedures
- Framework: PyTorch
- Tokenizer: SentencePiece
- Training: RunPod (NVIDIA A40 48GB)
- Inference: Apple M4 MPS / CUDA
- Web: Flask + vanilla HTML/CSS/JS
- Data: HuggingFace Datasets, ArXiv API, NIST, MITRE
Omkar Thombre
Master of Computer Science, University of Adelaide
Portfolio · LinkedIn · GitHub
This project is for educational and research purposes. The trained model weights are available on HuggingFace.
