🛡️ CyberLLM

A domain-specific cybersecurity language model built entirely from scratch.

No fine-tuning of existing models. Custom tokenizer, custom transformer architecture, custom data pipeline, pretrained on 5 billion tokens of cybersecurity data.

Results

You: What is SQL injection?

CyberLLM: SQL injection occurs when an attacker inserts malicious SQL into input
fields that get incorporated into database queries, allowing the execution of
arbitrary commands in the application's database. Prevention: use parameterized
queries and prepared statements for user input; proper validation and sanitization
of incoming data; least privilege database roles; and access management controls.

Model Card


Parameters	303.4M
Architecture	LLaMA-3 style decoder-only transformer
Training Tokens	5B (1.5 epochs over 3.4B unique tokens)
Security Data	3.2B tokens (90%+ of training mix)
Final Loss	3.80 (pretrain) → 1.28 (SFT)
Vocab Size	32,000 (custom SentencePiece)
Context Length	2,048 tokens
Training Hardware	NVIDIA A40 48GB
Training Time	~85 hours (pretrain) + 10 min (SFT)
Total Cost	~$40

Architecture

CyberLLM-350M
├── Token Embedding (32,000 × 1,024)
├── 24 × Transformer Block
│   ├── RMSNorm
│   ├── Grouped Query Attention (16 Q-heads, 4 KV-heads)
│   │   └── RoPE positional encoding
│   ├── RMSNorm
│   └── SwiGLU FFN (1,024 → 2,816 → 1,024)
├── RMSNorm
└── LM Head (1,024 → 32,000, tied weights)

What I Built (End-to-End)

1. Custom Tokenizer

Trained a 32,000-token SentencePiece BPE tokenizer on a cybersecurity-weighted corpus
Includes security-specific vocabulary (CVE IDs, MITRE techniques, tool names)

2. Data Pipeline

Collected and processed 3.4B tokens from 10+ sources:

Source	Tokens	Description
Primus-FineWeb	3.07B	Cyber-filtered web pages (Trend Micro)
Primus-Seed	72M	Curated security content
Stack Exchange	45M	InfoSec, Crypto, RevEng, NetEng Q&A
NVD/CVE	25M	Vulnerability descriptions
ArXiv cs.CR	3.7M	Security research abstracts
GitHub Repos	4M	OWASP, awesome-security, pentest wikis
MITRE ATT&CK	390K	Adversary techniques & procedures
NIST SP 800	1.2M	Security standards (15 publications)
Security RFCs	1.2M	Network security protocols
Security Wikipedia	3.4M	Security topic articles

3. Transformer Architecture

Implemented a LLaMA-3 style transformer from scratch in PyTorch
Grouped Query Attention (GQA) with 16 query heads and 4 KV heads
SwiGLU activation in feed-forward layers
RoPE positional encoding
RMSNorm pre-normalization

4. Pretraining

5B tokens on NVIDIA A40 48GB GPU
BF16 mixed precision training
Cosine learning rate schedule with warmup
Loss: 9.2 → 3.8 over 19,073 steps

5. Supervised Fine-Tuning (SFT)

3,750 cybersecurity instruction-response pairs
Topics: incident response, vulnerability analysis, threat detection, security architecture
3 epochs, validation loss: 1.28

6. Web Interface

Local Flask app with real-time inference on Apple M4 MPS
React portfolio demo with pre-generated responses

Quick Start

Prerequisites

python 3.11+
pip install torch sentencepiece pyyaml flask

Download Model

# Clone repo
git clone https://github.com/Omkarth/CyberLLM.git
cd CyberLLM

# Download trained model from HuggingFace
pip install huggingface_hub
python -c "
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id='Omkarth/CyberLLM-350M', filename='model.pt', local_dir='checkpoints')
"

Chat (CLI)

python training/chat.py --question "What is SQL injection?"

# Interactive mode
python training/chat.py

Web Interface

python web/app.py
# Open http://localhost:5000

Training Pipeline

Reproduce from Scratch

# 1. Train tokenizer
python tokenizer/train_tokenizer.py

# 2. Download security data
python data/pipeline/05_security_data_v2.py --all

# 3. Download general data
python data/pipeline/01_download.py --all

# 4. Clean and filter
python data/pipeline/02_clean_filter.py --skip-dedup --security-oversample 2

# 5. Tokenize into shards
python data/pipeline/03_tokenize_shard.py

# 6. Pretrain (requires GPU — ~50 hours on A40)
python training/pretrain.py --config configs/model_350m.yaml

# 7. SFT
python training/sft.py \
  --model checkpoints/final_model/model.pt \
  --config configs/model_350m.yaml \
  --data data/sft/cybersec_sft_v3.jsonl \
  --checkpoint-dir checkpoints/sft \
  --epochs 3 --lr 2e-5

# 8. Chat
python training/chat.py

Project Structure

cyberllm/
├── model/
│   ├── config.py              # Model configuration dataclass
│   ├── layers.py              # RMSNorm, SwiGLU, RoPE, GQA
│   └── transformer.py         # Full transformer + generation
├── tokenizer/
│   ├── train_tokenizer.py     # SentencePiece BPE training
│   └── cybersec_tokenizer.model
├── data/
│   ├── pipeline/
│   │   ├── 01_download.py     # General data downloader
│   │   ├── 02_clean_filter.py # Quality filtering + dedup
│   │   ├── 03_tokenize_shard.py # Tokenization + sharding
│   │   └── 05_security_data_v2.py # Security data pipeline
│   ├── dataloader.py          # Training data loader
│   └── sft/
│       └── cybersec_sft_v3.jsonl # SFT instruction pairs
├── training/
│   ├── pretrain.py            # Pretraining loop
│   ├── sft.py                 # Supervised fine-tuning
│   ├── chat.py                # CLI chat interface
│   └── train_350m.sh          # RunPod deployment script
├── web/
│   └── app.py                 # Flask web interface
├── configs/
│   ├── model_125m.yaml        # 125M config (v1)
│   └── model_350m.yaml        # 350M config (v2)
└── scripts/
    ├── count_params.py
    └── verify_environment.py

Training Loss Curve

Step      Loss    Notes
─────────────────────────────
0         9.24    Training start
500       5.67    Rapid learning
1,000     4.92    
3,000     4.37    
5,000     4.10    Below 125M final loss (4.70)
8,000     3.95    
10,000    3.86    
14,000    3.77    Best low: 3.14
16,000    3.85    Best low: 3.04
19,073    3.80    Training complete
─────────────────────────────
SFT       1.28    After instruction tuning

Limitations

350M parameters is small — the model handles common security topics well but struggles with niche technical details
Repetition — can sometimes repeat phrases or concepts
Hallucination — may generate plausible but incorrect technical details
Not a security tool — this is a research/educational project, not a production security advisor

What I Learned

Data quality > model size — Primus-FineWeb's 3B tokens of cyber-filtered data made more difference than model architecture tweaks
SFT transforms base models — the base model produced semi-coherent text; SFT made it actually useful
Infrastructure matters — debugging AMP (FP32 vs BF16) was the difference between 16K and 91K tokens/sec
Small models have limits — 350M can learn domain vocabulary and common patterns but can't reliably recall specific technical procedures

Tech Stack

Framework: PyTorch
Tokenizer: SentencePiece
Training: RunPod (NVIDIA A40 48GB)
Inference: Apple M4 MPS / CUDA
Web: Flask + vanilla HTML/CSS/JS
Data: HuggingFace Datasets, ArXiv API, NIST, MITRE

Author

Omkar Thombre
Master of Computer Science, University of Adelaide
Portfolio · LinkedIn · GitHub

License

This project is for educational and research purposes. The trained model weights are available on HuggingFace.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
configs		configs
data		data
eval		eval
inference		inference
model		model
scripts		scripts
tests		tests
tokenizer		tokenizer
training		training
web		web
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🛡️ CyberLLM

Results

Model Card

Architecture

What I Built (End-to-End)

1. Custom Tokenizer

2. Data Pipeline

3. Transformer Architecture

4. Pretraining

5. Supervised Fine-Tuning (SFT)

6. Web Interface

Quick Start

Prerequisites

Download Model

Chat (CLI)

Web Interface

Training Pipeline

Reproduce from Scratch

Project Structure

Training Loss Curve

Limitations

What I Learned

Tech Stack

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🛡️ CyberLLM

Results

Model Card

Architecture

What I Built (End-to-End)

1. Custom Tokenizer

2. Data Pipeline

3. Transformer Architecture

4. Pretraining

5. Supervised Fine-Tuning (SFT)

6. Web Interface

Quick Start

Prerequisites

Download Model

Chat (CLI)

Web Interface

Training Pipeline

Reproduce from Scratch

Project Structure

Training Loss Curve

Limitations

What I Learned

Tech Stack

Author

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages