Skip to content

Omkarth/CyberLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🛡️ CyberLLM

A domain-specific cybersecurity language model built entirely from scratch.

No fine-tuning of existing models. Custom tokenizer, custom transformer architecture, custom data pipeline, pretrained on 5 billion tokens of cybersecurity data.

CyberLLM Demo

Results

You: What is SQL injection?

CyberLLM: SQL injection occurs when an attacker inserts malicious SQL into input
fields that get incorporated into database queries, allowing the execution of
arbitrary commands in the application's database. Prevention: use parameterized
queries and prepared statements for user input; proper validation and sanitization
of incoming data; least privilege database roles; and access management controls.

Model Card

Parameters 303.4M
Architecture LLaMA-3 style decoder-only transformer
Training Tokens 5B (1.5 epochs over 3.4B unique tokens)
Security Data 3.2B tokens (90%+ of training mix)
Final Loss 3.80 (pretrain) → 1.28 (SFT)
Vocab Size 32,000 (custom SentencePiece)
Context Length 2,048 tokens
Training Hardware NVIDIA A40 48GB
Training Time ~85 hours (pretrain) + 10 min (SFT)
Total Cost ~$40

Architecture

CyberLLM-350M
├── Token Embedding (32,000 × 1,024)
├── 24 × Transformer Block
│   ├── RMSNorm
│   ├── Grouped Query Attention (16 Q-heads, 4 KV-heads)
│   │   └── RoPE positional encoding
│   ├── RMSNorm
│   └── SwiGLU FFN (1,024 → 2,816 → 1,024)
├── RMSNorm
└── LM Head (1,024 → 32,000, tied weights)

What I Built (End-to-End)

1. Custom Tokenizer

  • Trained a 32,000-token SentencePiece BPE tokenizer on a cybersecurity-weighted corpus
  • Includes security-specific vocabulary (CVE IDs, MITRE techniques, tool names)

2. Data Pipeline

Collected and processed 3.4B tokens from 10+ sources:

Source Tokens Description
Primus-FineWeb 3.07B Cyber-filtered web pages (Trend Micro)
Primus-Seed 72M Curated security content
Stack Exchange 45M InfoSec, Crypto, RevEng, NetEng Q&A
NVD/CVE 25M Vulnerability descriptions
ArXiv cs.CR 3.7M Security research abstracts
GitHub Repos 4M OWASP, awesome-security, pentest wikis
MITRE ATT&CK 390K Adversary techniques & procedures
NIST SP 800 1.2M Security standards (15 publications)
Security RFCs 1.2M Network security protocols
Security Wikipedia 3.4M Security topic articles

3. Transformer Architecture

  • Implemented a LLaMA-3 style transformer from scratch in PyTorch
  • Grouped Query Attention (GQA) with 16 query heads and 4 KV heads
  • SwiGLU activation in feed-forward layers
  • RoPE positional encoding
  • RMSNorm pre-normalization

4. Pretraining

  • 5B tokens on NVIDIA A40 48GB GPU
  • BF16 mixed precision training
  • Cosine learning rate schedule with warmup
  • Loss: 9.2 → 3.8 over 19,073 steps

5. Supervised Fine-Tuning (SFT)

  • 3,750 cybersecurity instruction-response pairs
  • Topics: incident response, vulnerability analysis, threat detection, security architecture
  • 3 epochs, validation loss: 1.28

6. Web Interface

  • Local Flask app with real-time inference on Apple M4 MPS
  • React portfolio demo with pre-generated responses

Quick Start

Prerequisites

python 3.11+
pip install torch sentencepiece pyyaml flask

Download Model

# Clone repo
git clone https://github.com/Omkarth/CyberLLM.git
cd CyberLLM

# Download trained model from HuggingFace
pip install huggingface_hub
python -c "
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id='Omkarth/CyberLLM-350M', filename='model.pt', local_dir='checkpoints')
"

Chat (CLI)

python training/chat.py --question "What is SQL injection?"

# Interactive mode
python training/chat.py

Web Interface

python web/app.py
# Open http://localhost:5000

Training Pipeline

Reproduce from Scratch

# 1. Train tokenizer
python tokenizer/train_tokenizer.py

# 2. Download security data
python data/pipeline/05_security_data_v2.py --all

# 3. Download general data
python data/pipeline/01_download.py --all

# 4. Clean and filter
python data/pipeline/02_clean_filter.py --skip-dedup --security-oversample 2

# 5. Tokenize into shards
python data/pipeline/03_tokenize_shard.py

# 6. Pretrain (requires GPU — ~50 hours on A40)
python training/pretrain.py --config configs/model_350m.yaml

# 7. SFT
python training/sft.py \
  --model checkpoints/final_model/model.pt \
  --config configs/model_350m.yaml \
  --data data/sft/cybersec_sft_v3.jsonl \
  --checkpoint-dir checkpoints/sft \
  --epochs 3 --lr 2e-5

# 8. Chat
python training/chat.py

Project Structure

cyberllm/
├── model/
│   ├── config.py              # Model configuration dataclass
│   ├── layers.py              # RMSNorm, SwiGLU, RoPE, GQA
│   └── transformer.py         # Full transformer + generation
├── tokenizer/
│   ├── train_tokenizer.py     # SentencePiece BPE training
│   └── cybersec_tokenizer.model
├── data/
│   ├── pipeline/
│   │   ├── 01_download.py     # General data downloader
│   │   ├── 02_clean_filter.py # Quality filtering + dedup
│   │   ├── 03_tokenize_shard.py # Tokenization + sharding
│   │   └── 05_security_data_v2.py # Security data pipeline
│   ├── dataloader.py          # Training data loader
│   └── sft/
│       └── cybersec_sft_v3.jsonl # SFT instruction pairs
├── training/
│   ├── pretrain.py            # Pretraining loop
│   ├── sft.py                 # Supervised fine-tuning
│   ├── chat.py                # CLI chat interface
│   └── train_350m.sh          # RunPod deployment script
├── web/
│   └── app.py                 # Flask web interface
├── configs/
│   ├── model_125m.yaml        # 125M config (v1)
│   └── model_350m.yaml        # 350M config (v2)
└── scripts/
    ├── count_params.py
    └── verify_environment.py

Training Loss Curve

Step      Loss    Notes
─────────────────────────────
0         9.24    Training start
500       5.67    Rapid learning
1,000     4.92    
3,000     4.37    
5,000     4.10    Below 125M final loss (4.70)
8,000     3.95    
10,000    3.86    
14,000    3.77    Best low: 3.14
16,000    3.85    Best low: 3.04
19,073    3.80    Training complete
─────────────────────────────
SFT       1.28    After instruction tuning

Limitations

  • 350M parameters is small — the model handles common security topics well but struggles with niche technical details
  • Repetition — can sometimes repeat phrases or concepts
  • Hallucination — may generate plausible but incorrect technical details
  • Not a security tool — this is a research/educational project, not a production security advisor

What I Learned

  1. Data quality > model size — Primus-FineWeb's 3B tokens of cyber-filtered data made more difference than model architecture tweaks
  2. SFT transforms base models — the base model produced semi-coherent text; SFT made it actually useful
  3. Infrastructure matters — debugging AMP (FP32 vs BF16) was the difference between 16K and 91K tokens/sec
  4. Small models have limits — 350M can learn domain vocabulary and common patterns but can't reliably recall specific technical procedures

Tech Stack

  • Framework: PyTorch
  • Tokenizer: SentencePiece
  • Training: RunPod (NVIDIA A40 48GB)
  • Inference: Apple M4 MPS / CUDA
  • Web: Flask + vanilla HTML/CSS/JS
  • Data: HuggingFace Datasets, ArXiv API, NIST, MITRE

Author

Omkar Thombre
Master of Computer Science, University of Adelaide
Portfolio · LinkedIn · GitHub

License

This project is for educational and research purposes. The trained model weights are available on HuggingFace.

About

A 350M parameter cybersecurity LLM built entirely from scratch — custom tokenizer, transformer architecture, pretrained on 5B tokens of security data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors