← Back to Ideas
📈 November 4 Update: Training In Progress!
Pivoted to nanochat (Karpathy's minimal LLM framework) instead of Phi-3-Mini. Currently training d20 model (561M params) on vortex with 2x RTX 3090.
Training Progress:
- Step 17,588 / 22,222 (79% complete)
- Data: 1M examples (800K train, 100K val, 100K test)
- Time: ~12-15 hours total (started Nov 4)
- Expected completion: Within 3 hours
- Training loss: 0.02-0.05 (excellent convergence)
Why nanochat? Simpler architecture (~8K lines vs complex frameworks), full control over training, proven to work on 100K validation run (1.3 hours, val loss 0.0426). Plus discovered we need multi-phase architecture: small skip classifier + larger extractor.
🎯 The Opportunity: Your Data is a Goldmine
You have 11.8 million Qwen 7B inferences sitting in PostgreSQL. That's not just processed data –
it's perfect training data to create a specialized model that runs circles around the teacher.
🤔 What is Model Distillation?
Knowledge distillation is the process of training a smaller "student" model to mimic a larger "teacher" model.
Large models (like Qwen 7B) learn general patterns across many domains. But your task is specialized: extracting events from SEC filings.
A smaller model (1-4B parameters) can match or even exceed large model performance on narrow, specialized tasks.
Think of it like this: A general practitioner knows medicine broadly, but a cardiologist knows hearts better than anyone.
💡 The Core Insight
You've already paid for Qwen 7B to process millions of sentences. Those inferences are worth far more than the GPU time you spent.
They're a training dataset asset potentially worth $100K-1M+. Use them to train a fast, cheap, specialized model
that's better at your specific task.
📊 Your Training Data Inventory
From your PostgreSQL kee.events table:
8.2M
High Confidence (≥8)
This is 80x larger than typical fine-tuning datasets. Most people fine-tune on 10K-100K examples.
You have millions of high-quality, domain-specific examples. That's your competitive advantage.
{
"instruction": "Extract SEC filing events from the sentence below. Return JSON with event_type, subject, object, magnitude_class, timing, sentiment, materiality, confidence, date_confidence, and event_date.",
"input": "The Company dismissed PricewaterhouseCoopers LLP as its independent auditor on August 15, 2024, following a disagreement on revenue recognition policies.",
"output": "{\"event_type\": \"dismissed_auditor\", \"subject\": \"The Company\", \"object\": \"PricewaterhouseCoopers LLP\", \"magnitude_class\": \"major\", \"timing\": \"definitive\", \"sentiment\": \"negative\", \"materiality\": \"material\", \"confidence\": 10, \"date_confidence\": 10, \"event_date\": \"2024-08-15\"}"
}
🔄 Multi-Phase Architecture: Skip Classifier + Extractor
Critical Discovery: The training data only includes positive examples (events that were extracted). Qwen's production pipeline actually rejects 90% of sentences with {"skip": true}, but these negative examples aren't in the training data.
⚠️ The Skip Logic Problem
Issue: A model trained only on positive examples (events) will hallucinate events from boilerplate sentences because it never learned when to say "no event here."
Solution: Two-stage architecture with specialized models for each task.
Production Pipeline Architecture
SEC Filing (10,000 sentences)
↓ (C keyword filter, ~90% rejected)
1,000 candidates
↓ (Skip Classifier: d8/d12, 200M params, CPU)
100 likely events
↓ (Event Extractor: d20, 561M params, CPU)
Final: ~100 extracted events
Phase 1 (Current)
Event Extractor
d20 model, 561M params
Training on 1M positive examples
79% complete, ~3 hours remaining
Phase 2 (Next)
Skip Classifier
d8/d12, 100-200M params
Binary task: skip vs extract
Balanced dataset: 500K pos + 500K neg
Phase 3 (Deploy)
Production
CPU inference with 32-thread pool
~2 seconds per filing
$0 cost vs $50/day H200
🎯 Actual Model: nanochat d20 (561M Parameters)
Why nanochat instead of Phi-3-Mini?
✅ Simplicity & Control
- ~8,000 lines of minimal LLM training code (Andrej Karpathy)
- Full control over training pipeline: base → midtraining → SFT → RL
- No complex framework dependencies
- Easy to understand, modify, and debug
- Proven to work: 100K validation run completed successfully (1.3 hours, val loss 0.0426)
⚡ Performance & Flexibility
- Multiple model sizes: d8 (100M), d12 (200M), d16 (350M), d20 (561M)
- Can train different sizes for different tasks (small skip classifier, larger extractor)
- CPU-friendly inference with quantization support
- Target: 10-20 events/sec on 32-core CPU
🎓 Validation Results (100K Model)
- Training duration: 1 hour 18 minutes
- Final validation loss: 0.0426 (excellent)
- MMLU: 22.7% (retained general knowledge)
- ARC-Easy: 25.0% (no catastrophic forgetting)
- Test extraction: Valid JSON, correct event_type, good confidence scoring
📚 Original Plan: Phi-3-Mini (3.8B Parameters)
Note: This was the initial plan before pivoting to nanochat. Kept for reference.
Why Phi-3-Mini?
✅ State-of-the-Art Performance
- Beats many 7B models despite being half the size
- Excellent instruction following (trained specifically for prompted tasks)
- Strong at structured output (JSON generation)
- Microsoft-backed with great documentation and support
⚡ Speed & Efficiency
- 3-4x faster than Qwen 7B on GPU
- 5-7x faster on CPU (with 4-bit quantization)
- 40-60 tokens/sec on modern CPU vs 8 tokens/sec for Qwen 7B
- Can run on CPU with no GPU required in production
💰 Cost Savings
- 90% reduction in inference costs
- Current: $6,250 per 10M sentences (Qwen 7B on H100)
- After: $0 per 10M sentences (Phi-3-Mini on CPU, just electricity)
- Break-even after processing ~30M sentences (likely 3-6 months)
🔬 Model Performance Comparison
| Model |
Size |
CPU Speed |
GPU Speed |
Cost/1M Tokens |
| Qwen 7B |
7B |
8 t/s |
100 t/s |
$0.20 |
| Phi-3-Mini |
3.8B |
25 t/s |
300 t/s |
$0.07 |
| Phi-3-Mini Q4 |
3.8B (4-bit) |
50 t/s |
N/A |
$0.00 |
🏗️ Training Approach: LoRA vs Full Fine-Tuning
LoRA (Low-Rank Adaptation) - Recommended First
⚡ Fast & Cheap
- Training time: 8-12 hours on 3x RTX 3090 (Vortex)
- Training cost: $0 on Vortex or $100-200 on cloud GPU
- Only trains 100M parameters (adapters, not full model)
- Adapter size: ~100MB (vs 7GB for full model)
📈 Quality Expectations
- Expected accuracy: 95-97% of Qwen 7B performance
- Minimal degradation from using adapters
- If quality is sufficient, ship it! If not, try full fine-tuning
Full Fine-Tuning - If LoRA Isn't Enough
- Training time: 20-24 hours on H200 GPU
- Training cost: ~$70 on Vast.ai
- Quality: 96-98% accuracy (maximum quality)
- When to use: If LoRA gives <95% accuracy
🚀 Implementation Roadmap
Phase 1: Data Extraction (30-60 minutes)
Extract 1M high-quality training examples from PostgreSQL. Create train/val/test splits (80/10/10).
Manual quality check on 100-1,000 examples to ensure Qwen 7B outputs are reliable.
python3 extract_training_data.py --output train_1m.jsonl --limit 1000000 --split
Phase 2: Setup Training Environment (5-10 minutes)
Setup Vortex (3x RTX 3090) or rent H200 GPU on Vast.ai. Install PyTorch, Transformers, Axolotl (training framework).
Verify GPU availability and configurations.
Phase 3: Train Model (8-24 hours)
Fine-tune Phi-3-Mini with LoRA using Axolotl. Monitor training loss and validation metrics.
Target: >95% agreement with Qwen 7B on test set.
# On Vortex
./train.sh vortex
# Or on H200 (if Vortex busy)
./train.sh h200
Phase 4: Quantize & Deploy (2-4 hours)
Convert model to 4-bit GGUF format (~2.3GB). Benchmark CPU inference speed (target: 40-60 t/s).
Deploy alongside Qwen 7B for side-by-side comparison.
Phase 5: Production Cutover (1 week)
Run both models in parallel for 1 week. Compare quality on 100K sentences.
If quality >95%, full cutover to Phi-3-Mini. Start saving thousands per month.
💰 Cost-Benefit Analysis
Investment Required
- Time: 3-4 weeks engineer time (data prep, training, testing, deployment)
- Money: $0 (Vortex) to $200 (cloud GPU for LoRA) to $1,000 (full fine-tuning)
Benefits
$6K+
Saved per 10M Sentences
ROI Calculation
If you process 50M sentences/year: $24K saved/year
If you process 200M sentences/year: $97K saved/year
Plus: Proprietary model = competitive advantage. Can sell as commercial product (see Commercial Product Strategy).
Foundation for 30+ specialist models (one per event type).
📚 Why This Works: Academic Evidence
Knowledge distillation is well-proven in academic research:
- DistilBERT: 97% of BERT performance with 40% fewer parameters
- TinyBERT: 96.8% of BERT performance with 87% fewer parameters
- MiniLM: 99%+ of BERT performance on specific tasks
Why it works for you:
- Narrow domain: SEC filings have consistent language and structure
- Structured output: JSON extraction is a learnable pattern
- Large dataset: Millions of examples (more than most distillation projects)
- Teacher quality: Qwen 7B is a strong teacher model
✅ Current Status: Ready to Extract Data
Setup is complete. Documentation created (README, strategy docs, training guides).
Training configs ready for both Vortex (3x RTX 3090) and H200. Helper scripts created for extraction and training.
Next step: Extract 1M training examples and kick off first training run.
Expected timeline: 4 days from data extraction to working model.
🎯 Expected Results
Quality Metrics
- LoRA: 95-97% accuracy vs Qwen 7B baseline
- Full fine-tuning: 96-98% accuracy
- Target: >95% agreement on event extraction
Performance Metrics
- Inference speed: 40-60 tokens/sec on CPU (vs 8 for Qwen 7B)
- Memory: 2.3GB (4-bit quantized) vs 14GB for Qwen 7B
- Latency: <100ms per sentence (vs ~300ms for Qwen 7B)
Cost Savings
- Current: $6,250 per 10M sentences (Qwen 7B on H100)
- After: $0 per 10M sentences (Phi-3-Mini on CPU)
- Savings: ~$6,000 per 10M sentences processed
🚀 Bonus: Advanced Opportunities
Specialist Models (Month 2-3)
Instead of one generalist model, train 30 specialist models – one per event type.
Each model is tiny (300-500M params), ultra-fast (100+ t/s on CPU), and expert at its specific event type.
- Parallelizable: Run 8 models simultaneously on 8-core CPU
- Easy to update: Retrain one model without affecting others
- Ultra-precise: Each model is hyper-specialized
Commercial Product Opportunities
- Fine-tuned Model as a Service: Sell Phi-3-Mini SEC extraction model
- Training Data: 11.8M labeled examples (worth $100K+)
- Event Extraction API: Fast CPU inference as a service
- White-Label Solution: Power Bloomberg/FactSet event feeds
🎯 Bottom Line
This is a No-Brainer
Your 11.8 million Qwen 7B inferences are worth far more than the GPU time you spent creating them.
They're a training dataset asset that can power a faster, cheaper, specialized model.
Investment: <$1,000 one-time
Timeline: 4 weeks to production
ROI: 3-6 months payback, then pure savings forever
Upside: Proprietary model + commercial product potential
Recommended first step: Extract 1M training examples, fine-tune Phi-3-Mini with LoRA on Vortex ($0 cost),
evaluate. If accuracy >95%, you're golden!