← Back to Ideas
November 4, 2025 • 🏃 Phase 1 Training 79% Complete

Model Distillation: Train Ultra-Fast Event Extraction

Turn 11.8 million Qwen 7B inferences into training data for a specialized small model. Get 5-7x faster inference, 90% cost reduction, and CPU-friendly deployment while maintaining 95%+ accuracy.

📈 November 4 Update: Training In Progress!

Pivoted to nanochat (Karpathy's minimal LLM framework) instead of Phi-3-Mini. Currently training d20 model (561M params) on vortex with 2x RTX 3090.

Training Progress:

  • Step 17,588 / 22,222 (79% complete)
  • Data: 1M examples (800K train, 100K val, 100K test)
  • Time: ~12-15 hours total (started Nov 4)
  • Expected completion: Within 3 hours
  • Training loss: 0.02-0.05 (excellent convergence)

Why nanochat? Simpler architecture (~8K lines vs complex frameworks), full control over training, proven to work on 100K validation run (1.3 hours, val loss 0.0426). Plus discovered we need multi-phase architecture: small skip classifier + larger extractor.

🎯 The Opportunity: Your Data is a Goldmine

You have 11.8 million Qwen 7B inferences sitting in PostgreSQL. That's not just processed data – it's perfect training data to create a specialized model that runs circles around the teacher.

11.8M
Training Examples
5-7x
Faster Inference
90%
Cost Reduction
95%+
Quality Maintained

🤔 What is Model Distillation?

Knowledge distillation is the process of training a smaller "student" model to mimic a larger "teacher" model. Large models (like Qwen 7B) learn general patterns across many domains. But your task is specialized: extracting events from SEC filings.

A smaller model (1-4B parameters) can match or even exceed large model performance on narrow, specialized tasks. Think of it like this: A general practitioner knows medicine broadly, but a cardiologist knows hearts better than anyone.

💡 The Core Insight

You've already paid for Qwen 7B to process millions of sentences. Those inferences are worth far more than the GPU time you spent. They're a training dataset asset potentially worth $100K-1M+. Use them to train a fast, cheap, specialized model that's better at your specific task.

📊 Your Training Data Inventory

From your PostgreSQL kee.events table:

11.8M
Total Events
8.2M
High Confidence (≥8)
66K
Event Types
110K
Companies

This is 80x larger than typical fine-tuning datasets. Most people fine-tune on 10K-100K examples. You have millions of high-quality, domain-specific examples. That's your competitive advantage.

{ "instruction": "Extract SEC filing events from the sentence below. Return JSON with event_type, subject, object, magnitude_class, timing, sentiment, materiality, confidence, date_confidence, and event_date.", "input": "The Company dismissed PricewaterhouseCoopers LLP as its independent auditor on August 15, 2024, following a disagreement on revenue recognition policies.", "output": "{\"event_type\": \"dismissed_auditor\", \"subject\": \"The Company\", \"object\": \"PricewaterhouseCoopers LLP\", \"magnitude_class\": \"major\", \"timing\": \"definitive\", \"sentiment\": \"negative\", \"materiality\": \"material\", \"confidence\": 10, \"date_confidence\": 10, \"event_date\": \"2024-08-15\"}" }

🔄 Multi-Phase Architecture: Skip Classifier + Extractor

Critical Discovery: The training data only includes positive examples (events that were extracted). Qwen's production pipeline actually rejects 90% of sentences with {"skip": true}, but these negative examples aren't in the training data.

⚠️ The Skip Logic Problem

Issue: A model trained only on positive examples (events) will hallucinate events from boilerplate sentences because it never learned when to say "no event here."

Solution: Two-stage architecture with specialized models for each task.

Production Pipeline Architecture

SEC Filing (10,000 sentences)
(C keyword filter, ~90% rejected)
1,000 candidates
(Skip Classifier: d8/d12, 200M params, CPU)
100 likely events
(Event Extractor: d20, 561M params, CPU)
Final: ~100 extracted events
Phase 1 (Current)
Event Extractor
d20 model, 561M params
Training on 1M positive examples
79% complete, ~3 hours remaining
Phase 2 (Next)
Skip Classifier
d8/d12, 100-200M params
Binary task: skip vs extract
Balanced dataset: 500K pos + 500K neg
Phase 3 (Deploy)
Production
CPU inference with 32-thread pool
~2 seconds per filing
$0 cost vs $50/day H200

🎯 Actual Model: nanochat d20 (561M Parameters)

Why nanochat instead of Phi-3-Mini?

✅ Simplicity & Control

⚡ Performance & Flexibility

🎓 Validation Results (100K Model)

📚 Original Plan: Phi-3-Mini (3.8B Parameters)

Note: This was the initial plan before pivoting to nanochat. Kept for reference.

Why Phi-3-Mini?

✅ State-of-the-Art Performance

⚡ Speed & Efficiency

💰 Cost Savings

🔬 Model Performance Comparison

Model Size CPU Speed GPU Speed Cost/1M Tokens
Qwen 7B 7B 8 t/s 100 t/s $0.20
Phi-3-Mini 3.8B 25 t/s 300 t/s $0.07
Phi-3-Mini Q4 3.8B (4-bit) 50 t/s N/A $0.00

🏗️ Training Approach: LoRA vs Full Fine-Tuning

LoRA (Low-Rank Adaptation) - Recommended First

⚡ Fast & Cheap

📈 Quality Expectations

Full Fine-Tuning - If LoRA Isn't Enough

🚀 Implementation Roadmap

Phase 1: Data Extraction (30-60 minutes)

Extract 1M high-quality training examples from PostgreSQL. Create train/val/test splits (80/10/10). Manual quality check on 100-1,000 examples to ensure Qwen 7B outputs are reliable.

python3 extract_training_data.py --output train_1m.jsonl --limit 1000000 --split

Phase 2: Setup Training Environment (5-10 minutes)

Setup Vortex (3x RTX 3090) or rent H200 GPU on Vast.ai. Install PyTorch, Transformers, Axolotl (training framework). Verify GPU availability and configurations.

Phase 3: Train Model (8-24 hours)

Fine-tune Phi-3-Mini with LoRA using Axolotl. Monitor training loss and validation metrics. Target: >95% agreement with Qwen 7B on test set.

# On Vortex ./train.sh vortex # Or on H200 (if Vortex busy) ./train.sh h200

Phase 4: Quantize & Deploy (2-4 hours)

Convert model to 4-bit GGUF format (~2.3GB). Benchmark CPU inference speed (target: 40-60 t/s). Deploy alongside Qwen 7B for side-by-side comparison.

Phase 5: Production Cutover (1 week)

Run both models in parallel for 1 week. Compare quality on 100K sentences. If quality >95%, full cutover to Phi-3-Mini. Start saving thousands per month.

💰 Cost-Benefit Analysis

Investment Required

Benefits

5-7x
Faster Processing
90%
Cost Reduction
$6K+
Saved per 10M Sentences
3-6mo
Payback Period

ROI Calculation

If you process 50M sentences/year: $24K saved/year
If you process 200M sentences/year: $97K saved/year

Plus: Proprietary model = competitive advantage. Can sell as commercial product (see Commercial Product Strategy). Foundation for 30+ specialist models (one per event type).

📚 Why This Works: Academic Evidence

Knowledge distillation is well-proven in academic research:

Why it works for you:

  1. Narrow domain: SEC filings have consistent language and structure
  2. Structured output: JSON extraction is a learnable pattern
  3. Large dataset: Millions of examples (more than most distillation projects)
  4. Teacher quality: Qwen 7B is a strong teacher model

✅ Current Status: Ready to Extract Data

Setup is complete. Documentation created (README, strategy docs, training guides). Training configs ready for both Vortex (3x RTX 3090) and H200. Helper scripts created for extraction and training.

Next step: Extract 1M training examples and kick off first training run. Expected timeline: 4 days from data extraction to working model.

🎯 Expected Results

Quality Metrics

Performance Metrics

Cost Savings

🚀 Bonus: Advanced Opportunities

Specialist Models (Month 2-3)

Instead of one generalist model, train 30 specialist models – one per event type. Each model is tiny (300-500M params), ultra-fast (100+ t/s on CPU), and expert at its specific event type.

Commercial Product Opportunities

🎯 Bottom Line

This is a No-Brainer

Your 11.8 million Qwen 7B inferences are worth far more than the GPU time you spent creating them. They're a training dataset asset that can power a faster, cheaper, specialized model.

Investment: <$1,000 one-time
Timeline: 4 weeks to production
ROI: 3-6 months payback, then pure savings forever
Upside: Proprietary model + commercial product potential

Recommended first step: Extract 1M training examples, fine-tune Phi-3-Mini with LoRA on Vortex ($0 cost), evaluate. If accuracy >95%, you're golden!